On this page it talks about the LZ4:
Starting at the section: Regarding the way LZ4 searches and finds matches
If I am understanding the process (and correct me if I am wrong) it sounds like they are hashing 4 bytes of every location of the last 64K (because the offset is 16-bit). I can see this being done quickly by using a rolling window of add a byte to the output, hash it, and forget the one that fell off the end of the 64K.
Then when looking at the new part of the stream to compress, you has it and check the table for hash matches. Presumably there is a mechanism to test past the 4 bytes to look for the longest sequence you can reference, but I am not sure about that.
My question is this - why hash it at all? Why not just do a memcmp style byte by byte check? An inline function could be made that says count the repeating bytes at the source and each position of the 64K. Would that really be that much slower?
LZ4 explained
At popular request, this post tries to explain the LZ4 inner workings, in order to allow any programmer to develop its own version, pote...
fastcompression.blogspot.com
Starting at the section: Regarding the way LZ4 searches and finds matches
If I am understanding the process (and correct me if I am wrong) it sounds like they are hashing 4 bytes of every location of the last 64K (because the offset is 16-bit). I can see this being done quickly by using a rolling window of add a byte to the output, hash it, and forget the one that fell off the end of the 64K.
Then when looking at the new part of the stream to compress, you has it and check the table for hash matches. Presumably there is a mechanism to test past the 4 bytes to look for the longest sequence you can reference, but I am not sure about that.
My question is this - why hash it at all? Why not just do a memcmp style byte by byte check? An inline function could be made that says count the repeating bytes at the source and each position of the 64K. Would that really be that much slower?