• Please review our updated Terms and Rules here

Please help me understand this compression

alank2

Veteran Member
Joined
Aug 3, 2016
Messages
2,265
Location
USA
On this page it talks about the LZ4:


Starting at the section: Regarding the way LZ4 searches and finds matches

If I am understanding the process (and correct me if I am wrong) it sounds like they are hashing 4 bytes of every location of the last 64K (because the offset is 16-bit). I can see this being done quickly by using a rolling window of add a byte to the output, hash it, and forget the one that fell off the end of the 64K.

Then when looking at the new part of the stream to compress, you has it and check the table for hash matches. Presumably there is a mechanism to test past the 4 bytes to look for the longest sequence you can reference, but I am not sure about that.

My question is this - why hash it at all? Why not just do a memcmp style byte by byte check? An inline function could be made that says count the repeating bytes at the source and each position of the 64K. Would that really be that much slower?
 
My question is this - why hash it at all? Why not just do a memcmp style byte by byte check? An inline function could be made that says count the repeating bytes at the source and each position of the 64K. Would that really be that much slower?
An algorithm designer would be looking at asymptotic performance as the buffer size increases to be as big as the file. A linear search of the file for every input byte like that means that the complexity compression is quadratic with the size of the file; if all the hash table operations can be done in constant time, that cuts the overall complexity down to linear.
 
What if more than one position in the source has the same hash? Are multiple hashed checked to see which has the longest match length?
 
What if more than one position in the source has the same hash? Are multiple hashed checked to see which has the longest match length?
I haven't looked at the code, but the article describes a "single-cell wide hash table" (i.e. no chaining, and if a cell is already occupied [i.e. there is a collision] the old contents get discarded and replaced) and to consider hash sizes of only 12 or 10 bits. That would imply two things:
  1. It would always match with the most recent substring with the first four bytes matching, not necessarily the longest match
  2. If a collision previously forced out a matching first four bytes with a non-matching first four bytes (but matching hash), the fact that there was a match within the past 64K will be missed.
Therefore as they describe, this heavily favors speed and memory usage over compression ratio: "Obviously, the smaller the table, the more collisions (false positive) we get, reducing compression effectiveness. But it nonetheless still works, and remain fully compatible with more complex and memory-hungry versions."
 
Thanks Makefile; just messing around with this I did get a compress and decompress function of my own design working last night which was fun. But without the hash table method it is quite slow comparatively. I am surprised the "single-cell" method would work as well as it does if that is what they are doing.

Any thoughts on how a hash is selected - it looks like he is just multiplying by some value:

Code:
LZ4_FORCE_INLINE U32 LZ4_hash4(U32 sequence, tableType_t const tableType)
{
    if (tableType == byU16)
        return ((sequence * 2654435761U) >> ((MINMATCH*8)-(LZ4_HASHLOG+1)));
    else
        return ((sequence * 2654435761U) >> ((MINMATCH*8)-LZ4_HASHLOG));
}

LZ4_FORCE_INLINE U32 LZ4_hash5(U64 sequence, tableType_t const tableType)
{
    const U32 hashLog = (tableType == byU16) ? LZ4_HASHLOG+1 : LZ4_HASHLOG;
    if (LZ4_isLittleEndian()) {
        const U64 prime5bytes = 889523592379ULL;
        return (U32)(((sequence << 24) * prime5bytes) >> (64 - hashLog));
    } else {
        const U64 prime8bytes = 11400714785074694791ULL;
        return (U32)(((sequence >> 24) * prime8bytes) >> (64 - hashLog));
    }
}
 
Thanks Chuck; that gives me some stuff to chew on for awhile.

I am pretty surprised that it gets decent compression given only one hash value (single cell) location. It sounds like then for each byte it calculates the hash, checks the hash table, verifies if the data matches and if so creates a match, else puts that byte in as a the literal. Somehow in my mind I figured there had to be a mechanism to check multiple matches to find the best one and use it, but maybe that isn't necessary. Could it be the logic that if 4 bytes match then it is likely that more do and compared to other matches perhaps there isn't much variation/benefit to checking them?

I haven't looked to see what the speed setting does in the LZ4 code, but if it isn't checking more hashes, what could it do. He mentions a dictionary in his code, but I haven't looked at it to deeply to see what it is doing. I thought for sure he would have multiple locations per hash value to check.
 
Once I got the idea that I should update an existing hash only if it was unused that took me from 88% compression down to 69%. L4Z is gets 66%, so I'm pretty pleased with my very basic code. It is very fast for sure.

This is a 1.44MB floppy disk image I am compressing one track at a time.

Hash table size performance:

Code:
128K    16bit     1029609 (69%) mc=47360, lc=29376, gc=100 (time 16ms)
64K    15bit    1030389 (69%) mc=46987, lc=29221, gc=100 (time 16ms)
32K    14bit    1033610 (70%) mc=46218, lc=28950, gc=95 (time 16ms)
16K    13bit    1043151 (70%) mc=43543, lc=28225, gc=97 (time 32ms)
8K    12bit    1056268 (71%) mc=39869, lc=26203, gc=100 (time 32ms)
4K    11bit    1085377 (73%) mc=32990, lc=22307, gc=98 (time 15ms)
2K    10bit    1119516 (75%) mc=25075, lc=16511, gc=103 (time 15ms)
 
Clearly I need to think about ways to improve my hashtable management. I added the ability to do multiple levels. It will check each level to see how long the match is and go with the longest. This isn't adding too much time, but it isn't getting the results. The first number is the blocksize. Note that the larger the block, the better LZ4 does, while my functions mostly do worse. Though 12 bit hash, 16 levels did better at 18432 bytes than 9216 bytes.

Code:
lz4
9216    975543 (66%) lc=0, mc=0, gc=0 (time 93ms)
18432    961676 (65%) lc=0, mc=0, gc=0 (time 94ms)
36864    950884 (64%) lc=0, mc=0, gc=0 (time 78ms)

mine bits 16 / levels 1
9216    1029609 (69%) lc=29376, mc=47360, gc=100 (time 16ms)
18432    1035959 (70%) lc=29767, mc=52249, gc=74 (time 31ms)
36864    1099885 (74%) lc=21118, mc=45318, gc=55 (time 32ms)

mine bits 12/ levels 16
9216    992602 (67%) lc=29262, mc=36649, gc=101 (time 62ms)
18432    985679 (66%) lc=29758, mc=37709, gc=75 (time 63ms)
36864    1045182 (70%) lc=21081, mc=28551, gc=56 (time 94ms)

mine bits 20/ levels 1
9216    1028889 (69%) lc=29497, mc=47711, gc=101 (time 93ms)
18432    1034699 (70%) lc=30006, mc=52909, gc=75 (time 46ms)
36864    1099132 (74%) lc=21308, mc=45745, gc=56 (time 47ms)
 
Can you post a sample file that you're compressing, for comparison, in case anyone else wants to play with this and see what compression ratio they get?
 
I am essentially using a DOS 6.22 disk that isn't quite full (it has had the most useful these files copied to it). Let me find a floppy disk image that is on the web to test with/use instead that way I'm not posting anything with copyright material in it.

I finally got to where I can study what the LZ4 code is doing by disabling all the inlining, fixing the bug with #elif that causes the compiler to confuse which code line it is on, and stepping it a bit.

It doesn't do more than a single hash check, but where I think it is getting the improved compression are the match references can go outside the lines of the buffer and back fill with zeros and my code doesn't try to do that. I output the pairs (reference location, size) and I could see that that one feature is certainly helpful for reduction.

I'm going to try my luck at a different method where I have a format like this:

<token> <possible extended size byte> <data>

where the token will have 2 top bits to specify (literal, reference(match), or two sequence types). I haven't decided if the sequence type will be single/double byte or double/quad byte, but essentially the idea here is each token can be one of these 4 types. So I am adding sequence detection where it can reduces sequences like 3f 3f 3f 3f or 12 34 12 34 or possibly even 12 34 56 78 12 34 56 78 as well. They are easy to find computationally and probably occur quite a bit in data, so my thought is that this replaces the lz4 reference out of range 00 padding with something hopefully more capable. Also the other bit in the token references an extended size byte which gives 8 more bits to the 5 from the token for a total of 2^13 byte sizes. In the 2/4 extended byte though this can reach further because the size can specify the sets (2 or 4) of bytes. Data would be the literal data, a 2 byte reference location, or the sequence.

I've got a lot of ideas to try, but just ideas and nothing coded/tested yet, but we'll see how it turns out.

I am seriously loving the hash table idea and what it accomplishes.
 
I was playing with lz4 as bit as well. An lz4 decompressor is quite straightforward, even in C. The 16-bit x86 segment wraparound can be used to your advantage also and simplify things (I suspect that was on purpose when lz4 was designed). I was doing compression in python just to not have to think about the hashing and buffering.

I don't exactly follow the extra optimizations you are proposing, but I suspect you will find this intriguing: https://en.wikipedia.org/wiki/BCJ_(algorithm)
 
Got a lot done on it today.

You can make a disk for testing by formatting a 1.44mb disk with /u /s and then copy these files to it:

Volume in drive A has no label
Volume Serial Number is 3179-0D1D
Directory of A:\

COMMAND.COM HIMEM.SYS DISKCOPY.COM EDIT.COM FORMAT.COM
MODE.COM MORE.COM SYS.COM UNFORMAT.COM ATTRIB.EXE
CHKDSK.EXE DEBUG.EXE DELTREE.EXE EMM386.EXE FDISK.EXE
INTERLNK.EXE INTERSVR.EXE LABEL.EXE MEM.EXE MOVE.EXE
MSCDEX.EXE MSD.EXE QBASIC.EXE SCANDISK.EXE SETVER.EXE
SHARE.EXE SMARTDRV.EXE SORT.EXE UNDELETE.EXE XCOPY.EXE
EDIT.HLP RAMDRIVE.SYS
32 file(s) 1,136,084 bytes
235,008 bytes free

Once I added sequence detection I was getting about the same value as LZ4:

From LZ4:
Code:
blksz 512      1071124/1474560     72.64%    110ms
blksz 9216     1027914/1474560     69.71%    78ms
blksz 18432    1021731/1474560     69.29%    62ms
blksz 36864    1016159/1474560     68.91%    63ms
all            4136928/5898240     70.14%    313ms

My current code with reference, sequence 1, and sequence 2 enabled:

Code:
blksz 512      1059933/1474560     71.88%    187ms
blksz 9216     1031270/1474560     69.94%    31ms
blksz 18432    1026918/1474560     69.64%    32ms
blksz 36864    1023526/1474560     69.41%    47ms

all            4141647/5898240     70.22%    297ms
refsave 435690, seq1save 1340671, seq2save 80890

Disabling the sequence detection raises it up to:

Code:
blksz 512      1081574/1474560     73.35%    172ms
blksz 9216     1063499/1474560     72.12%    15ms
blksz 18432    1065090/1474560     72.23%    31ms
blksz 36864    1064154/1474560     72.17%    16ms

all            4274317/5898240     72.47%    234ms
refsave 1720192, seq1save 0, seq2save 0

So the sequence detection reduces it by 2.25%.

I did some testing to see how the hash method can fail and sure enough, if I make a 00 00 00 00 ff sequence, it will never be able to replace long strings of 00 00 00 00 00 00 00 because it doesn't know where they are.

I was pleased with the performance of it so far, I haven't even tried to optimize it much yet.

My goal here (besides learning about compression!) is to make some code I can use for my 16-bit floppy drive image program. I didn't get too far trying to compile the LZ4 code with a 16-bit compiler.

Attaching what I've done so far if you want to mess with it.
 

Attachments

  • unit1.zip
    4 KB · Views: 1
My goal here (besides learning about compression!) is to make some code I can use for my 16-bit floppy drive image program. I didn't get too far trying to compile the LZ4 code with a 16-bit compiler.
Since the reference code uses a 32x32 bit multiply in the hash function, I wonder if you'd do better by substituting CRC16 or something else more appropriate for a 16-bit CPU.
 
Possibly, one thing I noticed that he did was do the multiply, but instead of taking the lower bits, they are shifted to keep the msb ones.
 
Here is where I am with it. I am calling it fastlz (wishful thinking perhaps) because it doesn't do anything more than match replacements, though it can do the nice trick of replacing a large pattern that hasn't existed yet because it will exist as it is copied.

If you see something else I can optimize in it (in the compressor fastlzc.cpp) let me know.

My goal for it was to be able to compress/decompress disk images and not be irritatingly slow even on my old Toshiba T1100plus 8086 running at 7.16MHz. From my tests, it looks like it can compress a 9 sector track (4608 bytes) in about 1/3rd of a second or so hopefully that won't be too bad. I've got lots of work to do on my disk image program to get it where I want, but hopefully in the end it will be able to read/write to or from drives, uncompressed disk images, and its own compresses disk images.
 

Attachments

  • fastlz-3.zip
    5.3 KB · Views: 1
BTW, this somewhat intersects with the other thread about using standard types like uint16_t vs. unsigned int. I found the best approach in was to use specific types where they are required, but if it does not matter, then leave them unsigned int or int. This approach was taken in the above compression code and made it very easy for it to support buffers up to 64K in 16 bit, and of course much larger in 32 bit.
 
This tested great on my old Toshiba T1100plus. I put it in my floppy disk imaging tool and I would *think* compression would take a 7.16MHz 8086 to a crawl, but its performance is very fast.
 
Back
Top