By the way, I don't feel like I should give advises to a master, but I hope some of this may improve that already excellent code. On the line 1316:
Code:
asm mov cl,30
_loop: //Update Lines
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm movsw
asm add di,84-40
asm sub si,40
asm loop _loop
maybe could be faster if reduced to something like this:
Code:
asm mov bl,30
asm sub ch,ch
_loop: //Update Lines
asm mov cl,20
asm rep movsw
asm add di,84-40
asm sub si,40
asm dec bl
asm jnz _loop
REP MOVSW uses to be faster than a succession of movsw, if only because there are much less instructions to be fetched. The bad side is we need CX to use LOOP, and also to feed REP. How to solve this conflict? We can use a spare register and mimic the LOOP instruction by using DEC and taking advantage of the FLAGS with JNZ. Not as efficient as LOOP but in my opinion its pretty close, anyway the lightning speed of REP MOVSW compensates it.
An alternative could be also something like this: backing up CL with the very fast XCHG, in order to be used with both REP and LOOP.
Code:
asm mov cl,30
asm sub ch,ch
_loop: //Update Lines
asm xchg bl,cl
asm mov cl,20
asm rep movsw
asm add di,84-40
asm sub si,40
asm xchg cl,bl
asm loop _loop