I found it was faster when profiling. When I get home I'll run some tests on the XT+CGA for you.
remember that if I keep the backbuffer in it's current format (which is the most efficient I've found for blitting sprites to and erasing sprites from/resetting) the code would end up having to be:
mov ah,$DE
; start copying backbuff to screen
lodsb
stosw
I'm pretty sure lodsb+stosw is going to be slower than stosb+inc di.
Changing the backbuffer format to match ups the backbuffer to 16k (not really viable for my memory target), makes the sprite blits more complex and take twice as long...
and...
; 4 clocks fetch stosw
movsw ; 26 clocks
; 26-16m = 10 free, so fetch next 2.5 instructions
movsw ; 26 clocks
; 26-16m = 8 free, 2 used to finish fetch final
movsw ; 26 clocks
; fetch non-event
movsw ; 26 clocks
runs at 108 clocks (104+4 fetch)
; 4 clocks fetch stosb
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
; 3 free, so +0.75
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
; 3 free, another +0.75
; somewhere around here, we've actually filled the biu so the prefetch actually hangs waiting for opcodes to run.
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
; 3 free, another +0.75
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
thanks to the biu, the extra instructions are prefetched so there's no penalty despite the stosb+inc being 8 bytes for 4x - so that's 76 clocks vs. 108 for movsw... so movsb+inc is faster than movsw with them matching -- and that's not even taking video memory speed being slower into account. Think on it this way -- 4 extra bytes written to video memory, or 4 extra 1 byte opcodes read from system memory... not a hard choice when the combined EU time is shorter too. (21 clocks vs. 26)
I figured that out while writing Paku Paku - because originally I had it doing what you suggested, and VERY quickly switched the backbuffer to packed just to write to video memory less.
I was unaware up until now you were writing to a virtual buffer whose memory layout does not match the screen layout. One generally does this for two reasons: Screen update speed (the backbuffer memory layout matches the screen's layout) or housekeeping speed (the backbuffer is arranged as necessary for fast composing).
Which I'm definately choosing the latter... as well as the lower memory requirements.
Have you considered changing your backbuffer so that it is byte-based, like MCGA? That way, you eliminate reading the background for masked sprite operations.
I did, but rejected it... if I stored them all as bottom nybble, I'd have to do a painfully slow shift 4 while copying to screen... if I stored them alternating high/low, I'd still have to "or" them together while blitting to screen... meaning...
lodsw
or al,ah
stosb
inc di
painful at best... It also means that writing my sprites when I can blit 2px flat out it's a stosw instead of stosb, and 4px flat out would be two stosw instead of one.
Storing them pixel packed lets my sprite to buffer routine exploit stosw to do 4 pixels in one operation...
I think I should have included the third operation in this equation, the blit of a sprite to the backbuffer... an example of which is above as well with sprite0 offset 0.
Array lookups are faster than pointer lookups on this architecture. There's a reason Intel calls bx/bp/si/di indexing registers and why the 808x allows you to use them for that purpose (ie. mov ax,[bx+si+1234] is much faster than add si,bx; add si,1234; mov al,[si]). I'd suggest looking at some compiled code for both arrays and pointer chains in Turbo Debugger so you can see how TP is constructing the code for each.
from what I've seen... well, for example, let's say I had an object... ?I'll use one of the stars as an example...
Code:
type
pStar=^tStar;
tStar=object
x,y,dy:integer;
color:byte;
next:pStar;
bufferP,oldBufferP,
screenP,oldScreenP:pByte;
constructor init;
procedure reset;
destructor term;
end;
If I had an array[0..15] of those, and iterated through them
Code:
var t:word;
begin
for t:=0 to 15 do with starList[t] do begin
{ do something here}
end; end;
end;
It takes more memory and is 10-20% slower than if I populated .next properly, and did
Code:
var p:Pstar;
begin
p:=first;
repeat
with p^ do begin
{do something here }
p:=next;
end;
until p=nil;
end;
It's why pointered lists for records and objects replaced arrays in the first place; no calculation as the 'next' one is already stored. It ends up about 60 bytes less code compiled and 10-20% faster.
If I was tracking just a single variable, like say a word sized x coordinate, then an array lookup might be faster, but you get into complex data types, not so much.
It's why I build them thus:
Code:
constructor tStarList.init(stars:word);
var
t:word;
begin
new(first,init);
last:=first;
for t:=2 to stars do begin
new(last^.next,init);
last:=last^.next;
end;
end;
Pointered lists kick arrays backside... even more so if you keep them on the heap, well, except for that 4 bytes extra overhead for the 'next' pointer... even more so since you can use the dispose method to kill it's child before it disposes of itself.
Code:
destructor tStar.term;
begin
if not(next=nil) then dispose(next,term);
end;
destructor tStarList.term;
begin
if not(first=nil) then dispose(first,term);
end;
makes cleanup a snap. Also handy if you need to sort them as it makes btrees simple... much less you can change how many of them there are at runtime while still having range-checking... though arrays DO rock when you cannot predict the order or offset you want to access them in; but if you are only ever going to sequentially call them in the same order over and over again, pointered lists 'for the win' as it means there is no extra calculation when moving on to "the next" as it's already calced. It's the same as having an array of fixed size elements, and incrementing a base pointer until end instead of doing the multiply for every access.
Random Access -- use the array
Sequential Acess -- use pointers
Kind of like blitting sprites
Single points at non-sequential offsets -- use displacements
Sequential points -- use STOS