Help testing my next game's engine

Trixter · Dec 25, 2011

deathshadow said:
Was just thinking about compiled sprites and looking at the examples on those pages -- and while they might be acceptable on a 386/higher or even a 286... they are DISASTROUSLY bad on a 8088 the way they are coded.

I mentioned them as a starting point, not exactly what you should use verbatim. Think about how to improve them for your specific video mode and processor. Because your video mode is based on nybbles, you'd have to compile a sprite to two targets, one for the first nybble in an attribute byte as a target and another for the second nybble. (Compiled sprites for VGA 16-color mode produce 4 compiled targets, one for each shift. It's a massive space waster, but it is a massive speed boost as well. You must weigh the benefits.)

BTW, I'm not sure your cycle counting is accurate. Let's assume we want to plot two MCGA pixels, with the second one (+1,+1) of the other, two different colors, and that ds:si points to the screen. The absolute addressing version of the code would do this:

Code:

mov byte ptr [di+68],ee ;10+EA, EA=9, 19 cycles
mov byte ptr [di+389],ff; ;19 cycles

9 bytes, 38 cycles (not counting instruction fetch time)

If you want to force it to write with a target of es:di instead:

Code:

es: mov byte ptr [di+68],ee ;10+EA, EA=9, add 2 for seg override, 21 cycles
es: mov byte ptr [di+389],ff ;21 cycles

11 bytes, 42 cycles (not counting instruction fetch time)

As above (es:di target), but using stosb to write the pixels as you suggested:

Code:

add di,68  ;4c
mov al,ee  ;4c
stosb      ;11c
add di,321 ;4c 
mov al,ff  ;4c
stosb      ;11c

13 bytes, 38 cycles (not counting instruction fetch time)

The first and third have the same CPU execution time, but the first and second are faster than the third because each opcode byte takes 4 cycles to fetch. The prefetch queue helps somewhat, but overall those three code snippets that achieve the same thing are listed above in the order of fastest to slowest.

deathshadow · Dec 26, 2011

Trixter said:
I mentioned them as a starting point, not exactly what you should use verbatim. Think about how to improve them for your specific video mode and processor. Because your video mode is based on nybbles, you'd have to compile a sprite to two targets, one for the first nybble in an attribute byte as a target and another for the second nybble. (Compiled sprites for VGA 16-color mode produce 4 compiled targets, one for each shift. It's a massive space waster, but it is a massive speed boost as well. You must weigh the benefits.)

Oh, I am, and it looks like I will be using it for the primary game sprites; I was considering it for the fonts too, but found that it was just way too big... the 128 game sprites (256 total thanks to the precalced shift) ends up 14k once compiled (170k source) -- using an array of proc for the lookups let me axe the conditionals too... fonts ends up another 32k on top of that (praise be units in TP get their own code segments), which isn't worth it for something called far less frequently. Ends up so large since in addition to pulling the mask value, I want to colorize them on the fly.

Storing the fonts bitmap style I just make them solid white ($0F) with no mask...

Code:

{ assumes bl= color nybble:color nybble, for example $33 = cyan, $11 = blue }
	lodsb           { get data, should be one of $0F, $F0 or $FF }  
	mov  ah,al      { copy to colorize }
	and  ah,bl      { ah = colored data }
	not  al         { al = mask }
	and  al,es:[di] { apply existing video byte to mask }
	or   al,ah      { apply colored data to masked byte }
	stosb           { write it back out }
	inc  di

Though if anyone has any ideas on speeding that up...

Trixter said:
BTW, I'm not sure your cycle counting is accurate. Let's assume we want to plot two MCGA pixels, with the second one (+1,+1) of the other, two different colors, and that ds:si points to the screen. The absolute addressing version of the code would do this:

Code:

mov byte ptr [di+68],ee ;10+EA, EA=9, 19 cycles mov byte ptr [di+389],ff; ;19 cycles 9 bytes, 38 cycles (not counting instruction fetch time)

... I thought DISP was always word? (TASM/BASM seems to compile that way...) so that should be 10 bytes, not nine... the reason mine said 25 was notice they were word operations, so 12 bytes for two codes, and 14+EA... so without the seg override that's 23.

Trixter said:
If you want to force it to write with a target of es:di instead:

due to the clocks dissapearing under the fetch, that may or may not be a good choice...

Let's go over that rewrite

fetch 5 bytes = 20 clocks EU hung with thumb up arse
mov byte ptr [di+68],ee; 10+EA, EA=9, 19 cycles
19 cycles minus 4 memory operation = 15 cycles for BIU fetch
fetch 4 bytes, 1 clock overhang
fetch 1 bytes, 4 clocks, EU hung
mov byte ptr [di+389],ff; ;19 cycles

So not counting fetching anything extra that's 38 EU clocks and 25 BIU fetch clocks.

Taking the ES:DI stosb version...

Code:

; fetch 4 bytes ;                 16 clocks EU hang
  add di,68     ; 4c, 4 for BIU
; fetch 2 bytes ; 8-4              4 clocks hang
  mov al,ee     ; 4c, 4 for BIU
; fetch 1 byte  ; 4-4              0
  stosb         ; 11c, -4c mem = 7 for BIU
; fetch 4 bytes ; 16-7             9 clocks hang
  add di,321    ; 4c, 4 for BIU 
; fetch 2 bytes ; 8-4              4 clocks hang
  mov al,ff     ; 4c, 4 for BIU
; fetch 1 byte  ; 4-4              0
  stosb         ; 11c, -4c mem = 7 for BIU

once again your byte counts are screwy... that's 14 bytes, not thirteen, though yes it's 38 EU clocks and 33 clocks EU hang -- slower; but sprites usually have large sections of sequential pixels; at least three or four per line, often more... so it comes down to the type of data being moved... you start writing sequential pixels, and the numbers flip in favor of the stosb (since you lose the add's in-between), or stosw (which doesn't gain in fetch size enough to be slower).

Let's take a 'real' sprite and compare just the first two lines of blit; I'll use the 'red ghost' from paku paku promoted to 8 bit as an example.

Code:

; fetch 5 bytes               ; 20 fetch              20 clocks hang
  mov byte ptr [di+1],$0C     ; 19c - 4 mem = 15 biu
; fetch 6 bytes               ; 24 fetch - 15          9 clocks hang
  mov word ptr [di+2],$0C0C   ; 23c - 8 mem = 15 biu
; fetch 6 bytes               ; 24 fetch - 15          9 clocks hang
  mov word ptr [di+320],$070C ; 23c - 8 mem = 15 biu
; fetch 6 bytes               ; 24 fetch - 15          9 clocks hang
  mov word ptr [di+322],$070C ; 23c - 8 mem = 15 biu
; fetch 5 bytes               ; 24 fetch - 15          9 clocks hang
  mov byte ptr [di+324],$0C   ; 19c - 4 mem = 15 biu

So that's 28 bytes, 107 EU clocks, and 56 clocks waiting on fetch.

"optimized" we can kind-of cheat due to the repeating data.

Code:

; fetch 1 byte             ;                        4 clocks hang
inc di                     ; 3c = 3 biu
; fetch 3 bytes            ; 12-3                   9 clocks hang
mov ax,$0C0C               ; 4c = 4 biu
fetch 1 byte               ; 4-4                    0
stosw                      ; 15c-8m = 7 biu        
fetch 1 byte               ; 3 biu remain
stosb                      ; 11c-4m+3r = 10 biu
fetch 4 bytes              ; 16-10                  6 clocks hang
add di,316                 ; 4c = 4 biu
; fetch 2 bytes            ; 8-4                    4 clocks hang
mov ah,$07                 ; 4c = 4 biu
fetch 1 byte               ; 4-4                    0
stosw                      ; 16c-8m=8 biu
fetch 1 byte               ; 
fetch 1 byte               ; no hang, plenty of prefetch free this point on
stosw                      ; 15c-8m=7 biu   
stosb                      ; 11c-4p=7 biu

Which is 15 bytes, 83 EU clocks and 23 clocks EU hung waiting on BIU fetch. For all the 'alternating' colors used in sprites they actually tend to have large runs of the same pixel; especially on low color depths.

Making my own "sprite compiler" for the 160x100 mode I've taken that into consideration by having four different types of movs operations. Thankfully since I'm writing to a backbuffer and not the display I don't have the headache of the interlaced characters to deal with.

Code:

{ 2 pixel 'flat' write }
mov al,dataByte
stosb

{ 2 pixel masked }
mov ax,(dataByte shl 8+maskByte)
and al,es:[di]
or  al,ah
stosb

{ 4 pixel 'flat' write }
mov ax,dataWord
stosw

{ 4 pixel masked }
mov ax,es:[di]
and ax,maskWord
or  ax,dataWord
stosw

the increments in-between each of course varying based on the sprite... the 'fighter' ship (sprite 0) using my current "compiler" ends up this:

Code:

procedure SHIPS_0_0(bufferP:pointer); assembler;
asm
	les di,bufferP
{ Row 1 }
	inc di
	mov ax,$0FF0  { 2 pixel masked }
	and al,es:[di]
	or  al,ah
	stosb
{ Row 2 }
	add di,SHIPSInc+3
	mov ax,es:[di]
	and ax,$0F00  { 4 pixel Masked }
	or  ax,$7079
	stosw
{ Row 3 }
	add di,SHIPSInc+2
	mov ax,$09F0  { 2 pixel masked }
	and al,es:[di]
	or  al,ah
	stosb
{ Row 4 }
	add di,SHIPSInc+3
	mov ax,es:[di]
	and ax,$0F00  { 4 pixel Masked }
	or  ax,$707F
	stosw
{ Row 5 }
	add di,SHIPSInc+1
	mov ax,es:[di]
	and ax,$00F0  { 4 pixel Masked }
	or  ax,$7F0F
	stosw
	mov al,$7F    { 2 pixel Flat }
	stosb
{ Row 6 }
	add di,SHIPSInc+1
	mov ax,$CFF7  { 4 pixel flat }
	stosw
	mov ax,es:[di]
	and ax,$0F00  { 4 pixel Masked }
	or  ax,$F0C7
	stosw
end;

SHIPSInc is a constant, equal to buffer width in bytes minus sprite width in bytes.

Works pretty good. The real speed increase over reading the sprite data isn't so much coming from the data being inlined, as it is being able to switch the type of operation so that when I have two solid pixels (or even better 4 pixels) in a row I don't have to read from the backbuffer first. Every single row above using normal sprites would have ended up:

Code:

lodsw
mov  bx,ax
lodsw
and  ax,es:[di]
or   ax,bx
stosw
lodsw
mov  bx,ax
lodsw
and  ax,es:[di]
or   ax,bx
stosw
add  di,SHIPSInc

Regardless of the number of bytes actually being changed.

New test/demo coming up sometime today if all goes well. I may end up storing the fonts as a constant instead of loading them from a file onto the heap -- see if I can get the whole thing self-contained in the .exe

deathshadow · Dec 26, 2011

BTW, don't suppose anyone knows a faster way to do this:

Code:

procedure buffer_showSprite8x6(screenP,bufferP:pointer); assembler;
asm
	les  di,screenP
	inc  di
	mov  dx,ds
	lds  si,bufferP
	mov  ax,153
	mov  bx,60
	
	movsb
	inc  di
	movsb
	inc  di
	movsb
	inc  di
	movsb
	add  di,ax
	add  si,bx
	
	movsb
	inc  di
	movsb
	inc  di
	movsb
	inc  di
	movsb
	add  di,ax
	add  si,bx
	
	movsb
	inc  di
	movsb
	inc  di
	movsb
	inc  di
	movsb
	add  di,ax
	add  si,bx
	
	movsb
	inc  di
	movsb
	inc  di
	movsb
	inc  di
	movsb
	add  di,ax
	add  si,bx
	
	movsb
	inc  di
	movsb
	inc  di
	movsb
	inc  di
	movsb
	add  di,ax
	add  si,bx
	
	movsb
	inc  di
	movsb
	inc  di
	movsb
	inc  di
	movsb
	
	mov ds,dx
end;

Or is that as lean as I'm gonna get on taking a flat pixel packed backbuffer and copying it to text-mode attribute bytes? 8086/286 that's tons of fun, all those off alignment penalties.

Faster version of this would be nice too...

Code:

procedure buffer_erase8x6(bufferP:pointer); assembler;
asm
	les di,bufferP
	xor ax,ax
	mov bx,60
	stosw
	stosw
	add di,bx
	stosw
	stosw
	add di,bx
	stosw
	stosw
	add di,bx
	stosw
	stosw
	add di,bx
	stosw
	stosw
	add di,bx
	stosw
	stosw
end;

But I don't think that exists either.

Trixter · Dec 26, 2011

deathshadow said:
BTW, don't suppose anyone knows a faster way to do this:
Or is that as lean as I'm gonna get on taking a flat pixel packed backbuffer and copying it to text-mode attribute bytes? 8086/286 that's tons of fun, all those off alignment penalties.

The faster way is storing your sprite data in the format of the screen, so that writing a pixel is a movsw. The only benefit to keeping a sprite in its packed pixel form is if you're going to write it to some intermediate buffer which gets translated on its way to the screen. However, for 8088+CGA at least, the speed you gain in sprite/drawing ops is lost in the translate-to-screen op.

Faster version of this would be nice too...

Code:

procedure buffer_erase8x6(bufferP:pointer); assembler;
asm
    les di,bufferP
    xor ax,ax
    mov bx,60
    stosw
    stosw
    add di,bx
    stosw
    stosw
    add di,bx
    stosw
    stosw
    add di,bx
    stosw
    stosw
    add di,bx
    stosw
    stosw
    add di,bx
    stosw
    stosw
end;

Shouldn't that be four stosws instead of two? (8x6 implies you're erasing 8 nybbles per scanline.) The only way to speed that up is to not call it as a procedure but insert it inline(). Doing so, you'd save at least 35 cycles (near call+ret).

Remember to profile at various stages of development -- you may find your back-end gameplay housekeeping taking up orders of magnitude more time than sprite handling, making the above optimizations moot. An old demoscene trick is, while debugging, sync the start of a frame on vertical retrace, then change the background color (in your video mode's case, border color) to a different color every time you enter a new area (draw, erase, housekeeping, sound, screen, etc.). You can then visually see what part of a frame is taken up by what procedure (and, if things start flickering wildly, you know you've gone over a frame in duration).

The long-period Zen timer is good too, if you just want to profile and put numbers on the screen as the game is running.

deathshadow · Dec 27, 2011

Trixter said:
The faster way is storing your sprite data in the format of the screen, so that writing a pixel is a movsw.

You'd think so, but there's a catch; that other byte is unchanging, and we're talking about writing to video memory -- which takes two or three times longer than main memory. The wait on the 8 bit bus to write that extra byte is longer than it takes to read in the inc di... No point in hanging the bus trying to write a byte that should NEVER change. (since it has to stay

Also that's copying a backbuffer tile to the screen, which the sprites and background stuff are written to... that way I can layer the sprite transparencies without flicker. That's not the sprite being written to the screen, it's a backbuffer tile... so sprite + background + any sprites underneath it.

Goes with the render order for making transparent overlaid sprites:

update ALL sprite positions while storing old ones
render ALL sprites to backbuffer
copy ALL sprite tiles AND ALL old sprite tile locations to screen
erase all sprites from backbuffer (or if a complex background, restore backbuffer from back-back-buffer)

Hmm... your suggestion of not making a function call may be worth implementing for just those erases, as they're nested in that pointer chain. (again using pointer chains instead of for/array index)

Trixter said:
The only benefit to keeping a sprite in its packed pixel form is if you're going to write it to some intermediate buffer which gets translated on its way to the screen.

Which is what's going on in that 'show' function -- the pixel packed back-buffer being copied to screen.

Trixter said:
Shouldn't that be four stosws instead of two?

Pixels... nybble sized... 2 words == 4 bytes == 8 nybbles... so... no, that's two STOSW on the pixel-packed buffer. That's being written to the backbuffer, which is pixel packed... NOT to screen. Again, why I don't have massive flicker like Moonbugs or Round42... which is why the backbuffer AND sprites are not in the same format as screen...

If the backbuffer and pixels were stored interlaced as per screen, that would be:
sprite to buffer - 4 stosw
buffer to screen - 4 stosw
old loc to screen - 4 stosw
erase on buffer - 4 stosw
vs.
sprite to buffer - 2 stosw
buffer to screen - 4 stosb, 4 inc di
old loc to screen - 4 stosb, 4 inc di
erase on buffer - 2 stosw

... and since inc di is a 1 byte instruction, it gets shoved in the prefetch with time to spare during a stosb... and again with screen RAM being so slow...

Trixter said:
The only way to speed that up is to not call it as a procedure but insert it inline(). Doing so, you'd save at least 35 cycles (near call+ret).

Worth considering as i said above -- originally I wasn't doing it because they're in different library units... but given that it's a TP proc call that's gonna be a lot more than the 35 cycles what with passing values on the stack.

Though there is the problem of passing heap area objects to BASM, which is usually easiest circumvented by passing it on the stack to a procedure... that or I move them all into global vars so I can access them, but as a rule I hate having stuff as globals and/or in the data segment unless I have to.

Trixter said:
Remember to profile at various stages of development -- you may find your back-end gameplay housekeeping taking up orders of magnitude more time than sprite handling, making the above optimizations moot.

RARELY have I ever had that be the case... compared to shoving stuff at video memory things like keyboard handling and game logic are bupkis; but then I do things like 8:8 fixed point math instead of floats, procedural variables in arrays instead of if/case statements when possible, etc, etc... The only two things that come close to video for me are audio (because it needs to update so often and with cards like the adlib are PAINFULLY slow) and the Joystick (since inherently a joystick read is just sitting there with the CPU bent over the table with a loop shoved up it's IO bus)... and even then all those things combined usually work out to less than a quarter of my total time per frame.

An old demoscene trick is, while debugging, sync the start of a frame on vertical retrace, then change the background color (in your video mode's case, border color) to a different color every time you enter a new area (draw, erase, housekeeping, sound, screen, etc.). You can then visually see what part of a frame is taken up by what procedure (and, if things start flickering wildly, you know you've gone over a frame in duration).

Yeah, I use that trick on occasion, it's nice because setting the overscan via the ports adds little to no overhead to skew the results and it's VERY visual instantly letting you know what's going on.

Trixter said:
The long-period Zen timer is good too, if you just want to profile and put numbers on the screen as the game is running.

Which zen is good for recording, but I'd not display those types of results while this is running -- if I had real ascii text available I'd consider it, but blitting text or even a status line would take so long, you'd lose the actual flow... Though I suppose at 286+ speeds this codebase could handle it -- problem being I can't trust the opcode times on those processors to represent a real 8088; my minimum target.

One trick I often use is to simply run each subsection independently inside a loop for a fixed time period... since each subsection ends up having it's own object and/or method. See the code for the next 'test.exe'...

Code:

repeat
	starList.update;
	shipList.update;
	starList.drawOnBuffer;
	shipList.drawOnBuffer;
	starList.blitFromBuffer;
	shipList.blitFromBuffer;
	starList.eraseFromBuffer;
	shipList.eraseFromBuffer;
	inc(shows[t]);
	while userTimerExpired(timerInterval) do inc(ticks);
	if keypressed then begin
		ch:=readkey;
		if ch=#27 then exit;
	end;
until ticks>testTimeInTicks;

That's to run it flat-out. The fixed frame-rate version just swaps the inc and while thus:

Code:

	inc(ticks);
	while not(userTimerExpired(timerInterval)) do inc(excess[t]);

I think you're right though -- I've got enough calls as it is with the compiled sprites and objects -- moving some stuff out of the buffer unit and into the sprites unit could make a big difference... if I can convince TP7's assembler to actually be able to read the values stored in a object on the heap... I could abandon objects for arrays, but that tends to lose as much speed as I'd gain thanks to how long array lookups take... I supposed I could just allocate dummy space on the stack for them as needed -- one 32 bit pointer copy vs. a CALL and a slew of push/pop to the stack AND passing values on the stack? Hmm....

Tough call all around.

Trixter · Dec 27, 2011

deathshadow said:
You'd think so, but there's a catch; that other byte is unchanging, and we're talking about writing to video memory -- which takes two or three times longer than main memory. The wait on the 8 bit bus to write that extra byte is longer than it takes to read in the inc di... No point in hanging the bus trying to write a byte that should NEVER change.

I found it was faster when profiling. When I get home I'll run some tests on the XT+CGA for you.

Also that's copying a backbuffer tile to the screen, which the sprites and background stuff are written to... that way I can layer the sprite transparencies without flicker. That's not the sprite being written to the screen, it's a backbuffer tile... so sprite + background + any sprites underneath it.

I was unaware up until now you were writing to a virtual buffer whose memory layout does not match the screen layout. One generally does this for two reasons: Screen update speed (the backbuffer memory layout matches the screen's layout) or housekeeping speed (the backbuffer is arranged as necessary for fast composing). Have you considered changing your backbuffer so that it is byte-based, like MCGA? That way, you eliminate reading the background for masked sprite operations.

Which zen is good for recording, but I'd not display those types of results while this is running

I meant for a debugging mode, for example where interrupts are not used to advance a frame, but rather a fixed event like a keypress.

thanks to how long array lookups take

Array lookups are faster than pointer lookups on this architecture. There's a reason Intel calls bx/bp/si/di indexing registers and why the 808x allows you to use them for that purpose (ie. mov ax,[bx+si+1234] is much faster than add si,bx; add si,1234; mov al,[si]). I'd suggest looking at some compiled code for both arrays and pointer chains in Turbo Debugger so you can see how TP is constructing the code for each.

deathshadow · Dec 27, 2011

Trixter said:
I found it was faster when profiling. When I get home I'll run some tests on the XT+CGA for you.

remember that if I keep the backbuffer in it's current format (which is the most efficient I've found for blitting sprites to and erasing sprites from/resetting) the code would end up having to be:

mov ah,$DE
; start copying backbuff to screen
lodsb
stosw

I'm pretty sure lodsb+stosw is going to be slower than stosb+inc di.

Changing the backbuffer format to match ups the backbuffer to 16k (not really viable for my memory target), makes the sprite blits more complex and take twice as long...

and...
; 4 clocks fetch stosw
movsw ; 26 clocks
; 26-16m = 10 free, so fetch next 2.5 instructions
movsw ; 26 clocks
; 26-16m = 8 free, 2 used to finish fetch final
movsw ; 26 clocks
; fetch non-event
movsw ; 26 clocks

runs at 108 clocks (104+4 fetch)

; 4 clocks fetch stosb
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
; 3 free, so +0.75
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
; 3 free, another +0.75
; somewhere around here, we've actually filled the biu so the prefetch actually hangs waiting for opcodes to run.
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks
; 3 free, another +0.75
movsb ; 18 clocks
; 18-8m = 10 free, so fetch next 2.5 instructions
inc di ; 3 clocks

thanks to the biu, the extra instructions are prefetched so there's no penalty despite the stosb+inc being 8 bytes for 4x - so that's 76 clocks vs. 108 for movsw... so movsb+inc is faster than movsw with them matching -- and that's not even taking video memory speed being slower into account. Think on it this way -- 4 extra bytes written to video memory, or 4 extra 1 byte opcodes read from system memory... not a hard choice when the combined EU time is shorter too. (21 clocks vs. 26)

I figured that out while writing Paku Paku - because originally I had it doing what you suggested, and VERY quickly switched the backbuffer to packed just to write to video memory less.

Trixter said:
I was unaware up until now you were writing to a virtual buffer whose memory layout does not match the screen layout. One generally does this for two reasons: Screen update speed (the backbuffer memory layout matches the screen's layout) or housekeeping speed (the backbuffer is arranged as necessary for fast composing).

Which I'm definately choosing the latter... as well as the lower memory requirements.

Trixter said:
Have you considered changing your backbuffer so that it is byte-based, like MCGA? That way, you eliminate reading the background for masked sprite operations.

I did, but rejected it... if I stored them all as bottom nybble, I'd have to do a painfully slow shift 4 while copying to screen... if I stored them alternating high/low, I'd still have to "or" them together while blitting to screen... meaning...

lodsw
or al,ah
stosb
inc di

painful at best... It also means that writing my sprites when I can blit 2px flat out it's a stosw instead of stosb, and 4px flat out would be two stosw instead of one.

Storing them pixel packed lets my sprite to buffer routine exploit stosw to do 4 pixels in one operation...

I think I should have included the third operation in this equation, the blit of a sprite to the backbuffer... an example of which is above as well with sprite0 offset 0.

Trixter said:
Array lookups are faster than pointer lookups on this architecture. There's a reason Intel calls bx/bp/si/di indexing registers and why the 808x allows you to use them for that purpose (ie. mov ax,[bx+si+1234] is much faster than add si,bx; add si,1234; mov al,[si]). I'd suggest looking at some compiled code for both arrays and pointer chains in Turbo Debugger so you can see how TP is constructing the code for each.

from what I've seen... well, for example, let's say I had an object... ?I'll use one of the stars as an example...

Code:

type
	pStar=^tStar;
	tStar=object
		x,y,dy:integer;
		color:byte;
		next:pStar;
		bufferP,oldBufferP,
		screenP,oldScreenP:pByte;
		constructor init;
		procedure reset;
		destructor term;
	end;

If I had an array[0..15] of those, and iterated through them

Code:

var t:word;
begin
  for t:=0 to 15 do with starList[t] do begin
    { do something here}
  end; end;
end;

It takes more memory and is 10-20% slower than if I populated .next properly, and did

Code:

var p:Pstar;
begin
  p:=first;
  repeat
    with p^ do begin
      {do something here }
      p:=next;
    end;
  until p=nil;
end;

It's why pointered lists for records and objects replaced arrays in the first place; no calculation as the 'next' one is already stored. It ends up about 60 bytes less code compiled and 10-20% faster.

If I was tracking just a single variable, like say a word sized x coordinate, then an array lookup might be faster, but you get into complex data types, not so much.

It's why I build them thus:

Code:

constructor tStarList.init(stars:word);
var
	t:word;
begin
	new(first,init);
	last:=first;
	for t:=2 to stars do begin
		new(last^.next,init);
		last:=last^.next;
	end;
end;

Pointered lists kick arrays backside... even more so if you keep them on the heap, well, except for that 4 bytes extra overhead for the 'next' pointer... even more so since you can use the dispose method to kill it's child before it disposes of itself.

Code:

destructor tStar.term;
begin
	if not(next=nil) then dispose(next,term);
end;

destructor tStarList.term;
begin
	if not(first=nil) then dispose(first,term);
end;

makes cleanup a snap. Also handy if you need to sort them as it makes btrees simple... much less you can change how many of them there are at runtime while still having range-checking... though arrays DO rock when you cannot predict the order or offset you want to access them in; but if you are only ever going to sequentially call them in the same order over and over again, pointered lists 'for the win' as it means there is no extra calculation when moving on to "the next" as it's already calced. It's the same as having an array of fixed size elements, and incrementing a base pointer until end instead of doing the multiply for every access.

Random Access -- use the array
Sequential Acess -- use pointers

Kind of like blitting sprites
Single points at non-sequential offsets -- use displacements
Sequential points -- use STOS

Trixter · Dec 27, 2011

deathshadow said:
I'm pretty sure lodsb+stosw is going to be slower than stosb+inc di.

You mean movsb+inc di, right? (Just want to make sure so I'm profiling the right code)

Storing them pixel packed lets my sprite to buffer routine exploit stosw to do 4 pixels in one operation...

You mean 2 pixels, right? Because an stosw will write a charbyte+attrbyte.

deathshadow · Dec 27, 2011

Trixter said:
You mean movsb+inc di, right? (Just want to make sure so I'm profiling the right code)

right... I think on a real 8088 you'll be surprised by that one; a movsb and inc is faster than a movsw if you do a bunch of them in a row when all you want to update is every other byte -- though there IS the alignment penalty when you get to a real 16 bit bus on the 286+.

Trixter said:
You mean 2 pixels, right? Because an stosw will write a charbyte+attrbyte.

No, pixel packed backbuffer, so no charbyte... no charbyte in the backbuffer lets stosw do four pixels when going from sprite to backbuffer or background-back-buffer to back-buffer. (when using a background-back buffer like in Paku Paku). stosw on only manages 2 pixels (and 1 byte of wasted memory access... well, two since we're talking a read and a write) when dealing with backbuffer to screen.

sprite -- pixel packed, 2 bytes == 4 pixels (not really since I'm using procedurals now)

backbuffer -- pixel packed, 2 bytes == 4 pixels

screen -- character interlaced, 2 bytes == 2 pixels + character we don't need to change value of.

deathshadow · Dec 27, 2011

deathshadow said:
Random Access -- use the array
Sequential Acess -- use pointers

.... not to stand around quoting myself, but this reminded me to show how I'm handling calling those 'compiled sprites' procedures.

Code:

type
	tBufferProc=^procedure(bufferP:pointer);

var
	SHIPSList:array[0..255] of tBufferProc;

Then I populate the array 'the hard way' during unit initialization...

Code:

begin
	SHIPSList[0]:=SHIPS_0_0;
	SHIPSList[1]:=SHIPS_0_1;
	SHIPSList[2]:=SHIPS_1_0;
	SHIPSList[3]:=SHIPS_1_1;

etc, etc, etc... Sucks the amount of code involved in setting it up, but that array lookup beats using massive 'case' statements since all I have to do is call:

BufferP:=pointer(longint(backBuffer)+(x div 2)+y*bufferWidthInBytes);
SHIPSLIST[sprite*2+x and 1](bufferP);

Keeps it pretty simple and quick. Can't predict which one as I'm not accessing them in a fixed order, so the array lookup for the proper procedure is the best bet. Arrays of Procedure are one of TP's most powerful yet underused capabilites.

For some reason it's faster to do the math myself than to index the array as [0..1,0..127] or [0..127,0..1] -- not sure why... You'd think:

SHIPSList[sprite,x and 1](bufferP);

would be faster, but it isn't.

-- edit --

Huh, I went back and checked that again, and it's faster now -- somehow the changes I made since I last tested indexing the array changed the performance... Wonder if calling it inside UPDATE instead of as it's own 'ShowOnBuffer' routine has something to do with it. (one less function call saves the day again?)

... and moving the eraseFromBuffer and blitFromBuffer's actual write routines into the loops instead of as a function call added up to another 8% speedup... so good advice there.

Code:

procedure tSpriteList.eraseFromBuffer;
var
	p:pSprite;
	bPointer:pointer;
begin
	p:=first;
	repeat
		with p^ do begin
			bPointer:=bufferLoc;
			p:=next;
		end;
	{	asm same as buffer_erase8x6 }
		asm
			les di,bPointer
			xor ax,ax
			mov bx,60
			stosw
			stosw
			add di,bx
			stosw
			stosw
			add di,bx
			stosw
			stosw
			add di,bx
			stosw
			stosw
			add di,bx
			stosw
			stosw
			add di,bx
			stosw
			stosw
		end;
	until p=nil;
end;

Sucks to have to copy the pointer (p^.buffer) into the local var, but it being on the heap and in an object makes it hard to access in BASM otherwise; no different than storing it on the stack for a function call overhead-wise though, so it's fine.

Trixter · Dec 27, 2011

deathshadow said:
right... I think on a real 8088 you'll be surprised by that one; a movsb and inc is faster than a movsw if you do a bunch of them in a row when all you want to update is every other byte -- though there IS the alignment penalty when you get to a real 16 bit bus on the 286+.

Your assumption is correct. On a 4.77MHz 8088, here are two code snippets and their timings:

Code:

movsb
inc di
movsb
inc di
movsb
inc di
movsb
inc di

24 microseconds

Code:

lodsb
stosw
lodsb
stosw
lodsb
stosw
lodsb
stosw

36 microseconds

I also tested the speed of various array lookup methods, from a simple array of bytes, to a complex array of unaligned records, to a linked list. My code retrieved a byte from the structure, then advanced the index pointer (in the case of the linked list, the advance took the form of p:=p^.next). According to the code that Turbo Pascal 7 outputs, here are the results:

Simple array: 25 usecs
Complex array: 53 usecs (math needed on index to point to proper location)
Linked list: 47 usecs

The linked list is always 47 usecs regardless of the complexity of the structure. Array lookups can be faster than the linked list, but it depends on the complexity of the index math. In the case of the "complex array" test above, I made it as bad as possible by having the index land on a prime number as equidistant from two powers-of-two as I could.

I was slightly disappointed to see that TP doesn't output better code; for example, iterating through a list of bytes with just "mov al,[si+bx]; inc bx;" -- functionally equivalent to the "simple array" test above -- takes only 5 usecs.

deathshadow · Dec 28, 2011

Trixter said:
I was slightly disappointed to see that TP doesn't output better code; for example, iterating through a list of bytes with just "mov al,[si+bx]; inc bx;" -- functionally equivalent to the "simple array" test above -- takes only 5 usecs.

It does produce better code, it's just speed of execution wasn't the only priority; the same code is used for arrays on the stack as arrays on the heap, WITHOUT changing DS. TP's compiler considers DS 'inviolate' and to always point at the stack, which is why you have to preserve it in your own ASM functions... They probably should have tested during compilation where the array was being accessed from, but there's a reason it compiles the code fairly quick... It makes some assumptions of what you might expect and what it actually does not always line up.

... and the code it makes is far, far better than say... anyone else's pascal compiler of the era... or C compiler for that matter... or even some compilers of today; see GCC who's x86 output is absolutely dreadful. (it really was never meant for Intel chips in the first place, and it shows).

Also part of why RISC type chips are so popular with compiler makers -- As Prof. Wirth wryly noted some time ago, "RISC is meant for people who write compilers, CISC is meant for people who write assembly." (though I'm not sure he REALLY said that -- you know how it is with sayings attributed to people who never said them)

Though... wouldn't

add si,bx
lodsb

be much MUCH better? 3 bytes, 14 EU clocks, entire operation smaller than the prefetch buffer... as opposed to mov al,[si+bx]; inc bx; which is 7 bytes and 20 clocks? (unless you really need to preserve SI, but one push/pop isn't the end of the world, or use a register or sub si,count+bx at end.)

Though again, if your values aren't on the stack in the first place... the overhead of swapping the stack or saving the stack index is often not worth the overhead of just using a raw calculation.

... also remember, var T:word; is not BX or even CX... it often fails to make register replacements of vars in for/next loops, also throwing the advantage in the pointer methods arena.

Help testing my next game's engine

Trixter

Veteran Member

deathshadow

Veteran Member

deathshadow

Veteran Member

Trixter

Veteran Member

deathshadow

Veteran Member

Trixter

Veteran Member

deathshadow

Veteran Member

Trixter

Veteran Member

deathshadow

Veteran Member

deathshadow

Veteran Member

Trixter

Veteran Member

deathshadow

Veteran Member

VCF West	Aug 01 - 02 2025,	CHM, Mountain View, CA
VCF Midwest	Sep 13 - 14 2025,	Schaumburg, IL
VCF Montreal	Jan 24 - 25, 2026,	RMC Saint Jean, Montreal, Canada
VCF SoCal	Feb 14 - 15, 2026,	Hotel Fera, Orange CA
VCF Southwest	May 29 - 31, 2026,	Westin Dallas Fort Worth Airport
VCF Southeast	June, 2026	Atlanta, GA