Adventures in x86 C & assembly

eeguru · Nov 3, 2011

Last night I managed to block out some time to work on a bit of BIOS code for the JR-IDE card. It really brought me back to a place I wasn't expecting. With modern CPUs we completely take for granted caching, pipe-lining, prefetch, high clock rates, and even things as simple as 32-bit wide busses. With the 4.77 MHz 8088, I was freshly reminded each instruction is multiple bytes long. Each byte of the instruction must be fetched one bus cycle at a time. Each bus cycle takes 4-7 clocks. Instruction decode, setup, and execution takes anywhere from 2 to 40 clocks. Then each output result must be stored back one bus cycle and 4-7 clocks at a time. 4.77 million clocks a second evaporates quick when everything is a geometric divide.

Most of us are well aware of these facts. I certain was and am after 30 years of doing this. However I don't think I was really 'in-tune' with it till I found myself actually writing code where eliminating a few instructions or even shortening the encoding of what you are trying to accomplish really became visible in human time. There was a time in my youth where I was a bit sharper and I really viewed C as an assembly macro-language that was more maintainable than raw assembly. I was very intimate with both Watcom and GNU compilers and could see ahead the generated assembly in my brain much like Cypher could see brunettes in the Matrix. Last night, at least somewhat, brought me back to that place. Was a bit like scrunching my toes in the sand on my favorite beach I only get to visit once a decade.

Was good stuff. And great to have fresh perspective in my day to day.

pearce_jj · Nov 8, 2011

This is gaining traction in engineering and academic circles too..., in the UK at least it is now widely recognised that there is bascially very, very little being taught about how computers actually work, the rot having started in the 90's when Cobol, Pascal & C for computing science projects were replaced with Excel VBA and indeed computing science itself then replaced with 'ICT' or some other nonsense. As said, in the late 80's the fetch-execute cycle actually translated directly to the hardware that was on the table in front of students.

As it happens whilst researching iPad security I stumbled across the raspberry-pi project which aims to provide a cheap single-board computer the aim being to address that. Although of course even with the little ARM chip it's processing throughput will still be near incomprehensible compared to and C code it runs.

That probably is the crux of the issue - the CPUs operate at speeds we (or I at least) simply cannot comprehend. Whereas even in the early 90's programming the typical school project library database system (written from scratch in Pascal) meant that coding inefficiencies and sorting algorithm choices REALLY mattered, even with only 100 records in the database - especially when running from floppy disk!

But back on topic, very interested in your development efforts and can't wait to see the results!

eeguru · Nov 8, 2011

I work in embedded automotive atm and I had a peer review some code of a junior person yesterday. In his report, he cited that every variable declaration didn't have a zero initializer at the point of declaration per our coding standards (most of which I despise). The code was fine. The author initialized every variable before first use - rarely with zero. I just forehead slapped. This is an embedded programmer in 2011? He is an older guy (40s) but I am fairly certain he has never looked at an assembly mnemonic in his entire life. And he has his own team of people he mentors every day. Granted most compilers will optimize out the first of two move immediates, if you enable optimizations. But to me, it emphasizes the difference between a computer scientist and engineer.

Tor · Nov 8, 2011

Pre-initializing every variable is a very bad habit and is prohibited in coding standards where I work.

The reason is that if you make a coding error and forget to set a variable (to its calculated value, somewhere in the actual code section) before using it elsewhere the compiler will be unable to detect your mistake. If you don't initialize it the compiler will spit out a warning, at least modern GCC and most commercial compilers I know about.

The only time you initialize a variable is if it's effectively a constant and won't be set elsewhere, or in certain cases where you normally use a default value but it's set to something else in exceptional cases. Even then you might be better off not initializing it.

-Tor

deathshadow · Nov 8, 2011

eeguru said:
Each byte of the instruction must be fetched one bus cycle at a time. Each bus cycle takes 4-7 clocks. Instruction decode, setup, and execution takes anywhere from 2 to 40 clocks. Then each output result must be stored back one bus cycle and 4-7 clocks at a time. 4.77 million clocks a second evaporates quick when everything is a geometric divide.

Gets even more complicated when you figure in the 4 byte prefetch buffer... something I dealt directly with in writing Paku Paku. Many coders think of their execution times just as byte size plus opcode size, when it's actually even more complex... for example:

Code:

shl  ax,1
shl  ax,1
shl  ax,1

2 bytes, 2 cycles each -- since it takes 4 cycles per byte to fetch the precache spends most of it's time empty on that -- do that after a prefetch emptying instruction (like jmp), and it's going to take 26 clocks to execute. 8 clock fetch, 2 execute, 2 fetch during that execute, 6 more fetch, 2 execute, 2fetch during that execute, 6 more fetch, then 2 execute... the fetch during execute not counting towards the total... but let's say you ran it after a longer instruction like say... mul bx

Code:

mul    bx         { 113 to 133 clocks -- let's just say prefetch is full with the next 4 bytes }
shl    ax,1      { 2 clocks, freeing two bytes prefetch and start half handshaking for next byte }
shl    ax,1      { 2 clocks, freeing two bytes prefetch finish fetching next byte}
shl    ax,1      { 2 clocks, fetch 1 byte for 4 clocks. }

So instead of the 26 clocks of the first one, those three shifts execute in 10 clocks due to the multiply before it! Gets even more interesting on the 8086 since it has a 6 byte prefetch queue.

In many places in my code, I found just re-arranging the order in which values are set and memory is read the execution times go all over the place... for example:

Code:

xor  di,di   { 1 byte, 2 clocks }
mov  al,$20 { 3 bytes, 4 clocks }
mov  cx,mem {  4 bytes, 29 clocks thanks to EA calc}

ends up taking 18 more clock cycles to execute at the start of a procedure (call is another prefetch empty) than simply flipping the order:

Code:

mov  cx,mem {  4 bytes, 29 clocks thanks to EA calc}
mov  al,$20 { 3 bytes, 4 clocks }
xor  di,di   { 1 byte, 2 clocks }

Of the 21 clocks left over after the memory read on the mov cx, 16 can be used to fetch all four bytes of the next two instructions. Executing the 3 byte instruction before the 1 byte one frees up more space in the prefetch meaning that over the next six clocks an entire byte and handshaking for the next byte can be done.

Some of the tricks one gets into with graphics programming can be amusing on that front too. Take a simple X,Y address calculation for the VGA's 320x200x256 graphics mode. you'll see this example a lot.

Code:

mov  di,screenX
mov  ax,screenY
mov  bx,320
mul  bx
add  di,ax

MUL is a ridiculously slow instruction, even on later processors. (right up to the 386 really)... the normal solution people use is to say "bah, forget the 8088. It's VGA, use 286 instructions"

Code:

mov  di,screenX
mov  ax,screenY
shr  ax,6 { == *64 }
add  di,ax
shr  ax,2 { now == * 256 }
add  di,ax

Which is a lot faster -- to the tune of around 10 clocks faster on an AT; but -- it doesn't even run on a 8088/8086 since it can't do more than shl reg,1 without getting cl involved... but if you think about the data - the Y coordinate only contains 0..199 -- meaning there's a faster way to get that *256 out of it.

Code:

mov  di,screenX
mov  ah,screenY
xor  al,al      { ax=screenY:00 == screenY*256! }
add  di,ax
shr  ax,1       { ax=screenY*128 }
shr  ax,1       { ax=screenY*64 }
add  di,ax      { di now equals screenX+screenY*320! }

Which actually executes as fast if not faster than that AT optimized version, in about the same number of bytes, while working on the 8088/8086. Simply using a byte swap to get it *256 and two single bit shifts gives you the same result... which is one of the key skills to working in assembler; sometimes you have to come at the problem from the opposite direction... like replacing a multiply with two shifts basically dividing by two each.

Fun stuff.

pearce_jj · Nov 18, 2011

Only just seen this post, the screen calculation example is absolutely great!

commodorejohn · Nov 18, 2011

Oh, now that's a good one.

One of these days I need to revisit my Tandy tile/sprite code and see if I can't get it tuned up...ran problematically slow even with just one 32x32 sprite when I tried it. It's that damn CGA-style interleave complicating everything... :/

deathshadow · Nov 18, 2011

commodorejohn said:
One of these days I need to revisit my Tandy tile/sprite code and see if I can't get it tuned up...ran problematically slow even with just one 32x32 sprite when I tried it. It's that damn CGA-style interleave complicating everything... :/

The trick to dealing with the CGA interleave is to store your sprite interlaced as well... think about it:

Blit 16 lines.
address:=(address+($2000-(scanlineWidthInBytes*16))) and $3FFF;
Blit 16 lines.

Makes it simple. I'm actually working on a 320x200 4 color CGA port of paku paku on request for the handful of machines that still can't manage my tweaked text mode... (like my sharp PC-7000)

My blitting routine (16x12 move for 12x12 sprites) ends up using that approach to go from backbuffer to screen... of course the game logic is going to remain 160x100 with everything moving 2px per 'tick'... we'll see how that turns out.

Code:

procedure copy2Screen12x12(x:word; y:byte); assembler;
{
	actually copies 16x12 on dword boundary to handle sprite shift.
	Faster to blit whole block than it is to add extra logic to handle offset

	game logic remains 160x100, so the scanline 'copy' is easier to deal with.
}
asm
	mov  di,x
	shr  di,1
	shr  di,1
	mov  ah,y
	xor  al,al { ax=y*256 }
	shr  ax,1
	shr  ax,1
	add  di,ax { +64 }
	shr  ax,1
	shr  ax,1
	add  di,ax { +16 = 80 bytes per scanline }
	mov  ax,$B800
	mov  es,ax
	push ds
	lds  si,backBuffer
	add  si,di
	mov  ax,76
	mov  cx,6
@loop1:
	movsw
	movsw
	add di,ax
	add si,ax
	loop @loop1
	mov cx,6
	add di,$1E20 { 8192-80*6 }
	add si,$1E20 
@loop2:
	movsw
	movsw
	add di,ax
	add si,ax
	loop @loop2
	pop ds
end;

fun times. Also thinking I might try to do a composite color version -- just for kicks.

commodorejohn · Nov 19, 2011

Duh, I can't believe I didn't think of that. IIRC I had it blitting from a linear sprite and jumping down one "plane" each line, then back every fourth line. Now I feel a bit stupid :/

The other thing is positioning on odd X positions...the fastest approach would probably be to store shifted copies of the sprites, but that just seems like such a waste of memory...

deathshadow · Nov 19, 2011

commodorejohn said:
The other thing is positioning on odd X positions...the fastest approach would probably be to store shifted copies of the sprites, but that just seems like such a waste of memory...

It is, but it's easily ten times faster. I do it in Paku Paku for two copies... Tandy/Jr 16 color would be much the same (It's why I had a tandy/jr 160x200 version for a bit). CGA 4 color is much worse since at 2bpp that would be four copies. Paku Paku's sprites are stored 3 bytes wide and five bytes tall.. for a 5x5 grid with room for the shift -- that's only 15 bytes for each shift, for 30 bytes per 'frameset'... which is why the entire game's sprites don't even consume 8k total WITH masks that also have shifted copies. (so actually 4 bitmaps per 'frameset')

(frameset == left and right shifted sprite frame together)

But that hinges on your number of sprites and their size. 32x32 is pretty massive for a CGA sprite... that's roughly 1/6th the screen height... even so at 2 bits per pixel with two shifts, that's 1k per 'frameset'... not too bad, but yeah that could add up fast. A more realistic sprite size for 320x200 16 color mode would be 15x15 stored as 15 scanlines of 8 bytes each -- with the shifted pre-copy that's only 240 bytes per 'frameset'. 7x7 might be even better since that would be 7 scanlines of 4 bytes... that's only 112 bytes per 'frameset'.

Keep in mind your bus speed and video ram limitations as well... even 112 bytes per blit with masking at 30fps limits you to around 4 sprites maximum, around 10 at once without masking.

It's actually something else that tweaked text mode gives an advantage on, bigger sprites due to the lower resoluton means less RAM to shove around. Probably why Sierra did the original King's Quest in 160x200 instead of 320x200.

commodorejohn · Nov 19, 2011

Yeah, true. I just have a basic tendency towards larger characters and objects, it's so much easier to create good sprites at 24x24 or larger than at 16x16, unless you go for a "super-deformed" look designed specifically to fit in that size...

Of course, not all sprites have to be exactly that size, so a game could have one or two larger sprites and a handful of bullets/small enemies or something.

deathshadow · Nov 29, 2011

I actually find the smaller sizes EASIER to work with -- and more fun.

For example, some of the sprites I'm playing with now for various upcoming games -- none of them are bigger than 16x16... and I kinda like it that way.

Click for 4x view

Those are all designed to fit 12x16, pretty large for the 160x100 mode.

Click for 4x view

Still VERY early on in working on those, but you can see they're designed to a 16x12 box.

Click for 5x view

Your typical defense command/invaders/galaxian/gorf type game sprites, these designed to fit a 8x6...

Though in the case of each, they're actually 1px narrower so I have room to provide the shifted copy without changing the total byte width.

It is a bit more... challenging to work in that size, but once you get enough colors in there you can do some pretty nice looking stuff... I mean, if it was good enough for the better Atari 400/800/5200 games and the intellivision...

But then, I'm an old B&W trash-80 guy, so anything more than 6x6 is "massive" to me.

commodorejohn · Nov 29, 2011

Yeah, certainly you can make good use of smaller sprites, if you know what you're doing - it's just I find I'm more suited to something in the medium range (i.e. not your Street Fighter Alpha 192px monstrosities, but something larger than 16x16 to be sure.)

kiyotewolf · Feb 13, 2012

Here's some C stuff, C for 8088, Small C is what it's called.

http://www.cpm.z80.de/small_c.html

~Paul

Trixter · Feb 13, 2012

eeguru said:
Last night I managed to block out some time to work on a bit of BIOS code for the JR-IDE card.

Love that this is happening. If this work is still in progress, I'd love to see what you've done. My hobby over the past decade has been 8088 optimization.

Adventures in x86 C & assembly

eeguru

Veteran Member

pearce_jj

Veteran Member

eeguru

Veteran Member

Tor

Veteran Member

deathshadow

Veteran Member

pearce_jj

Veteran Member

commodorejohn

Veteran Member

deathshadow

Veteran Member

commodorejohn

Veteran Member

deathshadow

Veteran Member

commodorejohn

Veteran Member

deathshadow

Veteran Member

commodorejohn

Veteran Member

kiyotewolf

Experienced Member

Trixter

Veteran Member