Optimizing Double Buffer: C vs Asm

neilobremski · Feb 16, 2017

There's no contest, right? C versus Assembler and you know the latter is going to crush the former. The question really becomes why and what you can do about it. But first I wanted to see just how fast I could make a double buffer for CGA.

As with previous entries, I'm using Microsoft C 5.10 as my baseline compiler. Your own compiler may get better or worse results but I've found this to be a good average as far as tools of the real mode era go (e.g. late 1980's). I created some helper methods for a double buffer that is 256x190 in size. This will be used for an optimized version of Magenta's Maze and that is the dimensions of the 3D viewing area. For 2-bit pixels this comes out to 12,160 (256 * 190 / 4). This is about a 25% savings over buffering the entire CGA memory area which is actually 16,384 due to the interlaced scan lines.

Here's the slow (C) version I wrote which unrolls loops and attempts to copy 16 bits at a time using unsigned short pointers.

Code:

unsigned char far *dbuf = (unsigned char far*)0;

void initdbuf(void)
{
	dbuf = _fmalloc(12160); /* 256 * 190 / 4 */
}

void clrdbuf(unsigned short pattern)
{
	unsigned short far *dp = (unsigned short far*)dbuf;
	unsigned short height = 190;
	unsigned short register pw = pattern;

	while (height--) {
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
		*dp++ = pw; *dp++ = pw; *dp++ = pw; *dp++ = pw;
	}
}

void copydbuf(void)
{
	unsigned short far *sp = (unsigned short far*)dbuf;
	unsigned short far *dp = (unsigned short far*)0xB8000000L;
	unsigned height = 95; /* x2 = 190 */

	while (height--) {
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;

		dp += 0xFE0;

		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;
		*dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++; *dp++ = *sp++;

		dp -= 0xFF8;
	}
}

void quitdbuf(void)
{
	_ffree(dbuf);
	dbuf = (unsigned char far*)0;
}

It's a little bloated but it seems reasonable, let's see how it performs on my Tandy 1000 HX timing both the clear and copy methods four times, one for each CGA color:

View attachment 36262

For those that can't view the animation, the results are:

Clear Times:
0.) 114506 (114 milliseconds)
1.) 60598 (60 milliseconds)
2.) 60600 (60 milliseconds)
3.) 60599 (60 milliseconds)
Copy Times:
0.) 279838 (279 milliseconds)
1.) 279847 (279 milliseconds)
2.) 279843 (279 milliseconds)
3.) 279844 (279 milliseconds)

It takes more than a third of second to clear and copy the double buffer to the video refresh buffer on the CGA. That would mean a maximum FPS of 3 without anything else being drawn or done with the CPU. That's really poor and clearly it's why the double buffering code is turned off by default in Magenta's Maze on the Tandy 1000.

What I did for assembler was to write the code into a memory buffer and then call it using an interrupt vector so that I could specify the register values in C (based on my research making FWMEMCPY earlier this morning). Why? Because MSC 5.1 doesn't have inline assembler ... and I'm a weird crank that likes to see if I can do something atypical to get good results!

Here then is the fast (asm-based) code:

Code:

unsigned char far *dbuf = (unsigned char far*)0;
unsigned char REP_STOSW_IRET_NOP[4] = { 0xF3, 0xAB, 0xCF, 0x90 };
unsigned char far *DBUF_X86_COPY = (unsigned char far*)0;
p_interrupt oldvect60h = (p_interrupt)0;
p_interrupt oldvect61h = (p_interrupt)0;

void initdbuf(void)
{
	int y;
	struct SREGS segregs;
	p_interrupt pint;
	segread(&segregs);

	/* DBUF_X86_COPY : REP MOVSW code for each scanline */
	DBUF_X86_COPY = _fmalloc(1522);
	*DBUF_X86_COPY++ = 0xFC; /* CLD */
	for (y = 0; y < 190; y++) {
		unsigned short offset = (y / 2 * 80) + (y % 2 ? 0x2000 : 0x000);

		/* MOV CX, 0020h */
		DBUF_X86_COPY[(y<<3)+0] = 0xB9;
		DBUF_X86_COPY[(y<<3)+1] = 0x20;
		DBUF_X86_COPY[(y<<3)+2] = 0x00;

		/* MOV DI, <offset> */
		DBUF_X86_COPY[(y<<3)+3] = 0xBF;
		DBUF_X86_COPY[(y<<3)+4] = (unsigned char)(offset & 0xFF);
		DBUF_X86_COPY[(y<<3)+5] = (unsigned char)(offset >> 8);

		/* REPZ MOVSW */
		DBUF_X86_COPY[(y<<3)+6] = 0xF3;
		DBUF_X86_COPY[(y<<3)+7] = 0xA5;
	}

	DBUF_X86_COPY--;
	DBUF_X86_COPY[1521] = 0xCF;	/* IRET (0xCF) */

	oldvect60h = _dos_getvect(0x60);
	pint = (p_interrupt)MK_FP(segregs.ds, REP_STOSW_IRET_NOP);
	_dos_setvect(0x60, pint);

	oldvect61h = _dos_getvect(0x61);
	_dos_setvect(0x61, (p_interrupt)DBUF_X86_COPY);

	dbuf = _fmalloc(12160); /* 256 * 190 / 4 */
}

void clrdbuf(unsigned short pattern)
{
	union REGS regs;
	struct SREGS segregs;

	regs.x.ax = pattern;
	regs.x.cx = 6080; /* dbuf size / 2 */

	segread(&segregs);
	segregs.es = FP_SEG(dbuf); regs.x.di = FP_OFF(dbuf);

	int86x(0x60, &regs, &regs, &segregs);
}

void copydbuf(void)
{
	union REGS regs;
	struct SREGS segregs;

	segread(&segregs); segregs.es = 0xB800; regs.x.di = 0;
	segregs.ds = FP_SEG(dbuf); regs.x.si = FP_OFF(dbuf);

	int86x(0x61, &regs, &regs, &segregs);
}

void quitdbuf(void)
{
	_dos_setvect(0x61, oldvect60h);
	_dos_setvect(0x61, oldvect61h);
	oldvect60h = (p_interrupt)0;
	oldvect61h = (p_interrupt)0;

	_ffree(DBUF_X86_COPY);
	DBUF_X86_COPY = (unsigned char far*)0;

	_ffree(dbuf);
	dbuf = (unsigned char far*)0;
}

Woah! You might still say this is C code but it's calling into machine instructions I assembled in DEBUG and then reconstructed into a far array here. Most of the work is being done in initdbuf() where our interrupt is built in machine code. This is simpler than it sounds or looks.

First, we know there are 190 scan lines contiguously placed in the buffer. The CGA on the other hand is interlaced. And we don't want to copy all 80 bytes (320 pixels) of each CGA scan line, we're only copying 64 bytes (256 pixels). This means that we need code for each line to blast out that line's pixels. Each line requires 8 bytes of machine code which comes down to the instructions:

Code:

MOV CX, 32		; Set count to 32 words (64 bytes)
MOV DI, <offset>	; Set ES:DI to start of scanline
REP STOSW		; Copy CX words from DS:SI to ES:DI

The only thing seemingly variable here is the offset put in DI but even this is an immediate value that is entered directly into the machine code. There's no address resolution or memory locations being used besides what was setup in DS:SI and ES:DI.

This expands to 1522 bytes of machine instructions; there's +2 for the initial CLD (to make sure copying works forward rather than backward) and the final IRET to return from the interrupt.

Sure there's the overhead of interrupts and the setup and teardown for C's
interrupt-calling methods, but let's see how it fairs against our previous numbers on my Tandy 1000 HX ...

View attachment 36263

The above image shows:

Clear Times:
0.) 35149 (35 milliseconds)
1.) 36601 (36 milliseconds)
2.) 36603 (36 milliseconds)
3.) 36600 (36 milliseconds)
Copy Times:
0.) 11792 (11 milliseconds)
1.) 11790 (11 milliseconds)
2.) 11789 (11 milliseconds)
3.) 11792 (11 milliseconds)

Yowza, that is blistering fast! Strangely the clearing takes almost three times as long as the copying and the clearing is only half as fast as it was in C. However, the copying is now 2500% faster at 11 milliseconds which is a combined FPS (without drawing or other actions) of 21. That's not bad at all considering I'm shooting for a much lower rate for Magenta's Maze (9 FPS would be stellar).

I have to say that after I did the first test filmed above, I reran and the copy times are now consistently coming up as 18 milliseconds instead of 11. I'm not sure why but well there you have it. That would be a combined FPS of 18.5; a few milliseconds sure makes a difference!

Optimizing Double Buffer: C vs Asm

neilobremski

Experienced Member