Caching in on the Z Divide

neilobremski · Feb 9, 2017

Projecting a 3D point onto a 2D screen is probably the most basic thing one has to do and one of the most frequent. The calculation is to multiply a coordinate by the Viewing Distance (VD) and then divide by Depth (Z). This is done for both X (horizontal) and Y (vertical) axes and thus occurs twice per point. Magenta's Maze encapsulates this into the macros project_x() and project_y() which both contain this core equation:

Code:

((int)( ((long)(p) << VD) / (z)))

The primary platform target is a Tandy 1000 HX which has a 8088-based CPU. This poor thing does not do well on multiplications or divides and even worse when they are 32-bit. Two problems: one is that although the 8088 has 16-bit instructions it can only pump 8-bits at a time and secondly it has no 32-bit instructions at all! This means that casting to a long actually calls some CRT method depending on the compiler you use.[SUP]1[/SUP]

There are 17 vertices used for each segment of the leaves mesh and there are 3 of these meshes per tree in Magenta's Maze. If there are 25 trees in the view then that's 1,275 vertices (17 x 3 x 25) and thus 2,550 projection calculations being done. That's quite a lot on trees alone!

Today I had an idea to cache this calculation which I usually call the Z Divide but it has a multiplication involved as well [SUP]2[/SUP]. I wondered if this would be faster at all since memory access on the 8088 is excruciatingly slow and this would be so much data, I'd need to put it in another segment via far pointer. There was only one way to find out: test it!

What I did was to allocate 64,512 bytes where each one is an unsigned projected 2D coordinate, giving me 0 - 255. In order to handle a maximum Z of 500 and a maximum X/Y of about 500 as well I had to truncate one of the two. I opted to truncate Z by dividing it by 4 and then putting it in the upper 7 bits of the index variable. Thus, accessing a single point in this cache is:

Code:

#define cache_point(p,z) pcache[(z&0x1FC)<<7+p] /* [z/4*512+p] */

Obviously the code to figure out the subscript position is extra cycles on top of even reading the data from the table so this alone had me worried. However, I pressed on and created a comparison between the raw calculation and the cached version:

Code:

printf("Timing Calculated Projections ... ");
start = microsec(TRUE);
for (i = -511; i <= 511; i++) {
	for (z = 1; z <= 500; z++) {
		int p = project_p(i,z);
	}
}
end = microsec(FALSE);
printf("%ld (%ld ms)\n", (end - start), (end - start) / 1000);

printf("Timing Cached Projections ....... ");
start = microsec(TRUE);
for (i = -511; i <= 511; i++) {
	for (z = 1; z <= 500; z++) {
		int p = (i < 0 ? -cache_point(-i,z) : cache_point(i,z));
	}
}
end = microsec(FALSE);
printf("%ld (%ld ms)\n", (end - start), (end - start) / 1000);

I compiled this in MSC 5.1 both with optimizations enabled and disabled, then I ran it on my Tandy 1000 HX:

View attachment 36056

This shows a 61% improvement, it ran the same calculations in 64.4 versus 103.7 seconds. I'm not sure if this is good enough to warrant using 64k of the heap, however, since the Tandy only has 256k total. Still it tells me that there is hope. My next attempt at this may be to figure out how to use some clever shifting and 16-bit IMUL and IDIV instructions then compare that to this.

Update 2/10, Int16 vs Caching:

After writing yesterday's entry, I realized the maximums I enforced on axis coordinates (-511 to +511) were such that unsigned 16-bit integers could be used. My viewing distance is 128 represented by a 7 bit shift left. Thus 511 * 128 = 65,408 (and 512 * 128 = 65,536). The problem would then be some comparison/branches to check for a negative axis coordinate, to say nothing of doing clipping which is not something I talked about.

Anyway, I created this macro and its associated timing block:

Code:

#define project_i(i,z) ((int)( (((unsigned)i) << VD) / (z) ) )

printf("Timing Int16 Projections ........ ");
start = microsec(TRUE);
for (i = -511; i <= 511; i++) {
	for (z = 1; z <= YON; z++) {
		int p = (i < 0 ? -project_i(-i,z) : project_i(i,z));
	}
}
end = microsec(FALSE);
printf("%ld (%ld ms)\n", (end - start), (end - start) / 1000);

I ran this on the Tandy 1000 HX, compiled with optimizations, and the results were:

Code:

Building Projection Cache ................................ OK!
Timing Calculated Projections ... 103600559 (103600 ms)
Timing Cached Projections ....... 64330068 (64330 ms)
Timing Int16 Projections ........ 30755434 (30755 ms)

That is a huge improvement over both of my old calculations. It was certainly fun to come up with a caching method but, as you can see, it was totally unnecessary!

Footnotes:

[SUP]1[/SUP] I used Microsoft C 5.10 which calls __aNldiv for dividing 32-bit integers.

[SUP]2[/SUP] There is no Aspect Ratio correction in Magenta's Maze; it would add an extra multiplication to one of the two axes and I didn't think it was worth it considering the resolution.

Caching in on the Z Divide

neilobremski

Experienced Member