Help with C/C++ inline assembly divide and shift (for fixed point math) on 16 bit / 8086 target.

radiance32 · May 8, 2024

Hi all,

I'm optimizing a mostly C program with some inline assembly for my fixed point math routines. (16 bit fixed point, 6.10 format)
I have hardly any experience with 8086 assembly programming at all, but I did find a working routine from a friend to do my multiplications,
but none for division.

Here's the code, the F_REAL type (short for Fixed Real number) is an alias for an int16.
The C functions work, and the assembly language routine works too (the multiplication one) using it improves my speed from ~0.6 to ~0.75.
All good.

Code:

#define FIXED_POINT_FRACTIONAL_BITS 10

// Multiplication of two fixed-point numbers
//inline F_REAL F_REAL_MUL(F_REAL a, F_REAL b) {
//    return (F_REAL)(((long)a * b) >> FIXED_POINT_FRACTIONAL_BITS);
//}
inline F_REAL F_REAL_MUL(F_REAL a, F_REAL b);
#pragma aux F_REAL_MUL =   \
"imul bx" \
"mov cx, 10" \
"shrd_loop:" \
"    shr dx, 1" \
"    rcr ax, 1"  \
"loop shrd_loop" \
    parm [ax] [bx]       \
    modify [ax bx cx dx]   \
    value [ax];

But, I also need the equivalent for my divisions,
and I've tried to modify the assembly language from the multiply routine above but it fails...
Can someone help ?

Here's the working division C function and my broken assembly language attempt:

Code:

// Division of two fixed-point numbers
// inline F_REAL F_REAL_DIV(F_REAL a, F_REAL b) {
//    return (F_REAL)(((long)a << FIXED_POINT_FRACTIONAL_BITS) / b);
//}
inline F_REAL F_REAL_DIV(F_REAL a, F_REAL b);
#pragma aux F_REAL_DIV =   \
"idiv bx" \
"mov cx, 10" \
"shld_loop:" \
"    shl dx, 1" \
"    rcl ax, 1"  \
"loop shld_loop" \
    parm [ax] [bx]       \
    modify [ax bx cx dx]   \
    value [ax];

I am using the Watcom C/C++ 1.9 compiler for compilation with a 8086 target with FPU emulation (I don't use much floats in my code since I switched to fixed point) and testing on DOSBOX (8086 CPU, no FPU).
If someone can fix this, I would be very grateful.
I am a visual learner, so a small sample of code is better for me to understand than a discussion...

Also, if someone knows of any ways to optimize the multiplication (and division) function even more (maybe with no loops), I'm all ears

Thanks,
Terrence

bakemono · May 8, 2024

I'm not sure if you're counting on overflows to be handled a certain way, but I think your multiply routine would be faster to shift left by 6 and take the result from DX instead of shifting right 10 and taking the result from AX. Or just shift 2 bits and then shuffle your 8-bit registers.

For the divide, you'll want to sign-extend into DX:AX, shift left by 10, and then do the IDIV.

Code:

cwd
mov cx,10
shld_loop:
shl ax,1
rcl dx,1
loop shld_loop
idiv bx

Again, there is an opportunity to reduce loop iterations by shuffling bytes instead.

radiance32 · May 9, 2024

Hi bakemono,

First of all big thanks for your response,
I will try it out shortly...

Could you write a small sample assembly routine that does what you say (eg for the multiply routine), shift left and take result from DX ?)
The fixed multiply is used throughout my app MANY times, so optimizing it further, even a bit, is going to make a big difference...

I'd love to see an example of your idea... if you have the time for it...

Cheers,
Terrence

bakemono · May 9, 2024

how about this

Code:

imul bx
shr dx,1
rcr ax,1
shr dx,1
rcr ax,1
mov al,ah
mov ah,dl

radiance32 · May 9, 2024

Hi,

Big thanks, that worked perfectly

I replaced the multiply function with your code:

Code:

inline static F_REAL F_REAL_MUL(F_REAL a, F_REAL b);
#pragma aux F_REAL_MUL =   \
"imul bx"       \
"shr dx, 1"     \
"rcr ax, 1"     \
"shr dx, 1"     \
"rcr ax, 1"     \
"mov al, ah"    \
"mov ah, dl"    \
    parm [ax] [bx]       \
    modify [ax bx dx al ah dl]   \
    value [ax];

And, since I am using a lot of multiplications (lots of vec3's and matrix4x4's) I got a speed up from 0.4334 FPS to 0.5058 FPS

That's a speedup of ~0.0724 milliseconds per frame

Considering I'm rendering an interactive wireframe 3D model of little over 4000 triangles on a 80186 CPU with no FPU, that's a great result

How would this change affect the division ? Can it be applied for the division function too ?

I have another question,
do you have any experience doing matrix multiplications in assembly language ?
I think that they can be made a lot faster if someone with a lot of assembly language skill would do them by trying to use all the available registers...
What's your idea on this ?

Thanks again!, I really appreciate this...
Terrence

radiance32 · May 11, 2024

bakemono said:
how about this

Code:

imul bx shr dx,1 rcr ax,1 shr dx,1 rcr ax,1 mov al,ah mov ah,dl

Hi,

Do you think you could help me with writing some 8086 compatible assembly code to do a matrix multiply of a 3d vector ?
I could probably stitch together one with the mutliply code you gave me, but that would'nt make much sense.
I was hoping someone could help me write a matrix multiply function in 8086 assembly that uses all the available registers in the CPU to do the job faster...
Here's what I'm currently using in my C/C++ code:

Code:

                vec3d v;
                v.x = F_REAL_MUL(i.x, m.m[0][0]) + F_REAL_MUL(i.y, m.m[1][0]) + F_REAL_MUL(i.z, m.m[2][0]) + F_REAL_MUL(i.w, m.m[3][0]);
                v.y = F_REAL_MUL(i.x, m.m[0][1]) + F_REAL_MUL(i.y, m.m[1][1]) + F_REAL_MUL(i.z, m.m[2][1]) + F_REAL_MUL(i.w, m.m[3][1]);
                v.z = F_REAL_MUL(i.x, m.m[0][2]) + F_REAL_MUL(i.y, m.m[1][2]) + F_REAL_MUL(i.z, m.m[2][2]) + F_REAL_MUL(i.w, m.m[3][2]);

Note that the F_REAL_MUL(a, b) function is the assembly code you wrote earlier in this thread (eg the one without the loop in it)...
Also, I'm using 16 bit signed int 6.10 fixed point values for every variable so the CPU does a 32bit multiplication in every F_REAL_MUL(a, b), but you probably already know that

Any code, help or info/tips would be greatly appreciated!

Cheers,
Terrence

bakemono · May 11, 2024

for the divide you can try this

Code:

cwd
shl ax,1
rcl dx,1
shl ax,1
rcl dx,1
mov dh,dl
mov dl,ah
mov ah,al
mov al,0
idiv bx

Might be just as good if you leave out the mov al,0 for that matter.

In your matrix multiply you have a series of multiply-accumulate operations. It would be more efficient to dump the F_REAL_MUL in this case, do normal 16bitx16bit=32bit multiplies, and then only after you have the 32-bit sum do the right-shift to get the final result.

radiance32 · May 11, 2024

bakemono said:
for the divide you can try this

Code:

cwd shl ax,1 rcl dx,1 shl ax,1 rcl dx,1 mov dh,dl mov dl,ah mov ah,al mov al,0 idiv bx

Might be just as good if you leave out the mov al,0 for that matter.

In your matrix multiply you have a series of multiply-accumulate operations. It would be more efficient to dump the F_REAL_MUL in this case, do normal 16bitx16bit=32bit multiplies, and then only after you have the 32-bit sum do the right-shift to get the final result.

Okay,

Sounds good. But remember I've NO experience at all with 8086 assembly...
For the assembly language matrix multiplication: I've no idea where to store the temporary result when adding the multiplies together... And NO idea how to load the values into where they need to go...
Can you give me a quick example ? I'm a visual learner, I need to see some code to make sense of it.

Thanks again, your help is invaluable...
Terrence

radiance32 · May 11, 2024

bakemono said:
for the divide you can try this

Code:

cwd shl ax,1 rcl dx,1 shl ax,1 rcl dx,1 mov dh,dl mov dl,ah mov ah,al mov al,0 idiv bx

Might be just as good if you leave out the mov al,0 for that matter.

In your matrix multiply you have a series of multiply-accumulate operations. It would be more efficient to dump the F_REAL_MUL in this case, do normal 16bitx16bit=32bit multiplies, and then only after you have the 32-bit sum do the right-shift to get the final result.

I wrote your idea in my C matrix multiply function and it does'nt work, black screen instead of my wireframe model

LOL

Code:

        vec3d Matrix_MultiplyVector3(mat4x4& m, vec3d& i)
        {
                vec3d v;
                //v.x = F_REAL_MUL(i.x, m.m[0][0]) + F_REAL_MUL(i.y, m.m[1][0]) + F_REAL_MUL(i.z, m.m[2][0]) + F_REAL_MUL(i.w, m.m[3][0]);
                //v.y = F_REAL_MUL(i.x, m.m[0][1]) + F_REAL_MUL(i.y, m.m[1][1]) + F_REAL_MUL(i.z, m.m[2][1]) + F_REAL_MUL(i.w, m.m[3][1]);
                //v.z = F_REAL_MUL(i.x, m.m[0][2]) + F_REAL_MUL(i.y, m.m[1][2]) + F_REAL_MUL(i.z, m.m[2][2]) + F_REAL_MUL(i.w, m.m[3][2]);

                v.x = (F_REAL) ( ((long i.x) * m.m[0][0] + (long) i.y * m.m[1][0] + (long) i.z * m.m[2][0] * (long) i.w, m.m[3][0]) >> FIXED_POINT_FRACTIONAL_BITS );
                v.y = (F_REAL) ( ((long i.x) * m.m[0][1] + (long) i.y * m.m[1][1] + (long) i.z * m.m[2][1] * (long) i.w, m.m[3][1]) >> FIXED_POINT_FRACTIONAL_BITS );
                v.z = (F_REAL) ( ((long i.x) * m.m[0][2] + (long) i.y * m.m[1][2] + (long) i.z * m.m[2][2] * (long) i.w, m.m[3][2]) >> FIXED_POINT_FRACTIONAL_BITS );
                return v;
        }

FYI, FIXED_POINT_FRACTIONAL_BITS = 10 and here's my C multiply function (although I'm using your assembly version in practice):

Code:

// Multiplication of two fixed-point numbers
F_REAL F_REAL_MUL(F_REAL a, F_REAL b) {
    return (F_REAL)(((long)a * b) >> FIXED_POINT_FRACTIONAL_BITS);
}

I tried with and without the (long) casts for the rightmost 3 multiplies, leaving only i.x as cast to (long) but nothing works...

Cheers,
Terrence

bakemono · May 11, 2024

Looks like the right idea, but I'm not a C programmer so someone else might have to weigh in. Is >> a signed shift right in C? (IIRC you have to use >>> in Verilog at least)

radiance32 · May 12, 2024

bakemono said:
Looks like the right idea, but I'm not a C programmer so someone else might have to weigh in. Is >> a signed shift right in C? (IIRC you have to use >>> in Verilog at least)

Hi,

Right-shift on signed integral types is an arithmetic right shift, which performs sign-extension.

Would you be able to write some example 8086 assembly code that I can use to speed this up?

Terrence

radiance32 · May 12, 2024

In fact, all I need for my final transforms is:

Code:

                v.x = F_REAL_MUL(i.x, m.m[0][0]) + F_REAL_MUL(i.y, m.m[1][0]) + F_REAL_MUL(i.z, m.m[2][0]);
                v.y = F_REAL_MUL(i.x, m.m[0][1]) + F_REAL_MUL(i.y, m.m[1][1]) + F_REAL_MUL(i.z, m.m[2][1]);
                v.z = F_REAL_MUL(i.x, m.m[0][2]) + F_REAL_MUL(i.y, m.m[1][2]) + F_REAL_MUL(i.z, m.m[2][2]);

Cheers,
Terrence

bakemono · May 12, 2024

What does the code generated by the compiler look like?

radiance32 · Monday at 12:49 AM

bakemono said:
What does the code generated by the compiler look like?

Hey,

I've no idea. I'm new to developing for 16 bit DOS (Before I only used win32/64 development with modern tools).

I'm using OpenWatcom 1.9 as my IDE and compiler and I test/run in DOSBOX, so I'm basically cross-compiling...

I've had a look around in OpenWatcom, but, the debugger does'nt work (as my compiled binaries don't run on a modern OS),
I can only run them in DOSBOX...

Cheers,
Terrence

Help with C/C++ inline assembly divide and shift (for fixed point math) on 16 bit / 8086 target.

radiance32

Member

bakemono

Experienced Member

radiance32

Member

bakemono

Experienced Member

radiance32

Member

radiance32

Member

bakemono

Experienced Member

radiance32

Member

radiance32

Member

bakemono

Experienced Member

radiance32

Member

radiance32

Member

bakemono

Experienced Member

radiance32

Member