• Please review our updated Terms and Rules here

Bitmap and playfield encoding -- what I've been working on.

deathshadow

Veteran Member
Joined
Jan 4, 2011
Messages
1,378
Just thought I'd share what I've been working on.

For a while now I've been working on speeding up my 160x100 game engine -- the first release that's going to make use of it is my upcoming "it's done when it's done" Paku Paku 2.0 -- pretty much a total rewrite of the game from scratch.

I learned a lot writing version 1.x, as well as with my coded sprites demo that came after -- writing a version for the C64 also made me "rethink my ink" and change out a lot of methodology... but more than anything, working with an actual unexpanded PCJr has made me realize I need to squeeze even more out of the engine.

The first step was switching from TP's inline assembler to using NASM. Turbo Pascal adds some overhead to the inline assembler that we don't always need and makes it's own changes to what you code as it thinks appropriate -- using a "real" assembler let's me micro-manage what's really being done. (like do I really need to dick with BP and SP when I'm not passing parameters?) Laughably that alone seems to shave anywhere from 16 to 128 bytes off EVERY function... Which surprised me a lot.

One of the big things I've found in testing is that reading Jr memory is slower than writing -- noticeably so. This means that a RLL type compression of data can pay off bigtime.

Which also works with my other goal for Paku 2.0 and the three other games I'm working on... that being getting the memory footprint below 64k. I figure if I can have a 40k memory footprint on a C64 version, there's no excuse for the current 93k version 1.6 uses.

The existing engine for Paku Paku renders 4x3 tile sprites to the screen from a 28x31 list of tiles. A copy of that same 28x31 tile list is used for the gameplay map to determine if pellets have been eaten or not and where walls are. This means that at start of the game, AND when I'm trying to render the 'greyscale' flash at the end of the game, it has to go through all 868 tile locations to render the appropriate 4x3 tile (actually 3x3 with a shift). Also means each tile has TWO copies since the data is 3 pixels wide, and shifting by 4 in realtime is rubbish.

The first major change I've made is to create a separate copy of the playfield as it appears in the backbuffer from the tile sheet. This means I can blit the bitmap to the backbuffer or screen faster -- the latter being ideal for the "blink" betweeen blue and greyscale at the end of each level. The intent being that I can encode a 'master copy' of the play layout and the bitmap version, and simplify the playfield table.

My first attempt at encoding was a bit overboard, attempting LZW, LZH and LZ4 -- and in all cases finding the decoder to be bigger than the reduction in file size. Worse the implementation for most such progressive encoders involves going back and copying from what's already been written; completely negating any possible "speed" you could gain from the reduction in size. -- So much for those.

Next I tried a simple two-byte RLL... First byte being 'how many', second byte being 'of what'. The ability to "lodsw; mov cl, ah; rep stosb;" works well, but the increase in reads and number of unique values meant that speed wise and size-wise I figured I could do better.

So I went in and started playing with multi-byte RLL. In multi-byte you use a bit or two as a trigger to say "this is a repeat" vs. "this is just data" -- the problem was that the data in question is mostly using all the bits too much for that to really offer any savings... unless I did a lookup table.

AHA. If I use a lookup table I can have TWO tables, one for the blue version, one for the greyscale... Which started me looking at the data itself.

Really the bitmap for screen/buffer only has four pixel values if I don't render the pellets... if I make the pellets separate (another EUREKA! moment) that only has TWO values and results in longer runs of empty pixels

So basically for pixel data we have:
L = high intensity border (light blue or white),
H = low intensity border (dark blue or dark grey),
M = middle intensity border (exit for ghost jail, magenta or grey)
0 = empty

Since that's two pixels per byte, the possible combinations become:
LL, HH, 00
0L, L0, 0H, H0,
LH, HL,
LM, MM, ML

Looking deeper, I realized that only three states actually need 'run lengths' -- LL, HH and 00 -- 90%+ of the data lines up on those. If I use the top two bits as the triggers I can use the bottom 6 bits for how many to repeat or the desired value. The LM, MM and ML always appear in the order LM MM MM ML, so I could reduce that four byte run to a single byte. The backbuffer is also a different width from the screen and I want the ability to decode to both -- solution there? Add a 'end of line' character, and if we're gonna have EOL, might as well put in "end of data" so we don't have to run a counter. Also for the repeat of 0, I choose to skip writing to the screen and just increment the buffer offset, speeding it up even more since for setting up gameplay we can just do a buffer clear first, and for writing the 'blink' we don't want to modify the 'blank' parts anyways.

; Binary
; x = count
; . = don't care
01xx xxxx == repeat 00
10xx xxxx == repeat LL
11xx xxxx == repeat HH
00.. .000 == 0L
00.. .001 == L0
00.. .010 == 0H
00.. .011 == H0
00.. .100 == LH
00.. .101 == HL
00.. .110 == LM MM MM ML
00.. .111 == EOL
else == EOD

This encoding is pretty efficient for the amount of compression... but checking the top two bits with multiple nested TEST and CMP wasn't exactly great from a decoder speed point of view... having to "AND" off the extra bits certainly wasn't helping -- if only there was a way to test bits AND do the compare in one operation without corrupting the count...

I sat there thinking about it, how can I extract the state more efficiently -- then it hit me; flip it around. Move the state to the bottom bit... then I can do a shift, and jump on carry. The drop through on the conditional jump being the most likely of values in order. A few extra cycles for HH or the single bytes doesn't matter if the long runs of 'nothing' and LL can have less overhead.

xxxx xxx0 == repeat 00
xxxx xx01 == repeat LL
xxxx x011 == repeat HH
0000 0111 == 0L
0001 0111 == L0
0010 0111 == 0H
0011 0111 == H0
0100 0111 == LH
0101 0111 == HL
...0 1111 == EOL
..01 1111 == LM MM MM ML
..11 1111 == EOD

So for example: (NASM code)
Code:
normalMap:
	db  0x01, 0x10, 0x09, 0x90, 0x19, 0x91, 0x11, 0x99, 0x15, 0x55, 0x51
whiteMap:
	db  0x08, 0x80, 0x0F, 0xF0, 0x8F, 0xF8, 0x88, 0xFF, 0x87, 0x77, 0x78

With this as the decoder:
Code:
; ASSUMES:
;   si points to encoded map
;   di points to backBuffer
;   bx points to normalMap or whiteMap

	xor  ax, ax
	xor  cx, cx
.nextByte:
	lodsb
	shr  al, 1
	jc   .xxxx_xx?1
; xxxx_xxx0 skip
	add  di, ax
	jmp  .nextByte
.xxxx_xx?1:
	shr  al, 1
	jc   .xxxx_x?11
; xxxx_xx01 repeat LL
	mov  cl, al
	mov  al, [bx + 6]
	rep  stosb
	jmp  .nextByte
.xxxx_x?11:
	shr  al, 1
	jc   .xxxx_?111
; xxxx_x011 repeat HH
	mov  cl, al
	mov  al, [bx + 7]
	rep  stosb
	jmp  .nextByte
.xxxx_?111:
	shr  al, 1
	jc  xxx?_1111
; xxxx_0111 table lookup one byte
	xlat
	stosb
	jmp  .nextByte
.xxx?_1111:
	shr  al, 1
	jc   .xx?1_1111
; xxx0_1111 EOL
	add  di, 6 ; data is 42 bytes wide, backBuffer is 48
	jmp  .nextByte
.xx?1_1111:
	shr  al, 1
	jc   .xx11_1111
; xx01_1111 LM MM ML
	mov  al, [bx + 8]
	stosb
	mov  al, [bx + 9]
	stosb
	stosb
	mov  al, [bx + 10]
	stosb
	jmp  .nextByte
.xx11_1111: ; EOD
	retF

ROCK AND ROLL. Delivers around 3.5:1 compression, specific to this data -- AND it blits fast enough to be acceptable on the Jr.

Once encrypted, it became apparant with the EOL's in place how often the map repeats itself. With the above decoder actually working out faster than I need, I started thinking on how to reduce the redundancies in the data... finally I just said "fine, we'll use a lookup table for each line"

Code:
segment CONST

mapLineList:
	dw  mapLineData00, mapLineData01, mapLineData02, mapLineData02
	dw  mapLineData02, mapLineData02, mapLineData02, mapLineData03
	dw  mapLineData04, mapLineData04, mapLineData04, mapLineData04
; etc, etc, etc for 93 lines total
	dw  0 ; end of scanlines

The line data being something like this:
Code:
mapLineData00:
	db  0x02, 0xFB, 0x4B, 0x02, 0x0F
mapLineData01:
	db  0x27, 0x4D, 0x04, 0x4D, 0x37, 0x0F
mapLineData02:
	db  0x57, 0x26, 0x1F, 0x26, 0x47, 0x0F
mapLineData03:
	db  0x57, 0x06, 0x11, 0x06, 0x07, 0x15, 0x06, 0x1F
	db  0x06, 0x15, 0x17, 0x06, 0x11, 0x06, 0x47, 0x0F
mapLineData04:
	db  0x57, 0x04, 0x07, 0x08, 0x17, 0x04, 0x17, 0x0A
	db  0x17, 0x04, 0x1F, 0x04, 0x07, 0x0A, 0x07, 0x04
	db  0x07, 0x08, 0x17, 0x04, 0x47, 0x0F
... and so forth.

I make the last value in mapLineList 0 as a easy out.

Changing the above decoder to handle this was simple.
Code:
; ASSUMES:
;   ds:si points to mapLineList
;   es:di points to backBuffer
;   bx points to normalMap or whiteMap

	xor  cx, cx
.nextLine:
	lodsw
	or   ax, ax
	jz   .done
	mov  dx, si ; since we don't "need" dx, use it instead of a push
	mov  si, ax
	xor  ax, ax
.nextByte:
	lodsb
	shr  al, 1
	jc   .xxxx_xx?1
; xxxx_xxx0 skip
	add  di, ax
	jmp  .nextByte
.xxxx_xx?1:
	shr  al, 1
	jc   .xxxx_x?11
; xxxx_xx01 repeat LL
	mov  cl, al
	mov  al, [bx + 6]
	rep  stosb
	jmp  .nextByte
.xxxx_x?11:
	shr  al, 1
	jc   .xxxx_?111
; xxxx_x011 repeat HH
	mov  cl, al
	mov  al, [bx + 7]
	rep  stosb
	jmp  .nextByte
.xxxx_?111:
	shr  al, 1
	jc  xxx?_1111
; xxxx_0111 table lookup one byte
	xlat
	stosb
	jmp  .nextByte
.xxx?_1111:
	shr  al, 1
	jc   .xxx1_1111
; xxx0_1111 EOL
	add  di, 6 ; data is 42 bytes wide, backBuffer is 48
	mov  si, dx
	jmp  .nextLine
.xxx1_1111: ; LM MM MM ML
	mov  al, [bx + 8]
	stosb
	mov  al, [bx + 9]
	stosb
	stosb
	mov  al, [bx + 10]
	stosb
	jmp  .nextByte
.done:
	retF

Didn't impact performance significantly, but reduced the data size another ~40%... net result is basically taking 3,822 bytes of data, and reducing it to 632 bytes. UAH... though I'm still arguing with myself over push/pop on SI and using DX to store HHLL for the repeat blits. I'm not sure those handful of [bx + #] calculations on those two runs is worth the effort.

Converting it to write to screen being as simple as adding a "inc di" after each stosb, unrolling the REP into a LOOP of same, and increasing the EOL addition to 76 (since the screen is 160 bytes wide).

I also went a little further and added more 'unique' multi-byte runs to the xxxx1111 list... specifically there are a LOT of cases where L00L or 0LL0 show up and reducing those to a single byte shaved a few more bytes off it.

I did try adding another shift to allow for stosw instead of stosb, but the runs aren't long enough for that extra code and overhead to actually pay off where it's needed. A lot of optimizations I'd make on larger data sets seem to be working out like that.

Using similar methods to draw the pellets and build the map used by the actual game engine for collisions has reduced the memory footprint of the game around 8k and made it run far, far faster.

For example the pellet encoding is even simpler:

xxxx xxx0 == skip x
xxxx xx01 == repeat 0x70 0x07 (0x0770) skip 1
.... .011 == 0x07 skip 1
.... 0111 == 0x70

I render a normal pellet at the super-pellet locations, since those have their buffer offset and screen offsets hardcoded due to their blink animations. Storing 0x0770 in DX speeds this up a bit (so I'm not using immediates inside the loop), though really there's no reason to speed up how fast the pellets are drawn since they're only done at the start of each level.

The playmap encoding is similar as well.
xxxx xxx0 == wall x
xxxx xx01 == pellets x
xxxx x011 == empty x
.... 0111 == super pellet
.... 1111 == ghost wall (2 bytes)

Even simpler. The values written:

empty 0x00
pellets 0x01
super pellet 0x03
wall 0x80
ghost wall 0xC0

Lets me quickly test for walls by checking the sign flag, or pellets with bit 0 using a simple "and al, 0x83" followed by jz "empty" and js "wall", the drop-through being "eat a pellet" which then also drops-through to "empty" to allow navigation.

Much like with the pellets the live gameplay map only needs to be written at the start of each level, so optimization efforts went more towards code and data size than they did speed... even though laughably this is far faster than what paku 1.x is doing.

Though with the unexpanded 128k Jr. as my current target minimum, smaller code size most of the time does mean faster, even when it's slower when run on a real PC. I'm hoping these changes will be enough to let the unmodified low-ram Junior run the game as good as a real PC from the same codebase. If it doesn't reach that goal I'm going to probably have to make a Junior specific version using Trixter's Jr. specific mode, since the linear screen buffer would reduce the overhead of blitting a LOT... and I could use page flipping instead of a manual backbuffer. HOPING I won't have to do that though as I'd like 2.0 to be self contained.

I'm also playing with my joystick read routine to make it a bit better on the Jr. Turns out the interrupt problem isn't entirely what's wrong with how I was doing it, it was just alleviating problems in the 'dead zone' common when you have a low return value as 'center'. Pretty much if the center is less than 10, I need to make sure the dead zone is only 3 wide.

Code:
; procedure stickUpdate;
pProcedure stickUpdate
	mov  dx, 0x201
	mov  cx, stickLimit
	xor  ax, ax
	mov  bx, ax
	mov  di, ax
	mov  si, ax
	mov  ah, [stickMask]
	push bp
	mov  bp, bx
	cli
	; using SP for zero inside the loop increases speed ~20 clocks per loop
	mov  es, sp ; since we can't use the stack, we put sp into es
	mov  sp, bx
	out  dx, al
.loop:
	in   al, dx
	and  al, ah
	ror  al, 1
	adc  bx, sp
	ror  al, 1
	adc  di, sp
	ror  al, 1
	adc  si, sp
	ror  al, 1
	adc  bp, sp
	or   al, al
	loopnz .loop
	mov  sp, es
	sti
	mov  [stick0x], bx
	mov  [stick0y], di
	mov  [stick1x], si
	mov  [stick1y], bp
	pop  bp
	retf

Stickmask is determined during startup as to which 4 axis to test. Doing an "and' of it's value against what's read from the port masks off ports that aren't connected. Side effect of the detection code I use for this is that when machines are 'too fast' for this stick reader the joystick is disabled. (that's actually a good thing!)

In any case, just thought I'd share what I've been up to and get it down somewhere. Sometimes just writing it down helps with debugging and thinking on new ways of handling things.

Also, never hurts to have another set of eyes on things. Any suggestions are more than welcome.
 
Last edited:
Just thought I'd share what I've been working on.
Thank you for taking the time to write it all down. It's a monster post but I enjoyed every second reading it. As for suggestions; I don't have much to contribute except for this little idea;

Replace this;
Code:
	mov  dx, si ; since we don't "need" dx, use it instead of a push
	mov  si, ax
	xor  ax, ax
.nextByte:

with this;
Code:
	mov  dx, si ; since we don't "need" dx, use it instead of a push
.nextByteMinusThree:
	xor  si, si
	xchg si, ax
.nextByte:
This on its own saves a byte but that's not all. Replace this (15 bytes);
Code:
.xxx1_1111: ; LM MM MM ML
	mov  al, [bx + 8]
	stosb
	mov  al, [bx + 9]
	stosb
	stosb
	mov  al, [bx + 10]
	stosb
	jmp  .nextByte

with this (9 bytes);
Code:
.xxx1_1111: ; LM MM MM ML
	lea  ax, [bx + 8]
	xchg si, ax
	movsw
	dec  si
	movsw
	jmp  .nextByteMinusThree

I doubt it will improve speed but it is smaller.
 
That first one thanks for pointing that out. I always forget that "xchg reg,acc" is 3 and 1 instead of 4 and 2. Definitely adding that.

On the second one, I'm actually thinking on making the lookup table store LM MM MM ML so I can just do two stosw without the DEC -- one more byte in the static data isn't a bid deal; at least not compared to the code difference... or four "stosb; inc di" for screen output. I was playing with the same basic idea last night, though I was getting far more complex than need be on the SI preservation which your use of XCHG makes far, FAR simpler.

Good advice on both of those -- thanks.... though on the second one that only gets run ONCE, so less code is more of an objective than speed.
 
You expressed privately being receptive to suggestions, so what follows is a lot of constructive criticism gleaned from extensive experience writing high-performance code for this platform. But first, some general observations:

You should be wary of making assumptions about the hardware, the compiler, etc. as it sends you into weeks of optimization for things that don't really matter. Always benchmark and profile your code before embarking down a long journey. For example, you've spent a ton of time optimizing the sprite plotting when in fact there are bigger issues at hand. Here is some rough profiling information from paku 1.6 using Turbo Profiler (gathered from an IBM PC 5160 with PC speaker audio active):

Code:
Turbo Profiler  Version 2.2  Thu Sep 18 12:50:50 2014

Program: D:\GAMES\PAKU_16A\PAKU.EXE

Execution Profile
Total time: 142.29 sec  Total Ticks: 14229
% of total: 16 %
       Run: 1 of 1

    Filter: All 
      Show: Time
      Sort: Frequency

PAKU.WAIT         11.08 sec  48% |**********************************************
PAKU.TGHOST.UPDA   3.11 sec  13% |************
PAKU.TESTCOLLISI   2.62 sec  11% |**********
PAKU.TPLAYER.UPD   1.31 sec   5% |*****
PAKU.DRAWPLAYFIE   1.06 sec   4% |****
PAKU.1992          0.31 sec   1% |*
PAKU.2069          0.30 sec   1% |*
PAKU.SCOREPOINTS   0.28 sec   1% |*
PAKU.GAMEKEYCHEC   0.21 sec  <1% |
PAKU.2009          0.20 sec  <1% |
PAKU.2063          0.19 sec  <1% |
PAKU.2056          0.16 sec  <1% |
... etc.

Based on this, I'm struck by a few things. First, your idea of dividing processing time into "timeslices" results in nearly half the active time wasted in a WAIT routine. It may be smoother to just pick a target update rate (ie. 30Hz) and put all processing in a single pass. But the larger issue is TESTCOLLISION which takes up nearly as much time as the stuff you spent weeks optimizing.

Does that mean you need to optimize TESTCOLLISION into assembler? No, but I would definitely compile PAKU.EXE with debugging information turned on (both integrated and standalone) and load it into Turbo Debugger so that you can navigate to the source lines in question and the View->CPU so you can see what Turbo Pascal is actually generating as code. It may shed some light on optimization opportunities -- such as using static objects instead of dynamic ones, etc.

Turning on .MAP file generation from the linker can also help determine what to target when optimizing for size. Here's a summary of segment sizes for PAKU 1.6:

Code:
paku               : 15018
joystick           : 85
txtGraph           : 4427
sound              : 4237
timer              : 237
jfunc              : 1254
System             : 6584
DATA               : 3752
STACK              : 3072

Ordered by size:

Code:
joystick           : 85
timer              : 237
jfunc              : 1254
STACK              : 3072
DATA               : 3752
sound              : 4237
txtGraph           : 4427
System             : 6584
paku               : 15018
Total memory size of code + data: 38666

Finally, I'd suggest profiling high-performance code sections with the Zen Timer. You may think you know how to optimize for 8088, but a microsecond timer will prove if you're right or wrong. Hint: You should be counting total number of I/Os (bytes read/written), not cycles. The only time you consider cycles is if they take longer than the time it takes to read the opcode (ie. MUL, DIV, etc.)

And now, some specific responses:

Laughably that alone seems to shave anywhere from 16 to 128 bytes off EVERY function... Which surprised me a lot.

While this is true, don't miss the forest for the trees. For example, eliminating the stack frame from procedures can save you 1000 bytes, but what if optimizing a piece of pure Pascal code elsewhere can save you 2000?

One of the big things I've found in testing is that reading Jr memory is slower than writing -- noticeably so.

This is correct, verified by profiling on a real PCjr:

Code:
<= 128KB:
MOV CX,4000; REP LODSB = 13412 microseconds
MOV CX,4000; REP STOSB = 13412 microseconds
MOV CX,4000; REP MOVSB = 22352 microseconds

> 128KB (as provided by a jrIDE):
MOV CX,4000; REP LODSB = 10897 microseconds
MOV CX,4000; REP STOSB = 08383 microseconds
MOV CX,4000; REP MOVSB = 14251 microseconds

On a regular 5150/5160/etc. this is roughly the same, although CGA imposes a penalty for reading screen RAM:

Code:
Video memory (stock CGA):
MOV CX,4000; REP LODSB = 17882 microseconds
MOV CX,4000; REP STOSB = 13412 microseconds
MOV CX,4000; REP MOVSB = 24141 microseconds

System memory:
MOV CX,4000; REP LODSB = 10974 microseconds
MOV CX,4000; REP STOSB = 08624 microseconds
MOV CX,4000; REP MOVSB = 15087 microseconds

So, is this a path worth persuing? Also, will reading a packed memory structure take more time than it saves? These are all things to consider.

Which also works with my other goal for Paku 2.0 and the three other games I'm working on... that being getting the memory footprint below 64k.

The only way to guarantee that is to write your code in pure asm. TP is not good at optimizing for size.

My first attempt at encoding was a bit overboard, attempting LZW, LZH and LZ4 -- and in all cases finding the decoder to be bigger than the reduction in file size.

The official (ie. recognized by the LZ4 author) LZ4 16-bit x86 assembler implementation has two decoders, one of which is optimized for size and is only 79 bytes long: http://www.oldskool.org/pc/lz4_8088

Next I tried a simple two-byte RLL... First byte being 'how many', second byte being 'of what'. The ability to "lodsw; mov cl, ah; rep stosb;" works well, but the increase in reads and number of unique values meant that speed wise and size-wise I figured I could do better.

If the alternative uses a lot of branching, the gains could be eaten up by the branching. There's no way to tell without profiling.

The tradeoff of working with a packed memory structure is that the code that works with it gets larger. If you save 1000 bytes in data but it takes 1200 bytes of code to deal with the changes, you've actually gone backwards.

Turns out the interrupt problem isn't entirely what's wrong with how I was doing it

Yes, it was. When the method used to read the stick is based on a tight loop, anything that interrupts that tight loop is going to change the values returned from the loop. Your assertion about the "dead zone" being wrong is because, when the values get interrupted, they fall way outside the dead zone. The proper fix isn't a bigger dead zone, but consistent values.

The new stick reading procedure you posted is great, because it disables interrupts around a timing-sensive piece of code, and has consistent performance per loop iteration. It is less granular than a procedure that monitors only one axis at a time, but that is exactly what you want for a game that will only use digital up/down/left/right/diagonal directions anyway.

Side effect of the detection code I use for this is that when machines are 'too fast' for this stick reader the joystick is disabled. (that's actually a good thing!)

Well, you could fall back to a timer method, which works on all systems. Or, if the system is "too fast" then it has a later BIOS which supports joystick reads on int 15h. Here's a reference: http://pastebin.com/xv02GJF3
 
You should be wary of making assumptions about the hardware, the compiler, etc. as it sends you into weeks of optimization for things that don't really matter.
I spent a LOT of time testing pieces section by section and profiling my timeslices as to where the problems lay. Certain things need to work at certain intervals; audio in particular is problematic and unless I'm going to move the audio updates into the ISR (which adds all sorts of bigger issues like screwing up the draw to screen) it's just not viable. Having the audio decide to fire in the middle of copying from the backbuffer to screen looked like choppy crap; disabling IRQ's during the blits stopped that but made the audio be crap -- not as bad as my version 1.0 version that had no back-buffer; but it's still bad.

I use the timeslices for a number of reasons -- so that I can split up the audio between doing things WITHOUT interrupting them being the biggest. Some slices have extra 'wait', some do not... that's just realtime programming 101.

Kind of like bit-banging on a fixed processor target. Sometimes you are right up against the execution limit, sometimes you're filling it with nop's to keep it at the exact frequency per "slice". (that's my days of working with QNX talking, RTOS is a whole different ballgame)

Basically when you need perfect timing, there's a LOT of time you'll spend with the program sitting around with it's thumb up it's backside -- then there's the time you'll spend in a little tiny slice where you can't move what's going on into that gap.

Always benchmark and profile your code before embarking down a long journey.
I have been, the reason you're not seeing it is you're looking at the overall picture INSTEAD of the slices -- it's a 'trap' you can fall into when profiling, if something takes a bunch of time when you HAVE the time it's fine... if something takes a medium or even small amount of time when you have a miniscule bit of time to work with, that's where the efforts have to be placed.

Again for example the score updating; updating it on the screen was consuming in the handful of slices it happened more than the entire slice allotment. In the grand picture of the game that's at most 1 out of every 15 ticks at the fastest game speed (1 out of 24 ticks at the slowest)... If I looked at that in the overall execution time it's a non-issue... but right there the game 'chokes' and was stuttering. Switching to BCD math in low byte first order gave me a 'fast enough' addition, and a data stream easily used with a address lookup to use coded sprites to output the number. It was something that in the grand scheme was taking little to no time and would NEVER have shown up as a issue in normal profiling, but WAS a choke point in actual application. CRAZY as that sounds, that's often the nature of realtime programming. Profiling at 1200 of my custom timer ticks shows it as less than 1% the overall execution time, but it WAS causing a hiccup every time a pellet was eaten as it caused a spike.

That's really what I'm trying to do here -- line conditioning. I'm trying to smooth out the spikes for more uniform execution.

For example, you've spent a ton of time optimizing the sprite plotting when in fact there are bigger issues at hand.
In terms of hitting up against the limits per slice, it is one of the biggest issues -- see rendering the score which even given it's own slice takes longer than the slice -- which either means moving the audio into the ISR or letting the audio worble when the computer being used is too slow.

Here is some rough profiling information from paku 1.6 using Turbo Profiler
What level was that profiled? This statement:
Based on this, I'm struck by a few things. First, your idea of dividing processing time into "timeslices" results in nearly half the active time wasted in a WAIT routine.
The first "slow" levels for example have THREE timeslices of absolutely nothing to slow the gameplay down! Remember, 15, 20 and 24 fps are the gameplay speeds on a 120hz tasker. Try doing that at a fixed 30fps loop, and you'll have choppy crap. That 120 times a second of the audio timer was chosen for a reason -- it is simply sexagesimal times two, so I can easily use 5 timeslices per frame, with delays of 1 or 3 extra timeslices for slower levels.

Only five of the timeslices actually do anything -- Pump it to a level without those extra waits and try again. That's why this:
It may be smoother to just pick a target update rate (ie. 30Hz) and put all processing in a single pass.
Is simply not viable while keeping the gameplay as smooth as it is. I'd end up with something like atarisofts version or PCMan, both of which feel jerky and sloppy to me in how they update the screen and control the flow of the game. (and sadly don't even HAVE different level speeds!)

Your profiling came to the same conclusion I did about two routines; update and testcollision -- BUT -- those do not actually cause delays in their timeslices and it's not like I can split them into other timeslices easily "as is" in how 1.6 was doing things; nor is removing the timeslice code the answer, as then you end up with something as choppy/strange as moonbugs or worse, round 42.

That said, the rewrites of those two routines is gonna fix that a good deal.

1.x's ghost "update" is flawed because it actually uses a divide by three and a modulo -- the new version uses xlat alleviating that problem. It's also where the back-buffer blitting occurs and, well... this next problem:

Both "update" and "Testcollision" are flawed because they use pointers to the data instead of keeping that data in the data segment which REALLY slows them down, particularly going back and forth between basically two different segments... which is basically me planning to implement:

It may shed some light on optimization opportunities -- such as using static objects instead of dynamic ones, etc.
I had already made that determination... so you're spot on there. I had actually disassembled what TP had built just as you said, and noticed all the swapping to DS on heap elements. The new version is NOT going to store ghost data in the heap, in fact instead of the 26k heap I was using I plan on a 1.5k total heap -- and none of that is really for gameplay. I may in fact even reduce it further. I'm also going to be hard-calling the ghosts instead of having one point at the next point at the next until null. It was creating extra stack overhead for no good reason and since they were objects on the heap, that's even more overhead that's just not needed. This version is putting everything into the global scope so it's all in the same DS.

So yes, I'm aware of those two -- but the laugh is, within their own timeslices they do NOT have an overall impact on the gameplay. They are 'fast enough' within their slices, even if they eat up most if not all of it... that might sound confusing, but again that's what timeslice smoothing is about. Being able to spread them around more will help... that's also one nice thing about timeslicing; I can profile each part of code and play with their execution orders between the updates that MUST happen at certain times.

In the profiler scorepoints may only be consuming 1%, but if that 1% occurs at a choke point throwing the timing off? Then it's a problem. I'm also moving around what's in each slice by profiling the slices on their own.

Instead of getting as complex as "zen timer" which can introduce it's own issues I'm simply waiting for the clock rollover and looping for 182 updates to count how many times I can run a section in approximately ten seconds. Just split them out, feed them dummy data and loop until rollover.

Code:
const
	testFor = 182;
	
var
	counter:word absolute $0040:$006C;
	countLast, countTime:word;
	
procedure startTest;
begin
	countTime := testFor;
	countLast := counter;
	{ wait for rollover }
	repeat until not(countLast = counter);
	countLast := counter;
end;

procedure timeTest:boolean;
begin
	if not(countLast = counter) then begin
		dec(countTime);
		countLast := counter;
	end;
	timeTest := countLast = 0;
end;

begin
	startTest;
	repeat
		{ whatever you are testing }
	until timeTest;
end.

I always prefer testing over time instead of timing over a test... probably due to seeing too many benchmarks that were flawed in languages like PHP or JS in the latter. While Zen timer generally doesn't have those problems, the above is more than simple enough to compare how long two bits of code take to run.

Turning on .MAP file generation from the linker can also help determine what to target when optimizing for size. Here's a summary of segment sizes for PAKU 1.6
Which doesn't entirely apply as this is a total rewrite; I'm not actually retaining a whole lot from the original in terms of code, hence the jump in version number -- though YES, I do plan on doing that with the new version once the new 'parts' are ready to be glued together.

This version is not only having more ASM, but it's also being built like how I use server-side languages like PHP, HTML and Database engines. Properly used PHP is just 'glue' -- you use it to glue together output from things like SQL that are optimized to their task to markup to say what that output is.

That's why I'm using TP for -- it's to glue together the optimized bits. Kind of like the load supervisor a lot of linux distros use that speeds up how fast linux loads, despite being written in an INTERPRETED language. (perl). All it does is optimize the load order where it doesn't have to be fast, so the faster stuff can be even faster and simpler to maintain.

You may think you know how to optimize for 8088, but a microsecond timer will prove if you're right or wrong. Hint: You should be counting total number of I/Os (bytes read/written), not cycles. The only time you consider cycles is if they take longer than the time it takes to read the opcode (ie. MUL, DIV, etc.)
I prefer to instead run for a fixed time period to see how many iterations I can do over that time period... but the result is much the same.

Though when I'm hand calculating, I actually figure out the BIU in my head so I'm counting BOTH execution time AND byte size -- since that 4 (or more jr) fetch time per byte makes most of those 2-3 execution time opcodes take 8 clocks anyways. Sometimes I'm off a clock or two, but that's where -- as you keep saying -- profiling comes into play.

And now, some specific responses:

While this is true, don't miss the forest for the trees. For example, eliminating the stack frame from procedures can save you 1000 bytes, but what if optimizing a piece of pure Pascal code elsewhere can save you 2000?
True enough -- though really when I've got 20k of compiled code and 60k of constants and variables? I've dropped it to half that on the data, so that was time well spent. Right now just from what I've done it looks like I might even be under 48k -- down to the size of the C64 version.

(re: memory speeds)
So, is this a path worth persuing? Also, will reading a packed memory structure take more time than it saves? These are all things to consider.
Which is why I actually tested and profiled every approach, and it kept getting faster and faster... as I outlined above though I have the advantage here (unlike say... a video stream) of being able to customize the encoding to a fixed data set -- so I'm not using a generic encoder/decoder. I also have that once it's "faster than I need" I can put more effort into size. It's a balancing act.

Sadly one of the biggest things that set me on the path of rewriting this part of the game wasn't size (though it helps) but making it so the white-blue-white-blue end of level blink looks nicer. On the Jr. you could see it drawing from top to bottom, and that bugged the hell out of me.

The only way to guarantee that is to write your code in pure asm. TP is not good at optimizing for size.
It's better than a lot of other compilers though... and really in this case the code is NOT what's sucking on most of the memory; though reducing BOTH is gonna work out pretty good.

The official (ie. recognized by the LZ4 author) LZ4 16-bit x86 assembler implementation has two decoders, one of which is optimized for size and is only 79 bytes long
But sadly recursively reads it's own writes -- meaning it takes longer to run than simply copying an unencoded copy. That's why I rejected it. See "x slower than memcpy" on the table. RLE is faster than memcpy because long runs of the same bytes are always faster as it's just rep stos instead of rep movs

Which is why in my case bitwise triggers with RLE is WAY faster; handy since I'm using it to decode not just to both backbuffer and screen, but also in two different color sets. I mean really 90%+ of the data consists of long (2+ byte) runs of just THREE values. Take just the first seven lines of the map without pellets.

Line 1: 1x 00 | 40x HH | 1x 00
Line 2: 1x 0H | 19x LL | 2x 00 | 19x 00 | 1x H0
Line 3: 1x HL | 19x 00 | 1x L00L | 19x 00 | 1x LH
Line 4: same as line 3
line 5: same as line 3
line 6: same as line 3
line 7: same as line 3
line 8:
1x HL | 3x 00 | 4x LL | 3x 00 | 1x 0L | 5x LL | 3x 00
1x L00L
3x 00 | 5x LL | 1x L0 | 3x 00 | 4x LL | 3x 00 | 1x LH

When you look at it that way, using the shift and jump for three conditions makes a good deal of sense. It's much akin to how lzh works but as a RLE trigger. ALL those runs of $00 are the drop-through on the very first check. The runs of LL are the drop-through on the second, the runs of HH (of which there are 10 total) are the third. The difference is instead of a generic encoder/decoder for all sorts of data, I have a highly specific coder for a very specific datastream.

If the alternative uses a lot of branching, the gains could be eaten up by the branching. There's no way to tell without profiling.
Or by simply doing the math and knowing what the data is... though profiling to test and even better, seeing if it does what you need done on the actual hardware. When I want it to use less memory and render fast enough on the Jr. that it's at least equal to what it was on the 7.16 T1K with the original code...

At the start of the optimization the first thing I did was switch to unencoded data, which WAS faster than the 48 4x3 tiles with lookups. Sadly it was also six times the memory footprint. This branch encoding/RLE combination is many times faster than a flat copy AND is smaller (though not by a whole lot) than the tile approach was. If you look at your table, anything where "x slower than memcpy" is greater than one simply isn't viable for what I'm doing...

The tradeoff of working with a packed memory structure is that the code that works with it gets larger. If you save 1000 bytes in data but it takes 1200 bytes of code to deal with the changes, you've actually gone backwards.
Unless you need both size AND speed. Basically what this got me was in the ballpark of 6x faster in ~200 bytes less code and data (since repeating the tiles was also low memory profile) but due to changes, it's got a much smaller memory footprint.

Yes, it was. When the method used to read the stick is based on a tight loop, anything that interrupts that tight loop is going to change the values returned from the loop. Your assertion about the "dead zone" being wrong is because, when the values get interrupted, they fall way outside the dead zone. The proper fix isn't a bigger dead zone, but consistent values.
I think you misunderstood me, the problem was that the way I was determining the dead zone resulted in one so large that the extremes of position were INSIDE the dead zone. As such the answer is a SMALLER dead zone, not a larger one. That's why it wasn't responding properly on the Jr right and bottom. With the interrupts on the center read was too small. Disabling interrupts didn't just decrease the jitter (which most certainly helped too), but also increased the range of output so the calibration at the start wasn't off.

I was actually a little surprised myself when I figured that one out. Even interrupts off it was doing the same thing on a unexpanded Jr, which is what led to me digging deeper on that.

The new stick reading procedure you posted is great, because it disables interrupts around a timing-sensive piece of code, and has consistent performance per loop iteration. It is less granular than a procedure that monitors only one axis at a time, but that is exactly what you want for a game that will only use digital up/down/left/right/diagonal directions anyway.
Precisely the thinking behind it. Input granularity is overrated anyways on a joystick interface that doesn't even bother with a real ADC, and where some hardware can have as much as 20% jitter just from the use of cheap capacitors and where simply running the device can cause drift over time as the temperature changes. :/ Always thought it was kind-of sad I had built a better analog joystick interface for a trash-80 out of $15 in parts when I was ten than IBM could manage with their vast resources; apparantly $3.50 (in 1980 dollars) 6 bit ADC's were "too expensive a part" If so, how did the shack spring for them on the coco?

The joystick interface was always a laugh -- it's a bit like Gelflings; there's a line between outright idiocy and sheer genius, but it's often hard to figure out on what side of the line things fall. The PC Joystick interface is either a brilliant way to make a cheap interface, a really stupid way of cutting corners... or a hefty dose of both. Knowing IBM at the time...

I agree with a lot of what you said, and much of it I'm already working on. Where we really differ is in time-slice vs. flat execution. One of the things I keep having people talk about being so impressive with the original game is the smoothness of gameplay compared to other CGA games, including those that use the 160x100 mode. That's what the timeslices gives me. Precise timing and control just like a RTOS task-slicing kernel would provide; I'm just using cooperative multitasking to do it.

Lemme put it this way, let's say it took you ~3000 clocks or more to update the next sound interval; try maintaining 24fps during FMV playback while manually running that sound update at 120hz with no DMA assistance.

Sadly most of the overhead of the sound updates is the multiple card support and the "3 voice priority" code, though I'm planning on addressing that as well. Instead of a case statement for the current card I'm going to use a pointer to the correct one, and the PC speaker multi-voice code is going to be ASM branching for better short circuit than the current method.

though I have an arpeggio code I was thinking on trying, but I think it might take too long -- though really it shouldn't be any slower than the painfully slow mess that is adlib... PakuMDA (also WIP) will probably use it though as there's FAR more CPU time free on that one since I'm only doing 80x50 graphics in normal 80x25 text mode and it's only going to have PC speaker sound

In any case thanks for the advice -- some of it really doesn't apply, some of it I'm already doing or planning, and some of it has me thinking in new directions. You even got me double-checking what I've already done for other approaches, making me confirm that yes, I'm on the right track. (mostly the time-slicing vs. moving audio into the timer ISR -- the latter just isn't viable)
 
Last edited:
For your rewrite, I would like to suggest the following to help you meet your goals:

- Fire the PIT at 120Hz, not 240. You don't need that kind of resolution and the CALL/RETs just take up time.

- Sync the PIT firing with screen refresh. Do all housekeeping on one timeslice, screen updates on the other. (This might eliminate the "snow" people see on real CGA cards.) I can provide code if you need it. The PIT divisor for 60Hz CGA is ((912*262) div 12) = 19912. For ~120Hz, divide this value by 2.

- Don't use Turbo Pascal's heap at all. Not once. This will force you to fit everything into a single data segment and additionally cut down on memory access times.

- Develop your program with range checking and stack checking turned on. If it doesn't run properly with those on, you have bugs in your code. (If your code doesn't run, don't be tempted to blame TP's error checking. It is not wrong.)

- Be careful mixing longint math with non-longint math in TP. TP has some bugs where it misses the proper typecast. inc/dec are fine, but watch out for +, -, /, * operators. 99.9% of the time it works properly but if you run into issues, try breaking up a long math statement into parts. Turning on arithmetic checking is a good check for this; it fires on buggy miscasts as well as legitimate arithmetic underflow/overflow.

- TP uses the data segment for the stack. Larger stacks cut down the amount of global data available and vice versa. If you're not using recursion and/or don't have a lot of local variables in your procedures, you should be able to get by with a small stack, 2K or less. If your program is crashing with stack checking on, that's not a bug -- you're blowing past the stack and you need to examine what you're doing.
 
Actually, I had the actual timer ISR at twice the game frequency for a reason, but for the life of me I can't remember what that reason was... I guess I'll find out. (could be code rot)

I do think I'm gonna try that fegiggur that 240 to 4978 and try 120 at 9956 so it matches the CGA timing as you suggested. Making it 1:1 to the slicer would make it better; I'm trying to figure out why I didn't do that.

Hmm. Might I be better off trying to pull the actual scan-rates from the card just in case of CGA that use slightly off timings? I know my sharp PC 7k has snow on the external connector (but not the internal display) but has two less scanlines in the blanking... (so 260 total instead of 262)... I'll have to play with that. It's going to be tight to fit all five sprites into the blanking period (my old method that just wasn't going to happen) at the fastest interval of 4 slices per game frame.
 
Figured out why I had it in there at 240hz -- silly me forgot to adjust the timer from what it was when I was trying to use arpeggio's for two-voice music in the theme from before... so yeah, that was code rot.

A new timer ISR is in the works anyways, fixed at 120hz with a "waitSlice" function that will, well:

Code:
; function waitSlice:word;
waitSlice:
  mov  ax, [sliceCount]
  or   ax, ax
  jz   waitSlice
  retf

The reason I wasn't having the test wait was to keep the keyboard buffer empty since I'm using BIOS for key reads; the new keyboard check now loops as long as there's stuff in the buffer.
 
Last edited:
Just wanted to say this was a fascinating read! Pretty amazing to use RLE in something so similar to a screen buffer.

Any news on the monochrome version of PakuPaku?
 
Any news on the monochrome version of PakuPaku?
The MDA version is hitting up against it's own unique set of... challenges.

Once of the biggest is that the intensity attributes are not uniform -- there are actually FOUR intensities (including black) not just three, and worse, you have little control over them. Attributes 0x78 gives you the 'mysterious' darker grey on grey, and 0xF8 gives you that same 'darker grey' on bright grey. These seem to be the only times they exist, and there is NO regular grey on bright grey!

Laughably it's even worse on a Hercules, where they call the darker one "dim" and actually allow you to use it as a background attribute - that may or may not be available depending on the monitor. Of the three configs I tested only one of my two actual monochrome displays showed it, as did the in-color card on a EGA display. Basically the Herc does NOT handle intensity attributes the same as a real MDA, so I may have to implement some kind of check to test for herc vs. real MDA. ... and christmas only knows what clone cards are gonna do.

Even accepting that when there's a bright in half the character the normal part is going to be "darker" or possibly even not show at all, the attributes are only provided for the darker color on light, not the other way around, so to even have it do that with the 'graphical glitch' of the border being darker when up against it is to have by backbuffer be rendered to the display as both the attribute and the character, flipping between 0xDC and 0xDF as the character used. This actually means I'm writing 6x2, so more two thirds the bytes of the CGA version despite half the resoluton :/

It's such a mess, I've been playing with simply trying to interlace the display to make my own grayscale. I'm thinking with the decay time being half that of the frame rate it might actually be passable without too much flicker on real hardware -- though the result is hideous in DOSBox.

I've also been playing with the idea of taking a page out of your book (thanks for the inspiration, MagiDuck rocks), and simply not showing the whole map on screen and scrolling it, with a side-bar map akin to the old "dung beetles" / "megabugs". No, wait... more like "Radar Rat Race" / "Rally X"

This would let me return the map to the original aspect ratio and pellet count. I could then run an even bigger tile (5x5) but with 1px gap between borders and sprites, hiding the overlap issue... sadly that ends up looking a LOT like Scarfman, though the intensity bits help. Biggest problem with that is I would have to blit the entire playfield from the backbuffer every frame since I won't even have access to multiple pages.

Really comes down to can I live with the border disappearing when a high intensity sprite is next to it or no... I'm starting to see why MDA games (what few they are) don't try to use 80x50 graphics a whole lot.
 
Oh, and technically the hercules displays FIVE intensities, as the "dim on bright" is brighter than the regular "dim" but darker than "normal".

Basically converted to 0..255 greyscale, the choices are:

MDA:
#00 full character
#AA on #00
#FF on #00
#30 on #AA
#30 on #FF

Hercules:
#00 full character
#AA on #00
#FF on #00
#30 full character
#AA on #30
#FF on #30
#55 on #AA
#55 on #FF

With the #30 intensity not showing up on some displays at all. Those are 'the only choices' and, well, you can see how that might drive one batty. There is no #AA (normal) on #FF (bright) or vice-versa!

Oh, I'm also playing with making a hercules in-color version, but that's probably gonna have an AT as minimum spec since the 720x348 resolution is, well... another challenge thanks to the oddball aspect ratio... each sprite needing to be 2.4x it's height... meaning I'm aiming for 24x10 sprites. Just DESIGNING for that size is difficult, particularly since testing is pretty much restricted to the actual hardware.

Harder to make work than my unreleased 320x200 tand... uhm... yeah. sometimes the CPU time just isn't there.

Of course who knows when ANY of this is going to be done, especially with all my other projects. Kind of a "It's done when it's done" situation on all of this. Particularly with my trying to implement a FB-01 emulator on a Teensy 3.0 -- I think 40mhz ARM should be up to the job of 8 voice 4 operator AM, though I might add a pair of external 16 bit DAC to it since I'm leery of using PWM for audio output, much less 8 bit output isn't all that great for a synth module... that or I resistor link two PWM outputs together after filtering, but that can be VERY hit or miss... though laughably I keep thinking "longs are going to suck at 8 bits" because the Teensy can run Arduino sketches -- when it's a 32 bit processor. REALLY not used to having so much bit-width available. I can just flat add all 32 operators (4 operators over 8 voices) and then >>5 for the output value once for each channel (left/right)
 
Last edited:
Once of the biggest is that the intensity attributes are not uniform -- there are actually FOUR intensities (including black) not just three, and worse, you have little control over them. Attributes 0x78 gives you the 'mysterious' darker grey on grey, and 0xF8 gives you that same 'darker grey' on bright grey. These seem to be the only times they exist, and there is NO regular grey on bright grey!

http://www.seasip.info/VintagePC/mda.html#memmap

I'm starting to see why MDA games (what few they are) don't try to use 80x50 graphics a whole lot.

That's an advantage -- there's no CGA snow to worry about, and you have a few frames of persistence, so entirely different effects are possible.
 
Ouch, yeah... Just tried Dosbox's Hercules emulation with Qbasic. Not very useful attribute combinations indeed. Also, Dosbox only shows black, grey, white and the blinks.

Pretty nasty having to consider both emulators and real hardware, at least if one is intending to have an audience. :)

I made some monochrome mockups with 5x5 sprites and a maze that leaves a 7x7 space around them to avoid two adjacent attributes. I wonder if you've already considered this layout?

pakumock0.jpgpakumock1.jpgpakumock2.jpgpakumock3.jpg

It's based on 3x3 tiles + one 2x3 tile in the middle of the screen. So 26*3 + 2 = 80 columns.

Instead of a mini-map, you could show ghosts that are beyond screen bounds as text-indicators that would change their case and attribute based on their y-distance. The indicators could also move horizontally to show their x-position.
- If ghosts are above the screen, the main hud would be shown at the bottom of the screen.
- If ghosts are both above and below the screen, it could show two indicator bars and override the main hud.
- If all ghosts are visible, the indicator bar would disappear and show the last two pixel rows normally.
- Whoops, sorry I used wrong ghost names on these!

I don't know how blasphemous these ideas are, but I guess nothing's lost in sharing. Admittedly the visible screen space is quite small, even compared to the Megadrive/Genesis version of Ms. Pacman. Not sure if these mockups have quite the right feel for Paku Paku either...

Also, please delete this post in case it's derailing this thread... Maybe I got a little too excited with this problem. :)
 
I don't know how blasphemous these ideas are, but I guess nothing's lost in sharing.
No such thing as a blasphemous IDEA to me. Ideas are good -- I might argue them, I might shoot them down, I might even reject them out of hand -- does NOT mean I didn't want to hear it.

Though really what you did is similar to what I'm playing with -- though I think I'm settling on the same 3x3 sprites with 1px gap around them as Scarfman used to use. Scrolling even at the fastest blit rate isn't viable for the frame-rates I want to maintain, so I'm pretty much going to have to abandon even resembling the real pac-man layout -- unless I were to follow PC-man's example of rotating it 90 degrees. (which I'm also considering)

Not sure if these mockups have quite the right feel for Paku Paku either...
Halving the resolution in general has those issues.

I might simply scrap the whole plan and go with a hercules version, though the massive amount of memory that sprites end up (30 bytes) means I'm not sure a 4.77mhz PC could keep up with it.

Also, please delete this post in case it's derailing this thread... Maybe I got a little too excited with this problem. :)
I don't tend to get my panties in a wad over thread drift -- a conversation goes where a conversation goes.

Again, ideas are always welcome -- even if I reject them for one reason or another it's good to at least THINK about ALL the possibilities.
 
I might simply scrap the whole plan and go with a hercules version, though the massive amount of memory that sprites end up (30 bytes) means I'm not sure a 4.77mhz PC could keep up with it.

Pros: All Hercules cards have 64K of RAM, so you would have two video pages. 1-bit graphics layout means more straightforward graphics operations (especially if you forgo masks).
Cons: 720x350 is a odd resolution and aspect ratio to design graphics for. Not every vintage owner has a Hercules.

There is a certain charm in games that use 100% text mode. For example, Czorian Siege was specifically designed for it (and the graphics support merely "mirrors" the existing game in a pseudo-text-mode).
 
Back
Top