Writing Assemblers... What should a good assembler do?

cj7hawk · Dec 10, 2023

Hi All,

I'm just rewriting my z80 assembler in z80 this time ( So it can assemble itself - and will be compatible with the cross-assembler I use under Windows11) - and had some thoughts about whether to change aspects of it and thought I'd throw a couple of questions out to the forum.

First is should a label be possible to reassign mid-assembly? Generally this should cause an exception, but if allowed, would let you cut and past code segments and change constants without changing the source code.

My thinking on this one is fail on the first pass, but are there any assemblers that allow reassignment of constants and is this ever of value?

Second, is how long should a label be allowed to get? 11 characters? 16 characters? 24 characters? 32 characters? 79 characters?

I'm still building the lexical analyzer at the moment, and want to get these things right.

Thirdly, what operators should be allowed on functions?

Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler? - eg, LD HL, 100000 / 5

Forthly, how useful are macros ? Or includes? And how to best use them?

For example, an include could "chain" another assembly file to the current, or could be as simple as reading it during the first pass and picking up labels for jumps. Or they could be two different things... And Macros can be confusing in assembly and promote bad code that is difficult to read, and a good assembler should be able to do everything in the line... Including flipping 8th bit on text strings, etc. What should a Macro do that you can't do in the general assembly file itself?

Fifthly, I'm looking to use . as a "null" character - that except in quotes, is simply ignored by everything... So .ORG and ORG are valid, and it can be used to help with readability - eg, .EQU MASK,%0000.0011.1111.1111

Finally, should a line have length limits or wrap around? or should it just read until EOL is reached, ignoring everything after the comment until the final EOL is found assuming whitespace after the intial command is to be ignored? so that LD HL , 10 would be the same as LD HL,10 without spaces.

Would love to head some thoughts on what a good assemble should support and what it doesn't need to,

Thanks
David

cj7hawk · Dec 10, 2023

And one extra I forgot to add... What characters should be allowed in Constant names? At the moment, I'm making it case insensitive, but it allows numbers, Underscore and letters and doesn't generally permit other symbols, but doesn't specifically exclude them either, so things like ROUTINE(1): would be acceptable at the moment, as would START[INITIALISE]: < and > aren't permitted though as these are shift operators... And I'm not sure why I did those, by my cross assembler got them assigned, so I kept shift left and shift right.

In case it helps with an earlier question if anyone can assist, here's a list of my operators. ( and ; is comment to EOL, and . is ignore character)

; Operators
; , separator
; + add
; - subtract
; / divide
; \ modulo ( remainder from divide ).
; * multiply
; @ and
; # or
; $ hexadecimal value follows.
; % binary number value follows.
; ' single quote means a byte or series of bytes follows in ASCII. 8 bits. Quotes are NOT normal operators.
; " same as single quote, but must also be closed with a double. eg, '"' and "'" are both valid.
; < rotate left ( Only on immediate value... Can be chained. )
; > rotate right ( Only on immediate value... Can be chained. )
; ^ current program counter ( Without offset ).
; ! invert current value. !+1 = Make Negative.

Also, I keep the maths very simple. A bit like polish notation in linear progression, so it looks normal, but the operator acts on the immediate value and the next value. Quotes can be ' or " however must be terminated by the same quote. ( I am tired of putting the missing quote value into code as an ASCII code ).

eg,

1+2 * 3 would give 9 since it evaluates terms in order.
%01010101 @ "AB" would treat A as the LSB, apply the logical AND to it, and would delete the B since there's no bits in the upper 8 bits of the AND.

Also I don't distinguish between 8 bits and 16 bits except as is relevant to the command, which may trap the exception if it detects the wrong number of bits - eg, LD A,$1234 would give an error, but LD HL,%0101 would not, and H would be 0. However LD A,$234/60 would work since the result is 8 bits.

Thanks for any input.

David.

Phil_G · Dec 10, 2023

I'm just rewriting my z80 assembler in z80 this time ( So it can assemble itself - and will be compatible with the cross-assembler I use under Windows11) - and had some thoughts about whether to change aspects of it and thought I'd throw a couple of questions out to the forum.
Just my personal observations following a Zen port, Zen assembles itself ok.

First is should a label be possible to reassign mid-assembly?
Changing an equate 'live' might be useful, changing a label might cause a helluva tangle (for the user, maybe not for the assembler)

Second, is how long should a label be allowed to get? 11 characters? 16 characters? 24 characters? 32 characters? 79 characters?
For a nice tidy .PRN file its usual to allow one 'tab' for labels, sometimes two tabs. So labels >7 or 15 need to be on a separate line, personally I dont like that so 16 max. And always with a ':'

Thirdly, what operators should be allowed on functions?
Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler?
I'd suggest standard 'c' symbols as they're the most familiar, 16 bit maths is fine for me. Using 32 could slow down assembly?

Forthly, how useful are macros ? Or includes? And how to best use them?
I avoid macros and I find other peoples source hard to follow if they've used macros extensively.
As a feature its good provided its not over used - some almost redefine an entire instruction set using macros, very hard to decipher years later!
Of course having facilitated macros, you've no control over how they are used!
They also tend to produce repetitive code blocks, and larger symbol tables to accommodate the local labels - I think generally 'optimisation' in the compiler sense isnt a desirable feature.
This might well be just me though

Includes are useful, some might say essential.

Fifthly, I'm looking to use . as a "null" character - that except in quotes, is simply ignored by everything... So .ORG and ORG are valid, and it can be used to help with readability - eg, .EQU MASK,%0000.0011.1111.1111
Ok but its unexpected behaviour which might confuse, also I'd expect the syntax to be label first, MASK: .EQU %0000.0011.1111.1111

Finally, should a line have length limits or wrap around?
The user would normally self regulate to an 80-column screen, but sometimes its handy, occasionally essential, to have a long line. EOL is good.

Would love to head some thoughts on what a good assemble should support and what it doesn't need to,
Some suggestions:
Choice of binary or Intel Hex output, the filenames of which are command line arguments rather than embedded in the source like SBASM
a 'Phase' directive whereby code is physically placed at a different address to the ORG (very handy for code to be ROM'ed and for TSR code that is copied to high memory before being run)
Symbol table on request, rather than absent or enforced in the .PRN file. Pasmo doesnt even give you a .PRN listing!!! this is bad

A few numbers after pass 2 would be nice: source code line-count and character-count, binary byte-count, last address used (ORG + byte-count), and a list of unreferenced labels & code blocks
These are just my own personal thoughts, others will disagree as I admit I'm rather 'old school' !

Is this a disk-based assembler like ZSM or M80? or memory resident like Zeap and Zen? If memory-resident you'd need to allow for shifting things around, symbol table, scratchpad, stack etc
Good luck with the project
Cheers
Phil

daver2 · Dec 10, 2023

You might want to consider a symbol for "the current location counter" as the assembler is assembling so that this can be used within expressions.

I use Macros and Include files in assembler code very often. I generally prefer them to be "text substitution" - but this is a whole minefield...

The concept of 'changing' constants and labels on the fly sounds like a recipe for disaster - unless it is controlled. If you are going to implement Macros, you will need some form of local label mechanism.

DEC's MACRO-11 feature of .<number> after a valid label is a good feature to specify local labels that don't really benefit from being named.

Dave

durgadas311 · Dec 10, 2023

Regarding labels that can change value, DRI (at least) defined SET vs. EQU directives. Labels assigned with EQU must have the same value on each pass. Labels assigned with SET may change values, and even have multiple assignments (with different values) throughout code.

Macros can be life-savers, although I don't use them often. When you need them, you REALLY REALLY need them.

Not that I'm fond of it, but DRI also used '$' as a character ignored in labels and constants. I have used that to help with readability, but it can also be problematic if you are not consistent with labels (try searching for instances of a label that might be "foobar", "foo$bar", ...).

cj7hawk · Dec 10, 2023

Phil_G said:
Thirdly, what operators should be allowed on functions?
Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler?
I'd suggest standard 'c' symbols as they're the most familiar, 16 bit maths is fine for me. Using 32 could slow down assembly?

I really liked this idea until I looked at what the C symbols were. And realized that my lexical analyzer is completely different from the C one. The algorythm I use makes it pretty easy to do stuff like add inline code references, macro's etc ( even though I haven't supported them yet ) and new assembler directives. But there are constraints. Still, the idea is interesting and some crossover might be worthwhile... I'll look deeper into that - thank you.

Phil_G said:
Fifthly, I'm looking to use . as a "null" character - that except in quotes, is simply ignored by everything... So .ORG and ORG are valid, and it can be used to help with readability - eg, .EQU MASK,%0000.0011.1111.1111
Ok but its unexpected behaviour which might confuse, also I'd expect the syntax to be label first, MASK: .EQU %0000.0011.1111.1111

This one seems unique to my assembler. It also accepts .EQU MASK: %0000 0011 1111 1111 as well, but making the dot invisible solved the issue of some compilers using .ORG and others using ORG - And when I'm checking it much later, it's so much easier to be able to break up long binary and even hexadecimal and decimal numbers.

Phil_G said:
Would love to head some thoughts on what a good assemble should support and what it doesn't need to,
Some suggestions:
Choice of binary or Intel Hex output, the filenames of which are command line arguments rather than embedded in the source like SBASM

Actually, having them in the source was what I was planning... What do others think of this - Which way is preferred? I suppose both is possible, as long as things like filenames is possible.

Phil_G said:
a 'Phase' directive whereby code is physically placed at a different address to the ORG (very handy for code to be ROM'ed and for TSR code that is copied to high memory before being run)

That's already planned. I call it "Offset" and it's just a number added to the PC that the assembler uses to offset where the code is stored, so I can assemble for another location in the current stream...

Phil_G said:
Symbol table on request, rather than absent or enforced in the .PRN file. Pasmo doesnt even give you a .PRN listing!!! this is bad

I don't understand this one. Even after googling PASMO and PRN? I think I've missed something here.

Phil_G said:
A few numbers after pass 2 would be nice: source code line-count and character-count, binary byte-count, last address used (ORG + byte-count), and a list of unreferenced labels & code blocks

I like that stuff too.. Maybe a little too much... It will definitely give output on success.

Phil_G said:
These are just my own personal thoughts, others will disagree as I admit I'm rather 'old school' !
Is this a disk-based assembler like ZSM or M80? or memory resident like Zeap and Zen? If memory-resident you'd need to allow for shifting things around, symbol table, scratchpad, stack etc
Good luck with the project
Cheers
Phil

I think I get something about the PRN from the above. It's a disk based assembler... Like a COM. Well, exactly a COM. And creates a COM by default, but probably creating hex files as an alternative option is a good idea. I wanted something that when it assembles, it creates a COM directly. Omitting the linker.

I wanted to write the original to fit entirely within the TPA and store symbols in memory, then write the COM in a single output pass.

It should support somewhere around 2000 to 4000 labels, constants etc. But later I'll update it to use my other architectures and use paged memory, and then I suppose letting it run in memory as a TSR would be a good idea for development... Especially since it could switch in and out of the TPA and current process at will. And even though the cross-assembler won't support it directly, the development environment includes an emulator which would, though I do enjoy editing large comments under Notepad++ so it's more just an addition to the operating system I built as I try to make that more complete, and that means I need to write some missing elements that would round it out, and my current assembler is different to the ones of old, so that's another issue to address.

I like that idea also, adding in a resident mode - thank you. That won't be in version 1, which will be a vanilla CP/M app, but will be in later ones. At least for the z80 version. Then having an editor for it also makes sense.

Thanks for that. I've only started the latest write... So far I've only written the maths evaluation routines, labels and directives and the first pass run to set up the label table. I have to start adding in the instructions now so that it can generate initially byte counts for the first pass to complete and fill in late labels, then I will start writing the second pass to generate the correct code.

cj7hawk · Dec 10, 2023

durgadas311 said:
Regarding labels that can change value, DRI (at least) defined SET vs. EQU directives. Labels assigned with EQU must have the same value on each pass. Labels assigned with SET may change values, and even have multiple assignments (with different values) throughout code.

I like that idea.. Very little additional code to support it also... Thank you.

durgadas311 said:
Macros can be life-savers, although I don't use them often. When you need them, you REALLY REALLY need them.

Are you able to give an example of such an instance? Of a function that couldn't be included in the assembler itself? ( or that typically wouldn't be? )

durgadas311 said:
Not that I'm fond of it, but DRI also used '$' as a character ignored in labels and constants. I have used that to help with readability, but it can also be problematic if you are not consistent with labels (try searching for instances of a label that might be "foobar", "foo$bar", ...).

Good point. Though hopefully people won't include invisible characters in the labels... I'm not even sure how I'd deal with that if it arose... It feels like it comes down tot he programmer.

cj7hawk · Dec 10, 2023

daver2 said:
You might want to consider a symbol for "the current location counter" as the assembler is assembling so that this can be used within expressions.

Presently I use ^ as that symbol, but I'm thinking of switching it to @ and using & for AND.

Also I'll have an offset so the assembler thinks it's at a different location to where it's writing code.

I figure after this I'll have separate codebases for the cross assembler and native assembler, so I might as well make any long term changes now before I get in to deep.

Phil_G · Dec 10, 2023

cj7hawk said:
I don't understand this one. Even after googling PASMO and PRN? I think I've missed something here.

Meaning the .PRN or sometimes called .LST file, the listing file formatted for printing. Pasmo is an example of a very popular Z80 assembler that I personally dislike because
it doesnt give a list file output like M80, RMAC, AZ80, ZSM, Zen, Zeap, even TASM etc all do.

For the current program counter value, $ is popular and feels natural to me. I'd suggest sticking to what exists, ie surely everyone expects ^ to be xor?
I'd suggest not using @ for the PC because many of us work on multiple cpus, some of which use @ to imply indirection - or again that might be just me...

Re listing and object-code filenames embedded in the source, I use SBASM a lot, and that imposition (no alternative) drives me nuts.
Say you want to try a small change, so you open your MYSRC.ASM source, do your change & save it as MYSRC2.ASM, but when you assemble it it overwrites your 'good' list, hex and
binary files so they now have the original name but the 'new' code. Unless of course you remember to update the embedded names, which for a quick (probably temporary) test, is a pain

and then you have to remember to change them back again. Love SBASM, hate that 'feature' !
It also means that the output drive is fixed in the source code, and you may want it to go elsewhere, maybe just once - easy when specified in a command line argument, but more editing (and back) if its embedded.

Svenska · Dec 10, 2023

cj7hawk said:
First is should a label be possible to reassign mid-assembly?

I think labels should be never be changed once assigned, but their assignment can be deferred. Only values should be possible to change mid-assembly. Separate namespaces (segments) with their own PC values for symbols are useful if you want to support development using overlays or other advanced features.

cj7hawk said:
Second, is how long should a label be allowed to get? 11 characters? 16 characters? 24 characters? 32 characters? 79 characters?

Any length should be allowed, given memory constraints. C compilers must handle 31 significant characters (6 for external symbols), but I have written C code (closely following a specification) where 64 characters were insufficient. Modern C compilers support symbol names of up to 255 characters, which is reasonable.

cj7hawk said:
Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler? - eg, LD HL, 100000 / 5

The modern world likes to use large numbers, and even CP/M disk sizes do not fit in 16-bit integers. Having 32-bit arithmetic support in the assembler is very useful, as well as extracting bytes and words from them. The eZ80 processors are binary compatible with the Z80, but support 24-bit addressing.

cj7hawk said:
Forthly, how useful are macros ? Or includes? And how to best use them?

Macros are very useful in few cases. Even relatively simple macro support can often replace an external source code preprocessor. Includes are a good way to structure larger programs. How to use them best should be left to the programmer; I am generally not a fan of opinionated tools.

cj7hawk said:
Finally, should a line have length limits or wrap around? or should it just read until EOL is reached, ignoring everything after the comment until the final EOL is found assuming whitespace after the intial command is to be ignored? so that LD HL , 10 would be the same as LD HL,10 without spaces.

Any restrictions on line length and whitespace formatting are usually caused by a stupid parser design. Avoid making programming hard for the programmer. Turn your source code into a token stream first, then continue from there. The assembler video by hjalfi at

may be a good start.

cj7hawk said:
What characters should be allowed in Constant names?

Any UTF-8 should be acceptable. Your assembler doesn't need to care. Handling string constants (messages) is more interesting, but CP/M doesn't leaves character encodings to the terminal.

cj7hawk said:
And creates a COM by default, but probably creating hex files as an alternative option is a good idea. I wanted something that when it assembles, it creates a COM directly. Omitting the linker.

Creating COM files is practical, and HEX files are nice to have. PRN files (human-readable combination of output and input) are very helpful when developing, and a SYM file (listing all symbols and their values) is great for debugging. All of these should be optional.

For a Z80 assembler, a switch restricting it to 8080 instructions is useful. If set, the assembler should warn or fail on Z80-only instructions. A similar switch could be used for eZ80 support.

Phil_G · Dec 10, 2023

Bear in mind that a .com file and a binary object-file are the same thing, so thats two birds with one stone. Which is which depends only on whether it is ORG'ed at 100h or not

Re the current PC, I just thought, $ is often used for hex notation, $42 = 0x42 = 42h ( = 66d )

Svenska · Dec 10, 2023

Phil_G said:
Bear in mind that a .com file and a binary object-file are the same thing, so thats two birds with one stone.

Object files contain relocation information, COM files are fully relocated. It is possible to assemble at both 0x0000 and 0x0100 to get relocation information if the assembler doesn't support it.

Phil_G said:
Re the current PC, I just thought, $ is often used for hex notation, $42 = 0x42 = 42h ( = 66d )

Yes, support for all common notations is desirable.

Chuck(G) · Dec 10, 2023

My biggest requirements are (1) A really powerful macro faciltiy with assembly-time variables--you should be able to do character manipulation easily. (2) Strong typing with associated conditionals (3) data structures, including bitfields and initialization constructs, as well as arrays (4) opdefs for new or custom instructions.
On x86, I've used every version of MASM from 1.0 to 6.14. Things didn't really start to get what I'd call "friendly" until MASM 5 or so. There are still a couple of "you can't get there from heres" in 6, but they can be worked around.
If you want to see an assembler with a decent macro facility, consider the more-than-50-year-old H level assembler for S/370 (on bitsavers).

But much of my career has been programming nothing but assembly. If you consider C to be a high-level assembly language, then almost all of my experience has been in assembly.

Phil_G · Dec 10, 2023

Svenska said:
Object files contain relocation information, COM files are fully relocated. It is possible to assemble at both 0x0000 and 0x0100 to get relocation information if the assembler doesn't support it.

Only in a relocating assembler like RMAC, yes, but unless I missed that requirement we're not talking about producing relocatable code are we?
in which case a .com is simply the bin file ORG'ed at 100h

Providing 'phase' or what David calls 'offset' isnt the same as creating relocatable code, I know you know that, I'm just explaining myself

Svenska · Dec 10, 2023

Yes, but that doesn't make it an object file. Binary file yes, object file no.

Phil_G · Dec 10, 2023

Then we agree, but in that one post I inattentively used the phrase " binary object file" when I should have said "binary file"

In my defence, within the documentation for several non-relocating assemblers, the output is often referred to as an object file even though relocation is not provided - however I accept that to the dictionary definition you're right. I'll climb back into my box now

cj7hawk · Dec 11, 2023

Phil_G said:
Bear in mind that a .com file and a binary object-file are the same thing, so thats two birds with one stone. Which is which depends only on whether it is ORG'ed at 100h or not
Re the current PC, I just thought, $ is often used for hex notation, $42 = 0x42 = 42h ( = 66d )

This is also where I use the $ sign as Hex. Assemblers and High Level languages are maybe not a good mix, but learning the symbols of a few more assemblers might help put together a good final symbol set.

I do want to keep this as small and simple as possible, and presently I have a very simple lexical analyser. It picks up the next token to process very efficiently, either stopping on whitespace/operator or just on operator while ignoring whitespace. As such I don't really have any arbitrary length limitations yet, but any operators must be a single character - Things like >> aren't acceptable, but > is.

I'll see what options are... C seems a bit far from compatible with the approach I've used, but now you've mentioned it, I should be able to put together a list of assembler operator functions.

Is there any advantage to relocatable code in CP/M? Where most code either loads at 100 or as a bin file destined for a different location? Or with an offset, so a loader can load at 100 and then relocate the code to the original intended location before execution?

Technically I get that's not relocatable code, since once linked it can't be relocated unless written that way. Is it just a case of keeping a list of any fixed address vectors and adding the program load offset to them? Nearly all of the code I write was embedded application, and I never found a use for relocatable code on z80 or encountered it. What use did CP/M make of relocatable code directly? Or was that in things like MPM?

Chuck(G) said:
My biggest requirements are (1) A really powerful macro faciltiy with assembly-time variables--you should be able to do character manipulation easily. (2) Strong typing with associated conditionals (3) data structures, including bitfields and initialization constructs, as well as arrays (4) opdefs for new or custom instructions.

Do you know if any assemblers offered "custom" instructions in the assembler itself? I imagine source code wasn't that common for assemblers, but would access to the assembler source meet the requirement to add new instructions? I wonder how much of what is described, with the exception of Macro source, could be achieved by changing the assembler directly to add new instructions such as ez80 ones as mentioned. I wonder if this might be an alternative to having to use macros to create new instructions.

It should be easier to add instructions to the assembler than to create a table of references to code chunks.

Also, when you say character manipulation, do you mean adding characters into formula to represent a number value - eg, 'A'+$80 ?

Thanks
David.

tofro · Dec 11, 2023

For "normal" programs, you don't really need or want reloctable code. But if you want to write RSX (resident system extensions) modules or GSX drivers, (and some "more exotic" system extensions that need to relocate themselves to high memory), you need to have relocatable programs - Especially when your loading more than one drver or RSX, the load address is not known beforehand.

daver2 · Dec 11, 2023

>>> Do you know if any assemblers offered "custom" instructions in the assembler itself?

This is the purpose of a meta assembler (see https://www.farnell.com/datasheets/100571.pdf for details of Cross-32). This is an assembler whereby the specific instruction set is defined in a table. The end user can define their own tables. I have created my own tables for a number of specific processors not supported "out of the box" by cross-32.

Dave

cj7hawk · Dec 11, 2023

tofro said:
For "normal" programs, you don't really need or want reloctable code. But if you want to write RSX (resident system extensions) modules or GSX drivers, (and some "more exotic" system extensions that need to relocate themselves to high memory), you need to have relocatable programs - Especially when your loading more than one drver or RSX, the load address is not known beforehand.

That makes sense, thank you for the example. I guess my current architecture would just replace these with process hooks, so they could all install in the same location ( Typically $1000 ) and page in/out as called, while still being able to access common tables and other data in the original TPA. So it's pseudo relocatable from that perspective. But from your example, assuming I saved a list of fixed vectors to update with the new code location while loading, I could relocate code anywhere in memory.

Do you know if CP/M had a common loader for such code, or did linkers get used for this purpose?

VCF West	Aug 01 - 02 2025,	CHM, Mountain View, CA
VCF Midwest	Sep 13 - 14 2025,	Schaumburg, IL
VCF Montreal	Jan 24 - 25, 2026,	RMC Saint Jean, Montreal, Canada
VCF SoCal	Feb 14 - 15, 2026,	Hotel Fera, Orange CA
VCF Southwest	May 29 - 31, 2026,	Westin Dallas Fort Worth Airport
VCF Southeast	June, 2026	Atlanta, GA

Writing Assemblers... What should a good assembler do?

Veteran Member

Veteran Member

Experienced Member

10k Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Experienced Member

Veteran Member

Experienced Member

Veteran Member

25k Member

Experienced Member

Veteran Member

Experienced Member

Veteran Member

Member

10k Member

Veteran Member