• Please review our updated Terms and Rules here

Writing Assemblers... What should a good assembler do?

cj7hawk

Veteran Member
Joined
Jan 25, 2022
Messages
1,136
Location
Perth, Western Australia.
Hi All,

I'm just rewriting my z80 assembler in z80 this time ( So it can assemble itself - and will be compatible with the cross-assembler I use under Windows11) - and had some thoughts about whether to change aspects of it and thought I'd throw a couple of questions out to the forum.

First is should a label be possible to reassign mid-assembly? Generally this should cause an exception, but if allowed, would let you cut and past code segments and change constants without changing the source code.

My thinking on this one is fail on the first pass, but are there any assemblers that allow reassignment of constants and is this ever of value?

Second, is how long should a label be allowed to get? 11 characters? 16 characters? 24 characters? 32 characters? 79 characters?

I'm still building the lexical analyzer at the moment, and want to get these things right.

Thirdly, what operators should be allowed on functions?

Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler? - eg, LD HL, 100000 / 5

Forthly, how useful are macros ? Or includes? And how to best use them?

For example, an include could "chain" another assembly file to the current, or could be as simple as reading it during the first pass and picking up labels for jumps. Or they could be two different things... And Macros can be confusing in assembly and promote bad code that is difficult to read, and a good assembler should be able to do everything in the line... Including flipping 8th bit on text strings, etc. What should a Macro do that you can't do in the general assembly file itself?

Fifthly, I'm looking to use . as a "null" character - that except in quotes, is simply ignored by everything... So .ORG and ORG are valid, and it can be used to help with readability - eg, .EQU MASK,%0000.0011.1111.1111

Finally, should a line have length limits or wrap around? or should it just read until EOL is reached, ignoring everything after the comment until the final EOL is found assuming whitespace after the intial command is to be ignored? so that LD HL , 10 would be the same as LD HL,10 without spaces.


Would love to head some thoughts on what a good assemble should support and what it doesn't need to,

Thanks
David
 
And one extra I forgot to add... What characters should be allowed in Constant names? At the moment, I'm making it case insensitive, but it allows numbers, Underscore and letters and doesn't generally permit other symbols, but doesn't specifically exclude them either, so things like ROUTINE(1): would be acceptable at the moment, as would START[INITIALISE]: < and > aren't permitted though as these are shift operators... And I'm not sure why I did those, by my cross assembler got them assigned, so I kept shift left and shift right.

In case it helps with an earlier question if anyone can assist, here's a list of my operators. ( and ; is comment to EOL, and . is ignore character)

; Operators
; , separator
; + add
; - subtract
; / divide
; \ modulo ( remainder from divide ).
; * multiply
; @ and
; # or
; $ hexadecimal value follows.
; % binary number value follows.
; ' single quote means a byte or series of bytes follows in ASCII. 8 bits. Quotes are NOT normal operators.
; " same as single quote, but must also be closed with a double. eg, '"' and "'" are both valid.
; < rotate left ( Only on immediate value... Can be chained. )
; > rotate right ( Only on immediate value... Can be chained. )
; ^ current program counter ( Without offset ).
; ! invert current value. !+1 = Make Negative.

Also, I keep the maths very simple. A bit like polish notation in linear progression, so it looks normal, but the operator acts on the immediate value and the next value. Quotes can be ' or " however must be terminated by the same quote. ( I am tired of putting the missing quote value into code as an ASCII code ).

eg,

1+2 * 3 would give 9 since it evaluates terms in order.
%01010101 @ "AB" would treat A as the LSB, apply the logical AND to it, and would delete the B since there's no bits in the upper 8 bits of the AND.

Also I don't distinguish between 8 bits and 16 bits except as is relevant to the command, which may trap the exception if it detects the wrong number of bits - eg, LD A,$1234 would give an error, but LD HL,%0101 would not, and H would be 0. However LD A,$234/60 would work since the result is 8 bits.

Thanks for any input.

David.
 
I'm just rewriting my z80 assembler in z80 this time ( So it can assemble itself - and will be compatible with the cross-assembler I use under Windows11) - and had some thoughts about whether to change aspects of it and thought I'd throw a couple of questions out to the forum.
Just my personal observations following a Zen port, Zen assembles itself ok.

First is should a label be possible to reassign mid-assembly?
Changing an equate 'live' might be useful, changing a label might cause a helluva tangle (for the user, maybe not for the assembler)

Second, is how long should a label be allowed to get? 11 characters? 16 characters? 24 characters? 32 characters? 79 characters?
For a nice tidy .PRN file its usual to allow one 'tab' for labels, sometimes two tabs. So labels >7 or 15 need to be on a separate line, personally I dont like that so 16 max. And always with a ':' :)

Thirdly, what operators should be allowed on functions?
Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler?

I'd suggest standard 'c' symbols as they're the most familiar, 16 bit maths is fine for me. Using 32 could slow down assembly?

Forthly, how useful are macros ? Or includes? And how to best use them?
I avoid macros and I find other peoples source hard to follow if they've used macros extensively.
As a feature its good provided its not over used - some almost redefine an entire instruction set using macros, very hard to decipher years later!
Of course having facilitated macros, you've no control over how they are used!
They also tend to produce repetitive code blocks, and larger symbol tables to accommodate the local labels - I think generally 'optimisation' in the compiler sense isnt a desirable feature.
This might well be just me though :) Includes are useful, some might say essential.


Fifthly, I'm looking to use . as a "null" character - that except in quotes, is simply ignored by everything... So .ORG and ORG are valid, and it can be used to help with readability - eg, .EQU MASK,%0000.0011.1111.1111
Ok but its unexpected behaviour which might confuse, also I'd expect the syntax to be label first, MASK: .EQU %0000.0011.1111.1111

Finally, should a line have length limits or wrap around?
The user would normally self regulate to an 80-column screen, but sometimes its handy, occasionally essential, to have a long line. EOL is good.

Would love to head some thoughts on what a good assemble should support and what it doesn't need to,
Some suggestions:
Choice of binary or Intel Hex output, the filenames of which are command line arguments rather than embedded in the source like SBASM
a 'Phase' directive whereby code is physically placed at a different address to the ORG (very handy for code to be ROM'ed and for TSR code that is copied to high memory before being run)
Symbol table on request, rather than absent or enforced in the .PRN file. Pasmo doesnt even give you a .PRN listing!!! this is bad :)
A few numbers after pass 2 would be nice: source code line-count and character-count, binary byte-count, last address used (ORG + byte-count), and a list of unreferenced labels & code blocks
These are just my own personal thoughts, others will disagree as I admit I'm rather 'old school' ! :)
Is this a disk-based assembler like ZSM or M80? or memory resident like Zeap and Zen? If memory-resident you'd need to allow for shifting things around, symbol table, scratchpad, stack etc
Good luck with the project
Cheers
Phil
 
Last edited:
You might want to consider a symbol for "the current location counter" as the assembler is assembling so that this can be used within expressions.

I use Macros and Include files in assembler code very often. I generally prefer them to be "text substitution" - but this is a whole minefield...

The concept of 'changing' constants and labels on the fly sounds like a recipe for disaster - unless it is controlled. If you are going to implement Macros, you will need some form of local label mechanism.

DEC's MACRO-11 feature of .<number> after a valid label is a good feature to specify local labels that don't really benefit from being named.

Dave
 
Regarding labels that can change value, DRI (at least) defined SET vs. EQU directives. Labels assigned with EQU must have the same value on each pass. Labels assigned with SET may change values, and even have multiple assignments (with different values) throughout code.

Macros can be life-savers, although I don't use them often. When you need them, you REALLY REALLY need them.

Not that I'm fond of it, but DRI also used '$' as a character ignored in labels and constants. I have used that to help with readability, but it can also be problematic if you are not consistent with labels (try searching for instances of a label that might be "foobar", "foo$bar", ...).
 
Thirdly, what operators should be allowed on functions?
Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler?

I'd suggest standard 'c' symbols as they're the most familiar, 16 bit maths is fine for me. Using 32 could slow down assembly?

I really liked this idea until I looked at what the C symbols were. And realized that my lexical analyzer is completely different from the C one. The algorythm I use makes it pretty easy to do stuff like add inline code references, macro's etc ( even though I haven't supported them yet ) and new assembler directives. But there are constraints. Still, the idea is interesting and some crossover might be worthwhile... I'll look deeper into that - thank you.

Fifthly, I'm looking to use . as a "null" character - that except in quotes, is simply ignored by everything... So .ORG and ORG are valid, and it can be used to help with readability - eg, .EQU MASK,%0000.0011.1111.1111
Ok but its unexpected behaviour which might confuse, also I'd expect the syntax to be label first, MASK: .EQU %0000.0011.1111.1111

This one seems unique to my assembler. It also accepts .EQU MASK: %0000 0011 1111 1111 as well, but making the dot invisible solved the issue of some compilers using .ORG and others using ORG - And when I'm checking it much later, it's so much easier to be able to break up long binary and even hexadecimal and decimal numbers.

Would love to head some thoughts on what a good assemble should support and what it doesn't need to,
Some suggestions:
Choice of binary or Intel Hex output, the filenames of which are command line arguments rather than embedded in the source like SBASM

Actually, having them in the source was what I was planning... What do others think of this - Which way is preferred? I suppose both is possible, as long as things like filenames is possible.

a 'Phase' directive whereby code is physically placed at a different address to the ORG (very handy for code to be ROM'ed and for TSR code that is copied to high memory before being run)

That's already planned. I call it "Offset" and it's just a number added to the PC that the assembler uses to offset where the code is stored, so I can assemble for another location in the current stream...

Symbol table on request, rather than absent or enforced in the .PRN file. Pasmo doesnt even give you a .PRN listing!!! this is bad :)

I don't understand this one. Even after googling PASMO and PRN? I think I've missed something here.


A few numbers after pass 2 would be nice: source code line-count and character-count, binary byte-count, last address used (ORG + byte-count), and a list of unreferenced labels & code blocks

I like that stuff too.. Maybe a little too much... It will definitely give output on success.

These are just my own personal thoughts, others will disagree as I admit I'm rather 'old school' ! :)
Is this a disk-based assembler like ZSM or M80? or memory resident like Zeap and Zen? If memory-resident you'd need to allow for shifting things around, symbol table, scratchpad, stack etc
Good luck with the project
Cheers
Phil

I think I get something about the PRN from the above. It's a disk based assembler... Like a COM. Well, exactly a COM. And creates a COM by default, but probably creating hex files as an alternative option is a good idea. I wanted something that when it assembles, it creates a COM directly. Omitting the linker.

I wanted to write the original to fit entirely within the TPA and store symbols in memory, then write the COM in a single output pass.

It should support somewhere around 2000 to 4000 labels, constants etc. But later I'll update it to use my other architectures and use paged memory, and then I suppose letting it run in memory as a TSR would be a good idea for development... Especially since it could switch in and out of the TPA and current process at will. And even though the cross-assembler won't support it directly, the development environment includes an emulator which would, though I do enjoy editing large comments under Notepad++ so it's more just an addition to the operating system I built as I try to make that more complete, and that means I need to write some missing elements that would round it out, and my current assembler is different to the ones of old, so that's another issue to address.

I like that idea also, adding in a resident mode - thank you. That won't be in version 1, which will be a vanilla CP/M app, but will be in later ones. At least for the z80 version. Then having an editor for it also makes sense.

Thanks for that. I've only started the latest write... So far I've only written the maths evaluation routines, labels and directives and the first pass run to set up the label table. I have to start adding in the instructions now so that it can generate initially byte counts for the first pass to complete and fill in late labels, then I will start writing the second pass to generate the correct code.
 
Regarding labels that can change value, DRI (at least) defined SET vs. EQU directives. Labels assigned with EQU must have the same value on each pass. Labels assigned with SET may change values, and even have multiple assignments (with different values) throughout code.

I like that idea.. Very little additional code to support it also... Thank you.

Macros can be life-savers, although I don't use them often. When you need them, you REALLY REALLY need them.

Are you able to give an example of such an instance? Of a function that couldn't be included in the assembler itself? ( or that typically wouldn't be? )

Not that I'm fond of it, but DRI also used '$' as a character ignored in labels and constants. I have used that to help with readability, but it can also be problematic if you are not consistent with labels (try searching for instances of a label that might be "foobar", "foo$bar", ...).

Good point. Though hopefully people won't include invisible characters in the labels... I'm not even sure how I'd deal with that if it arose... It feels like it comes down tot he programmer.
 
You might want to consider a symbol for "the current location counter" as the assembler is assembling so that this can be used within expressions.

Presently I use ^ as that symbol, but I'm thinking of switching it to @ and using & for AND.

Also I'll have an offset so the assembler thinks it's at a different location to where it's writing code.

I figure after this I'll have separate codebases for the cross assembler and native assembler, so I might as well make any long term changes now before I get in to deep.
 
I don't understand this one. Even after googling PASMO and PRN? I think I've missed something here.

Meaning the .PRN or sometimes called .LST file, the listing file formatted for printing. Pasmo is an example of a very popular Z80 assembler that I personally dislike because
it doesnt give a list file output like M80, RMAC, AZ80, ZSM, Zen, Zeap, even TASM etc all do. :)
For the current program counter value, $ is popular and feels natural to me. I'd suggest sticking to what exists, ie surely everyone expects ^ to be xor?
I'd suggest not using @ for the PC because many of us work on multiple cpus, some of which use @ to imply indirection - or again that might be just me... :)

Re listing and object-code filenames embedded in the source, I use SBASM a lot, and that imposition (no alternative) drives me nuts.
Say you want to try a small change, so you open your MYSRC.ASM source, do your change & save it as MYSRC2.ASM, but when you assemble it it overwrites your 'good' list, hex and
binary files so they now have the original name but the 'new' code. Unless of course you remember to update the embedded names, which for a quick (probably temporary) test, is a pain :)
and then you have to remember to change them back again. Love SBASM, hate that 'feature' !
It also means that the output drive is fixed in the source code, and you may want it to go elsewhere, maybe just once - easy when specified in a command line argument, but more editing (and back) if its embedded.
 
Last edited:
First is should a label be possible to reassign mid-assembly?
I think labels should be never be changed once assigned, but their assignment can be deferred. Only values should be possible to change mid-assembly. Separate namespaces (segments) with their own PC values for symbols are useful if you want to support development using overlays or other advanced features.

Second, is how long should a label be allowed to get? 11 characters? 16 characters? 24 characters? 32 characters? 79 characters?
Any length should be allowed, given memory constraints. C compilers must handle 31 significant characters (6 for external symbols), but I have written C code (closely following a specification) where 64 characters were insufficient. Modern C compilers support symbol names of up to 255 characters, which is reasonable.

Add, Subtract, Divide, Multiply, AND, OR, MOD, XOR, Invert, Shift left/right, Neg ( though NEG can also be invert+1 ) - And is there any standard as to what symbols should be used to represent these functions? And with z80 is there any valid reason to calculate a result more than 16 bits in the assembler? - eg, LD HL, 100000 / 5
The modern world likes to use large numbers, and even CP/M disk sizes do not fit in 16-bit integers. Having 32-bit arithmetic support in the assembler is very useful, as well as extracting bytes and words from them. The eZ80 processors are binary compatible with the Z80, but support 24-bit addressing.

Forthly, how useful are macros ? Or includes? And how to best use them?
Macros are very useful in few cases. Even relatively simple macro support can often replace an external source code preprocessor. Includes are a good way to structure larger programs. How to use them best should be left to the programmer; I am generally not a fan of opinionated tools.

Finally, should a line have length limits or wrap around? or should it just read until EOL is reached, ignoring everything after the comment until the final EOL is found assuming whitespace after the intial command is to be ignored? so that LD HL , 10 would be the same as LD HL,10 without spaces.
Any restrictions on line length and whitespace formatting are usually caused by a stupid parser design. Avoid making programming hard for the programmer. Turn your source code into a token stream first, then continue from there. The assembler video by hjalfi at
may be a good start.

What characters should be allowed in Constant names?
Any UTF-8 should be acceptable. Your assembler doesn't need to care. Handling string constants (messages) is more interesting, but CP/M doesn't leaves character encodings to the terminal.

And creates a COM by default, but probably creating hex files as an alternative option is a good idea. I wanted something that when it assembles, it creates a COM directly. Omitting the linker.
Creating COM files is practical, and HEX files are nice to have. PRN files (human-readable combination of output and input) are very helpful when developing, and a SYM file (listing all symbols and their values) is great for debugging. All of these should be optional.

For a Z80 assembler, a switch restricting it to 8080 instructions is useful. If set, the assembler should warn or fail on Z80-only instructions. A similar switch could be used for eZ80 support.
 
Bear in mind that a .com file and a binary object-file are the same thing, so thats two birds with one stone. Which is which depends only on whether it is ORG'ed at 100h or not :)
Re the current PC, I just thought, $ is often used for hex notation, $42 = 0x42 = 42h ( = 66d )
 
Bear in mind that a .com file and a binary object-file are the same thing, so thats two birds with one stone.
Object files contain relocation information, COM files are fully relocated. It is possible to assemble at both 0x0000 and 0x0100 to get relocation information if the assembler doesn't support it.

Re the current PC, I just thought, $ is often used for hex notation, $42 = 0x42 = 42h ( = 66d )
Yes, support for all common notations is desirable.
 
My biggest requirements are (1) A really powerful macro faciltiy with assembly-time variables--you should be able to do character manipulation easily. (2) Strong typing with associated conditionals (3) data structures, including bitfields and initialization constructs, as well as arrays (4) opdefs for new or custom instructions.
On x86, I've used every version of MASM from 1.0 to 6.14. Things didn't really start to get what I'd call "friendly" until MASM 5 or so. There are still a couple of "you can't get there from heres" in 6, but they can be worked around.
If you want to see an assembler with a decent macro facility, consider the more-than-50-year-old H level assembler for S/370 (on bitsavers).

But much of my career has been programming nothing but assembly. If you consider C to be a high-level assembly language, then almost all of my experience has been in assembly.
 
Last edited:
Object files contain relocation information, COM files are fully relocated. It is possible to assemble at both 0x0000 and 0x0100 to get relocation information if the assembler doesn't support it.
Only in a relocating assembler like RMAC, yes, but unless I missed that requirement we're not talking about producing relocatable code are we?
in which case a .com is simply the bin file ORG'ed at 100h :)
Providing 'phase' or what David calls 'offset' isnt the same as creating relocatable code, I know you know that, I'm just explaining myself :)
 
Last edited:
Then we agree, but in that one post I inattentively used the phrase " binary object file" when I should have said "binary file" :)
In my defence, within the documentation for several non-relocating assemblers, the output is often referred to as an object file even though relocation is not provided - however I accept that to the dictionary definition you're right. I'll climb back into my box now :)
 
Last edited:
Bear in mind that a .com file and a binary object-file are the same thing, so thats two birds with one stone. Which is which depends only on whether it is ORG'ed at 100h or not :)
Re the current PC, I just thought, $ is often used for hex notation, $42 = 0x42 = 42h ( = 66d )

This is also where I use the $ sign as Hex. Assemblers and High Level languages are maybe not a good mix, but learning the symbols of a few more assemblers might help put together a good final symbol set.

I do want to keep this as small and simple as possible, and presently I have a very simple lexical analyser. It picks up the next token to process very efficiently, either stopping on whitespace/operator or just on operator while ignoring whitespace. As such I don't really have any arbitrary length limitations yet, but any operators must be a single character - Things like >> aren't acceptable, but > is.

I'll see what options are... C seems a bit far from compatible with the approach I've used, but now you've mentioned it, I should be able to put together a list of assembler operator functions.

Is there any advantage to relocatable code in CP/M? Where most code either loads at 100 or as a bin file destined for a different location? Or with an offset, so a loader can load at 100 and then relocate the code to the original intended location before execution?

Technically I get that's not relocatable code, since once linked it can't be relocated unless written that way. Is it just a case of keeping a list of any fixed address vectors and adding the program load offset to them? Nearly all of the code I write was embedded application, and I never found a use for relocatable code on z80 or encountered it. What use did CP/M make of relocatable code directly? Or was that in things like MPM?

My biggest requirements are (1) A really powerful macro faciltiy with assembly-time variables--you should be able to do character manipulation easily. (2) Strong typing with associated conditionals (3) data structures, including bitfields and initialization constructs, as well as arrays (4) opdefs for new or custom instructions.

Do you know if any assemblers offered "custom" instructions in the assembler itself? I imagine source code wasn't that common for assemblers, but would access to the assembler source meet the requirement to add new instructions? I wonder how much of what is described, with the exception of Macro source, could be achieved by changing the assembler directly to add new instructions such as ez80 ones as mentioned. I wonder if this might be an alternative to having to use macros to create new instructions.

It should be easier to add instructions to the assembler than to create a table of references to code chunks.

Also, when you say character manipulation, do you mean adding characters into formula to represent a number value - eg, 'A'+$80 ?

Thanks
David.
 
For "normal" programs, you don't really need or want reloctable code. But if you want to write RSX (resident system extensions) modules or GSX drivers, (and some "more exotic" system extensions that need to relocate themselves to high memory), you need to have relocatable programs - Especially when your loading more than one drver or RSX, the load address is not known beforehand.
 
>>> Do you know if any assemblers offered "custom" instructions in the assembler itself?

This is the purpose of a meta assembler (see https://www.farnell.com/datasheets/100571.pdf for details of Cross-32). This is an assembler whereby the specific instruction set is defined in a table. The end user can define their own tables. I have created my own tables for a number of specific processors not supported "out of the box" by cross-32.

Dave
 
For "normal" programs, you don't really need or want reloctable code. But if you want to write RSX (resident system extensions) modules or GSX drivers, (and some "more exotic" system extensions that need to relocate themselves to high memory), you need to have relocatable programs - Especially when your loading more than one drver or RSX, the load address is not known beforehand.
That makes sense, thank you for the example. I guess my current architecture would just replace these with process hooks, so they could all install in the same location ( Typically $1000 ) and page in/out as called, while still being able to access common tables and other data in the original TPA. So it's pseudo relocatable from that perspective. But from your example, assuming I saved a list of fixed vectors to update with the new code location while loading, I could relocate code anywhere in memory.

Do you know if CP/M had a common loader for such code, or did linkers get used for this purpose?
 
Back
Top