Writing Assemblers... What should a good assembler do?

cjs · Apr 26, 2024

Bruce Tomlin said:
I know it's fun to argue about what's the best way to do a hash, but the simple fact is, it's premature optimization. The hash doesn't matter if the rest of the assembler doesn't work, but the assembler will still work if it uses a simple linked list.

And what is it about the current version of the assembler you think isn't working?

cj7hawk · Apr 27, 2024

cjs said:
And what is it about the current version of the assembler you think isn't working?

I imagine by "Not Working" he means "Unfinished" -and he's not incorrect about that.

There's a list of things I've mentioned I'm still working on. It's functional, but in terms of working - eg, completeness, I'm still getting the functions to work as intended.

Today, the Macros are finally nesting correctly. 32 levels of nesting reserved in memory. It takes 6 bytes of reserved memory for each level of nest, and it's working pretty well. Macros can call macros, can call macros. I even wrote an iterative macro to call itself until it reached maximum depth using conditional assembly, then I called it again to test level trapping, and finally backed it out all the way to the original call and terminated the macro. The amount of nesting allowed can be changed, but presently, that's the limit.

Macros can call Includes, and Includes can call Macros, though the intent is to include system macros rather than bind them to RSTs like the PC does.

Macros and Includes both allow up to 10 arguments to be passed into it via the calling line and Binary includes are allowed also.

I still need to write the routines to put screen output to a file instead of the console too.

Then I need to add the ability to enter switches and values via the command line from CP/M and it's complete... A little cleanup maybe, but the point then is something I agree with. I need to get it working correctly as I want, then begin playing with the optimisiations.

Though the first priority I'm going to have is documenting it all. I'm already forgetting stuff I added in already. And what commands I used. And going back through the source to look isn't optimal. I can implement other data structures later to improve speed if it's an issue, but having it complete, them making a compatible cross-assembler is the primary objective. The second priority on completing it is writing the cross-assembler to replace the current cross-assembler.

The current technical limits/capabilities of my assembler are;

* 32 levels of Macro nesting
* Memory-limited levels of Include nesting
* Unlimited Binary Includes
* 9 arguments transferred to called Includes or Macros ( per call. )
* Global and Local variable capabilities, and system labels can be turned on and off and reused concurrently.
* Ability to all labels in a structured manner.
* Ability to use same label name for Group labels and Data lables.
* Full z80 opcode set ( including undocumented codes )
* Temporary Labels that can be removed mid-assembly.
* One or Two pass iterations with macros and includes.
* Intel Hex Output capable
* Binary Output capable
* Offset PC and direct souce code access to relevant system functions (eg, pass, PC, Macro depth, etc ).
* Simple console messaging output.
* System labels accessible if required ( some )
* Presently less than 11K (<8K Objective, <12K Threshold) - I think I'll make it.
* Does not require linking ( also does not create link files ) as is intentionally designed to create binaries direct from source.
* JIT assembly ( not implemented yet ).
* PRN Output ( not implemented yet ).
* Output console to a file instead of screen ( not implemented yet ).
* Matched Cross-Assembler (source compatible - partially implemented ).
* Can assemble itself from source.

whartung · Apr 27, 2024

Sounds great, CJ!

Lot of work, good job.

cjs · Apr 28, 2024

Bruce Tomlin said:
I know it's fun to argue about what's the best way to do a hash, but the simple fact is, it's premature optimization. The hash doesn't matter if the rest of the assembler doesn't work, but the assembler will still work if it uses a simple linked list.

cj7hawk said:
I imagine by "Not Working" he means "Unfinished" -and he's not incorrect about that.

Well, if so he is using an odd definition of the "avoid premature optimisation" guideline. There's enough working in your assembler that you an assemble things, and the discussion relating to efficiency has been related to parts that are already running, in particular related to objections to adding certain features because they're currently extremely inefficient due to the current implementation, e.g., "It would be very inefficient on a z80 based system to search up to 50K of label space before decoding opcodes."

There are a couple of potential solutions there: the more direct one of making your label searches faster (with hash tables or whatever, which could also significantly benefit assembly speed as a whole), or my suggestion of having a separate namespace for macros. (It's not clear to me why you would want the ability to, e.g., jmp somemacro or ld hl,somemacro, or even what you would fill in for the values of those symbols.)

But to sum it up, it's not premature optimisation when you're dropping a feature because you think it will run too slow.

As for an instructure/pseudo-op trie, instruction/pseudo-op lookup is clearly implemented and already working. It's heavily used and so has potential to speed up the assembler noticeably with a faster search, and also has the potential to save a bunch of space, which has always been a source of pressure on the size of things you can assemble. ("Presently less than 11K (<8K Objective, <12K Threshold).")

whartung · Apr 28, 2024

How long does it take to assemble the assembler on a heritage system (or a simulated heritage system)? That's the real measure, everything else is academic.

I run z80pack at 4Mhz clock speed, but even that's not quite authentic as I/O is still "instant". I mean, I guess it is, I guess there's no seek time (that's at host speed), but if it's moving bytes from the "disk drive" using IN and OUT commands, that's at 4Mhz, vs some out of band "magic" DMA. Anyway, you can certainly feel the machine at simulated 4Mhz.

cjs · Apr 28, 2024

whartung said:
How long does it take to assemble the assembler on a heritage system (or a simulated heritage system)? That's the real measure, everything else is academic.

No, there are two other things that are clearly not academic:

1. If you wish to assemble something other than the assembler itself, and you run out of memory. At that point, the time it takes is effectively the same as "forever."

2. If, regardless of speed, the developer(s) are dropping useful features because they think they will be too slow. (And it's not unreasonable for them to be unwilling to do a bunch of implementation work with no idea if it will be practical to use or will have to be thrown out for performance reasons.) In that case, it's necessary to provide some convincing ideas that the new feature can be designed in a way to run fast enough. (It doesn't matter what the current speed without the feature is, since that's not what's blocking the new feature.)

It's of course true that one shouldn't do premature optimisation, but not all thoughts about optimisation are premature just because something hasn't been tested and timed on the hardware. Basic knowledge of how algorithms work can easily give you estimates for large differences in performance and size, enough that in some situations it will be obvious that it's not worth writing the poorly performing and/or large code only to have to replace it later.

(And do keep in mind, I'm not the sort of person that optimises unnecessarily. You can look back at my previous posts for examples where I've deliberately used, e.g. O(n) algorithms.)

cj7hawk · Apr 29, 2024

cjs said:
No, there are two other things that are clearly not academic:

1. If you wish to assemble something other than the assembler itself, and you run out of memory. At that point, the time it takes is effectively the same as "forever."

It's not as though I can address memory problems in RAM entirely through any strategy. All that's being established by any memory efficiency increases at this point is the approximate output code size capability. Initial collected data suggests it's pretty close to 1:1 at the moment even without further optimisation, which is adequate for a z80 based system.

I will get around to writing a version for the target architecture eventually, which will have 1Mb of memory, so that's not an issue long-term. But I want this version to run on a normal CP/M 2.2 system.

cjs said:
2. If, regardless of speed, the developer(s) are dropping useful features because they think they will be too slow.

It's not that I think the features suggested are too slow. I think they would provide a significant improvement in performance. I like them. They will take up more space without question ( A hash system is going to have to use more code than a similar linear system because it has to do more ) - But even that's reasonably likely to still be within requirements.

The suggestions are good and I fully intend to take advantage of them.

Even so, I completely agree with @Bruce Tomlin 's point that I should finish the current project objective first and then evaluate what needs to change before moving onto the next objective. Due to the point that the code development is at presently, there are a number of practical issues with back-implementing a hash table into the existing linear search.

The first and foremost is that some of my functions *require* a linear search. Group functions, for example, manipulate the linked list to remove entire sections of the table from processing. But that nature presently requires a linear search.

Also the Macro function embeds itself into the list as normal labels, and it also manipulates the list in the same way, so that the macro content itself is removed from the function. The Macro as a label does have a value, and the value is the length of the Macro presently, but could be used in the future to hold version information as it doesn't use the macro length for processing.

To facilitate a hash table would involve reexamining all of the functions that use the nature of the existing list and adapting them so as to maintain the capabilities, which would mean I need to examine all of my list data structures and rewrite them.

It's not inconceivable to find solutions to all of this presently, but it is impractical at this stage of the project.

So it's not like I don't like the ideas provided - I do like them and will spend some time considering how to use it. But there's a lot of thinking I still need to do on what is required and what is the best way to implement them. These are techniques I haven't used previously, so there's some learning curve there as well.

I also need to plan on how to best take advantage of the new/adapted data structures. For that I need to collect data and the best way to get that data is to finish the current project and get some metrics around how it performs and what makes a difference within the existing structure when tuned.

It's a great learning process and I intend to replicate the same structures in my new cross-assembler as well - just with more capacity. That will give me an opportunity to find an improved model that works with the same data structures and will also allow me to find a creative solution that can be back-ported to the z80 version. So there's value in continuing on to the Cross Assembler once I have to z80 assembler's documentation completed even z80-native first-version doesn't address some of the current deficits.

That's the current plan.

whartung said:
How long does it take to assemble the assembler on a heritage system (or a simulated heritage system)? That's the real measure, everything else is academic.

Completely insightful - and I believe this is the key to being pragmatic at this point in the project. I nearly managed to get my Amstrad down off the shelf over the weekend, but my desk didn't quite make it to "Clean" sufficiently to make space for it yet... Hopefully this week.

This is going to be the razor on which the decision is made. If I'm working with a floppy drive ( And an Amstrad is pretty powerful as z80 CP/M machines go ) as well as the RAM drive, it will provide some very useful performance data on how it performs in the wild.

If it performs acceptably, then I'll proceed to complete the current project, write the documentation, then implement the same data structures with new ideas into the cross-assembler rewrite to see how they go. What I learn during that process can be included in Version 2 of the z80 native assembler.

If it's woefully inadequate then I'll review the existing project and consider how best to address the key components that led to inefficient operation.

But until I test it I won't know. As an example, If I could get it to assemble in 5 seconds, but mine takes 25 seconds, that would be bad, but if I have 2 minutes of disk activity over that same assembly period, then the extra 20 seconds is completely irrelevant for version 1.

The current source file is around 200K. That's pretty big for a z80 CP/M system. One thing I have learnt is that the first memory constraint I'll hit is source file length and disk capacity. I'm guessing I'd need around 600K of source file before system memory becomes a primary limitation.

cjs · Apr 29, 2024

cj7hawk said:
It's not as though I can address memory problems in RAM entirely through any strategy.

You clearly haven't been reading the conversation that's gone by; some of the suggestions have been directly addressing how you might reduce RAM usage.

cj7hawk said:
All that's being established by any memory efficiency increases at this point is the approximate output code size capability.
Initial collected data suggests it's pretty close to 1:1 at the moment....

Err, one what to one what?

And I'm mystified by why memory would place any limit on your output code size. I thought you were assembling from and to disk files. Are you holding the source or binary in memory? If so, there's a thing you can fix to massively increase the size of files you can assemble.

cj7hawk said:
It's not that I think the features suggested are too slow.

Well, you could have fooled me. Your own words were (this is the third time I've quoted them): "It would be very inefficient on a z80 based system to search up to 50K of label space before decoding opcodes."

cj7hawk said:
The first and foremost is that some of my functions *require* a linear search. Group functions, for example, manipulate the linked list to remove entire sections of the table from processing. But that nature presently requires a linear search.

Yeah, I'm going to go with that being the premature optimisation.

cj7hawk said:
The current source file is around 200K. That's pretty big for a z80 CP/M system.

Depends on the system. Back in my day (late '70s, early '80s) it was quite normal to have a pair of 1.2 MB floppies on line, sometimes even four.

cj7hawk · Apr 29, 2024

cjs said:
You clearly haven't been reading the conversation that's gone by; some of the suggestions have been directly addressing how you might reduce RAM usage.

Yes, *reducing* ram usage isn't the same as addressing the lack of ram. It's a "percentage" improvement acknowledged, but it doesn't eliminate the problem - it just increases the number of labels in the table, but still within the same 64K limit.

cjs said:
Err, one what to one what?

For an output file binary that's 16K, given the way I use long labels, and the frequency with which i use them, the label table is approximately 16K.

cjs said:
And I'm mystified by why memory would place any limit on your output code size. I thought you were assembling from and to disk files. Are you holding the source or binary in memory? If so, there's a thing you can fix to massively increase the size of files you can assemble.

It's a limit of holding the label table in memory. The output file length isn't limited, but there is a correlation between output file size and label table size.

cjs said:
Well, you could have fooled me. Your own words were (this is the third time I've quoted them): "It would be very inefficient on a z80 based system to search up to 50K of label space before decoding opcodes."

I thought you were suggesting that I was suggesting the features you suggested were too slow... And I was saying I like the idea of the hash tables.

cjs said:
Yeah, I'm going to go with that being the premature optimisation.

Maybe. I've yet to test in the wild... Well, I did test in the wild this evening. Didn't quite go as planned.

cjs said:
Depends on the system. Back in my day (late '70s, early '80s) it was quite normal to have a pair of 1.2 MB floppies on line, sometimes even four.

If you don't mind me asking, how big did single .ASM files you were seeing get back then? the biggest system I ever worked on had around 360K or 720K. Can't remember which. It was a microbee. And the targets were eproms, so I never had enough assembly to even fill 8K on any of my early projects. I am going to guess these were PC8801s you were using? They had a liking for multiple 1.2Mb FDDs from around 1985 onwards didn't they?

Well, as mentioned, it was tested in the wild. And it crashed on a PCW-9512 due to file related issues. ( problems reading the source, or mixing up the DMAs or something... ) - Well that's easily tested, fixed and returned to working hopefully.

So of course, @Bruce Tomlin might also be referring to the current Github version not working under CP/M... Some issues in my file handling and it looks like LokiOS has some comatability issues to solve.

And it looks like Joyce is my new test platform for CP/M testing. Since once it works with Joyce I should be able to drop it back onto the Amstrad's Gotek for some real-world file testing. The Gotek is slightly faster than a floppy, but not much and it's consistent. There's also M: drive to work with.

whartung · Apr 29, 2024

cjs said:
Back in my day (late '70s, early '80s) it was quite normal to have a pair of 1.2 MB floppies on line, sometimes even four.

But 200K source files were uncommon. Paging editors were uncommon. Navigating a 200K file on a glacial machine, paging from floppies, is a rather horrible experience.

It can be argued that a singular motivation behind linkers and library utilities is to cope with not having enough memory. Thus encouraging software to be broken up into pieces for individual processing, until finally linking them together.

Sure, there's always the "reuse" banner to fly, but when you see "EDIT1.ASM, EDIT2.ASM, EDIT3.ASM" or some other similar pattern of filenames, thats someone fighting editor memory, assembly time, or some other limit to the system -- even if they're all on the same floppy.

There's been a recent phenomenon in computing where folks are avoiding linking altogether. Where "util.h" is not a header file simply enumerating labels and storage, it's the entire source code. SQLite is developed as several individual files, but distributed in a single, very large C file. A combination of having "infinite" resources that allow us to process huge source code files combined with the absolute disaster that dependency management is today is what's giving momentum to such practices.

Linkers allowed individual, smaller, parts to be combined into a whole. Library managers allowed large libraries to be easier to distribute, save file space, and save directory space. A CP/M directory has, what, 64 entries? I ran into that limit on a project I was working on. Space on the disk, but not on the directory. There's always been the issue of that 100 byte file consumeing 4K of disk space because of the directory structure. Library files help address both of those issues.

I don't know when the last time I saw a new project use a library file. They're essentially unnecessary nowadays. Easy enough to just ship of directory of .o files.

For context, there's the disassembled source code for M80 available, that's got comments in it, and it's 183K. I can guarantee you that the original source code was not developed as a single 183K file. It's part of a pretty cool project that lets you run M80 (and L80, etc.) from modern machines. It bundles the CP/M runtime and turns it into a command line program, so you can write Macro 80 normally and use it as a cross assembler. It's pretty clever.

cjs · Apr 29, 2024

whartung said:
But 200K source files were uncommon. Paging editors were uncommon. Navigating a 200K file on a glacial machine, paging from floppies, is a rather horrible experience.

Well, it's your choice to make it one huge file rather than using include files. I don't know why you'd do that.

whartung said:
It can be argued that a singular motivation behind linkers and library utilities is to cope with not having enough memory. Thus encouraging software to be broken up into pieces for individual processing, until finally linking them together.

It could be, sure. But it's not clear to me why, if that's your only issue, you'd not simply (again) use include files. Another argument would be speed, so that you don't need to re-assemble the whole program. But that's pretty dependent on how small your modules are and how often you're changing them; you're trading less assembly for a fairly large linking cost (which also eats a lot of disk space).

The major advantage that linkers offer over includes in assembly systems is that you can distribute modules that can be linked with other people's code without you having to distribute your source code.

whartung said:
A combination of having "infinite" resources that allow us to process huge source code files combined with the absolute disaster that dependency management is today is what's giving momentum to such practices.

That's an interesting opinion, but certainly contrary to my experience. I started out with an assembler that used a linker and quickly switched to whole-program assembly because it made dependency management much easier; I can parametrise my modules in much more sophisticated ways that really help when building for small-memory systems.

Note that this is far from a new idea, nor used only by assemblers. MLton, a whole-program compiler for ML, is just one example that comes to mind. It was this or some other similar system that allowed someone to build an incredibly fast DNS server and other network applications because they were essentially compiling the operating system and application together, giving massive optimisation benefits.

whartung said:
Linkers allowed individual, smaller, parts to be combined into a whole.

Just like included files, except with less flexibility.

whartung said:
Library managers allowed large libraries to be easier to distribute, save file space, and save directory space. A CP/M directory has, what, 64 entries?

You really need to upgrade to CP/M 2.0. There's really no need to live with the limitations of CP/M 1.x in this day and age.

That said, even if you have configured your DPB to have only 64 directory entries (which actually isn't unreasonable), that's plenty of room for a dozen source files of say 32K each, a dozen tools (editor, assembler, nsweep, etc.) and three or four copies of your output file. That is likely all under a megabyte, making development even on a single-disk system practical, albeit annoying when you want to make backups.

whartung said:
I ran into that limit on a project I was working on. Space on the disk, but not on the directory. There's always been the issue of that 100 byte file consumeing 4K of disk space because of the directory structure. Library files help address both of those issues.

Or you could get rid of the library file, linking utility, and separate intermediate object outputs for every source file, and replace it all with a single parametrised source file you include.

whartung said:
I don't know when the last time I saw a new project use a library file. They're essentially unnecessary nowadays. Easy enough to just ship of directory of .o files.

On Unix I've never seen anybody ship a directory of .o files, though admittedly I don't deal much these days with software where the source is not available. And I can't imagine why someone wouldn't build a link archive; shipping a pile of different files is not as easy as shipping a single archive. I have seen plenty of projects where even internal libraries are built as archives. And of course any set of functions used by other programs is built as a single file shared library; the same is true under Windows.

whartung said:
For context, there's the disassembled source code for M80 available, that's got comments in it, and it's 183K. I can guarantee you that the original source code was not developed as a single 183K file.

Of course not; that would be insane. I have no trouble with 183K files on my modern cross-development system, and I still wouldn't do that.

whartung said:
It bundles the CP/M runtime and turns it into a command line program, so you can write Macro 80 normally and use it as a cross assembler. It's pretty clever.

That's an excellent idea, and I was proposing something along those lines for the OP's cross-assembler. But he's quite set on having a completely separate source base, in a different language, for the cross-assembler.

Do you have a link to that? I'd like to look at making use of their Z80 and CP/M emulation code. I've been using RunCPM to date, but there are a few things about it that I'm not happy with.

cjs · Apr 30, 2024

Oops, I missed one important point about separate assembly/link systems in my post above. One of the things separate assembly does is drop non-public symbols, which could have a large effect on the amount of memory needed to build a system.

If you do whole-program assembly, you need enough memory for all of the symbols in all of the files. But if you separately assemble each file and build a linkable object file, that object file will have only the public symbols in it, which means that the assembler needs enough memory for all the symbols in the largest file, and the linker needs enough memory only for all the public symbols.

I personally find whole-program assembly so compelling that I'd probably look at other solutions (such as caching the symbol table to a disk file) before writing a linker, but again, there's also the issue of being able to distribute linkable binaries without source code, which was a much bigger issue back in the day than it is now.

cj7hawk · Apr 30, 2024

I fixed the CP/M bugs that caused CP/M to fail when running the assembler. It was a register I forgot to save before calling a BDOS function related to printing to the console. LokiOS didn't have the same issue, so I missed it entirely.

Then I did some tests.

Some times Loki Assembler took to assemble itself -
Approx 6500 lines of code - 220KB source file.
11K of assembled code.

Amstrad PCW Gotek - 5m 15s
Amstrad PCW M:RAMDISK - 3m 10s
Joyce Emulator - 1m 36s
My emulator - 0m 8s.

Wow... That took a very long time... Way longer than I expected. But, it works. And it can reassemble it's own source which meets requirements. Half the time is spent on disk access and half dealing with the source. Instruction assembly takes only seconds, so I assume the slowness is the labels. It will be interesting how that changes in the future. 2 minutes was my outer guess. I was off by a bit. Some data structure optimisation would definitely increase speeds. I'll work on that while I build the new cross-assembler.

I am surprised my emulator was that much faster also... That is not something I expected and thought it might be 4 to 5 times faster. Not 30 times faster!

Also, does anyone have an ASM or M80 assembler and BDOS source pair and can tell me how long it takes to assemble on a 4MHz z80 system using a floppy drive with output to the same floppy? (The end to end process, including assembling and linking ).

cjs said:
If you do whole-program assembly, you need enough memory for all of the symbols in all of the files. But if you separately assemble each file and build a linkable object file, that object file will have only the public symbols in it, which means that the assembler needs enough memory for all the symbols in the largest file, and the linker needs enough memory only for all the public symbols.

That is something that it does/can do. All the files can be included at the start, and while it takes a little while to load each one, one after another, the table can be cleared out entirely after each include is assembled. Or you can clear out just the local variables. There's different ways to do it.

cjs · Apr 30, 2024

cj7hawk said:
I am surprised my emulator was that much faster also... That is not something I expected and thought it might be 4 to 5 times faster. Not 30 times faster!

30 or more times faster is no surprise; remember a single core on a "modern" machine from the last ten or fifteen years is literally a thousand or more times faster than Z80s circa 1978-1985. A really good emulator should be a couple of orders of magnitude faster, even with the hit of running over a modern OS with split kernel/userland. Even my Python simulators, which are entirely interpreted and quite inefficiently written (and so ten to a hundred times slower than a good emulator) run at more or less the speed of an original machine.

cjs said:
If you do whole-program assembly, you need enough memory for all of the symbols in all of the files. But if you separately assemble each file and build a linkable object file, that object file will have only the public symbols in it, which means that the assembler needs enough memory for all the symbols in the largest file, and the linker needs enough memory only for all the public symbols.

cj7hawk said:
That is something that it does/can do. All the files can be included at the start, and while it takes a little while to load each one, one after another, the table can be cleared out entirely after each include is assembled. Or you can clear out just the local variables. There's different ways to do it.

You can't clear out all the symbols, or you won't be able to call or use anything in the code you just included and built. Even just clearing out local symbols is likely to break things because you've now lost the values for them that you filled in when you have forward references you're resolving in the next pass. And I thought you didn't have proper local symbols, anyway.

But that does introduce an interesting and I think reasonable idea for increasing the number of symbols you can handle in whole-program assembly: set up a file to hold local symbols and, every time a scope closes, dump all the local symbols (resolved or not) to a section of that file. Then you need keep only global symbols and one scope's worth of local symbols in memory. When you come back for the next pass, reload the local symbols (including any you resolved after their use) and generate code (or fail, needing another pass, in which case you dump again and repeat as necessary).

cj7hawk · Apr 30, 2024

cjs said:
30 or more times faster is no surprise; remember a single core on a "modern" machine from the last ten or fifteen years is literally a thousand or more times faster than Z80s circa 1978-1985. A really good emulator should be a couple of orders of magnitude faster, even with the hit of running over a modern OS with split kernel/userland. Even my Python simulators, which are entirely interpreted and quite inefficiently written (and so ten to a hundred times slower than a good emulator) run at more or less the speed of an original machine.

My emulator is written in basic and isn't very clever in it's design. It was written without experience and is somewhat amateurish in every sense of the term, and there was no planning in it's design or programming. I just kept adding stuff in the clumbsiest way imaginable until everything worked as it was supposed to... It works OK now, but it's not pretty or well designed. It's a brute-force type application. I'm fortunate it runs fast on modern machines.

It started off about the same as a z80 running at 50 MHz, but that quickly evaporated as I added functionality and I didn't think it maintained much of it's speed. At a guess, I would have said around 4 to 5 times faster. And it's emulating hardware also at the same time.

I think in this case, since different instructions execute at different speeds, and it just turns out that the parts I am using are the more efficient parts.

cjs said:
You can't clear out all the symbols, or you won't be able to call or use anything in the code you just included and built. Even just clearing out local symbols is likely to break things because you've now lost the values for them that you filled in when you have forward references you're resolving in the next pass. And I thought you didn't have proper local symbols, anyway.

But that does introduce an interesting and I think reasonable idea for increasing the number of symbols you can handle in whole-program assembly: set up a file to hold local symbols and, every time a scope closes, dump all the local symbols (resolved or not) to a section of that file. Then you need keep only global symbols and one scope's worth of local symbols in memory. When you come back for the next pass, reload the local symbols (including any you resolved after their use) and generate code (or fail, needing another pass, in which case you dump again and repeat as necessary).

Originally when I hit that problem, I just worked off the idea that each include would have the include go through two passes.

The downside is it also means that it's possible to have multiple passes call multiple passes, and if it iterates 5 levels deep, that's 64 passes in total for some pieces of code - and there's no limit except memory to the number of passes, so 2^128 passes is entirely possible, though I suspect the surface of the disk would be eroded and worn away into dust long before the source could be read that many times.

I thought about saving the labels to disk, but in the end just provided controls around the passes, allowing multiple passes, potential reassignment of the variables, conditional assembly with knowledge of which pass is occuring and the programmer can even control the passes *during* a pass (eg, Can switch from pass1 to pass2 mid pass, and back again) so that EQUs that are global only get defined once with very little declaration overhead while all the local ones get redefined.

Hopefully no one would ever do that, and there are ways to avoid that, such as simply using HEX file outputs and only completing the two passes of the include on the first pass, rather than creating linear binaries.

Non-linear hex files can be turned back into binaries by loading them into real memory - also later dependencies ( which hopefully won't exist ) can be implemented by just updating that code, since the writing of code can be turned on and off as well in the assembly and revisited later if required.

So it's possible to write all the linking stuff so it only gets assembled twice at any depth, and as a result code gets written in the order it's successfully assembled, rather than linearly... It's a bit of a mind twister to think about it, but the non-linear nature of HEX files is perfectly suited to the objective. It fills in includes as it goes and the programmer doesn't have to give it any thought outside of knowing it needs to be assembled into HEX, not Binary. ( and now that I think about it, the code can also detect whether it's hex or binary selected and error otherwise... )

It would be nightmarish to try and read the Hex file manually though... A bit like assembling a jigsaw by picking up random pieces, seeing where they go, and dropping them in the correct place until the entire picture appears as the gaps slowly fill in, with some pieces potentially being swaped around at the end to provide a different picture entirely. It would be all over the place at first but still comes together with no piece of code typically ever requiring more than 2 passes in total, and in some cases, this can even be dropped to one pass globally.

Macros, of course, are tied to the current file being assembled, so don't suffer from the same issue at all. They can iterate or nest up to 32 times, but all is treated as new code.

Generally includes should always be included at the start, though it is possible to include them in the middle or at the end also. It's not a perfect solution, but it captures what is necessary.

cjs · Apr 30, 2024

cj7hawk said:
Originally when I hit that problem, I just worked off the idea that each include would have the include go through two passes.

Well, of course. Each include goes through the same number of passes as the whole assembly does, since each pass starts at the beginning, goes through all the code including each include in turn, and then when you get to the end you decide whether you've resolved everything or not and, if not, do another pass.

But from the way you put it, I get this funny suspicion that you're doing a pass for an included file, going back and doing another pass on just that included file, and then throwing away all of its "local" symbols and not going back through it again when you need to another pass through the whole file.

Which can't work, of course. If you start your program with,

Code:

            jp   start
                
            include foo
            include main        ; includes "start" symbol

You of course cannot fill in start when you first reach it, because it's a forward reference. So you go through the various includes, and presumably are throwing away the code you generate until you restart the entire assembly having found the value of start. So there was no point in re-reading foo twice when it was included; you're still going to have go back and re-read it again when you are able to generate code from the top.

cj7hawk said:
...and the programmer can even control the passes *during* a pass (eg, Can switch from pass1 to pass2 mid pass, and back again)....

That doesn't make sense to me at all. So you can say, "I'm on pass 1, and haven't yet found the location of start, but wait, I'll push a button to be on pass 2 and magically know the value of start even though I still haven't encountered it in the source code?

cj7hawk · Apr 30, 2024

cjs said:
That doesn't make sense to me at all. So you can say, "I'm on pass 1, and haven't yet found the location of start, but wait, I'll push a button to be on pass 2 and magically know the value of start even though I still haven't encountered it in the source code?

It all depends on what I'm doing with the labels and where they are coming from. Also whether the forward reference is in the same file, or is in the calling file.

An include referencing forward labels outside of the include shouldn't generally happen, but can be addressed with conditional codes and dry writes as one strategy, or it can just do a loop in line with what is expected - eg, a single pass locked to the global pass. Or if it's small enough, a macro can rewrite the jumps later. There's a few ways to deal with it.

But a more likely scenario is a forward label in the include itself that sets a global variable outside of the include, since there are ways to pass these to an include as vectors.

In those cases, since the labels must be set correctly by the end of the first pass, then if the labels are deleted on the first pass, it necessitates two runs of the include to reference the internal forward reference.

The simple solution if the local labels are being deleted between passes is to simply instruct the include to make 2 passes for each pass, deleting local variables both times. Inefficient, but simple.

If the programmer desires, and is writing something that will be included in a lot of programs, and there's no external forward references, a more effective way is to drop out on the second pass entirely (ie, ignore it with conditional includes), check that HEX is set for output, and then generate the final code for that include on the first pass, including setting local variables to be deleted on completion, with global variables to remain in the table to pass parameters back to the main program, such as jumps. Even forward jumps can be set as global parameters by redistribution - eg;

; Include routines for MATHS.
.2PASS ; Force two passes locally in this first pass.
GROUP MATHS
EQU CALCULATE,DOCALCULATE

MODID LOCAL
; Local declarations and code goes here.
DOCALCULATE: ;Memory Address Defined for entry point for calls by the main program ( or calling include ).
; Some routine we want to calculate.
REMOVE LOCAL
ENDGROUP MATHS
.END

And in the main program, you'd do,

IFZ XPASS\$100-1 ; All 2-pass includes to be run on the first pass only.
INCLUDE MATHS.ASM
ENDIF

Then as long as the output is set to hex, it would write the includes straight away, and the global label "CALCULATE" would be passed to the main program, which could

OPEN MATHS ; Optional if reusing "calculate" elsewhere.
CALL CALCULATE
CLOSE MATHS ; Optional to reuse "calculate" elsewhere.

And the call can be forwards or backwards, since doing both passes of the include will generate the hex for the include first, and establish the label MATHS which will also be valid in the global second pass, when the main program writes in the actual jump value.

The real vector "DOCALCULATE" would be erased...

It's also possible to use temporarly labels for local labels -eg,

CALCULATE: ; memory address where the calculation subroutine is in the include.
~ALLOTHER: Labels to be deleted at the end of the include.
CLEAN LOCAL; Clean up temporary variables ( There's one for ALL temp variables,

This would keep Calculate but erase ~ALLOTHER

Several ways to do it depending on programmer preferences.

whartung · Apr 30, 2024

cj7hawk said:
It started off about the same as a z80 running at 50 MHz, but that quickly evaporated as I added functionality and I didn't think it maintained much of it's speed. At a guess, I would have said around 4 to 5 times faster. And it's emulating hardware also at the same time.

But your I/O is free. You save almost 2m just by running to RAMDISK. It's hard to appreciate today how slow those drives were back then.

cj7hawk · Wednesday at 1:43 AM

whartung said:
But your I/O is free. You save almost 2m just by running to RAMDISK. It's hard to appreciate today how slow those drives were back then.

2m and 5 second to be exact, and yes, it really was a different world where information took a long time to process. I remember writing a simple accounting package for someone in Basic using stringy floppies and he waited 30 seconds to a minute between updates when entering or looking up information.

So disk access accounts for almost half of the program time. It is incredibly difficult to consider the timeframes of access back then for practical purposes when our minds have moved on to instantaneous technology that is powerful enough to produce near-real-time photo-realistic images through AI.

And then we get Microsoft Windows, or Linux, running on a machine a million times faster than that era could provide, and they still take 30 seconds to 5 minutes to boot - highlights just how little progress has been made on such technologies and how incredibly bloated things have become on modern computer systems which require more than 4Gb of memory just to run up an operating system.

People shift to SSD for many applications, but I still do some disk-intensive work on a ramdisk even now. The RAMDISK is still an order of magnitude faster than the SSDs. Especially with lots of small files.

Well, I can't argue that 3 minutes for a 16K output file is great. At some point I need to find better data structures as was suggested earlier in the thread, but it's working for the moment and that's enough for me and the few hours a week I get to spend on the project.

The CP/M compatability issue I hit was related tot the BDOS call corrupting the B register and making a DJNZ go infinite. It's an amateurish mistake, but when I wrote that section of code, and tested it, I didn't notice I hadn't protected the register value pre-call and my own BDOS call didn't corrupt the B register so I didn't notice. That was a very early routine I wrote also so it took a bit of digging and troubleshooting to find it, placing checkpoints through the code to narrow down where it was failing.

I'm spoiled by use of the emulator, which can instantly spit out code traces, but now I had to do the same troubleshooting on an alien OS with no debugging support, so I went back to basics.

I'll add frequent testing on the Joyce emulator to maintain CP/M compatability as a test element too.

cjs · Wednesday at 9:55 AM

cj7hawk said:
An include referencing forward labels outside of the include shouldn't generally happen, but can be addressed with conditional codes and dry writes as one strategy, or it can just do a loop in line with what is expected - eg, a single pass locked to the global pass.

...

Your whole model for how this is done is so complex and byzantine, especially in its mixing of what I guess is some sort of scope thing with include files, that I can't even begin to wrap my mind around it.

You may want to consider whether other users are going to have any chance of understanding what the heck your assembler is doing. And, even if they do, whether they'd want to deal with all the extra complexity you're adding to the problem they're trying to solve.

You may also find it helpful to stop coding for a bit and write the manual for your assembler. I often find that writing documentation for the developers who are going to be using my system makes it clear to me when I've gone down some far-too-convoluted path for how something should work, and convinces me to redesign things to be a lot simpler.

(For an example of where to start, consider describing your include mechanism in detail. For most assemblers, this is simply, "when file A includes file B, everything operates just as if you have a single file where you'd substituted the contents of file B at the point of its include in file A.")

Writing Assemblers... What should a good assembler do?

Experienced Member

Veteran Member

Veteran Member

Experienced Member

Veteran Member

Experienced Member

Veteran Member

Experienced Member

Veteran Member

Veteran Member

Experienced Member

Experienced Member

Veteran Member

Experienced Member

Veteran Member

Experienced Member

Veteran Member

Veteran Member

Veteran Member

Experienced Member