• Please review our updated Terms and Rules here

Overwriting non-interfering sparse files in CP/M

cj7hawk

Veteran Member
Joined
Jan 25, 2022
Messages
1,122
Location
Perth, Western Australia.
OK, here's an unusual question... Can large files be broken up into smaller sparse files on different disks, and then be rejoined by simply PIP'ing them back together?

I am guessing the behaviour given PIP will just overwrite a file is that it respects the extent number and mask and will just add a sparse entry to a large disk rather than doing anything to the existing sparse entries, and as long as the records themselves are not rewritten, it should even be possible for sparse records to be blocked/deblocked into an existing sector containing some sparse blocks and some unassigned blocks.

So, if you had a particularly large file, it should be possible to simply break it up and write it to smaller disks, then reconstitute it later.

The only question then is whether PIP will recognize sparse files as a "Dearchiving" tool.
 
Your use of "sparse" concerns me. You can split large files into a bunch of smaller ones, and re-assemble them using PIP later. But if a file is sparse, that implies that there are gaps in the allocations and that the files "size" is larger than the number of blocks assigned to it. PIP will not "do the right thing" re-assembling sparse files, or at least is not likely to do what you want. Reading the gaps in a sparse file returns blocks of "0", so PIP will reassemble a bunch of sparse files into an immense pile of junk that is not likely to be usable.
 
The CP/M file system supports sparse files, but this is more of an accident than intentional design. Applications do not recognize or understand them, and will likely only create them by accident. There is no tooling for sparse files, either.

There is no such thing as a sparse record. The file system tracks allocations on a block level, so the sparseness of a file depends on the file system parameters to start with. If a block is not allocated, it will just read as zero; otherwise it contains data.

Sparse files also have the disadvantage that writing to an existing sector may fail due to lack of space. There are very few instances where they are useful, none of which apply to CP/M.

You could write a program to sequentially read a file while keeping track of the underlying blocks, and deallocate them to save some disk space. But this means mucking around in the on-disk structure. It is smarter to avoid storing huge runs of zeros in files in the first place. For archiving tools, even the simplest RLE compression schemes will handle sparse files efficiently - without knowing about sparseness.
 
That does seem a clever way of reassembling the files in the correct order without the user having to write the append. Each disk would only have the portion of the allocation listed for itself with all the other blocks being empty. I imagine that a replacement for PIP would be needed to correctly note each group of blocks.

The underlying mechanism would be somewhat similar to bittorrent except it would be different disks instead of different computers with partial files.
 
So, if the original goal is simply to break a large file up into pieces that will fit on some smaller media, to be later reassembled, using sparse files has no advantage. You simply copy the first X records to a file on media 1, the next X records to a file on media 2, etc. Not sparse files. Then you just use PIP to concatenate the files back together.
 
I was thinking about splitting a large random file across multiple disks if it needed to be accessed directly, but couldn't be accessed on a single disk - them the elements of it could be accessed by swapping disks one at a time as long as the calling program allowed a manual intervention when the appropriate block being accessed was not on the current disk, but without having to break it into multiple files, in case moved from a large drive to a smaller drive.

It's not something I'm currently planning on doing, but was curious then about the behavior.

Wordstar, BTW, does something like this. It messes with the extent number to hide working files on the disk. I found out when my wordstar crashed and I was having compatability issues, so I dumped the disk to a virtual disk and examined it and found wordstar entries that include any lower extents that would show up with a DIR command. Though wordstar doesn't spread it across multiple disks.
 
If I understand correctly, you want to allocate virtual blocks of a file, referring to other disks; the file areas before and after the current disk would then be seen as unallocated/sparse. Trying to access such a block could then be used to trigger a disk swap. Yes, that is theoretically possible, but not supported by CP/M.

Speaking about CP/M 2.2, the maximum file size in CP/M is not much larger than the typical file system (to keep the block size down for space efficiency), so you simply run out of block addresses. Especially if you split your file across two different file systems (e.g. a single-density and a double-density disk), the allocation scheme breaks down.

It is smarter to keep block/record virtualization strictly within the application. Scales a lot better and doesn't risk being destroyed by a stupid-careless program. You could keep the same file name on all disks, but this is a pain when dealing with larger, fixed media.
 
The maximum file size is around 8Mb, and the question was more of a thought experiment since I noticed that my source files are sometimes growing a little larger than some maximum disk sizes on single sided disks. Hence I started to wonder if there was a way to split a single file across multiple physical disks on the same drive, but to also bring them back together on a larger disk.
 
Back
Top