• Please review our updated Terms and Rules here

The joys of filesharing ..

I did some testing based on the ideas I gathered here. All tests are run on the same machine (386-40) and are run for a meaningful length of time. I run each test a few times and take the best number.

Sending a real file one packet at a time with UDP checksums: 166KB/sec
Sending a real file one packet at a time wo/ UDP checksums: 200KB/sec

So turning off UDP checksums improves the speed by 20%.

Now do the same test, but just send packets as fast as you can without checking for reply packets from the server. (This tests how fast you can read a file and shove it down the Ethernet.)

Sending a real file as fast as possible with UDP checksums: 260KB/sec
Sending a real file as fast as possible w UDP checksums: 338KB/sec


Not waiting for the server to ack each individual packet makes a big difference. It's over 50% faster. Here the difference between UDP checksums and no UDP checksums is slightly more pronounced .. I don't understand why, but it's a 30% difference.

And finally, just for grins, see what the affect of file reading is by not sending real data. In this test I just send dummy data to avoid the file read.

With UDP checksums: 492KB/sec
Wo/ UDP checksums: 649KB/sec

Wow, eh? Too bad sending dummy data doesn't accomplish anything. The difference between UDP checksums here and no UDP checksums is 30% again, so I'm going to assume in the first test something was flawed and it should have been a 30% difference there too, not a 20% difference.

Note that a 10Mb/sec Ethernet works out to 1220KB/sec if everything is perfect. File transfer throughput suffers because of gaps between frames, frame headers, IP headers, etc. If the machine was fast enough the file xfer rate would be close to 1000KB/sec, including the TCP/IP overhead.

The only thing I could do differently would be to use a slightly larger packet size to cut down on some overhead. (I'm using a packet size of approximately 1100 bytes now.)


Anyway, this is getting too long. Contrary to what I was thinking, the waiting for the acks for each packet does take a while. So I implemented a simple mechanism to let the 386 'get ahead' of the server a little bit, while still keeping all of the error detection. Here are the new results:

With UDP checksums: 248KB/sec
Wo/ UDP checksums: 310KB/sec

So, compare 248KB/sec vs. 166KB/sec, or 310KB/sec vs. 200KB/sec. It's about 50% faster with no loss of error checking!

Lesson learned .. at least for a machine like the 386-40, the sliding window is worthwhile.
 
New numbers for the PCjr using a new Ethernet card:

With UDP checksums: 21.4KB/sec
Without UDP checksums: 39.26KB/sec

With the Xircom parallel port Ethernet adapter it was 15KB/sec and 24KB/sec, so this is a nice bump.

Here's a picture of the nasty hack :)

http://brutman.com/PCjr_WD_small.jpg
 
Last edited:
Yeah, it's a little exposed. Hardware hacking at it's best. I need to build a proper enclosure because if a stray pen touches anything it'll all go up in smoke!

Got the receive code done, so now I can send and receive. The send code is pretty good by the receive code only just started working, so I have to go back and add all of the error checking, sliding window, and other tweaks. That'll take another two or three weeks. But then I'll be and to send and receive from the DOS machines to the Java program running on a Windows or Linux box. After that gets polished it will be time to start the TCP part.
 
I did some performance work on my UDP checksum code and got some great advice on exploiting the 8088 from Jim Leonard of '8088 Corruption' fame. Here are the new numbers:

Code:
Machine   Card             Before    After
386DX-40  NE2000 clone     335800    356800
PCjr      Xircom PE3-10BT   18000     20700
PCjr      WD 8003           25400     31200

(All numbers are in bytes per second)

What a difference a little more 8088 assembly language makes! Especially well chosen instructions, and trying to keep everything in registers while in a performance sensitive loop.
 
Does boundary cross (e.g. pages of 256 bytes) make a difference in x86 assembler, or is everything running at the same speed as long as you stay within the same 64K segment?
 
There is no penalty as there are no boundaries really. If you go past your current segment, the pointer just wraps around as it is a 16 bit pointer.

You only start to get boundary penalties when you get into virtual memory. Everything I'm doing on the 8088 is a relatively flat memory model .. there is no virtual memory or paging even possible. The segment registers are just a necessary evil to extend your range past 64K.

I just got my first multi-packet TCP/IP transaction to run. The DOS machine contacted Apache running on a Linux box and sent 'GET /'. Apache in turn spewed forth many small packets, while my code dumped the packets to the screen and sent acks.

I'm still missing a lot of code, but getting multi-packet transactions is awesome .. this is coming together much faster than I thought it would.
 
The code has improved quite a bit, and I've learned much more. It's stable enough now such that I'm starting to measure performance.

I added some required error checking (the infamous checksum) for incoming data which slowed down receives quite a bit, but there really is no way around that. So now, as far as I can tell, I detect all required errors. Handling them gracefully is a different issue. :)

I also cheated and found a simple way to force the TCP 'maximum segment size' to 1024, up from the default of 576. That helps things a lot because there is a bit of overhead to make a TCP/IP packet, so bigger packets reduces your overhead. Unless you start losing packets .. then it sucks. The size will be configurable at some point.

Lastly, to measure the speed of the code and the Ethernet card and not my disk drives, I've changed my testing methodology. If I do a 'speed test' the machine sends 4MB of random bytes from memory to a Linux machine instead of reading data from a file. The random crap is still error checked and has to arrive correctly. That simulates an infinitely fast disk. Similarly, on a receive I receive packets but then toss them away instead of writing to disk.

And here are the numbers

JR with WD Ethernet: Send 37.7KB/sec, Receive 36.9KB/sec
386-40 with NE2000: Send 475.0KB/sec, Receive 492.0KB/sec


I'll update tomorrow with numbers from an XT running an original 10MB hard drive, with data being written to and from the hard disk.
 
The code has improved quite a bit, and I've learned much more. It's stable enough now such that I'm starting to measure performance.

I added some required error checking (the infamous checksum) for incoming data which slowed down receives quite a bit, but there really is no way around that. So now, as far as I can tell, I detect all required errors. Handling them gracefully is a different issue. :)

I also cheated and found a simple way to force the TCP 'maximum segment size' to 1024, up from the default of 576. That helps things a lot because there is a bit of overhead to make a TCP/IP packet, so bigger packets reduces your overhead. Unless you start losing packets .. then it sucks. The size will be configurable at some point.

Lastly, to measure the speed of the code and the Ethernet card and not my disk drives, I've changed my testing methodology. If I do a 'speed test' the machine sends 4MB of random bytes from memory to a Linux machine instead of reading data from a file. The random crap is still error checked and has to arrive correctly. That simulates an infinitely fast disk. Similarly, on a receive I receive packets but then toss them away instead of writing to disk.

And here are the numbers

JR with WD Ethernet: Send 37.7KB/sec, Receive 36.9KB/sec
386-40 with NE2000: Send 475.0KB/sec, Receive 492.0KB/sec


I'll update tomorrow with numbers from an XT running an original 10MB hard drive, with data being written to and from the hard disk.

NICE! those are some very good numbers for computers that old.
 
Here are the updated numbers, with the XT included:

PCjr with WD Ethernet: Send 37.7KB/sec, Receive 36.9KB/sec
PC XT with 3Com Etherlink II: Send 36.3KB/sec, Receive 33.8KB/sec
386-40 with NE2000: Send 475.0KB/sec, Receive 492.0KB/sec

So the Jr is marginally faster than the XT, but mine has a NEC V20 and a different Ethernet card so it's close enough.

On a real world file transfer (885KB) using the XT I got the following:

Receive: 16.43KB/sec, Send: 16.57 KB/sec

Same machine, but using NCSA Telnet/FTP for the file transfer:

Receive: 17KB/sec, Send: 20.58KB/sec


Ugh, it would seem that I got spanked by 11 year old code. But you need to peel the onion a bit to see why.

NCSA's FTP code uses a 10000 or 20000 byte buffer. It doesn't read or write from the disk unless it has a full buffer, so on a machine with a slow disk it makes a big difference. To use such a large buffer they have to use memory copies to get data into packets, which is expensive but they have a nice memory copy loop.

My code uses 1KB buffers, but no memcopies.

Apparently the disk overhead is so bad that it pays to make fewer, bigger operations and suffer through the memcopies than it does to make more frequent operations, but not have the memcopies.

If I do a real FTP client and design it like that, then I should be able to beat NCSA's FTP, which is very good.


Mike
 
Still working on the TCP/IP code ...

Right now the code is capable of opening a socket connection to another machine, sending and receiving data, and closing the connection cleanly. The machine is rock solid during hours of coding and testing so I'm not corrupting memory or anything bad like that. And the Linux machine on the other side isn't getting mad at my sloppy disconnects anymore. :)

It still performs fairly well, but it is getting slower ... as I add more features to meet TCP/IP requirements it slows down. Still, I'm getting 22KB/sec when sending a file and 16KB/sec when receiving, and that includes the overhead of writing to disk. And if I increase the size of my disk reads and writes, those numbers will improve. Those numbers are on a PC XT with the original 10MB hard disk and a 3Com Etherlink II. On my 386-40 the numbers are around 300KB/sec.

On the todo list:

- Add Unix style send and recv calls. This will slow things down by introducing a memcpy, but when doing file transfers it will open the door to larger disk reads and writes. I'll still have my current send & recv calls, which are fast because they avoid the extra memcpy but they limit the amount of data sent and received to what fits in a single packet.

- Add a listen call so that the code can be used to implement servers.

- Fix my connection reset detection code .. (or lack thereof)

- Go back and clean up my UDP implementation based on what I've learned from TCP. I haven't touched the UDP code in two months, and I'm sure it needs work.

- Do a simple DNS implementation so that you don't always have to specify raw IP addresses.

- Improve my timeout and retransmission code and my window-size code.

- More performance work.



Do we have any C programmers in the audience? I plan to use what I'm writing here for a telnet BBS, but it would be nice to have other people using some of this.

I've been working on it on and off for close to a year now. Last year when I started it was a struggle just to get a packet of any data over the Ethernet. Now I'm starting to get ideas for what the telnet BBS should look like. :)

On a related note, is anybody else involved in a long term programming project on vintage hardware? If so, start a new thread here and let us know what your building!
 
I've not worked professionally in C, but as an advanced hobbyist. Not on older 8086/88 class systems though. I think I've used Turbo C version 1 or so on a Wang in PC emulation mode. Some of the more modern mechanisms were not implemented IIRC, like a function can not return void (?) or that calloc() has to be replaced by malloc().
 
Back
Top