Spent the evening at W7RM's with Mike K7NT (the ham formerly known as
AA7NX). Initially, we were able to make the network crash in about
2 minutes. Well, crash enough for some of the debug stuff I had to
start yelling in alarm.
After much hair pulling and enhancements to the tests, I finally
figured out the problem. I kept a linked list of messages originated
by a specific computer. When that computer saw its message come back,
it would delete the message from the linked list. Of course, it would
not send this message onto the next computer since everyone has now
seen the message.
This same list is used to generate the retries. If something is on
the list long enough, it will generate a retry (up to ten of them).
Let's say it does a retry every 30 seconds.
Okay, now pretend the first message took 35 seconds to come around
the network. When the originating computer sees the message, it
recognizes it as its own, deletes the entry in the retry list, and
doesn't send it on to the other computer. What is wrong with this
picture???
The computer sent a retry of the original message 5 seconds before
receiving the first message. Now when that second message comes around the
loop, it does not appear in the retry list - and thus nobody takes
any ownership of the message and it just keeps going around and around
until the "Time to live" count goes to zero (about 40 hops).
To makes things worse, the program initially sent the first retry
after only 4 seconds! Then another at 14. If you had a computer
that was doing a lot of hard disk work (like a slow 286 with the
network processing 1500 QSOs/hour), these messages might take
a minute to get around... meanwhile there is a que of about 5
retry messages (with the old timing). The output buffer quickly
filled up causing the computer to slow to a crawl.
So, the retry buffer has been re-engineered so the last 32 messages
are remembered. A flag is set when the message comes all the way
around the network so it won't be resent. The retries are spread out
every 30 seconds (meaning you could have a computer down for a few
minutes and everything should catch up). I have increased the baud
rate from 2400 to 4800 which really helped speed things up.
Now, with 3 computers going full speed, the delay time for QSOs
showing up on the computers is in the order of seconds. We ran
about 600 QSOs without a single QSO being lost at rates close to
2K/hour.
TR is now network ready!!!!!
I am going to bed.
Tree N6TR
tree@contesting.com
|