I’m happy to report that, as far as we have been able to determine, all our speed-related problems at Mockingbird Tower are now resolved. Assuming this announcement doesn’t generate a spate of contrary responses, our next task will be to arrange for incremental UPS protection at our towers, beginning with the ones that are most highly subscribed; and working with our consultants to assure that we can deliver to our subscribers all the untapped bandwidth currently available at our Internet gateway.
For the technically curious, a brief description of the source of our speed problems appears below. (Warning: unpaved roads ahead!)
Most modern networks use an automatic routing protocol (OSPF) designed to detect tower and pathway outages automatically and re-route traffic over available alternative paths. This is why we desire to establish redundant pathways between as many pairs of towers as practical.
In networks with complex interconnections, it’s possible for the automatic algorithm to set up an alternative topology that routes packets in such a way that they can arrive at the same location from multiple sources, potentially causing an unending feedback loop (“packet storm”). To prevent this, another protocol (RSTP) briefly shuts down redundant pathways that cause such problems, forcing “timeouts” that allow the feedback loops to die out.
In January, our router manufacturer released an upgrade of their software that subtly broke RSTP, which fault wasn’t identified in the field for quite some time.
In mid-March, immediately after the sensational “Vault 7” Wikileaks disclosure, we were forced by circumstances to promptly adopt one of the descendants of that January release, in order to close security loopholes whose details were now internationally available to hackers. We were still unaware of the bug, or how it might affect us.
Some time after that, we began seeing mysterious subscriber unit dropouts, as RSTP decided not to route traffic over perfectly functional paths. This is when we discovered the existence of the bug, and exercised our only recourse, which was to disable RSTP on any portion of our network we found to be misbehaving, right down to the subscriber roof units themselves, until communication was re-established with the affected subscriber. In all cases, we detected and remedied the break in communication without the subscriber ever reporting it to us. Other than one or two cases, all these dropouts occurred on the Mockingbird Tower and neighborhood POPs fed by the Mockingbird Tower.
On June 20, our manufacturer finally released a bug fix for this problem, which we adopted after testing. Now that the bug was fixed, since we had modified many subscriber units to evade it, we changed our nighty maintenance script to re-enable RSTP on all the bridges in our network.
And that’s where we goofed.
We essentially created a few very large RSTP bridges, inadvertently crossing wireless boundaries, that encompassed every device connected to a tower.
Because the RSTP protocol requires bridge devices to communicate among themselves (at some overhead) to detect packet storms, our error put the bandwidth available to any subscriber at the mercy of the bandwidth available to the subscriber with the weakest performance. When RSTP failed to hear back from these units within its allotted time — which was pretty often — it would enforce a four-second timeout suspending on all traffic on the bridge — meaning, the entire tower. As you might guess, this brought tower performance to a crawl.
It was a subtle and insidious error, one that took us and our consultants weeks to locate, because it gave all indications of being caused by some radio frequency issue. It hit Mockingbird the hardest because by necessity Mockingbird (due to local terrain factors) serves a greater number of marginal subscribers than any of our other towers.
Once the problem was obvious, we corrected it by hand, one at a time, on each of the 50+ Mockingbird subscriber units, the tower itself, and all inter-tower link units on our network; and recoded the nightly maintenance script to correct it on our other towers as well. At that point, we were left with a pure speed issue on Mockingbird, which turned out to be the result of a broadcast protocol change we played with during the weeks that we were searching for the real cause of our problem. Once we reverted that, full tower performance resumed.
Yes, it was a highly technical nightmare, the sort of labyrinth network engineers can be faced with navigating on any given day. There are a relative handful of people in the world who know everything there is to know about networking, and (despite commercial training and frequent recertification) I’m confident I’ll never be among them. But what Grand Avenue Broadband can bring to the table instead are the resources and the willingness to obtain expert help when indicated, and the perseverance to pursue problem situations until they are resolved.
Thanks for making us your Internet provider.