[HamWAN PSDR] Service Impact Notice
me at bartk.us
Fri Mar 11 00:56:22 PST 2016
Hmm that's not the whole story though. If it were just the 1 router
failure (in reality a hypervisor failure), we'd be in a much better
position, but it's combined with 2 other modem failures. We had the
ETiger->SnoDEM modem die over the winter, and it needs replacement.
That link has been down for a month or more now. And most recently
we're having the Tukwila->Baldi modem lose connectivity frequently.
We've implemented an automatic mitigation for that, but it still
produces sporadic short downtime windows of a few minutes. I'd just
like to move that modem to a NetMetal 5. Our servers are also being
affected by instability in the Quagga routing software. We need to
replace this with a more stable alternative, like BIRD. Lastly, the
Baldi emergency uplink is only configured to go to Westin and Corvallis,
but not Tukwila.
We could have avoided DNS outages too, if the anycast groups were
populated with more of the available servers. I believe lack of good
automation for server build-outs is causing the deployment lag here.
The network is designed to withstand failures, even multiple failures,
but we've got many broken things right now that need fixing. After that
fixing, I would really love to see some folks get behind improving our
monitoring, deployment and diagnostic automation. Networks like this
won't scale unless they're nearly completely automated and simple to
manage. I would not mind at all if we even rolled back some features
until we can get them re-implemented in 100% automated ways.
As important as all this is, I still think the deep penetration project
takes precedence, so I can't drop that work in favor of this. Aside
from helping out on the simple break-fix stuff, I mean.
On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
> Thanks for the update, Nigel.
> On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen
> <nigel at nigelvh.com <mailto:nigel at nigelvh.com>> wrote:
> Hello All,
> Just wanted to send out a quick notice here. We’ve had a failure
> at our Seattle edge router, which we’re still investigating. In
> the meantime, our Tukwila edge router is still providing
> connectivity, but you may notice higher latencies or issues
> reaching things. If you find things you can’t reach, please let me
> know, as we’d like to make sure the redundancy is working, while
> we’re working to resolve the issues we’re investigating with the
> Seattle edge router.
> PSDR mailing list
> PSDR at hamwan.org <mailto:PSDR at hamwan.org>
> Ryan Turner
> PSDR mailing list
> PSDR at hamwan.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the PSDR