[HamWAN PSDR] Service Impact Notice

Bart Kus me at bartk.us
Fri Mar 11 00:56:22 PST 2016

Hmm that's not the whole story though.  If it were just the 1 router 
failure (in reality a hypervisor failure), we'd be in a much better 
position, but it's combined with 2 other modem failures.  We had the 
ETiger->SnoDEM modem die over the winter, and it needs replacement.  
That link has been down for a month or more now.  And most recently 
we're having the Tukwila->Baldi modem lose connectivity frequently.  
We've implemented an automatic mitigation for that, but it still 
produces sporadic short downtime windows of a few minutes.  I'd just 
like to move that modem to a NetMetal 5.  Our servers are also being 
affected by instability in the Quagga routing software.  We need to 
replace this with a more stable alternative, like BIRD.  Lastly, the 
Baldi emergency uplink is only configured to go to Westin and Corvallis, 
but not Tukwila.

We could have avoided DNS outages too, if the anycast groups were 
populated with more of the available servers.  I believe lack of good 
automation for server build-outs is causing the deployment lag here.

The network is designed to withstand failures, even multiple failures, 
but we've got many broken things right now that need fixing.  After that 
fixing, I would really love to see some folks get behind improving our 
monitoring, deployment and diagnostic automation.  Networks like this 
won't scale unless they're nearly completely automated and simple to 
manage.  I would not mind at all if we even rolled back some features 
until we can get them re-implemented in 100% automated ways.

As important as all this is, I still think the deep penetration project 
takes precedence, so I can't drop that work in favor of this.  Aside 
from helping out on the simple break-fix stuff, I mean.


On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
> Thanks for the update, Nigel.
> On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen 
> <nigel at nigelvh.com <mailto:nigel at nigelvh.com>> wrote:
>     Hello All,
>     Just wanted to send out a quick notice here. We’ve had a failure
>     at our Seattle edge router, which we’re still investigating. In
>     the meantime, our Tukwila edge router is still providing
>     connectivity, but you may notice higher latencies or issues
>     reaching things. If you find things you can’t reach, please let me
>     know, as we’d like to make sure the redundancy is working, while
>     we’re working to resolve the issues we’re investigating with the
>     Seattle edge router.
>     Nigel
>     _______________________________________________
>     PSDR mailing list
>     PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>     http://mail.hamwan.net/mailman/listinfo/psdr
> -- 
> Ryan Turner
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hamwan.net/pipermail/psdr/attachments/20160311/c05d1aec/attachment.html>

More information about the PSDR mailing list