Mastodawn

Watching a #tornog talk about the Rogers countrywide outage. Surprising but _not surprised at all_ that there were a chain of missing safeguards that would have prevented it, and missing remediation tools.

Show thread

mhoye Apr 13

"Longest prefix match. Always gets us." #tornog

Show thread

mhoye Apr 13

It's actually refreshing to be hanging out with people where "it's always DNS" is like a child's toy model of a problem. This is an "It's always BGP" crowd.

#tornog

Show thread

mhoye Apr 13

Very cool to see a new tool getting open-sourced on stage, for realtime monitoring/alerting of BGP problems! Doubly cool to see part of its utility is about alerting you to what peers - and _who to contact_ at those peers - are causing problems. Social-context awareness is an undervalued part of operational hygienics. #tornog

Show thread

Vivi Apr 13

@mhoye would be happy to see the project when it's posted!

Show thread

mhoye Apr 13

@vivithecanine Apparently the official announcement is later this week, I'll come back with it.

Show thread

robert daniel pickard May 1

@vivithecanine @mhoye The videos for TORNOG are up on YT now

https://www.youtube.com/@TeamTORNOG

and the RAVEN project is on github

https://github.com/nokia/bgp-routing-security-monitor

TORNOG

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

YouTube

Show thread

(hic/haec/hoc)Apr 13

@mhoye MTU problems are worse to diagnose than both of those, don't ask me how I know :'(

Show thread

mhoye Apr 13

@_hic_haec_hoc Oh, I'm going to ask you. (I wonder if we can beg @lcamtuf to bring back the Museum Of Lost Packets....)

Show thread

(hic/haec/hoc)Apr 13

@mhoye @lcamtuf some branch offices started reporting that they randomly couldn't connect to some internal servers, it wouldn't work for some time and then start working again. Eventually we figured out that they were all connected to the same MPLS router in the same PoP, and eventually we noticed that this PoP was connected to the rest of the network via a fiber optics link and a radio link. The MTU of a router port connected to the radio link *on the other side* of the link was larger than the

Show thread

(hic/haec/hoc)Apr 13

@mhoye @lcamtuf actual MTU on the radio link (something like 1800 bytes instead of 1580), so most normal IP-over-MPLS packets would still go through, making it looks like everybody was ok, but some BGP packets generated on the routers would be larger, so they would never arrive and the VPNv4 and the routes in the users' VRFs would just disappear...

Show thread

(hic/haec/hoc)Apr 13

@mhoye @lcamtuf but the time one of our upstream ISPs misconfigured the MTU of one link of a LAG was worse, because we couldn't reliably reproduce the issue, we spent days looking at our network without any success, we had to convince several SaaS vendors to do the same without any success on their side either, and we finally figured out it was an ISP's fault only after we became so desperate we started playing with route maps on the border routers to force the traffic on the other ISPs

Show thread

(hic/haec/hoc)Apr 13

@mhoye @lcamtuf at least after we told them "guys, there's definitely something wrong on your network because when we send the traffic for AWS through our peering with you things randomly stop working" they listened and found the problem in less than 24 hours...

Show thread

Jean-François Mezei Apr 13

@mhoye from the CRTC consultation post Rogers outage , it didn’t have any rollback plan and did not include such in its promises to institute change management program.