The way Facebook and its allied services like WhatsApp and Instagram stopped performing on the internet last night a is quite unbelievable thing. The moment our tech experts got to know this, they promptly looked into it and found that DNS names stopped resolving and the infrastructure Ips were unreachable. But how was that possible with world’s largest social media platform? However, Facebook has mentioned via a blog post that it all happened due to some internal hitches.
Our technical experts, on the other hand, started probing into the matter and explored the post deeply on BGP and DNS problems and soon found that it was something that began with a configuration change that impacted the entire network.
Now, most of you must be wondering what this BGP and DNS is that we are talking about. Let’s give you a brief about it.
What is BGP?
BGP stands for Border Gateway Protocol. It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet. BGP refers to a gateway protocol that enables the internet to exchange routing information between autonomous systems. So, without BGPs, the internet would not actually work.
The entire framework of internet is bounded with BGPs and it allows one network (say Facebook) to advertise its presence to other networks that form the Internet. Now, when we talk about individual networks, they have an Autonomous System Number ASN with an individual network with a unified internal routing policy. An AS can originate both prefixes and transit prefixes that are known to control a group of IP addresses.
The BGP Updates
Our Fluper team have some experts who are well versed with BGP and ASN concepts and thus, we have presented this evaluation before you. When they started exploring the number of updates that they received from Facebook during their BGP database evaluation, they found a big BGP update message thatinforms a router of any changes you’ve made to a prefix advertisement or entirely withdraws the prefix.
At around 15:40 UTC, our experts witnessed a peak of routing changes from Facebook and this was where the problem actually arrived. Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why 184.108.40.206 couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems.
Its Impact on DNS
As an impact of this, the DNS resolvers all over the world stopped resolving their domain names. It’s simply because DNS, like many other systems on the Internet, also has its routing mechanism. When someone types the https://facebook.com URL in the browser, the DNS resolver, responsible for translating domain names into actual IP addresses to connect to, first checks if it has something in its cache and uses it. If not, it tries to grab the answer from the domain nameservers, typically hosted by the entity that owns it.
If the nameservers are unreachable and fail to respond due to any reason, a message of SERVERFAIL is returned and the browser shows an error report to the user.
What Happens When DNS Get Affected?
The moment Facebook’s DNS prefixes routes had stopped,our and everyone else’s DNS resolvers had no way to connect to their nameservers.Consequently, 220.127.116.11, 18.104.22.168, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.
But this was not the end, most of the aps won’t accept an error for an answer and start retrying aggressively also because end-users also won’t take an error for an answer and start reloading the pages, or killing and relaunching their apps.
Facebook has a huge database and due this, the DNS resolvers over the world started handling 30 times more queries than usual and this eventually resulted in `latency and time-out issues for various other platforms.
As 22.214.171.124 was built to be free, swift, and scalable, we were somehow able to keep servicing our users with minimal impact. The vast majority of our DNS requests kept resolving in under 10ms. At the same time, a minimal fraction of p95 and p99 percentiles saw increased response times, probably due to expired TTLs having to resort to the Facebook nameservers and timeout. The 10 seconds DNS timeout limit is well known amongst engineers.
Impact on Other Services
When Facebook became unreachable, we witnessed increased DNS queries to Twitter, Signal, and some other messaging and social media platforms also. Another side effect of this unreachability was seen on WARP traffic to and from Facebook’s affected ASN 32934. The traffic changes between that time duration compared with three hours before in each country.
The incident simply resembles the complexity and interdependency of millions of systems and protocols working together and the trust, standardization, and cooperation among entities are at the heart of making internet available for almost 5 billion active users worldwide.