Facebook, Instagram, and WhatsApp: what we know about the outage, and how to prevent this problem.
Having these three tools and methodologies on hand will help you avoid system breakdowns in the future: Ticketing, ITIL and, Change Management. Facebook services such as Instagram, WhatsApp, Messenger, and the social network itself were affected by the global blackout. After a 6-hour outage, social networks and messages are back online. the internet giant’s three services experienced a global that prevented millions of users from accessing its online services.
The company claims that the problem was caused by a change in the configuration of its systems.
According to a Facebook statement, this impacted the company’s internal tools and systems, making it difficult to solve the problem.
Downdetector, a website that tracks downtime on online platforms, has received hundreds of thousands of bug reports for Facebook (and its Messenger service), Instagram, and WhatsApp.
The issue affects both mobile and PC services, regardless of the service provider. Facebook, Messenger, WhatsApp, and Instagram are completely inaccessible, even through a web browser. When attempting to access one of these sites, the error message “DNS PROBE FINISHED NXDOMAIN” is shown.
The #InstagramDown hashtag is back on Twitter, and the number of tweets complaining about the service’s unavailability is rapidly increasing, propelling the hashtag into trending status. The tweets are written in different languages, suggesting a major outage, possibly worldwide. It was immediately followed by the hashtag #FacebookDown and the phrase “WhatsApp.”
A DNS PROBLEM ON THE SERVERS OF FACEBOOK, WHATSAPP, AND INSTAGRAM?
Reverse engineering specialist Jane Manchun Wong said it could be a DNS problem, both for public sites and for the site employees use for their tests.
a-What does DNS stand for? What does it mean?
The Domain Name System (DNS) is the Internet’s phone book. People can access material on the internet through domain names. Internet Protocol (IP) addresses are used by web browsers to communicate with one another. DNS translates domain names to IP addresses, enabling browsers to access Internet resources. Each Internet-connected device has its IP address, which other devices use to locate it.
a server’s IP address to its publicly accessible URL. Solving such a problem could take several hours.
What happened to Facebook?
Cloudflare performed one of the most extensive assessments of the event, concluding that the outage was probably definitely not the consequence of a hack, but rather of server maintenance error at the level of routing and the BGP protocol (for Border Gateway Protocol). “It’s as if someone ripped the cables from Facebook’s data centers in one fell swoop and took the sites offline,” summarizes a post on the Cloudflare blog, which reviews During this incident, Facebook’s DNS and IP infrastructure were unavailable.
Continuously monitoring data supplied through the BGP protocol, which underpins the worldwide Internet (BGP allows a network, such as Facebook, to signal its presence to other networks on the Internet), Cloudflare has observed the disappearance of DNS (Domain Name System) from Facebook, preventing queries requesting facebook.com or instagram.com from being answered, followed by a spike in Faceb. This is unusual for Facebook, which typically makes few changes to its network in real-time.
According to Cloudflare, when Facebook’s DNS went down, some of its engineers mistook it for a problem with their systems, trying to figure out where the outage was coming from before realizing the obvious. Facebook and its websites were cut off from the Internet.
There is little or no impact on the rest of the Internet traffic.
What do users and their applications or Internet browsers do when Facebook, Instagram, or WhatsApp do not respond? They make every effort to connect to it by relaunching queries on public DNS servers like 184.108.40.206 or 220.127.116.11, each time receiving the same response, indicating that the server is not connected to respond. Cloudflare explains the surge in DNS requests during the outage. DNS resolvers were forced to process up to 30 times more queries than usual. Fortunately, these infrastructures are extremely efficient and resilient, and they did not collapse despite the additional load.
Fortunately, a global Facebook outage is insufficient to bring down other websites and services on the Internet. A very small percentage of requests simply had their processing time extended, but in an almost imperceptible way to the user.
Furthermore, Facebook alleges that changes in the design of the routers that coordinate network traffic between its data centers created communication difficulties.
Facebook also claims that the failure was caused by “configuration changes in the routers that coordinate traffic between the data centers.” These configuration changes had several effects on communications between the American group’s various servers.”We’d want to reassure you that we believe the underlying reason of this issue is an incorrect configuration modification.” ” There is no proof that user data was hacked as a result of the outage.”
What is it meant by BGP?
The Internet’s postal service is the Border Gateway Protocol (BGP). When a letter is placed in a mailbox, the postal service processes it and determines the most efficient method of delivery to the intended recipient. When someone sends data over the Internet, BGP is in charge of analyzing all of the alternative routes and choosing the best one, which often entails hopping between autonomous systems. BGP is the protocol that allows the Internet to work. This is accomplished by allowing data to be sent via the Internet. When a person in Singapore visits a website, BGP is the protocol that allows them to interact quickly and efficiently with origin servers in Argentina.
What is the relationship between DNS and BGP?
According to Cloudflare, DNS tells you where you are heading, whereas BGP tells you how to get there. DNS is how computers figure out what IP address a website or another resource has, but having that knowledge is not practical for example, if you ask a friend where their house is, you will almost likely need GPS to go there. Cloudflare also has a fantastic technical account of how BGP failures can cause DNS requests to fail — the post is specific to Monday’s Facebook incident, so it is worth reading if you are seeking an explanation from the perspective of an autonomous system.
FACEBOOK WAS ONLINE BUT COULD NOT BE FOUND
In particular, it appears that a change in BGP configuration is to blame for the failure of Facebook services. “BGP enables a network, such as Facebook, to promote its presence to the other networks that comprise the Internet. “As of this writing, Facebook is no longer marketing its existence, ISPs and other networks are unable to identify Facebook’s network, and it is therefore unreachable,” stated Cloudflare. Facebook remained online, but Internet service providers no longer knew how to access it.
What would be the solution to this issue?
If there is one term to characterize the process of making significant changes to major processes, systems, or workflows, it may be “chaos” But here’s the good news: implementing a plan to manage this chaos using these three methods can result in long-term beneficial effects.
A-What is ITIL?
ITIL (Information Technology Infrastructure Library) is a collection of specialized organizational abilities aimed at creating value for end-users in the form of services. ITIL lays the foundation for international practices that organizations can adopt, in whole or in part, to provide valuable services to their clients.
IT continues to assist organizations and professionals in getting the most out of IT and digital services and provides a very clear model of the skills required for service providers, aligning them with business strategy and customer needs.
The following are the primary advantages of the ITIL method:
Assists organizations in the new technological era. Robotics, artificial intelligence, nanotechnology, biotechnology, the Internet of Things (IoT), 3D printing, autonomous vehicles, and other emerging technologies characterize the fourth industrial revolution.
Provides a realistic and adaptable framework for companies to use as they embark on their digital transformation path, assisting them in aligning their digital and physical resources to compete in an increasingly complicated environment.
Context is focused on the organizational and technological aspects of the repository, as well as how the repository connects with Agile, DevOps, and digital transformation.
Is critical for software developers and service management professionals because it encourages a comprehensive approach to product and service delivery.
Stresses the value of cooperation, openness, automation, and a holistic approach.
Organizations must continually adapt to change since it is constant.
ITIL v4 is one of the Best Practices that assists companies in managing continual change.
Guarantees that quality is delivered more quickly and at a higher value to businesses and individuals.
B-WHAT IS TICKETING?
Ticketing is, first and foremost, a method of managing your company’s priorities by identifying the significance of each extra work in your job. For instance, if a customer requests that you do a verification engagement, this is not a typical activity. You then generate a ticket to indicate that this task is a request from an interlocutor rather than a routine project follow-up.
The ticket also allows you to identify the mission’s priority level and has the obvious goal of allowing your work organization.
As you may have observed, incident ticket management saves time by organizing your work and assigning responsibilities to members of a team.
Never lose contact with your consumers and provide them with omnichannel access to the help they require.
C-WHAT IS CHANGE MANAGEMENT?
Change management entails a methodical approach to transitioning from the old to the new. It considers all of the people, processes, and systems that will be affected by a change. When a company decides to make a change – to a new system, kind of software, vendor, or even culture – they must go through a change management process.
A cloud move, for example, may influence legal, security, and end-users. Working with stakeholders before and throughout the transition can help to reduce disturbance throughout the business.
To summarize, Facebook did not anticipate this breakdown, and the world, as well as the Facebook Company, its managers, engineers, and the Big Boss (Mark Zuckerberg), suffered greatly as a result.
In this scenario, Facebook’s stock price on the New York Stock Exchange fell by about 5%.