How to react to a network outage

By Lora O’Haver

May 19, 2020

undefined mins

Recent DDoS (distributed denial-of-service) attacks have everyone on edge, from DNS hosting providers to the individual organization fearing they might be caught in the next denial-of-service campaign. It’s a valid concern considering the cost of unplanned data center outages have increased 38 percent in five years to nearly $9,000 a minute or about half a million dollars an hour.

Unfortunately, network outages don’t need to be instigated by malicious outside forces in order to occur. Configuration choices, new releases, or the cumulative impact of changes made over many years can lead to catastrophic failures as well. Examples of this are common. Just this summer, Southwest Airlines experienced a router failure that caused the cancellation of about 2,300 flights in four days. The router took down several Southwest Airlines systems and the outage continued uninterrupted for about 12 hours when the backup systems didn’t work as expected. Software can be a cause for them too. Last year, the New York Stock Exchange experienced an outage caused by a software update, crippling the exchange platform in the longest technology-related disruption in recent memory.

An outage can strike anywhere and even a minor internal problem can create a ripple effect that can cause widespread disruption—costing an organization money and consumer trust. How can we mitigate the risk of such an outage happening to your organization?

Why It Happens

Single Points of Failure: Deploying a device on the network without failsafe protection of other network components, can lead to a lot of headaches. In cases where a failure does occur, like the router incident above, the entire traffic pathway can fail and result in a complete network outage. In other words, a single device can have a huge impact on business operations and fixes can be hard to implement when failures occur, as it can often be difficult to identify and isolate any individual issue in a short period of time.

The Human Element: But it’s not just technology that is at fault, humans are too. Last year, Avaya found that 81 percent of IT pros cited human error (e.g. configuration mistakes) had taken them offline. Another human element is conflicting priorities between teams that can delay necessary action—such as software updates—from being taken in a timely manner, leaving vulnerabilities exposed.

Environmental Issues: Finally, you can’t predict random occurrences. For example, if your HVAC system fails, the data center room can become overheated, leading to potential damage and failure of any system deployed there.

Taking Steps to Prevent the Problem

The first step is accepting that you cannot prevent all network outages, so it is critical to develop a disaster recovery and business continuity plan. What is an acceptable amount of risk or downtime? Being specific about what risks are acceptable and which are not allows for better prioritization.

Second, it is important to employ the right setup for your network. No device should be deployed without providing an alternate path in the event of a failure. Something as simple as an external bypass switch can maintain availability when deployed in front of network devices.

Additionally, you need resilient network security that provides continuous traffic inspection and recovers from any outage in an acceptable time frame, to minimize exposure. For security, that means having complete visibility of traditional and cloud traffic and a security fabric that protects the network from any outage in your security tools and ensures the maximum amount of traffic will be inspected.

Once the necessary steps have been taken, test, test, test. Network and devices should be continuously tested as part of standard IT operations. This should be done with loads that reflect real world scenarios, particularly in the event of application and network changes.

The incidents I describe above, ones in which a IT problem rapidly becomes a major business issue, are almost certain to happen to any business at one point or other. As such, they should not be any less a cause for concern than a denial-of-service attack. Understanding the sources of common network disruption, and how they can be addressed in advance, will help in avoiding major fallout—from lost revenue to lost customers. Are you doing this now? If not, it is time to make this a top priority.

Lora O’Haver is Solutions Marketing Manager at Ixia

Business Review Australia's January issue is now live.

Follow @BizReviewAU and @MrNLon on Twitter.

Business Review Australia is also on Facebook.

Tags

IT software Lora O’Haver. Solutions Marketing Manager at Ixia network outage

Read Now