Network Failures: 10 Ways You Can Avoid Mishaps and Manage Risk

By Eric Harris, IT Director, Tech Networks of Boston

The network is down! These four words can set off panic attacks in the workplace. Workers and managers stress about missed deadlines, and IT leaders scramble to figure out the problem and fix it.

Network failures do happen. But they don’t have to be regular occurrences. Here are 10 examples of why networks fail and what you as an IT leader can do to make sure they don’t keep happening in the future.

1 - ISP outages

Connections to major internet service providers go down every now and then. There’s really nothing you can do about an individual connection but grin and bear it. What you can do is get a “redundant ISP” – a networking device or load balancer. It may be as simple as a firewall with WAN failover function. Getting a secondary service doesn’t have to be expensive. Sign up for a service where you pay ad hoc, per event, and your network devices would automatically failover in a heartbeat.

2 – Power issues

Sometimes a weak power link to a data facility can cause voltages to fluctuate and lead to an outage. This is tricky one to manage. It’s next to impossible to strong-arm utilities, but you can negotiate with the utility to bring in a new line or a redundant line. Another option is to simply have a back-up plan to move the data center in the event of an outage. A local data can provide redundant power and AC on a warrantee basis.

3 – Hardware glitches

If you’re holding onto old equipment stuff that’s beyond its expected life span, chances are it won’t have vendor support. The solution: Keep your components current. Make sure you have your warrantees up to date. Vendors have service packages offering four-hour turnarounds to get devices repaired with new parts and back on line.

4 – Human factors

Maybe it’s not the equipment’s fault. Maybe it was installed wrong or not properly maintained. Maybe the skillsets don’t match the needs for a technical environment. There isn’t a quick-fix solution to this issue other than to make sure you’re diligent in your processes. If you may have a team you’re happy with, assign responsibilities in a better way. If you have to outsource, make sure you can ask pointed questions to ensure you’re getting the right team for the job. The more you know as an HR administrators, the better.

5 – No redundancies built into the architecture

This should be standard operating procedure. If you buy a server, you’ll want to have dual power supplies and a redundant disk. Redundant network links are also really important. This gives you the ability to hedge against the failure of one switch that’s servicing the servers.

6 – Configuration errors in equipment

If equipment is configured incorrectly, bottlenecks can occur and the network can experience generally poor performance. To avoid this, document all the changes you make and plan to make. Try to make configuration changes that are reversible. Document all the steps and keep communication lines open with all the technicians working on the job.

7 – Not updating with patches

This happens more often than not. Make sure you update all patches in a timely manner – especially network changes regarding security. Attacks are coming faster and quicker. There’s no excuse for not staying current with regular upgrades technology suppliers provide.

8 – Cooling failures

When summer comes around, technology equipment can overheat and shut down. Cooling should be a priority all year round, but never more than in the summer. Be mindful of the air flow in your server room. Buy a standalone AC unit that’s water cooled, or at the very least make sure you have fans blowing air around. Experts can configure more elaborate cooling systems if overheating becomes a regular issue.

9 – A mis-sized network link

Underestimating the load in your network could lead to outages. It could be in your wifi set-up or in the core infrastructure in your data center. You could have all of your clients trying to connect into resources and have too much network traffic trying to squeeze through a too-small link. To guard against this, find the proper tools to simulate network flows. There are also online calculators you can put in to determine what would happen in the event of a peak flow.

10- Trying to make too many changes at once

Changing one aspect of a network is complicated enough. If you make several changes at once, you raise the risk of a component reacting negatively. Exercise caution. Keep changes minimal and do them in bite-sized projects. If one change works, move onto the next one.

Conclusion

As you can see from this list, there are plenty of culprits that can bring networks down. To keep outages to a minimum, you need to be informed about potential issues and exercise common sense. Communicate with your staff, and make sure you’re all working together to fend off network problems. If you need help, reach out to a managed service provider or another expert in network functions.

Does your organization have a disaster recovery plan in place for if a network outage were to occur? Use this helpful cheat sheet to develop a disaster recover plan for your business.

Do you have a network failure story? How did you deal with it, and what did you learn?

Network Failures: 10 Ways You Can Avoid Mishaps and Manage Risk