Reimagining Network Fault Management —
As we begin 2020, if you’re like many of us, you’re thinking about a few New Year resolutions. Maybe you’re thinking about losing a few pounds, hitting the gym more regularly, reading a few more books, or taking that vacation you always dreamed about.
I have a question for you: what about resolving to make your work life better?
As a network engineer/manager, when was the last time you re-evaluated the way your organization identifies and manages network faults and outages? Could a less stressful, more efficient network operations environment at work also help you achieve your personal goals and resolutions? I know for myself that when I reduce the amount of crisis management and overall work stress, I achieve better balance and further enable myself to commit to and follow through on achieving my personal goals/resolutions.
While I am not a customer service or network operations expert, I am fortunate enough to work in a company that has many of them. I decided to go to these experts to assist me in my quest to identify the latest processes and tools that the experts are using to identify, manage, resolve, and root-cause analyze, network faults and outages.
I conducted multiple interviews with customer service teammates specializing in 3 types of networks: metro/long-haul optical networks, submarine networks, and Internet Protocol/Multiprotocol Label Switching (IP/MPLS) routing networks.
Each interview was conducted independently and separately from the others in order to obtain individual perspectives and avoid group uniformity/influence. While there were differences in the responses, some common themes emerged. The 3 interviews form the basis of our Elite 8 list for 2020.
Tip 1. Stay calm and fasten your seatbelt.
When potential faults are impacting your network or services to your customers, it can be difficult to remain calm and logical. However, all my respondents talked about networking situations where either they failed to remain calm or the service provider personnel failed to do so. Every fault or service impacting issue has a root cause and a resolution. In order to divide the analysis and think clearly, it is imperative that you remain calm even under extreme conditions.
Tip 2. Make training a priority.
Training is something that may seem obvious, but with so many company mergers and acquisitions and continuous pressure to minimize operational expenses and improve profitability, it’s easy for training to be overlooked or avoided (e.g., we’ll just delay it another 6 months).
Today, it is common to encounter network services personnel with limited knowledge or experience about the network or the products they are managing. The good news for service providers and vendors alike is that the way we conduct training, and the ways in which we learn, has also changed. We no longer have to fly people to a remote location with a massive lab for a full week of on-site training. Many classes are offered online. Many are recorded and available for replay.
Besides vendor-specific training, online courses are readily available, from companies like Udacity and Corsera, across all kinds of subjects for little or no cost. Probably more important than how the training is conducted is ensuring that your work environment and culture supports training and continuous learning for its employees, including support for the time investment away from the daily job. If an engineer is conducting training and is pinged every 15 minutes about his/her day job, the training won’t be of much use.
Tip 3. Don’t mistake activity for progress.
There are only 2 things that matter when you are dealing with a fault or service impacting situation: service restoration and root-cause analysis. There are times when customer service personnel are researching logs, reviewing maintenance data, and correlating alarms, to identify the source of the issue and restore service.
In these situations, don’t mistake quiet thoughtful analysis for a lack of urgency or progress — just like you shouldn’t interpret rapid actions and chaotic behavior as making progress.
One of the interviewees recalled a time when he was working at a service provider and failed to remain calm while also mis-interpreting the actions of the vendor’s customer services personnel. After a few minutes, he declared that the vendor had better give him an answer in the next 5 minutes or he was going to power cycle the entire node with 1,000s of end-customer services running on it. In response, the vendor’s customer services personnel pushed back — telling him he was going to do no such thing, that they were analyzing the data and would have a much better resolution if he just had a little patience. After another 10-15 minutes, the vendor identified the issue and gave him precise procedures that both addressed the issue while also avoiding impacting 1,000s of additional services from a node reboot.
Remember, remaining calm under pressure gets easier with practice and with preparation.
Tip 4. Develop a process, and the discipline to follow it.
One of the ways that surgeons avoid leaving instruments inside patients and pilots avoid missing critical steps in the pre-flight take-off of a plane is the disciplined use of checklists. In a similar manner, using a formal method-of-procedure (MOP) or a checklist in diagnosing networking faults helps everyone to avoid mistakes or missed steps in the process. There are specific questions (e.g., what, where, how big/impactful) that need to be answered and specific data to be collected every time. By documenting a process and the expectations, new employees can be on-boarded and handoffs can be efficiently managed including when escalating to other organizations or to vendors for assistance.
Tip 5. Make use of simulation.
With modern optical and IP/MPLS routing solutions, many vendors offer simulation environments that run on standard cloud servers. In many cases, simulation environments can be used in lieu of physical hardware labs — thus avoiding physical space, power, cooling, and the cost of the hardware itself.
Service providers can avoid faults with configuration changes or network upgrades by first running a sanity check in a simulation environment where network element data bases and circuits are analyzed. By using simulation
environments, service providers can avoid networking faults caused by configuration changes and software upgrades altogether.
Tip 6. Automate.
Humans make mistakes. It happens. One of the most stable times in a network can be during the holiday season(s) when non-critical activities are put on hold. Network trouble tickets consistently drop off during the holiday period which further reinforces the point that humans aren’t perfect.
Increasingly, we see the benefit of custom software tools and automation to execute repetitive activities, which helps us avoid typos and typical human errors. By automating simple tasks and recurring activities, service providers can minimize the frequency of human-induced faults all year long.
Tip 7. Share.
Have an expert on your team? Help them to share and mentor others by creating a lunch’n’learn series where your experts have an opportunity to present their knowledge and experience with the rest of the team. If your expert isn’t a public speaker, have him/her start a collaborative Wiki site where experts can populate information from past faults, and everyone can contribute to continuous learning.
Tip 8. Commit to regular maintenance.
One of the key diagnostic decisions in fault detection and resolution in submarine networks is to determine if an issue is in the wet-plant or the submarine line terminal equipment (SLTE) that uses coherent dense wavelength division multiplexing transmission gear. A rigorous maintenance schedule that includes daily collection and storage of network performance data is key to rapidly diagnosing an issue. With such data, advanced analysis tools can rapidly detect a change or abnormal condition — and pinpoint the part of the network and the location of incidence.
So, this is our Elite 8 list for 2020, to help you reimagine your approach to network fault identification and resolution. We hope you can implement some of these ideas to improve your operational networking environment, reduce work stress, and follow through on your 2020 New Year resolutions.
Mark Leuzinger, Daniel Gatto, and Luis Perez, all from Infinera, contributed their expertise to this article.