Don’t Touch Anything!
For more than 30 years, as either a vendor or as a manager with union employees, I could not even flip a line bay breaker switch. Recently, I have worked on live 1GE a lot and some 10GE. This was my first chance to work on live 100GE service.
For several months recently, I had the weekend outage duty for 5 areas every 5 weeks. I never know when or how many times the phone will ring, but I rarely make it through a weekend without a call. I was happily puttering around in my garage when the phone rang.
Last weekend, I had my hands on a live 100GE circuit for the first time. Fourteen hours later, I was convinced that we were never looking in the right place.
After talking to the NOC Tech, I knew from the beginning that this 100GE thing was going to be trouble. The NOC Tech did not understand the design of the circuit that we were shooting, and neither did I. His office was in another guy’s territory. The NOC was short-handed that weekend, and so were we. Basically, we were stuck with each other, and we had to do our best.
NOC and Net
The interaction between the NOC Tech and the Network (Net) Tech is crucial to outage repair. The NOC tech is the specific equipment expert and the Net tech is local network expert. Some trouble doesn’t require a lot of both areas of expertise, but sometimes, we really need each other.
Interestingly, there can be friction between the two groups sometimes. When I worked with the NOC, the Net Techs were sometimes called “knuckle draggers”. As a technician again, I sometimes hear complaints about “drone pilots” in the NOC. Still, generally, everybody gets along and works together. And it must be said that there are some superstars in both groups.
Santa’s 2017 Gift
We lost an entire small town this last Christmas Day due to the faulty installation of a generator transfer switch. Everything was dead: voice, data, Internet, transport. Everything. The faulty installation even prevented the alarms from activating.
What happened from there was the impressive part. Techs from 4 different NOC groups worked in concert with 2 Net Techs and 2 Net Managers.
From the time we hit the front door of a completely dark CO.
• Transport was up in 30 minutes.
• Internet was up, and customers were in sync in 60 minutes.
• Class 5 Voice Switch was up in 120 minutes.
In terms of background, here is the 100GE circuit trouble description.
1. 100 GE Circuit between CO #1 and CO #2.
2. At the CO’s, they attach to another ring via Cisco Z33s.
3. Services being impacted were cell sites for a very vocal customer.
4. If there was a redundant standby circuit, nobody ever found it.
5. Approximately every 4 minutes the circuit would take massive hits and degrade to the point of going down.
6. The NOC Tech and I agreed to remotely shut down the light and reset the CFP (100GE version of SFP).
7. We were celebrating after 5 minutes, but not after 25 minutes.
8. After the initial failure, it started to fail again at every 4 minutes.
It must be noted that a transport fiber optic cable was broken the previous day, causing massive outages. This outage was 150 miles away, and supposedly not on the same ring. Still, that nagged me throughout this process.
Note: Due to road conditions, every trip between the COs was about 60 minutes.
In order to understand the progression of the 100G live repair, please “enjoy” the steps we went through on that Christmas Day.
CO #2 Trip 1: Net Tech #1 and NOC Tech #1
• We physically reseated the CFP on the Cyan card.
• We cleaned all of connectors and ports.
• Then, the circuit wouldn’t come up at all.
• A 100GE next to it, went down briefly.
• We saw a bad light at CO #1.
CO #1 Trip 1: Net Tech #1 and NOC Tech #2
• Note: the NOC had a shift change, so another NOC Tech was assigned.
• Verified that the equipment on site was putting out good light.
• Low light coming from CO #2.
• We changed frequencies at CO #1 through the link to CO #2.
CO #2 Trip 2: Net Tech #1 and NOC Tech #2
• Moved the jumper to the new frequency on link between CO #1 and CO #2.
• The circuit came back up.
• It lasted 25 minutes and dropped.
• It started dropping every 4 minutes.
NOC Tech decides the CFP at CO #1 is bad
• 2 more Net Techs are sent a total of 200 miles over questionable roads to get the closest spare — I was to stay and continue testing.
• More cleaning and testing.
• Though light levels look good on both sides, the circuit is down again and won’t come back.
• I am sent back to CO #1.
CO #1 Trip 2: Net Tech #1 and NOC Tech #2
• In transit, the circuit came back up.
• It came up 30 minutes after I left CO #2, and nobody had changed anything (that we know of).
• It stayed up 90 minutes.
• The decision was made to change out the CFP at CO #1 anyway due to a “software alarm”.
• Note: The circuit was up, I had 12 hours in and a long drive back, so I left the CFP replacement to the Net Tech who was sent to get the hardware.
CO#1 Trip 3: Net Tech #2 and NOC Tech #3
• Note: Another shift change at the NOC.
• Circuit never dropped again (over 3 hours by now).
• Software error still active.
• CFP replaced.
• Circuit came up and stayed up.
• Software error still active.
• 2 hours of analysis with no change.
• Net Tech #2 takes the long drive home, due to no more local support required.
The result: weeks later and the circuit never went down again.
Immediately, the NOC tech suspected a dirty connection or a bad CFP (100 Gig version of the SFP). Though I concurred in the beginning, I started to have doubts early on in the process.
At 100 Gigabits per second, physical issues knocked down the circuit quickly. Not in 25 minutes, and not even in 4 minutes, because so much data is flowing that even extremely low percentage issues accumulate to overwhelm the circuit.
All the people with a TDM background with whom I later spoke are convinced this was a timing issue. The transport fiber outage had been back on only for 12 hours when this trouble started.
What I learned is that 100GE is a different animal from 1GE or even 10GE. It does not like being touched!
I know what happened when we knocked down the adjacent 100GE circuit. I was physically tracing the jumper looking for macro-bends. One tug at the wrong spot in that fiber rack and that other circuit dropped.
New 100GE Strategy
I am changing my practices based on this experience. I admit that I have been pretty rough with 1GE and 10GE fiber jumpers. You must be, if you are physically tracing them. Also, I have been cleaning all the connections at the same time to reduce outage time.
I am implementing these changes.
1. Don’t physically touch anything!
2. Understand exactly where the circuit goes and what it does. If not understood, see Step #1.
3. Use remote data sources to passively evaluate symptoms. Until then, see Step #1.
4. Create a short list of possible root causes.
5. Create a test strategy for each root cause. May I repeat: physical touching is the last resort for each strategy.
6. CAREFULLY change one thing at time, and one thing only.
7. Re-evaluate strategy based on results.
8. Repeat Steps 5-7 until complete.
When it comes to 100GE, I now believe in the Hippocratic Oath: First do no harm.
I believe that much of the testing and cleaning performed that day did more harm than good. I think we broke that circuit at least twice and broke an adjacent 100GE circuit briefly once. I don’t think we ever had much chance of fixing anything because the trouble was not where we were looking.
I believe somebody in another group found a problem left over from the fiber break the previous day. They were working on it when the circuit would not come up for 30 minutes. They fixed whatever the problem was and it never dropped again.
If you have an alternative explanation, please send it in and we can discuss it in a later column.
When I was at my truck that Christmas day, an elderly couple pulled up and rolled down the window. “Thank you for sacrificing your Christmas to help us,” the smartly dressed silver-haired gentleman said to me.
Because we are onsite, we get to hear the gratitude that the NOC people don’t get to hear. I wanted to share this shout out because a lot of people sacrificed their Christmas that day. In fact, this was the fifth Christmas callout in a row for one NOC Tech who I must thank for his/her help.
Heroes are made by the paths they choose, not the powers they are graced with. (Brodi Ashton, Everneath)