Yesterday, I got a first-hand demonstration of how a simple, well-meaning act of tidying up can have far-reaching consequences for a network.
Our campus uses Cisco IP phones both for regular communication and for emergency paging. As such, every classroom is equipped with an IP phone, and each of these phones is equipped with a switch port, so that rooms with only one active network drop may still have a computer (or more often a networked printer) wired in. If you work in such an environment, I hope that this short tale will serve as a cautionary tale about what happens when you don’t clean up.
I was working at my desk yesterday afternoon, already having more than enough to do, since the start of school is only a few days away, and everybody wants a piece of me all at once. While reading through some log files, a bit of motion at the bottom of my vision caught my attention: the screen on my phone had gone from its normal display to a screen that just said “Registering” at the bottom left with a little spinning wheel. Well, thought I, it’s just a blip in the system–not the first time my phone’s just cut out for a second. So I reset my phone. Then I looked and saw that my co-workers’ phones were doing the same thing. Must just be something with our switch, I thought. So I connected to the switch over a terminal session and checked the status of the VLANs. Finding them to be all present and accounted for, I took the next logical step and reset the switch. A couple minutes later, the switch was back up and running, but our phones were still out.
Logging in to the Voice box, I couldn’t see anything out of the ordinary, and the closest phone I could find outside of my office was fully operational. Soon, I began getting reports that the phones, the wi-fi, and even the wired internet were down or at least very slow elsewhere on campus, though from my desk, I was still able to get out to the internet with every device available to me. The reports, though, weren’t all-encompassing. The middle school, right across a courtyard from my office, still had phones, as did the art studios next door, but the upper school was down, and the foreign language building was almost completely disconnected from the rest of the network–the few times I could get a ping through, the latency ranged from 666 (seriously) to 1200-ish milliseconds.
I reset the switches I could reach in the most badly affected areas. I reset the core switch. I reset the voice box. Nothing changed. I checked the IP routes on the firewall: nothing out of the ordinary. Finally, in desperation, my boss and I started unplugging buildings, pulling fiber out of the uplink ports on their switches, then waiting to see if anything changed. Taking out the foreign language building, the most crippled building, seemed like the best starting point, but was fruitless. Then we unplugged the main upper school building, and everything went back to normal elsewhere on campus. Plug the US in, boom–the phones died again–unplug it, and a minute later, everything was all happy internet and telephony.
We walked through the building, looking for anything out of the ordinary, but our initial inspection turned up nothing, so, with tape and a marker in hand, I started unplugging cables from the switch, one by one, labeling them as I went. After disconnecting everything on the first module of the main switch, along with the secondary PoE switch that served most of the classroom phones, I plugged in the uplink cable. The network stayed up. One by one, I plugged cables back into the first module, but everything stayed up. Then I plugged the phone switch back in, and down the network went again.
After another session of unplugging and labeling cables, I plugged the now-empty voice switch back in, hoping for the best. The network stayed up. Then I plugged in the first of the cables back into the switch. Down the network went. Unplug. Back up. Following the cable back to the patch panel, we eventually found the problem, missed on my initial sweep of the rooms: two cables hanging out of a phone, both plugged into ports in the wall. For whatever reason, both ports on that wall plate had been live, and that second cable, plugged in out of some sense of orderliness, had created the loop that flooded the network with broadcast packets and brought down more than half of campus.
Take away whatever lesson you want from this story, but after working for almost four hours to find one little loop, I will think twice about hotting up two adjacent ports if they aren’t both going to be connected immediately and (semi)permanently to some device, especially if one of them is going to a phone.