Log in

No account? Create an account
Net Work
[Most Recent Entries] [Calendar View] [Friends]

Below are the 20 most recent journal entries recorded in network_nerd's LiveJournal:

[ << Previous 20 ]
Friday, January 5th, 2007
12:44 pm
Re: Email moved to new server
To: Relevant Managers

I did not get, from the various advance announcements about the new webmail access, that the email server itself would be moving to a new machine.
This is important for three reasons:

1. While users whose email client was configured to use the email server alias got moved automatically, those of us who had somehow wound up with the actual server name had to manually correct that to receive today's email. If we're lucky, I was in a small minority. (Judging by the number of "email problem" help tickets, I suspect not.)

2. I have a couple of small wireless routers deployed which send me daily log reports. These are pretty minimal devices, and have the SMTP server configure by IP address. I'm going to have to manually change it on each one....

3. The important one!
There is an entry in the perimeter access lists to prevent outside email from coming directly to the server, so that mail from outside goes through the Barracuda box first instead.
I've now added a similar line for the new server address, but in the meantime I received an infected email that came directly to the server and bypassed Barracuda. [*I* recognized it as suspicious, and Norton confirmed it later, but we can't assume that all of our users would be as astute.] (I think this is the same infected email as one of our users reported receiving.)

Inevitably, services get moved over time as necessary. But amongst the various announcements that webmail was changing, this server move never reached my attention until well after the fact. This probably just means that anyone involved directly in the move, IF they realized that others might need to know about it, assumed someone else was taking care of that....
Thursday, December 28th, 2006
9:59 am
Christmas Crash
To: Network Team

Unfortunately, Santa didn't leave us any .dmp files, so all we know is that the lights looked good, but no traffic would pass, and a reboot took care of it.

David Gillett

-----Original Message-----
From: Monitoring System
Sent: Monday, December 25, 2006 2:12 AM
To: Network Team
Subject: Bridge FWAN DOWN

Time: 02:12:29 on 12/25/2006
Address: x.x.x.x
Status: Timed Out ( 11010)
Friday, December 15th, 2006
10:59 am
Fight me if you dare | Combat Cards
to fight network_nerd
enter your username below
Wednesday, December 13th, 2006
6:02 pm
12:58 pm
Worm attack
Late Monday afternoon, I noticed that a machine was scanning random addresses across both campuses using port 135 (DCE). I blocked the port and tracked the machine to the support area, where one of the techs was reformatting a laptop.
Late Tuesday afternoon, I noticed similar traffic from another machine, and blocked that port.

This morning, that second machine showed up somewhere else on campus, and similar traffic was flooding from 22 additional machines, 19 at the big campus and 3 at the other -- most appear to also be laptops.

In addition to spreading via port 135, I've also seen:

1. At least one machine eventually started similar scanning on port 445 (CIFS).

2. These machines all try to "phone home" to port 7654 of a remote machine. I've got that blocked now, but one succeeded and appeared to be talking IRC over that port, reporting a "successful file download" to/from an additional machine which (so far) doesn't appear to have been trying to spread the infection further.

I've got the "phone home" traffic blocked, and the known infected machines null-routed at the gateway, which *should* make it just about impossible for them to infect outside their own VLANs.
Monday, November 20th, 2006
10:59 am
Illicit application(s)
There's a machine on our network that is generating some odd traffic. The traffic which originally attracted my attention was BitTorrent, which we block, but something on that machine is trying to open web connections to unreachable IP addresses and No big deal, except that "1337" is hacker "leet speak" and so these might be symptoms of some kind of rootkit, virus, or other hackage. The machine is in the Counselling department, so it may have access to sensitive personal information; if it has been hacked, that *could* be a serious data breach.

I did some research, and learn that some hacker tools attempt connections to as a test to see if the error message they get back indicates a web server/proxy that will do tunnelling to hide their connections. This *might* mean that the user has installed something which violates the "hacking tools" provision of the Acceptable Use policy.

I have not (yet) blocked the port, since there's no sign that bad traffic is actually getting through.
Friday, August 4th, 2006
10:24 am
Core reboot tonight
Got the new code images rolled out to all the big chassis switches, and will reboot them tonight. I'll be replacing the FCSM card in the coremost switch as well, so I'll be on site for that.

(Last time we rebooted the next switch along the ring, the core switch went dysfunctional until it was rebooted too -- and since that's where the MSS is, the whole ATM cloud was down. Vendor thinks replacing the FCSM will fix that.)

Also, the cluster is now taling on two different switches, so as long as I only reboot one of them at a time, it will stay up.

Oh, and a new network engineer starts Monday. I hope he's as good as his interview.
Tuesday, August 1st, 2006
5:31 pm
Antec fan
Last year, I got a good deal on a particularly quiet PC case (Antec "Sonata II" model). The PC has actually been sitting in my office at work, but when I came in on Monday it was dead. Restarted it, and it died again after about 10 minutes. I eventually figured out that the case fan had died -- it's larger than most(*) and has a three-speed switch, so I doubted I could find a replacement at Fry's. ((*) This is a feature; larger diameter means it can move as much air as a smaller fan at lower RPMs and thus less bearing noise.)

It didn't take much googling (there are a number of online reviews of this case) to determine that it features a "120mm Tricool fan", and only a little more to find where on Antec's site to order a replacement.
Friday, July 28th, 2006
10:15 am
RMA day
3x 3032 10 Mbps switches
1x CSM-622 OC-12 board

2 more 3032s and one more CSM-622 in queue to be authorized

I think there's another bad CSM-622 out there, but it might be the FCSM that's actually bad.
Thursday, July 13th, 2006
4:52 pm
You started it
Yet another user installed an application that tries to sidestep firewalls -- in this case, some Yahoo VOIP thing that first tries port 5061, but if it can't get through then it falls back to 443 and finally 80, even though it's using SIP and not HTTP or HTTPS.


The best definition I know of for "firewall" is "Network Policy Enforcement Device". So if you engineer an app to bypass typical firewalls, what you've created is, by definition, a "Network Policy VIOLATION Device". So the end users you're trying to help go from just not being able to use an unauthorized application, to potentially being FIRED for trying. User friendly? Hardly.

Look, guys: If you build your nifty thingamabob assuming that network security is your users' enemy, guess what? IT WILL BE.

Play nice. Use your own ports, register and document them. I routinely Google on "product name" and "firewall" to learn what I have to do to allow my users to use said product, and make the appropriate adjustments, usually within 24 hours of the first request from a user that gets approved.

But pull a stunt like Yahoo, and I have to start blocking addresses and checking the status of funding for an SSL proxy and possibly making it a bit hard for our users to get to some approved destinations while figuring out how to block your crap. Result is that I'm not happy, and neither are my users, and so when it reaches someone who can approve the use of your app -- or NOT! -- on our network, my recommendation is going to be "No, we can't trust them" and odds are that the blocks will be made permanent.

And it will be your own fault.
Friday, April 14th, 2006
10:33 am
Talk to us, please
Ticket from user:

NETWORK -- Only some of the ethernet ports in room xxxx are switched on. Can someone please turn the rest of them on?
Thank you

WHY? Are they making that room a lab? A facility for drop-ins to plug in and get Internet access? A secret server facility?

In order to bring up the ports, we need to determine what network to connect them to! We need some clue about how they will be used.
Friday, February 10th, 2006
4:36 pm
The cycle of life and death
And around it goes....

I did two things today that were rather unusual.

In the morning, I replaced the "engine" (motherboard) and I/O board of a Cisco 7204 router for our ISP, which meant stepping through enough initial configuration to get it to where they could talk to it remotely and finish the job of bringing it fully online. I don't that often have to deal with a completely unconfigured router.

In the afternoon, I shut down all but one non-routing interface on an RSM that has been throwing tantrums that confuse its neighbors. For those who might not remember, the RSM was more or less a 2500-series router on a blade to fit in a 5000-series switch chassis. The result is not (quite) a layer 3 switch -- the chassis supervisor and the RSM each have their own OS, IP address, and console. The RSM mostly routes between VLAN virtual interfaces, while the other switch blades provide the physical interfaces which populate or extend those VLANs.

I thought it was kind of fitting to bring up a new router and shut down an old one....
Friday, January 20th, 2006
12:02 pm
Fun (not!) with virtual CSM ports
One of the things that we found recently, which might relate to some of the ATM instabilities we've had lately, was that the configurations of several ports on a key switch were not quite right. So last night I fixed them and rebooted the switch.

Unfortunately, one of the incorrect ports was a virtual port on the FCSM, and it came up in a non-working state. So there was no way to fix that from home; I had to come into the data center to the console to even talk to the switch.

Somehow, it had originally let me change the state of the port, but now would not let me fix it because it was "in use". I tried several different ways to try and clear that, without success.
What finally worked was to shut down the switch, pull its DS3 and OC3 cards so nothing would be seen by the FCSM, reboot, fix the FCSM port config, shutdown, reinsert the cards, and reboot. It's now up and working, and traffic is once more flowing between the campuses.
Thursday, January 12th, 2006
4:40 pm
Busy, busy, busy
Got a fresh (or perhaps refurbished...) switch back from RMA, and tried to configure it to replace one in the field that has been acting up. No go -- looks like serious problems with the flash filesystem. Set it aside and try another fresh switch.

Similar symptoms. This is bad news, because it gets me into reloading code images using ZModem instead of FTP. Let's see, 10 Kbps (9600, actually), rather than 10 Mbps, that's going to take roughly 1000 times as long.... I can cut it to 500 by temporarily jacking the console port up to 19200.

In amongst the misbehavior of the filesystem, though, I see an intermittent message. "Bad Hbus device type of 0". What's an Hbus? I don't know. But one thing these two switches have in common is that I'm trying the same 10/100 Mbps uplink module. What if I try a spare module instead? The symptoms vanish. I complete the config, deliver the switch, pull the one it's replacing and the temporary helper I installed in September, and I'm less than half an hour late to the evening's scheduled event.
Monday, January 9th, 2006
5:15 pm
ELAN routing
We've got a pair of ELANs, one per campus, for environmental (HVAC) and energy-management gear. Think SCADA.

These ELANs meet at three routers -- the core routers for each campus, and a chassis-based switch configured to be a member of both ELANs and route between them.

The Physical Plant folks say that this became "slow" a few months ago -- perhaps around the time we started noticing broadcast storms which we eventually traced to one of these ELANs. And that now it has stopped working altogether for *some* client devices.

Unfortunately, the devices have been configured by someone with little clue about IP addresses, masks, gateways, etc. And in many cases the end device configuration can only be examined/changed at the vendor's facilities, not in the field.

Since my colleague who helped set this stuff up is out, it's up to me to troubleshoot it....
10:47 am
This 'n' that
Configure a port for a printer in one building. We have one blade in one switch there which reverts to a default config whenever it's rebooted, which happened during one of last week's storms.

Laptops for a class are being reimaged for the term, and can't see the classroom access point. Can mine? It can, so must be an issue with the image/config. [EDIT: It helps if the laptops spell the SSID the same way as the AP does.]
8:21 am
Busy Sunday
Four calls.

One from an instructor who hasn't used one of our patch panels before, and can't find the document that tells how. We made sure to train the department admins so they could support instructors with this equipment, but I guess they like their Sundays off.

Two from campus security, reporting network down. Referred them to the duty cell phone, since I was (a) out of town, and (b) can't do unscheduled overtime without authorization from one of the directors who carry the duty phone. No sign of a general outage, so it may just have been their building.

And one from a colleague who's dealing with family issues. He had planned to come in on the weekend to configure a new switch for one building, but that didn't work out and he hoped I could handle it. So I came in at 7am and had the switch ready by 8am when a tech came by to deliver and install it.
Friday, January 6th, 2006
4:12 pm
Well, the temporary configuration from last night, which we hoped would be stable (long enough to resolve deeper issues) wasn't. It looks like having the primary MSS offer LES for some ELANs while others go to the secondary because their LES on the primary is stopped doesn't work very well.

So the primary is offline, and everything is running from the secondary now.

When we tried that before, the one switch on the other campus that needed an ELAN from here couldn't get it. Now we know why: When you move an ATM client from one switch to another, its ATM address changes. The primary MSS hasn't moved, but the secondary has (a couple of times) without updating the LECS database.

When the LECS is found by WKA (Well Known Address) and the LES(*) is "local", the correct LES address will be given out to clients. But if the LES is not local, the MSS had better have the correct ATM address to hand out! So when the secondary was live but moved, clients that relied on the primary or the far campus MSSes for LECS got handed an incorrect LES address and so never got VCs.

On the good side: our issues have now been escalated to an engineer we have worked with before, and who knows his stuff.
Thursday, January 5th, 2006
5:24 pm
We've had an interesting night and day, and believe we've found some important clues!

[1] Bug Report regarding switch directly connected to MSS

At the moment we have two MSSes live on this campus using IBM redundancy mode. We've forced clients to get their VCs from the secondary MSS for some VLANs by *stopping those LES-BUSes* on the primary.

This works for almost everybody. But the switch that the MSS is plugged into will NOT get VCs from the secondary in this case. Apparently, having a live primary plugged directly into the switch means that it absolutely will not fail over to a secondary LES-BUS.

Once again we encounter a configuration that was probably never included in the MSS test plan, and doesn't work.

[2] VPI/VCI configuration

Our secondary MSS is configured for 1024 maximum calls, and is plugged into a UNI 3.1 port configured for 2 VPI bits and 10 VCI bits. This seems to be working.

Our primary MSS is configured for 2048 maximum calls. It used to be plugged into a UNI 3.1 port configured for 1 VPI bit and 11 VCI bits. That used to work, but the MSS was moved from that port some time back, and when we move it to that port now, nobody gets any VCs. The Omni-9 that it's plugged into seems to only manage to get VCs from it when it's plugged into its current port instead.

Said current port is reported under "vap" as having 0 VPI bits and 10 VCI bits. The same is true of another UNI 3.1 port on the switch, and there's a PNNI port also reporting 0 and 10, as does the FCSM's virtual port.

A couple of years ago, we had a rash of ports reporting 0 and 10, and undertook to correct them; we were pretty sure we had. So we're concerned that this may indicate that *something* is corrupting the switch configuration in this regard. We plan to survey our network to see if this is cropping up anywhere else; we're not aware of any reason this issue should be present on our other campus, where we've had a similar range of issues.

[3] VC number exhaustion

We have two VLANs that are present on every switch: 2 and 7. Many switches also have VLANs 5 and/or 6. Most client problem reports have involved VLAN 5 or 6, and we often also find a problem with VLAN 2 when investigating. [VLANs 5 and 6 are rarely found on the Omni-9s making up the core ring, which may be why we see the problem mostly affecting distribution switches.]

I got a repeat trouble report today about VLAN 6 on a particular distribution switch (.223), yet I was already pinging it on VLANs 6 and 7 without any problem.

So I telnetted to the switch and did a broadcast ping of VLAN 7. I was getting timeouts about one in every 5-8 tries, which shouldn't happen. A broadcast ping on VLAN 6 showed no such failures. A broadcast ping on VLAN 2 was fully answered too -- and my ongoing ping from my machine to that switch on VLAN 6 stopped getting answers!

"vas" showed large numbers of VCs for each of these three VLANs; some of the VC numbers were just over 1000. A "mas save" would discard all of these VCs and replace them with the basic four VCs to the LECS/LES, but no further VCs could be obtained for any VLAN. Rebooting the switch would not fix the problem either. Only a LES-BUS restart would clear it, as we had discovered the other day.

Our current theory is that the MSS thinks it can issue VCs up to the "maximum number of calls" 2048, but the switch it is plugged into is only configured for 10 VCI bits on that port and so past 1024 the VCs don't get handed out.

That, though, begs several additional questions:

1. Why did it only break recently? We have gradually added a few more VLANs, but the most recent were way back in September.

2. Why won't it work on any other port now? We haven't been entirely systematic yet about testing it, but we believe we've tried it on a 1/11 port (correct), 2/10 port (incorrect), and another 0/10 port (how is this even possible?) without success.

3. Why are there ports showing as 0/10? What causes it? How do we fix it?

4. How many of our other ATM issues are side-effects of this one? Or perhaps vice versa?
Wednesday, January 4th, 2006
4:21 pm
More ELAN instability
We've seen fewer spontaneous switch reboots, I think, but we've begun to see a different issue:

   Each of our campuses has a core ring of OC-12 connected Omni-9s. Some of our stackable switches are fed
via OC-3s from those core switches.

   In a few places, there's a distribution layer, an Omni-9 that connects via 2 or 4 (usually 4) aggregated OC-3s (PNNI) to a core switch, and additional stackables are fed from there. [In one place, we have an additional Omni-9 fed from the distribution layer.]

   We've seen several cases in the last 2-3 weeks where one of these distribution switches, or one or more of the stackables fed from it, can obtain VCs for some of its VLANs, but others are stuck in "LECS Connect" state. and do not get VCs. Doing a "mas save" on the service does not clear it. Going into "mas" and choosing not to try ILMI (which we do not use) does not clear it. Rebooting the switch does not clear it.

   Restarting the LES-BUS at the MSS *usually* clears the problem. This is sort of tolerable this week when 90% of our users are still off campus. It will be intolerably disruptive next week.

   This problem is occurring on both campuses, each of which has its own MSS and core.

   We've had several power failures in the last month due to severe weather, and on a couple of occasions we've seen an Omni-9 apparently not come back on line when power is restored. We *suspect* that in at least some cases, the Omni-9 may simply have failed to get VCs on all of its major VLANs.

   In one case, I found that pinging the MSS from the Omni-9 it's connected to was seeing responses to only every other ping. One of the LES-BUSes that I needed to restart was the one that the MSS is on, and when I did that this symptom also cleared up.

   It's not obvious whether this problem is related to the "SSCOP flutters" and/or "spontaneous reboots" issue. The main thing they have in common is that our ATM infrastructure has become unstable, and we need to get it stable again.
[ << Previous 20 ]
About LiveJournal.com