March 24, 2003

I am so blind

So here’s the scoop on what happened last week when I went crazy and did an early termination of my web log entry.

For starters, I neglected to mention that my /etc/ld.so.conf was empty on both the RH8 machines I was looking at. I’ll further add that this was abnormal. What I eventually found out was that /etc/ld.so.conf on every RH8 machine was getting cleared out. In addition, two copies of glibc-utils would usually end up hanging around. On a few machines glibc was missing altogether.

I ended up using the Bootdisk HOWTO to create some boot disks that I sent out to a few technicians. The procedure was like: hook up keyboard (most boxes are headless), put floppy in drive, hit reset button (no real good hope of a clean reboot), wait for five minutes (while the boot floppy loaded the kernel — 5m should be more than enough), remove floppy, insert floppy (root floppy), press enter on keyboard (kernel prompt to insert root disk), wait five minutes (for the stuff on the root disk to run; BusyBox rules BTW), remove floppy, reboot computer, call me. This last step was a good thing, because in most cases these boot floppies didn’t work; this was especially the case on systems without glibc.

I also never ended up sleeping. After I finished that log entry I had to diagnose the problem, come up with a fix, shower, call some people, then I rand around to three or four sites, mainly clients, before returning home. I got a one hour nap before euphorik and I had to go pick up tuxes and get dinner, then I made dinner.

During all this I learned that a glibc update was put out by Red Hat. This got put on my Red Hat mirror (currently offline because its drive filled up finally — damnit), built in to my APT repository, then all my boxes tried to pull it down when their nightly apt-get dist-upgrade hit. Some of my boxes didn’t get the update by a coincidence of timing, so I was able to play around with what happened. I did a straight rpm -Uvh of the same bunch of RPMs that APT should have been updating and the bug did not manifest itself: /etc/ld.so.conf fine, everything installed OK.

Really, that’s not entirely true: when you use this glibc update you have to service sshd restart. This is apparently being reported by other people as well, so I don’t think it’s just me. I made the mistake of not doing this and I was unable to reconnect via SSH.

Anyway, I was puzzled what APT had been doing at that point. Some environment variable? Fucked up flag to rpm?

Just now I was flipping through my e-mail and noticed a message I had saved in a half-awake stupor. It’s the output of the APT cron job from one of the machines that ended up dying. Check this out:

The following packages will be upgraded
  glibc glibc-common glibc-devel glibc-utils nscd
5 packages upgraded, 0 newly installed, 0 removed and 0 not upgraded.
Need to get 18.6MB of archives. After unpacking 678kB will be used.
[...]
Fetched 18.6MB in 16m29s (18.8kB/s)
Executing RPM (-e)...
warning: /etc/nsswitch.conf saved as /etc/nsswitch.conf.rpmsave
warning: /etc/ld.so.conf saved as /etc/ld.so.conf.rpmsave
error: %postun(glibc-2.2.93-5) scriptlet failed, exit status 255
Executing RPM (-Uvh)...
warning: /var/cache/apt/archives/glibc-utils_2.3.2-4.80_i386.rpm: V3 DSA
+signature: NOKEY, key ID db42a60e
Preparing...                ##################################################
glibc-utils                 ##################################################
error: %post(glibc-utils-2.3.2-4.80) scriptlet failed, exit status 255
error: %pre(glibc-devel-2.3.2-4.80) scriptlet failed, exit status 255
error:   install: %pre scriptlet failed (2), skipping glibc-devel-2.3.2-4.80
glibc-common                ##################################################
glibc                       ##################################################
nscd                        ##################################################
E: Sub-process /bin/rpm returned an error code (5)

Now, when I first read this, I thought, “what does status 5 mean?” I knew RPM was statically linked, so it wouldn’t be failing to be invoked. Then I read further up from the last line and noticed the little bit about error: %postun(glibc-2.2.93-5) scriptlet failed, exit status 255. At this point it hit me, as it may have just hit you. APT uninstalls first, then installs the new ones; APT notably does not rpm -U (upgrade). When it removed the RH8 glibc package that package took /etc/ld.so.conf with it. When the new glibc came along it just created an empty /etc/ld.so.conf. Mayhem ensued. (Lots of stuff on the system is linked with Kerberos libraries, I think.) This doesn’t definitively explain everything, like why some machines had no glibc at all, but I’d be willing to bet this was the root cause.

Now what do I do about this? Put a hold on all glibc packages in APT? That’s about the only thing I can really think of. I suspect it would be non-trivial to make APT run rpm -U instead of rpm -e then rpm -i. I think there’s a noremove flag you can give to the %config directive that’ll tell RPM not to remove the configuration file, perhaps. Perhaps. So maybe it is Red Hat’s fault after all.

Inspector darky solves the case of the missing glibc.

On a less angsty note, ardent had his wedding this past weekend and is now sipping on a glass of scotch somewhere south of Cancun. I hope he and his new bride are very happy, and want to thank him for letting us all have a good time. I’m having serious feelings of inadequacy about having lived up to my role of best man, but I was and am still honored to have been chosen to stand at his wedding. I hope I didn’t make things too hard on him. :)