These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.
Most content on this site is licensed under the WTFPL, version 2 (details).
Questions? Praise? Blame? Feel free to contact me.
My old blog (pre-2006) is also still available.
See also my Mastodon page.
Sun | Mon | Tue | Wed | Thu | Fri | Sat |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 |
(...), Arduino, AVR, BaRef, Blosxom, Book, Busy, C++, Charity, Debian, Electronics, Examination, Firefox, Flash, Framework, FreeBSD, Gnome, Hardware, Inter-Actief, IRC, JTAG, LARP, Layout, Linux, Madness, Mail, Math, MS-1013, Mutt, Nerd, Notebook, Optimization, Personal, Plugins, Protocol, QEMU, Random, Rant, Repair, S270, Sailing, Samba, Sanquin, Script, Sleep, Software, SSH, Study, Supermicro, Symbols, Tika, Travel, Trivia, USB, Windows, Work, X201, Xanthe, XBee
A few months ago, I put up an old Atom-powered Supermicro server (SYS-5015A-PHF) again, to serve at De War to collect and display various sensor and energy data about our building.
The server turned out to have an annoying habit: every now and then it would start beeping (one continuous annoying beep), that would continue until the machine was rebooted. It happened sporadically, but kept coming back. When I used this machine before, it was located in a datacenter where nobody would care about a beep more or less (so maybe it has been beeping for years on end before I replaced the server), but now it was in a server cabinet inside our local Fablab, where there are plenty of people to become annoyed by a beeping server...
I eventually traced this back to faulty sensor readings and fixed this by disabling the faulty sensors completely in the server's IPMI unit, which will hopefully prevent the annoying beep. In this post, I'll share my steps, in case anyone needs to do the same.
At first, I noticed that there was an alarm displayed in the IPMI webinterface for one of the fans. Of course it makes sense to be notified of a faulty fan, except that the system did not have any fans connected... It did show the fan speed as 0RPM (or -2560RPM depending on where you looked) as expected, so I suspected it would start up realizing there was no fan but then sporadically seeing a bit of electrical noise on the fan speed pin, causing it to mark the fan as present and immediately as not running, triggering the alarm. I tried to fix this by shorting the fan speed detection pins to the GND pins to make it more noise-resilient.
However, a couple of weeks later, the server started beeping again. This time I
looked a bit more closely, and found that the problem was caused by too high
temperature this time. The IPMI system event log (queried using ipmi-sel
)
showed:
43 | Feb-17-2023 | 09:18:58 | CPU Temp | Temperature | Upper Non-critical - going high ; Sensor Reading = 125.00 C ; Threshold = 85.00 C
44 | Feb-17-2023 | 09:18:58 | CPU Temp | Temperature | Upper Critical - going high ; Sensor Reading = 125.00 C ; Threshold = 90.00 C
45 | Feb-17-2023 | 09:18:58 | CPU Temp | Temperature | Upper Non-recoverable - going high ; Sensor Reading = 125.00 C ; Threshold = 95.00 C
46 | Feb-17-2023 | 16:26:16 | CPU Temp | Temperature | Upper Non-recoverable - going high ; Sensor Reading = 41.00 C ; Threshold = 95.00 C
47 | Feb-17-2023 | 16:26:16 | CPU Temp | Temperature | Upper Critical - going high ; Sensor Reading = 41.00 C ; Threshold = 90.00 C
48 | Feb-17-2023 | 16:26:16 | CPU Temp | Temperature | Upper Non-critical - going high ; Sensor Reading = 41.00 C ; Threshold = 85.00 C
This is abit opaque, but the events at 9:18 show the temperature was read as 125°C - clearly indicating a faulty sensor. These are (I presume) the "asserted" events for each of the thresholds that this sensor has. Then at 16:26, the server was rebooted and the sensor read 41°C again (which I believe is still higher than realistic) and each of the thresholds emits a "deasserted" event.
Looking back, I noticed that the log showed events for both fans and both
temperature sensors, so it seemed all of these sensors were really wonky. I
could also see the incorrect temperatures clearly in the sensor data I had been
collecting from the server (using telegraf, collected using lm-sensors
from
within the linux system itself, but clearly reading from the same sensor as
IPMI):
Note that the graph above shows two sensors, while IPMI only reads two, so I am not sure what the third one is. The alarm from the IPMI log is shown clearly as a sudden jump of the temp2 purple line (jumping back down when the server was rebooted). But also note an unexplained second jump down a few hours later, and note that the next day temp1 dives down to -53°C for some reason, which also matches what IPMI reads:
$ sudo ipmitool sensor
System Temp | -53.000 | degrees C | nr | -9.000 | -7.000 | -5.000 | 75.000 | 77.000 | 79.000
CPU Temp | 27.000 | degrees C | ok | -11.000 | -8.000 | -5.000 | 85.000 | 90.000 | 95.000
CPU FAN | -2560.000 | RPM | nr | 400.000 | 585.000 | 770.000 | 29260.000 | 29815.000 | 30370.000
SYS FAN | -2560.000 | RPM | nr | 400.000 | 585.000 | 770.000 | 29260.000 | 29815.000 | 30370.000
CPU Vcore | 1.160 | Volts | ok | 0.640 | 0.664 | 0.688 | 1.344 | 1.408 | 1.472
Vichcore | 1.040 | Volts | ok | 0.808 | 0.824 | 0.840 | 1.160 | 1.176 | 1.192
+3.3VCC | 3.280 | Volts | ok | 2.816 | 2.880 | 2.944 | 3.584 | 3.648 | 3.712
VDIMM | 1.824 | Volts | ok | 1.448 | 1.480 | 1.512 | 1.960 | 1.992 | 2.024
+5 V | 5.056 | Volts | ok | 4.096 | 4.320 | 4.576 | 5.344 | 5.600 | 5.632
+12 V | 11.904 | Volts | ok | 10.368 | 10.496 | 10.752 | 12.928 | 13.056 | 13.312
+3.3VSB | 3.296 | Volts | ok | 2.816 | 2.880 | 2.944 | 3.584 | 3.648 | 3.712
VBAT | 2.912 | Volts | ok | 2.560 | 2.624 | 2.688 | 3.328 | 3.392 | 3.456
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na
PS Status | 0x1 | discrete | 0x01ff| na | na | na | na | na | na
Note that the voltage sensors show readings that do make sense, and looking at the history, they show no sudden jumps, so those are probably still reliably (even though they are read from the same sensor chip according to lm-sensors).
It seems you can disable generation of events when a threshold is crossed, can even disable reading the sensor entirely. Hopefully this will also prevent the BMC from beeping on weird sensor values.
To disable things, I used ipmi-sensor-config
(from the freeipmi-tools
Debian package):
First I queried the current sensor configuration:
sudo ipmi-sensors-config --checkout > ipmi-sensors-config.txt
Then I edited the generated file, setting Enable_All_Event_Messages
and
Enable_Scanning_On_This_Sensor
to No
. I also had to set the hysteresis
values for the fans to None
, since the -2375 value generated by
--checkout
was refused when writing back the values in the next step.
Commited the changes with:
sudo ipmi-sensors-config --commit --filename ipmi-sensors-config.txt
I suspect that modifying Enable_All_Event_Messages
allows the sensor to be
read, but prevents the threshold from being checked and generating events
(especially since this setting seems to just clear the corresponding setting
for each available threshold, so it seems you can also use this to disable some
of the thresholds and keep some others). However, it is not entirely clear to
me if this would just prevent these events from showing up in the event log, or
if it would actually prevent the system from beeping (when does the system
beep? On any event? Specific events? This is not clear to me).
For good measure, I decided to also modify Enable_Scanning_On_This_Sensor
,
which I believe prevents the sensor from being read at all by the BMC, so that
should really prevent alarms. This also causes ipmitool sensor
to display
value and status as na
for these sensors. The sensors
command (from the
lm-sensors
package) can still read the sensor without issues, though the
values are not very useful anyway...).
Note that apparently these settings are not always persistent across reboots and powercycles, so make sure you test that. For this particular server, the settings survive across a reboot, I have not tested a hard power cycle yet.
I cannot yet tell for sure if this has fixed the problem (only applied the changes today), but I'm pretty confident that this will indeed keep the people in our Fablab happy (and if not - I'll just solder off the beeper from the motherboard, but let's hope I will not have to resort to such extreme measures...).