Failed hot spare drive is not detected / HPE ssacli #229

crocodileneptune · 2024-11-10T09:58:14Z

Hello Glen,

first of all thanks so much for your work!

I noticed that your check_raid.pl plugin doesn't seem to trigger the warning or critical state in the case of a failed spare drive. In my case, the server used to run on two harddisks in a RAID1 configuration, with another harddisk configured as a hot spare device. When I looked at ILO logs last night, I saw that the harddisk in bay 1 failed and the hot spare in bay 3 was activated some time ago.

I would have expected that the check_raid.pl plugin would trigger some sort of warning if any harddisk fails which is why I now created this bug report. I don't mind the exact state (warning or critical), but a failed device needs to trigger an action which is why I am using the plugin. I read CONTRIBUTING.md and I hope that all relevant details are included in this bug report.

Output of check_raid -d:

# /usr/lib/nagios/plugins/check_raid.pl -d
check_raid 4.0.10
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /sbin/dmsetup status --noflush at /usr/lib/nagios/plugins/check_raid.pl line 503.
DEBUG EXEC: /proc/mdstat at /usr/lib/nagios/plugins/check_raid.pl line 503.
DEBUG EXEC: /sbin/ssacli controller all show status at /usr/lib/nagios/plugins/check_raid.pl line 503.
DEBUG EXEC: /sbin/ssacli controller slot=0 logicaldrive all show at /usr/lib/nagios/plugins/check_raid.pl line 503.
OK: ssacli:[Smart Array P440ar[OK]: Array A(OK)[LUN1:OK]]

Output of each command from check_raid -d

/sbin/ssacli controller all show status

Smart Array P440ar in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

/sbin/ssacli controller slot=0 logicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      logicaldrive 1 (558.88 GB, RAID 1, OK)

However, the failed hot spare drive is not detected, even though ssacli notices it:

/sbin/ssacli ctrl slot=0 pd all show status

   physicaldrive 1I:3:2 (port 1I:box 3:bay 2, 600 GB): OK
   physicaldrive 1I:3:3 (port 1I:box 3:bay 3, 600 GB): OK
   physicaldrive 1I:3:1 (port 1I:box 3:bay 1, 0 GB, spare): Failed

Additional environment details:

Debian 12 Bookworm
HPE DL360 Gen9 with a P440ar raid controller + BBU

Thanks and best wishes!

The text was updated successfully, but these errors were encountered:

crocodileneptune · 2024-12-03T19:35:30Z

With the help of ChatGPT, I developed another Bash script that checks for failed drives. So if anyone faces the same problem, here is the script that I use:

#!/bin/bash

# Nagios return codes
OK=0
WARNING=1
CRITICAL=2
UNKNOWN=3

# Command to check drive status
CMD="sudo /usr/sbin/ssacli ctrl slot=0 pd all show status"

# Execute command and capture output
output=$($CMD 2>&1)
exit_code=$?

# Check if the command execution was successful or failed due to sudo permission issues
if [[ $exit_code -ne 0 ]]; then
  if echo "$output" | grep -q "sudo: a password is required"; then
    echo "CRITICAL - Sudo permission error: password is required to execute the command."
  elif echo "$output" | grep -q "sudo: "; then
    echo "CRITICAL - Sudo permission error: $output"
  else
    echo "CRITICAL - Command execution failed: $output"
  fi
  exit $CRITICAL
fi

# Initialize status flag
all_ok=true
status_message=""

# Process each line in the output
while IFS= read -r line; do
  if [[ $line =~ ^[[:space:]]*physicaldrive ]]; then
    # Extract the status at the end of the line
    drive_status=$(echo "$line" | awk -F': ' '{print $NF}')
    
    # Check if the drive status is not "OK"
    if [[ "$drive_status" != "OK" ]]; then
      all_ok=false
      # Add newline to status_message only if it is not empty
      [[ -n $status_message ]] && status_message+=$'\n'
      status_message+="Warning - Non-OK status found: $line"
    fi
  fi
done <<< "$output"

# Determine Nagios plugin output based on status
if $all_ok; then
  echo "OK - There are no failed hot spare drives"
  exit $OK
else
  echo -e "$status_message"
  exit $WARNING
fi

I call the script from the SNMP process, so sudo needs to be instructed:

cat /etc/sudoers.d/check_raid 
User_Alias CHECK_RAID=Debian-snmp
Defaults:CHECK_RAID !requiretty
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli controller all show status
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli controller slot=0 logicaldrive all show
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli controller * logicaldrive all show
CHECK_RAID ALL=(root) NOPASSWD: /sbin/dmsetup status --noflush
CHECK_RAID ALL=(root) NOPASSWD: /sbin/dmsetup status
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli ctrl slot=0 pd all show status

glensc · 2024-12-03T20:12:36Z

@crocodileneptune pull requests are accepted. you should send your code as pull request, see CONTRIBUTING.md for how. to start.

nowadays everything is virtualized, haven't had need for check_raid on metal for very long already.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed hot spare drive is not detected / HPE ssacli #229

Failed hot spare drive is not detected / HPE ssacli #229

crocodileneptune commented Nov 10, 2024

crocodileneptune commented Dec 3, 2024

glensc commented Dec 3, 2024

Failed hot spare drive is not detected / HPE ssacli #229

Failed hot spare drive is not detected / HPE ssacli #229

Comments

crocodileneptune commented Nov 10, 2024

crocodileneptune commented Dec 3, 2024

glensc commented Dec 3, 2024