Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed hot spare drive is not detected / HPE ssacli #229

Open
crocodileneptune opened this issue Nov 10, 2024 · 2 comments
Open

Failed hot spare drive is not detected / HPE ssacli #229

crocodileneptune opened this issue Nov 10, 2024 · 2 comments

Comments

@crocodileneptune
Copy link

Hello Glen,

first of all thanks so much for your work!

I noticed that your check_raid.pl plugin doesn't seem to trigger the warning or critical state in the case of a failed spare drive. In my case, the server used to run on two harddisks in a RAID1 configuration, with another harddisk configured as a hot spare device. When I looked at ILO logs last night, I saw that the harddisk in bay 1 failed and the hot spare in bay 3 was activated some time ago.

I would have expected that the check_raid.pl plugin would trigger some sort of warning if any harddisk fails which is why I now created this bug report. I don't mind the exact state (warning or critical), but a failed device needs to trigger an action which is why I am using the plugin. I read CONTRIBUTING.md and I hope that all relevant details are included in this bug report.

Output of check_raid -d:

# /usr/lib/nagios/plugins/check_raid.pl -d
check_raid 4.0.10
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /sbin/dmsetup status --noflush at /usr/lib/nagios/plugins/check_raid.pl line 503.
DEBUG EXEC: /proc/mdstat at /usr/lib/nagios/plugins/check_raid.pl line 503.
DEBUG EXEC: /sbin/ssacli controller all show status at /usr/lib/nagios/plugins/check_raid.pl line 503.
DEBUG EXEC: /sbin/ssacli controller slot=0 logicaldrive all show at /usr/lib/nagios/plugins/check_raid.pl line 503.
OK: ssacli:[Smart Array P440ar[OK]: Array A(OK)[LUN1:OK]]

Output of each command from check_raid -d

/sbin/ssacli controller all show status

Smart Array P440ar in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

/sbin/ssacli controller slot=0 logicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      logicaldrive 1 (558.88 GB, RAID 1, OK)

However, the failed hot spare drive is not detected, even though ssacli notices it:

/sbin/ssacli ctrl slot=0 pd all show status

   physicaldrive 1I:3:2 (port 1I:box 3:bay 2, 600 GB): OK
   physicaldrive 1I:3:3 (port 1I:box 3:bay 3, 600 GB): OK
   physicaldrive 1I:3:1 (port 1I:box 3:bay 1, 0 GB, spare): Failed

Additional environment details:

  • Debian 12 Bookworm
  • HPE DL360 Gen9 with a P440ar raid controller + BBU

Thanks and best wishes!

@crocodileneptune
Copy link
Author

With the help of ChatGPT, I developed another Bash script that checks for failed drives. So if anyone faces the same problem, here is the script that I use:

#!/bin/bash

# Nagios return codes
OK=0
WARNING=1
CRITICAL=2
UNKNOWN=3

# Command to check drive status
CMD="sudo /usr/sbin/ssacli ctrl slot=0 pd all show status"

# Execute command and capture output
output=$($CMD 2>&1)
exit_code=$?

# Check if the command execution was successful or failed due to sudo permission issues
if [[ $exit_code -ne 0 ]]; then
  if echo "$output" | grep -q "sudo: a password is required"; then
    echo "CRITICAL - Sudo permission error: password is required to execute the command."
  elif echo "$output" | grep -q "sudo: "; then
    echo "CRITICAL - Sudo permission error: $output"
  else
    echo "CRITICAL - Command execution failed: $output"
  fi
  exit $CRITICAL
fi

# Initialize status flag
all_ok=true
status_message=""

# Process each line in the output
while IFS= read -r line; do
  if [[ $line =~ ^[[:space:]]*physicaldrive ]]; then
    # Extract the status at the end of the line
    drive_status=$(echo "$line" | awk -F': ' '{print $NF}')
    
    # Check if the drive status is not "OK"
    if [[ "$drive_status" != "OK" ]]; then
      all_ok=false
      # Add newline to status_message only if it is not empty
      [[ -n $status_message ]] && status_message+=$'\n'
      status_message+="Warning - Non-OK status found: $line"
    fi
  fi
done <<< "$output"

# Determine Nagios plugin output based on status
if $all_ok; then
  echo "OK - There are no failed hot spare drives"
  exit $OK
else
  echo -e "$status_message"
  exit $WARNING
fi

I call the script from the SNMP process, so sudo needs to be instructed:

cat /etc/sudoers.d/check_raid 
User_Alias CHECK_RAID=Debian-snmp
Defaults:CHECK_RAID !requiretty
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli controller all show status
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli controller slot=0 logicaldrive all show
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli controller * logicaldrive all show
CHECK_RAID ALL=(root) NOPASSWD: /sbin/dmsetup status --noflush
CHECK_RAID ALL=(root) NOPASSWD: /sbin/dmsetup status
CHECK_RAID ALL=(root) NOPASSWD: /usr/sbin/ssacli ctrl slot=0 pd all show status

@glensc
Copy link
Owner

glensc commented Dec 3, 2024

@crocodileneptune pull requests are accepted. you should send your code as pull request, see CONTRIBUTING.md for how. to start.

nowadays everything is virtualized, haven't had need for check_raid on metal for very long already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants