Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for cciss_vol_status 1.12a #205

Merged
merged 5 commits into from
Mar 4, 2024

Conversation

knorrie
Copy link
Contributor

@knorrie knorrie commented Jul 28, 2021

Hi, I have some fixes for check_raid to be able to cope with the changed text output of cciss_vol_status version 1.12a.

First opening the WIP MR with the code changes, so that I have a number to use while trying to get additional test cases set up.

knorrie added 2 commits July 28, 2021 15:29
The latest version of cciss_vol_status is 1.12a. For the first time, an
extra letter was appended.

The version pattern would not match any more. The function will return
0, and later on, the code to detect if the version is >= 1.10 (for new
added functionality) would wrongly return false.

So, change it to also accept an optional letter at the end, but ignore
it. Adding it to the detected version string will result in errors like
"Argument "1.12a" isn't numeric" later on in the code, and because right
now, there is no need to be able to distinguish between 1.12 and 1.12a
in the plugin.
In version 1.12a of cciss_vol_status,
  RAID 1
is changed to
  RAID1(1+0)
which makes the pattern fail.

Accept the optional (X+Y).

Without this change, no array would be detected, and the program will
fail with:

  UNKNOWN: cciss:[Plugin error]
@glensc glensc marked this pull request as draft July 28, 2021 14:33
@knorrie
Copy link
Contributor Author

knorrie commented Jul 28, 2021

H'okay, so. There's a new cciss_vol_status version 1.12a which contains 'Formatting changes', yay. [0] Not only formatting changes in the program output, but even in the version number handling.

The formatting changes break the fragile text processing in check_raid. I ran into this while upgrading physical servers from Debian 10 (Buster) to 11 (Bullseye).

I spent some time to figure out what exactly changed [1], and how to adapt check_raid to this. The first three commits in this PR are the fixes, which I think can deal with the new and previous output. In my own environment, I tested the changes by making sure the check_raid output would be like normal again with cciss_vol_status 1.12a, and I checked that when downgrading to cciss_vol_status 1.12, everything would also look still fine.

Summary:

  • The output change for RAID1(1+0) was simply already causing an UNKNOWN: cciss:[Plugin error]. This was the starting point for debugging.
  • After fixing that, I noticed that the entire Drives(x): etc section from the check output was still gone. So, it would say /dev/sda(Smart Array P420i): Volume 0 (RAID 1): OK without the extra Drives(8): 1I-1-1,1....
  • Additionally, I found out about the extra space character in the physical disk location output lines that was added, and then I found out that that code was not even executed, because parsing the new version number with an 'a' attached failed.

The fourth commit is my best attempt to add a new test. I did not understand all the fields to be filled in (especially detect_cciss), but it works and passes...

Well, not entirely, because:

t/check_cciss.t ..... 1/85 Unparsed[                 Total cache memory: 816 MiB] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 18.
Unparsed[                        Cache Ratio: 10% Read / 90% Write] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 19.
Unparsed[                 Total cache memory: 816 MiB] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 18.
Unparsed[                        Cache Ratio: 10% Read / 90% Write] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 19.

This is a different and already known issue I see. The check_raid plugin in Debian 11 now has a patch in the packaging which you can see at [2]. When applying this change to check_raid here and running the tests again, I can see that it passes with my new cciss_vol_status output, but tests with older output do not pass, because it requires the new lines to be there:

t/check_cciss.t ..... 1/85
#   Failed test 'controller structure'
#   at t/check_cciss.t line 219.
#     Structures begin differing at:
#          $got->{/dev/sda}{cache}{cache_ratio} = '10% Read / 90% Write'
#     $expected->{/dev/sda}{cache}{cache_ratio} = Does not exist
# Looks like you failed 1 test of 85.
t/check_cciss.t ..... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/85 subtests

I agree that this might be an OK temporary band-aid fix for the Debian package because the real issue is not resolved yet. I have to admit that I have been doing server upgrade tests months ago, but while doing that, the yolo test servers were not hooked up to actual nagios alerting, so I totally missed the disk plugin being broken. :-(

I won't mind helping out a bit extra to get this cache memory/ratio issue properly fixed so that it works with older and newer output. I have some test servers here which can produce different kind of cciss_vol_status output to help out with that.

The new test does pass (for my new output file) if the two line debian patch fix for the new total cache stuff is applied, or if you remove the following two lines from the cciss_vol_status output:

                 Total cache memory: 816 MiB
                        Cache Ratio: 10% Read / 90% Write

Just let me know what you think about this. Apparently, this fix depends on the other cache fix being done, unless we want to 'forge' test input (which would not be right).

I'll post this now and then re-read the CONTRIBUTING to see if there's anything else that I should add.

Have fun,
Knorrie

[0] https://sourceforge.net/projects/cciss/files/cciss_vol_status/
[1] https://salsa.debian.org/debian/cciss-vol-status/-/commit/f8df6cefd4dedc0f3031e6ff6d25001a8a4cf0b5#6ea9d5500e187d921a7ab20e567a8530a5eb36c6
[2] https://salsa.debian.org/nagios-team/pkg-nagios-plugins-contrib/-/blob/master/debian/patches/check_raid/fix_unparsed_error_cciss

@knorrie
Copy link
Contributor Author

knorrie commented Jul 28, 2021

Ah, apparently the data collected in the process (especially the test files in the 4th commit) is already covering the output of the -d things. That's good.

-# ./check_raid -d -p cciss
check_raid 4.0.9
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /bin/lsscsi -g at ./check_raid line 504.
DEBUG EXEC: >&2 /bin/cciss_vol_status -v at ./check_raid line 500.
DEBUG EXEC: /bin/cciss_vol_status -V /dev/sg0 at ./check_raid line 504.
OK: cciss:[/dev/sda(Smart Array P420i): Volume 0 (RAID 1(1+0)): OK, Drives(8): 1I-1-1,1I-1-2,1I-1-3,1I-1-4,2I-1-5,2I-1-6,2I-1-7,2I-1-8=OK, Cache: WriteCache FlashCache ReadMem:81 MiB WriteMem:735 MiB]

@glensc
Copy link
Owner

glensc commented Jul 28, 2021

  1. remove WIP from commit message (do not want to merge commit saying it's work in progress)
  2. the changes should not break older version support, so some more efforts are needed to detect this.

in our infra, everything is virtualized, so check_raid is useless, so I'll try to support you with feedback.

@glensc
Copy link
Owner

glensc commented Jul 28, 2021

The Debian patch seem to originate from #200

@knorrie knorrie changed the title WIP: Fixes for cciss_vol_status 1.12a Fixes for cciss_vol_status 1.12a Jul 28, 2021
@knorrie
Copy link
Contributor Author

knorrie commented Jul 28, 2021

  1. done
  2. my changes seem to be ok, but there's Error runing plug-in on debian 10 #196 (somewhat or not unrelated to my issue) in our way.

Aha, yes, #200 submits a change to "make it work", but it will not pass our tests with a collection of previous output of cciss_vol_status versions.

I can have a look at #196 because I have some test hardware here that I can use for that issue and improve it so that it can go throught and we can have both.

Just let me know what you think and want to do.

I think it seems rather easy to first get that #196 change done properly and then this one on top. I'll have a look at it tomorrow.

Thanks,
Hans

@knorrie
Copy link
Contributor Author

knorrie commented Jul 28, 2021

Ok, right. I see that #200 is just fixing things forward, and because there's an extra level of indirection involved (my %map = (), I don't really have a clue how to quickly dissolve this dependency. I can do regexes pretty well, but I'm not a perl programmer.

So, unless someone shows up to rewrite that my %map = ( a bit so that it can handle optional entries to deal with older cciss_vol_status output, this issue is blocked.

@glensc
Copy link
Owner

glensc commented Jul 28, 2021

@knorrie so perhaps add another entry to %map, and later in code use one or another key:

			my %map = (
				total_cache_memory => qr/Total cache memory: (.+)/,
				total_cache_memory_v2 => qr/Total cache memory: (.+)/,


...

if ($cache->{total_cache_memory} || $cache->{total_cache_memory_v2}) ...

or maybe even:

# fill total_cache_memory if it's missing and total_cache_memory_v2 is present
$cache->{total_cache_memory} = $cache->{total_cache_memory_v2} if not $cache->{total_cache_memory} and $cache->{total_cache_memory_v2};

I don't remember that much he code, but you should get the gist from this.

also, if what you need has just changed the match key, use some regex like this:

			my %map = (
				total_cache_memory => qr/(?:Total cache memory|Memory Cache total): (.+)/,

it's just important not to create a new capture group, so (?:) is used

hope these help

@knorrie
Copy link
Contributor Author

knorrie commented Jul 28, 2021

Ok, thanks for the hints, I will have a look at this.

@knorrie knorrie force-pushed the cciss_vol_status_1_12a branch from fa2a242 to f1e5eeb Compare August 5, 2021 10:23
@knorrie
Copy link
Contributor Author

knorrie commented Aug 5, 2021

Hi, looking at this again.

The new cache information lines can just be added to that map. The additional missing link to make tests pass was to also add the info to cstatus, so it ends up in the files in t/dump.

Now all tests pass again, with my new test case added.

~/s/nagios-plugin-check_raid m (cciss_vol_status_1_12a) 0-$ make test
perl -MTest::Harness -e 'runtests @ARGV' t/*.t
t/check_aaccli.t .... ok   
t/check_afacli.t .... ok   
t/check_arcconf.t ... ok       
t/check_areca.t ..... ok     
t/check_cciss.t ..... ok     
t/check_cmdtool2.t .. ok   
t/check_dm.t ........ ok     
t/check_dmraid.t .... ok     
t/check_dpt_i2o.t ... ok   
t/check_gdth.t ...... ok     
t/check_hp_msa.t .... ok   
t/check_hpacucli.t .. ok     
t/check_ips.t ....... ok   
t/check_lsraid.t .... ok   
t/check_lsscsi.t .... ok   
t/check_lsvg.t ...... ok   
t/check_mdstat.t .... ok       
t/check_megacli.t ... ok       
t/check_megarc.t .... ok   
t/check_metastat.t .. ok     
t/check_mpt.t ....... ok     
t/check_mvcli.t ..... ok     
t/check_sas2ircu.t .. ok     
t/check_smartctl.t .. ok   
t/check_tw_cli.t .... ok     
t/enabled.t ......... ok     
t/status.t .......... ok     
t/sudo.t ............ ok     
All tests successful.
Files=28, Tests=1005,  4 wallclock secs ( 0.16 usr  0.07 sys +  2.70 cusr  0.43 csys =  3.36 CPU)
Result: PASS

Let me know what you think of it all. And yes, this also makes #200 obsolete.

Have fun,
Hans

In version 1.12a, cciss_vol_status changes the output of physical location of drives:

    -       sprintf(tail, " connector %c%c box %d bay %d %40s %40s %8s",
    +       sprintf(tail, " connector %c%c box %d bay %-2d %40s %40s %8s",

This means that the bay number is now 2 posititions, left aligned.
So, e.g. '1 ', '6 ', or '12'.

Allow for an extra space to be present when the bay number is less than
10.

Without this change, the pattern will not match any more, and the
program complains about Unparsed lines.
@knorrie knorrie force-pushed the cciss_vol_status_1_12a branch from f1e5eeb to 2f57038 Compare August 5, 2021 11:52
knorrie added 2 commits August 5, 2021 13:55
Since v1.12, cciss_vol_status outputs these two new extra lines about
cache status:

        Total cache memory: 816 MiB
               Cache Ratio: 10% Read / 90% Write

Make sure we parse them and include them in cstatus, so that they also
end up in the test dump files which makes the tests pass again.
This adds a test case which contains all changes to cciss_vol_status
1.12 and 1.12a output.
@knorrie knorrie force-pushed the cciss_vol_status_1_12a branch from 2f57038 to 13acd51 Compare August 5, 2021 11:55
knorrie added a commit to mendix/nagios-plugins-mendix that referenced this pull request Aug 5, 2021
The plugin would only scream "UNKNOWN: cciss:[Plugin error]".

See glensc/nagios-plugin-check_raid#205 about
all of this.
@knorrie
Copy link
Contributor Author

knorrie commented Aug 5, 2021

There's still something not right, since Nagios is telling me "NRPE: Unable to read output"... Trying to debug now. Oh, right, that was an issue in my local sudoers config, ignore.

@glensc
Copy link
Owner

glensc commented Aug 5, 2021

@knorrie remove Draft status once this is ready for merge.

also, are you interested in having developer access in this repo?

@Napsty
Copy link

Napsty commented Sep 2, 2021

Thanks @knorrie for working on this.
I just manually tested commit mendix/nagios-plugins-mendix@d8def49 and it seems to work.

Debian Bullseye, cciss_vol_status version 1.12a, check_raid 4.0.10

Without your changes:

# ./check_raid.pl -d
check_raid 4.0.10
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /sbin/dmsetup status --noflush at ./check_raid.pl line 503.
DEBUG EXEC: /bin/lsscsi -g at ./check_raid.pl line 503.
DEBUG EXEC: >&2 /bin/cciss_vol_status -v at ./check_raid.pl line 499.
DEBUG EXEC: /bin/cciss_vol_status /dev/sg0 at ./check_raid.pl line 503.
UNKNOWN: cciss:[Plugin error]

With your changes from the mentioned commit:

# ./check_raid.pl -d
check_raid 4.0.10
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /sbin/dmsetup status --noflush at ./check_raid.pl line 503.
DEBUG EXEC: /bin/lsscsi -g at ./check_raid.pl line 503.
DEBUG EXEC: >&2 /bin/cciss_vol_status -v at ./check_raid.pl line 499.
DEBUG EXEC: /bin/cciss_vol_status -V /dev/sg0 at ./check_raid.pl line 503.
Unparsed[                 Total cache memory: 912 MiB] at ./check_raid.pl line 1873, <$fh> line 16.
Unparsed[                        Cache Ratio: 25% Read / 75% Write] at ./check_raid.pl line 1873, <$fh> line 17.
OK: cciss:[/dev/sda(Smart Array P410i): Volume 0 (RAID 1(1+0)): OK, Drives(6): 1I-1-1,1I-1-4,2I-1-5,2I-1-6,2I-1-7,2I-1-8=OK, Cache: WriteCache FlashCache ReadMem:228 MiB WriteMem:684 MiB]

@glensc
Copy link
Owner

glensc commented Feb 11, 2022

@knorrie are you able to finish up the PR? please remove the Draft status when it's ready to merge.

Copy link

@xals xals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I just applied the changeset on a Debian Bullseye server with cciss_vol_status v1.12a, it fixed the UNKNOWN: cciss:[Plugin error].

Thanks!

@Napsty
Copy link

Napsty commented Dec 8, 2022

@knorrie reminder, pleeeease :)
@glensc can you do a manual merge if the draft status is not removed? it's definitely a much needed code change for Bullseye.

@glensc
Copy link
Owner

glensc commented Dec 8, 2022

@Napsty you can carry the commits to your own branch and submit new PR with same commits, but do verify the changes are ok first.

@glensc
Copy link
Owner

glensc commented Mar 4, 2024

merging the original instead, as uses from another pr report his working:

@glensc glensc marked this pull request as ready for review March 4, 2024 22:33
@glensc glensc merged commit 818a694 into glensc:master Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants