-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for cciss_vol_status 1.12a #205
Conversation
The latest version of cciss_vol_status is 1.12a. For the first time, an extra letter was appended. The version pattern would not match any more. The function will return 0, and later on, the code to detect if the version is >= 1.10 (for new added functionality) would wrongly return false. So, change it to also accept an optional letter at the end, but ignore it. Adding it to the detected version string will result in errors like "Argument "1.12a" isn't numeric" later on in the code, and because right now, there is no need to be able to distinguish between 1.12 and 1.12a in the plugin.
In version 1.12a of cciss_vol_status, RAID 1 is changed to RAID1(1+0) which makes the pattern fail. Accept the optional (X+Y). Without this change, no array would be detected, and the program will fail with: UNKNOWN: cciss:[Plugin error]
H'okay, so. There's a new cciss_vol_status version 1.12a which contains 'Formatting changes', yay. [0] Not only formatting changes in the program output, but even in the version number handling. The formatting changes break the fragile text processing in check_raid. I ran into this while upgrading physical servers from Debian 10 (Buster) to 11 (Bullseye). I spent some time to figure out what exactly changed [1], and how to adapt check_raid to this. The first three commits in this PR are the fixes, which I think can deal with the new and previous output. In my own environment, I tested the changes by making sure the check_raid output would be like normal again with cciss_vol_status 1.12a, and I checked that when downgrading to cciss_vol_status 1.12, everything would also look still fine. Summary:
The fourth commit is my best attempt to add a new test. I did not understand all the fields to be filled in (especially detect_cciss), but it works and passes... Well, not entirely, because:
This is a different and already known issue I see. The check_raid plugin in Debian 11 now has a patch in the packaging which you can see at [2]. When applying this change to check_raid here and running the tests again, I can see that it passes with my new cciss_vol_status output, but tests with older output do not pass, because it requires the new lines to be there:
I agree that this might be an OK temporary band-aid fix for the Debian package because the real issue is not resolved yet. I have to admit that I have been doing server upgrade tests months ago, but while doing that, the yolo test servers were not hooked up to actual nagios alerting, so I totally missed the disk plugin being broken. :-( I won't mind helping out a bit extra to get this cache memory/ratio issue properly fixed so that it works with older and newer output. I have some test servers here which can produce different kind of cciss_vol_status output to help out with that. The new test does pass (for my new output file) if the two line debian patch fix for the new total cache stuff is applied, or if you remove the following two lines from the cciss_vol_status output:
Just let me know what you think about this. Apparently, this fix depends on the other cache fix being done, unless we want to 'forge' test input (which would not be right). I'll post this now and then re-read the CONTRIBUTING to see if there's anything else that I should add. Have fun, [0] https://sourceforge.net/projects/cciss/files/cciss_vol_status/ |
Ah, apparently the data collected in the process (especially the test files in the 4th commit) is already covering the output of the
|
in our infra, everything is virtualized, so check_raid is useless, so I'll try to support you with feedback. |
The Debian patch seem to originate from #200 |
Aha, yes, #200 submits a change to "make it work", but it will not pass our tests with a collection of previous output of cciss_vol_status versions. I can have a look at #196 because I have some test hardware here that I can use for that issue and improve it so that it can go throught and we can have both. Just let me know what you think and want to do. I think it seems rather easy to first get that #196 change done properly and then this one on top. I'll have a look at it tomorrow. Thanks, |
Ok, right. I see that #200 is just fixing things forward, and because there's an extra level of indirection involved ( So, unless someone shows up to rewrite that |
@knorrie so perhaps add another entry to my %map = (
total_cache_memory => qr/Total cache memory: (.+)/,
total_cache_memory_v2 => qr/Total cache memory: (.+)/,
...
if ($cache->{total_cache_memory} || $cache->{total_cache_memory_v2}) ... or maybe even: # fill total_cache_memory if it's missing and total_cache_memory_v2 is present
$cache->{total_cache_memory} = $cache->{total_cache_memory_v2} if not $cache->{total_cache_memory} and $cache->{total_cache_memory_v2}; I don't remember that much he code, but you should get the gist from this. also, if what you need has just changed the match key, use some regex like this: my %map = (
total_cache_memory => qr/(?:Total cache memory|Memory Cache total): (.+)/, it's just important not to create a new capture group, so hope these help |
Ok, thanks for the hints, I will have a look at this. |
fa2a242
to
f1e5eeb
Compare
Hi, looking at this again. The new cache information lines can just be added to that map. The additional missing link to make tests pass was to also add the info to cstatus, so it ends up in the files in t/dump. Now all tests pass again, with my new test case added.
Let me know what you think of it all. And yes, this also makes #200 obsolete. Have fun, |
In version 1.12a, cciss_vol_status changes the output of physical location of drives: - sprintf(tail, " connector %c%c box %d bay %d %40s %40s %8s", + sprintf(tail, " connector %c%c box %d bay %-2d %40s %40s %8s", This means that the bay number is now 2 posititions, left aligned. So, e.g. '1 ', '6 ', or '12'. Allow for an extra space to be present when the bay number is less than 10. Without this change, the pattern will not match any more, and the program complains about Unparsed lines.
f1e5eeb
to
2f57038
Compare
Since v1.12, cciss_vol_status outputs these two new extra lines about cache status: Total cache memory: 816 MiB Cache Ratio: 10% Read / 90% Write Make sure we parse them and include them in cstatus, so that they also end up in the test dump files which makes the tests pass again.
This adds a test case which contains all changes to cciss_vol_status 1.12 and 1.12a output.
2f57038
to
13acd51
Compare
The plugin would only scream "UNKNOWN: cciss:[Plugin error]". See glensc/nagios-plugin-check_raid#205 about all of this.
|
@knorrie remove Draft status once this is ready for merge. also, are you interested in having developer access in this repo? |
Thanks @knorrie for working on this. Debian Bullseye, cciss_vol_status version 1.12a, check_raid 4.0.10 Without your changes:
With your changes from the mentioned commit:
|
@knorrie are you able to finish up the PR? please remove the Draft status when it's ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I just applied the changeset on a Debian Bullseye server with cciss_vol_status v1.12a, it fixed the UNKNOWN: cciss:[Plugin error]
.
Thanks!
@Napsty you can carry the commits to your own branch and submit new PR with same commits, but do verify the changes are ok first. |
merging the original instead, as uses from another pr report his working: |
Hi, I have some fixes for check_raid to be able to cope with the changed text output of cciss_vol_status version 1.12a.
First opening the WIP MR with the code changes, so that I have a number to use while trying to get additional test cases set up.