Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147
Labels
area/robustness-testing
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
type/feature
What would you like to be added?
Background
Currently robustness test leverages the WAL records to rebuild the etcdserver's real history to check correctness
So it requires that at least one member has complete WAL records.
Issue
Based on the discussion in #19095 and #19038, a member might fail to flush some WAL records to disk when being stopped (and restarted again later). Accordingly, it causes the error
failed to read WAL, cannot be repaired, err: wal: slice bounds out of range
, refer to #19038 (comment); the issue was fixed in #19095.Usually it isn't a problem, because usually only one member gets stopped or killed in robustness test, only one member missing WAL record isn't a problem.
But in downgrade test, we need to stop & restart all the members one by one, so it's possible that each member has some missing WAL records. So it might be impossible to read the complete WAL records; accordingly it causes the error
last succesful client write .... was not persisted, required to validate
, refer to #19095 (comment)Proposed solution
Based on my previous test, usually the issue (WAL records fail to be flushed to disk) only happens in high traffic scenario, so one workaround solution is that we only play very low traffic when doing downgrade case in robustness test.
Also currently robustness test reads the longest WAL records. But the longest one may not be the correct one. We should ensure at least majorities members have the same longest WAL records. Refer to #19095 (comment)
cc @siyuanfoundation @serathius
Related discussion
This is for other contributors reference. The goal of robustness test is to verify correctness of etcd, so ideally it should NOT depend on the any data (including WAL files) generated by etcd; it should fully regard etcd as a black box.
But it's hard and super cost for robustness to build all exponencial possibilities (when a client gets a failure response, the server side may fail or success; when there are multiple failed client requests, then the possibilities increase exponencially). So a practical way is to use WAL records to build the real history from the server side.
Why is this needed?
Workaround the issue of the downgrade robustness test
The text was updated successfully, but these errors were encountered: