Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

ahrtr · 2025-01-08T20:21:23Z

What would you like to be added?

Background

Currently robustness test leverages the WAL records to rebuild the etcdserver's real history to check correctness

check whether it matches what the client side receives.

So it requires that at least one member has complete WAL records.

Issue

Based on the discussion in #19095 and #19038, a member might fail to flush some WAL records to disk when being stopped (and restarted again later). Accordingly, it causes the error failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, refer to #19038 (comment); the issue was fixed in #19095.

Usually it isn't a problem, because usually only one member gets stopped or killed in robustness test, only one member missing WAL record isn't a problem.

Note there is NO any issue from users perspective. A member missing some WAL records isn't a problem, because it can get a snapshot from the leader when it gets started again if it's lag far behind the leader.

But in downgrade test, we need to stop & restart all the members one by one, so it's possible that each member has some missing WAL records. So it might be impossible to read the complete WAL records; accordingly it causes the error last succesful client write .... was not persisted, required to validate, refer to #19095 (comment)

Again, there is NO any issue from users perspective.

Proposed solution

Based on my previous test, usually the issue (WAL records fail to be flushed to disk) only happens in high traffic scenario, so one workaround solution is that we only play very low traffic when doing downgrade case in robustness test.

Also currently robustness test reads the longest WAL records. But the longest one may not be the correct one. We should ensure at least majorities members have the same longest WAL records. Refer to #19095 (comment)

cc @siyuanfoundation @serathius

Related discussion

This is for other contributors reference. The goal of robustness test is to verify correctness of etcd, so ideally it should NOT depend on the any data (including WAL files) generated by etcd; it should fully regard etcd as a black box.

But it's hard and super cost for robustness to build all exponencial possibilities (when a client gets a failure response, the server side may fail or success; when there are multiple failed client requests, then the possibilities increase exponencially). So a practical way is to use WAL records to build the real history from the server side.

Why is this needed?

Workaround the issue of the downgrade robustness test

The text was updated successfully, but these errors were encountered:

ahrtr added type/feature area/robustness-testing labels Jan 8, 2025

ahrtr mentioned this issue Jan 8, 2025

Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

Merged

ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 8, 2025

ahrtr assigned siyuanfoundation Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

ahrtr commented Jan 8, 2025

Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

Comments

ahrtr commented Jan 8, 2025

What would you like to be added?

Background

Issue

Proposed solution

Related discussion

Why is this needed?