Multi-threaded Executor starvation fix #2702

HarunTeper · 2024-12-09T13:36:15Z

Pull request addressing the issues in #2360 #2645.

So far, I have added a test that detects starvation in the multi-threaded executor.
This test includes a mutually-exclusive callback group with two timers.
The executor should be alternating between these two tasks, never executing one task twice before the other.

To fix starvation, I have identified the following steps (which I was not able to completely implement yet):

Introduce a new mutex (I refer to this as 'notify_mutex_') to the executor.hpp file that is used to guard callback group flags.
Introduce a function that locks the 'notify_mutex_', triggers the 'interrupt_guard_condition_' and unlocks the callback group flag (and the 'notify_mutex_' afterwards). Currently, the 'MultiThreadedExecutor::run' function includes the guard condition trigger method, while the 'Executor::execute_any_executable' includes the call that unlocks the callback group. These need to be combined and guarded by the 'notify_mutex_'.
The 'Executor::wait_for_work' function needs to be split into two funtions. One may be called 'Executor::prepare_wait_set', which does everything up to (but excluding) the 'wait' function which is currently in the 'wait_for_work' function. The rest of the wait for work function can be kept in the current function. This change is necessary to lock the 'notify_mutex' while a callback is being extracted from the wait set by 'get_next_executable', ensuring that no other thread can change a callback group flag at the same time.
The most complex change: The function that collects and adds the entities to the 'wait_set_' needs to be updated. First, the 'wait_result_' should not be reset. Then a function needs to be added, which adds all callback instances from the previous 'wait_result_' to the current 'wait_result_', if they are blocked and not invalid. This new function should be executed after the 'wait_set_.wait' function. Otherwise, the previously blocked jobs would immediately unblock the wait function, as they are already ready when it is called, which would then lead to busy waiting. Furthermore, for this change, the position at which the previous job instances are added also needs to be decided, which would then decide the priority of the jobs after inserting. For this change, it may also be necessary to introduce a variable called 'previous_wait_result_'.

These steps are based on the work I published here:

https://ieeexplore.ieee.org/document/9622336

https://daes.cs.tu-dortmund.de/storages/daes-cs/r/publications/teper2024emsoft_preprint.pdf

I have already tried to implement some of these steps, and I will also commit some of the changes to this fork this week. However, for step 4, I may require some help. I also noticed that my changes break some of the tests are currently part of rclcpp, as I move the functions that set callback group flags and trigger guard conditions.

Signed-off-by: HarunTeper <[email protected]>

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

jmachowinski · 2024-12-12T13:22:59Z

The proposed changes, as well as the paper, don't make sense to me. But this might be a wording vs implementation issue. Lets wait for the actual implementation...
@HarunTeper you might want to attend to the next client workgroup meeting, to discuss this in person.

alsora · 2024-12-12T20:21:25Z

Hi @HarunTeper,
I agree that this topic seems complex enough that it would be better to have an in-person conversation.

The next Client Library WG meeting will happen Friday 12/20/2024 at 8AM PST.
See here some details, and I will post a reminder next week https://discourse.ros.org/t/next-client-library-wg-meeting-friday-6th-december-2024-8am-pt/40954

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

HarunTeper · 2025-01-06T11:20:56Z

Hi @HarunTeper, I agree that this topic seems complex enough that it would be better to have an in-person conversation.

The next Client Library WG meeting will happen Friday 12/20/2024 at 8AM PST. See here some details, and I will post a reminder next week https://discourse.ros.org/t/next-client-library-wg-meeting-friday-6th-december-2024-8am-pt/40954

I would likely join the next meeting, and work on the pull request until then. Sorry for not making it to the last one.

…ead of busy wait Signed-off-by: HarunTeper <[email protected]>

HarunTeper · 2025-01-07T14:07:52Z

I have now uploaded my fix for humble to a custom repository:
It provides a devcontainer and VSCode tasks to build and run the test that shows starvation. (After switching branches, you should delete the build log and install folders before rebuilding)

https://github.com/HarunTeper/rclcpp_humble_multithreaded_executor

You can see the changes in the branch called "fix".

I am currently working on applying the same changes to the current version of the multi-threaded executor. However, since Humble, the code of the executor itself, specifically the wait set and wait_for_work functions are very different.

I will try to implement the version that I propose, however, it will be different from my previous solution. For example, I think the current version has threads busy waiting for tasks to become ready if all tasks are blocked but not all threads are currently executing tasks.

If there is another meeting, I will join that one to talk about the executor. In the meantime, I will work on the fix.

jmachowinski · 2025-01-08T16:18:19Z

I had a look at the 'fix' implementation. As far as I can see, this comes down to:

If there are unprocessed entities of a callback group, don't add any entities of the callback group to the waitset on repoll
Make unprocessed entities from the last poll 'persistent' / don't clear unprocessed entities of callback groups in execution on repoll
Use some mutexes to avoid races between unmarking of callback group execution and unprocessed entities updates

While I think, this fix works, the implementation is awkward, in the sense that the code flow is extremely hard to follow. To be fair, the executor code flow isn't great in the first place...
I would also say, that the implementations suffers from unneeded interruptions of the waitset polling and context switches. Also the be fair here, the humble and current implementation both suffer from the same problem.

I would suggest a different approach to the problem:
Create a deque of ready events per callback group.
Have two lists of callback groups

Idle
Processing

Poll for events:
Only poll for callback groups from the idle list
After each poll:

add the events to the corresponding deque
Move callback groups with ready events into the processing list

Worker threads:

if processing is empty
- poll / block if someone else is already polling
- Wakup workers corresponding to the size of the processing list
- continue
remove first entry of processing
process and remove the first ready entity of the removed callback group
check if callback group has further pending events
- if not, add to idle, wakeup poll thread
- if there are pending events, readd to processing

This is basically the logic I use in the EventsCBGExecutor.

Added test for starvation in the multi-threaded executor

f6f72c8

Signed-off-by: HarunTeper <[email protected]>

HarunTeper changed the title ~~Added test for starvation in the multi-threaded executor~~ Multi-threaded Executor starvation fix Dec 9, 2024

fujitatomoya reviewed Dec 12, 2024

View reviewed changes

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp Show resolved Hide resolved

jmachowinski suggested changes Dec 12, 2024

View reviewed changes

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp Outdated Show resolved Hide resolved

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp Outdated Show resolved Hide resolved

alsora reviewed Dec 12, 2024

View reviewed changes

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp Show resolved Hide resolved

sloretz assigned alsora Dec 20, 2024

sloretz added the more-information-needed Further information is required label Dec 20, 2024

Refactor starvation test in multi-threaded executor to use sleep inst…

af9b7c8

…ead of busy wait Signed-off-by: HarunTeper <[email protected]>

HarunTeper force-pushed the rolling branch from 7ec584f to af9b7c8 Compare January 6, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded Executor starvation fix #2702

Multi-threaded Executor starvation fix #2702

HarunTeper commented Dec 9, 2024 •

edited

Loading

jmachowinski commented Dec 12, 2024

alsora commented Dec 12, 2024 •

edited

Loading

HarunTeper commented Jan 6, 2025

HarunTeper commented Jan 7, 2025

jmachowinski commented Jan 8, 2025

Multi-threaded Executor starvation fix #2702

Are you sure you want to change the base?

Multi-threaded Executor starvation fix #2702

Conversation

HarunTeper commented Dec 9, 2024 • edited Loading

jmachowinski commented Dec 12, 2024

alsora commented Dec 12, 2024 • edited Loading

HarunTeper commented Jan 6, 2025

HarunTeper commented Jan 7, 2025

jmachowinski commented Jan 8, 2025

HarunTeper commented Dec 9, 2024 •

edited

Loading

alsora commented Dec 12, 2024 •

edited

Loading