Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Holoscan not running multiple instances of operator simultaneously #41

Open
CameronDevine opened this issue Jan 10, 2025 · 7 comments
Open
Labels
help wanted Extra attention is needed

Comments

@CameronDevine
Copy link

Please describe your question

I have an application where I would like to use Holoscan for an image processing pipeline. The problem I am having is that Holoscan seems to only run a single instance of an operator at a time. This is in contrast to Intel's Thread Building Blocks which will run multiple copies of an operator at once. As much of the processing I need to do is serial, this results in the performance of the application being dominated by the runtime of the slowest operator. Is there a different scheduler I can use to enable multiple copies of an operators to be run at once, or could I implement a custom scheduler?
 

Please specify what Holoscan SDK version you are using

2.8.0

Please add any details about your platform or use case

I am working on building a pipeline for real time processing of mouse brain images from transmission electron microscopes.

@CameronDevine CameronDevine added the help wanted Extra attention is needed label Jan 10, 2025
@tbirdso
Copy link
Contributor

tbirdso commented Jan 10, 2025

Hi there @CameronDevine , thanks for using Holoscan SDK. Yes, by default Holoscan SDK uses greedy scheduling, but also provides a polling-based scheduler and an event-based scheduler for multithreaded operations. Recommend visiting the Holoscan Schedulers documentation or multithreaded scheduler example for more on multithreaded approaches.

@CameronDevine
Copy link
Author

Hi @tbirdso, here's some code showing the issue I am encountering. When run with -s greedy, I am getting a rate of approximately 0.5 Hz, which is what I expect. However, when I change to either the event based (-s event), or multi-threaded (-s multi) schedulers, I am still getting a rate of approximately 0.5 Hz. However, I would expect this to be increased to about 2 Hz when run either the event base, or multi-threaded schedulers since there are 4 worker threads (and more than 4 cores on my machine) which could each run the Wait operator simultaneously.

from holoscan.core import Application, Operator
import holoscan.schedulers
import argparse
from time import sleep, monotonic


class Count(Operator):
    def setup(self, spec):
        spec.output("out")

    def start(self):
        self.i = 0

    def compute(self, input, output, context):
        self.i += 1
        output.emit(self.i, "out")


class Wait(Operator):
    def setup(self, spec):
        spec.input("in")
        spec.output("out")
        spec.param("duration")

    def compute(self, input, output, context):
        sleep(self.duration)
        output.emit(input.receive("in"), "out")


class Rate(Operator):
    def setup(self, spec):
        spec.input("in")
        spec.param("window")

    def start(self):
        self.last_call = None
        self.durations = []

    def compute(self, input, output, context):
        if self.last_call is None:
            self.last_call = monotonic()
            print(f"Received {input.receive('in')}")
        else:
            now = monotonic()
            self.durations.append(now - self.last_call)
            self.last_call = now
            self.durations = self.durations[-self.window :]
            rate = len(self.durations) / sum(self.durations)
            print(f"Received {input.receive('in')} (rate {rate} Hz)")


class Test(Application):
    def __init__(self, scheduler):
        super().__init__()

        schedulers = {
            "greedy": lambda: holoscan.schedulers.GreedyScheduler(self),
            "event": lambda: holoscan.schedulers.EventBasedScheduler(
                self, worker_thread_number=4
            ),
            "multi": lambda: holoscan.schedulers.MultiThreadScheduler(
                self, worker_thread_number=4
            ),
        }

        assert scheduler in schedulers

        self.scheduler(schedulers[scheduler]())

    def compose(self):
        count = Count(self)
        wait = Wait(self, duration=2)
        rate = Rate(self, window=8)

        self.add_flow(count, wait)
        self.add_flow(wait, rate)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--scheduler",
        "-s",
        type=str,
        default="greedy",
        help='The scheduler to use, either "greedy", "event", or "multi".',
    )

    args = parser.parse_args()

    app = Test(args.scheduler)
    app.run()


if __name__ == "__main__":
    main()

@grlee77
Copy link
Contributor

grlee77 commented Jan 10, 2025

Hi @CameronDevine,

Thanks for providing the concrete code example. I think there is some misconception in how the scheduling works. Based on your compose method

    def compose(self):
        count = Count(self)
        wait = Wait(self, duration=2)
        rate = Rate(self, window=8)

        self.add_flow(count, wait)
        self.add_flow(wait, rate)

There are only three total operators connected in a single linear chain, so there is nothing to potentially run in parallel. In other words, the scheduler does not launch multiple copies of the application. There are only these three operators that get repeatedly executed.

For a concrete example of running multiple wait/delay operators in parallel see this one included in the examples folder:
https://github.com/nvidia-holoscan/holoscan-sdk/blob/main/examples/multithread/python/multithread.py (README)

In that case, there is a single source operator that cannects to N "delay" operators in parallel which then converge on a single sink operator. It is not require to have a common source/sink, that is just what we happened to choose for the example. You could also have the same type of example, but with a separate source/sink for each "delay" operator.

@CameronDevine
Copy link
Author

Hi @grlee77,

Thank you (and @tbirdso) for confirming what I had found experimentally. Unfortunately, this means that Holoscan will not work for my application as I have a pipeline that is primarily serial and the total computation time is dominated by a small number of operators.

@whom3
Copy link
Contributor

whom3 commented Jan 15, 2025

@CameronDevine how many of these operators will you have connected serially in your use case?

In this simple xample, there is only one operator that takes a lot of time and the runtime in the other operators are trivial so it won't help even if you have multiple workers.

If we modify your example here to include multiple operators that are connected serially, we could see a difference between the reported rate. Here I have set the worker_thread_number count to 2.

def compose(self):
     count = Count(self)
     wait1 = Wait(self, duration=2)
     wait2 = Wait(self, duration=2)
     wait3 = Wait(self, duration=2)
     wait4 = Wait(self, duration=2)
     rate = Rate(self, window=8)

     self.add_flow(count, wait1)
     self.add_flow(wait1, wait2)
     self.add_flow(wait2, wait3)
     self.add_flow(wait3, wait4)
     self.add_flow(wait4, rate)

With greedy scheduler:

python3 example.py 
Received 1
Received 2 (rate 0.12488370303508864 Hz)
Received 3 (rate 0.1248758491955746 Hz)
Received 4 (rate 0.12486922475752466 Hz)
Received 5 (rate 0.12486592252718796 Hz)

With event based scheduler:

python3 example.py -s event
Received 1
Received 2 (rate 0.24969815600480916 Hz)
Received 3 (rate 0.24969767448312177 Hz)
Received 4 (rate 0.24969736471333656 Hz)
Received 5 (rate 0.24969612976670194 Hz)

@CameronDevine
Copy link
Author

Hi @whom3, I expect I will have about 10 operators, which is still lower than the number of cores of the systems I will be running the code on.

@whom3
Copy link
Contributor

whom3 commented Jan 15, 2025

@CameronDevine You could try the following which modifies your original example such that

  • we now have 10 wait operators connected serially, each waits 1 sec
  • you can provide the number of workers via -n to specify the # of workers
  • updated the input port of the Wait operator to have a queue size of 2, this allows upstream operator to do work while the downstream operator is still doing work. if we use the default queue size of 1, we still get some benefit to use multiple workers but not as much
from holoscan.core import Application, Operator, IOSpec
import holoscan.schedulers
import argparse
from time import sleep, monotonic

class Count(Operator):
    def setup(self, spec):
        spec.output("out")

    def start(self):
        self.i = 0

    def compute(self, input, output, context):
        self.i += 1
        output.emit(self.i, "out")


class Wait(Operator):
    def setup(self, spec):
        spec.input("in").connector(IOSpec.ConnectorType.DOUBLE_BUFFER, capacity=2)
        spec.output("out")
        spec.param("duration")

    def compute(self, input, output, context):
        sleep(self.duration)
        output.emit(input.receive("in"), "out")


class Rate(Operator):
    def setup(self, spec):
        spec.input("in")
        spec.param("window")

    def start(self):
        self.last_call = None
        self.durations = []

    def compute(self, input, output, context):
        if self.last_call is None:
            self.last_call = monotonic()
            print(f"Received {input.receive('in')}")
        else:
            now = monotonic()
            self.durations.append(now - self.last_call)
            self.last_call = now
            self.durations = self.durations[-self.window :]
            rate = len(self.durations) / sum(self.durations)
            print(f"Received {input.receive('in')} (rate {rate} Hz)")


class Test(Application):
    def __init__(self, scheduler, num_workers):
        super().__init__()

        schedulers = {
            "greedy": lambda: holoscan.schedulers.GreedyScheduler(self),
            "event": lambda: holoscan.schedulers.EventBasedScheduler(
                self, worker_thread_number=num_workers
            ),
            "multi": lambda: holoscan.schedulers.MultiThreadScheduler(
                self, worker_thread_number=num_workers
            ),
        }

        assert scheduler in schedulers

        self.scheduler(schedulers[scheduler]())

    def compose(self):
        count = Count(self)
        wait1 = Wait(self, duration=1)
        wait2 = Wait(self, duration=1)
        wait3 = Wait(self, duration=1)
        wait4 = Wait(self, duration=1)
        wait5 = Wait(self, duration=1)
        wait6 = Wait(self, duration=1)
        wait7 = Wait(self, duration=1)
        wait8 = Wait(self, duration=1)
        wait9 = Wait(self, duration=1)
        wait10 = Wait(self, duration=1)
        rate = Rate(self, window=8)

        self.add_flow(count, wait1)
        self.add_flow(wait1, wait2)
        self.add_flow(wait2, wait3)
        self.add_flow(wait3, wait4)
        self.add_flow(wait4, wait5)
        self.add_flow(wait5, wait6)
        self.add_flow(wait6, wait7)
        self.add_flow(wait7, wait8)
        self.add_flow(wait8, wait9)
        self.add_flow(wait9, wait10)
        self.add_flow(wait10, rate)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--scheduler",
        "-s",
        type=str,
        default="greedy",
        help='The scheduler to use, either "greedy", "event", or "multi".',
    )
    parser.add_argument(
        "--num_workers",
        "-n",
        type=int,
        default=4,
        help='Number of workers for multi-thread and event based scheduler',
    )


    args = parser.parse_args()

    app = Test(args.scheduler, args.num_workers)
    app.run()


if __name__ == "__main__":
    main()

For the above, I am seeing much better scaling when increasing the # of workers.
greedy: ~0.1 Hz
event with 4 workers: ~0.3 Hz
event with 5 workers: ~0.43 Hz
event with 8 workers: ~0.85 Hz
event with 10 workers: ~0.996 Hz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants