-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple source rate limiting #2149
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
# Conflicts: # docs/website/docs/general-usage/source.md
bac447c
to
9b9c9ec
Compare
The timing on mac is again working quite differently from the other platforms. Maybe we should add sth to exec_info to be able to determine mac. and specifically allow more generous timeframes if tests are run there. |
@@ -52,6 +54,8 @@ class PipeIteratorConfiguration(BaseConfiguration): | |||
futures_poll_interval: float = 0.01 | |||
copy_on_fork: bool = False | |||
next_item_mode: str = "round_robin" | |||
rate_limit: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be a callback? Or even better a generic RateLimiter
instance for example that can be shared between different pipes and dynamically updated?
I am thinking about #1485 (comment)
which describes an API that defines a global rate limit across all api endpoints.
If I have different resources (real-world example: https://github.com/dlt-hub/verified-sources/pull/587/files#diff-0fc4db143e89ab087ef737341bc4d93a631141a0bb633c31831ca22bb072c146R100-R105) that are in different pipes but share one limit, then I can't easily express this with this API as I wouldn't know how many other requests are made in other pipes (you can see from the code above that the pipes even may depend on the user configuration).
It also wouldn't be possible to make the rate limit dynamic. E.g. let's say I start a pipeline run that takes 2h but during that time an external system uses a certain amount of the resource for a while, then I'd like to adjust the rate limiting here. I.e.:
t0: I start pipe 1, I have 900 requests/s
t1: another system starts working, uses 100 requests/s, I want to reduce pipe 1 to 800r/s
t2: the other system stops it's work, I want to up the requests of pipe 1 to 900 aain
or
t0: I start pipe1 and pipe2 parallel, I have 900 req/s in total, so they share 450req/s each
t1: pipe2 is much faster and finishes, I want to increase the req/s for pipe1 to 900
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joscha could you maybe collect your requirements for rate limiting in a ticket. I'm not sure if this change here will make it into the code soon, but it is meant as fairly simple rate limiting mechanism to address requirements we hear a lot from the community. The idea to use exponential backoff and to respect well defined rate limiting header also is under consideration and would go into the rest api layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can most certainly put it in a ticket. I am also not suggesting that having rate liming here is bad, I am just not sure the current implementation is flexible enough for more complex systems. Having one way to define rate limits and exponential backoff would be amazing, as it then would mean we only have one way to define it and next time we add the feature to another layer (like the RESTClient for example) we can just reuse what's already known.
Thinking in terms of how sources and/or pipes are structured and limiting them does also not always easily translate to the underlying system these source(s)/resource(s) pull data from. For resources based on an external REST API for example we possibly wouldn't be interested in limiting the work in the resource, but only how many requests are leaving the system (possibly even based on the endpoint as often APIs define different limits for different endpoints but a dlt resource might be using more than one type)?
Description
This PR adds a simple rate limiting mechanism controlled by the PipeIterator which should be very useful for certain api based use-cases.