-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker Network Timeseries #5129
Conversation
Thank you for doing this. Request for future PRs, can I ask you to include
a quick screenshot? I suspect that this will make it easier / more
exciting for future reviewers :)
…On Tue, Jul 27, 2021 at 12:41 PM Naty Clementi ***@***.***> wrote:
- Closes #5090 <#5090>
- Tests added / passed
- Passes black distributed / flake8 distributed / isort distributed
Currently computing the average for read_bytes and write_bytes across
workers. If we decided we want the sum we can change that, let me know in
the comments.
------------------------------
You can view, comment on, or merge this pull request online at:
#5129
Commit Summary
- add timeseries for network bandwith
- add timeseries bandwidth entry on http services
- Modify legend to reference average
File Changes
- *M* distributed/dashboard/components/scheduler.py
<https://github.com/dask/distributed/pull/5129/files#diff-ad4c586347761d4ae47c28a8f98199b0278aeb4e8c0cab96406483a4c2202379>
(75)
- *M* distributed/dashboard/scheduler.py
<https://github.com/dask/distributed/pull/5129/files#diff-204bf47402cd79041d0e13db17e7345740603fe16667c824b08883da63f087bb>
(4)
- *M* distributed/dashboard/tests/test_scheduler_bokeh.py
<https://github.com/dask/distributed/pull/5129/files#diff-fe7fdacaea63aa88e75656603c15602128b855c1cada8ef973fb7316fc89e019>
(45)
- *M* docs/source/http_services.rst
<https://github.com/dask/distributed/pull/5129/files#diff-b60c986aa66a44a3d964d0cdf5be70a24503ab23551182e6c4c6b80e364f35d8>
(1)
Patch Links:
- https://github.com/dask/distributed/pull/5129.patch
- https://github.com/dask/distributed/pull/5129.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5129>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHZ3UL6KQ2AOCP65ADTZ4DUXANCNFSM5BC44W6A>
.
|
My preference is for sum over average
…On Tue, Jul 27, 2021 at 1:01 PM Matthew Rocklin ***@***.***> wrote:
Thank you for doing this. Request for future PRs, can I ask you to
include a quick screenshot? I suspect that this will make it easier / more
exciting for future reviewers :)
On Tue, Jul 27, 2021 at 12:41 PM Naty Clementi ***@***.***>
wrote:
>
> - Closes #5090 <#5090>
> - Tests added / passed
> - Passes black distributed / flake8 distributed / isort distributed
>
> Currently computing the average for read_bytes and write_bytes across
> workers. If we decided we want the sum we can change that, let me know in
> the comments.
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
> #5129
> Commit Summary
>
> - add timeseries for network bandwith
> - add timeseries bandwidth entry on http services
> - Modify legend to reference average
>
> File Changes
>
> - *M* distributed/dashboard/components/scheduler.py
> <https://github.com/dask/distributed/pull/5129/files#diff-ad4c586347761d4ae47c28a8f98199b0278aeb4e8c0cab96406483a4c2202379>
> (75)
> - *M* distributed/dashboard/scheduler.py
> <https://github.com/dask/distributed/pull/5129/files#diff-204bf47402cd79041d0e13db17e7345740603fe16667c824b08883da63f087bb>
> (4)
> - *M* distributed/dashboard/tests/test_scheduler_bokeh.py
> <https://github.com/dask/distributed/pull/5129/files#diff-fe7fdacaea63aa88e75656603c15602128b855c1cada8ef973fb7316fc89e019>
> (45)
> - *M* docs/source/http_services.rst
> <https://github.com/dask/distributed/pull/5129/files#diff-b60c986aa66a44a3d964d0cdf5be70a24503ab23551182e6c4c6b80e364f35d8>
> (1)
>
> Patch Links:
>
> - https://github.com/dask/distributed/pull/5129.patch
> - https://github.com/dask/distributed/pull/5129.diff
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#5129>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTHZ3UL6KQ2AOCP65ADTZ4DUXANCNFSM5BC44W6A>
> .
>
|
I'll modify it to use the sum.
Absolutely, I'll modify things to work with sum and add a screenshot/gif. |
This looks great. For the sake of cleanliness/minimalism I recommend removing the minor ticks on the y-axis and the vertical gridlines (there are hopefully examples of doing this in other parts of the code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ncclementi! I just took this for a spin and it looks great. A couple of comments I have (which aren't meant to be blocking):
- The y-axis scales up automatically when there's lots of network activity, but doesn't scale down. Not a huge deal as refreshing the browser causes the y-axis to be resized.
- We might want to combine the "workers network bandwidth" and "workers network bandwidth timeseries" plots into a single page. Do you have a sense for how much extra work that is? If it's quick and easy, then I'll suggest we do it. Otherwise we can leave it as follow-up work in a future PR.
I recommend that we skip this for now. I think that we should get a bunch of plots built, and then think holistically about how to present those plots. |
For example, I think that we might want to include a page that includes timeseries for all of cpu/memory/network/disk that include both the timeseries and the real-time per-worker chart. |
Actually @ncclementi if it's easy for you to make sure that we have each of those that might be welcome. My hope is that this is not significantly harder than doing one or two. Maybe there is some nice way to avoid code duplication, but even if not I think that that's ok. |
+1. One thing to be aware of is that we're already presenting all the As we add a bunch of new individual plots, those two things will become more cluttered. Again, like I said before, this isn't a huge deal, just something to be aware of |
I think that the right solution is to provide a more ordered list of plots: dask/dask-labextension#179 (comment) |
That sounds great for the labextension |
Maybe we do something similar in our own navbar as well |
I got a bit lost in here. Are you trying to say that we get something like what is on /system but has on one column the time series of cpu/memory/network/disk and next to it the corresponding bar plot. To not duplicate code it seems the approach is to create an HTML template with a custom css. (I'm not that familiarized with html/css this might take some time) Currently, all the plots in /system are made in one class, will have to investigate if I can only extract the ones we need ie. cpu/memory, and then create a page like the status page which uses a The other approach would be going for something like what is on /system which creates all the plots together in one class and uses |
I wonder if it's possible to adjust the dropdown to only include the plots that are not somewhere else in the dashboard. |
My original point is let's not worry too much about layout right now and just get a lot of charts out. We can ask HTML/CSS folks to assmble them for us in the future if we desire. However, you probably should be aware that I'm likely going to ask you to replicate these network plots for all of CPU/memory/network/disk, and so if there are efficienceis to be had in doing those together then we should probably consider them. |
Sure, I guess I'm a little confused on what's remaining.
|
The system page only reports information for the machine that the scheduler is running on. We're also curious about this same information, but aggregated across machines |
Oh I see, so we are missing the timeseries that have the aggregated data that comes from With this in mind would it be useful have all the time series cpu/memory/network/disk on one page? This should be straight forward (once I add the disk metrics) and reduce code lines. In simpler words follow the format of the SystemMonitor Class but with the aggregated info coming from
|
I recommend that we start with everything in separate pages. This is
probably due to my preference for using JLab, where constructing custom
layouts is really useful. I never use the /system page for example because
I can't easily see the one plot that I want.
I think that after we get lots of charts down in single pages then other
folks will be able to come by and set them in pre-configured pages. I
don't think that the group assembled on this PR is the right group for that
though. I think that we should focus on shoving data from the scheduler
into plots and leave layout to other people.
…On Tue, Jul 27, 2021 at 5:56 PM Naty Clementi ***@***.***> wrote:
The system page only reports information for the machine that the
scheduler is running on. We're also curious about this same information,
but aggregated across machines
Oh I see, so we are missing the timeseries that have the aggregated data
that comes from ws.metrics["cpu"] and ws.metrics["memory"]
With this in mind would it be useful have all the time series
cpu/memory/network/disk on one page? This should be straight forward (once
I add the disk metrics) and reduce code lines. In simpler words follow the
format of the SystemMonitor Class but with the aggregated info coming from
ws.metrics
https://github.com/dask/distributed/blob/50fd3ff34e1a66e2fe0b27bce1457e8fd4b00d7d/distributed/dashboard/components/shared.py#L409
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5129 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE7PCFF5QFG35O6ZC3TZ42TTANCNFSM5BC44W6A>
.
|
Here is a look at the 4 timeseries plots. But I was wondering couple of things:
|
It looks simple and straightforward to me |
For the cpu and memory limits, I've been trying to get something reasonable but using sum at least in the case of cpu seems a bit much, (see below). I was checking what do we do in the current timeseries on /system and we leave them without and Adding this to the update if self.scheduler.workers:
y_end_cpu = sum(ws.nthreads or 1 for ws in self.scheduler.workers.values())
y_end_mem = sum(ws.memory_limit for ws in self.scheduler.workers.values())
else:
y_end_cpu = 1
y_end_mem = 100_000_000
self.cpu.y_range.end = y_end_cpu * 100
self.memory.y_range.end = y_end_mem If I run something like the code below, it's hard to see with such a high import dask.array as da
x = da.random.random((50_000, 50_000), chunks=(1_000, 1_000))
result = (x + x.T).mean(axis=0).mean()
result.compute() |
Interesting, I would have expected the machine to peak at 800%. I'm curious, if you look at something like |
@ncclementi and I are walking through CPU utilization discrepancy offline and it looks like the code here is correct, but there's an upstream issue with EDIT: I should note that I see the expected CPU load in the dashboard on my laptop, which is also a Mac, but not with an M1 chip |
Good to merge then? |
Let's hold off for now. I'm slightly concerned the |
I took this for a spin this morning. It has already helped immeasurably :) Some feedback from usage:
@bryevdv I'm not sure if you're around and are interested in these sorts of things. If you are then I encourage you to speak up. I suspect that you're able to hammer these out very quickly. |
@mrocklin I haven't actually seen either Edit: If you are OK using alpha compositing for different "bands" (which seems the simplest thing to do) then streaming bands is fairly straightforward and works very well import numpy as np
from bokeh.driving import count
from bokeh.models import Band
from bokeh.plotting import ColumnDataSource, curdoc, figure
source = ColumnDataSource(data=dict(x=[], y=[], l1=[], u1=[], l2=[], u2=[]))
p = figure(width=900, y_range=(0, 20))
p.x_range.follow = "end"
p.x_range.follow_interval = 80
p.x_range.range_padding = 0
p.line('x', 'y', color="white", source=source)
b1 = Band(base='x', lower='l1', upper='u1', source=source, level='underlay',
fill_alpha=0.4, line_width=1, fill_color='navy', line_color=None)
p.add_layout(b1)
b2 = Band(base='x', lower='l2', upper='u2', source=source, level='underlay',
fill_alpha=0.4, line_width=1, fill_color='navy', line_color=None)
p.add_layout(b2)
@count()
def cb(x):
y = 10 + np.random.random()
l1 = y - 2 * np.random.random()
u1 = y + 2 * np.random.random()
l2 = l1 - 3 * np.random.random() - 1
u2 = u1 + 3 * np.random.random() + 1
source.stream(dict(x=[x], y=[y], l1=[l1], u1=[u1], l2=[l2], u2=[u2]), rollover=200)
curdoc().add_periodic_callback(cb, 60)
curdoc().add_root(p)
As an aside, I still think it could be possible to offer a hook to control stream length via a JS callback (instead of specifying a fixed N) e.g. in order to make the length correspond more to a "fixed time duration". This is currently marked as GFI in bokeh/bokeh#7024 but if there is a desire to prioritize it I can re-triage it for more attention. |
Awesome, glad to hear!
We can definitely add the pan tool, to all the plots.
I'm not sure if this is possible to coordinate if they are not on the same page. Currently, the
Do we want this for this PR. I believe that we can probably add layouts to the current line plots using bands like in the example above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @ncclementi. Apologies for the delayed response. It looks like distributed/dashboard/tests/test_scheduler_bokeh.py::test_SystemTimeseries
is legitimately failing in CI. Do you have an idea as to what's going on there?
I think we're also nearing a solid checkpoint for this work where we should merge this PR in and then address remaining review comments in a follow-up PR. As @mrocklin mentioned, the current set of changes here will already be valuable to users.
(I'm regularly including this in temporary branches when I run experiments,
I've noticed Gabe doing the same)
…On Thu, Aug 12, 2021 at 11:06 AM James Bourbeau ***@***.***> wrote:
***@***.**** commented on this pull request.
Thanks for all your work on this @ncclementi
<https://github.com/ncclementi>. Apologies for the delayed response. It
looks like
distributed/dashboard/tests/test_scheduler_bokeh.py::test_SystemTimeseries
is legitimately failing in CI. Do you have an idea as to what's going on
there?
I think we're also nearing a solid checkpoint for this work where we
should merge this PR in and then address remaining review comments in a
follow-up PR. As @mrocklin <https://github.com/mrocklin> mentioned, the
current set of changes here will already be valuable to users.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5129 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE5IP4LB3N7RVT2N33T4PWPRANCNFSM5BC44W6A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Yes, I just realized that when I pushed to compute mean instead of sum, forgot to update the tests. Working on it right now, will push the fix soon. |
I added another individual a bar plot that has the disk read/write since this was part of the original plan too. |
Also, if possible it would be nice to move the legend for read/write on the timeseries plots over to the left. When it's on the right in a small space it blocks the most recent values. I think that moving this to the left makes it block older values, which are less immediately relevant. |
This shows up sometimes in practice
|
Oh, yes great catch. This is happening because if for some reason we don't have workers up, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work here @ncclementi!
black distributed
/flake8 distributed
/isort distributed
Currently computing the average for
read_bytes
andwrite_bytes
across workers. If we decided we want the sum we can change that, let me know in the comments.