Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Network Timeseries #5129

Merged
merged 17 commits into from
Aug 17, 2021
Merged

Worker Network Timeseries #5129

merged 17 commits into from
Aug 17, 2021

Conversation

ncclementi
Copy link
Member

Currently computing the average for read_bytes and write_bytes across workers. If we decided we want the sum we can change that, let me know in the comments.

@mrocklin
Copy link
Member

mrocklin commented Jul 27, 2021 via email

@mrocklin
Copy link
Member

mrocklin commented Jul 27, 2021 via email

@ncclementi
Copy link
Member Author

My preference is for sum over average

I'll modify it to use the sum.
One question I have is do we still do an average on the time (we have one entry per worker)?

Thank you for doing this. Request for future PRs, can I ask you to include a quick screenshot? I suspect that this will make it easier / more exciting for future reviewers :)

Absolutely, I'll modify things to work with sum and add a screenshot/gif.

@ncclementi
Copy link
Member Author

  • Replaced average by sum.
  • Keep using the average of time.

This is how it looks when there are no tasks running.

timeseries

@mrocklin
Copy link
Member

This looks great. For the sake of cleanliness/minimalism I recommend removing the minor ticks on the y-axis and the vertical gridlines (there are hopefully examples of doing this in other parts of the code)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ncclementi! I just took this for a spin and it looks great. A couple of comments I have (which aren't meant to be blocking):

  1. The y-axis scales up automatically when there's lots of network activity, but doesn't scale down. Not a huge deal as refreshing the browser causes the y-axis to be resized.
  2. We might want to combine the "workers network bandwidth" and "workers network bandwidth timeseries" plots into a single page. Do you have a sense for how much extra work that is? If it's quick and easy, then I'll suggest we do it. Otherwise we can leave it as follow-up work in a future PR.

@mrocklin
Copy link
Member

We might want to combine the "workers network bandwidth" and "workers network bandwidth timeseries" plots into a single page. Do you have a sense for how much extra work that is? If it's quick and easy, then I'll suggest we do it. Otherwise we can leave it as follow-up work in a future PR.

I recommend that we skip this for now. I think that we should get a bunch of plots built, and then think holistically about how to present those plots.

@mrocklin
Copy link
Member

For example, I think that we might want to include a page that includes timeseries for all of cpu/memory/network/disk that include both the timeseries and the real-time per-worker chart.

@mrocklin
Copy link
Member

Actually @ncclementi if it's easy for you to make sure that we have each of those that might be welcome. My hope is that this is not significantly harder than doing one or two. Maybe there is some nice way to avoid code duplication, but even if not I think that that's ok.

@jrbourbeau
Copy link
Member

think holistically about how to present those plots.

+1. One thing to be aware of is that we're already presenting all the individual-* plots through the new "More..." dropdown in the dashboard and the JupyterLab extension:

Screen Shot 2021-07-27 at 4 36 23 PM

As we add a bunch of new individual plots, those two things will become more cluttered. Again, like I said before, this isn't a huge deal, just something to be aware of

@mrocklin
Copy link
Member

I think that the right solution is to provide a more ordered list of plots: dask/dask-labextension#179 (comment)

@jrbourbeau
Copy link
Member

That sounds great for the labextension

@mrocklin
Copy link
Member

That sounds great for the labextension

Maybe we do something similar in our own navbar as well

@ncclementi
Copy link
Member Author

For example, I think that we might want to include a page that includes timeseries for all of cpu/memory/network/disk that include both the timeseries and the real-time per-worker chart.

I got a bit lost in here. Are you trying to say that we get something like what is on /system but has on one column the time series of cpu/memory/network/disk and next to it the corresponding bar plot.

To not duplicate code it seems the approach is to create an HTML template with a custom css. (I'm not that familiarized with html/css this might take some time)

Currently, all the plots in /system are made in one class, will have to investigate if I can only extract the ones we need ie. cpu/memory, and then create a page like the status page which uses a status.html template based on this css

The other approach would be going for something like what is on /system which creates all the plots together in one class and uses column() from bokeh to put one under the other. This will lead to code duplication if we want the page and the plot as an individual plot too.

@ncclementi
Copy link
Member Author

As we add a bunch of new individual plots, those two things will become more cluttered. Again, like I said before, this isn't a huge deal, just something to be aware of

I wonder if it's possible to adjust the dropdown to only include the plots that are not somewhere else in the dashboard.
BUt this is just an idea, I wouldn't know how to go about it.

@mrocklin
Copy link
Member

My original point is let's not worry too much about layout right now and just get a lot of charts out. We can ask HTML/CSS folks to assmble them for us in the future if we desire.

However, you probably should be aware that I'm likely going to ask you to replicate these network plots for all of CPU/memory/network/disk, and so if there are efficienceis to be had in doing those together then we should probably consider them.

@ncclementi
Copy link
Member Author

However, you probably should be aware that I'm likely going to ask you to replicate these network plots for all of CPU/memory/network/disk, and so if there are efficiencies to be had in doing those together then we should probably consider them.

Sure, I guess I'm a little confused on what's remaining.

  1. Aren't CPU and memory already cover between whats on /system and the main status page (CPU tab on left bottom corner and workers memory) --> or we want something different?
  2. Network read_bytes and write_bytes should be cover between this PR (timeseries) and Add WorkerNetworkBandwidth chart to dashboard #5104
  3. Disk, yes it's pending and on my TODO list.

@mrocklin
Copy link
Member

Aren't CPU and memory already cover between whats on /system and the main status page (CPU tab on left bottom corner and workers memory) --> or we want something different?

The system page only reports information for the machine that the scheduler is running on. We're also curious about this same information, but aggregated across machines

@ncclementi
Copy link
Member Author

The system page only reports information for the machine that the scheduler is running on. We're also curious about this same information, but aggregated across machines

Oh I see, so we are missing the timeseries that have the aggregated data that comes from ws.metrics["cpu"] and ws.metrics["memory"]

With this in mind would it be useful have all the time series cpu/memory/network/disk on one page? This should be straight forward (once I add the disk metrics) and reduce code lines. In simpler words follow the format of the SystemMonitor Class but with the aggregated info coming from ws.metrics

class SystemMonitor(DashboardComponent):

@mrocklin
Copy link
Member

mrocklin commented Jul 27, 2021 via email

@ncclementi ncclementi changed the title Worker Network Bandwidth Timeseries Worker Network Timeseries Jul 28, 2021
@ncclementi
Copy link
Member Author

Here is a look at the 4 timeseries plots. But I was wondering couple of things:

  • if I needed to add a test for adding the disk metrics, wasn't sure exactly what.
  • I think some names of the plots are not good, same as labels, not quite sure about some.
  • the get_data function on SystemTimeseries I wonder if there is a better way of doing it (it looks ugly)
  • the SystemTimeseries test on test_scheduler_bokeh.py is a bit messy too.

test_timeseries

@mrocklin
Copy link
Member

the get_data function on SystemTimeseries I wonder if there is a better way of doing it (it looks ugly)

It looks simple and straightforward to me

@ncclementi
Copy link
Member Author

For the cpu and memory limits, I've been trying to get something reasonable but using sum at least in the case of cpu seems a bit much, (see below). I was checking what do we do in the current timeseries on /system and we leave them without and y_range.end (Is there a particular reason we want to go differently on this case?)

Adding this to the update

  if self.scheduler.workers:
      y_end_cpu = sum(ws.nthreads or 1 for ws in self.scheduler.workers.values())
      y_end_mem = sum(ws.memory_limit for ws in self.scheduler.workers.values())
  else:
      y_end_cpu = 1
      y_end_mem = 100_000_000

  self.cpu.y_range.end = y_end_cpu * 100
  self.memory.y_range.end = y_end_mem

With no computations:
Screen Shot 2021-07-29 at 3 21 45 PM

If I run something like the code below, it's hard to see with such a high y_range.end

import dask.array as da

x = da.random.random((50_000, 50_000), chunks=(1_000, 1_000))
result = (x + x.T).mean(axis=0).mean()
result.compute()

Screen Shot 2021-07-29 at 3 23 19 PM

@mrocklin
Copy link
Member

Interesting, I would have expected the machine to peak at 800%. I'm curious, if you look at something like top during this time is it not using several hundred precent of CPU?

@jrbourbeau
Copy link
Member

jrbourbeau commented Jul 29, 2021

@ncclementi and I are walking through CPU utilization discrepancy offline and it looks like the code here is correct, but there's an upstream issue with psutil.cpu_percent() on M1 chip Macs (xref giampaolo/psutil#1956)

EDIT: I should note that I see the expected CPU load in the dashboard on my laptop, which is also a Mac, but not with an M1 chip

@mrocklin
Copy link
Member

Good to merge then?

@jrbourbeau
Copy link
Member

Let's hold off for now. I'm slightly concerned the SystemMonitor.update() method is doing more work than we want it to for the individual plots. I'll do some quick benchmarking to see if this is an issue or not

@mrocklin
Copy link
Member

mrocklin commented Aug 1, 2021

I took this for a spin this morning. It has already helped immeasurably :)

Some feedback from usage:

  1. It would be useful to have the ability to pan back in history a bit. I think that this doable if we add the pan tool and keep a long enough history when we call stream(...). The other timeseries plots do this and can serve as an example.
  2. Let's make all of the figures share an x_range. That way when we slide back and forth they'll all slide together. That is only though, if they all share the same SystemTimeseries instance. We might want to do something clever here to make only one of these things and have all plots share the same object.
  3. I think that @ncclementi suggested this, but at some point we should consider making an aggregate view with all four of these timeseries, and probably all four of their realtime equivalents off to the side. Probably not in this PR, but I do think that a single view there would be valuable. I could even imagine putting this into the JLab individual-plots.json listing.
  4. I'm going to walk back my "sum vs mean" stance from before. We should maybe show mean. I actually still value sum more than mean, but I want to get a sense for the distribution of values in each timeseries. I'd love to have some sort of shaded area for one standard deviation around the mean. Or even, if we wanted to get fancy, two shades for one standard deviation, and also min/max. This probably requires going beyond line glyphs.
  5. I am curious what happens when we're on a single machine. I suspect that disk/network measurements are systemwide, and so we're maybe over-counting if we have multiple workers on the same host. We can clean this up later, but it might be wise to uniquify values based on host address.

@bryevdv I'm not sure if you're around and are interested in these sorts of things. If you are then I encourage you to speak up. I suspect that you're able to hammer these out very quickly.

@bryevdv
Copy link
Contributor

bryevdv commented Aug 2, 2021

Or even, if we wanted to get fancy, two shades for one standard deviation, and also min/max. This probably requires going beyond line glyphs.

@mrocklin I haven't actually seen either Band or varea used in conjunction with streaming (personally or from any users). I can't think of any reasons offhand why there would be any issue, but you know how things go. Once this PR is merged I am happy to take a look at this as a follow-on, especially just to see and document how Bokeh performs in this specific scenario.

Edit: If you are OK using alpha compositing for different "bands" (which seems the simplest thing to do) then streaming bands is fairly straightforward and works very well

import numpy as np
from bokeh.driving import count
from bokeh.models import Band
from bokeh.plotting import ColumnDataSource, curdoc, figure

source = ColumnDataSource(data=dict(x=[], y=[], l1=[], u1=[], l2=[], u2=[]))

p = figure(width=900, y_range=(0, 20))
p.x_range.follow = "end"
p.x_range.follow_interval = 80
p.x_range.range_padding = 0
p.line('x', 'y', color="white", source=source)

b1 = Band(base='x', lower='l1', upper='u1', source=source, level='underlay',
            fill_alpha=0.4, line_width=1, fill_color='navy', line_color=None)
p.add_layout(b1)

b2 = Band(base='x', lower='l2', upper='u2', source=source, level='underlay',
            fill_alpha=0.4, line_width=1, fill_color='navy', line_color=None)
p.add_layout(b2)

@count()
def cb(x):
    y = 10 + np.random.random()
    l1 = y - 2 * np.random.random()
    u1 = y + 2 * np.random.random()
    l2 = l1 - 3 * np.random.random() - 1
    u2 = u1 + 3 * np.random.random() + 1
    source.stream(dict(x=[x], y=[y], l1=[l1], u1=[u1], l2=[l2], u2=[u2]), rollover=200)

curdoc().add_periodic_callback(cb, 60)

curdoc().add_root(p)

It would be useful to have the ability to pan back in history a bit. I think that this doable if we add the pan tool and keep a long enough history when we call stream(...).

As an aside, I still think it could be possible to offer a hook to control stream length via a JS callback (instead of specifying a fixed N) e.g. in order to make the length correspond more to a "fixed time duration". This is currently marked as GFI in bokeh/bokeh#7024 but if there is a desire to prioritize it I can re-triage it for more attention.

@ncclementi
Copy link
Member Author

ncclementi commented Aug 3, 2021

I took this for a spin this morning. It has already helped immeasurably :)

Awesome, glad to hear!

  1. It would be useful to have the ability to pan back in history a bit. I think that this doable if we add the pan tool and keep a long enough history when we call stream(...). The other timeseries plots do this and can serve as an example.

We can definitely add the pan tool, to all the plots.

  1. Let's make all of the figures share an x_range. That way when we slide back and forth they'll all slide together. That is only though, if they all share the same SystemTimeseries instance. We might want to do something clever here to make only one of these things and have all plots share the same object.

I'm not sure if this is possible to coordinate if they are not on the same page. Currently, the x_range used for all the plots is the same but they are not going to be in sync unless they are on the same plot page. But if we do this like what we have in /system then we can't have them as separate plots.

  1. I'm going to walk back my "sum vs mean" stance from before. We should maybe show mean. I actually still value sum more than mean, but I want to get a sense for the distribution of values in each timeseries. I'd love to have some sort of shaded area for one standard deviation around the mean. Or even, if we wanted to get fancy, two shades for one standard deviation, and also min/max. This probably requires going beyond line glyphs.

Do we want this for this PR. I believe that we can probably add layouts to the current line plots using bands like in the example above.
An intermediate step would be plotting the average and we can have the min, max, and std on an annotation like we have in /system CPU and Memory.

@mrocklin
Copy link
Member

mrocklin commented Aug 9, 2021

For reference, this is what my current JLab session looks like

image

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @ncclementi. Apologies for the delayed response. It looks like distributed/dashboard/tests/test_scheduler_bokeh.py::test_SystemTimeseries is legitimately failing in CI. Do you have an idea as to what's going on there?

I think we're also nearing a solid checkpoint for this work where we should merge this PR in and then address remaining review comments in a follow-up PR. As @mrocklin mentioned, the current set of changes here will already be valuable to users.

@mrocklin
Copy link
Member

mrocklin commented Aug 12, 2021 via email

@ncclementi
Copy link
Member Author

Thanks for all your work on this @ncclementi. Apologies for the delayed response. It looks like distributed/dashboard/tests/test_scheduler_bokeh.py::test_SystemTimeseries is legitimately failing in CI. Do you have an idea as to what's going on there?

I think we're also nearing a solid checkpoint for this work where we should merge this PR in and then address remaining review comments in a follow-up PR. As @mrocklin mentioned, the current set of changes here will already be valuable to users.

Yes, I just realized that when I pushed to compute mean instead of sum, forgot to update the tests. Working on it right now, will push the fix soon.

@ncclementi
Copy link
Member Author

I added another individual a bar plot that has the disk read/write since this was part of the original plan too.
The plot is in the More tab under "Workers disk"

@mrocklin
Copy link
Member

Also, if possible it would be nice to move the legend for read/write on the timeseries plots over to the left. When it's on the right in a small space it blocks the most recent values. I think that moving this to the left makes it block older values, which are less immediately relevant.

@mrocklin
Copy link
Member

This shows up sometimes in practice

  File "/home/mrocklin/workspace/distributed/distributed/dashboard/components/scheduler.py", line 854, in update
    max(x_read),
ValueError: max() arg is an empty sequence

@ncclementi
Copy link
Member Author

This shows up sometimes in practice

  File "/home/mrocklin/workspace/distributed/distributed/dashboard/components/scheduler.py", line 854, in update
    max(x_read),
ValueError: max() arg is an empty sequence

Oh, yes great catch. This is happening because if for some reason we don't have workers up, x_read, x_write, etc are empty and it's trying to get the max() of an empty list. I push something that fixes it.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here @ncclementi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Observed worker network bandwidth chart
4 participants