Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warnings, errors, and tips to benchmark report #4

Open
BrianHicks opened this issue May 11, 2018 · 3 comments
Open

Add warnings, errors, and tips to benchmark report #4

BrianHicks opened this issue May 11, 2018 · 3 comments

Comments

@BrianHicks
Copy link
Collaborator

Issue by jwoudenberg
Monday Jun 05, 2017 at 12:35 GMT
Originally opened as BrianHicks/elm-benchmark#13


In the same vein as the elm compiler it wouldn't be really nice if elm-benchmark gave us warning, errors, and tips to help us write better benchmarks. From working with Brian a bit, I know he has tons of context on this, part of which could be automatically distributed in the benchmark report.

Below is an outline of some of Brian's tips I remeber, to give an idea of the type of helpful messages that could be displayed.

  • Standard deviations are large:
    • Run counts are low
      • "Try to make the benchmark run faster by narrowing down the code it runs to the part you're actually trying to benchmark."
      • "Perhaps you're not actually writing a micro-benchmark? Consider using a different tool. Here's a good one: <link>."
    • Run counts are not low
      • "These programs are known to interfere with benchmark performance. Try closing them and running the benchmark again."
      • "Did none if the above work? Please report an issue to help us make elm-benchmark better! <link>"
  • Standard deviation is larger than delta:
    • "The benchmark is inconclusive."
    • Should the delta even be shown in this scenario?
@BrianHicks
Copy link
Collaborator Author

Comment by BrianHicks
Wednesday Dec 06, 2017 at 23:01 GMT


FWIW I'm reducing these numbers down to two in the next version: runs per second and goodness of fit. Runs per second is pretty self-descriptive, but goodness of fit is not. In the new version, we vary sample size in order to generate a trend line, and goodness of fit is a measure of errors in the trend. It's expressed in terms of percent, and higher is better. So these advice will end up close to:

  1. total samples are low (number TBD but related to samples/bucket): same advice as "run counts are low" above.
  2. 5% of buckets have points outside 2 sigma (exact numbers TBD): high outlier count, try re-running (just reloading the page will keep the JIT hot enough to avoid these, usually.)
  3. goodness of fit is less than 95%: there may be interference on the system. Try closing programs or tabs that are consuming significant system resources (Slack, Spotify are typical candidates) and re-running.
  4. goodness of fit is less than 85%: There's something really wrong, don't trust these results.
    1. same advice on closing heavy tabs or programs
    2. if that doesn't solve it, try increasing the sample time
    3. if that doesn't solve it, show up in #elm-benchmark on the Elm Slack and we'll try to get you sorted out. There's probably some error this tool can't detect, or we need to account for your system setup in the sampling approach.

@BrianHicks
Copy link
Collaborator Author

Comment by BrianHicks
Wednesday Dec 06, 2017 at 23:08 GMT


Also, the new approach solves these in the following ways:

  • Standard deviations are large: trends on linear data are about equally susceptible to this problem, but goodness of fit is a single intuitive metric that we can easily generate advice from.
  • Standard deviation is larger than delta: varying sample size means we just don't have this problem… but we do have other ones. Goodness of fit measures this, too.
  • Run counts are low: we're going to make more smaller samples. This allows for larger benchmarks. It still has some problems, but hopefully in fewer cases.
  • System interference: a few outliers should not throw everything off.

In addition I'm adding lots of charts. Just looking at the data shows problems more often than you'd suspect, humans are very good at "hey, that's weird..." and not trusting the results. So for example, I can show the points. That shows outliers easily, as well as jags due to system spikes. If I show the trend line, it'll be obviously a good or bad fit (it's kinda susceptible to outliers.)

@BrianHicks
Copy link
Collaborator Author

actually, @jwoudenberg do you think this is resolved in the latest version? Did you have a chance to try a 2.x release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant