Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added simple tile size calculation #2

Open
wants to merge 12 commits into
base: impala
Choose a base branch
from

Conversation

cod3monk
Copy link

It is currently tailored to 2D boxed stencils, but can be generalized

@richardmembarth
Copy link
Member

This seems to block only for L1 cache: if cur_lvl == 0 { ... tile(cur_lvl + 1, ...) and is almost 4x slower on my machine compared to what is used currently :)

I guess you have to pass cur_lvl to get_tile_dims and iterate only over the cache sizes relevant for the current level, starting from L3 down to L1. It seems also that get_tile_dims() gets not partially evaluated (lower2cff hangs). @leissa can you have a look at this?

For debugging, you can define a Boolean debug_tiling and set it to false in production mode.

@cod3monk
Copy link
Author

cod3monk commented Feb 27, 2017

@with the code in 9720ea8 I get 478ms meadian(27) and with my version in c097a4f I get 493 ms. We are talking about gaussian with iteration_advanced, right? If I increase the threshold for loop-overhead vs. blocking efficiency to 2000, it forces the calculation to block for L2 and yields 487 ms. So I can hardly find any performance increase with L1 vs L2 blocking.

You mention blocking for multiple levels, with the given image size of 4096x4096, the L1 block_size needs to be <= 1638 elements, for all other caches the layer conditions are fulfilled this the original size (L2 would need blocking with <=13107 and L3 <= 1M elements). Did I misunderstand what you mean by blocking for other levels?

//print_string(", ");
//print_int(mask.size_y);
//print_string(")\n");
if debug_tiling @{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No @ is required here - the frontend will already discard the conditional code if debug_tiling is false.

@richardmembarth
Copy link
Member

Your current logic starts tile with cur_lvl = 0, adds tiling for one cache level and calls then tile with cur_lvl = 1. However, the code checks only for cur_lvl == 0 and emits otherwise the body. Hence, there will be at most tiling for one cache level.

@richardmembarth
Copy link
Member

To make debugging programs using PE easier, we will add an instruction to show the "status" during PE, something like pe_info("message", var), which will print the value of var during PE if var is static and a warning if var is dynamic, that is, not partially evaluated.

@richardmembarth
Copy link
Member

richardmembarth commented Mar 1, 2017

Regarding speed: your version takes 286 ms and the version on master takes only 76 ms on my laptop (gaussian & iteration_advanced).
My guess is that the calls to get_tile_dims() are not partially evaluated and this gets executed at runtime. Once we have the pe_info() instruction such cases are easy to debug.

@richardmembarth
Copy link
Member

Ok, we've added the pe_info() instruction, see here how to use it: 91af79f
You need the latest version of impala / thorin / runtime for it to work and you also want the latest changes from Stincilla.

Adding pe_info() to your code:

let (xtile_dim, ytile_dim) = @get_tile_dims(mask, img);

pe_info("xtile_dim", xtile_dim);
pe_info("ytile_dim", ytile_dim);

Shows that ytile_dim is a constant (4096 of type int), but xtile_dim is not (variable _20511 and _47437):

I:stincilla/mapping_cpu.impala:205 col 9 - 41: pe_info: xtile_dim: _20511
I:stincilla/mapping_cpu.impala:206 col 9 - 41: pe_info: ytile_dim: qs32 4096
I:stincilla/mapping_cpu.impala:205 col 9 - 41: pe_info: xtile_dim: _47437
I:stincilla/mapping_cpu.impala:206 col 9 - 41: pe_info: ytile_dim: qs32 4096

That is, some condition in get_tile_dims() is not partially evaluated.
Talking to Roland, it turned out that we generate load and stores for mutable variables, e.g. for let mut max_distance = (0, 0);

There are two solution:

  • functional-style programming to avoid mutable variables
  • folding of loads and stores on our side

@leissa
Copy link
Member

leissa commented Mar 10, 2017

To be clear:
Having mutable variables is fine for the partial evaluator as long as the address is not taken from the variable. But this happens implicitly when the variable is captured in a closure.

Examples:

  • let mut x = 42;
    let p = &x;
    x += 3;
    // ...
    Because the address of x is taken here impala generates loads and store which - unless another optimization kicks in - the partial evaluator will not fold at the moment.
  • let mut x = 42;
    || { /*...*/ x /*...*/ }
    Same thing here because the address of x is implicitly taken within the closure.
  • let mut x = 42;
    for i in range(a, b) { /*...*/ x /*...*/ }
    Same as above since the block expression is just a closure in this case.

That being said, note that usually, other optimizations in thorin will fold those loads & stores afterwards. So folding these loads & stores is only critical if control flow depends on such variables.

I think we should fold loads & store during partial evaluation. This is, however, more complicated than it sounds. So, don't expect it to be fixed next week :(

@cod3monk
Copy link
Author

That was a a measurement mistake (last deleted comment). Is there a way to parse command line arguments? I would like to switch between the different versions dynamically as to avoid such mistakes.

@madmann91
Copy link
Contributor

You can:

  • Pass directly the result of parsing the command line arguments to some Impala function and call that Impala function from C:
fn impala_main(do_thing: i32) -> () {
  if do_thing == 42 {
    // ...
  }
}
extern "C" void impala_main(int);

int main(int argc, char** argv) {
  int do_thing = parse_args(argc, argv);
  impala_main(do_thing);
}
  • Parse the arguments in impala using strcmp, for example:
extern "C" {
  fn strcmp(&[u8], &[u8]) -> i32;
}

fn main(argc: i32, argv: &[&[u8]]) -> i32 {
  for i in range(1, argc) {
    if !strcmp(argv(i), "-h") {
      usage();
      return(0)
    } else if !strcmp(argv(i), ...) {
      // ... and so on
    }
  }
}

@richardmembarth
Copy link
Member

Some of our tests in Impala use argc and argv, for example nbody

I'm going to have a look at your changes tomorrow.

@richardmembarth
Copy link
Member

I had a look at the code, first a minor remark:

  • you can have pe_info() also in non-debug builds, no need to check for debug_tiling. Actually pe_info() is a no-op and removed by the compiler after printing.

Looking at execution times:

  • iteration: ~207 ms
  • iteration_bounds: ~58 ms
  • iteration_advanced: ~313 ms
  • iteration_advanced (master): ~57 ms

So it seems something is going wrong here ...
Adding the following code in addition to the pe_info("x|y tile dim", ...)

pe_info("x lower | upper", (xl, xu));
pe_info("y lower | upper", (yl, yu));

shows the problem:

...
x lower | upper: ((qs32, qs32) tuple qs32 2, qs32 4094)
y lower | upper: ((qs32, qs32) tuple qs32 2, qs32 4094)
...
x lower | upper: _39537
y lower | upper: _39540
...

xl and xu are only known when we enter tile() the first time. For the second call to tile(), xl and xu are not know since min() and max() can't be evaluated. That is, the loops for cur_lvl == 0 should be partially evaluated, but not those when calling body().
Applying these changes gives the same execution time as iteration_bounds: ~58 ms

I can push these changes to your fork if you give me the permissions.

@richardmembarth
Copy link
Member

I've pushed my changes.

Maybe also good to know: pe_info() messages will be only printed if you increase the log-level of Impala to info. This is done by Stincilla. By default, log-level is only set to error.

@richardmembarth
Copy link
Member

Enabling fast-math gives:

  • iteration_bounds: ~56 ms
  • iteration_advanced: ~51 ms

@richardmembarth richardmembarth deleted the branch AnyDSL:impala September 5, 2022 11:06
@richardmembarth richardmembarth changed the base branch from master to impala September 5, 2022 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants