Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraneous mov in Pulley bytecode after register allocation #9942

Open
alexcrichton opened this issue Jan 7, 2025 · 0 comments
Open

Extraneous mov in Pulley bytecode after register allocation #9942

alexcrichton opened this issue Jan 7, 2025 · 0 comments

Comments

@alexcrichton
Copy link
Member

Given this input CLIF:

function u0:0(i32) -> i32 tail {
    sig0 = (i32) -> i32 tail
    fn0 = colocated u0:0 sig0

block0(v2: i32):
    v19 = call fn0(v2)
    v20 = iadd v2, v19
    return v20
}

I'm currently seeing on main

$ cargo run --features pulley compile --target x86_64 -D ../clif/wasm_func_0.clif --set opt_level=speed
.byte 157, 16, 0, 0, 0, 0, 0, 32, 0, 64, 21, 0, 2, 21, 0, 0, 0, 0, 64, 5, 21, 69, 160, 0, 158, 16, 0, 0, 0, 0, 0, 32, 0, 0

Disassembly of 34 bytes <u0:0>:
       0: 9d 10 00 00 00 00 00 20 00         push_frame_save 16, x21
       9: 40 15 00                           xmov x21, x0
       c: 02 15 00 00 00 00                  call1 x21, 0x0    // target = 0xc
      12: 40 05 15                           xmov x5, x21
      15: 45 a0 00                           xadd32 x0, x5, x0
      18: 9e 10 00 00 00 00 00 20 00         pop_frame_restore 16, x21
      21: 00                                 ret

Notably instruction at 12, xmov x5, x21, I'm not sure why that exists. In theory that should not be there and the subsequent xadd32 should be xadd32 x0, x21, x0 and that would remove the need for the insertion of the xmov. This xmov is being inserted as part of register allocation as logging shows that the input program is:

VCode {
  Entry block: 0
Block 0([]):
    (original IR block: block0)
  Inst 0: args v192=x0
  Inst 1: call CallInfo { dest: PulleyCall { name: User(userextname0), args: [XReg(v192)] }, uses: [], defs: [CallRetPair { vreg: Writable { reg: v196 }, preg: p0i }], clobbers: PRegSet { bits: [65534, 65535, 4294967295, 0] }, callee_conv: Tail, caller_conv: Tail, callee_pop_size: 0 }
  Inst 2: xadd32 v195, v192, v196
  Inst 3: rets v195=x0
}

On aarch64 I see:

$ cargo run -q --features pulley compile --target aarch64 -D ../clif/wasm_func_0.clif --set opt_level=speed
.byte 253, 123, 191, 169, 253, 3, 0, 145, 248, 15, 31, 248, 248, 3, 2, 170, 0, 0, 0, 148, 2, 3, 2, 11, 248, 7, 65, 248, 253, 123, 193, 168, 192, 3, 95, 214

Disassembly of 36 bytes <u0:0>:
   0:   fd 7b bf a9             stp     x29, x30, [sp, #-0x10]!
   4:   fd 03 00 91             mov     x29, sp
   8:   f8 0f 1f f8             str     x24, [sp, #-0x10]!
   c:   f8 03 02 aa             mov     x24, x2
  10:   00 00 00 94             bl      #0x10
  14:   02 03 02 0b             add     w2, w24, w2
  18:   f8 07 41 f8             ldr     x24, [sp], #0x10
  1c:   fd 7b c1 a8             ldp     x29, x30, [sp], #0x10
  20:   c0 03 5f d6             ret

which does the right thing after the call instruction at 10 and the add instruction at 14 doesn't have a mov in front. Additionally for x64 it also looks "correct":

$ cargo run -q --features pulley compile --target x86_64 -D ../clif/wasm_func_0.clif --set opt_level=speed
.byte 85, 72, 137, 229, 72, 131, 236, 16, 76, 137, 60, 36, 73, 137, 255, 232, 0, 0, 0, 0, 68, 1, 248, 76, 139, 60, 36, 72, 131, 196, 16, 72, 137, 236, 93, 195

Disassembly of 36 bytes <u0:0>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 10             subq    $0x10, %rsp
   8:   4c 89 3c 24             movq    %r15, (%rsp)
   c:   49 89 ff                movq    %rdi, %r15
   f:   e8 00 00 00 00          callq   0x14
  14:   44 01 f8                addl    %r15d, %eax
  17:   4c 8b 3c 24             movq    (%rsp), %r15
  1b:   48 83 c4 10             addq    $0x10, %rsp
  1f:   48 89 ec                movq    %rbp, %rsp
  22:   5d                      popq    %rbp
  23:   c3                      retq

Notably the addl has no preceding mov instruction.

Is this the Pulley backend perhaps missing something with respect to regalloc metadata?

The generated get_operands method for the instruction here is:

        RawInst::Xadd32 { dst, src1,src2, .. } => {
            collector.reg_use(src1);
collector.reg_use(src2);

            collector.reg_def(dst);

        }

which is in theory pretty similar to how x64/aarch64 both work. I'm not sure if this is falling off a regalloc heuristic or a meaningful difference between x64/aarch64 though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant