Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS nightly wheel builds failing since 2024-11-19 #7019

Closed
swolchok opened this issue Nov 21, 2024 · 25 comments
Closed

macOS nightly wheel builds failing since 2024-11-19 #7019

swolchok opened this issue Nov 21, 2024 · 25 comments
Assignees

Comments

@swolchok
Copy link
Contributor

🐛 Describe the bug

Status page: https://github.com/pytorch/executorch/actions/workflows/build-wheels-m1.yml
Note that the Python 3.9 build always fails, so even though the runs are red, they were successful through 2024-11-18.

Linking is failing with ld: invalid use of ADRP in '_init_f32_vcopysign_config' to '_xnn_f32_vcopysign_ukernel__neon_u8’.

Versions

N/A

@swolchok
Copy link
Contributor Author

Inspection of PRs landed between the last good build and first bad build suggested the following:

Trial revert of #6837 in #7013 still failed the job; trialing revert of the other two PRs together

swolchok added a commit that referenced this issue Nov 21, 2024
This reverts commit 5b4d9bb. Attempting to debug/fix #7019.
@swolchok
Copy link
Contributor Author

trial revert of #6522 in #7020 did not fix the job

@swolchok
Copy link
Contributor Author

trial revert of #6892 in #7021 did not fix the job.

I am also not able to repro this locally, and I've inspected git diff 8526d0a2d798658b6a6e3a42ec935b8093f355ef..04f6fcd4b3920eaf1be9905d12b449f301f89ca7 without finding anything else suspicious, so I wonder if the runners broke somehow

@swolchok
Copy link
Contributor Author

I wonder if the runners broke somehow

I reran the last good workflow run; builds succeeded (there were some failures due to an unrelated issue).

@swolchok swolchok self-assigned this Nov 21, 2024
@larryliu0820
Copy link
Contributor

Found a failure with the same error message in a different job (test-llama-runner-mac): https://github.com/pytorch/executorch/actions/runs/11959891658/job/33342737621?pr=7010

@swolchok
Copy link
Contributor Author

Found a failure with the same error message in a different job (test-llama-runner-mac): https://github.com/pytorch/executorch/actions/runs/11959891658/job/33342737621?pr=7010

that job is green on trunk runs though! https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=llama-runner-mac%20(fp32%2C%20mps

@kimishpatel
Copy link
Contributor

am late to this so not sure my comments will help, but any change related to xnnpack upgrade? since the job fails related xnnpack

@swolchok
Copy link
Contributor Author

@larryliu0820 suggested maybe the runner toolchain changed.

It looks like we're using macos-m1-stable runners for test-llama-runner-mac: https://github.com/pytorch/executorch/blob/main/.github/workflows/trunk.yml#L236 not sure what runner the wheel build uses

I don't know a whole lot about this runner type, but I see that 1) it seems to be in-house: pytorch/pytorch#127490 2) I don't see recent activity in https://github.com/pytorch-labs/pytorch-gha-infra/ suggesting that there was a recent update

@swolchok
Copy link
Contributor Author

any change related to xnnpack upgrade

as I mentioned above, I inspected all the commits (there aren't many) in the range of commit hashes flagged in the nightly builds.

@larryliu0820
Copy link
Contributor

An example of trunk job passing:

https://github.com/pytorch/executorch/actions/runs/11962683652/job/33351640398

An example of PR job failing:

https://github.com/pytorch/executorch/actions/runs/11959891658/job/33342745520?pr=7010

I don't see obvious difference between these 2, regarding environment setup.

@huydhn anything obvious to you?

@swolchok
Copy link
Contributor Author

Another example: PR jobs failing on #7044; tbd if they fail consistently

@swolchok
Copy link
Contributor Author

interesting that a large block of jobs all failed on the same PR. Points to some piece of shared state being the cause, either the repo state itself or sccache

@swolchok
Copy link
Contributor Author

@wdvr is it a potential problem that our Mac builds are still on sccache 0.4.1? I see that you updated the ubuntu build to 0.8.2 in #6837

@swolchok
Copy link
Contributor Author

I am now able to repro! gh pr checkout 7040; ./install_requirements.sh --pybind xnnpack

@swolchok
Copy link
Contributor Author

reverting backends/xnnpack/third-party/XNNPACK to ad0e62d69815946be92134a56ed3ff688e2549e8 (updated in #6101) does not fix it

@swolchok
Copy link
Contributor Author

removing --pybind xnnpack from the install_requirements.sh line does fix it (duh), so perhaps we couldn't repro with setup.py because we weren't doing whatever magic to build XNNPACK.

@swolchok
Copy link
Contributor Author

just reconfirmed that ./install_requirements.py --pybind xnnpack does not repro on main; must gh pr checkout 7040 first.

@huydhn
Copy link
Contributor

huydhn commented Nov 23, 2024

@wdvr is it a potential problem that our Mac builds are still on sccache 0.4.1? I see that you updated the ubuntu build to 0.8.2 in #6837

sscache uses the file path and the compiler name and its flags in the cache. So, there shouldn't be any issue from 0.8.2 update on ubuntu as they are well isolated.

@larryliu0820
Copy link
Contributor

Oh it could be coming from PyTorch. #7010 only bumps PyTorch pin and jobs are failing. It seems #7044 is also bumping the pin?

@malfet
Copy link
Contributor

malfet commented Nov 25, 2024

There were recent xnnpack update in PyTorch, if ET directly depends on XNNPack, but its version is older, it can easily create a problem, as MacOS, unlike Linux does not have -fvisibility=hidden set by default

@swolchok
Copy link
Contributor Author

recent xnnpack update

for clarity, the update is pytorch/pytorch#139913 and landed on 11/18, the day before nightlies started failing, so it's very suspicious. I've asked @digantdesai / @mcr229 about this internally; tagging them here as well for visibility.

@huydhn
Copy link
Contributor

huydhn commented Nov 28, 2024

I think #6538 doesn't fix the issue as it's still showing up on the latest nightly with the change in place https://github.com/pytorch/executorch/actions/runs/12060458350/job/33630916538. I should have add ciflow/binaries to run the build on the PR, then it would have clear signals there.

@larryliu0820
Copy link
Contributor

Should we revisit commits between 11/18 nightly and 11/19 nightly? 8526d0a...04f6fcd

@larryliu0820
Copy link
Contributor

Repro steps:

pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cpu 
export CMAKE_ARGS=' -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_COREML=ON -DEXECUTORCH_BUILD_MPS=ON'
export EXECUTORCH_BUILD_PYBIND=1
python setup.py bdist_wheel

@mcr229
Copy link
Contributor

mcr229 commented Jan 14, 2025

closing as it was fixed

@mcr229 mcr229 closed this as completed Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants