Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed deployment script fails to work on Peanut cluster at UChicago #198

Open
fkengun opened this issue Oct 3, 2024 · 0 comments
Open
Assignees
Labels
Milestone

Comments

@fkengun
Copy link
Contributor

fkengun commented Oct 3, 2024

Summary

Distributed deployment script fails to work on Peanut cluster at UChicago.

Steps to reproduce

Run the script with with arguments -d -w WORK_DIR -j JOB_ID -n 2 -e after a successful allocation using salloc .

What is the current bug behavior?

  • The script exits abnormally in preparing hosts file due to unmatched number of ChronoKeepers.
  • The script exits abnormally in getting remote hostnames.
  • ChronoKeeper fails to launch due to wrong IP or hostname in the configuration file.

Initial diagnosis:

  • Cannot get hostname list correctly. Slurm assigns environment variable SLURM_JOB_ID automatically after salloc, which is different to the behavior on Ares. That leads to the code in the if branch in the prepare_hosts function, which does not work.
  • mpssh enables key check on ssh on default. Peanut has conflicting key problem right now. ssh does not work with key check enabled.
  • dig is used to get IP from a remote hostname. But it returns nothing useful on Peanut. nslookup works on Peanut, but fails on Ares.

What is the expected correct behavior?

ChronoLog is deployed on multiple nodes. Data from clients can be stored in WORK_DIR/output as CSV files.

Relevant logs and/or screenshots

N/A

@fkengun fkengun self-assigned this Oct 3, 2024
@ibrodkin ibrodkin added this to the 2024-10-04 milestone Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants