Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reV Generation Module Fails When Using SLURM on AWS ParallelCluster #501

Open
ODOU opened this issue Dec 24, 2024 · 3 comments
Open

reV Generation Module Fails When Using SLURM on AWS ParallelCluster #501

ODOU opened this issue Dec 24, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@ODOU
Copy link

ODOU commented Dec 24, 2024

Issue Description

I'm analyzing RE potential in West Africa using reV to connect with the REEDS model for capacity expansion. Following the AWS ParallelCluster setup guide (https://nrel.github.io/reV/misc/examples.aws_pcluster.html), I've encountered an issue when switching to SLURM execution option.

Current Behavior

  • When using execution_control: option= local: Successfully generates h5 file in ~25 minutes for 300 sc points.
  • When using execution_control: option= slurm: Fails to generate output h5 file
  • Command used: reV generation -c config_gen_WestAfrica.json

Expected Behavior

  • The model should generate an h5 output file when run with SLURM execution mode on the AWS cluster

Environment Details

AWS ParallelCluster Configuration:

  • Region: us-west-2
  • OS: Amazon Linux 2
  • Head Node: t2.large
  • Compute Nodes: c5.2xlarge (0-8 nodes)
  • Scheduler: SLURM
  • Shared Storage: 5000GB EBS (gp2) mounted at /shared

Multiple Error (s) found


During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/rex/utilities/utilities.py", line 336, in check_res_file
with h5pyd.Folder(hsds_dir + '/') as f:
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/h5pyd/_hl/folders.py", line 204, in init
rsp = self._http_conn.GET(req)
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/h5pyd/_hl/httpconn.py", line 480, in GET
raise IOError("Connection Error")
OSError: Connection Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/gaps/cli/config.py", line 594, in run_with_status_updates
out = run_func(**run_kwargs)
File "/shared/reV/reV/generation/generation.py", line 436, in init
self._multi_h5_res, self._hsds = check_res_file(resource_file)
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/rex/utilities/utilities.py", line 351, in check_res_file
raise FileNotFoundError(msg) from ex
FileNotFoundError: /nrel/nsrdb/meteosat/meteosat_2019.h5 is not a valid file path, and HSDS cannot be check for a file at this path:Connection Error!


Attached Files

  1. Full error log file
  2. Configuration file (config_gen_WestAfrica.json)
  3. start_hsds.sh script

I've tried various configuration options but haven't been able to resolve the issue. Any assistance would be greatly appreciated.

config_gen_WestAfrica.json
westafrica_generation.log
westafrica_generation_j0_79.e.txt

@ODOU ODOU added the bug Something isn't working label Dec 24, 2024
@ppinchuk
Copy link
Collaborator

I haven't had a chance to look into this properly yet, but an initial glance through your log file (westafrica_generation_j0_79.e.txt) suggests that the hsds setup script is not completing properly. Looks like the failure point is cd ~/hsds/, and it's failing with /home/ec2-user/start_hsds.sh: line 58: cd: /home/ec2-user/hsds/: No such file or directory. So maybe the hsds folder is not being properly copied to the EC2 instances?

@ODOU
Copy link
Author

ODOU commented Dec 25, 2024

Thank for the reply.
I checked once again, the folder exist and the path work if I use the command line (See sreenshot below).
image

It seems to connect to the folder now, but throws another connection error.


Error response from daemon: No such container: hsds_sn_1
Traceback (most recent call last):
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/home/ec2-user/miniconda3/envs/hsds_env/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:


westafrica_generation_j4_98.txt

@ppinchuk
Copy link
Collaborator

Hello, sorry for the delayed response.

Based on the log file you attached, it looks like your HSDS server is still not launching correctly. I think from here you have two options:

  1. The hsds startup shell script you are using is really outdated. See this note:

Note that these instructions were originally developed and tested in February 2022 and have not been maintained. The latest instructions for setting up HSDS local servers can be found in the rex docs page: HSDS local server instructions. The best way to run reV on an AWS PCluster with HSDS local servers may be a combination of the instructions below and the latest instructions from the rex docs page.

If you want to continue pursuing this route, you will need to update the shell script to properly set up an HSDS server. We have some guidance on setting up a local HSDS server here, but we do not maintain a single script that does this.

  1. We recently added new capability to specify S3 filepaths directly to reV. If you point to S3 filepaths, you wouldn't have to setup an HSDS server at all (though execution may be somewhat slower).

I think option 2 may be the easiest, but hopefully this gives you enough info to continue to make progress on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants