Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JEDI QG model example cycled DA workflow #85

Closed
wants to merge 96 commits into from

Conversation

christopherwharrop-noaa
Copy link
Collaborator

@christopherwharrop-noaa christopherwharrop-noaa commented Mar 18, 2024

This PR adds a collection of Parsl wrappers that configure and execute the various applications needed for a simple cycled 3DVar JEDI QG workflow. The wrappers consist of "configure" and "run" tasks that execute separately on different resources. The "configure" tasks will run on either a "service" (when internet access is required) or a "serial" resource while the "run" tasks can execute on compute nodes in parallel. Wrappers are created for: installing JEDI, running a "truth" forecast, creating simulated observations from the "truth", running 3DVar using the simulated obs and a background from a previous cycle forecast, and running a forecast initialized from the analysis produced by 3DVar.

These are not the most efficient or best organized wrappers. Much more thought should be put into design that can be applicable for future applications and experiment types. However, effort was made to build configurable components that could be used to run cycled 3DVar experiments with varying configuration options.

An experiment program, workflow.py was created to illustrate how the wrappers can be used to build a workflow.

Automated tests for the various wrappers has not been added yet because it isn't clear what the best way to do it is. All such tests would depend on a JEDI install, which takes time to build, and would take even longer in CI. Testing will be added once a strategy is determined.

Closes #84

* Update LICENSE and README to be consistent with repository contents

* Add basic install scripts for Spack and ExaWorks SDK

* Add basic test for ExaWorks installation
* Add ExaWorks Docker containers and update ExaWorks install scripts

Dockerfiles for both Ubuntu and CentOS containers is added.
Scripts for installing Spack and Exaworks are updated.
Add basic "hello" Parsl test script.
Modify CI workflow to use containers for testing ExaWorks.

* Rename CI workflow

Modify CI workflow names to be consistent with current usage.

* Fix typo in comment and add newline at end of Docker files
* Add CI test for Docker container and update checkout action for test example CI

Also, add py-pylint to the Spack install script to enable better testing.

* Restructure ExaWorks installation into small pieces

Create installation scripts for each main ExaWorks component.
Create Docker files for each main ExaWorks component.
Use Spack build caches to optimize installation times.
Update Github Actions workflow to leverage Spack build caches.
Update Github Actions workflow to split install into different build jobs.

* Add missing colon from workflow CI yml file

* Fix another missing colon in the workflow yaml file

* Rework Docker containers and install scripts

Install scripts and containers were modified to build
and use a "bootstrap" compiler for building the exaworks
software.  This is necessary because dependencies shared
between various components of exaworks will induce multiple
rebuilds of the gcc compiler, which takes a very long time.
This is avoided by bootstrapping gcc with the native
compiler, and then rebuilding the newer gcc with itself
in a "base" environment.  In this way, gcc is built twice
for the base environment, but subsequent environments
using the "base" environment will not rebuild gcc.

This commit also adds CentOS7 containers for the
multi-stage build of exaworks that breaks it into
smaller pieces with shorter build times. The Github
Actions workflow was updated to add CentOS containers.

* Fix typo in script comment
* Consolidate Dockerfiles for ExaWorks constituents

There were Dockerfiles for Ubuntu20 and CentOS7 builds
of ExaWorks and its constituents.  The Dockerfiles
for the base environment were significantly different
but the ones for the constituent pieces only differed
in the initial FROM statement at the top.  This commit
removes the duplicate Dockerfiles for the constituent
parts and uses build arguments (e.g. ARG) to specify the
OS to use at build time.  This elminates 5 extraneous
Dockerfiles and increases maintainability.

* Fix typo in GHA workflow for CentOS container

* Use environment variable in CI workflow to control OS used for containers

* Use env context to access container os variable
* Modify CI so that Parsl examples are tested for latest containers

* Remove unused CI workflow

* Fix typo in container bind mount for parsl test CI workflow
* Split install of bootstrap compiler into its own script

Do not remove compiler bootstrap environment because its needed
when installing from a build cache.

Add explicit specs for compilers to disambiguate.

* Update Dockerfiles for Spack exaworks base install

* Fix accidental incorrect placement of ./install_bootstrap.sh in Dockerfile

* Fix comment "boostrap" typos
* Add miniconda3 and newer gcc, parsl, flux

* Rework install scripts to use pip for Parsl and RADICAL

This allows the latest versions of Parsl and RADICAL to be
installed via pip with miniconda3.  It works around the
problem that Spack packages are not kept up to date for
Parsl and RADICAL.  It also drastically speeds up the
installation process by eliminating extremely long builds
of Parsl and RADICAL dependencies.

* Checksums are missing for Flux, so don't try to retreive them
* Create README with a list of links to workflow resources

* Add list markdown for resource items

* Fix typo

* Fix swift/t github link
* Add slurm cluster docker files

* Add docker container with test for Parsl+Flux+MPI

* Clean up unnecessary files and change Docker build context

* Refactor CI workflows for efficiency

* Remove unnecessary docker entrypoint files and attempt to optimize docker layer caching in CI
* Add a status badge to the README

* Don't use caching in CI when pushing container images to registry

* Remove use of docker image cache in push steps
* Consolidate install scripts

* Update CI workflow to reflect install script changes
* Initialize Intel compilers silently
Update Dockerfiles and CI workflow to remove loading of Flux Spack env.
* Reorganize tests to compare output against baselines

* Fix typo in CI yaml config

* Fix typo in CI test yaml

* Fix incorrect python for installing parsl via pip

* Fix parsl_flux_resource_list baseline for CI resources
* Add portable install scripts

Install scripts must take into account when
default compiler is different than the one
needed in the flux spack environment.

* Fix typos in Dockerfile

* chown Spack install to admin:admin

* Make all three containers use same shell init

* Refresh Spack view after install to get correct PYTHONPATH and CPATH

* Call flux/parsl install script twice to repair pip damage
* Split installation steps into separate docker layers

* Rearrange installation of spack and flux/parsl

* Use docker compose v2

* print compose version

* Add ghcr docker layer caching back in

* Set up intel oneapi and flux env activation in /etc/profile.d

* Fix spack environment setup

* Fix typos in GHA workflow config

* Fix comments in Dockerfiles
* Install Flux with conda instead of Spack

This removes use of Spack for installing the required
packages.  Miniconda3 is installed instead and used to
install both Parsl and Flux.  The "chiltepin" conda
environment is activated at login time.

* Add initialization to both .bashrc and .bash_profile
* Switch to GNU compilers and JEDI spack-stack environment

* Fix /opt copy mistake and add spack-stack Dockerfile

* Update CI workflow

* Turn off "load" option since container cannot be both pushed and loaded

* Fix typo in CI workflow yaml

* Remove debug "which" from parsl/flux test program

* Remove commented line and add verification for MPI test

* Try changing string quoting in CI yaml

* Escape the quote in CI workflow

* Remove tab from CI YAML file
* Automate creation of custom spack-stack Dockerfile

* Fix incorrect comments, add arguments for stack to use

* Enable selection of stack for tests

* Add test for spack-stack stack

* Update parsl_flux_mpi_hello baseline to work with new tests
* Small fixes to conda environment and creation of machine-specific stacks

* Update chiltepin stack for container use

* Create a rudimentary stack factory

* Add rudimentary factory for config

* Move config to YAML files

* Update CI to use yaml file argument and remove obsolete JEDI stack calls

* Update baseline test data to match change in test code
* Remove obsolete ExaWorks directory

* Update frontend container's /opt to be consistent with spack-stack changes
* Restructure repo to have proper locations for package source and tests
Port tests to pytest framework

* Update chiltepin conda env to add pytest

* Rework container to use user modifiable conda install

Specify version 7.4.0 of pytest to work around bugs

* Add pytest fixture to set up and teardown parsl for flux tests

* Improve tests to check for output correctness

* Fix bug, only remove previous output if it exists

* Update CI workflow to use pytest

* Remove unneeded test baseline output
…urces (#45)

* Add test to ensure concurrent MPI programs don't use overlapping resources

Add mpi_pi.f90 test program for tests requiring longer run times.
Update config factory to set cores per node in the provide constructor.

Add test

* Fix line length issue in mpi_pi.f90

Create ci.yaml configuration specifically for CI testing.
* Update documentation to provide most basic instructions for use

* Update README with docker usage

* Fix grammatical errors in README
The latest Parsl breaks the latest version of Flux that
is available via conda.  So we explicitly set the version
of Parsl to get the latest one compatible with Flux.
@@ -0,0 +1,49 @@
def leadtime_to_seconds(fcst):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we renamed this from fcst_to_seconds to leadtime_to_seconds. Should we also take advantage of this change and fix the parameter being passed in from fcst to leadtime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea! It does look confusing now.

Comment on lines 1 to 23
from chiltepin.jedi.leadtime import leadtime_to_seconds, seconds_to_leadtime

def test_leadtime_to_seconds():
assert leadtime_to_seconds("PT0S") == 0
assert leadtime_to_seconds("PT1M") == 0
assert leadtime_to_seconds("PT1H") == 3600
assert leadtime_to_seconds("PT1H30M") == 3600

assert leadtime_to_seconds("P1D") == 86400
assert leadtime_to_seconds("P1DT1H") == 90000
assert leadtime_to_seconds("P1DT1H30M") == 90000

assert leadtime_to_seconds("MT0S") == -0
assert leadtime_to_seconds("MT1H") == -3600
assert leadtime_to_seconds("MT1H30M") == -3600

assert leadtime_to_seconds("M1D") == -86400
assert leadtime_to_seconds("M1DT1H") == -90000
assert leadtime_to_seconds("M1DT1H30M") == -90000
assert leadtime_to_seconds("M1DT1H30M") == -90000


Copy link
Collaborator

@NaureenBharwaniNOAA NaureenBharwaniNOAA Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are pretty off, also there are only 13 entries for leadtime_to_seconds and 14 supplied values for seconds_to_leadtime. So the values are a little off. The numbers in the code accurately pass the tests. The values bolded and italicized below are the incorrect values.

leadtime.leadtime_to_seconds

  • "PT0S"

  • "PT1M"

  • "PT1H"

  • "PT1H30M"

  • "P1D"

  • "P1DT1H"

  • "P1DT1H30M"

  • "MT0S"

  • "MT1H"

  • "MT1H30M"

  • "M1D"

  • "M1DT1H"

  • "M1DT1H30M"

leadtime.seconds_to_leadtime

  • 0

  • 60

  • 3600

  • 5400

  • 86400

  • 90000

  • 91800

  • -0

  • -60

  • -3600

  • -5400

  • -86400

  • -90000

  • -91800

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert leadtime_to_seconds("PT1M") == 0
should be
assert leadtime_to_seconds("PT1M") == 60

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert leadtime_to_seconds("PT1H30M") == 3600
should be
assert leadtime_to_seconds("PT1H30M") == 5400

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert leadtime_to_seconds("P1DT1H30M") == 90000
should be
assert leadtime_to_seconds("P1DT1H30M") == 91800

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert leadtime_to_seconds("MT1H30M") == -3600
should be
assert leadtime_to_seconds("MT1H30M") == -5400

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert leadtime_to_seconds("M1DT1H30M") == -90000
should be
assert leadtime_to_seconds("M1DT1H30M") == -91800

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears I accidentally left out the MT1M case. That should evaluate to -60. Perhaps the missing case caused mapping from leadtime to seconds to get messed up in the issue. Sorry about that. I'll go edit the issue and fix it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all instances of fcst to leadtime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need test JEDI workflow
2 participants