Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with run_neon with the --experiment flag starting in ctsm5.1.dev172 #2433

Closed
wwieder opened this issue Mar 22, 2024 · 19 comments · Fixed by #2435
Closed

Issues with run_neon with the --experiment flag starting in ctsm5.1.dev172 #2433

wwieder opened this issue Mar 22, 2024 · 19 comments · Fixed by #2435
Assignees
Labels
enhancement new capability or improved behavior of existing capability investigation Needs to be verified and more investigation into what's going on. support user or developer needs help

Comments

@wwieder
Copy link
Contributor

wwieder commented Mar 22, 2024

Brief summary of bug

It seems like in the NEON refactor for PLUMBER2 we lost come capabilities using the --experiment flag? Specifically creating an AD case that includes the experiment flag in the case name.

General bug information

CTSM version you are using: dev175

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: running NEON cases with --experiment flag.

Details of bug

I created two cases, one with dev171 & another with dev175, that were intended to evaluate impacts of dead arctic veg on NEON simulations in Alaska. The following command with dev175 creates an AD case, as expected, but without the experiment flag in the case name.

./run_neon --neon-sites BARR --run-type ad --experiment dev175 --output-root /glade/derecho/scratch/wwieder/neon_AK

the same command using dev 171 produces case names like this: BARR.dev171.ad

cases are all in my scratch directory:
/glade/u/home/wwieder/scratch/neon_AK

@wwieder wwieder added enhancement new capability or improved behavior of existing capability support user or developer needs help investigation Needs to be verified and more investigation into what's going on. next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Mar 22, 2024
@wwieder wwieder added this to the CESM3 milestone Mar 22, 2024
@wwieder wwieder changed the title Have updates to run_neon deprecated previous features with the --experiment flag? Have updates to run_neon deprecated previous features with the --experiment flag? Mar 22, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Mar 22, 2024

Thanks for pointing this out @wwieder. This must be in #2363, but because it's a refactoring step it's hard to point to the exact line.

This is where adding more testing for all the important configurations run_neon is used in would be critical to prevent this type of thing. We test a few -- but obviously not enough.

So @TeaganKing and @wwieder it would be good to sit down together and map out the list of important configuration of options you use for run_neon. It looks like right now our testing is really minimal, and we need to also test interactions of command line options with each other (like in this case).

Having this kind of comprehensive testing enables us to refactor and improve the design without losing important functionality. Rather than adding comprehensive testing up front we've been incrementally adding testing as we go, which in this case led to this problem. In Agile SDM it's considered important to have the comprehensive testing up front so you know you can refactor without regressing functionality. So we might want to discuss what kind of balance we want to have between those two things as we move forward.

@TeaganKing
Copy link
Contributor

Thanks for catching this bug and making this issue. A few other notes I wanted to add are below.

  • Without the run type specified, it looks like the actual build is not completing, and requires an extra step of running ./case.build separately from the main case directory (not the .transient case directory). I think this may be a more significant problem.
  • I've made a fix to how run_neon.py calls run_case to specify the optional arguments which might resolve this issue; I'm testing a few examples now. I likely won't have actual results before Monday morning, but wanted to share a status update.

@wwieder
Copy link
Contributor Author

wwieder commented Mar 24, 2024

A few more errors after trying to start postad runs

---- cloning the base case in /glade/derecho/scratch/wwieder/neon_AK/BARR.postad
Traceback (most recent call last):
  File "./run_neon", line 48, in <module>
    main(__doc__)
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/run_neon.py", line 241, in main
    experiment,
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/neon_site.py", line 103, in run_case
    base_case_root, run_type, prism, run_length, user_version, tower_type, user_mods_dirs
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/tower_site.py", line 394, in run_case
    self.set_ref_case(case)
  File "/glade/u/home/wwieder/CTSM/tools/site_and_regional/../../python/ctsm/site_and_regional/tower_site.py", line 230, in set_ref_case
    symlink_force(reffile, os.path.join(rundir, os.path.basename(reffile)))
  File "/glade/u/home/wwieder/CTSM/cime/CIME/utils.py", line 1473, in symlink_force
    raise e
  File "/glade/u/home/wwieder/CTSM/cime/CIME/utils.py", line 1467, in symlink_force
    os.symlink(target, link_name)
FileNotFoundError: [Errno 2] No such file or directory: '/glade/derecho/scratch/wwieder/neon_AK/BARR.ad/run/BARR.ad.clm2.r.1018-01-01-00000.nc' -> '/glade/derecho/scratch/wwieder/neon_AK/BARR.postad/run/BARR.ad.clm2.r.1018-01-01-00000.nc'

@wwieder
Copy link
Contributor Author

wwieder commented Mar 24, 2024

it seems like the run directory isn't being created in the new .postad case, the initial conditions can't be copied, and the jobs fail. Manually copying files into the run directory still doesn't pickup the right restart files (not sure what else isn't happening correctly...)

@TeaganKing
Copy link
Contributor

Thanks for these extra details, @wwieder !

@slevis-lmwg
Copy link
Contributor

I may have fixed the same failure in this test with a suggestion from @ekluzek:
SSPMATRIXCN_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_intel.clm-ciso_monthly
@TeaganKing and/or @wwieder maybe we could discuss at Stand-up this morning? Or it would also be fine to meet another time.

@TeaganKing
Copy link
Contributor

Hi @slevis-lmwg , Is there another stand-up other than the Tuesday 3pm one? @ekluzek , @wwieder , and I were planning to discuss testing strategies at 11am this morning. I think the PR mentioned above includes a fix that seems to be working, but if you have already implemented this and/or want to discuss alternative methods to fixing this issue, I'd be happy to chat!

@slevis-lmwg
Copy link
Contributor

Mondays 10 am MT we have the ctsm software stand-up. The error message seems the same but our fixes are different. I think that yours seems fine in the context of this code.

@TeaganKing
Copy link
Contributor

TeaganKing commented Mar 25, 2024

Okay, thanks. I have a different 10-11am meeting, but I might be able to join at the end of the meeting if we wrap up my other meeting early... Otherwise maybe Erik and Will can bring a summary of where we're at to the stand up, and then I can discuss with them a bit at 11am. Or, if you want to point me to your fix, it might be helpful to see what you did, too. Did you implement/merge this in already?

@slevis-lmwg
Copy link
Contributor

I just pushed the code change in 64870fa (PR #640)

@wwieder
Copy link
Contributor Author

wwieder commented Mar 25, 2024

I confirmed that @TeaganKing fixes in #2435 address the --experiment bug.

The issue with creating postad cases seems to be more related to the externals update in dev172?, see #2437 (which should be addressed separately).

@TeaganKing
Copy link
Contributor

Thanks for testing this @wwieder ! I also did some tests on run_type and it does look like this argument is now being passed correctly, and transient is used as the default if not specified, as well.

@TeaganKing
Copy link
Contributor

I am however running into some issues such as the following. @wwieder are you seeing similar results, as well?

Command submitted:
./run_neon --neon-sites ABBY --experiment test_no_run-type_specified

Error:
ERROR: Build complete is not True please rebuild the model by calling case.build

@wwieder
Copy link
Contributor Author

wwieder commented Mar 25, 2024

agreed, @TeaganKing I'm getting this error too. A bit up I'm also seeing 'File /glade/derecho/scratch/wwieder/neon_AK/RMNP.no_runtype_test.transient/LockedFiles/env_build.xml has been modified
found difference in CALENDAR : case 'GREGORIAN' locked 'NO_LEAP'
Setting build complete to False'

I'm not really sure where the calendar is getting set or changed, but maybe this is a clue?

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 25, 2024

In our meeting this morning we thought to make this problem less likely, @TeaganKing will do the following:

This would exercise the most important options to run-neon and gives much better test coverage for the options in run-neon.

Also putting #2438 in place would help us with externals issues.

@TeaganKing
Copy link
Contributor

TeaganKing commented Mar 26, 2024

It looks like the calendar differences error is only occurring with the newest CIME checkout. I'm still a bit stumped as to why it's occurring.

With the same command as above, I'm getting a different error with the previous CIME checkout:
WARNING: buildlib is being called as a program rather than a subroutine as it is expected to be in the CESM context. However, upon trying to run ./case.build directly from the case directory, I do still see an error relating to env_build.xml having changed (found difference in CALENDAR : case 'GREGORIAN' locked 'NO_LEAP'). So, perhaps the same issues is just appearing differently in both cases due to cime changes

@TeaganKing
Copy link
Contributor

TeaganKing commented Mar 26, 2024

In our meeting this morning we thought to make this problem less likely, @TeaganKing will do the following:

This would exercise the most important options to run-neon and gives much better test coverage for the options in run-neon.

Also putting #2438 in place would help us with externals issues.

I don't seem to be able to check off items in your comment, but these are addressed in #2406 -- with the caveat that I think we discussed not actually testing prism and specifying 'ad' run in the second test.

@TeaganKing
Copy link
Contributor

TeaganKing commented Mar 26, 2024

Just to add documentation as I'm working through this, there's also a line case.set_value("CALENDAR", "GREGORIAN") in tower_site.py that may be the culprit here. This was also previously in neon_site in dev171, so I'm still not entirely sure why it would cause issues now.

@samsrabin samsrabin removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 28, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Mar 29, 2024

With ctsm5.2 work I ended up doing the python testing for recent versions of ctsm5.1. And I noticed that ctsm5.1.dev175 fails as follows (previous versions pass from dev171 to dev174).

I think this might be helpful to @TeaganKing @wwieder and @slevis-lmwg. @slevis-lmwg this covers what we were discussing with the b4b-dev tag testing (that we added to a future CTSM SE meeting to discuss with the group). In this case for ctsm5.1.dev175, doing the python testing would have been helpful (and I'm also just making sure you didn't run the python testing). But, I think the error below might help us in tracking down where at least the fail below happened. And that might fix some issues for us...

 ./run_ctsm_py_tests --sys
................
Inactive Modules:
  1) hdf5/1.12.2     2) intel/2023.0.0     3) ncarcompilers/1.0.0     4) netcdf/4.9.2

Due to MODULEPATH changes, the following have been reloaded:
  1) conda/latest     2) craype/2.7.20

The following have been reloaded with a version change:
  1) cdo/2.1.1 => cdo/2.3.0     2) ncarenv/23.06 => ncarenv/23.09     3) nco/5.1.4 => nco/5.1.9     4) ncview/2.1.8 => ncview/2.1.9

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) cesmdev/1.0   2) ncarenv/23.09
Done converting /glade/derecho/scratch/erik/tmp/tmp5qfwj9_c/scrip.nc
...E
Stdout:
in neonsite adding usermodsdirs
usermodsdirs: ['/glade/derecho/scratch/erik/ctsm5.1.dev175/cime_config/usermods_dirs/NEON/BART']
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---

======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 705, in run_sub_or_cmd
    getattr(mod, subname)(*subargs)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/build_scripts/buildlib.gptl", line 74, in buildlib
    run_bld_cmd_ensure_logging(cmd, logger)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 2614, in run_bld_cmd_ensure_logging
    expect(stat == 0, filter_unicode(errput))
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 176, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: Warning:
 Headers from cray-mpich will be ignored because they are not compatible with network-target=none.

Error invoking pkg-config!
Package cray-pmi was not found in the pkg-config search path.
Perhaps you should add the directory containing `cray-pmi.pc'
to the PKG_CONFIG_PATH environment variable
No package 'cray-pmi' found
Package libpals was not found in the pkg-config search path.
Perhaps you should add the directory containing `libpals.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libpals' found
gmake: *** [/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/non_py/src/timing/Makefile:85: gptl.o] Error 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/test/test_sys_run_neon.py", line 57, in test_one_site
    main("")
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/run_neon.py", line 227, in main
    cesmroot, output_root, res, compset, user_mods_dirs, overwrite, setup_only
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/neon_site.py", line 52, in build_base_case
    case_path = super().build_base_case(cesmroot, output_root, res, compset, user_mods_dirs)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/tower_site.py", line 157, in build_base_case
    build.case_build(case_path, case=case)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/build.py", line 1327, in case_build
    return run_and_log_case_status(functor, cb, caseroot=caseroot)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 2480, in run_and_log_case_status
    rv = func()
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/build.py", line 1320, in <lambda>
    dry_run,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/build.py", line 1212, in _case_build_impl
    complist,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/build.py", line 840, in _build_libraries
    logfile=file_build,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 717, in run_sub_or_cmd
    expect(False, "{} FAILED, cat {}".format(cmd, logfile))
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 176, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: /glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/build_scripts/buildlib.gptl FAILED, cat /glade/derecho/scratch/erik/tmp/tmpc8f8ojc_/BART/bld/gptl.bldlog.240329-123746

Stdout:
in neonsite adding usermodsdirs
usermodsdirs: ['/glade/derecho/scratch/erik/ctsm5.1.dev175/cime_config/usermods_dirs/NEON/BART']
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---

----------------------------------------------------------------------
Ran 20 tests in 37.962s

FAILED (errors=1)

@ekluzek ekluzek changed the title Have updates to run_neon deprecated previous features with the --experiment flag? Issues with run_neon with the --experiment flag starting in ctsm5.1.dev172 Mar 29, 2024
This was referenced Apr 12, 2024
slevis-lmwg added a commit that referenced this issue Apr 16, 2024
experiment bug fix

Address #2433 with changes to the arguments in run_case
@slevis-lmwg slevis-lmwg linked a pull request Apr 16, 2024 that will close this issue
@wwieder wwieder closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability investigation Needs to be verified and more investigation into what's going on. support user or developer needs help
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants