-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/hafs rtcases #646
Feature/hafs rtcases #646
Conversation
…_3denvar_hafens, hafs_3denvar_hybens, hafs_4denvar_glbens
…ther hafs regression test cases.
Fixed!
…On Mon, Oct 23, 2023 at 2:01 PM RussTreadon-NOAA ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In regression/regression_var.sh
<#646 (comment)>:
> @@ -4,6 +4,10 @@
# To run with hybrid ensemble option on, change HYBENS_GLOBAL and/or HYBENS_REGIONAL from "false" to "true".
# These are located at the end of this script.
+export local_or_default="/work/noaa/hwrf/noscrub/jcheng/gsi"
Quick note: Please remove lines 7 to 10. I understand these are for your
testing. We don't want them committed to the authoritative repo.
—
Reply to this email directly, view it on GitHub
<#646 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWIMNKYGQ6KGYFM4T5FTYA2WFBAVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOJTGE3TOOJZG4>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Ran hafs ctests on Orion after removing lines 7-1 from
The
The slower running contrl job suggests an Orion workload issue since PR #646 does not alter The
A check of the updat and contrl wall times shows that the loproc_contrl took noticeably longer time than the loproc_updat.
Since PR #646 does not alter source code, this difference is due to the Orion system load. Hence, this is a non-critical failure. |
…e 4denvar and 3denvar_hybens cases.Removed the 3denvar_glbens and 3denvar_hafens cases.
I've reduced the total number of HAFS related regression tests to 2 after discussed with Shun. |
@yonghuiweng Could you please review this PR and make regression test on WCOSS and other HPCs that HAFS is using. Thank you. |
@JingCheng-NOAA For verification, you can use the same check as rrfs: In "GSI/regression/regression_test.sh", you can change the following two lines:
Add "hafs":
Thanks, |
Hera ctests FYI, Hera
These changes remove references to the global_3dvar, global_4dvar, hwrf_nmm_d2, and hwrf_nmm_d3 regression tests. The data for the hafs tests has been sync'd to Hera. The remaining 7 ctests were run with the following results.
The hafs failures are non-fatal failures. Both tests failed the maxmem check
This check is misleading. The threshold is set to the per node memory divided by the number of cores. For Hera this is 96 Gb divided by 40 cores (2516582 Kb or 2.4 Gb). The hafs tests run with 20 tasks per Hera node. As such, neither hafs tests comes close to exceeding the 96 Gb limit for Hera nodes. As @hu5970 mentioned, some of the current regression test checks should either be removed or revised. The maxmem and timing scalability tests can probably be removed. |
Previously I only tested it on Orion. So the settings for Hera may be off. Please feel free to change it or I can make changes as well. |
Hera ctests (continued) Remove maxmem and timing scalability tests from
All tests pass. This is an expected result since this PR does not alter code, namelists, modules, or scripts.. This test is relevant to GSI issue #647 and is cross referenced there. |
Seems Hera and Orion are good for testing. Is any one working on WCOSS2? |
As suggested on Monday's meeting, someone from HAFS DA might run the test on WCOSS to get familiar with HAFS DA regression test. Yonghui might make a regression test on WCOSS once he gets chance. |
HAFS data needs to be rsync'd to WCOSS2. |
Yes, I’m working on dogwood.
�
From: ShunLiu-NOAA ***@***.***>
Sent: Wednesday, October 25, 2023 2:05 PM
To: NOAA-EMC/GSI ***@***.***>
Cc: Yonghui Weng ***@***.***>; Mention ***@***.***>
Subject: Re: [NOAA-EMC/GSI] Feature/hafs rtcases (PR #646)
�
As suggested on Monday's meeting, someone from HAFS DA might run the test on WCOSS to get familiar with HAFS DA regression test. Yonghui might make a regression test on WCOSS once he gets chance.
—
Reply to this email directly, view it on GitHub <#646 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKKATI7SE44AIUERIFEAW3LYBFIDTAVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZG44TCNJWGE> .
You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AKKATIYDLIFNLCBXCFI4LETYBFIDTA5CNFSM6AAAAAA6MHWW36WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTKCV3MS.gif> Message ID: ***@***.*** ***@***.***> >
|
…ion_tests.sh for verification
WCOSS2 (Dogwood) tests
As a reminder, my modified working copy of The two hafs tests have the longest run time in the ctest suite. Notice that the hafs tests run two outer loops with each inner loop containing 50 iterations. Can we reduce the number of inner iterations? Reducing the inner loop iteration count is an easy way to reduce wall time. The global_4denvar test runs 2 outer loops with 5 iterations on the first outer loop and 10 on the second outer loop. The global_4denvar test could reduce the second outer loop iterations to 5. We should check the number of inner loop iterations used in other ctests. |
I have no issue to reduce the iterations.JingOn Oct 25, 2023, at 21:26, RussTreadon-NOAA ***@***.***> wrote:
WCOSS2 (Dogwood) tests
Do the following
Rsync Hera working copy of hafs_rtcases to Dogwood /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr645
Update with current head of feature/hafs_rtcases
Rsync hafs regression test data to Dogwood
Build feature/hafs_rtcases
Run seven ctests. As shown below, all tests pass. This is an expected result.
***@***.***:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr645/build> ctest -j 7
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr645/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional .............. Passed 482.59 sec
2/7 Test #7: global_enkf ...................... Passed 610.86 sec
3/7 Test #3: rrfs_3denvar_glbens .............. Passed 666.14 sec
4/7 Test #2: rtma ............................. Passed 1088.11 sec
5/7 Test #1: global_4denvar ................... Passed 1381.85 sec
6/7 Test #6: hafs_3denvar_hybens .............. Passed 1448.17 sec
7/7 Test #5: hafs_4denvar_glbens .............. Passed 1630.54 sec
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 1630.56 sec
As a reminder, my modified working copy of feature/hafs_rtcases removes the maxmem and timing scalability tests.
The two hafs tests have the longest run time in the ctest suite. Notice that the hafs tests run two outer loops with each inner loop containing 50 iterations. Can we reduce the number of inner iterations? Reducing the inner loop iteration count is an easy way to reduce wall time.
The global_4denvar test runs 2 outer loops with 5 iterations on the first outer loop and 10 on the second outer loop. The global_4denvar test could reduce the second outer loop iterations to 5.
We should check the number of inner loop iterations used in other ctests.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Orion ctests Repeat on Orion the steps followed on Hera and WCOSS2. All seven ctests pass
|
Thank you @JingCheng-NOAA for being open to reducing the number of inner loop iterations in the hafs tests. I'll reduce the inner loop iterations in other tests and rerun the ctests to quantify the run time impact. |
Reduced inner loop iterations Reduce the number of inner loop iterations for several tests. Rerun ctests on Orion, Hera, and WCOSS2 (Dogwood) with the following results. Orion
The previous round of ctests on Orion took Hera
The previous round of ctests on Hera took WCOSS2 (Dogwood)
The previous round of ctests on Dogwood took The modified working directory for
|
Thanks for the update and the test results look great.
…On Thu, Oct 26, 2023 at 6:01 AM RussTreadon-NOAA ***@***.***> wrote:
*Reduced inner loop iterations*
Reduce the number of inner loop iterations for several tests. Rerun ctests
on Orion, Hera, and WCOSS2 (Dogwood) with the following results.
*Orion*
Orion-login-4:/work2/noaa/da/rtreadon/git/gsi/pr645/build$ ctest -j 7
Test project /work2/noaa/da/rtreadon/git/gsi/pr645/build
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 7: global_enkf
Start 4: netcdf_fv3_regional
1/7 Test #4: netcdf_fv3_regional .............. Passed 482.46 sec
2/7 Test #7: global_enkf ...................... Passed 488.62 sec
3/7 Test #3: rrfs_3denvar_glbens .............. Passed 545.26 sec
4/7 Test #2: rtma ............................. Passed 969.11 sec
5/7 Test #6: hafs_3denvar_hybens .............. Passed 1273.80 sec
6/7 Test #5: hafs_4denvar_glbens .............. Passed 1452.80 sec
7/7 Test #1: global_4denvar ................... Passed 1562.07 sec
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 1562.08 sec
The previous round of ctests on *Orion* took 2114.54 sec.
*Hera*
Hera(hfe10):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr645/build$ ctest -j 7
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr645/build
Start 1: global_4denvar
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 2: rtma
Start 7: global_enkf
Start 4: netcdf_fv3_regional
Start 3: rrfs_3denvar_glbens
1/7 Test #4: netcdf_fv3_regional .............. Passed 676.55 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 733.49 sec
3/7 Test #7: global_enkf ...................... Passed 985.35 sec
4/7 Test #2: rtma ............................. Passed 1218.54 sec
5/7 Test #6: hafs_3denvar_hybens .............. Passed 1406.38 sec
6/7 Test #5: hafs_4denvar_glbens .............. Passed 1464.25 sec
7/7 Test #1: global_4denvar ................... Passed 2160.74 sec
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 2160.75 sec
The previous round of ctests on *Hera* took 3053.84 sec.
*WCOSS2 (Dogwood)*
***@***.***:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr645/build> ctest -j 7
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr645/build
Start 1: global_4denvar
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 2: rtma
Start 7: global_enkf
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
1/7 Test #4: netcdf_fv3_regional .............. Passed 482.63 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 485.01 sec
3/7 Test #7: global_enkf ...................... Passed 609.06 sec
4/7 Test #2: rtma ............................. Passed 968.42 sec
5/7 Test #6: hafs_3denvar_hybens .............. Passed 1208.72 sec
6/7 Test #5: hafs_4denvar_glbens .............. Passed 1209.08 sec
7/7 Test #1: global_4denvar ................... Passed 1322.42 sec
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 1322.45 sec
The previous round of ctests on Dogwood took 1630.56 sec.
The modified working directory for hafs_rtcases including changes in the
number of inner loop iterations is found in the following locations:
- Orion: /work2/noaa/da/rtreadon/git/gsi/pr645
- Hera: /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr645
- Dogwood: /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr645
—
Reply to this email directly, view it on GitHub
<#646 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWIOKPHGSUNCP7LEVZ4TYBIYEVAVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBQHAYDAMZQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I updated a working copy of
I can not push the commit to If the changes in Orion Thank you. |
Hi Russ, |
@JingCheng-NOAA , please try again. I executed a |
Thanks, it worked.
…On Thu, Oct 26, 2023 at 9:52 AM RussTreadon-NOAA ***@***.***> wrote:
@JingCheng-NOAA <https://github.com/JingCheng-NOAA> , please try again. I
executed a chmod 755 on the directory.
—
Reply to this email directly, view it on GitHub
<#646 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWIKWQ5AJIBR7HTVHNSDYBJTKNAVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGE3TGNJUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Russ, I know you've removed the the maxmem and timing scalability tests in "regression_test_enkf.sh" and "regression_test.sh" for test purpose. I shouldn't include the changes in these two scripts, correct? |
@JingCheng-NOAA , after talking with @ShunLiu-NOAA it's my impression that we want to bring all my modifications into your branch. This incudes changes to |
Thanks for the clarification. I will include all the changes then.
…On Thu, Oct 26, 2023 at 11:14 AM RussTreadon-NOAA ***@***.***> wrote:
@JingCheng-NOAA <https://github.com/JingCheng-NOAA> , after talking with
@ShunLiu-NOAA <https://github.com/ShunLiu-NOAA> it's my impression that
we want to bring *all* my modifications into your branch. This incudes
changes to regression_test.sh and regression_test_enkf.sh.
—
Reply to this email directly, view it on GitHub
<#646 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWINSYYPOLQDMQKARTIDYBJ433AVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGMZDQOJYGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you @RussTreadon-NOAA and @JingCheng-NOAA. |
@JingCheng-NOAA , we should
in |
Agree. I removed them.
…On Thu, Oct 26, 2023 at 12:06 PM RussTreadon-NOAA ***@***.***> wrote:
@JingCheng-NOAA <https://github.com/JingCheng-NOAA> , we should git rm
the following files
regression/global_3dvar.sh
regression/global_4dvar.sh
in hafs-community:feature/hafs_rtcases. global_4denvar is the only global
gsi test executed by ctest.
—
Reply to this email directly, view it on GitHub
<#646 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWIP3GXJJ74ZPLUVJIQ3YBKC7RAVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGQZDAMBXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rerun ctests on Hera, Orion, and Dogwood. All 7 tests pass on each machine.
Approve pending input from others.
- @hu5970 , do you have any regression test changes you want to commit to
hafs-community:feature/hafs_rtcases
? - @JingCheng-NOAA , do you have additional changes you would like to make to your branch?
I have no more changes to add.
BTW, I've also tested on ORION and all 7 tests passed with no issues.
Test project /work/noaa/hwrf/save/jcheng/GSIversions/GSI/build
Start 5: hafs_4denvar_glbens
Start 1: global_4denvar
Start 6: hafs_3denvar_hybens
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 7: global_enkf
Start 4: netcdf_fv3_regional
1/7 Test #4: netcdf_fv3_regional .............. Passed 242.96 sec
2/7 Test #7: global_enkf ...................... Passed 249.75 sec
3/7 Test #3: rrfs_3denvar_glbens .............. Passed 546.38 sec
4/7 Test #2: rtma ............................. Passed 970.55 sec
5/7 Test #6: hafs_3denvar_hybens .............. Passed 1338.81 sec
6/7 Test #5: hafs_4denvar_glbens .............. Passed 1456.55 sec
7/7 Test #1: global_4denvar ................... Passed 1562.20 sec
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 1562.21 sec
…On Thu, Oct 26, 2023 at 1:54 PM RussTreadon-NOAA ***@***.***> wrote:
***@***.**** approved this pull request.
Rerun ctests on Hera, Orion, and Dogwood. All 7 tests pass on each machine.
Approve pending input from others.
- @hu5970 <https://github.com/hu5970> , do you have any regression
test changes you want to commit to hafs-community:feature/hafs_rtcases?
- @JingCheng-NOAA <https://github.com/JingCheng-NOAA> , do you have
additional changes you would like to make to your branch?
—
Reply to this email directly, view it on GitHub
<#646 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWIIAU63GXTTSHNZ3O2TYBKPWXAVCNFSM6AAAAAA6MHWW36VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMBQGIYDINJYGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Add HAFS related regression test into GSI Ctests to solve the issue #600.
Four set of regression tests for the current suite of HAFSv1 GSI:
3DEnvar with GDAS Ensemble plus FGAT capability -- "hafs_3denvar_glbens".
4Denvar with GDAS Ensemble plus FGAT capability -- "hafs_4denvar_glbens".
3Envar with self-cycled HAFS Ensemble -- "hafs_3denvar_hafens".
3Envar with GDAS Ensemble plus self-cycled HAFS Ensemble -- "hafs_3denvar_hybens".
Fixes #600
Partially fixes #647
Type of change
How Has This Been Tested?
These changes are tested under Orion through GSI Ctests.
Checklist
DUE DATE for this PR is 12/4/2023. If this PR is not merged into
develop
by this date, the PR will be closed and returned to the developer.