-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERP tests failing to run #406
Comments
Hi Erik, was this test done after merge PR #401, which is supposed to fix this? if so, there is still some problem finding restart file? |
This is with that merge. That solved #388 which allows ERI tests to work. Here I don't think the problem is finding the restart file, but it seems to have trouble when it tries to open the restart file. This ERP test runs with 1 MPI task and 25 threads, and then does the restart with half as many threads. A restart test where the number of threads doesn't change works fine, and starting up with either number of threads seems to be fine. |
Ok, I will need to take a close look at this one as well as amazon grid one too. |
OK, it seems to have trouble only when running with MPI for a single task. When I run with mpi-serial I thought it was working for both intel and gnu compilers -- but I was wrong and it fails for both MPI and non-MPI. ERP_D_Mmpi-serial_P1x25.5x5_amazon.I2000Clm50Sp.cheyenne_intel.mizuroute-default It might be useful to have a multi-task MPI test to make sure you can do restarts with a differing number of MPI tasks even though we know it will change answers because of #256. |
My previous comment was actually incorrect, and it does fail for both MPI and non-MPI. I'll update the comment above... |
The traceback for ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default looks like this:
rof.log ends on...
Line 333 of pio_utils is the open of the restart file: ierr = pio_openfile(pioIoSystem, pioFileDesc, iotype, trim(fname), mode)
if(ierr/=pio_noerr)then; message=trim(message)//'Could not open netCDF'; return; endif
|
Note, that 25 and 12 thread ERS tests run as expected so there isn't something about the thread count for reading the restart files. They fail the comparison because of #390, but do work... ERS_D_Mmpi-serial_P1x12.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default COMPARE_base_rest |
When I compare the namelists for restart for the ERP test to the ERS 12 thread test the comparison looks correct to me with the difference being the casename. diff -wbcr CaseDocs/ /glade/work/erik/ctsm_worktrees/mizuRoute/cime/scripts/cases/ERS_D_Mmpi-serial_P1x12.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/CaseDocs/ | less
(ctsm_pylib) case2/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist> pwd
/glade/work/erik/ctsm_worktrees/mizuRoute/cime/scripts/cases/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/case2/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist |
This is trying to open history file, not restart file. that seems to be incorrect. but the messsage said the file is not there? so rpointer is wrong? |
Ahhh, you are right the problem is that the history file isn't there. The restart file is, but it's not copying over the history file like it should. |
OK, to get this to work, a string variable needs to be added to the restart file that contains the name(s) of the history file(s) that needs to be read in. The names of that needs to be added to the config_archive.xml file as well. The history file name is the same as the name of the history file that gets put into the mizuroute.rpointer file. For clm this variable is called locfnh and it's added to the archive as...
I propose something longer and more descriptive like "restart_history_filenames". |
Actually when gauge data is output the filenames should include both the history file and gauge file (so hfileout, and hfileout_gage) |
I've got this working in #391 (writing the filenames to the restart file and having them copied over), however the ERP tests still fail because it's trying to read in the history file with a date of 2000-01-12-00000.nc rather than 2000-01-07-00000.nc. The Jan/12th date is on the restart pointer file as well for the history file. This means something is going wrong with the logic for reading in the history file at restart. |
Hmm... dose the test stop on 2000-01-07 but generate the 2000-01-12 history file? Does the test produce daily history file or monthly? maybe need to know the configurations used for the test to make a better guess. |
@nmizukami I found the problem. This might be a case that you didn't think about. For CESM especially for testing we have cases where you right out restart files before the end of the run. So you output restarts for day 7, but run until day 12. At which point the history file is updated. But, the restart file hasn't been updated. So when I make the following change I get it to work... diff --git a/route/build/src/write_simoutput_pio.f90 b/route/build/src/write_simoutput_pio.f90
index 0d9d5862..98f61b50 100644
--- a/route/build/src/write_simoutput_pio.f90
+++ b/route/build/src/write_simoutput_pio.f90
@@ -108,8 +108,8 @@ SUBROUTINE main_new_file(ierr, message)
end if
! update history files
- call io_rpfile('w', ierr, cmessage)
- if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif
+ !call io_rpfile('w', ierr, cmessage)
+ !if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif
END SUBROUTINE main_new_file
Is it OK to just get rid of that call to io_rpfile? Or is this case important for standalone? If it's important for standalone there could be an if block around it. Once I have that change above the test works as I expect it to. |
Thanks Erik for finding this. I can see that. It seems that there is no need for this for standalone either. So rpointer file needs to have 2000-01-07 for both history file and restart file? |
Yes, the rpointer file needs to have them consistent: 2000-01-07 for both history and restart file. In this case at least. Depending on how often the history file is written the history file date could be behind the restart file -- but never in front of it. I'll just remove those lines then. |
Ok. after these lines are removed, io_rpfile (for writing) is called only in restart_output, which is called in main_restart (this is main restart writing routine). so rpointer file is updated only when restart file is written and history file name is picked up at the time when restart file is written. |
Exact restart tests with a change in threadcount are failing to run. It fails on opening the restart file on
ERP_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default
Other compilers fail as well. ERS tests work. And ERP tests without mizuRoute work as well.
The text was updated successfully, but these errors were encountered: