[Bug]: CDAT migration: potential performance or memory bottlenecks handling time-series data files #892

chengzhuzhang · 2024-11-08T00:03:35Z

What happened?

This happened again in a test (climo) vs obs (time-series) run case. When using time-series as input, the run was either killed or hung without finishing. I suspect there is a performance/memory bottleneck in handling reading and computing climatology on the fly from time-series files. More debugging is needed.

Running the same variable on main branch takes shorter than 1 min wall time.

What did you expect to happen? Are there are possible answers you came across?

No response

Minimal Complete Verifiable Example (MVCE)

Command line:
python run_script.py -d U.cfg

# run_script.py

import os
from e3sm_diags.parameter.core_parameter import CoreParameter
from e3sm_diags.run import runner

param = CoreParameter()


param.reference_data_path = '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/time-series'
param.test_data_path = '/global/cfs/cdirs/e3sm/chengzhu/eamxx/post/data/rgr'
param.test_name = 'eamxx_decadal'
param.seasons = ["ANN"]
#param.save_netcdf = True

param.ref_timeseries_input = True
# Years to slice the ref data, base this off the years in the filenames.
param.ref_start_yr = "1996"
param.ref_end_yr = "1996"

prefix = '/global/cfs/cdirs/e3sm/www/zhang40/tests/eamxx'
param.results_dir = os.path.join(prefix, 'eamxx_decadal_1996_1107_edv3')

runner.sets_to_run = ["lat_lon",
        "zonal_mean_xy",
        "zonal_mean_2d",
        "zonal_mean_2d_stratosphere",
        "polar",
        "cosp_histogram",
        "meridional_mean_2d",
        "annual_cycle_zonal_mean",]

runner.run_diags([param])



# U.cfg
[#]
sets = ["lat_lon"]
case_id = "ERA5"
variables = ["U"]
ref_name = "ERA5"
reference_name = "ERA5 Reanalysis"
seasons = ["ANN", "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "DJF", "MAM", "JJA", "SON"]
plevs = [850.0]
test_colormap = "PiYG_r"
reference_colormap = "PiYG_r"
contour_levels = [-20, -15, -10, -8, -5, -3, -1, 1, 3, 5, 8, 10, 15, 20]
diff_levels = [-8, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 8]
regrid_method = "bilinear"

Relevant log output

No response

Anything else we need to know?

No response

Environment

cdat-migration-fy24 branch

tomvothecoder · 2024-11-08T20:21:36Z

This time series variable (U) is being derived from other variables. I noticed one of the variable being used for derivation is ua. The dataset is 76 GB, which on my end causes a hang then crash when trying to call ds.load()

You confirmed the CDAT code is handling these larger datasets? I don't recall other diags using datasets this large, but I might be wrong.

Dataset filepath: /global/cfs/cdirs/e3sm/diagnostics/observations/Atm/time-series/ERA5/ua_197901_201912.nc
Lines of code where it crashes:

e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py

Line 1565 in 542b88b

ds.load(scheduler="sync")

tomvothecoder · 2024-11-08T20:25:20Z

RE: My comment above.

I think the ds.load() call is trying to load the entire time series into memory, rather than the time slice. I'll try switching the order of subsetting to slice on time first before ds.load(), which should cut down on the memory requirements.

chengzhuzhang · 2024-11-08T21:42:38Z

Closed with #892

tomvothecoder · 2024-12-16T23:05:46Z

Reopening because we found this is happening again in #880.

chengzhuzhang added the bug Bug fix (will increment patch version) label Nov 8, 2024

chengzhuzhang assigned tomvothecoder Nov 8, 2024

tomvothecoder mentioned this issue Nov 8, 2024

EAMxx variables #880

Open

9 tasks

chengzhuzhang closed this as completed Nov 8, 2024

tomvothecoder reopened this Dec 16, 2024

tomvothecoder linked a pull request Dec 17, 2024 that will close this issue

Address diffs v2.12.1 to v3 #907

Draft

14 tasks

chengzhuzhang added a commit that referenced this issue Dec 20, 2024

downgrade dask version to close #892

ffe0bca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: CDAT migration: potential performance or memory bottlenecks handling time-series data files #892

[Bug]: CDAT migration: potential performance or memory bottlenecks handling time-series data files #892

chengzhuzhang commented Nov 8, 2024 •

edited

Loading

tomvothecoder commented Nov 8, 2024 •

edited

Loading

tomvothecoder commented Nov 8, 2024

chengzhuzhang commented Nov 8, 2024

tomvothecoder commented Dec 16, 2024

[Bug]: CDAT migration: potential performance or memory bottlenecks handling time-series data files #892

[Bug]: CDAT migration: potential performance or memory bottlenecks handling time-series data files #892

Comments

chengzhuzhang commented Nov 8, 2024 • edited Loading

What happened?

What did you expect to happen? Are there are possible answers you came across?

Minimal Complete Verifiable Example (MVCE)

Relevant log output

Anything else we need to know?

Environment

tomvothecoder commented Nov 8, 2024 • edited Loading

tomvothecoder commented Nov 8, 2024

chengzhuzhang commented Nov 8, 2024

tomvothecoder commented Dec 16, 2024

chengzhuzhang commented Nov 8, 2024 •

edited

Loading

tomvothecoder commented Nov 8, 2024 •

edited

Loading