Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: aero_graph_net dataloader worker exited unexpectedly #713

Open
willyawan16 opened this issue Nov 16, 2024 · 1 comment
Open
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working

Comments

@willyawan16
Copy link

willyawan16 commented Nov 16, 2024

Version

0.8.0

On which installation method(s) does this occur?

No response

Describe the issue

aero_graph_net suddenly stopped in the middle of training, it returns the DataLoader worker exited unexpectedly

I use the following command to run the train.py
HYDRA_FULL_ERROR=1 python train.py +experiment=ahmed/mgn data.data_dir=/home/willy/modulus/modulus/examples/cfd/aero_graph_net/data/ahmed_body data.train.num_workers=1 data.val.num_workers=1 data.test.num_workers=1 data.train.n
um_samples=10 data.val.num_samples=5 data.test.num_samples=5

i also tried to change the num_workers=0 and returns the same result

Minimum reproducible example

Relevant log output

[11:02:51 - agnet - INFO] Loading the training dataset...
[11:02:57 - agnet - INFO] Using 10 training samples.
[11:02:57 - agnet - INFO] Loading the validation dataset...
[11:03:01 - agnet - INFO] Using 5 validation samples.
[11:03:01 - agnet - INFO] Creating the dataloaders...
[11:03:01 - agnet - INFO] Creating the model...
ERROR:checkpoint:Could not find valid model file /home/willy/modulus/modulus/examples/cfd/aero_graph_net/outputs/2024-11-16/11-02-51/MeshGraphNet.0.0.mdlus, skipping load
[11:03:02 - agnet - INFO] Training started...
[11:03:09 - agnet - INFO] epoch:     1, loss: 1.04640, lr: 0.0000999, time per epoch:  7.01
[11:03:10 - agnet - INFO] Validation loss: graph: 0.9278, total: 0.9278
[11:03:14 - agnet - INFO] epoch:     2, loss: 0.99062, lr: 0.0000997, time per epoch:  3.15
[11:03:15 - agnet - INFO] Validation loss: graph: 0.8663, total: 0.8663
[11:03:17 - agnet - INFO] epoch:     3, loss: 0.95101, lr: 0.0000996, time per epoch:  2.24
[11:03:18 - agnet - INFO] Validation loss: graph: 0.8572, total: 0.8572
[11:03:20 - agnet - INFO] epoch:     4, loss: 0.90867, lr: 0.0000994, time per epoch:  2.25
[11:03:21 - agnet - INFO] Validation loss: graph: 0.8699, total: 0.8699
[11:03:23 - agnet - INFO] epoch:     5, loss: 0.92775, lr: 0.0000993, time per epoch:  2.22
[11:03:24 - agnet - INFO] Validation loss: graph: 0.8395, total: 0.8395
[11:03:26 - agnet - INFO] epoch:     6, loss: 0.91789, lr: 0.0000991, time per epoch:  2.23
[11:03:27 - agnet - INFO] Validation loss: graph: 0.8131, total: 0.8131
[11:03:29 - agnet - INFO] epoch:     7, loss: 0.88435, lr: 0.0000990, time per epoch:  2.22
[11:03:30 - agnet - INFO] Validation loss: graph: 0.8058, total: 0.8058
[11:03:32 - agnet - INFO] epoch:     8, loss: 0.84495, lr: 0.0000988, time per epoch:  2.21
[11:03:33 - agnet - INFO] Validation loss: graph: 0.8140, total: 0.8140
[11:03:36 - agnet - INFO] epoch:     9, loss: 0.83925, lr: 0.0000987, time per epoch:  2.20
[11:03:37 - agnet - INFO] Validation loss: graph: 0.7596, total: 0.7596
[11:03:40 - agnet - INFO] epoch:    10, loss: 0.83024, lr: 0.0000985, time per epoch:  2.23
[11:03:40 - agnet - INFO] Validation loss: graph: 0.8068, total: 0.8068
[11:03:41 - agnet - INFO] Saved model on rank 0
[11:03:43 - agnet - INFO] epoch:    11, loss: 0.84523, lr: 0.0000984, time per epoch:  2.26
[11:03:44 - agnet - INFO] Validation loss: graph: 0.7840, total: 0.7840
[11:03:46 - agnet - INFO] epoch:    12, loss: 0.83194, lr: 0.0000982, time per epoch:  2.24
[11:03:47 - agnet - INFO] Validation loss: graph: 0.8148, total: 0.8148
[11:03:49 - agnet - INFO] epoch:    13, loss: 0.80304, lr: 0.0000981, time per epoch:  2.22
[11:03:50 - agnet - INFO] Validation loss: graph: 0.7451, total: 0.7451
[11:03:52 - agnet - INFO] epoch:    14, loss: 0.73931, lr: 0.0000979, time per epoch:  2.23
[11:03:53 - agnet - INFO] Validation loss: graph: 0.6774, total: 0.6774
[11:03:55 - agnet - INFO] epoch:    15, loss: 0.70535, lr: 0.0000978, time per epoch:  2.31
[11:03:56 - agnet - INFO] Validation loss: graph: 0.7420, total: 0.7420
[11:03:59 - agnet - INFO] epoch:    16, loss: 0.66442, lr: 0.0000976, time per epoch:  2.28
[11:04:00 - agnet - INFO] Validation loss: graph: 0.7097, total: 0.7097
[11:04:02 - agnet - INFO] epoch:    17, loss: 0.62061, lr: 0.0000975, time per epoch:  2.26
[11:04:03 - agnet - INFO] Validation loss: graph: 0.6342, total: 0.6342
[11:04:05 - agnet - INFO] epoch:    18, loss: 0.58241, lr: 0.0000973, time per epoch:  2.24
[11:04:06 - agnet - INFO] Validation loss: graph: 0.6717, total: 0.6717
[11:04:08 - agnet - INFO] epoch:    19, loss: 0.55367, lr: 0.0000972, time per epoch:  2.24
[11:04:09 - agnet - INFO] Validation loss: graph: 0.6473, total: 0.6473
[11:04:12 - agnet - INFO] epoch:    20, loss: 0.58924, lr: 0.0000970, time per epoch:  3.16
[11:04:13 - agnet - INFO] Validation loss: graph: 0.6289, total: 0.6289
[11:04:13 - agnet - INFO] Saved model on rank 0
[11:04:16 - agnet - INFO] epoch:    21, loss: 0.56127, lr: 0.0000969, time per epoch:  2.26
[11:04:17 - agnet - INFO] Validation loss: graph: 0.6261, total: 0.6261
[11:04:19 - agnet - INFO] epoch:    22, loss: 0.56012, lr: 0.0000968, time per epoch:  2.24
[11:04:20 - agnet - INFO] Validation loss: graph: 0.6119, total: 0.6119
[11:04:22 - agnet - INFO] epoch:    23, loss: 0.54818, lr: 0.0000966, time per epoch:  2.27
[11:04:23 - agnet - INFO] Validation loss: graph: 0.6387, total: 0.6387
[11:04:25 - agnet - INFO] epoch:    24, loss: 0.61648, lr: 0.0000965, time per epoch:  2.26
[11:04:26 - agnet - INFO] Validation loss: graph: 0.6684, total: 0.6684
[11:04:29 - agnet - INFO] epoch:    25, loss: 0.61332, lr: 0.0000963, time per epoch:  2.25
[11:04:30 - agnet - INFO] Validation loss: graph: 0.5929, total: 0.5929
[11:04:32 - agnet - INFO] epoch:    26, loss: 0.53022, lr: 0.0000962, time per epoch:  2.26
[11:04:33 - agnet - INFO] Validation loss: graph: 0.6428, total: 0.6428
[11:04:35 - agnet - INFO] epoch:    27, loss: 0.54248, lr: 0.0000960, time per epoch:  2.52
[11:04:36 - agnet - INFO] Validation loss: graph: 0.6354, total: 0.6354
[11:04:39 - agnet - INFO] epoch:    28, loss: 0.53696, lr: 0.0000959, time per epoch:  2.26
[11:04:40 - agnet - INFO] Validation loss: graph: 0.6226, total: 0.6226
[11:04:42 - agnet - INFO] epoch:    29, loss: 0.52558, lr: 0.0000957, time per epoch:  2.26
[11:04:44 - agnet - INFO] Validation loss: graph: 0.6171, total: 0.6171
[11:04:46 - agnet - INFO] epoch:    30, loss: 0.50851, lr: 0.0000956, time per epoch:  2.25
[11:04:47 - agnet - INFO] Validation loss: graph: 0.7020, total: 0.7020
[11:04:47 - agnet - INFO] Saved model on rank 0
[11:04:49 - agnet - INFO] epoch:    31, loss: 0.49040, lr: 0.0000955, time per epoch:  2.26
[11:04:50 - agnet - INFO] Validation loss: graph: 0.5944, total: 0.5944
[11:04:53 - agnet - INFO] epoch:    32, loss: 0.48895, lr: 0.0000953, time per epoch:  2.28
[11:04:54 - agnet - INFO] Validation loss: graph: 0.6124, total: 0.6124
[11:04:56 - agnet - INFO] epoch:    33, loss: 0.46899, lr: 0.0000952, time per epoch:  2.27
[11:04:57 - agnet - INFO] Validation loss: graph: 0.5988, total: 0.5988
[11:04:59 - agnet - INFO] epoch:    34, loss: 0.44761, lr: 0.0000950, time per epoch:  2.28
[11:05:00 - agnet - INFO] Validation loss: graph: 0.5896, total: 0.5896
[11:05:03 - agnet - INFO] epoch:    35, loss: 0.47603, lr: 0.0000949, time per epoch:  2.28
[11:05:04 - agnet - INFO] Validation loss: graph: 0.6296, total: 0.6296
[11:05:06 - agnet - INFO] epoch:    36, loss: 0.48775, lr: 0.0000947, time per epoch:  2.27
[11:05:07 - agnet - INFO] Validation loss: graph: 0.6657, total: 0.6657
[11:05:09 - agnet - INFO] epoch:    37, loss: 0.47699, lr: 0.0000946, time per epoch:  2.35
[11:05:10 - agnet - INFO] Validation loss: graph: 0.5994, total: 0.5994
[11:05:13 - agnet - INFO] epoch:    38, loss: 0.45868, lr: 0.0000945, time per epoch:  2.39
[11:05:14 - agnet - INFO] Validation loss: graph: 0.6130, total: 0.6130
[11:05:17 - agnet - INFO] epoch:    39, loss: 0.45016, lr: 0.0000943, time per epoch:  3.33
[11:05:18 - agnet - INFO] Validation loss: graph: 0.5912, total: 0.5912
[11:05:21 - agnet - INFO] epoch:    40, loss: 0.43625, lr: 0.0000942, time per epoch:  2.29
[11:05:22 - agnet - INFO] Validation loss: graph: 0.6006, total: 0.6006
[11:05:22 - agnet - INFO] Saved model on rank 0
[11:05:24 - agnet - INFO] epoch:    41, loss: 0.44077, lr: 0.0000940, time per epoch:  2.28
[11:05:25 - agnet - INFO] Validation loss: graph: 0.6132, total: 0.6132
[11:05:27 - agnet - INFO] epoch:    42, loss: 0.44645, lr: 0.0000939, time per epoch:  2.28
[11:05:29 - agnet - INFO] Validation loss: graph: 0.5658, total: 0.5658
[11:05:31 - agnet - INFO] epoch:    43, loss: 0.44287, lr: 0.0000938, time per epoch:  2.27
[11:05:32 - agnet - INFO] Validation loss: graph: 0.5943, total: 0.5943
[11:05:34 - agnet - INFO] epoch:    44, loss: 0.45043, lr: 0.0000936, time per epoch:  2.29
[11:05:35 - agnet - INFO] Validation loss: graph: 0.5937, total: 0.5937
[11:05:38 - agnet - INFO] epoch:    45, loss: 0.44687, lr: 0.0000935, time per epoch:  2.35
[11:05:39 - agnet - INFO] Validation loss: graph: 0.5660, total: 0.5660
[11:05:41 - agnet - INFO] epoch:    46, loss: 0.43094, lr: 0.0000933, time per epoch:  2.32
[11:05:42 - agnet - INFO] Validation loss: graph: 0.5814, total: 0.5814
[11:05:45 - agnet - INFO] epoch:    47, loss: 0.42908, lr: 0.0000932, time per epoch:  2.33
[11:05:46 - agnet - INFO] Validation loss: graph: 0.5723, total: 0.5723
[11:05:48 - agnet - INFO] epoch:    48, loss: 0.43253, lr: 0.0000931, time per epoch:  2.37
[11:05:50 - agnet - INFO] Validation loss: graph: 0.6036, total: 0.6036
[11:05:53 - agnet - INFO] epoch:    49, loss: 0.43167, lr: 0.0000929, time per epoch:  2.34
[11:05:54 - agnet - INFO] Validation loss: graph: 0.5951, total: 0.5951
[11:05:56 - agnet - INFO] epoch:    50, loss: 0.43239, lr: 0.0000928, time per epoch:  2.36
[11:05:57 - agnet - INFO] Validation loss: graph: 0.5776, total: 0.5776
[11:05:58 - agnet - INFO] Saved model on rank 0
[11:06:00 - agnet - INFO] epoch:    51, loss: 0.42958, lr: 0.0000926, time per epoch:  2.36
[11:06:01 - agnet - INFO] Validation loss: graph: 0.5959, total: 0.5959
[11:06:03 - agnet - INFO] epoch:    52, loss: 0.42537, lr: 0.0000925, time per epoch:  2.36
[11:06:05 - agnet - INFO] Validation loss: graph: 0.6439, total: 0.6439
[11:06:07 - agnet - INFO] epoch:    53, loss: 0.42174, lr: 0.0000924, time per epoch:  2.36
[11:06:08 - agnet - INFO] Validation loss: graph: 0.6653, total: 0.6653
[11:06:11 - agnet - INFO] epoch:    54, loss: 0.45621, lr: 0.0000922, time per epoch:  2.35
[11:06:12 - agnet - INFO] Validation loss: graph: 0.7041, total: 0.7041
[11:06:14 - agnet - INFO] epoch:    55, loss: 0.45867, lr: 0.0000921, time per epoch:  2.36
[11:06:15 - agnet - INFO] Validation loss: graph: 0.5986, total: 0.5986
[11:06:18 - agnet - INFO] epoch:    56, loss: 0.44210, lr: 0.0000919, time per epoch:  2.38
[11:06:19 - agnet - INFO] Validation loss: graph: 0.6084, total: 0.6084
[11:06:21 - agnet - INFO] epoch:    57, loss: 0.42640, lr: 0.0000918, time per epoch:  2.36
[11:06:23 - agnet - INFO] Validation loss: graph: 0.5960, total: 0.5960
[11:06:26 - agnet - INFO] epoch:    58, loss: 0.42906, lr: 0.0000917, time per epoch:  2.35
[11:06:27 - agnet - INFO] Validation loss: graph: 0.5877, total: 0.5877
[11:06:29 - agnet - INFO] epoch:    59, loss: 0.41593, lr: 0.0000915, time per epoch:  2.31
[11:06:30 - agnet - INFO] Validation loss: graph: 0.5977, total: 0.5977
[11:06:33 - agnet - INFO] epoch:    60, loss: 0.42479, lr: 0.0000914, time per epoch:  2.34
[11:06:34 - agnet - INFO] Validation loss: graph: 0.5854, total: 0.5854
[11:06:34 - agnet - INFO] Saved model on rank 0
[11:06:36 - agnet - INFO] epoch:    61, loss: 0.42576, lr: 0.0000913, time per epoch:  2.35
[11:06:37 - agnet - INFO] Validation loss: graph: 0.5894, total: 0.5894
[11:06:40 - agnet - INFO] epoch:    62, loss: 0.41232, lr: 0.0000911, time per epoch:  2.36
[11:06:41 - agnet - INFO] Validation loss: graph: 0.6056, total: 0.6056
[11:06:43 - agnet - INFO] epoch:    63, loss: 0.41637, lr: 0.0000910, time per epoch:  2.36
[11:06:45 - agnet - INFO] Validation loss: graph: 0.6054, total: 0.6054
[11:06:47 - agnet - INFO] epoch:    64, loss: 0.43276, lr: 0.0000908, time per epoch:  2.35
[11:06:48 - agnet - INFO] Validation loss: graph: 0.6051, total: 0.6051
[11:06:51 - agnet - INFO] epoch:    65, loss: 0.43972, lr: 0.0000907, time per epoch:  2.37
[11:06:52 - agnet - INFO] Validation loss: graph: 0.5941, total: 0.5941
[11:06:54 - agnet - INFO] epoch:    66, loss: 0.41911, lr: 0.0000906, time per epoch:  2.37
[11:06:56 - agnet - INFO] Validation loss: graph: 0.5699, total: 0.5699
[11:06:59 - agnet - INFO] epoch:    67, loss: 0.39792, lr: 0.0000904, time per epoch:  2.38
[11:07:00 - agnet - INFO] Validation loss: graph: 0.5766, total: 0.5766
[11:07:02 - agnet - INFO] epoch:    68, loss: 0.39070, lr: 0.0000903, time per epoch:  2.38
[11:07:04 - agnet - INFO] Validation loss: graph: 0.5935, total: 0.5935
[11:07:06 - agnet - INFO] epoch:    69, loss: 0.42633, lr: 0.0000902, time per epoch:  2.37
[11:07:07 - agnet - INFO] Validation loss: graph: 0.6092, total: 0.6092
[11:07:10 - agnet - INFO] epoch:    70, loss: 0.43192, lr: 0.0000900, time per epoch:  2.39
[11:07:11 - agnet - INFO] Validation loss: graph: 0.5988, total: 0.5988
[11:07:11 - agnet - INFO] Saved model on rank 0
[11:07:14 - agnet - INFO] epoch:    71, loss: 0.40431, lr: 0.0000899, time per epoch:  2.45
[11:07:15 - agnet - INFO] Validation loss: graph: 0.5811, total: 0.5811
[11:07:17 - agnet - INFO] epoch:    72, loss: 0.41174, lr: 0.0000898, time per epoch:  2.45
[11:07:19 - agnet - INFO] Validation loss: graph: 0.5780, total: 0.5780
[11:07:21 - agnet - INFO] epoch:    73, loss: 0.41795, lr: 0.0000896, time per epoch:  2.47
[11:07:23 - agnet - INFO] Validation loss: graph: 0.5996, total: 0.5996
[11:07:25 - agnet - INFO] epoch:    74, loss: 0.41099, lr: 0.0000895, time per epoch:  2.46
[11:07:26 - agnet - INFO] Validation loss: graph: 0.6146, total: 0.6146
[11:07:30 - agnet - INFO] epoch:    75, loss: 0.41559, lr: 0.0000894, time per epoch:  3.37
[11:07:31 - agnet - INFO] Validation loss: graph: 0.6174, total: 0.6174
[11:07:34 - agnet - INFO] epoch:    76, loss: 0.40612, lr: 0.0000892, time per epoch:  2.48
[11:07:35 - agnet - INFO] Validation loss: graph: 0.6176, total: 0.6176
[11:07:37 - agnet - INFO] epoch:    77, loss: 0.40496, lr: 0.0000891, time per epoch:  2.43
[11:07:39 - agnet - INFO] Validation loss: graph: 0.5746, total: 0.5746
[11:07:41 - agnet - INFO] epoch:    78, loss: 0.42470, lr: 0.0000890, time per epoch:  2.46
[11:07:43 - agnet - INFO] Validation loss: graph: 0.6034, total: 0.6034
[11:07:45 - agnet - INFO] epoch:    79, loss: 0.41885, lr: 0.0000888, time per epoch:  2.43
[11:07:46 - agnet - INFO] Validation loss: graph: 0.6060, total: 0.6060
[11:07:49 - agnet - INFO] epoch:    80, loss: 0.39811, lr: 0.0000887, time per epoch:  2.46
[11:07:50 - agnet - INFO] Validation loss: graph: 0.5926, total: 0.5926
[11:07:50 - agnet - INFO] Saved model on rank 0
[11:07:53 - agnet - INFO] epoch:    81, loss: 0.39563, lr: 0.0000886, time per epoch:  2.48
[11:07:54 - agnet - INFO] Validation loss: graph: 0.5974, total: 0.5974
[11:07:57 - agnet - INFO] epoch:    82, loss: 0.40131, lr: 0.0000884, time per epoch:  2.50
[11:07:58 - agnet - INFO] Validation loss: graph: 0.6259, total: 0.6259
[11:08:01 - agnet - INFO] epoch:    83, loss: 0.39569, lr: 0.0000883, time per epoch:  2.52
[11:08:03 - agnet - INFO] Validation loss: graph: 0.5768, total: 0.5768
[11:08:06 - agnet - INFO] epoch:    84, loss: 0.40172, lr: 0.0000882, time per epoch:  2.63
[11:08:07 - agnet - INFO] Validation loss: graph: 0.5554, total: 0.5554
[11:08:10 - agnet - INFO] epoch:    85, loss: 0.41280, lr: 0.0000880, time per epoch:  2.54
[11:08:11 - agnet - INFO] Validation loss: graph: 0.6162, total: 0.6162
[11:08:14 - agnet - INFO] epoch:    86, loss: 0.39156, lr: 0.0000879, time per epoch:  2.49
[11:08:15 - agnet - INFO] Validation loss: graph: 0.6115, total: 0.6115
[11:08:18 - agnet - INFO] epoch:    87, loss: 0.37685, lr: 0.0000878, time per epoch:  2.43
[11:08:19 - agnet - INFO] Validation loss: graph: 0.6713, total: 0.6713
[11:08:25 - agnet - INFO] epoch:    88, loss: 0.38061, lr: 0.0000876, time per epoch:  6.03
[11:08:27 - agnet - INFO] Validation loss: graph: 0.5992, total: 0.5992
[11:08:29 - agnet - INFO] epoch:    89, loss: 0.37490, lr: 0.0000875, time per epoch:  2.47
[11:08:31 - agnet - INFO] Validation loss: graph: 0.5933, total: 0.5933
[11:08:33 - agnet - INFO] epoch:    90, loss: 0.36721, lr: 0.0000874, time per epoch:  2.46
[11:08:35 - agnet - INFO] Validation loss: graph: 0.6246, total: 0.6246
[11:08:36 - agnet - INFO] Saved model on rank 0
[11:08:38 - agnet - INFO] epoch:    91, loss: 0.37175, lr: 0.0000872, time per epoch:  2.51
[11:08:39 - agnet - INFO] Validation loss: graph: 0.6272, total: 0.6272
[11:08:42 - agnet - INFO] epoch:    92, loss: 0.38019, lr: 0.0000871, time per epoch:  2.52
[11:08:43 - agnet - INFO] Validation loss: graph: 0.5839, total: 0.5839
[11:08:46 - agnet - INFO] epoch:    93, loss: 0.38484, lr: 0.0000870, time per epoch:  2.49
[11:08:47 - agnet - INFO] Validation loss: graph: 0.5811, total: 0.5811
[11:08:50 - agnet - INFO] epoch:    94, loss: 0.39524, lr: 0.0000868, time per epoch:  2.89
[11:08:52 - agnet - INFO] Validation loss: graph: 0.6102, total: 0.6102
[11:08:54 - agnet - INFO] epoch:    95, loss: 0.39685, lr: 0.0000867, time per epoch:  2.44
[11:08:55 - agnet - INFO] Validation loss: graph: 0.5561, total: 0.5561
[11:08:58 - agnet - INFO] epoch:    96, loss: 0.39618, lr: 0.0000866, time per epoch:  2.46
[11:08:59 - agnet - INFO] Validation loss: graph: 0.5514, total: 0.5514
[11:09:02 - agnet - INFO] epoch:    97, loss: 0.37102, lr: 0.0000865, time per epoch:  2.51
[11:09:03 - agnet - INFO] Validation loss: graph: 0.5808, total: 0.5808
[11:09:06 - agnet - INFO] epoch:    98, loss: 0.38046, lr: 0.0000863, time per epoch:  2.51
[11:09:07 - agnet - INFO] Validation loss: graph: 0.6673, total: 0.6673
[11:09:11 - agnet - INFO] epoch:    99, loss: 0.40059, lr: 0.0000862, time per epoch:  3.45
[11:09:12 - agnet - INFO] Validation loss: graph: 0.5723, total: 0.5723
[11:09:15 - agnet - INFO] epoch:   100, loss: 0.38197, lr: 0.0000861, time per epoch:  2.54
[11:09:16 - agnet - INFO] Validation loss: graph: 0.5761, total: 0.5761
[11:09:16 - agnet - INFO] Saved model on rank 0
[11:09:19 - agnet - INFO] epoch:   101, loss: 0.39936, lr: 0.0000859, time per epoch:  2.52
[11:09:20 - agnet - INFO] Validation loss: graph: 0.6000, total: 0.6000
[11:09:23 - agnet - INFO] epoch:   102, loss: 0.39781, lr: 0.0000858, time per epoch:  2.90
[11:09:25 - agnet - INFO] Validation loss: graph: 0.5894, total: 0.5894
[11:09:27 - agnet - INFO] epoch:   103, loss: 0.37163, lr: 0.0000857, time per epoch:  2.47
[11:09:29 - agnet - INFO] Validation loss: graph: 0.6307, total: 0.6307
[11:09:31 - agnet - INFO] epoch:   104, loss: 0.35317, lr: 0.0000856, time per epoch:  2.46
[11:09:33 - agnet - INFO] Validation loss: graph: 0.5912, total: 0.5912
[11:09:35 - agnet - INFO] epoch:   105, loss: 0.36086, lr: 0.0000854, time per epoch:  2.53
[11:09:37 - agnet - INFO] Validation loss: graph: 0.5923, total: 0.5923
[11:09:39 - agnet - INFO] epoch:   106, loss: 0.35450, lr: 0.0000853, time per epoch:  2.55
[11:09:41 - agnet - INFO] Validation loss: graph: 0.6093, total: 0.6093
[11:09:45 - agnet - INFO] epoch:   107, loss: 0.34461, lr: 0.0000852, time per epoch:  4.29
[11:09:47 - agnet - INFO] Validation loss: graph: 0.6033, total: 0.6033
[11:09:49 - agnet - INFO] epoch:   108, loss: 0.35229, lr: 0.0000850, time per epoch:  2.55
[11:09:51 - agnet - INFO] Validation loss: graph: 0.6246, total: 0.6246
Error executing job with overrides: ['+experiment=ahmed/mgn', 'data.data_dir=/home/willy/modulus/modulus/examples/cfd/aero_graph_net/data/ahmed_body', 'data.train.num_workers=1', 'data.val.num_workers=1', 'data.test.num_workers=1', 'data.train.num_samples=10', 'data.val.num_samples=5', 'data.test.num_samples=5']
Traceback (most recent call last):
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/queue.py", line 179, in get
    raise Empty
_queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 267, in <module>
    main()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 225, in main
    for batch in trainer.dataloader:
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1327, in _next_data
    idx, data = self._get_data()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1283, in _get_data
    success, data = self._try_get_data()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1144, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 171111) exited unexpectedly

Environment details

@willyawan16 willyawan16 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 16, 2024
@willyawan16 willyawan16 changed the title 🐛[BUG]: aero_graph_net suddenly stopped training 🐛[BUG]: aero_graph_net dataloader worker exited unexpectedly Nov 16, 2024
@Alexey-Kamenev Alexey-Kamenev added 2 - In Progress Currently a work in progress and removed ? - Needs Triage Need team to review and classify labels Dec 4, 2024
@Alexey-Kamenev
Copy link
Collaborator

I have a few questions which might help in resolving the issue:

  1. What is the Modulus installation method? Are you running the example from Modulus Docker container or some other way?
  2. Does the error happen after certain number of epochs or it's random?
  3. What is the GPU memory utilization around the time of the crash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants