Running the maddpg benchmark for a long time results in a traci error. #158

rsuwa · 2020-11-13T05:54:48Z

Reproduction

If you run the task for a long time (1-2 hours), there is a high probability that the following error will occur.

Command

python run.py scenarios/intersections/4lane -f agents/maddpg/baseline-lane-control.yaml

Full logs

Failure # 1 (occurred at 2020-11-13_14-32-54)
Traceback (most recent call last):
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FatalTraCIError): �[36mray::MADDPG2.train()�[39m (pid=8802, ip=192.168.10.106)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 517, in train
    raise e
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 506, in train
    result = Trainable.train(self)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 147, in step
    res = next(self.train_exec_impl)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
    item = next(it)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
ray.exceptions.RayTaskError(FatalTraCIError): �[36mray::RolloutWorker.par_iter_next()�[39m (pid=8801, ip=192.168.10.106)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 1152, in par_iter_next
    return next(self.local_it)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 317, in gen_rollouts
    yield self.sample()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 621, in sample
    batches = [self.input_reader.next()]
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 94, in next
    batches = [self.get_data()]
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 211, in get_data
    item = next(self.rollout_provider)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 602, in _env_runner
    observation_fn=observation_fn,
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 896, in _process_observations
    env_id)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/env/base_env.py", line 422, in try_reset
    obs = self.env_states[env_id].reset()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/env/base_env.py", line 460, in reset
    self.last_obs = self.env.reset()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/wrappers/rllib/early_done.py", line 34, in reset
    obs = self.env.reset()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/env/rllib_hiway_env.py", line 158, in reset
    env_observations = self._smarts.reset(scenario)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/smarts.py", line 270, in reset
    self.setup(scenario)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/smarts.py", line 318, in setup
    provider_state = self._setup_providers(self._scenario)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/smarts.py", line 600, in _setup_providers
    provider_state.merge(provider.setup(scenario))
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/sumo_traffic_simulation.py", line 238, in setup
    [tc.VAR_DEPARTED_VEHICLES_IDS, tc.VAR_ARRIVED_VEHICLES_IDS]
  File "/usr/share/sumo/tools/traci/_simulation.py", line 440, in subscribe
    Domain.subscribe(self, "", varIDs, begin, end)
  File "/usr/share/sumo/tools/traci/domain.py", line 208, in subscribe
    self._connection._subscribe(self._subscribeID, begin, end, objectID, varIDs)
  File "/usr/share/sumo/tools/traci/connection.py", line 231, in _subscribe
    result = self._sendCmd(cmdID, (begin, end), objID, format, *args)
  File "/usr/share/sumo/tools/traci/connection.py", line 178, in _sendCmd
    return self._sendExact()
  File "/usr/share/sumo/tools/traci/connection.py", line 88, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO

The text was updated successfully, but these errors were encountered:

KornbergFresnel · 2020-11-14T00:50:09Z

@rsuwa Yes, it is an error produced by SUMO. You can read some related issue reports like this one to get more details. It worth noting that this problem will not affect your training. You can set a large max_failures and decrease the number of workers to ensure plenty of samples can be collected, also reduce the probability of raising TraCIError:

analysis = tune.run(
        "PG",
        # ...
        max_failures=3,
        # ...
    )

rsuwa · 2020-11-14T05:05:07Z

@KornbergFresnel
Okay, I'll check it out.
However, if an error occurs, then I can't come back.
It also appears to reset the number of steps.

Adaickalavan · 2023-04-21T16:17:38Z

Given the graceful handling of traci connection errors in latest SMARTS version, this issue is being closed.

Gamenot added this to the Backlog milestone Jan 27, 2021

Adaickalavan linked a pull request Mar 18, 2021 that will close this issue

Bugtest sumo crash reproduction #619

Open

Gamenot mentioned this issue Nov 25, 2021

Graceful handle of TraCI connection errors #1138

Merged

Adaickalavan closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the maddpg benchmark for a long time results in a traci error. #158

Running the maddpg benchmark for a long time results in a traci error. #158

rsuwa commented Nov 13, 2020

KornbergFresnel commented Nov 14, 2020

rsuwa commented Nov 14, 2020

Adaickalavan commented Apr 21, 2023

Running the maddpg benchmark for a long time results in a traci error. #158

Running the maddpg benchmark for a long time results in a traci error. #158

Comments

rsuwa commented Nov 13, 2020

Reproduction

Command

Full logs

KornbergFresnel commented Nov 14, 2020

rsuwa commented Nov 14, 2020

Adaickalavan commented Apr 21, 2023