-
Notifications
You must be signed in to change notification settings - Fork 32
Hacks for DRL with TORCS
Very important. One could naively set up an maximum episode length of, say, 3000 steps, and then wander off while the code runs - this massive wastage of time and resources is almost as bad as building The Wall. What's happening in the opening episodes is that the agent wanders off the road (say towards the left) in the first 200 steps and stays there with its nose in the wall, filling the replay buffer with trash and learning nothing along the way. After a few tens of laborious episodes, it does this again on the other side.
The obvious solution is to have appropriate restarts of episodes for enabling faster and more efficient learning.
- if the agent goes out of the track,
- is making no progress,
- or has turned around.
One can modify the associated parameters and/or penalties with the above as well as add new conditions in gym_torcs.py
- With multiple agents learning in the same environment, the following design choice is made - whenever any agent requires a restart, the episode is reset for all the agents. If the other agents are allowed to run till all of them crash or reach the maximum length of the episode, it'll fill the experience replays of the crashed agent(s) with garbage transitions, slowing down the learning.
- One way to think about this is that the learning occurs at the rate of of the slowest agent.
- In case you can think of a better design choice, please create a new issue and hopefully a pull request. :)
Popular exploration techniques like epsilon-greedy cannot be used here with the 3 continuous actions, because it will result in useless combinations like when the value of brakes is more than that of acceleration, resulting in no movement. Instead a Ornstein-Uhlenbeck process is commonly used to generate temporally-correlated exploration with inertia. In simple words, it has mean-reverting properties. One initializes the means around which you want the noise in acceleration and braking to appear. Now you don't get nonsensical values like that from an epsilon-greedy exploration.
- Don't waster precious CPU cycles in rendering every frame of your training process. Turn it off by setting the
display mode
in the configuration file to"results only"
. This gives a 2-5x speedup. - Locate the configuration files
quickrace.xml
orpractice.xml
usingsudo find / -name practice.xml
. - A good practice is to visualize the training every 10 episodes or something, as per your convenience, to check if something untoward isn't happening, and to take appropriate measures if it is so.1
Now getting DRL to work on simple ATARI or MuJoCo is a painful task in itself. Why make a hard task harder if we can use some domain knowledge to help the agent learn an expected behavior?
- Suppose you're training the agent(s) to drive in traffic. Since TORCS is inherently a racing game, all the cars (including both the agents and the 'traffic') start from a grid. To present the agent(s) with a decent amount of experience in encountering overtaking situations, let the traffic get a head-start of 500-700 steps. In technical terms, take a no-op action for those number of steps.
- Similarly, think of other ways you can make the agent's job simpler. After all, it does not have years and years of a cognitive model being trained with the physics and dynamics of the world and the plethora of objects that exist in it. 2
1 This point mainly originates from a deadline of a student-project approaching with only limited resources at hand...
2 That in itself is another interesting direction of research. ;)