-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLearning.lhs
229 lines (185 loc) · 10.2 KB
/
Learning.lhs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
Learning.lhs
Cody Shepherd
> module Learning where
> import System.Random
> import Data.List
> import Data.List.Extras.Argmax
> import Data.List.Utils
> import Data.Ord
> import Board
> import Data.Map (Map)
> import qualified Data.Map as Map
This will be the module that defines the learning algorithm - i.e. the math part.
It will define the functions responsible for creating and updating the Q-Matrix
First, we should define the Q-Matrix and specify how one stores, retrieves, and updates
information contained within it.
The state of a cell is going to be important in the Robot's learning algorithm. The robot needs
to be able to observe the contents of its cell and neighboring cells in order to make a
decision about the best course of action.
Therefore we create named values for different configurations of a state.
> data State = E
> | C
> | W
> | Rc
> | Re
> deriving (Eq, Show, Ord, Enum)
We also are going to want a quick way to create a "hash" of a given state signature for being
able to store information about a state configuration in the Qmatrix.
> stateKey :: [State] -> String
> stateKey [] = []
> stateKey (x:xs) = let nx = case x of
> E -> '_'
> C -> '.'
> W -> 'w'
> Rc -> '%'
> Re -> 'o'
> in nx : stateKey xs
Our Qmatrix is just a simple Mapping
> type Qmatrix = Map String [Double]
Now that we have a notion of a state and state key (a stateKey string plus a list of five Double values),
we need a function that allows Rob to observe his current state.
I have chosen to break this observe function into four separate functions to make it easier to manage
and understand. Ultimately an "observation" results in a stateKey string representing the configuration
of 5 cells that make up the robot's "footprint."
> obsU :: Board -> State
> obsU (Board ((a, b), cs, r))
> | fst rx == a-1 = W
> | any (\x -> (cfst x == (fst rx)+1) && (csnd x == snd rx)) cs = C
> | otherwise = E
> where
> rx = rToPair r
> obsD :: Board -> State
> obsD (Board ((a, b), cs, r))
> | fst rx == 0 = W
> | any (\x -> (cfst x == (fst rx)-1) && (csnd x == snd rx)) cs = C
> | otherwise = E
> where
> rx = rToPair r
> obsE :: Board -> State
> obsE (Board ((a, b), cs, r))
> | snd rx == b-1 = W
> | any (\x -> (csnd x == (snd rx)+1) && (cfst x == fst rx)) cs = C
> | otherwise = E
> where
> rx = rToPair r
> obsW :: Board -> State
> obsW (Board ((a, b), cs, r))
> | snd rx == 0 = W
> | any (\x -> (csnd x == (snd rx)-1) && (cfst x == fst rx)) cs = C
> | otherwise = E
> where
> rx = rToPair r
> obsH :: Board -> State
> obsH (Board ((a, b), cs, r))
> | any (\x -> (csnd x == snd rx) && (cfst x == fst rx)) cs = Rc
> | otherwise = Re
> where
> rx = rToPair r
> observe :: Board -> String
> observe b = let n = obsU b
> s = obsD b
> e = obsE b
> w = obsW b
> h = obsH b
> in stateKey [n, s, e, w, h]
Fundamentally, Q-Learning requires two parts: training and testing. These parts differ
in that the q-matrix may be updated during training, but not during testing.
Training
- Initialize Q(s,a) to all zeros
- Initialize s
- selection action a
- take action a and receive reward r
- observe new state s'
- Update Q(s, a) <- Q(s,a) + eta * (r + gamma*argmax(Q(s',a')) - Q(s,a))
- s <- s'
testStep
- The same steps as above except without updating the Q Table
In service of either algorithm, the robot needs to be able to select an action
in a deterministic way, but also get a random action if it wants.
> action :: Int -> IO Dir
> action n = case n `mod` 6 of
> 0 -> return U
> 1 -> return D
> 2 -> return R
> 3 -> return L
> 4 -> return P
> 5 -> do v <- randomRIO(0,4)
> action v
Sometimes this action selection should be random (e.g. when the robot has not learned enough,
or when it wants to maximize exploration over exploitataion). The chances of randomness should be
based on a tunable value.
> isRandom :: Double -> IO Bool
> isRandom p = do v <- randomRIO(0.0, 1.0)
> if v < p then return True else return False
We also want a way to get an action we know or think is good. This requires looking at the Qmatrix
and checking for any "learned knowledge" for the given stateKey state configuration.
> maxI :: [Double] -> Int
> maxI xs = let (f, i) = maximumBy (comparing fst) (zip xs [0..]) in i
> bestAction :: String -> Qmatrix -> IO Dir
> bestAction k q = do let l = Map.lookup k q
> --print ("bestAction lookup: " ++ show l)
> case l of
> Just n -> action (maxI n)
> Nothing -> action 5
In order to perform a single step during training, we need to have the updated states of
the Board and the Qtable; a single step potentially updates both, so it should return them,
I suppose as a pair.
We will need some global constants to control the behavior of our program and keep the number of
copied parameters to a minimum.
> eta = 0.2 :: Double
> gamma = 0.9 :: Double
Training is fundamentally an updating of the board position and the contents of the Qmatrix. Because it
relies on randomness, it becomes an IO function.
> train :: Double -> (Qmatrix, Board) -> IO (Qmatrix, Board)
> train eps (q, b) = do t <- isRandom eps
> let s = observe b
> --showBoard b
> a <- if t then action 5 else bestAction s q
> --print a
> let (s', r) = move a b
> q' = updateQ q s (observe s') a r
> --print ("reward " ++ show r)
> --print ("stateKey: " ++ show s)
> --print (Map.lookup s q')
> return (q', s')
The Qmatrix must be updated according to the Q-Learning algorithm. A more detailed explanation of this algorithm,
and thus what is going on in this function, is provided in my paper.
> updateQ :: Qmatrix -> String -> String -> Dir -> Double -> Qmatrix
> updateQ q s s' a r = let qcurrent = Map.lookup s q
> in case qcurrent of
> Just sa -> let y = head $ drop (dirIndex a) sa
> snext = Map.lookup s' q
> qt = case snext of
> Just z -> let newval = y + (eta * (r + (gamma * (maximum z)) - y))
> in take (dirIndex a) sa ++ [newval] ++ drop ((dirIndex a) + 1) sa
> Nothing -> let newval = y + (eta * (r - y))
> in take (dirIndex a) sa ++ [newval] ++ drop ((dirIndex a) + 1) sa
> in Map.insert s qt q
> Nothing -> let y = 0.0
> snext = Map.lookup s' q
> qt = case snext of
> Just z -> let newval = y + (eta * (r + (gamma * (maximum z)) - y))
> in replicate (dirIndex a) 0.0 ++ [newval] ++ (drop ((dirIndex a) + 1) $ replicate 5 0.0)
> Nothing -> let newval = eta * r
> in replicate (dirIndex a) 0.0 ++ [newval] ++ (drop ((dirIndex a) + 1) $ replicate 5 0.0)
> in Map.insert s qt q
We also need a quick way to generate the very first, zero-initialized Qmatrix, based on combinations over the
possible stateKeys. Note that there are some states generated by this algorithm that will never occur, such as
"wwwwo" (a cell surrounded by four walls), but the cost to memory and time is minimal, so we don't really care.
> newQTable :: Map String [Double]
> newQTable = let strings = [ [n] ++ [s] ++ [e] ++ [w] ++ [h] | n <- [E .. W], s <- [E .. W]
> , e <- [E .. W ], w <- [E .. W], h <- [Rc .. Re]]
> in Map.fromList $ map makePair $ map stateKey strings
> where makePair x = (x, [0.0,0.0,0.0,0.0,0.0])
A test step is fundamentally similar to a training step. Note the lack of a call to updateQ.
> test :: Double -> (Qmatrix, Board, Double) -> IO (Qmatrix, Board, Double)
> test eps (q, b, r) = do let s = observe b
> showBoard b
> t <- isRandom eps
> --print ("Is action random: " ++ show t)
> a <- if t then action 5 else bestAction s q
> let (s', r') = move a b
> print ("Action: " ++ show a)
> print r'
> throwaway <- getLine
> return (q, s', r'+r)