Most publications and explanations of deep learning have two sorts of explanation types:
While I must give all credit to Colah’s blog for putting together that excelent diagram, I think it is telling why it is so well known: every other LSTM diagram before that was kind of terrible. Making high quality diagrams is hard, and takes a lot of time and effort.
And it is hard to do math in diagrams—explaining each step will take a lot of time and effort, as each transformation will require a new diagram. So proof-oriented diagrams (like my own here: are extremely time consuming, so you more often see proof written like this:
TERRIBLE LOOKING PROOF HERE
What I am looking for is a notation system that is
I’ll start by example:
To train on observations X and known values y on a 2 layer network via a MSE error
X --[M_1|relu]-> h_1 --[M_2]-> p => y
To train on discrete labels y with a softmax cross entropy error (a bit suprising, but yes, the math checks out)
X --[M_1|relu]-> h_1 --[M_2|softmax]-> p => one_hot(y)
To train on observation sequence X_t
and label sequence y_t
using MSE loss,
0 -> h_0
X_t|h_{t-1} --[Enc]-> h_t --[Gen]-> p => y_t
Of course, different functions can be specified for Enc
and Gen
:
0 -> h_0
X_t|h_{t-1} --[M_1]-> h_t --[M_2]-> p => y_t
Here Enc can be defined:
A somewhat computational definition:
0 -> h_0
h_{t-1}|x_t -> c_t
c_t --[R|sigmoid]-> r_t
c_t --[Z|sigmoid]-> z_t
x_t|(h_{t-1}*r_t) --[H|tanh]-> l_t
h_{t-1} * (1 - z_t) + l_t -> h_t
We can define a new macro instead, which will be useful if we want a layered GRU.
Enc: h_{t-1},x_t
h_{t-1}|x_t -> c_t
c_t --[R|sigmoid]-> r_t
c_t --[Z|sigmoid]-> z_t
x_t|(h_{t-1}*r_t) --[H|tanh]-> l_t
h_{t-1} * (1 - z_t) + l_t -> h_t
return h_t
h_0 := 0
h_{t-1},x_t --[Enc]-> h_t
0 -> c_0
0 -> h_0
h_{t-1}|x_t -> p_t
p_t --[F|sigmoid]-> f_t
p_t --[I|sigmoid]-> i_t
p_t --[C|tanh]-> l_t
o_t --[O|sigmoid]-> o_t
c_{t-1}*f_t + i_t*l_t -> c_t
tanh(c_t)*o_t -> h_t
To define an environment, we need a few custom functions:
reset() -> S_0
step(S_{t-1}) -> S_t,O_t,r_t,d_t
(in general, of course, these will not be differentiable)
0.999 -> gamma
reset() -> S_0
S_{t-1},a_{t-1} --[step]-> S_t,O_t,r_t,d_t
O_t --[Value]-> V_t => V_{t-1} * gamma + r_t
0.999 -> gamma
reset() -> S_0
S_{t-1} --[step]-> S_t,O_t,r_t,d_t
O_t --[Value]-> V_t => V_{t-1} * gamma + r_t
V_{t+1}*gamma + r_t - V_t -> A_t
O_t --[clip_grad(Policy)]-> pi_t $ A_t * log(pi_t)
This one is interesting because there is forward looking logic: very hard to implement in a general and efficient way (likely there would have to be some point where gamma^nlam^n is rounded to 0 and ignored…. However, it is fairly easy to write down.
0.999 -> gamma
0.95 -> lam
reset() -> S_0
S_{t-1} --[step]-> S_t,O_t,r_t,d_t
O_t --[Value]-> V_t => V_{t-1} * gamma + r_t
V_{t+1}*gamma + r_t - V_t -> TD_t
TD_t + gamma * lam * A_{t+1} * (1-d_t) -> A_t
O_t --[clip_grad(Policy)]-> pi_t $ A_t * log(pi_t)
Taking the Enc
definition from the GRU implementation:
0.999 -> gamma
reset() -> S_0
S_{t-1} --[step]-> S_t,O_t,r_t,d_t
0 -> h_0
O_t,h_{t-1} --[Enc]-> h_t
h_t --[Value]-> V_t => V_{t-1} * gamma + r_t
DQN requires a buffer, which is a bit of a weird sort of function/parameter combo
Specialized buffer functions include:
There is also a small complexity of adding an action sampling method….
0.999 -> gamma
reset() -> S_0
S_{t-1} --[step]-> S_t,O_t,r_t,d_t
O_{t-1} --[Q]-> q_{t-1} $ r_t + one_hot(argmax(q_t))
Three primitive concepts:
States are local variable definitions. States are immutable and local to a single input value(possibly shared across time). State definitions have attributes
A reference to a state is accesed with the notation: <Name>_<context>
(with the context being optional, by default given context 0.
Parameters are global variable definitions that are shared across all data points and all contexts. Parameter instantiations are mutable, and by definition, they have the following attributes:
Functions operate on states and update parameters.
3 rules define a function:
If f is a stateless function (like relu) then Update() will accept and return an empty set of parameters. If f is a non-differentiable function (like one-hot) then Autodiff will additionally return constant zeros for its inputs.
Builtin functions include:
dense_M
(x)conv_K
(x)add
(x,y)exp
(x)batch_norm
(x)dropout
(x)Transforms operate on functions (essentially higher order functions). Transforms have 3 major operations
Some special transform include:
Implementation should be able to generate any low level tensor code, for example: pytorch code.
AST manipulations, transformation rules, and static optimization proceedures should occur in a proper functional language, such as OCaml.
Features like RL environments and replay buffers should use C extensions to execute the functions. Proper support for these may include required type definitions and code generation.
Decent library support to wrap these into functions is required.
There are performance/memory tradeoffs which occur when doing recurrent or RL methods on whether caching or recomputation is best.
The static optimizer should try some combinations and determine what is the best tradeoff.