Introduces the concept of feedback with Neural networks from past states and the notion of timestep. RNNs are represented as computational DAGs.

Adds a state variable h (hidden unit), and the operation of the layer depends on its state. State is updated at every time step. $f, g$ can be arbitrary differentiable functions (required for backpropagation).

h_{t} x_{out} [t] = f (h_{t - 1}, x_{in} [t]) = g (h_{t})

Backpropagation through time: our goal is to compute MLE of the parameters by solving $θ^{*} = ar g max_{θ} p (y_{1 : T} ∣ x_{1 : T}, θ)$ . To compute the MLE, we need the gradients of the loss wrt parameters. Consider a general parameterized RNN model

h_{t} o_{t} = W_{h x} x_{t} + W_{hh} h_{t - 1} = W_{h o} h_{t} = f (x_{t}, h_{t}, w_{h}) = g (h_{t}, w_{o})

Loss can be written as: $L = \frac{1}{T} \sum_{t = 1}^{T} ℓ (y_{t}, o_{t})$ and we need to compute $\frac{\partial L}{\partial W _{h x}}, \frac{\partial L}{\partial W _{hh}}, \frac{\partial L}{\partial W _{h o}}$ , and $w_{h}$ is just flattened version of $W_{h x}, W_{hh}$ .
$\frac{\partial L}{\partial w _{h}} = \frac{1}{T} \sum_{t = 1}^{T} \frac{\partial ℓ ( y _{t} , o _{t} )}{\partial w _{h}}$
Using chain rule, this can be calculated, and takes $O (T^{2})$ to compute overall and thus, need to truncated at some time step to maintain computable tracatability.
Exploding and vanishing gradients: We are multiplying by the jacobian to compute backpropagation at each time step (consider the gradient of $\frac{\partial x _{out} [ t ]}{\partial x _{in} [ 0 ]} = \frac{\partial x _{o u t} [ t ]}{\partial h _{t}} \frac{\partial h _{t}}{\partial h _{t - 1}} \dots \frac{\partial h _{0}}{\partial x _{in} [ 0 ]}$ and $\frac{\partial h _{t}}{\partial h _{t - 1}} = σ^{'} W$ where $σ^{'}$ is gradient of the nonlinearity). If the gradients are high or low, it can result in exploding and vanishing gradients respectively.

LSTM

Much better introduction by Chris Olah.

Avoids the vanishing/exploding gradient problem.
LSTM adds a memory cell $C_{t}$ that is controlled using three gates output gate $O_{t}$ , Input gate $I_{t}$ , and forget gate $F_{t}$ . Each serves a specific purpose:
- Output: determines what gets reads out from the input and hidden state.
- Input: determines what gets reads in.
- Forget: determines when to reset the cell.
- Each gate is composed of a sigmoid NN and a pointwise multiplication. Sigmoid controls when to switch something on and off, and pointwise multiplication uses the sigmoid output for switching operation. A “1” keeps something, and “0” corresponds to removing it.
Let’s first write the computation equations, and then understand them:

O_{t} I_{t} F_{t} = σ (W_{o} [H_{t - 1}, X_{t}] + b_{o}) = σ (W_{i} [H_{t - 1}, X_{t}] + b_{i}) = σ (W_{f} [H_{t - 1}, X_{t}] + b_{f})

At the first step, $C_{t} = C_{t - 1} F_{t}$ . This means, we use the forget gate output to forget some of the information from the cell state.
At the second step $C_{t} = C_{t - 1} F_{t} + I_{t} \tilde{C}_{t}$ , we use the output from input gate $I_{t}$ which scales the information as per its importance, and use the candidate memory $\tilde{C}_{t} = tanh (W_{c} [H_{t - 1}, X_{t}] + b_{c})$ to select the new candidate values that could be added to the state.
Our memory is updated at this point, and LSTM now decides what to output.
Output gate $O_{t}$ decides what gets read out from the input and hidden state, and is transformed with the cell state to compute the next hidden state: $H_{t} = O_{t} \cdot tanh (C_{t})$

GRU

GRU combines forget and input gate into a single “update” gate $Z_{t} \in R^{N \times H}$ . And also merges cell and hidden state.

R_{t} Z_{t} \tilde{H} * t H_{t} = σ (W_{r} [H_{t - 1}, X_{t}] + b_{r}) = σ (W_{z} [H_{t - 1}, X_{t}] + b_{z}) = tanh (W * h [R * t ⊙ H_{t - 1}, X_{t}] + b_{h}) = Z_{t} ⊙ H_{t - 1} + (1 - Z_{t}) ⊙ \tilde{H} * t

Beam search

Compute top K candidate outputs at each step, and expand each one in V possible ways, to generate VK candidates. Select top K again.

Stochastic beam search samples top K without replacement, i.e. pick the top one, renormalize, and pick the new top one.

lonerapier.me

Table of Contents

Backlinks

Explorer

Recurrent Neural Networks

LSTM

GRU

Beam search