Metadata

These are my notes for Efficient Neural Architechture search - ENAS

Just a high level overview of the main idea. Will not go into details of each

The game: Learning what we should learn but faster…

In an earlier writeup I described the setting of learning hyperparameters using the Reinforce algorithm.

Quick recap: start with some random configurations of convolutional neural networks. You train them till convergence and evaluate their performance on some validation set. These configurations can be described as a sequence of tokens. Feed these sequences and the validation accuracies to an RNN controller to produce new sequences. Use these sequences as the new configuration and repeat.

One of the downsides of this approach as mentioned in the paper is it takes 450 GPU’s and 3 days to train the model end to end. So unless you’re Mother Google or one of the other big tech companies - good luck coding this up.

In this paper they authors show us how to get the same results using 1 1080 TI GPU and 7 hours of training.

What changed ?

Well close to everything. So it’s probably more productive not relating this part to the earlier post and pretend this is it’s separate idea.

The main idea

“Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph”

So say we define some graph where each node defines some compuation.

Take for example, in the picture below, node 1 represents 5x5 convolution with certain stride and padding. Node 2 represents an activation function like RELU or sigmoid and so on. One can construct a neural network by subsampling nodes in this big graph and sticking them together.

The red lines depict the edges the controller have sampled from the big DAG to propse an architechture.

This pretty much what efficient NAS is. Now the search space is restricted by the big DAG instead of the vocabulary of sequences in the original NAS paper. While the search space than NAS it is not trivially small. NAS produces networks with roughly 54 million parameters as opposed to 24 million parameters produced by ENAS.

So where is the gain coming from? The next section illustrates the key idea.

Shared weights

Going back to picture above.

Each node in the big DAG defines a computation. Say node 1 is the input, node 2 is a convolution, node 3 and node 4 is yet another convolution. As node 1 is the input there is nothing to do there. In node 2, say we can have a 3x3 convolution or 5x5 convolution. Node 3 defines some activation - either relu or sigmoid. There are a total of 4 possible dataflow paths between node 2 and node 3. Namely:

The job of the controller is to sample one of these paths. In the paper they say, we note that for each pair of nodes \(j < ℓ\) , there is an independent parameter matrix \(W^h_{ℓ,j}\). By choosing the previous indices, the controller also decides which parameter matrices are used. Therefore, in ENAS, all recurrent cells in a search space share the same set of parameters.

So unlike in the original NAS where each recurrent cell could recommend an indepent kernel/weight matrix to learn; here there is a finite number of weights to learn and they’re all shared by the recurrent cells. This is somewhat confusing to me?

What happens when you learn a certain subgraph and sample another subgraph that uses the previous subgraph partially? Do you relearn the weights or use the old ones? Refer to confusion section.

Controller architechture

The controller looks pretty much the same as NAS. It’s a seq-to-seq model using some RNN (LSTM or GRU or whatver). Each node in the big graph corresponds to a RNN cell in the controller. At each cell the controller does two things- sample previous nodes it should connect to (this leads to skip connections) and decides which computation defined by that node to sample.

Potential Research Extensions and Confusions