A deep learning algorithm requires tuning of following possible set of parameters (also known as hyperparameters):
- $\alpha$ - learning rate - the most important hyperparameter to tune
- learning rate decay - probably second in importance
- mini-batch size - as important as learning rate decacy
- mini-batch iterations
- number of layers
- $\beta$ - momentum parameter (after $\alpha$)
- $\beta_1, \beta_2, \epsilon$ (Adam parameters) - no tuning, standard values
There is no hard and fast role towards importance of hyperparameters.
Essentially, there are two approaches for hyperparameter tuning:
Grid Method:
Impose a grid on possible space of a hyperparameter and then go over each cell of grid one by one and evaluate your model against values from that cell. Grid method tends to vast resources in trying out parameter values which would not make sense at all.
Random Sampling Method:
In random method, we have high probability of finding good set of params quickly. After doing random sampling for a while, we can zoom into the area indicative of good set of params - coarse to fine search
Appropriate Scale for Hyperparameters
Random sampling allows efficient search in hyperparameter space. But sampling at random does not guarantee uniformity over the range of valid values. Therefore, it is important to pick an appropriate scale.
Let 's say, we have $n^{[l]}: {50, 51, 52 ..., 100}$ as range of hidden units of a layer. In this range, it is quite reasonable to pick random values. Or another example could be picking number of layers from: $L: 2-4$. However, this is not true for all hyperparameters of a deep learning algorithm.
For instance, in $\alpha = 0.0001, ..., 1$, 90% of sampled values will be between 0.1 and 1. Instead, if we do search on log scale $0.0001, 0.001, 0.01, 0.1, 1$, all intervals will have equal chance of being sampled. In python we implement this with following
r = -4 * np.random.rand() (r will be random number between -4 and 0)
$\alpha = 10^r$ ($\alpha$ will be between $10^{-4} ... 10^0$)
Exponentially Weighted Averages
During optimization of deep learning algorithm, we typically use hyperparameter $\beta = 0.9$. Let's say, we suspect $\beta$ can be anywhere from $0.9, ..., 0.999$. $\beta=.9$ means averaging over last 10 values whereas $\beta=0.999$ is averaging over previous 1000 values.
Similar to hyperparam $\alpha$, we need to take scale into account when sampling at random from $0.9 ... 0.999$. There is a better we to search by sampling $1-\beta = 0.1 ... 0.001$ than $\beta$. Here, we sample $r$ uniformly at random from $r \epsilon [-3, -1]$ such as,
$$1-\beta = 10^r$$
$$\beta = 1 - 10^r$$
This way we will spend equal resource to explor each interval of hyperparameter range.
Pandas vs. Caviar - Hyperparameter Tuning Approach
Deep learning is being used in many different areas - NLP, vision, logistics, ads, etc. We may not transfer hyperparameter tuning from one area to another. Therefore, we should perhaps not get locked with our intuition and rather consider to reevaluate the intuition.
Babysitting Model
Babsitting works well when we do not have enough computational power or capacity to train a lot of models at once. We patiently measure our models performance over time and tune hyperparameter based its performance of pervious day. This is like Panda strategy - many pandas together babysitting a baby panda so that it does well eventually.
Training Many Models in Parallel
We train many models in parallel, each against different hyperparameter settings. Potentially, these models will generate different learning curves. At end we will pick the model with best learning curve. This approach is known as Caviar strategy.