AIFounded | Exploring Deep Neural Network Loss Surfaces
1087
post-template-default,single,single-post,postid-1087,single-format-standard,ajax_fade,page_not_loaded,,qode-title-hidden,qode_grid_1300,qode-theme-ver-10.1.1,wpb-js-composer js-comp-ver-5.0.1,vc_responsive

Exploring Deep Neural Network Loss Surfaces

Understanding the loss surface of your objective function is the one of the most essential and fundamental concepts in machine learning. Knowing the kind of loss surface we are dealing with can tremendously help us design and train our models. This is because the ability to train a model is precisely the ability to minimize a loss function.

Generally, we have a thorough understanding of various types of optimization algorithms under convex or other quasi-convex loss functions. Hence, we often try to formulate our problem into a convex problem if possible. Of course, these days, due to their excellent performance, everyone deploys deep neural networks in practice. However, our understanding of deep neural networks is still at the tip of the iceberg. This is because these deep neural networks are not only non-convex functions but they also lie in a high dimensional space.

Current research focuses on modifying architectures and developing optimization methods based on the researcher’s intuition and experience. Unfortunately, this is not the most ideal thing to do. This is because we are so used to (limited to) envisioning things in 3D space, but our intuition based on 3D visualizations can easily break down in higher dimensional space.  For example, even the mass of the Gaussian or uniform distribution in a sphere or cube are distributed very differently as the dimension gets higher.

Here, I will summarize the results of the paper “An Empirical Analysis of Deep Networks Loss Surfaces”.  The results of the paper provide valuable intuition on the geometry of neural networks’ loss surfaces with multiple stochastic optimization methods. The paper visualizes loss surfaces on polygons to understand how and when these stochastic optimization methods find critical points (The paper abuses the term local minima since the authors don’t verify that the Hessian is positive semi-definite. Nevertheless, this seems to be typical in deep learning community.)

 

1. Different optimization methods find different local minima

The lower left triangle of the table presents the different optimization methods each reaching a distinct valley.  The 3D visualizations indicate that final configurations reached by different methods with same initial configuration are separated by a hump at least up to 2D slice.  This is somewhat straightforward, but it is still interesting to see the bifurcation of behaviour. The goals were to observe the complexity of loss surfaces in 2D slices, and  to determine whether the complexity of the local region found by different optimization methods are different.

Deep Neural Network surfaces are highly symmetrical. We can permute the indices of hidden units in the same layer with corresponding weights and still have the same functional outputs. In practice, what we really care about is the functional outputs themselves. So, the authors wondered whether the deep networks found by different optimization methods are the same functions or not. The upper right table presents the functional output difference between different optimization methods (See section 4.1 of the paper for details.) These figures illustrate that the local minima found by different methods correspond to effectively different networks, and the hump indicates that they approach different functions.

Network-In-Network                                                                                   VGG

2. Different optimization methods find different “types” of local minima

Another interesting observation is the shape of basin found by different methods. The authors trained deep neural networks multiple times with different initializations. As shown in the figures below, we see that the surfaces look strikingly similar for different runs of the same algorithm, but characteristically different for different algorithms. Thus, we found evidence that the different algorithms land in qualitatively different types of local minima.

Network-In-Network                                                                                   VGG

The absolute size differences between basins found by different methods were interesting to see (see section 4.2 in the paper for details). Note that loss value at 2.4 is the initial loss rate.

3. Different local minima can be found even at later stages of  training

The authors also wondered when the destined critical point gets chosen.. In other words, the learning curve usually drops dramatically at the beginning of learning, and then the rate of decay slows down as learning progresses.

We can easily think / imagine that the destined local minimum after the fast decay period is already chosen and we are just settling down at the slow decay rate- at least we may think like this after staring at different optimization methods converging on a convex loss surface for a long time.

In the paper, the authors experiment with this idea  by switching the descent direction at different points in the training period. For example, switching from ADAM to SGD at the end stage of training. This was to check whether the valley stayed the same even after changing the descending directions. In order to make sure that they are not overshooting (due to large learning rate), they also tried with several learning rates. Here are one of the results from the supplementary materials :

Each row has a different learning rate (the top uses a larger learning rate and the bottom uses a smaller learning rate) and A200 and S200 represents Adam and SGD for 200 epochs respectively.

The observation is that there is a sudden change in accuracy when we immediately switch to another method. As well, the distance traveled from initial parameters to the parameter value at each epoch in the second column are also intriguing. The finding suggests that the direction of optimization definitely radically changed when the optimization method changed. Perhaps the most interesting phenomenon is that in every experiment it seemed like SGD wanted to converge to somewhere closer to the origin while ADAM and other adaptive learning rate methods wanted to go somewhere further (at least in these deep networks with cifar10 dataset). Lastly, the linearly interpolated loss surface of different final configurations indicate that all of them are separated by a hump in at least one dimension.

To sum up, even when the authors switch optimization algorithms, the local minima found were all different. “… This directly disagrees with the notion that the local minimum has effectively been chosen before the “minimization” phase, but instead that which local minimum is found is still in flux this late in optimization.” We don’t observe the basin of attraction even in the later stage of training and the learning curve doesn’t reflect this.

4. Effect of batch-normalization makes the loss surface much more steep/convex and stable.

Lastly, one of the most effective techniques that was proposed and has been widely applied is batch-normalization. The idea was motivated by the phenomena called covariate drift and the technique normalizes the pre-activations on each layer of deep networks such that it reduces internal covariate drift. You can see the associated  papers for more details. Instead, we will observe how batch-normalization applied to each layer affects overall loss surfaces. We will see how badass it is.

First, let’s compare 1D slices of deep network loss surfaces with and without batch-normalization:

Without Batch-Normalization                                                   With Batch-Normalization

These are the results of repeated experiments with random initial parameters draws. The effect is finely distinguishable when you look at the figure above. There are two interesting observations. First is that the batch-normalization unifies the loss surface (as all the curves stay together with each other). Another observation is that the loss curve instantly drops from the beginning and the barrier between the two final configurations is sharper as opposed to having a plateau.  This becomes even more clear when see the 3D visualization!

The above table has a bunch of 2D slices of loss surfaces. The main thing to recognize is that surfaces with batch-normalization have steeper slopes and have much lower error surface regions (shaded in blue).

To sum up, these empirical results provide interesting insights to characterizing the optimization methods on deep networks, as well as a step towards understanding the loss surfaces of deep networks. I believe that this direction of research is vital to understanding deep learning, especially when we are exploring different topologies of neural networks.

Acknowledgements : I am very thankful to Carolyn Augusta for great feedbacks.