Deep learning theory struggled for a long due to focusing on engineering deep learning methods. Typically, a reasercher would attempt to define a DL training algorithm that always has certain convergence properties, so that a practitioner would be able to use this method with confidence it will always work. However, it turns out that the constraints necessary for these general, prescriptive proofs to hold hurt performance greatly in practice, and that methods without these provable guarentees consistently worked far better than those with them. This sort of theory is largely regarded as a failed project.
So successful deep learning theory started to take a more descriptive apporach to their subject of study.
After this theoretical work, you, as the practitioner, then have a powerful monitoring metric avaiable that can help you determine if there is a problem in your training regimes. Once you have narrowed the problem to a well understood theoretical difficulty in trainning, you can look towards your fellow empericists and experienced ML craftsmen to look for suggestions in debugging, data normalization, and algorithmic techniques (in that order) that have been found to help improve this metric, and thus hopefully improve your overall training performance.
This post goes through several of the most important theoretical properties using this approach:
Most credit should go not to me, but to to Soheil Feizi and his Deep Learning Foundations course (avaliable for free on youtube)[https://www.youtube.com/playlist?list=PLHgjs9ncvHi80UCSlSvQe-TK_uOyDv_Jf], perhaps the best free deep learning theory course avaliable online.
“Gradient methods for the minimisation of functionals”; B.T.Polyak
Applied to deep learining by the paper (“On the exponential convergence rate of proximal gradient flow algorithms”; Hassan-Moghaddam and Mihailo R. Jovanovic)[https://viterbi-web.usc.edu/~mihailo/papers/mogjovCDC18a.pdf#:~:text=The%20Polyak-Lojasiewicz%20%28PL%29%20condition%20is%20an%20inequality%20that,is%20a%20twice%20differentiable%20function%20with%20a%20Lipschitz]
PL condition always holds when training loss goes to zero
All neural networks (with finite valued weights) are Lipschitz continuous.
Picard–Lindelöf theorem shows that all Lipschitz continuous functions have a unique solution to the initial point problem, so gradient descent with infintensimal updates are guarenteed to converge.
(Condition number)[https://en.wikipedia.org/wiki/Condition_number]
numerator of the condition number is the spectral norm
high condition number implies a certain sort of overfitting
Calculating a upper bound of the Lipschitz constant.
add scipy code to calulate this upper bound easily
GAN stability can be aided by simply normalizing each matrix in the discriminator by the spectral norm bounding the Lipschitz constant of the discriminator to 1.
(“Spectral Normalization for Generative Adversarial Networks”; Miyato, Kataoka, Koyama, Yoshida )[https://arxiv.org/pdf/1802.05957v1.pdf]·
Isn’t used in supervised learning, slows training and doesn’t help end performance.
Batch normalization lowers emperical condition number of each layer in the network, by linearly rescaling all actual inputs and outputs onto the same scale and same mean value.
Take fixed data input/output pairs \(\{((a_1, z_l),...,(a_n, z_n)\}\), and matrix M fit to minimize \(\sum_i^n (Ma_i - z_i)^2\).
Now apply batch normalization to the inputs and outputs, to get:
And fit a new matrix to minimize \(M^\prime = \sum_i^n (M^\prime((a_i - \beta_a) \odot \alpha_a) - (z_i - \beta_z) \odot \alpha_z)^2\) (where $$ \odot is the Elementwise product or Hadamard product)
For a hand-wavy argument to why batch normalization can improve the conditioning of the problem, consider the emperical condition number of the data:
Emperical condition number:
\[\max_{i,j} \frac{|a_i|}{|z_i|}\frac{|a_j|}{|z_j|} = \max_{i} \frac{|a_i|}{|z_i|} \max_j \frac{|a_j|}{|z_j|}\]Batch normalized emperical condition number:
\[\max_{i,j} \frac{|(a_i - \beta_a) \odot \alpha_a|}{|(z_i - \beta_z) \odot \alpha_z|}\frac{|(a_j - \beta_a) \odot \alpha_a|}{|(z_j - \beta_z) \odot \alpha_z|} = \max_{i}\frac{|(a_i - \beta_a) \odot \alpha_a|}{|(z_i - \beta_z) \odot \alpha_z|} \max_{j}\frac{|(a_j - \beta_a) \odot \alpha_a|}{|(z_j - \beta_z) \odot \alpha_z|}\]Handwavy argument for why this improves condition number:
Note that bad conditioned examples in the original dataset would no longer be badly conditioned in the normalized dataset, as the small norm vectors would likely no longer be small due to the offset, and the large norm vecotors previously will no longer have a large norm due to the scaling.
Batch normalization is much cheaper to evaluate than spectral normalization, and works much better for supervised learning, since it works on both the numerator and denominator of the condition number, rather than just the numerator. However, the dynamics of how these weights are updated can be problematic, especially in adversarial regimes, and it doesn’t help if the values are already well conditioned.
(Reconciling modern machine learning practice and the bias-variance trade-off; Belkin, et al. 2019)[https://arxiv.org/abs/1812.11118]
Explain main theorem from:
https://youtu.be/75n2BIILNMc?list=PLHgjs9ncvHi80UCSlSvQe-TK_uOyDv_Jf&t=3994
Following summary guidelines for practitioners:
Thus inspiring the following debugging technique: