Information Theory of Deep Learning

I will present a novel comprehensive theory of large scale learning with Deep Neural Networks, based on the correspondence between Deep Learning and the Information Bottleneck framework. The theory is based on the following components: (1) rethinking Learning theory. I will prove a new generalization bound, the input-compression bound, which shows that compression of the input variable is far more important for generalization than the dimension of the hypothesis class, an ill defined notion for deep learning. (2) I will than prove that for large scale Deep Neural Networks the mutual information on the input and the output variables, for the last hidden layer, provide a complete characterization of the sample complexity and accuracy of the network. This put the information Bottlneck bound as the optimal trade-off between sample complexity and accuracy with ANY learning algorithm. (3) I will then show how stochastic gradient descent, as used in Deep Learning, actually achieves this optimal bound. In that sense, Deep Learning is a method for solving the Information Bottleneck problem for large scale supervised learning problems. The theory gives concrete predictions for the structure of the layers of Deep Neural Networks, and design principles for such Networks, which turns out to depend solely on the joint distribution of the input and output and the sample size.

Based on joint works with Ravid Schwartz Ziv and Noga Zaslavsky.