Empirical Risk Minimization: Complexity, Duality, Sampling, Sparsity and Big Data

Empirical Risk Minimization (ERM) is a key paradigm for training statistical learning models, including prediction tasks such as regression, binary classification and deep neural networks. In ERM, one seeks to minimize the average of a very large number of loss functions, each corresponding to a data point (example). State-of-the-art algorithms for ERM sample a single or a small number of examples in each iteration, and update the predictor based on the information contained in it. I will describe a modern, flexible and scalable stochastic primal-dual algorithm for ERM, describe its convergence properties, and comment on how data sparsity is related to the speedup one should expect to obtain from minibatching.