Why does deep and cheap learning work so well? (1608.08225v4)

Published 29 Aug 2016 in cond-mat.dis-nn, cs.LG, cs.NE, and stat.ML

Abstract: We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through "cheap learning" with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various "no-flattening theorems" showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss, for example, we show that $n$ variables cannot be multiplied using fewer than 2ⁿ neurons in a single hidden layer.

Citations (584)

View on Semantic Scholar

Summary

The paper explains how neural networks exploit physical axioms like symmetry, locality, and compositionality to achieve efficient learning.
It demonstrates that deep neural networks approximate complex functions with exponentially fewer parameters than shallow networks due to their hierarchical structure.
The study employs information theory concepts such as sufficient statistics and causal hierarchies to elucidate deep learning's robustness and versatility.

Understanding the Efficacy of Deep and Cheap Learning

Introduction

The paper "Why does deep and cheap learning work so well?" explores the surprising effectiveness of deep learning by considering both mathematical and physical perspectives (1608.08225). The central thesis is that deep learning's success is not solely due to mathematical guarantees but is also attributed to specific characteristics in the data and the architectural properties of neural networks. These neural networks can exploit physical axioms such as symmetry, locality, and compositionality, allowing efficient representation and learning with a significantly reduced parameter count.

Approximating Functions with Neural Networks

Deep neural networks exhibit remarkable capabilities to approximate complex functions of practical interest. The paper emphasizes that the functions deep networks aim to approximate can often be expressed more efficiently due to inherent physical properties:

Symmetry: Many problems have symmetrical properties, lessening the computational complexity required to approximate the function.
Locality: Physical systems often demonstrate local interactions, allowing neural networks to focus computations on local input regions.
Compositionality: Problems can often be decomposed into simpler subproblems that deep networks can handle in layered architecture, capturing hierarchical structures inherent in data.

These concepts help neural networks approximate functions exponentially more efficiently in terms of required parameters compared to generic function representations, addressing the question of why neural networks perform well despite seemingly insufficient parameters.

Deep vs. Shallow Networks

The paper examines the differences between deep and shallow networks, highlighting that deep networks are often more efficient:

Hierarchical Structure: Deep networks mirror the hierarchical generative processes found in many physical systems, allowing them to efficiently encode complex relationships found in real-world data.
No-Flattening Theorems: These theorems demonstrate situations where shallow networks cannot efficiently replicate the computation of an equivalent deep network without incurring substantial costs in resources. For instance, certain polynomials and combinatorial functions that a deep network computes with linear resources require exponential resources if represented in a shallow network.

Insights from Information Theory

The hierarchy-induced efficiency is further understood through the lens of information theory:

Causal Hierarchies: Information in many systems flows through a hierarchy, with upper layers encoding abstract concepts and lower layers focusing on detailed, specific information.
Sufficient Statistics and Distillation: Deep networks are seen as layers of progressive information distillation, capturing essential features and discarding noise. This mirrors the statistical concept of sufficient statistics, where these networks ideally compress data into a minimal sufficient form retaining all necessary information for inference.

Implications and Applications

The insights presented in the paper have implications for both theoretical understanding and practical application:

Robustness and Versatility: Understanding the properties that enable efficient learning through deep networks can lead to more robust designs, capable of generalizing well across different tasks and conditions.
Future Developments in AI: The theoretical insights can inspire new architectures and training methodologies that leverage physical principles more explicitly, possibly leading to architectures designed with specific invariances and efficiency considerations built-in.

Conclusion

The exploration of deep learning's effectiveness through the lens of both mathematics and physics provides a dual perspective that clarifies why deep networks are successful at learning complex patterns with reduced parameters. The paper posits that the hierarchical structure of neural networks closely reflects the compositional nature of the physical processes generating real-world data. These insights not only enhance the theoretical understanding of neural network efficiency but also influence practical implementations aiming at leveraging such efficiencies in real-world machine learning applications.