Xception: Deep Learning with Depthwise Separable Convolutions (1610.02357v3)

Published 7 Oct 2016 in cs.CV

Abstract: We present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). In this light, a depthwise separable convolution can be understood as an Inception module with a maximally large number of towers. This observation leads us to propose a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions. We show that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset (which Inception V3 was designed for), and significantly outperforms Inception V3 on a larger image classification dataset comprising 350 million images and 17,000 classes. Since the Xception architecture has the same number of parameters as Inception V3, the performance gains are not due to increased capacity but rather to a more efficient use of model parameters.

Citations (13,349)

View on Semantic Scholar

Summary

The paper presents the Xception architecture that reinterprets Inception modules as depthwise separable convolutions to enhance computational efficiency.
The method achieves superior performance by reducing parameter counts while obtaining a top-1 ImageNet accuracy of 79.0% and improved MAP on the JFT dataset.
The study underscores the architecture’s practical impact and suggests further exploration of intermediate convolution formulations and optimization techniques.

Xception: Deep Learning with Depthwise Separable Convolutions

The seminal paper by François Chollet, titled "Xception: Deep Learning with Depthwise Separable Convolutions," presents a compelling continuation and refinement of concepts underpinning the design of deep convolutional neural networks (CNNs). The work is built on profound observations concerning the Inception module originally introduced by Szegedy et al. The Xception architecture proposed in this paper, dubbed “Extreme Inception,” replaces Inception modules with depthwise separable convolutions, yielding significant improvements in both efficiency and performance.

Core Contributions

Rethinking Inception Modules: The Inception module is interpreted as an intermediary between regular convolutions and depthwise separable convolutions. The later are viewed as an "extreme" form of the Inception module, where cross-channel and spatial correlations are mapped in completely separate steps.
Accessing Computational Efficiency: Depthwise separable convolutions aim to separate the mapping of spatial and cross-channel correlations, significantly reducing computational costs while enabling the model to learn more efficiently.
Comparison and Performance: The Xception architecture, while maintaining parameter count parity with Inception V3, consistently demonstrates enhanced performance. Particularly notable improvements are observed on the JFT dataset—a large-scale image classification dataset—highlighting the generalized applicability and practical utility of the proposed architecture.

Experimental Setup and Results

ImageNet Dataset

Performance Metrics: On the ImageNet dataset, Xception marginally outperforms Inception V3 with a top-1 accuracy of 79.0% compared to 78.2% for Inception V3. This suggests that the novel architecture does not merely gain its advantage by increasing model capacity.
Optimization and Regularization: The training regimen employed stochastic gradient descent (SGD) with momentum for ImageNet, while a different optimization configuration was utilized for JFT. Both models included Polyak averaging and appropriate dropout layers to bolster generalization.

JFT Dataset

Superior Performance on Large Scale Data: Evaluations on the JFT dataset demonstrate a relative improvement, where Xception achieved a Mean Average Precision (MAP@100) of 6.70 compared to 6.36 for Inception V3. This substantial gain underscores the architecture's suitability for complex, large-scale image classification tasks.
Training and Convergence: The authors utilized 60 NVIDIA K80 GPUs, evidencing the necessity for significant computational resources to realize the full potential of the proposed architecture. The models were not trained to full convergence on JFT, indicating there remains further potential for performance enhancement with extended training durations.

Implications and Future Directions

The research posits that Xception's advanced performance is rooted not in increased capacity but in the more efficient allocation and utilization of model parameters. These findings suggest several directions for future investigation:

Intermediate Formulations: The paper highlights a spectrum between regular and depthwise separable convolutions, suggesting that intermediate formulations of Inception modules could potentially yield additional gains. This theory warrants further empirical exploration to better understand its practical implications.
Broader Applicability: Given the architecture's demonstrated efficacy, future work might explore its applicability beyond image classification, including object detection and segmentation tasks where spatial correlations are paramount.
Optimization Improvements: The significant computational resources required imply future work could also focus on optimizing the implementation of depthwise separable convolutions to enhance training efficiency, particularly for large-scale datasets.

Conclusion

The Xception architecture represents a meaningful step forward in the design of deep convolutional neural networks. By replacing traditional Inception modules with depthwise separable convolutions, the model capitalizes on a more efficient parameterization, enabling superior performance on diverse and extensive datasets without increasing model size. This work underscores depthwise separable convolutions' potential to become integral to future convolutional neural network designs, promoting both theoretical insights and pragmatic advancements in deep learning applications.

PDF Markdown

Related Papers

YouTube

Show All Videos