Learned Image Compression with Mixed Transformer-CNN Architectures (2303.14978v1)

Published 27 Mar 2023 in eess.IV and cs.CV

Abstract: Learned image compression (LIC) methods have exhibited promising progress and superior rate-distortion performance compared with classical image compression standards. Most existing LIC methods are Convolutional Neural Networks-based (CNN-based) or Transformer-based, which have different advantages. Exploiting both advantages is a point worth exploring, which has two challenges: 1) how to effectively fuse the two methods? 2) how to achieve higher performance with a suitable complexity? In this paper, we propose an efficient parallel Transformer-CNN Mixture (TCM) block with a controllable complexity to incorporate the local modeling ability of CNN and the non-local modeling ability of transformers to improve the overall architecture of image compression models. Besides, inspired by the recent progress of entropy estimation models and attention modules, we propose a channel-wise entropy model with parameter-efficient swin-transformer-based attention (SWAtten) modules by using channel squeezing. Experimental results demonstrate our proposed method achieves state-of-the-art rate-distortion performances on three different resolution datasets (i.e., Kodak, Tecnick, CLIC Professional Validation) compared to existing LIC methods. The code is at https://github.com/jmliu206/LIC_TCM.

Authors (3)

Jinming Liu (29 papers)
Heming Sun (39 papers)
Jiro Katto (36 papers)

Citations (149)

View on Semantic Scholar

Summary

The paper introduces a parallel Transformer-CNN Mixture (TCM) block that fuses local and global features for improved rate-distortion efficiency.
It employs a channel-wise entropy model combined with a parameter-efficient swin-transformer-based attention module to minimize computational complexity.
Experiments on Kodak, Tecnick, and CLIC datasets show BD-rate improvements up to 13.71%, establishing state-of-the-art performance in learned image compression.

Overview of "Learned Image Compression with Mixed Transformer-CNN Architectures"

The paper "Learned Image Compression with Mixed Transformer-CNN Architectures" introduces a novel approach to image compression that effectively combines the strengths of Convolutional Neural Networks (CNNs) and Transformers. The authors propose a parallel Transformer-CNN Mixture (TCM) block designed to incorporate the local modeling capabilities of CNNs with the non-local modeling strengths of Transformers. This method aims to achieve superior rate-distortion performance compared to existing Learned Image Compression (LIC) methods. The proposed framework is evaluated on three datasets: Kodak, Tecnick, and CLIC Professional Validation, demonstrating state-of-the-art results in image compression.

Problem Formulation and Methodology

Traditional image compression techniques like JPEG and VVC focus on hand-crafted features, typically involving transform, quantization, and entropy coding processes. Emerging LIC techniques optimize compression end-to-end using neural networks, showing superior performance in metrics like Peak Signal-to-Noise Ratio (PSNR) and Multi-Scale Structural Similarity Index (MS-SSIM).

This research addresses two core challenges:

Fusion of Architectures: How to combine CNNs and Transformers effectively to harness both local and long-range data dependencies.
Complexity Management: Achieving high performance without excessive computational complexity.

The authors propose a TCM block where features are split and processed in parallel by CNNs for local features and Transformers for global features. By subsequently fusing these outputs, the architecture manages to maintain a rich tapestry of spatial and contextual information.

Channel-wise Entropy Model and SWAtten Module

Leveraging recent advancements in entropy modeling, the authors introduce a channel-wise entropy model enhanced with a parameter-efficient swin-transformer-based attention module (SWAtten). This model further incorporates channel squeezing to reduce computational load while maintaining strong performance. Each slice of latent variable data aids in constructing a more informed compression model.

The SWAtten module specifically is designed to capture both local and non-local information effectively while minimizing the complexity compared to traditional methods that employ heavier attention layers throughout the entire network.

Experimental Results

The proposed method demonstrates significant improvements over existing LIC methods, achieving reductions in BjÃ¸ntegaard-delta-rate (BD-rate) against the VVC benchmark. On the Kodak, Tecnick, and CLIC datasets, the proposed method improves BD-rate by 12.30%, 13.71%, and 11.85%, respectively. The paper also provides visual comparisons showing superior preservation of details compared to older methods under the same bit rate constraints.

Implications and Future Directions

This paper makes significant strides in the field of image compression, providing a hybrid architecture that intelligently combines the complementary strengths of CNNs and Transformers. The insights garnered on the dual-domain effectiveness of local and non-local feature aggregation could inspire further innovations in related fields such as video compression and real-time image processing.

Future work may focus on further reducing the computational complexity while possibly increasing compression efficacy, exploring the use of larger and more diverse datasets, and generalizing this method to other types of data beyond images. Potential advancements could also incorporate more sophisticated entropy models or adaptive architectures that dynamically balance load between CNN and Transformer components in real-time usage scenarios.

In conclusion, this research enriches the LIC landscape by strategically merging two prominent neural architectures to enhance both theoretical understanding and practical application in image compression.

PDF Markdown

Related Papers

GitHub

GitHub - jmliu206/LIC_TCM (161 stars)