Mixer is more than just a model (2402.18007v2)
Abstract: Recently, MLP structures have regained popularity, with MLP-Mixer standing out as a prominent example. In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives, effectively acting as a fusion of channel and token information. Indeed, Mixer represents a paradigm for information extraction that amalgamates channel and token information. The essence of Mixer lies in its ability to blend information from diverse perspectives, epitomizing the true concept of "mixing" in the realm of neural network architectures. Beyond channel and token considerations, it is possible to create more tailored mixers from various perspectives to better suit specific task requirements. This study focuses on the domain of audio recognition, introducing a novel model named Audio Spectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporates insights from both time and frequency domains. Experimental results demonstrate that ASM-RH is particularly well-suited for audio data and yields promising outcomes across multiple classification tasks. The models and optimal weights files will be published.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- “Training data-efficient image transformers & distillation through attention,” 2020.
- “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12009–12019.
- “Mlp-mixer: An all-mlp architecture for vision,” Advances in neural information processing systems, vol. 34, pp. 24261–24272, 2021.
- “Pay attention to mlps,” Advances in Neural Information Processing Systems, vol. 34, pp. 9204–9215, 2021.
- “Resmlp: Feedforward networks for image classification with data-efficient training,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 5314–5321, 2022.
- “Mts-mixers: Multivariate time series forecasting via factorized temporal and channel mixing,” arXiv preprint arXiv:2302.04501, 2023.
- “Metaformer is actually what you need for vision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10819–10829.
- “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- “When shift operation meets vision transformer: An extremely simple alternative to attention mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 2423–2430.
- “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
- “Ssast: Self-supervised audio spectrogram transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 10699–10709.
- “Flexiast: Flexibility is what ast needs,” ArXiv, vol. abs/2307.09286, 2023.
- “Asm: Audio spectrogram mixer,” ArXiv, vol. abs/2401.11102, 2024.
- “Shift: A zero flop, zero parameter alternative to spatial convolutions,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
- “Constructing fast network through deconstruction of convolution,” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, May 2018.
- “S22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-mlp: Spatial-shift mlp architecture for vision,” Cornell University - arXiv,Cornell University - arXiv, Jun 2021.
- Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
- “Sound classification using convolutional neural networks,” in 2018 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM). IEEE, 2018, pp. 81–84.
- “Speech emotion recognition based on svm and ann,” International Journal of Machine Learning and Computing, vol. 8, no. 3, pp. 198–202, 2018.
- “Eranns: Efficient residual audio neural networks for audio pattern recognition,” Pattern Recognition Letters, vol. 161, pp. 38–44, 2022.
- “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. 1–35, 05 2018.
- “Activemlp: An mlp-like architecture with active token mixer,” arXiv: 2203.06108, 2022.
- “Active token mixer,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, vol. 37, pp. 2759–2767.
- “Adaptive frequency filters as efficient global token mixers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6049–6059.
- “How can large language models understand spatial-temporal data?,” ArXiv, vol. abs/2401.14192, 2024.