Model Compression Method for S4 with Diagonal State Space Layers using Balanced Truncation (2402.15993v3)
Abstract: To implement deep learning models on edge devices, model compression methods have been widely recognized as useful. However, it remains unclear which model compression methods are effective for Structured State Space Sequence (S4) models incorporating Diagonal State Space (DSS) layers, tailored for processing long-sequence data. In this paper, we propose to use the balanced truncation, a prevalent model reduction technique in control theory, applied specifically to DSS layers in pre-trained S4 model as a novel model compression method. Moreover, we propose using the reduced model parameters obtained by the balanced truncation as initial parameters of S4 models with DSS layers during the main training process. Numerical experiments demonstrate that our trained models combined with the balanced truncation surpass conventionally trained models with Skew-HiPPO initialization in accuracy, even with fewer parameters. Furthermore, our observations reveal a positive correlation: higher accuracy in the original model consistently leads to increased accuracy in models trained using our model compression method, suggesting that our approach effectively leverages the strengths of the original model.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of Advances in neural information processing systems, 2017.
- A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 9, 2023, pp. 11 121–11 128.
- A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” in International Conference on Learning Representations, 2022.
- A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” in Proceedings of Advances in neural information processing systems, 2020, pp. 1474–1487.
- Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long range arena: A benchmark for efficient transformers,” in International Conference on Learning Representations, 2021.
- A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” in Proceedings of Advances in Neural Information Processing Systems, 2022, pp. 22 982–22 994.
- D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Re, “Hungry hungry hippos: Towards language modeling with state space models,” in The Eleventh International Conference on Learning Representations, 2023.
- M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger convolutional language models,” arXiv preprint arXiv:2302.10866, 2023.
- A. Gu, K. Goel, A. Gupta, and C. Ré, “On the parameterization and initialization of diagonal state space models,” in Advances in Neural Information Processing Systems, 2022, pp. 35 971–35 983.
- E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. Ré, “S4nd: Modeling images and videos as multidimensional signals with state spaces,” in Proceedings of Advances in neural information processing systems, 2022, pp. 2846–2861.
- J. T. Smith, A. Warrington, and S. Linderman, “Simplified state space layers for sequence modeling,” in The Eleventh International Conference on Learning Representations, 2023.
- J. M. Lopez Alcaraz and N. Strodthoff, “Diffusion-based time series imputation and forecasting with structured state space models,” Transactions on machine learning research, pp. 1–36, 2023.
- A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
- K. Cao, Y. Liu, G. Meng, and Q. Sun, “An overview on edge computing research,” IEEE Access, vol. 8, pp. 85 714–85 728, 2020.
- Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
- A. Astolfi, “Model reduction by moment matching for linear and nonlinear systems,” IEEE Transactions on Automatic Control, vol. 55, no. 10, pp. 2321–2336, 2010.
- S. Gugercin, A. C. Antoulas, and C. Beattie, “H2superscript𝐻2H^{2}italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT model reduction for large-scale linear dynamical systems,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 2, pp. 609–638, 2008.
- B. Moore, “Principal component analysis in linear systems: Controllability, observability, and model reduction,” IEEE transactions on automatic control, vol. 26, no. 1, pp. 17–32, 1981.
- K. Sato, “Riemannian optimal model reduction of linear port-hamiltonian systems,” Automatica, vol. 93, pp. 428–434, 2018.
- K. Sato, “Riemannian optimal model reduction of stable linear systems,” IEEE Access, vol. 7, pp. 14 689–14 698, 2019.
- K. Sato, “Reduced model reconstruction method for stable positive network systems,” IEEE Transactions on Automatic Control, vol. 68, no. 9, pp. 5616–5623, 2023.
- S. Massaroli, M. Poli, D. Fu, H. Kumbong, R. Parnichkun, D. Romero, A. Timalsina, Q. McIntyre, B. Chen, A. Rudra et al., “Laughing hyena distillery: Extracting compact recurrences from convolutions,” in Advances in Neural Information Processing Systems, 2023.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.