Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction (2212.11642v3)

Published 22 Dec 2022 in cs.CV and cs.AI

Abstract: We present a multi-scale predictive coding model for future video frames prediction. Drawing inspiration on the ``Predictive Coding" theories in cognitive science, it is updated by a combination of bottom-up and top-down information flows, which can enhance the interaction between different network levels. However, traditional predictive coding models only predict what is happening hierarchically rather than predicting the future. To address the problem, our model employs a multi-scale approach (Coarse to Fine), where the higher level neurons generate coarser predictions (lower resolution), while the lower level generate finer predictions (higher resolution). In terms of network architecture, we directly incorporate the encoder-decoder network within the LSTM module and share the final encoded high-level semantic information across different network levels. This enables comprehensive interaction between the current input and the historical states of LSTM compared with the traditional Encoder-LSTM-Decoder architecture, thus learning more believable temporal and spatial dependencies. Furthermore, to tackle the instability in adversarial training and mitigate the accumulation of prediction errors in long-term prediction, we propose several improvements to the training strategy. Our approach achieves good performance on datasets such as KTH, Moving MNIST and Caltech Pedestrian. Code is available at https://github.com/Ling-CF/MSPN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. B. T. Morris and M. M. Trivedi, “Learning, modeling, and classification of vehicle track patterns from live video,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 3, pp. 425–437, 2008.
  2. A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall, “Probabilistic future prediction for video scene understanding,” in European Conference on Computer Vision.   Springer, 2020, pp. 767–785.
  3. C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 2786–2793.
  4. K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in European Conference on Computer Vision.   Springer, 2012, pp. 201–214.
  5. X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo, “Deep learning for precipitation nowcasting: A benchmark and a new model,” Advances in neural information processing systems, vol. 30, 2017.
  6. A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board prediction of people in traffic scenes under uncertainty,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4194–4202.
  7. T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  8. J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video representation learning by pace prediction,” in European conference on computer vision.   Springer, 2020, pp. 504–521.
  9. X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” Advances in neural information processing systems, vol. 28, 2015.
  10. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in International Conference on Learning Representations, 2017.
  11. J. Wang, W. Wang, and W. Gao, “Predicting diverse future frames with local transformation-guided masking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 12, pp. 3531–3543, 2018.
  12. X. Lin, Q. Zou, X. Xu, Y. Huang, and Y. Tian, “Motion-aware feature enhancement network for video prediction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 2, pp. 688–700, 2020.
  13. S. Li, J. Fang, H. Xu, and J. Xue, “Video frame prediction by deep multi-branch mask network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 4, pp. 1283–1295, 2020.
  14. C. Tan, Z. Gao, L. Wu, Y. Xu, J. Xia, S. Li, and S. Z. Li, “Temporal attention unit: Towards efficient spatiotemporal predictive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 770–18 782.
  15. Z. Chang, X. Zhang, S. Wang, S. Ma, Y. Ye, X. Xinguang, and W. Gao, “Mau: A motion-aware unit for video prediction and beyond,” Advances in Neural Information Processing Systems, vol. 34, pp. 26 950–26 962, 2021.
  16. X. Ye and G.-A. Bilodeau, “Video prediction by efficient transformers,” Image and Vision Computing, vol. 130, p. 104612, 2023.
  17. S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro-Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A review on deep learning techniques for video prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 2806–2826, 2020.
  18. J. Su, W. Byeon, J. Kossaifi, F. Huang, J. Kautz, and A. Anandkumar, “Convolutional tensor-train lstm for spatio-temporal learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 714–13 726, 2020.
  19. Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” Advances in neural information processing systems, vol. 30, 2017.
  20. Y. Wang, Z. Gao, M. Long, J. Wang, and S. Y. Philip, “Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5123–5132.
  21. Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, S. Y. Philip, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2208–2225, 2022.
  22. R. P. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects,” Nature neuroscience, vol. 2, no. 1, pp. 79–87, 1999.
  23. K. Friston, “The free-energy principle: a unified brain theory?” Nature reviews neuroscience, vol. 11, no. 2, pp. 127–138, 2010.
  24. K. Friston, J. Kilner, and L. Harrison, “A free energy principle for the brain,” Journal of physiology-Paris, vol. 100, no. 1-3, pp. 70–87, 2006.
  25. B. Han and R. VanRullen, “The rhythms of predictive coding? pre-stimulus phase modulates the influence of shape perception on luminance judgments,” Scientific reports, vol. 7, no. 1, pp. 1–10, 2017.
  26. L. Aitchison and M. Lengyel, “With or without you: predictive coding and bayesian inference in the brain,” Current opinion in neurobiology, vol. 46, pp. 219–227, 2017.
  27. C. Teufel and P. C. Fletcher, “Forms of prediction in the nervous system,” Nature Reviews Neuroscience, vol. 21, no. 4, pp. 231–242, 2020.
  28. H. Hogendoorn and A. N. Burkitt, “Predictive coding with neural transmission delays: a real-time temporal alignment hypothesis,” Eneuro, vol. 6, no. 2, 2019.
  29. T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
  30. S. Aigner and M. Körner, “Futuregan: Anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans,” arXiv preprint arXiv:1810.01325, 2018.
  31. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
  32. C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in neural information processing systems, vol. 29, 2016.
  33. S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1526–1535.
  34. J.-T. Hsieh, B. Liu, D.-A. Huang, L. F. Fei-Fei, and J. C. Niebles, “Learning to decompose and disentangle representations for video prediction,” Advances in neural information processing systems, vol. 31, 2018.
  35. M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” International conference on learning representations, 2016.
  36. A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” arXiv preprint arXiv:1804.01523, 2018.
  37. B. Jin, Y. Hu, Q. Tang, J. Niu, Z. Shi, Y. Han, and X. Li, “Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4554–4563.
  38. Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei, “Eidetic 3d lstm: A model for video prediction and beyond,” in International conference on learning representations, 2018.
  39. W. R. Softky, “Unsupervised pixel-prediction,” in Advances in neural information processing Systems, 1996, pp. 809–815.
  40. A. Hollingworth, “Constructing visual representations of natural scenes: the roles of short-and long-term visual memory.” Journal of Experimental Psychology: Human Perception and Performance, vol. 30, no. 3, p. 519, 2004.
  41. W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” International Conference on Learning Representations, 2017.
  42. N. Elsayed, A. S. Maida, and M. Bayoumi, “Reduced-gate convolutional lstm architecture for next-frame video prediction using predictive coding,” in 2019 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2019, pp. 1–9.
  43. Z. Straka, T. Svoboda, and M. Hoffmann, “Precnet: Next frame video prediction based on predictive coding,” arXiv preprint arXiv:2004.14878, 2020.
  44. X. Hu, Z. Huang, A. Huang, J. Xu, and S. Zhou, “A dynamic multi-scale voxel flow network for video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6121–6131.
  45. W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021.
  46. M. Li, L. Chen, J. Lu, J. Feng, and J. Zhou, “Order-constrained representation learning for instructional video prediction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5438–5452, 2022.
  47. Z.-Q. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma, “Frequency principle: Fourier analysis sheds light on deep neural networks,” arXiv preprint arXiv:1901.06523, 2019.
  48. T. Luo, Z. Ma, Z. Wang, Z. J. Xu, and Y. Zhang, “An upper limit of decaying rate with respect to frequency in linear frequency principle model,” in Mathematical and Scientific Machine Learning.   PMLR, 2022, pp. 205–214.
  49. S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3883–3891.
  50. X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-recurrent network for deep image deblurring,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8174–8182.
  51. N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International conference on machine learning.   PMLR, 2015, pp. 843–852.
  52. C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a local svm approach,” in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol. 3.   IEEE, 2004, pp. 32–36.
  53. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
  54. P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” PAMI, vol. 34, 2012.
  55. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
  56. B. Jin, Y. Hu, Y. Zeng, Q. Tang, S. Liu, and J. Ye, “Varnet: Exploring variations for unsupervised video prediction,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 5801–5806.
  57. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
  58. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International conference on learning representations, 12 2014.
  59. M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 716–731.
  60. S. Lee, H. G. Kim, D. H. Choi, H.-I. Kim, and Y. M. Ro, “Video prediction recalling long-term motion context via memory alignment learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3054–3063.
  61. C. Ling, J. Zhong, and W. Li, “Pyramidal predictive network: A model for visual-frame prediction based on predictive coding theory,” Electronics, vol. 11, no. 18, p. 2969, 2022.
  62. Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Simpler yet better video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180.
  63. W. Yu, Y. Lu, S. Easterbrook, and S. Fidler, “Efficient and information-preserving future frame prediction and beyond,” 2020.
  64. Y. Wu, R. Gao, J. Park, and Q. Chen, “Future video synthesis with object motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5539–5548.
  65. Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4463–4471.
  66. T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video synthesis,” Conference on Neural Information Processing Systems (NeurIPS), 2018.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub