Aligning Human Motion Generation with Human Perceptions (2407.02272v2)
Abstract: Human motion generation is a critical task with a wide range of applications. Achieving high realism in generated motions requires naturalness, smoothness, and plausibility. Despite rapid advancements in the field, current generation methods often fall short of these goals. Furthermore, existing evaluation metrics typically rely on ground-truth-based errors, simple heuristics, or distribution distances, which do not align well with human perceptions of motion quality. In this work, we propose a data-driven approach to bridge this gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic, that capture human perceptual preferences. Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline to enhance generation quality. Extensive experiments demonstrate the effectiveness of our approach in both evaluating and improving the quality of generated human motions by aligning with human perceptions. Code and data are publicly available at https://motioncritic.github.io/.
- Circle: Capture in rich contextual environments. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21211–21221, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- From the perception of action to the understanding of intention. Nature reviews neuroscience, 2(8):561–567, 2001.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Context-aware human motion prediction. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 6992–7001, 2020.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021.
- Generative adversarial nets. In Proc. Adv. Neural Inform. Process. Syst., 2014.
- Brain areas involved in perception of biological motion. Journal of cognitive neuroscience, 12(5):711–720, 2000.
- Human-like arm motion generation: A review. Robotics, 9(4):102, 2020.
- Generating diverse and natural 3d human motions from text. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 5152–5161, 2022.
- Action2motion: Conditioned generation of 3d human motions. In Proc. ACM Int. Conf. Multimedia, pages 2021–2029, 2020.
- Inverse preference learning: Preference-based rl without a reward function. Advances in Neural Information Processing Systems, 36, 2024.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. Adv. Neural Inform. Process. Syst., volume 30, 2017.
- Denoising diffusion probabilistic models. Proc. Adv. Neural Inform. Process. Syst., 2020.
- Dance revolution: Long-term dance generation with music via curriculum learning. In Proc. Int. Conf. Learn. Represent., 2021.
- David R Hunter. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384–406, 2004.
- A large-scale rgb-d database for arbitrary-view human action recognition. In Proc. ACM Int. Conf. Multimedia, page 1510–1518, 2018.
- Maurice George Kendall. Rank correlation methods. 1948.
- Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8255–8263, 2023.
- Auto-encoding variational bayes. In Proc. Int. Conf. Learn. Represent., 2014.
- Analyzing input and output representations for speech-driven gesture generation. In Proc. Int. Conf. on Intelligent Virtual Agents, page 97–104, 2019.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
- Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024.
- Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34(6):1–16, 2015.
- Posegpt: Quantization-based 3d human motion generation and forecasting. In Proc. Eur. Conf. Comput. Vis., pages 417–435, 2022.
- Long-term motion generation for interactive humanoid robots using gan with convolutional network. In Proc. ACM/IEEE Int. Conf. on Human-Robot Interaction, pages 375–377, 2020.
- Training language models to follow instructions with human feedback. Proc. Adv. Neural Inform. Process. Syst., 35:27730–27744, 2022.
- TEMOS: Generating diverse human motions from textual descriptions. In Proc. Eur. Conf. Comput. Vis., pages 480–497, 2022.
- Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021.
- Variational inference with normalizing flows. In Proc. Int. Conf. Mach. Learn., 2015.
- Modulation of motor area activity during observation of unnatural body movements. Brain and cognition, 80(1):1–6, 2012.
- Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 11050–11059, June 2022.
- Absolute identification by relative judgment. Psychological review, 112(4):881, 2005.
- Motionclip: Exposing human motion generation to clip space. In Proc. Eur. Conf. Comput. Vis., pages 358–374, 2022.
- Human motion diffusion model. In Proc. Int. Conf. Learn. Represent., 2023.
- Nikolaus F Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of vision, 2(5):2–2, 2002.
- Edge: Editable dance generation from music. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 448–458, 2023.
- What is the best automated metric for text to motion generation? In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
- Synthesizing long-term 3d human motion and interaction in 3d scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9401–9411, 2021.
- Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
- Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
- Transmomo: Invariance-driven unsupervised video motion retargeting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Generating holistic 3d human motion from speech. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
- The one-man-crowd: Single user generation of crowd motions using virtual reality. IEEE Trans. Vis. Comput. Graph., 28(5):2245–2255, 2022.
- Instructvideo: Instructing video diffusion models with human feedback. arXiv preprint arXiv:2312.12490, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Learning human motion representations: A unified perspective. In Proc. Int. Conf. Comput. Vis., 2023.
- Human motion generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.