Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization (2403.16071v2)
Abstract: Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.
- Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell., 44(12):8717–8727.
- Deep lip reading: A comparison of models and an online application. In Interspeech 2018, pages 3514–3518.
- LRS3-TED: a large-scale dataset for visual speech recognition. CoRR, abs/1809.00496.
- Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In ICASSP 2016, pages 2722–2726.
- Lipnet: Sentence-level lipreading. CoRR, abs/1611.01599.
- MINE: mutual information neural estimation. CoRR, abs/1801.04062.
- Flexivit: One model for all patch sizes. In CVPR 2023, pages 14496–14506.
- Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV 2017, pages 1021–1030.
- The speaker-independent lipreading play-off; a survey of lipreading machines. In IPAS 2018, pages 125–130.
- Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Transactions on Image Processing, 15(10):2879–2891.
- Duallip: A system for joint lip reading and generation. In MM ’20: The 28th ACM International Conference on Multimedia, pages 1985–1993.
- CLUB: A contrastive log-ratio upper bound of mutual information. In ICML 2020, volume 119, pages 1779–1788.
- Lip reading sentences in the wild. In CVPR 2017, pages 3444–3453.
- Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In ACCV 2016, volume 10112, pages 87–103.
- An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR 2021.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML 2006, volume 148, pages 369–376.
- Conformer: Convolution-augmented transformer for speech recognition. In Interspeech 2020, pages 5036–5040.
- Deep residual learning for image recognition. In CVPR 2016, pages 770–778.
- Learning deep representations by mutual information estimation and maximization. In ICLR 2019.
- Joint CTC/attention decoding for end-to-end speech recognition. In ACL 2017, pages 518–529.
- Callip: Lipreading using contrastive and attribute learning. In MM ’21: ACM Multimedia Conference, pages 2492–2500.
- Speaker-adaptive lip reading with user-dependent padding. In ECCV 2022, volume 13696, pages 576–593.
- Justin B Kinney and Gurinder S Atwal. 2014. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, 111(9):3354–3359.
- Information maximizing visual question generation. In CVPR 2019, pages 2008–2018.
- Profile view lip reading. In ICASSP 2007, pages 429–432.
- Siamese decoupling network for speaker-independent lipreading. Journal of Electronic Imaging, 31(3):033045–033045.
- Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020, pages 273–280.
- End-to-end audio-visual speech recognition with conformers. In ICASSP 2021, pages 7613–7617.
- Visual speech recognition for multiple languages in the wild. Nat. Mac. Intell., 4(11):930–939.
- Lipreading using temporal convolutional networks. In ICASSP 2020, pages 6319–6323.
- Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In ICASSP 2019, pages 6900–6904.
- Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, pages 513–520.
- Shape preserving facial landmarks with graph attention networks. In BMVC 2022, page 155.
- Searching for activation functions. In ICLR 2018, Workshop Track Proceedings.
- Human machine interaction via visual speech spotting. In ACIVS 2015, volume 9386, pages 566–574.
- Motion dynamics improve speaker-independent lipreading. In ICASSP 2020, pages 4407–4411.
- Speaker-independent visual speech recognition with the inception V3 model. In IEEE Spoken Language Technology Workshop, SLT 2021, pages 613–620.
- Adaptive semantic-spatio-temporal graph convolutional network for lip reading. IEEE Trans. Multim., 24:3545–3557.
- Sequence to sequence learning with neural networks. In NeurIPS 2014, pages 3104–3112.
- Representation learning with contrastive predictive coding. CoRR, abs/1807.03748.
- Attention is all you need. In NeurIPS 2017, pages 5998–6008.
- Michael Wand and Jürgen Schmidhuber. 2017. Improving speaker-independent lipreading with domain-adversarial training. In Interspeech 2017, pages 3662–3666.
- Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930.
- Lcanet: End-to-end lipreading with cascaded attention-ctc. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, pages 548–555.
- Lipformer: Learning to lipread unseen speakers based on visual-landmark transformers. IEEE Transactions on Circuits and Systems for Video Technology.
- Lcsnet: End-to-end lipreading with channel-aware feature selection. ACM Trans. Multim. Comput. Commun. Appl., 19(1s):28:1–28:21.
- Speaker-independent lipreading with limited data. In ICIP 2020, pages 2181–2185.
- LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In 14th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2019, pages 1–8.
- Speaker-independent lipreading by disentangled representation learning. In ICIP 2021, pages 2493–2497.
- Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition. In 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020, pages 356–363.
- Mutual information maximization for effective lip reading. In 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020, pages 420–427.
- A cascade sequence-to-sequence model for chinese mandarin lip reading. In MM Asia ’19: ACM Multimedia Asia, pages 32:1–32:6.
- Towards a practical lipreading system. In CVPR 2011, pages 137–144.
- High-resolution talking face generation via mutual information approximation. CoRR, abs/1812.06589.