Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

IMAD: IMage-Augmented multi-modal Dialogue (2305.10512v2)

Published 17 May 2023 in cs.CL and cs.HC

Abstract: Currently, dialogue systems have achieved high performance in processing text-based communication. However, they have not yet effectively incorporated visual information, which poses a significant challenge. Furthermore, existing models that incorporate images in dialogue generation focus on discussing the image itself. Our proposed approach presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue. By doing so, we aim to expand the capabilities of current dialogue systems and transition them from single modality (text) to multi-modality. However, there is a lack of validated English datasets that contain both images and dialogue contexts for this task. Thus, we propose a two-stage approach to automatically construct a multi-modal dialogue dataset. In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image. In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model. We used this approach, along with additional labeling, to create the IMage Augmented multi-modal Dialogue dataset (IMAD), which can serve as a validated dataset for this task. Furthermore, we propose a baseline model trained on this dataset, which outperforms model trained on the same data without images and BlenderBot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Software-based dialogue systems: Survey, taxonomy, and challenges. ACM Comput. Surv., 55(5), dec 2022.
  2. OpenAI. Chatgpt: Optimizing language models for dialogue., 2022.
  3. A. Baki Kocaballi. Conversational ai-powered design: Chatgpt as designer, user, and product, 2023.
  4. The future of human-ai collaboration: a taxonomy of design knowledge for hybrid intelligence systems, 2021.
  5. Chatting about chatgpt: how may ai and gpt impact academia and libraries? Library Hi Tech News, 2023.
  6. Lamda: Language models for dialog applications, 2022.
  7. A survey on dialogue systems. ACM SIGKDD Explorations Newsletter, 19(2):25–35, nov 2017.
  8. Deeppavlov: Open-source library for dialogue systems. In ACL (4), pages 122–127, 2018.
  9. Dialogpt: Large-scale generative pre-training for conversational response generation. In arXiv:1911.00536, November 2019.
  10. Bloom: A 176b-parameter open-access multilingual language model, 2022.
  11. Dialogbert: Discourse-aware response generation via learning to recover and rank utterances, 2020.
  12. Mmchat: Multi-modal chat dataset on social media, 2021.
  13. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  14. High-resolution image synthesis with latent diffusion models, 2021.
  15. Zero-shot text-to-image generation, 2021.
  16. Artificial intelligence in the creative industries: a review. Artificial Intelligence Review, 55(1):589–656, jul 2021.
  17. Generative adversarial networks in medical image augmentation: a review. Computers in Biology and Medicine, page 105382, 2022.
  18. Physics informed synthetic image generation for deep learning-based detection of wrinkles and folds. Journal of Computing and Information Science in Engineering, 23(3):030903, 2023.
  19. Multimodal machine learning: A survey and taxonomy, 2017.
  20. Recent advances and trends in multimodal deep learning: A review, 2021.
  21. A review on explainability in multimodal deep neural nets. IEEE Access, 9:59800–59821, 2021.
  22. A Survey on Deep Learning for Multimodal Data Fusion. Neural Computation, 32(5):829–864, 05 2020.
  23. HEU emotion: a large-scale database for multimodal emotion recognition in the wild. Neural Computing and Applications, 33(14):8669–8685, jan 2021.
  24. Visual question answering model based on visual relationship detection. Signal Processing: Image Communication, 80:115648, 2020.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  26. A generalist agent. Transactions on Machine Learning Research, 2022. Featured Certification.
  27. Flamingo: a visual language model for few-shot learning, 2022.
  28. Language is not all you need: Aligning perception with language models, 2023.
  29. Constructing multi-modal dialogue dataset by replacing text with semantically relevant images, 2021.
  30. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  31. Recipes for building an open-domain chatbot, 2020.
  32. Simvlm: Simple visual language model pretraining with weak supervision, 2021.
  33. Scaling up vision-language pre-training for image captioning, 2021.
  34. Zero-shot video question answering via frozen bidirectional language models, 2022.
  35. Revealing single frame bias for video-and-language learning, 2022.
  36. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners, 2022.
  37. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis, 2020.
  38. Naturalspeech: End-to-end text to speech synthesis with human-level quality, 2022.
  39. Align before fuse: Vision and language representation learning with momentum distillation, 2021.
  40. Scaling up visual and vision-language representation learning with noisy text supervision. 2021.
  41. Ankur Kumar. The illustrated image captioning using transformers. ankur3107.github.io, 2022.
  42. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022.
  43. mplug: Effective and efficient vision-language learning by cross-modal skip-connections, 2022.
  44. Learning transferable visual models from natural language supervision, 2021.
  45. Bridgetower: Building bridges between encoders in vision-language representation learning, 2022.
  46. Representation learning with contrastive predictive coding, 2018.
  47. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  48. Im2text: Describing images using 1 million captioned photographs. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  49. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  50. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
  51. Microsoft coco: Common objects in context, 2014.
  52. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics.
  53. Visual semantic reasoning for image-text matching. In ICCV, 2019.
  54. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  55. Dialogcc: Large-scale multi-modal dialogue dataset, 2022.
  56. Personalizing dialogue agents: I have a dog, do you have pets too? 2018.
  57. Dailydialog: A manually labelled multi-turn dialogue dataset. 2017.
  58. MuTual: A dataset for multi-turn dialogue reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1406–1416, Online, July 2020. Association for Computational Linguistics.
  59. Commonsense-focused dialogues for response generation: An empirical study, 2021.
  60. Towards empathetic open-domain conversation models: a new benchmark and dataset, 2018.
  61. Dream: A challenge dataset and models for dialogue-based reading comprehension, 2019.
  62. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  63. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  64. Lavis: A library for language-vision intelligence, 2022.
  65. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
  66. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
  67. Gaussian error linear units (gelus), 2016.
  68. Noun-based attention mechanism for fine-grained named entity recognition. Expert Systems with Applications, 193:116406, 2022.
  69. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  70. Sergey Shkarin. ruTS, a library for statistics extraction from texts in Russian. Moscow, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.