Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models (2405.04180v1)
Abstract: The rapid advancement in text-to-video (T2V) generative models has enabled the synthesis of high-fidelity video content guided by textual descriptions. Despite this significant progress, these models are often susceptible to hallucination, generating contents that contradict the input text, which poses a challenge to their reliability and practical deployment. To address this critical issue, we introduce the SoraDetector, a novel unified framework designed to detect hallucinations across diverse large T2V models, including the cutting-edge Sora model. Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content. Leveraging the state-of-the-art keyframe extraction techniques and multimodal LLMs, SoraDetector first evaluates the consistency between extracted video content summary and textual prompts, then constructs static and dynamic knowledge graphs (KGs) from frames to detect hallucination both in single frames and across frames. Sora Detector provides a robust and quantifiable measure of consistency, static and dynamic hallucination. In addition, we have developed the Sora Detector Agent to automate the hallucination detection process and generate a complete video quality report for each input video. Lastly, we present a novel meta-evaluation benchmark, T2VHaluBench, meticulously crafted to facilitate the evaluation of advancements in T2V hallucination detection. Through extensive experiments on videos generated by Sora and other large T2V models, we demonstrate the efficacy of our approach in accurately detecting hallucinations. The code and dataset can be accessed via GitHub.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Weaverbird: Empowering financial decision-making with large language model, knowledge base, and search engine, 2023.
- Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023.
- Prompt-augmented temporal point process for streaming event sequence. Advances in Neural Information Processing Systems, 36:18885–18905, 2023.
- Leveraging large language models for pre-trained recommender systems. arXiv preprint arXiv:2308.10837, 2023.
- Harrison Chase. LangChain, 2022.
- Db-gpt: Empowering database interactions with private large language models. arXiv preprint arXiv:2312.17449, 2023.
- Demonstration of db-gpt: Next generation data interaction system empowered by large language models, 2024.
- Learning transferable visual models from natural language supervision, 2021.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Aditi Singh. A survey of ai text-to-image and ai text-to-video generators. In 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC), pages 32–36. IEEE, 2023.
- OpenAI. Video generation models as world simulators. 2024.
- Factchd: Benchmarking fact-conflicting hallucination detection, 2024.
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
- Unified hallucination detection for multimodal large language models, 2024.
- Keypoint-based keyframe selection. IEEE Transactions on circuits and systems for video technology, 23(4):729–734, 2012.
- Evaluating object hallucination in large vision-language models, 2023.
- Enhancing recommender systems with large language model reasoning graphs. arXiv preprint arXiv:2308.10835, 2023.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A survey on generative ai and llm for video generation, understanding, and streaming. 2024.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022.
- Matryoshka diffusion models, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Visual hallucinations of multi-modal large language models, 2024.
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites, 2023.
- Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024.
- Evaluation and analysis of hallucination in large vision-language models, 2023.
- Mitigating hallucination in large multi-modal models via robust instruction tuning, 2024.
- Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
- M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
- Equivalent key frames selection based on iso-content principles. IEEE Transactions on circuits and systems for video technology, 19(3):447–451, 2009.
- Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia, 14(1):66–75, 2011.
- Video summarization via minimum sparse reconstruction. Pattern Recognition, 48(2):522–533, 2015.
- L 2, 0 constrained sparse dictionary selection for video summarization. In 2014 IEEE international conference on multimedia and expo (ICME), pages 1–6. IEEE, 2014.
- Representative selection with structured sparsity. Pattern Recognition, 63:268–278, 2017.
- Adaptive key frame extraction using unsupervised clustering. In Proceedings 1998 international conference on image processing. icip98 (cat. no. 98cb36269), volume 1, pages 866–870. IEEE, 1998.
- Video key frame extraction through dynamic delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation, 24(7):1212–1227, 2013.
- Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern recognition letters, 32(1):56–68, 2011.
- Spatio-temporal feature-based keyframe detection from video shots using spectral clustering. Pattern Recognition Letters, 34(7):770–779, 2013.
- Scalable video summarization using skeleton graph and random walk. In 2014 22nd International Conference on Pattern Recognition, pages 3481–3486. IEEE, 2014.
- Key frames extraction using graph modularity clustering for efficient video summarization. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1502–1506. IEEE, 2017.
- A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, February 2022.
- Knowledge-guided article embedding refinement for session-based news recommendation. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7921–7927, 2021.
- Neural, symbolic and neural-symbolic reasoning on knowledge graphs, 2021.
- Task-driven causal feature distillation: Towards trustworthy risk prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11642–11650, 2024.
- Graph infomax adversarial learning for treatment effect estimation with networked observational data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 176–184, 2021.
- Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Data-centric financial large language models. arXiv preprint arXiv:2310.17784, 2023.
- Llm-guided multi-view hypergraph learning for human-centric explainable recommendation. arXiv preprint arXiv:2401.08217, 2024.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, March 2023.
- Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2024.
- Bridging causal discovery and large language models: A comprehensive survey of integrative approaches and future directions. arXiv preprint arXiv:2402.11068, 2024.
- OpenCV, 2024.
- A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2022.
- A survey on temporal knowledge graph: Representation learning and applications, 2024.
- Video anomaly identification. IEEE Signal Processing Magazine, 27(5):18–33, 2010.
- Temporal insight enhancement: Mitigating temporal hallucination in multimodal large language models, 2024.
- Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG), 39(4):75–1, 2020.
- Video anomaly detection with spatio-temporal dissociation. Pattern Recognition, 122:108213, 2022.
- Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In international conference on machine learning, pages 3462–3471. PMLR, 2017.
- Noh-nms: Improving pedestrian detection by nearby objects hallucination. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1967–1975, 2020.
- Localizing temporal anomalies in large evolving graphs. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 927–935. SIAM, 2015.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
- Professional agents–evolving large language models into autonomous experts with human-level competencies. arXiv preprint arXiv:2402.03628, 2024.
- Intelligent virtual assistants with llm-based process automation. arXiv preprint arXiv:2312.06677, 2023.