MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos (2406.08407v3)
Abstract: Multimodal Language LLMs (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces MMWorld, a big test (called a “benchmark”) to check how well AI systems that understand both language and visuals can make sense of the real world by watching videos. The main idea is to see if these AIs have a good “world model”—a kind of common‑sense understanding of how things work, why they happen, and what might happen next.
What questions does the paper try to answer?
In simple terms, the researchers ask:
- Can AI understand videos across many school‑like subjects (like science, sports, health, engineering, and business)?
- Can AI do more than just describe what it sees—like explain why something happened, imagine what would happen if something changed (what‑if), or predict the future?
- How do current AIs perform on these skills, and where do they struggle compared to humans?
How did they do it?
Think of MMWorld as a giant quiz made from real videos:
- The team collected 1,910 short videos from seven big areas: Art & Sports, Business, Science, Health & Medicine, Embodied Tasks (like robots or people doing step‑by‑step tasks), Tech & Engineering, and Games. These are split into 69 smaller topics (for example: robotics, chemistry, trading, agriculture).
- They wrote 6,627 multiple‑choice questions about these videos. The questions test different kinds of thinking, not just “what’s in the picture.”
Here are the main types of questions, explained with everyday ideas:
- Explanation: Why did something happen in the video? (Like, “Why did the chemical fizz?”)
- Counterfactual (What‑if): What would happen if we changed something? (Like, “If the person didn’t tighten the screw, what would happen next?”)
- Future prediction: What’s likely to happen next in the video?
- Domain expertise: Subject‑specific knowledge (like health tips, basic physics, or engineering steps).
- Temporal understanding: Understanding order and timing (what came first, what comes later).
- Attribution and procedure: Cause‑and‑effect, counting, and step‑by‑step tasks.
Two parts of the dataset:
- Human‑annotated set: People watched the whole videos, wrote questions, and checked answers carefully.
- Synthetic set: The team also made “controlled” questions using an automated pipeline. Some questions use only the video’s visuals; others use only the audio (like speech or sounds). This helps test whether the AI can handle each type of information separately.
How the automated pipeline works (simple analogy): It’s like a smart librarian that:
- Finds videos with open licenses,
- Picks key frames (important snapshots) or transcribes the audio (turns speech into text),
- Uses a strong AI to write questions about either just the visuals or just the audio,
- Has humans spot‑check the quality to keep things fair and sensible.
Finally, they tested 12 well‑known AI models (some open‑source, some commercial) on these questions to see how well they did.
What did they find, and why does it matter?
Main results:
- The best model (GPT‑4V) scored about 52% accuracy overall. That’s barely better than a coin flip in many cases, meaning the tasks are tough and today’s models are far from perfect at true “world understanding.”
- Some models trained specifically for video even did worse than random guessing on this benchmark—showing that real‑world video reasoning is hard.
- Strengths vary by model:
- Closed‑source (commercial) models like GPT‑4V and Gemini were often best overall.
- An open‑source model (Video‑LLaVA) did especially well on time‑related understanding and on areas that need strong “motion over time” skills (like sports and hands‑on tasks).
- Human vs. AI differences:
- There’s some overlap: questions humans find harder also tend to be harder for models.
- But AI and humans don’t struggle with the exact same things. Sometimes AI gets “expert‑level” questions right that non‑expert humans miss, and other times AI trips on “easy” questions because it lacks context or common sense.
Why this matters:
- These results show that describing a single image is much easier than understanding a whole video over time, explaining why things happen, or predicting the future.
- MMWorld gives researchers a clear target to improve AI models that must reason about the real world, not just recognize objects.
What’s the impact?
- For researchers and developers: MMWorld is a challenging, well‑designed testbed to build and compare better “world models” in AI—especially for video. It encourages progress on skills like cause‑and‑effect, what‑if reasoning, and step‑by‑step understanding.
- For real‑world uses: Stronger video understanding could help in education (explaining experiments), robotics (following procedures safely), sports analysis (predicting plays), healthcare training (understanding proper steps), and more.
- Cautions: The authors note risks like AI mistakes (hallucinations) and privacy concerns if video understanding is misused. Responsible use and careful evaluation are important.
In short, MMWorld is like a tough, multi‑subject exam for video‑watching AIs. Today’s best AIs still struggle, which shows how much room there is to grow—and gives a clear path for making smarter, safer, and more helpful systems in the future.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of concrete gaps and unresolved questions that future work could address:
- Inconsistent dataset statistics and documentation clarity: the paper states both 6,627 QA pairs overall and, elsewhere, ranges like “<417–1,559>”; clarify final counts per subset, per discipline, and per question type; report inter-annotator agreement and full annotation protocol.
- Limited temporal granularity in evaluation: proprietary models are fed only 10 frames per video; quantify how accuracy scales with frame count, sampling strategy (uniform vs. content-aware), and continuous/streaming video tokens.
- Short-horizon bias: average video length is ~100–116 seconds and synthetic collection restricts queries to 2 minutes; evaluate long-horizon temporal reasoning and memory over longer videos and multi-scene narratives.
- Audio modality underutilization and unfair comparison: many evaluations rely on frames only; audio-based tests use ASR transcripts and, for Gemini (audio setting), only the question is provided; standardize audio inputs across models and include non-speech acoustic events, multilingual speech, and noisy conditions.
- Lack of multimodal integration tasks: synthetic subsets isolate single modalities; add tasks where correct answers provably require joint audio–visual reasoning (with ablation showing failure when either modality is removed).
- Counterfactual and future prediction ground truthing: multiple plausible futures or counterfactuals may exist; detail how correctness was operationalized, include multiple-acceptable answers where appropriate, and measure agreement among independent annotators.
- Potential circularity/bias in synthetic data: GPT-4V generates QAs and captions while also being evaluated; quantify advantage to GPT-4V from style/knowledge alignment, and provide ablations using alternative generators or human-only curation.
- Limited human baseline and expertise calibration: “difficulty” is based on 3 turkers per item (non-experts); collect broader human baselines and expert annotations per domain to validate “domain expertise” items and set meaningful ceilings.
- No hard controls against shortcutting: add adversarial controls that require temporal dependencies (e.g., mask single frames), verify that single-frame baselines perform significantly worse, and release per-item “evidence dependency” labels.
- Fairness of cross-model protocol: models differ in number of frames, prompt templates, and safety settings; standardize prompts, input budgets, and frame/audio provisioning; report sensitivity to prompt variants and temperature.
- GPT-4-as-judge reliability and coverage: current validation is 189 examples (4.76% error); increase human verification scale across disciplines and question types, report per-type judge error, and release judge prompts and decision rules.
- Data contamination risk: videos come from public web sources likely included in pretraining; perform near-duplicate detection and overlap analysis with known training corpora, and report results per model.
- Distractor quality and multiple-choice artifacts: analyze plausibility and hardness of distractors, control for lexical cues, and benchmark free-form answering with calibrated semantic matching.
- Missing evaluation of calibration and abstention: add metrics for confidence calibration, selective answering, and robustness to uncertain or ambiguous questions.
- No evaluation of explanation faithfulness: while error analyses collect explanations, there is no formal measure of reasoning faithfulness; add counterfactual or rationale-consistency tests.
- Limited diversity and representativeness: YouTube Creative Commons and selected datasets may skew geography, language, and content; quantify distributional biases and add curated non-English, non-Western, and low-resource domains.
- Lack of privacy and fairness audits: beyond a brief risk note, no systematic audit of sensitive content, demographic bias, or privacy implications; include datasheets, consent assessment, and red-teaming for surveillance misuse.
- Missing tasks central to “world models”: no explicit tests of latent state estimation under occlusion, physical commonsense/causality beyond short clips, 3D spatial reasoning, or counterfactual physics; design controlled physical reasoning tasks with measurable state variables.
- No interactive or closed-loop evaluation: current setup is passive video QA; add embodied/interactive tasks (e.g., simulated environments) to probe planning, action prediction, and counterfactual interventions.
- Limited analysis of subdiscipline balance: some subdisciplines are sparsely represented; publish per-subdiscipline counts and ensure balanced splits to avoid training priors dominating results.
- Captioning unused despite availability: captions are released but not evaluated; include captioning benchmarks and assess alignment between QA and caption performance.
- Language coverage gaps: the pipeline depends on English ASR and prompts; extend to multilingual videos, transcripts, and QAs, and evaluate cross-lingual transfer.
- Robustness to video quality: no study of resolution, frame rate, compression, or motion blur; systematically vary these factors and report degradation curves.
- Statistical significance and uncertainty: report confidence intervals across items (not only runs), per-discipline significance tests, and item response theory analyses to validate difficulty claims.
- Reproducibility and dataset persistence: reliance on external links (YouTube) risks link rot; provide mirrored clips or deterministic frame/keyframe packages with versioned checksums and clear licensing.
- Tool-use and retrieval: domain expertise questions do not test retrieval-augmented models or tool integration; add settings where external resources are permitted and measure gains.
- Chain-of-thought video reasoning: no evaluation of step-by-step spatiotemporal reasoning; introduce CoT prompts with intermediate supervision and measure faithfulness/utility.
- Frame selection method unspecified: detail and ablate keyframe extraction methods (uniform vs. motion-/event-based) and their impact on temporal reasoning tasks.
- Hard negatives for audio vs. visual reliance: construct pairs of items that differ only in audio or only in visual cues to quantify modality reliance and triangulate error causes.
Glossary
- Ablation study: A controlled analysis that removes or varies components to assess their impact on results. "Finally, the statistics of automated curated data, which is used for the ablation study, are shown in Table~\ref{tab:merged_benchmark_stats_total}."
- Attribution Understanding: A reasoning task focused on identifying cause-and-effect relationships within videos. "GPT-4V emerges as the strongest model across Future Prediction, Domain Expertise, and Attribution Understanding."
- Automatic Speech Recognition (ASR): Technology that converts spoken language in audio to text. "Keyframes are extracted for visual-based QA generation, and videos are transcribed using an ASR module for audio-based QA generation."
- CLIP ViT-L/14: A large vision transformer variant used in CLIP for vision-language representation learning. "adoption of CLIP ViT-L/14 trained in LanguageBind~\citep{videollava} as its vision model"
- Counterfactual thinking: Reasoning about “what if” scenarios and alternative outcomes. "counterfactual thinking (answering what-if questions)"
- Domain expertise: Specialized knowledge required to understand content from specific fields. "Hence, domain expertise across a variety of disciplines is imperative for a thorough evaluation of a modelâs world understanding towards AGI~\citep{morris2023agi,yue2023mmmu}."
- Embodied Tasks: Tasks involving agents acting in and interacting with the physical world. "the best open-source model Video-LLaVA-7B outperforms GPT-4V and Gemini on Embodied Tasks by a large margin"
- First-party annotation: Data labels created directly by the dataset’s authors or curators rather than scraped from elsewhere. "MMWorld is the first multi-discipline and multitask video understanding benchmark that covers wider reasoning questions, and also included first-party data annotations."
- Frame embedding layer: A model component that encodes individual video frames into vector representations. "VideoLLaMA~\citep{zhang2023videollama} introduces a frame embedding layer and also leverages ImageBind to inject temporal and audio information into the LLM backend."
- Future prediction: Anticipating plausible upcoming events or states based on current observations. "future prediction (predicting future events)"
- Hallucination: An AI failure mode where models generate confident but unsupported or incorrect content. "While using LLMs for data generation can introduce hallucination issues"
- Instruction finetuning: Adapting a model using datasets of instruction-response pairs to improve task following. "Otter~\citep{li2023otter} proposes to conduct instruction finetuning based on Openflamingo~\citep{awadalla2023openflamingo}."
- Keyframes: Representative frames selected from a video to summarize content for efficient processing. "Keyframes are extracted for visual-based QA generation"
- LAION-CCSBU: A large-scale image-text dataset used for training vision-LLMs. "558K LAION-CCSBU image-text pairs"
- LanguageBind: A framework/model suite aligning multiple modalities (e.g., audio, video) with language representations. "adoption of CLIP ViT-L/14 trained in LanguageBind~\citep{videollava} as its vision model"
- LLM: A neural network trained on large text corpora to perform a wide range of language tasks. "Foundation models, such as LLMs~\citep{openai2023gpt4, touvron2023llama, jiang2023mistral,anil2023palm}"
- Multimodal LLM (MLLM): An LLM that can process and reason over multiple modalities, such as text, images, audio, and video. "and Multimodal LLMs (MLLMs)~\citep{gpt4-v,team2023gemini,videollava,li2023videochat,Maaz2023VideoChatGPT,chen2023minigptv2}"
- Perceiver resampler: A module that reduces and adapts high-dimensional perceptual inputs for downstream processing. "with only the Perceiver resampler module fine-tuned, which may contribute to its lower performance."
- Procedure Understanding: A reasoning task assessing comprehension of step-by-step processes in videos. "7) Procedure Understanding: Tests the model's ability to comprehend and explain procedural tasks shown in the video."
- QFormer: A query-based transformer used to map visual features into token sequences compatible with LLMs. "VideoChat~\citep{li2023videochat} leverages the QFormer~\citep{blip2} to map visual representations to LLM~\citep{vicuna2023}"
- Query-focused video summarization: Summarizing videos based on relevance to a specific query or topic. "The Video Summarization module utilizes Query-focused video summarization techniques based on Katna"
- Spatio-temporal reasoning: Inference over both spatial and temporal dependencies in video data. "which enhances its spatio-temporal reasoning abilities."
- Spatiotemporal dynamics: The combined spatial and temporal patterns that characterize events in videos. "where spatiotemporal dynamics play a more crucial role in video understanding."
- Synthetic dataset: Artificially constructed data (often via automated pipelines) used to analyze or train models under controlled conditions. "a synthetic dataset designed to analyze MLLMs within a single modality of perception."
- Temporal Understanding: Reasoning about the order, duration, and timing of events in a video. "This is further validated with its leading results on the Temporal Understanding question type."
- UniVTG: A model/method for video text generation used here to support query-focused summarization. "and UniVTG~\citep{univtg}"
- Video infilling: Predicting and generating missing frames or segments within a video sequence. "and video infilling~\citep{himakunthala2023lets}."
- Video question answering: Answering questions grounded in video content by integrating visual, temporal, and sometimes audio cues. "enabling tasks such as video question answering and video captioning."
- WebVid: A large-scale video-text dataset commonly used for training video-LLMs. "702K video-text pairs from WebVid~\citep{webvid}."
- Whisper: An automatic speech recognition model by OpenAI used for transcribing audio. "This may be attributed to its use of the Whisper~\citep{whisper} speech recognition model."
- World Model: An internal representation enabling a system to infer hidden state and predict plausible future states of the world. "Are they equipped with an inherent World Model~\citep{lecun2022path,worldknowledge,worldmodels,pandora} that can understand and reason about the underlying principles and causalities of the dynamic, multimodal world?"
- YouTube-8M: A large-scale labeled video dataset used for retrieval and evaluation. "YouTube-8M dataset~\citep{youtube8m}."
Practical Applications
Immediate Applications
The following applications can be deployed now using the MMWorld benchmark, datasets, findings, and tooling described in the paper.
- Model benchmarking and selection for video features (software, media, robotics)
- Use MMWorld’s multi-discipline, multi-faceted QA evaluation to choose MLLMs for specific product features like video summarization, coaching analytics, or assembly assistance.
- Workflow: Integrate MMWorld QA tasks (explanation, temporal understanding, future prediction) into CI/CD model gates; adopt audio-only vs visual-only tests to diagnose modality weaknesses.
- Assumptions/dependencies: Current top models achieve ~52% accuracy overall; reliability varies by discipline (e.g., Video-LLaVA stronger on temporal/embodied tasks). Requires GPU resources and standardized prompt-parsing.
- Pre-deployment risk and compliance QA harness (policy, safety, governance)
- Use MMWorld’s error taxonomy (question understanding, audio/visual perception, hallucination, reasoning, domain knowledge gaps, refusal) to design pre-release checklists and red-team evaluations for video understanding features.
- Workflow: Add benchmark slices per sector; track error-type frequencies; establish minimum accuracy thresholds for regulated contexts (e.g., healthcare, surveillance-sensitive applications).
- Assumptions/dependencies: Benchmark coverage is broad but not exhaustive; policy teams must align thresholds with domain-specific risk.
- Diagnostic training-data curation and curriculum design (academia, ML engineering)
- Employ MMWorld’s seven reasoning types to structure curriculum learning and targeted data augmentation (e.g., add domain-expertise samples for medical assembly tasks; counterfactual prompts for science labs).
- Workflow: Create per-skill datasets; run ablations using synthetic audio-only or visual-only subsets to isolate failure modes; measure uplift per skill.
- Assumptions/dependencies: Instruction tuning improves specific skills; synthetic QA relies on GPT-4V-generated annotations (biases must be monitored).
- Audio-only and visual-only perception testing (media, accessibility, call centers)
- Use synthetic Subset I (audio) and Subset II (visual) to evaluate and select agents for speech-centric tasks (ASR-based QA, broadcast compliance) or visual perception tasks (scene understanding without audio).
- Tools/products: Voice QA agents for customer support; visual-only content analyzers for muted streams.
- Assumptions/dependencies: Audio pipelines work best when paired with reliable ASR (e.g., Whisper); some models outperform random only slightly in certain settings.
- Robotics and embodied-task assistant evaluation and tuning (robotics, manufacturing)
- Leverage MMWorld’s embodied tasks and temporal understanding to select models for procedure tracking (e.g., IKEA assembly, household tasks).
- Workflow: Pair Video-LLaVA (strong temporal reasoning) with high-quality ASR; use procedure/attribution questions to verify step recognition and cause-effect reasoning.
- Assumptions/dependencies: Current accuracy limits preclude full autonomy; best suited for assistive/co-pilot roles with human oversight.
- Sports analytics and coaching tools (sports, media)
- Apply temporal reasoning tasks to build explainable video QA tools for play analysis, rule explanations, and future move predictions.
- Products: Coach-side assistants that query game clips (“What led to this turnover?”; “What happens if defense switches here?”).
- Assumptions/dependencies: Domain-specific fine-tuning and robust data ingestion (frame selection, event segmentation) needed.
- Healthcare training and education content QA (healthcare, education)
- Use domain-expertise and procedure-understanding questions to generate and validate training modules from medical/rehab videos; check comprehension via multi-choice QA.
- Workflow: Construct sector-specific benchmarks (e.g., PT exercises, device assembly) using the automated pipeline; track learner progress by reasoning type.
- Assumptions/dependencies: Not for clinical decision-making; requires compliance with privacy and licensing; human review for accuracy.
- Content platform moderation and explainable captioning (media, platforms)
- Use explanation and attribution QA to improve auto-captions and safety checks (e.g., flag risky procedures, provide context-aware summaries).
- Tools: Caption enhancers that synthesize causal/temporal explanations; moderation lists tuned by discipline.
- Assumptions/dependencies: Accuracy thresholds and human-in-the-loop processes required to prevent erroneous flags or misleading explanations.
- Sector-specific benchmark creation via the automated pipeline (software, research)
- Reuse the paper’s pipeline (Katna keyframe selection, UniVTG summarization, ASR transcripts, GPT-4V QA generation) to build custom benchmarks (e.g., surgical videos, factory line audits).
- Workflow: Generate subdiscipline queries; filter Creative Commons content; produce QA/captions; run human spot-checks.
- Assumptions/dependencies: Licensing and ethics review; GPT-4V involvement in QA generation introduces biases and costs.
- Human–AI complementary labeling strategies (operations, research)
- Exploit the observed human–model skill differences: route “expert-level” items models can handle to AI, and “easy” items models oddly miss to human annotators for QC.
- Workflow: Difficulty-aware task routing; monitor discipline-wise performance gaps.
- Assumptions/dependencies: Requires measurement infrastructure to classify difficulty and track cross-skill performance.
Long-Term Applications
The following opportunities require further research, scaling, or development to reach reliable deployment.
- World-model capability certification and standards (policy, governance)
- Establish sector-wide standards using MMWorld-style multi-faceted video evaluation to certify “world modeling” claims (explanation, counterfactuals, future prediction, temporal cognition).
- Products: Third-party audit services; capability scorecards by discipline and modality.
- Dependencies: Community consensus on metrics and thresholds; expanded datasets with expert annotations and sensitive contexts.
- Robotics autonomy with procedure understanding and counterfactual planning (robotics, manufacturing)
- Build agents that watch live video, understand procedural steps, predict future states, and suggest recovery plans when deviations occur.
- Tools: On-device spatiotemporal transformers; task graphs from video; counterfactual simulators for “what-if” replanning.
- Dependencies: Higher accuracy and robustness; long-horizon memory; safety certification; integration with sensor fusion beyond video/audio.
- Clinical video decision support and patient monitoring (healthcare)
- Real-time reasoning on surgical/recovery videos: explain steps, predict complications, suggest interventions, and assess adherence to protocols.
- Products: OR assistants; rehab compliance monitors; training simulators with counterfactual feedback.
- Dependencies: Medical-grade accuracy; regulated data pipelines; strong privacy protections; bias and fairness audits; clinician-in-the-loop.
- Industrial safety and predictive hazard detection (energy, manufacturing, construction)
- Video-based agents that explain causes, anticipate hazards (future prediction), and recommend mitigations in dynamic environments.
- Tools: Risk dashboards; timeline visualizations; near-miss predictors using attribution and temporal reasoning.
- Dependencies: Integration with IoT sensors; edge compute; reliability across varied lighting and occlusion; strong false-positive/negative controls.
- Intelligent educational tutors for labs and demonstrations (education, science)
- Tutors that watch student lab videos, explain phenomena, offer counterfactuals (“What if we increase temperature?”), and predict outcomes.
- Products: Interactive lab assistants; assessment engines aligned to MMWorld question types.
- Dependencies: Domain-specific expert datasets; alignment with curricula; robust handling of noisy student-generated video; pedagogical validation.
- Retail and operations planning via video reasoning (finance, retail)
- Analyze store-floor videos to explain bottlenecks, predict customer flow, and evaluate procedural compliance.
- Tools: Ops analytics platforms; scenario simulators using counterfactuals (e.g., staffing changes).
- Dependencies: Privacy safeguards; reliable person and activity recognition; organizational buy-in for continuous monitoring ethics.
- Video-based forecasting tools in sports/media (sports, media)
- Predict near-term plays or outcomes; offer strategy counterfactuals with explainable rationales grounded in video dynamics.
- Products: Analyst tools; fan-facing interactive breakdowns; coaching simulators.
- Dependencies: High-quality event segmentation; better temporal modeling; domain knowledge infusion.
- Multimodal training curricula and RLHF pipelines targeting reasoning skills (academia, ML platforms)
- Develop curricula that explicitly train explanation, counterfactual, future prediction, temporal and attribution abilities using MMWorld-like tasks and feedback.
- Workflow: Skill-specific datasets; reward models judging correctness and reasoning steps; longitudinal capability tracking.
- Dependencies: Scalable data with verified answers; improved long-context models; cost-effective training.
- Extended multimodal benchmarks and architectures (software, research)
- Expand MMWorld to include longer videos, continuous audio, and additional modalities (egocentric sensors, depth, events), enabling richer world-model evaluation.
- Tools: Streaming evaluation harnesses; memory-efficient sequence models; cross-modal alignment modules.
- Dependencies: Engineering advances in long-sequence modeling; standardized data formats; compute and storage.
- Privacy-preserving, on-device video understanding (policy, edge computing)
- Build privacy-first agents that process sensitive videos locally while offering explanations and predictions aligned with MMWorld tasks.
- Products: On-device assistants for homes/clinics; differential privacy or federated learning strategies.
- Dependencies: Efficient edge models; verifiable privacy guarantees; user consent and transparency frameworks.
Collections
Sign up for free to add this paper to one or more collections.