Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision (1503.01558v3)

Published 5 Mar 2015 in cs.CL, cs.CV, and cs.IR

Abstract: We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jonathan Malmaud (6 papers)
  2. Jonathan Huang (46 papers)
  3. Vivek Rathod (12 papers)
  4. Nick Johnston (17 papers)
  5. Andrew Rabinovich (23 papers)
  6. Kevin Murphy (87 papers)
Citations (152)

Summary

We haven't generated a summary for this paper yet.