Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Joint Gaze-Location and Gaze-Object Detection (2308.13857v1)

Published 26 Aug 2023 in cs.CV

Abstract: This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), \emph{i.e.}, gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations and their gaze followings simultaneously, aiming at jointly detect human gaze location and gaze object in a unified and single-stage pipeline. To this end, we propose GTR, short for \underline{G}aze following detection \underline{TR}ansformer, streamlining the gaze following detection pipeline by eliminating all additional components, leading to the first unified paradigm that unites GL-D and GO-D in a fully end-to-end manner. GTR enables an iterative interaction between holistic semantics and human head features through a hierarchical structure, inferring the relations of salient objects and human gaze from the global image context and resulting in an impressive accuracy. Concretely, GTR achieves a 12.1 mAP gain ($\mathbf{25.1}\%$) on GazeFollowing and a 18.2 mAP gain ($\mathbf{43.3\%}$) on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement ($\mathbf{45.2\%}$) on GOO-Real for GO-D. Meanwhile, unlike existing systems detecting gaze following sequentially due to the need for a human head as input, GTR has the flexibility to comprehend any number of people's gaze followings simultaneously, resulting in high efficiency. Specifically, GTR introduces over a $\times 9$ improvement in FPS and the relative gap becomes more pronounced as the human number grows.

Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.