Follow the Attention: Combining Partial Pose and Object Motion for Fine-Grained Action Detection (1905.04430v2)

Published 11 May 2019 in cs.CV and cs.LG

Abstract: Retailers have long been searching for ways to effectively understand their customers' behaviour in order to provide a smooth and pleasant shopping experience that attracts more customers everyday and maximises their revenue, consequently. Humans can flawlessly understand others' behaviour by combining different visual cues from activity to gestures and facial expressions. Empowering the computer vision systems to do so, however, is still an open problem due to its intrinsic challenges as well as extrinsic enforced difficulties like lack of publicly available data and unique environment conditions (wild). In this work, We emphasise on detecting the first and by far the most crucial cue in behaviour analysis; that is human activity detection in computer vision. To do so, we introduce a framework for integrating human pose and object motion to both temporally detect and classify the activities in a fine-grained manner (very short and similar activities). We incorporate partial human pose and interaction with the objects in a multi-stream neural network architecture to guide the spatiotemporal attention mechanism for more efficient activity recognition. To this end, in the absence of pose supervision, we propose to use the Generative Adversarial Network (GAN) to generate exact joint locations from noisy probability heat maps. Additionally, based on the intuition that complex actions demand more than one source of information to be identified even by humans, we integrate the second stream of object motion to our network as a prior knowledge that we quantitatively show improves the recognition results. We empirically show the capability of our approach by achieving state-of-the-art results on MERL shopping dataset. We further investigate the effectiveness of this approach on a new shopping dataset that we have collected to address existing shortcomings.