Real-time Intent Prediction of Pedestrians for Autonomous Ground Vehicles via Spatio-Temporal DenseNet (1904.09862v1)

Published 22 Apr 2019 in cs.CV and cs.AI

Abstract: Understanding the behaviors and intentions of humans are one of the main challenges autonomous ground vehicles still faced with. More specifically, when it comes to complex environments such as urban traffic scenes, inferring the intentions and actions of vulnerable road users such as pedestrians become even harder. In this paper, we address the problem of intent action prediction of pedestrians in urban traffic environments using only image sequences from a monocular RGB camera. We propose a real-time framework that can accurately detect, track and predict the intended actions of pedestrians based on a tracking-by-detection technique in conjunction with a novel spatio-temporal DenseNet model. We trained and evaluated our framework based on real data collected from urban traffic environments. Our framework has shown resilient and competitive results in comparison to other baseline approaches. Overall, we achieved an average precision score of 84.76% with a real-time performance at 20 FPS.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces a spatio-temporal DenseNet within a tracking-by-detection framework that achieves an 84.76% average precision in predicting pedestrian intent.
It leverages YOLOv3 paired with SORT-UKF for efficient pedestrian detection and tracking, maintaining robustness even with noisy inputs.
The study underscores a significant step toward enhanced AGV safety by enabling real-time behavior prediction, paving the way for future sensor fusion applications.

Real-time Intent Prediction of Pedestrians for Autonomous Ground Vehicles via Spatio-Temporal DenseNet

The paper presents a comprehensive paper on the challenges faced by autonomous ground vehicles (AGVs) in understanding human intentions and behaviors, particularly in complex urban environments where pedestrian interactions are frequent and often unpredictable. It introduces a novel framework utilizing video data from monocular RGB cameras for accurately predicting pedestrian intent in real-time scenarios.

Technical Contribution

The significant contribution of this paper is the integration of spatio-temporal DenseNet models within a tracking-by-detection framework. This approach facilitates the nuanced understanding of pedestrian intent based solely on visual data. The DenseNet model, adapted for spatio-temporal use, effectively captures the dependencies across video sequences, offering a promising alternative to conventional methods relying heavily on recurrent neural networks (RNNs), which tend to be computation-intensive and less suitable for real-time applications.

Framework Architecture

The paper details a two-stage process:

Detection and Tracking: Utilizes YOLOv3 combined with SORT-UKF to detect and track pedestrians efficiently. The choice of YOLOv3 over other detectors like Faster-RCNN is justified by its single-stage nature, offering a balance between detection accuracy and processing speed, essential for real-time applications.
Intent Prediction: The spatio-temporal DenseNet model processes sequences of pedestrian bounding boxes to predict their intended actions (e.g., deciding to cross the street), yielding an average precision score of 84.76%. This performance surpasses several benchmark models, including ConvNet-LSTM and C3D approaches, underlining the robustness of the proposed method.

Evaluation and Results

The paper rigorously evaluates the framework's performance on the JAAD dataset, demonstrating superior predictive accuracy compared to existing models. Furthermore, the authors conducted a series of experiments to understand the framework's resilience under noisy detection conditions. Even with noisy inputs from different detector models (SSD, ACF), the framework remained competitive, highlighting the robustness of ST-DenseNet in maintaining prediction quality.

Implications and Future Directions

From a practical standpoint, the successful application of this framework may enhance the safety and reliability of AGVs operating in urban traffic settings by providing them with improved capabilities to anticipate human actions. Such advancements could accelerate the integration of autonomous vehicles into everyday transportation systems.

Theoretically, this work opens avenues for further research into hybrid frameworks that combine dense connectivity principles with spatio-temporal modeling. Future developments could explore the scalability of the approach across different sensory modalities, such as depth or thermal imaging, and extend its applications to other vulnerable road user scenarios beyond pedestrian interaction.

Overall, this paper makes a substantial contribution to the field of autonomous vehicle interaction, offering insights into both technical strategy and real-world applicability. The advancements in real-time behavior prediction underscore the potential for more intuitive and responsive autonomous systems.

PDF Markdown

Related Papers

YouTube

Show All Videos