Towards Neuro-Symbolic Video Understanding

Minkyu Choi1,2, Harsh Goel† 1,2, Mohammad Omama† 1,2, Yunhao Yang1, Sahil Shah1,2, Sandeep Chinchali1,2,
1The University of Texas at Austin 2UT Swarm Lab
†Contributed equally to this work


Abstract

The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing the semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT-4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes. The source code is available on Github.

Methodology


We introduce a novel way to identify scenes of interest using a neuro-symbolic approach. Given video streams or clips alongside the temporal logic specification Φ, Neuro-Symbolic Visual Search with Temporal Logic (NSVS-TL) identifies scenes of interest.


Method Overview


Autonomous Driving Example



Key Capabilities


Long Horizon Video Understanding

We evaluate multi-event sequences with temporally extended gaps which have a large impact on video length. We observe the consistency with videos spanning up to 40 minutes, indicating reliability in handling long videos.

Plug In Your Own Model

Our framework allows for the integration of any neural perceptual model, enhancing the capability to understand videos. This enables us to localize frames of interest with respect to queries.

Comparison to Benchmark

From the experiments, we observe that NSVS-TL with various neural perception models performs differently depending on the complexity of the TL specification and datasets. Using the datasets, we see that for single event scenarios, both our method and LLM-based reasoning perform reasonably well since these events do not require complex reasoning whereas for multi-event scenarios, our TL-based reasoning outperforms all LLM-based baselines.



BibTeX

@inproceedings{Choi_2024_ECCV,
  author    = {Choi, Minkyu and Goel, Harsh and Omama, Mohammad and Yang, Yunhao and Shah, Sahil and Chinchali, Sandeep},
  title     = {Towards Neuro-Symbolic Video Understanding},
  journal   = {Proceedings of the European Conference on Computer Vision (ECCV)},
  month     = {September},
  year      = {2024},
}