NVIDIA’s AutoGaze makes AI video understanding up to 19x faster by mimicking human attention

NVIDIA found a way to make AI video understanding up to 19x faster by teaching models to ignore the parts of a video that do not matter.

nvidia logo
Image: NVIDIA

Processing video is one of the most computationally expensive things you can ask an AI model to do. It works by feeding every pixel into the model whether it matters or not. For long, high-resolution video, that adds up fast.

AutoGaze, published by NVIDIA and UC Berkeley researchers, solves this the same way your eyes do. Instead of processing everything, it learns to identify which parts of a video actually changed or matter, and throws out the rest.

The module is surprisingly small, only 3 million parameters. It plugs into existing video models without rebuilding them.

  • How much it cuts: 4x to 100x fewer tokens depending on the video content, typically keeping 5 to 20% of patches.
  • Speed gains: Up to 19x faster for standard vision transformers, 10x for NVIDIA’s NVILA-8B model.
  • What it enables: Real-time processing of 4K video with over 1,000 frames on hardware that would previously struggle with it.

NVIDIA already shipped a model using it. NVILA-8B-HD-Video incorporates AutoGaze natively and handles long-form 4K video question answering significantly better than previous versions.

On a new benchmark called HLVid, which tests five-minute 4K videos with fine-detail questions:

  • NVILA-8B-HD-Video with AutoGaze: 52.6% accuracy
  • Improvement over baseline: +10.1% over the same model without AutoGaze
  • Improvement over previous best: +4.5% over the prior state of the art

The limitations are worth knowing. Heavy camera movement still causes problems, and pushing compression too far produces smearing artifacts. It is not perfect, but it removes a real bottleneck.

This is not the only video bet NVIDIA is making right now. The company is also partnering with Runway to build a model that generates HD video in real time, a story we covered recently.

The Bottom Line: AutoGaze will not make headlines the way a new language model does. But making AI video understanding 19x faster with a 3 million parameter plugin is exactly the kind of unglamorous infrastructure work that actually moves the field forward.

Source: NVIDIA / UC Berkeley

RunPod
RunPod

If you need on-demand GPUs for training, fine-tuning, inference, or running open-source models, give RunPod a try.

  • Available hardware: H100, H200, A100, L40S, RTX 4090, RTX 5090, and 30+ more
  • Cost: significantly cheaper than AWS or GCP, billed per second, no contracts
  • Setup: spins up in under a minute, 30+ regions worldwide
Try RunPod →
Affiliate disclosure: We may earn a commission if you sign up via our link, at no extra cost to you.
Efficienist Newsletter

Get the core business tech news delivered straight to your inbox. We track AI, automation, SaaS, and cybersecurity so you don't have to.

Just read what you want, and be done with it.

Read Next