Accelerating Action Recognition by Optimizing Python Video Pipelines

In October of 2022, I was asked to join a project that was using a specialized feature-extraction approach to build action recognition inference models. At the time, the team had a high-priority customer, won over by an early proof-of-concept. Now was the time for a production inference pipeline, but the current system utilized specialized hardware that made it unreasonable to parallelize as multiple processes.

I was brought in to build out a system that could operate on multiple camera streams, each running their content through multiple models, aggregating and publishing their results to an MQTT broker, all in real-time.

I started by developing a GStreamer plugin in Python so I could integrate with the custom feature-extraction hardware and process frames through their inference models. Along the way, I used the profiler py-spy to identify heavy code paths. Based on the results, I rewrote portions of the code to reduce the amount of video data that needed to be copied or transferred, and I replaced slow mathematics in hot-loops with optimized extension modules or with versions that required fewer operations.

This enabled us to operate on multiple camera streams and easily iterate on different pipeline designs. The changes saw a ten-fold improvement in single-camera framerates and the ability to operate on 15 streams simultaneously, netting a 150-fold total improvement. I was recognized by the project sponsor “for the outstanding job of exploring various options for multi-camera/multi-model solutions, testing & adjusting through structured experiments, and then rapidly delivering a solution against an aggressive schedule. The recent […] approval is a testament to your thoroughness, attention to detail, and quality.”