Slashing Network Wait with Compressed Repo Snapshots

01 Jan 2022

The company I work for has a lot of teams with a lot of code. For many years, teams managed their own repositories, but several years ago, they started an effort to move most of these to GitHub. The team responsible for the tooling and infrastructure around this move needed a way to integrate internal systems with GitHub’s permissions models, to let teams self-manage their repositories with minimal intervention, and to attach company-related metadata to these repositories.

And so they developed a special repository, known simply as “Inventory”, consisting of a directory tree with tens of thousands of tiny files written in a custom, YAML-inspired data format. Their idea was they could leverage GitHub’s PR interface and events so that when employees needed new repositories or changes to existing ones, they could submit a pull request, and automated tooling would seek the appropriate approvals, create or move repositories within organizations, and modify permissions in internal and GitHub systems.

The idea has its merits, but runs into problems at scale…

When I joined the Health, Education, and Consumer DevOps team, we needed data contained in those files. The Inventory team provided an HTTP service that supported simple queries, but the APIs were limited in their querying capabilities, so we were often stuck paginating through every single repository. The networking and data transfer overhead made this prohibitive, and it started taking hours to perform relatively simple tasks. This made it difficult to prototype and test, slowing us down more.

Of course, the actual data we needed was all in the Inventory repository, and that was just a git clone away! But this “database” layout takes up hundreds of megabytes on disk, and it takes a long time to read and parse all those files. Since we had several tools that needed this data, it wasn’t reasonable to build this ability into all of them.

I recognized that the actual information content in Inventory is tiny, so my idea was to convert it to something cheaper to store and parse. Moreover, though the individual file parsing was a bottleneck, the process of walking the tree, parsing the files, and converting their content is embarrassingly parallel, so it was relatively easy to wrap it in Python’s asyncio. My tool was capable of ripping through the dataset in seconds and could compress it down to just 3% of the original size.

I set up automation to convert the entire dataset to a single JSON object and upload daily snapshots to a central Artifactory instance. This made the snapshots available both to our tools and the tools of other teams at the company. Such a file is easy to “query” locally using jq or by utilizing the JSON tooling available in most modern languages. Now processes that took hours to run needed only to download a single, small file which it could parse in seconds with whatever made sense for the task. Moreover, we were able to use these snapshots to prototype and test new and existing tools, greatly increasing the team’s velocity.