We have a git repo that uses git-lfs. We had a scare where we realized the repo was much bigger than the files in it and concluded something large was not in lfs. In fact the problem was the lfs cache was big.

For a minute there, I was considering writing a script that checked every file and its lfs status, and gave you the largest file that is not in lfs and maybe the file extension that contributes most to non-lfs repo weight. But now I wonder: Does a script like that exist already?

It seems like this is seconds to write but then the part that gets a little more complicated is you probably want to also look through the git *history* to spot if there's like a 2GB file that predates you adding lfs to the repo which you've been lugging around the whole time. Off the top of my head, I'm not sure I know how to scrape the entire history like that. You could check out each commit in turn, but theoretically git offers slightly more powerful tools.

@mcc Using libgit2 (many language bindings) (or gix, rust), I made a graph of commits in cpython as the sum of the uncompressed size of the blobs, recursively, vs the commit date.

This avoids being extremely slow by caching the results for each tree object, so each commit only walks new trees.