[{"data":1,"prerenderedAt":367},["ShallowReactive",2],{"NoscriptNav_XrRK2e2e8meJ0jKVGkb5ULGQDVi3UiFQ9nupAr7Yns":3,"\u002Fideas\u002Frevisiting-gitballs":8},["Island",4],{"key":5,"result":6},"NoscriptNav_XrRK2e2e8meJ0jKVGkb5ULGQDVi3UiFQ9nupAr7Yns",{"head":7},{},{"id":9,"title":10,"authors":11,"body":13,"canonicalUrl":353,"canonicalWebsiteName":354,"category":355,"date":356,"description":357,"extension":358,"featured":359,"fullWidthLayout":359,"image":360,"imageAlt":360,"location":360,"meta":361,"metaImage":360,"navigation":362,"path":363,"seo":364,"stem":365,"venue":360,"venueUrl":360,"__hash__":366},"ideas\u002Fideas\u002Frevisiting-gitballs.md","Revisiting Gitballs",[12],"andrew",{"type":14,"value":15,"toc":346},"minimark",[16,28,40,43,156,159,210,213,218,239,248,251,272,275,279,282,285,288,292,295,335,338],[17,18,19,20,27],"p",{},"Nine years ago I made a small experiment called ",[21,22,26],"a",{"href":23,"rel":24},"https:\u002F\u002Fgithub.com\u002Fandrew\u002Fgitballs",[25],"nofollow","Gitballs",". Package registries store every release as a complete tarball, but most releases are just a few lines changed from the previous version. Git is good at storing diffs efficiently. What if you committed each release to a git repo and let git's delta compression do the work?",[17,29,30,31,35,36,39],{},"The script downloads every release of a package, extracts each one, commits it to a git repo in version order, then runs ",[32,33,34],"code",{},"git gc --aggressive",". The result is a single ",[32,37,38],{},".git"," folder containing every release.",[17,41,42],{},"The space savings were significant for packages with many releases:",[44,45,46,68],"table",{},[47,48,49],"thead",{},[50,51,52,56,59,62,65],"tr",{},[53,54,55],"th",{},"package",[53,57,58],{},"releases",[53,60,61],{},"tarball size",[53,63,64],{},"gitball size",[53,66,67],{},"saving",[69,70,71,89,106,122,139],"tbody",{},[50,72,73,77,80,83,86],{},[74,75,76],"td",{},"rails",[74,78,79],{},"288",[74,81,82],{},"159M",[74,84,85],{},"7.4M",[74,87,88],{},"95%",[50,90,91,94,97,100,103],{},[74,92,93],{},"sass",[74,95,96],{},"309",[74,98,99],{},"74M",[74,101,102],{},"2.0M",[74,104,105],{},"97%",[50,107,108,111,114,117,120],{},[74,109,110],{},"bundler",[74,112,113],{},"225",[74,115,116],{},"42M",[74,118,119],{},"1.9M",[74,121,88],{},[50,123,124,127,130,133,136],{},[74,125,126],{},"lodash",[74,128,129],{},"88",[74,131,132],{},"79M",[74,134,135],{},"8.1M",[74,137,138],{},"90%",[50,140,141,144,147,150,153],{},[74,142,143],{},"nokogiri",[74,145,146],{},"94",[74,148,149],{},"275M",[74,151,152],{},"33M",[74,154,155],{},"88%",[17,157,158],{},"But for packages with few releases, the git overhead made things worse:",[44,160,161,175],{},[47,162,163],{},[50,164,165,167,169,171,173],{},[53,166,55],{},[53,168,58],{},[53,170,61],{},[53,172,64],{},[53,174,67],{},[69,176,177,194],{},[50,178,179,182,185,188,191],{},[74,180,181],{},"left-pad",[74,183,184],{},"11",[74,186,187],{},"52K",[74,189,190],{},"348K",[74,192,193],{},"-569%",[50,195,196,199,202,204,207],{},[74,197,198],{},"i18n-active_record",[74,200,201],{},"4",[74,203,187],{},[74,205,206],{},"360K",[74,208,209],{},"-590%",[17,211,212],{},"It was an afternoon experiment. Life got busy and I forgot about it.",[214,215,217],"h2",{"id":216},"why-im-thinking-about-it-again","Why I'm thinking about it again",[17,219,220,221,226,227,232,233,238],{},"Last week I was in Paris for a ",[21,222,225],{"href":223,"rel":224},"https:\u002F\u002Fgithub.com\u002Fcodemeta\u002Fcodemeta\u002Fdiscussions\u002F445",[25],"CodeMeta unconference"," hosted by ",[21,228,231],{"href":229,"rel":230},"https:\u002F\u002Fwww.softwareheritage.org\u002F",[25],"Software Heritage",". I got to meet Roberto Di Cosmo and Stefano Zacchiroli and talk about integration points with ",[21,234,237],{"href":235,"rel":236},"https:\u002F\u002Fecosyste.ms",[25],"ecosyste.ms",".",[17,240,241,242,247],{},"Software Heritage archives all publicly available source code using ",[21,243,246],{"href":244,"rel":245},"https:\u002F\u002Fdocs.softwareheritage.org\u002Fdevel\u002Fswh-model\u002Fpersistent-identifiers.html",[25],"SWHIDs"," (Software Heritage Identifiers), content-addressed identifiers where two identical files always have the same SWHID regardless of where they're stored.",[17,249,250],{},"That's the same principle gitballs was exploring. Git stores snapshots at each commit (much like package releases), but the packfile format finds similar objects, computes deltas between them, and compresses everything together. Every blob, tree, and commit is identified by its SHA hash, so identical content is automatically deduplicated.",[17,252,253,254,259,260,265,266,271],{},"I've been writing a ",[21,255,258],{"href":256,"rel":257},"https:\u002F\u002Fgithub.com\u002Fandrew\u002Fswhid",[25],"Ruby gem for generating SWHIDs",", partly to learn the standard, partly because I'm hoping to ",[21,261,264],{"href":262,"rel":263},"https:\u002F\u002Fgithub.com\u002Fecosyste-ms\u002Fpackages\u002Fissues\u002F1206",[25],"generate SWHIDs for every version of every package"," in ",[21,267,270],{"href":268,"rel":269},"https:\u002F\u002Fpackages.ecosyste.ms",[25],"packages.ecosyste.ms"," at some point. Working on that got me thinking about gitballs again, because if you're computing content hashes for millions of package releases anyway, you're most of the way to a deduplication scheme.",[17,273,274],{},"The same principle shows up in Nix and Guix, which use content-addressed stores for reproducible builds. And pnpm, which deduplicates packages across projects by storing them in a content-addressed cache.",[214,276,278],{"id":277},"still-relevant","Still relevant?",[17,280,281],{},"I haven't re-run the gitballs numbers yet. The original data is from 2016, and packages like Rails have had hundreds more releases since then. It would be interesting to see if the compression ratios still hold, or if modern packages (with more dependencies, more generated files) compress differently.",[17,283,284],{},"Putting every version of every package in a single git repo probably isn't practical. The write path is slow and you'd need to handle concurrent writes. But the experiment did show that sequential releases of the same package compress well, and identical files across packages (MIT-LICENSE, .gitignore, tsconfig.json) would dedupe automatically with content-addressing.",[17,286,287],{},"What if you focused on the top 1% of packages that make up 99% of downloads? Most registry bandwidth goes to a small number of popular packages. Deduplicating just those might get you most of the savings without the complexity of handling the long tail. Managing packfiles across hundreds of millions of releases globally would be expensive, but a targeted approach might be practical.",[214,289,291],{"id":290},"related-ideas","Related ideas",[17,293,294],{},"If you're interested in content-addressed storage for packages:",[296,297,298,305,319,327],"ul",{},[299,300,301,304],"li",{},[21,302,231],{"href":229,"rel":303},[25]," archives source code with content-addressed identifiers",[299,306,307,312,313,318],{},[21,308,311],{"href":309,"rel":310},"https:\u002F\u002Fnixos.org\u002F",[25],"Nix"," and ",[21,314,317],{"href":315,"rel":316},"https:\u002F\u002Fguix.gnu.org\u002F",[25],"Guix"," use content-addressed stores for reproducible builds",[299,320,321,326],{},[21,322,325],{"href":323,"rel":324},"https:\u002F\u002Fpnpm.io\u002F",[25],"pnpm"," deduplicates node_modules across projects",[299,328,329,334],{},[21,330,333],{"href":331,"rel":332},"https:\u002F\u002Fgithub.com\u002Fopencontainers\u002Fdistribution-spec",[25],"OCI registries"," use content-addressed layers for container images",[17,336,337],{},"SWHIDs can actually encompass git repos and their history. When Software Heritage ingests a git repository, the content hashes for blobs and trees match git's SHAs. SWHIDs add a type prefix and can reference things git can't (like snapshots of entire repositories), but they're built on the same foundations.",[17,339,340,341,345],{},"The ",[21,342,344],{"href":23,"rel":343},[25],"gitballs code"," is still on GitHub if you want to try it yourself.",{"title":347,"searchDepth":348,"depth":348,"links":349},"",2,[350,351,352],{"id":216,"depth":348,"text":217},{"id":277,"depth":348,"text":278},{"id":290,"depth":348,"text":291},"https:\u002F\u002Fnesbitt.io\u002F\u002F2025\u002F11\u002F28\u002Frevisiting-gitballs","nesbitt.io","tooling","2025-11-28","Nine years ago I experimented with storing package tarballs as git objects. A visit to Software Heritage got me thinking about it again.","md",false,null,{},true,"\u002Fideas\u002Frevisiting-gitballs",{"title":10,"description":357},"ideas\u002Frevisiting-gitballs","S_s8SATWtjYrD36KqF9kq7P6WKcUyc5y9LKZ5rtSHsE",1780596105062]