There is now just enough minimal-invasive order brought to the Geocities files that is possible to serve them via a proxy server that even imitates the original URLs completely. Using this proxy, you will be able to click on any historic Geocities URL and experience it in your browser, which ideally is a historic browser as well.1
All technical measurements applied to the data are presented on GitHub in the form of annotated programs written in bash, Perl, SQL and Python. The comments inside the scripts explain problems, considerations, compromises, decisions and technical solutions, trying to match the ideal of software being executable documentation of itself.
You are welcome to evaluate each step and use this information to make your own Geocities proxy server, or use the developed techniques to revive other dead web sites.
There is a small discussion over at reddit about this graphic. You’re welcome to compare this treemap with the first published treemap, with problematic areas encircled.
The overall number of files has come down from 36 million to 28 million.
Next up is re-packaging the Geocities files and database contents for a cleaned-up distribution on the Internet Archive.