Tree Maps are a practical way to visualize large file systems, and a very useful tool for data archeologists, archivists and restorers.

The Free Software tool gdmap draws tree maps and running it on the Geocities Torrent files is useful for identifying problems within the collection. It produces images like these (click to enlarge):

Edit: See screenshot of the color coding legend. Please excuse the German interface.

This is just the main server, On mouse-hover one can get path information for each rectangle, helping to decide what part of the files still need work:

The large shaded rectangles represent the neighborhoods, inside them are the sub-neighborhoods, and even before fixing the various apparent problems it is clear that HeartLand dwarfs them all. Vanity Profiles on the right edge are per-user accounts created after Yahoo! bought Geocities and neighborhoods were discontinued. Some of the largest vanity profiles seem to challenge whole sub-neighborhoods in size.

Instances of Symlink Cancer (marked red) are showing up as big blobs of directories with hardly any files in them. These are cases where Yahoo! deeply screwed up their file hosting and URL forwarding, making the Archive Team’s harvesting software wget run in circles and create almost indefinitely nested directories. Of course this cancer distorts the graphics quite a bit, need to get rid of it!

Missing mergers (marked blue) indicate huge case-related double-entries, for example “heartland” still has to be merged into “HeartLand” and “soho” into “SoHo”.

Apparently finding case-related redundancies will require another huge round of processing time.

The artwork The Deleted City by Richard Vijgen2 uses treemap visualization to show the contents of the Geocities Torrent as a city scape.

Regular readers of this blog will immediately notice that there was officially no neighborhood named “athens”, only one called “Athens”. However “EnchantedForest” and “FashionAvenue” seem alright. Without having seen the piece in action personally I am guessing Deleted City is not very accurately representing Geocities. Richard, do you need a cleaned data set? Contcact me. :)

  1. With a dataset that large the image takes several hours to compile. []
  2. already mentioned here before []

