In its current state (62.59% downloaded) the Geocities Torrent already provides us with tens of thousands of historic home pages. How to find the interesting, typical or exceptional ones in such a huge amount of data?
Our private web server generates index files listing everything, but they are almost impossible to navigate. Not only do they take very long to generate on each access (this pain can be softened a bit), but also the browser rendering them can choke on 50,000+ bullet items and links.
Picking out the classic neighborhoods is a good strategy for finding a starting point and related styles and content. Many neighborhoods were also held together via web rings, however most of them are now defunct as they relied on a centralized software that went offline many years ago. The problem is that the neighborhoods only represent a small part of Geocities, the program was stopped already in 1999, when Yahoo! bought the service.
Good old websurfing still works very good. If we find a nice home page of course we follow the links placed manually by the webmasters to explore our cultural Terabyte, sometimes even coming to places outside that are still alive. It is unlikely to reach a dead end there. It still leaves the question on where to start exploring.
We decided we would get a very good sample with random homepage picking. Starting on an arbitrary location gives obscure home pages a chance to be discovered and provides a nice overview on what could be representative. – The first finding is that de.geocities.com (the German branch) contains huge numbers of link spamming sites, apparently installed shortly before the end of the service.
To facilitate random surfing, cd into the “geocities” directory of the decrunched torrent and run:
$ find . -iregex ".*geocities.*/.*/index.html?" > indexes.txt
This will take some time and save a list of all index files. You will have to re-run this command each time new data from the torrent is decrunched. Go back to the root of your web server and create this perl script:
#!/usr/bin/perl use strict; my @indexes; open LIST, 'geocities/indexes.txt'; while (<LIST>) { chomp; push @indexes, $_; } close LIST; my $file = $indexes[int(rand($#indexes+1))]; my $URL = 'http://localhost/geocities' . substr($file,1); print qq(Location: $URL Status: 302 Bounce Content-Type: text/html Bounced to $URL );
Access the script from your browser to be forwarded to a random Geocities site. I use this script as my browser’s start page – great fun!