As mentioned in Olia’s A Vernacular Web, people writing in the Cyrillic alphabet had lots of fun before Unicode became common. Every operating system sported multiple ways of mixing Cyrillic and ASCII letters into one 8bit encoding scheme.

Many Russian web sites had to offer their text in several encodings to accommodate all possible visitors. And a stripe of buttons offering these encodings could be found on the top of many pages. Collections of such button graphics in different styles became of the web folklore.

1

And of course many Geocities users wrote in the Cyrillic alphabet. Vladimir Sitnik from Artemovsky created a site for his local Karate Dojo. A typical link to a photo on his website would look like this:

pict_e.html?RU?images/040502b.jpg?Наша%20команда?с%20наградами%20из%20Одессы%202004

Naturally, during the Archive Team’s backup, this confused wget quite a lot and it created many files looking like this:

Since most current Linux systems use Unicode for file names, the 8bit encoded Cyrillic letters are illegal and printed as questions marks. The HTML code of any of the files in the directory reveals the encoding used by Vladimir Sitnik:

<meta http-equiv="Content-Type" content="text/html; Charset=Windows-1251">

Using the handy tool conmv with the codepage cp1251 and switching the terminal to the same encoding for a especially nasty case, I managed to restore the file names:

Only after completing this tendious operation I checked that each of the pict_e.html file variations was actually the same, apart from some minor tracking code inserted by the Geocities server — there is no reason to really keep more than one.2 They all were generated because Vladimir Sitnik successfully circumvented the fact that Geocities did not offer any kind of server side scripting or usable templating mechanism. But he still did not want to create hundreds of pages with just a different photo on each. So he had the quite exotic idea of using HTTP GET parameters to create photo pages via Javascript.

In his links, the parameters are separated by question marks: interface language, file name of the photo, caption lines. pict_e.html then uses the Javascript instruction document.writeln to create the actual HTML. So that is a minimal content management system on a free service that does not offer content management systems. Almost AJAX, but Netscape 4 compatible!

Considering this page was made by an “amateur”, created in 2000 and last updated 2006, it already shows that much better scraping tools than wget are needed3 — even when it comes to pages that give a very modest impression to “professionals”. Users have been very inventive in all aspects of web authoring, they did not care for any specification or standard, they went ahead and pushed the borders, trying all the new options presented to them in the browser wars.

Original URL: http://www.geocities.com/soho/Museum/1134/ph4tu_r.html


  1. All examples lifted from A Vernacular Web. []
  2. Except for studying wget behavior or Geocities’ random number generator. []
  3. Maybe something based on PhantomJS? []

Tree Maps are a practical way to visualize large file systems, and a very useful tool for data archeologists, archivists and restorers.

The Free Software tool gdmap draws tree maps and running it on the Geocities Torrent files is useful for identifying problems within the collection. It produces images like these (click to enlarge):

Edit: See screenshot of the color coding legend. Please excuse the German interface.

This is just the main server, www.geocities.com.1 On mouse-hover one can get path information for each rectangle, helping to decide what part of the files still need work:

The large shaded rectangles represent the neighborhoods, inside them are the sub-neighborhoods, and even before fixing the various apparent problems it is clear that HeartLand dwarfs them all. Vanity Profiles on the right edge are per-user accounts created after Yahoo! bought Geocities and neighborhoods were discontinued. Some of the largest vanity profiles seem to challenge whole sub-neighborhoods in size.

Instances of Symlink Cancer (marked red) are showing up as big blobs of directories with hardly any files in them. These are cases where Yahoo! deeply screwed up their file hosting and URL forwarding, making the Archive Team’s harvesting software wget run in circles and create almost indefinitely nested directories. Of course this cancer distorts the graphics quite a bit, need to get rid of it!

Missing mergers (marked blue) indicate huge case-related double-entries, for example “heartland” still has to be merged into “HeartLand” and “soho” into “SoHo”.

Apparently finding case-related redundancies will require another huge round of processing time.

The artwork The Deleted City by Richard Vijgen2 uses treemap visualization to show the contents of the Geocities Torrent as a city scape.

Regular readers of this blog will immediately notice that there was officially no neighborhood named “athens”, only one called “Athens”. However “EnchantedForest” and “FashionAvenue” seem alright. Without having seen the piece in action personally I am guessing Deleted City is not very accurately representing Geocities. Richard, do you need a cleaned data set? Contcact me. :)


  1. With a dataset that large the image takes several hours to compile. []
  2. already mentioned here before []

As it became clear already, the contents of the torrent can not be considered an archive, but rather the result of a distributed web harvesting effort.1 Many of the Torrent’s files are redundant or conflicting, because of uppercase/lowercase issues or different URLs pointing to the same file etc.

Before these files can be put together again and served via HTTP2, decisions need to be made which of these files should be served if a certain URL is requested. For most URLs this decision is easy — a request for

http://www.geocities.com/EnchantedForest/3151/index.html

should probably be answered with the corresponding file in the directory

www.geocities.com/EnchantedForest/3151/index.html.

However this file exists three times:

  • www.geocities.com/EnchantedForest/3151/index.html,
  • geocities.com/EnchantedForest/3151/index.html (without the “www”) and
  • YAHOOIDS/E/n/EnchantedForest/3151/index.html.

And all of them differ in contents!

While this particular case is easy — the difference in between these files is a tracking code inserted by Yahoo!, randomly generated for every visitor of the page –, in general each conflict has to be examined, especially when last modified dates are differing.

Many conflicts might be resolved automatically, however the first step was to identify conflicts and remove redundancies from the file system. This means actually changing the source material: the files contained in the Torrent and how they are organized. Ethically this is problematic, practically I do not see another way than reducing the amount of decisions to be made for a reconstruction.

To make the alteration more acceptable I publish all tools that I created for handling the torrent. They can be found on github.3 merge_directories.pl is the script that did most of the conflict identifying work. The directory with scripts created for handling the Torrent contains a list of scripts that should be run in order after downloading to have the data normalized. Running them all in a row took me 29 hours and 47 minutes.

The original Geocities Torrent contains 36’238’442 files. My scripts managed to identify 3’198’141 redundant files — quite impressive 8.85%. There are still 768’884 conflicts, 475 involving triple files. — And this is even before taking into account case-related redundancies or conflicts that I hope to solve with the help of a database.

Ingesting is next.


  1. See Tips for Torrenters, Cleaning Up The Torrent []
  2. see Authenticity/Access []
  3. The simple ones are unix shell scripts, most Linux distributions come with a rich set of tools to do file handling. More complex scripts are written in Perl. I hold the opinion that anybody involved in data preservation and digital archeology should be knowledgeable in Perl for mangling large amounts of data and interfacing to everything ever invented. []

Found this picture on my desktop. I can’t remember what profile it comes from. May be it is not even from Geocities. And I’m not sure if it was planned for this blog or for car metaphors

Why in the second part of the 90’s animated GIFs were rarely (outside of porno sites) used to show film and video sequences? Because even half a second of heavily compressed and and downscaled, barely recognizable footage would be still too heavy and slow network of that time.

In 1998, Shocking Blue fan Greg converted one second of a TV performance of the group’s hit song “Venus” into a GIF. It is 160×120 pixles, contains 15 frames and weights 93KB. Greg didn’t dare to confront the visitors of the page with such a huge file.1 He uses a static image and suggests to start loading the animation with a click, but to be ready that “it may take 20 sec”.

Starting image

Animated GIF

The moment in the original video

Original URL: http://www.geocities.com/ofmang/greg/shockblu.html


  1. Just for comparison, the third picture in the Alternative Animated GIF Timeline, a video loop as common in GIFs today, is 461×322 pixels, 2MB, 42 frames.
    []


Original URL: http://www.geocities.com/vienna/4302/

null

Original URL: http://www.geocities.com/~johanh/

The page is still on(VRM)line.

Access to the remains of Geocities can be measured on two axis: authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity).

Graphical authenticity on the pixel level means that a Geocities web page will render exactly as it would to visitors in the time the page was published. This might be considered as esoteric on the grounds that web authors could never be sure how their creations would appear to their audience, by the nature of HTML itself. However, web authors often were not aware of this fact and people of the 1990’s used different browsers on different operating systems than we do today. The dominance of for example Netscape browsers and a certain set of plugins meant that most people would experience a web page in a certain way that was very different from accessing it now with Webkit based browsers.

What has changed most significantly in operating systems since the high times of Geocities is the display of text. All current operating systems render characters with smoothed out edges, and this is reflected as much in current web design as the historic aliased pixel text display influenced the web design of the past.

Low barrier access to original visual web culture can be provided by screenshots taken from virtual machines running an historic operating system and browser.

Today, MIDI music files do not sound at all as they used to when they dominated web audio. These audio files can be recorded from historic hardware and operating systems or rendered with emulators for contemporary listeners. This will rip them of any context, but still can give a good, easily accessible impression on how the web sounded at a certain point in time.1

The web pages’ original interactivity can be restored by accessing them from a mirror and employing a browser addon that re-writes the original URLs contained in the HTML to match the mirror’s URLs.2

If the data should not be tampered with, neither on the client nor on the mirror server, and URLs shown in the browser’s address bar should be the exact originals, a proxy server must be put in front of the mirror server. This proxy can transparently re-direct requests to for example http://www.geocities.com/ to http://mirror.local/www.geocities.com/ without the need for any name server tricks or changing historic HTML code. It is trivial to write such a proxy in nodejs for example.

As even the oldest browsers support proxy servers, one can employ this system on a virtualized historic operating system like Windows 95 or Windows 98 with authentic web browsers. This significantly raises the access barrier though, as access to such a proxy must be technically restricted, or the Internet will collapse.3 Also, virtual machine software, historic operating systems and historic browsers and plugins are not easily available to the largest part of web users that might be interested in looking at their cultural past.

The largest authenticity surface is covered by using all the before mentioned techniques and running historic hardware. Current virtualization and emulation systems are quite good in re-creating all aspects of 1990’s computers, except audio. There are just not enough business critical MIDI files out there to make companies like VMWare emulate OPL3 sound chips. Additionally, all graphical output looks very different on CRT monitors and their special surface-to-pixel ratios than it does on contemporary flat screens. For example, when looking at historic web pages on a 800×600 pixel 14″ CRT screen with a 60Hz refresh rate, it becomes clear why many people decided to use dark backgrounds and bright text for their designs instead of emulating paper with black text on a white background.

Balancing

While restoration work must be done on the right end of the scale to provide a very authentic re-creation of the web’s past, it is just as important to work on every point of the scale in between to allow the broadest possible audience to experience the most authentic re-enactment of Geocities that is comfortable for consumption on many levels of expertise and interest.


  1. The artist Ryder Ripps for example published MP3 recordings of MIDI files as online mix tapes and a vinyl record containing recordings of a selection of classic MIDI files. Of course these projects present only “snapshots” of a certain kind of MIDI playback sound and the selection process definitely targets musically interested people of today, but no doubt these efforts will keep the sound present. []
  2. See Tips for Torrenters on this blog for a version of a browser addon that does this. reocities takes a similar approach and even automatically changed all HTML code so it validates and is more likely to “work” on contemporary browsers. While this seems questionable from an archivist’s point of view it surely allows broader access to the historic data. []
  3. The most common problem with open proxies is, as experience from my project insert_coin shows, that people from Saudi Arabia use up a lot of bandwidth while accessing pornographic material that is blocked in their home country. []

Many people started the adventure of downloading the Geocities Torrent, and while constantly seeding it, I have seen the peer numbers drop to zero a long time ago. Anyway, I do not recommend using the torrent at the moment, since Jason Scott, now working for the Internet Archive, put all the Geocities data into the Geocities Valhalla.

These are still the same files, hard to handle in their sheer amount, and not as neatly organized as the contents of the original torrent. If you like to complete your canceled torrent download, or of you wish to download the whole thing via HTTP, you might like this script:

I mostly wrote it for myself to check data integrity before the first serious database ingest, but extended it to support downloading. It requires Perl 5.1x, XML::TreePP and md5sum. Hope you will find it useful.

(To all the peeps reading this with a feed reader: The post contains a Perl script, embedded via gist from github.)

New Media researcher Anne Helmond asked if her 1997 “Unofficial Eric’s Trip Homepage” at SunsetStrip/3500/ was archived.

It is there. The lo-fi (html) part is almost complete and functional, only guest book and “Sounds” are missing. Hi-Fi link leads to a non existing ethi.html.

As I mentioned in Ruins and Templates of Geocities, we rarely know whether missing files were lost due to glitches during the archiving process, or to the site owner’s lack of skill or failure to maintain the files and links between them. But this time I could ask the user personally what was there.

Anne said that probably she never ever made a HiFi version and can’t remember what she planned for this.

It is so 1997! I remember myself believing that broadband is just some weeks away and preparing for the nearest future with links like that. This can be it. Broken links were links to the files to be made very very soon.

Original URL: http://www.geocities.com/SunsetStrip/3500/