Monthly Archives: June 2012

Russian Karate ⁋ by despens

last updated 2006 + meta + user 2012-06-16

As mentioned in Olia’s A Vernacular Web, people writing in the Cyrillic alphabet had lots of fun before Unicode became common. Every operating system sported multiple ways of mixing Cyrillic and ASCII letters into one 8bit encoding scheme.

Many Russian web sites had to offer their text in several encodings to accommodate all possible visitors. And a stripe of buttons offering these encodings could be found on the top of many pages. Collections of such button graphics in different styles became of the web folklore.

And of course many Geocities users wrote in the Cyrillic alphabet. Vladimir Sitnik from Artemovsky created a site for his local Karate Dojo. A typical link to a photo on his website would look like this:

pict_e.html?RU?images/040502b.jpg?Наша%20команда?с%20наградами%20из%20Одессы%202004

Naturally, during the Archive Team’s backup, this confused wget quite a lot and it created many files looking like this:

Since most current Linux systems use Unicode for file names, the 8bit encoded Cyrillic letters are illegal and printed as questions marks. The HTML code of any of the files in the directory reveals the encoding used by Vladimir Sitnik:

<meta http-equiv="Content-Type" content="text/html; Charset=Windows-1251">

Using the handy tool conmv with the codepage cp1251 and switching the terminal to the same encoding for a especially nasty case, I managed to restore the file names:

Only after completing this tendious operation I checked that each of the pict_e.html file variations was actually the same, apart from some minor tracking code inserted by the Geocities server — there is no reason to really keep more than one.² They all were generated because Vladimir Sitnik successfully circumvented the fact that Geocities did not offer any kind of server side scripting or usable templating mechanism. But he still did not want to create hundreds of pages with just a different photo on each. So he had the quite exotic idea of using HTTP GET parameters to create photo pages via Javascript.

In his links, the parameters are separated by question marks: interface language, file name of the photo, caption lines. pict_e.html then uses the Javascript instruction document.writeln to create the actual HTML. So that is a minimal content management system on a free service that does not offer content management systems. Almost AJAX, but Netscape 4 compatible!

Considering this page was made by an “amateur”, created in 2000 and last updated 2006, it already shows that much better scraping tools than wget are needed³ — even when it comes to pages that give a very modest impression to “professionals”. Users have been very inventive in all aspects of web authoring, they did not care for any specification or standard, they went ahead and pushed the borders, trying all the new options presented to them in the browser wars.

Original URL: http://www.geocities.com/soho/Museum/1134/ph4tu_r.html

All examples lifted from A Vernacular Web. [↩]
Except for studying wget behavior or Geocities’ random number generator. [↩]
Maybe something based on PhantomJS? [↩]

Birdseye view on www.geocities.com ⁋ by despens

meta + neighborhoods 2012-06-10

Tree Maps are a practical way to visualize large file systems, and a very useful tool for data archeologists, archivists and restorers.

The Free Software tool gdmap draws tree maps and running it on the Geocities Torrent files is useful for identifying problems within the collection. It produces images like these (click to enlarge):

Edit: See screenshot of the color coding legend. Please excuse the German interface.

This is just the main server, www.geocities.com.¹ On mouse-hover one can get path information for each rectangle, helping to decide what part of the files still need work:

The large shaded rectangles represent the neighborhoods, inside them are the sub-neighborhoods, and even before fixing the various apparent problems it is clear that HeartLand dwarfs them all. Vanity Profiles on the right edge are per-user accounts created after Yahoo! bought Geocities and neighborhoods were discontinued. Some of the largest vanity profiles seem to challenge whole sub-neighborhoods in size.

Instances of Symlink Cancer (marked red) are showing up as big blobs of directories with hardly any files in them. These are cases where Yahoo! deeply screwed up their file hosting and URL forwarding, making the Archive Team’s harvesting software wget run in circles and create almost indefinitely nested directories. Of course this cancer distorts the graphics quite a bit, need to get rid of it!

Missing mergers (marked blue) indicate huge case-related double-entries, for example “heartland” still has to be merged into “HeartLand” and “soho” into “SoHo”.

Apparently finding case-related redundancies will require another huge round of processing time.

The artwork The Deleted City by Richard Vijgen² uses treemap visualization to show the contents of the Geocities Torrent as a city scape.

Regular readers of this blog will immediately notice that there was officially no neighborhood named “athens”, only one called “Athens”. However “EnchantedForest” and “FashionAvenue” seem alright. Without having seen the piece in action personally I am guessing Deleted City is not very accurately representing Geocities. Richard, do you need a cleaned data set? Contcact me. :)

With a dataset that large the image takes several hours to compile. [↩]
already mentioned here before [↩]

Resolving conflicts ⁋ by despens

meta 2012-06-09

As it became clear already, the contents of the torrent can not be considered an archive, but rather the result of a distributed web harvesting effort.¹ Many of the Torrent’s files are redundant or conflicting, because of uppercase/lowercase issues or different URLs pointing to the same file etc.

Before these files can be put together again and served via HTTP², decisions need to be made which of these files should be served if a certain URL is requested. For most URLs this decision is easy — a request for

http://www.geocities.com/EnchantedForest/3151/index.html

should probably be answered with the corresponding file in the directory

www.geocities.com/EnchantedForest/3151/index.html.

However this file exists three times:

www.geocities.com/EnchantedForest/3151/index.html,
geocities.com/EnchantedForest/3151/index.html (without the “www”) and
YAHOOIDS/E/n/EnchantedForest/3151/index.html.

And all of them differ in contents!

While this particular case is easy — the difference in between these files is a tracking code inserted by Yahoo!, randomly generated for every visitor of the page –, in general each conflict has to be examined, especially when last modified dates are differing.

Many conflicts might be resolved automatically, however the first step was to identify conflicts and remove redundancies from the file system. This means actually changing the source material: the files contained in the Torrent and how they are organized. Ethically this is problematic, practically I do not see another way than reducing the amount of decisions to be made for a reconstruction.

To make the alteration more acceptable I publish all tools that I created for handling the torrent. They can be found on github.³ merge_directories.pl is the script that did most of the conflict identifying work. The directory with scripts created for handling the Torrent contains a list of scripts that should be run in order after downloading to have the data normalized. Running them all in a row took me 29 hours and 47 minutes.

The original Geocities Torrent contains 36’238’442 files. My scripts managed to identify 3’198’141 redundant files — quite impressive 8.85%. There are still 768’884 conflicts, 475 involving triple files. — And this is even before taking into account case-related redundancies or conflicts that I hope to solve with the help of a database.

Ingesting is next.

See Tips for Torrenters, Cleaning Up The Torrent [↩]
see Authenticity/Access [↩]
The simple ones are unix shell scripts, most Linux distributions come with a rich set of tools to do file handling. More complex scripts are written in Perl. I hold the opinion that anybody involved in data preservation and digital archeology should be knowledgeable in Perl for mangling large amounts of data and interfacing to everything ever invented. [↩]

One Terabyte of Kilobyte Age

Monthly Archives: June 2012

Russian Karate ⁋ by despens

Birdseye view on www.geocities.com ⁋ by despens

Resolving conflicts ⁋ by despens