Author Archives: despens

Coping with the Overwhelming ⁋ by despens

meta + ontology 2014-02-23

For a cultural researcher, the amount of material contained in the ArchiveTeam’s Geocities copy is simply overwhelming. The tumblr blog One Terabyte Of Kilobyte Age Photo Op presents one way to make it all accessible, by transforming it into an exciting soap opera of screen shots.

With the Geocities Research Institute’s latest effort, categorizing the home pages can go as easily as checking tumblr: When accessed through the Geocities proxy server, each post is connected with the local database, widgets to view and modify the displayed home page’s metadata are inserted into tumblr.

Tagging is a good way to work with experimental or explorational ontologies, since a hierarchy can be built ad-hoc. Different views on the same material can exist at the same time — a prerequisite for giving meaning to a collection as large as Geocities. Tags are useful before a complete collection of items is fully known and can be created independently by different researchers, without too much prior agreement on a vocabulary.

In conclusion, each new opportunity and context to enter meta information has the potential to be valuable. Meta information does not have to be definite or objective to help structuring items, as long as the system applying it performs reasonably fast and allows quick combinations.

1 Comment

The Anniversary Restoration ⁋ by despens

alive + last updated 1997 + last updated 1998 + restoration 2014-02-07

To celebrate one year of our Geocities screenshot tumblr One Terabyte of Kilobyte Age Photo Op I restored the top 3 reblogged and liked home pages posted there, as tracked by Olia.

The access has been optimized for contemporary browsers and high interactivity. On the Authenticity/Access chart, this restoration is placed in between “screenshots” and “HTTP mirror access, contemporary, browser AddOn”: The pixels generated by contemporary browsers are not be the same as the ones rendered by a 1997/98 browser, the URLs are not the original ones; however, the interactivity comes close to the original, the graphics look fine enough and are animated. As a bonus, a first time in website restoration, the embedded MIDI files have been transformed into audio recordings using timidity and a Soundblaster AWE32 instrument set.¹

#3 I have a website

All material except the counter image was present in the ArchiveTeam’s Geocities torrent distribution. The missing image icq.JPG probably never was uploaded by the user, it is not present on any public Geocities mirror.

The counter image was lifted from the Wayback Machine. The original URL http://www.geocities.com/cgi-bin/counter was probably working with browser referrer information to assign the counter to a certain web page.² The Internet Archive’s web crawler saved the counter showing 0000 over a few years. We will not be able to reconstruct the number of visitors to the page, but at least we can imagine how it looked.

The MIDI file embedded in the page, a version of Celine Dion’s “My Heart Will Go On”, is heavily damaged and produces strange noises when played back via timidity. I haven’t verified how it would be interpreted on a legacy system, but since the MIDI file specification is not met in this file it will for sure not reproduce a perfect version of the song. (The file is damaged or not present in all public mirrors of Geocities.)

#2 Cute Boy Site

This simple home page posed no further problems, “As Long As You Love Me” by the Backstreet boys was conserved in a perfect MIDI version. The missing image devlayy.jpg never left the author’s hard disk, it is referenced outside of the homepage’s root directory in a folder called Annies GirlClub.

#1 Divorced Dads Page

From Divorced Dads, the ArchiveTeam’s copy only contains the main page. I took some missing pages and from reocities. The downside of reocities is that there is no Last-Modified header delivered from the server, the upside is that the original HTML is less modified than on the wayback machine.³ Thankfully the wayback machine delivers original Last-Modified dates in extra HTTP headers, so I was able to transfer this metadata to the reocities copies.⁴

The banner on the bottom was replaced with a generic banner ad from this particular banner exchange service from 2003, as found on the wayback machine.

The top of the page features a Java applet called “GeoGuide” that is referenced on many Geocities home pages. Unfortunately, Java applets have posed issues for webcrawler-based archiving, since they are opaque blobs of code that might load further resources, for example images or object code libraries. Most crawlers wouldn’t even download the applet files because of the low likeliness that they would work later. There is no public mirror of Geocities available that contains this applet, and until now no screenshot or other form of documentation of GeoGuide was found.

The counter used to be delivered from a personalized URL, http://www.geocities.com/cgi-bin/counter/jacquestheman, the first time it was checked on the wayback machine in 2003 was already producing a “file not found”. Since Geocities moved their user tracking to a separate server visit.geocities.com, I decided to look there and indeed found four zeroes printed in a nice font, still alive. This might be the counter the page’s author customized for himself, or it might not be.

All sub pages use a non-standard font called “Paramount” <FONT FACE="Paramount">. A metadata tag <META NAME="GENERATOR" CONTENT="Mozilla/4.01 [en] (Win95; I) [Netscape]"> hints towards Windows 95 being the platform the pages were created on, but there is no information available about what this font might be: There are some freeware fonts with that name, but no font of such a name was ever included in for example the “Microsoft Plus” packs for Windows that gave users extra features and fonts; Microsoft Office never shipped with a Paramount as well. Since the choice of font would be too arbitrary and the likeliness of page visitors having exactly this font installed in 1997/98 to actually see it is very low, I decided to leave the browser’s default font in place.

Enjoy!

If the restorations are accessed via a legacy system, audio authenticity suffers because the original MIDI files have been replaced with OGG and MP3 recordings. [↩]
Upon loading images, browsers send a Referer header to the server that contains the URL of the web page the image is embedded in. Like this, servers can return different images for the same URL. [↩]
Reocities just inserts a Javascript banner after the opening body tag and seems to have done a simple search/replace on geocities.com to reocities.com. [↩]
The headers encountered during crawl time are reproduced by The Internet Archive as starting with X-Archive-Orig-, for example X-Archive-Orig-Last-Modified: Sun, 10 Aug 1997 01:00:05 GMT [↩]

1 Comment

A City Rebuilt ⁋ by despens

meta + neighborhoods 2013-04-06

There is now just enough minimal-invasive order brought to the Geocities files that is possible to serve them via a proxy server that even imitates the original URLs completely. Using this proxy, you will be able to click on any historic Geocities URL and experience it in your browser, which ideally is a historic browser as well.¹

All technical measurements applied to the data are presented on GitHub in the form of annotated programs written in bash, Perl, SQL and Python. The comments inside the scripts explain problems, considerations, compromises, decisions and technical solutions, trying to match the ideal of software being executable documentation of itself.

You are welcome to evaluate each step and use this information to make your own Geocities proxy server, or use the developed techniques to revive other dead web sites.

Click to enlarge!!

There is a small discussion over at reddit about this graphic. You’re welcome to compare this treemap with the first published treemap, with problematic areas encircled.

The overall number of files has come down from 36 million to 28 million.

Next up is re-packaging the Geocities files and database contents for a cleaned-up distribution on the Internet Archive.

Get your ancient web surfing gear at evolt.org. [↩]

3 Comments

How It Works ⁋ by despens

meta 2013-02-19

For those wondering how the automated tumblr blog One Terabyte of Kilobyte Age Photo Op is made, the graphic below should be enlightening. More on technology, philosophy and historical value of screenshots will follow soon.

Leave a comment

How to fix wrongly encoded file names ⁋ by despens

meta 2012-07-01

Make a list of the files that need inspection with convmv, like this:
(convmv --lowmem -r --nfc -f latin1 -t utf8 *) 2>&1 >> ~/Desktop/encoding_errors.txt
You can find out here what files probably have filenames not encoded in utf8.
Check out each directory containing bad filenames. Usually, inside one profile you will only encounter one encoding. To find out which one exactly, grep for parts of the bad file name in surrounding HTML files. For example:
$ ls A?onet.jpg milenio.html $ grep 'onet\.jpg' *.html milenio.html: <td colspan=3 rowspan=1 width=314 align="center" [rest of output line ommitted]

Now you know that milenio.html links to the file that has a mysterious file name.
Check out the HTML file in Firefox:
$ firefox milenio.html
By looking at the source code (Control+U) or “page information” (Control+I) you can most of the time guess the used encoding. Either
- it is explicitly written into the <head> part of the HTML, like
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
- or the characters are written in entities inside the HTML, like À, À or %C0. Then the browser will usually display the correctly and you can try different encodings for the file.
- If no encoding is specified anywhere, at least on Geocities the likelihood that it is iso-8859-1 is very high. (This used to be Netscape’s default encoding if no other was specified.)
Use convmv to convert the file names:
$ convmv -f iso-8859-1 -t utf8 -r --notest *
The –notest option makes convmv actually do the renaming, if this option is omitted, convmv will just display what it would do to the files.

Leave a comment

Russian Karate ⁋ by despens

last updated 2006 + meta + user 2012-06-16

As mentioned in Olia’s A Vernacular Web, people writing in the Cyrillic alphabet had lots of fun before Unicode became common. Every operating system sported multiple ways of mixing Cyrillic and ASCII letters into one 8bit encoding scheme.

Many Russian web sites had to offer their text in several encodings to accommodate all possible visitors. And a stripe of buttons offering these encodings could be found on the top of many pages. Collections of such button graphics in different styles became of the web folklore.

And of course many Geocities users wrote in the Cyrillic alphabet. Vladimir Sitnik from Artemovsky created a site for his local Karate Dojo. A typical link to a photo on his website would look like this:

pict_e.html?RU?images/040502b.jpg?Наша%20команда?с%20наградами%20из%20Одессы%202004

Naturally, during the Archive Team’s backup, this confused wget quite a lot and it created many files looking like this:

Since most current Linux systems use Unicode for file names, the 8bit encoded Cyrillic letters are illegal and printed as questions marks. The HTML code of any of the files in the directory reveals the encoding used by Vladimir Sitnik:

<meta http-equiv="Content-Type" content="text/html; Charset=Windows-1251">

Using the handy tool conmv with the codepage cp1251 and switching the terminal to the same encoding for a especially nasty case, I managed to restore the file names:

Only after completing this tendious operation I checked that each of the pict_e.html file variations was actually the same, apart from some minor tracking code inserted by the Geocities server — there is no reason to really keep more than one.² They all were generated because Vladimir Sitnik successfully circumvented the fact that Geocities did not offer any kind of server side scripting or usable templating mechanism. But he still did not want to create hundreds of pages with just a different photo on each. So he had the quite exotic idea of using HTTP GET parameters to create photo pages via Javascript.

In his links, the parameters are separated by question marks: interface language, file name of the photo, caption lines. pict_e.html then uses the Javascript instruction document.writeln to create the actual HTML. So that is a minimal content management system on a free service that does not offer content management systems. Almost AJAX, but Netscape 4 compatible!

Considering this page was made by an “amateur”, created in 2000 and last updated 2006, it already shows that much better scraping tools than wget are needed³ — even when it comes to pages that give a very modest impression to “professionals”. Users have been very inventive in all aspects of web authoring, they did not care for any specification or standard, they went ahead and pushed the borders, trying all the new options presented to them in the browser wars.

Original URL: http://www.geocities.com/soho/Museum/1134/ph4tu_r.html

All examples lifted from A Vernacular Web. [↩]
Except for studying wget behavior or Geocities’ random number generator. [↩]
Maybe something based on PhantomJS? [↩]

Leave a comment

Birdseye view on www.geocities.com ⁋ by despens

meta + neighborhoods 2012-06-10

Tree Maps are a practical way to visualize large file systems, and a very useful tool for data archeologists, archivists and restorers.

The Free Software tool gdmap draws tree maps and running it on the Geocities Torrent files is useful for identifying problems within the collection. It produces images like these (click to enlarge):

Edit: See screenshot of the color coding legend. Please excuse the German interface.

This is just the main server, www.geocities.com.¹ On mouse-hover one can get path information for each rectangle, helping to decide what part of the files still need work:

The large shaded rectangles represent the neighborhoods, inside them are the sub-neighborhoods, and even before fixing the various apparent problems it is clear that HeartLand dwarfs them all. Vanity Profiles on the right edge are per-user accounts created after Yahoo! bought Geocities and neighborhoods were discontinued. Some of the largest vanity profiles seem to challenge whole sub-neighborhoods in size.

Instances of Symlink Cancer (marked red) are showing up as big blobs of directories with hardly any files in them. These are cases where Yahoo! deeply screwed up their file hosting and URL forwarding, making the Archive Team’s harvesting software wget run in circles and create almost indefinitely nested directories. Of course this cancer distorts the graphics quite a bit, need to get rid of it!

Missing mergers (marked blue) indicate huge case-related double-entries, for example “heartland” still has to be merged into “HeartLand” and “soho” into “SoHo”.

Apparently finding case-related redundancies will require another huge round of processing time.

The artwork The Deleted City by Richard Vijgen² uses treemap visualization to show the contents of the Geocities Torrent as a city scape.

Regular readers of this blog will immediately notice that there was officially no neighborhood named “athens”, only one called “Athens”. However “EnchantedForest” and “FashionAvenue” seem alright. Without having seen the piece in action personally I am guessing Deleted City is not very accurately representing Geocities. Richard, do you need a cleaned data set? Contcact me. :)

With a dataset that large the image takes several hours to compile. [↩]
already mentioned here before [↩]

Leave a comment

Resolving conflicts ⁋ by despens

meta 2012-06-09

As it became clear already, the contents of the torrent can not be considered an archive, but rather the result of a distributed web harvesting effort.¹ Many of the Torrent’s files are redundant or conflicting, because of uppercase/lowercase issues or different URLs pointing to the same file etc.

Before these files can be put together again and served via HTTP², decisions need to be made which of these files should be served if a certain URL is requested. For most URLs this decision is easy — a request for

http://www.geocities.com/EnchantedForest/3151/index.html

should probably be answered with the corresponding file in the directory

www.geocities.com/EnchantedForest/3151/index.html.

However this file exists three times:

www.geocities.com/EnchantedForest/3151/index.html,
geocities.com/EnchantedForest/3151/index.html (without the “www”) and
YAHOOIDS/E/n/EnchantedForest/3151/index.html.

And all of them differ in contents!

While this particular case is easy — the difference in between these files is a tracking code inserted by Yahoo!, randomly generated for every visitor of the page –, in general each conflict has to be examined, especially when last modified dates are differing.

Many conflicts might be resolved automatically, however the first step was to identify conflicts and remove redundancies from the file system. This means actually changing the source material: the files contained in the Torrent and how they are organized. Ethically this is problematic, practically I do not see another way than reducing the amount of decisions to be made for a reconstruction.

To make the alteration more acceptable I publish all tools that I created for handling the torrent. They can be found on github.³ merge_directories.pl is the script that did most of the conflict identifying work. The directory with scripts created for handling the Torrent contains a list of scripts that should be run in order after downloading to have the data normalized. Running them all in a row took me 29 hours and 47 minutes.

The original Geocities Torrent contains 36’238’442 files. My scripts managed to identify 3’198’141 redundant files — quite impressive 8.85%. There are still 768’884 conflicts, 475 involving triple files. — And this is even before taking into account case-related redundancies or conflicts that I hope to solve with the help of a database.

Ingesting is next.

See Tips for Torrenters, Cleaning Up The Torrent [↩]
see Authenticity/Access [↩]
The simple ones are unix shell scripts, most Linux distributions come with a rich set of tools to do file handling. More complex scripts are written in Perl. I hold the opinion that anybody involved in data preservation and digital archeology should be knowledgeable in Perl for mangling large amounts of data and interfacing to everything ever invented. [↩]

Leave a comment

Authenticity/Access ⁋ by despens

meta 2012-04-09

Access to the remains of Geocities can be measured on two axis: authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity).

Graphical authenticity on the pixel level means that a Geocities web page will render exactly as it would to visitors in the time the page was published. This might be considered as esoteric on the grounds that web authors could never be sure how their creations would appear to their audience, by the nature of HTML itself. However, web authors often were not aware of this fact and people of the 1990’s used different browsers on different operating systems than we do today. The dominance of for example Netscape browsers and a certain set of plugins meant that most people would experience a web page in a certain way that was very different from accessing it now with Webkit based browsers.

What has changed most significantly in operating systems since the high times of Geocities is the display of text. All current operating systems render characters with smoothed out edges, and this is reflected as much in current web design as the historic aliased pixel text display influenced the web design of the past.

Low barrier access to original visual web culture can be provided by screenshots taken from virtual machines running an historic operating system and browser.

Today, MIDI music files do not sound at all as they used to when they dominated web audio. These audio files can be recorded from historic hardware and operating systems or rendered with emulators for contemporary listeners. This will rip them of any context, but still can give a good, easily accessible impression on how the web sounded at a certain point in time.¹

The web pages’ original interactivity can be restored by accessing them from a mirror and employing a browser addon that re-writes the original URLs contained in the HTML to match the mirror’s URLs.²

If the data should not be tampered with, neither on the client nor on the mirror server, and URLs shown in the browser’s address bar should be the exact originals, a proxy server must be put in front of the mirror server. This proxy can transparently re-direct requests to for example http://www.geocities.com/ to http://mirror.local/www.geocities.com/ without the need for any name server tricks or changing historic HTML code. It is trivial to write such a proxy in nodejs for example.

As even the oldest browsers support proxy servers, one can employ this system on a virtualized historic operating system like Windows 95 or Windows 98 with authentic web browsers. This significantly raises the access barrier though, as access to such a proxy must be technically restricted, or the Internet will collapse.³ Also, virtual machine software, historic operating systems and historic browsers and plugins are not easily available to the largest part of web users that might be interested in looking at their cultural past.

The largest authenticity surface is covered by using all the before mentioned techniques and running historic hardware. Current virtualization and emulation systems are quite good in re-creating all aspects of 1990’s computers, except audio. There are just not enough business critical MIDI files out there to make companies like VMWare emulate OPL3 sound chips. Additionally, all graphical output looks very different on CRT monitors and their special surface-to-pixel ratios than it does on contemporary flat screens. For example, when looking at historic web pages on a 800×600 pixel 14″ CRT screen with a 60Hz refresh rate, it becomes clear why many people decided to use dark backgrounds and bright text for their designs instead of emulating paper with black text on a white background.

Balancing

While restoration work must be done on the right end of the scale to provide a very authentic re-creation of the web’s past, it is just as important to work on every point of the scale in between to allow the broadest possible audience to experience the most authentic re-enactment of Geocities that is comfortable for consumption on many levels of expertise and interest.

The artist Ryder Ripps for example published MP3 recordings of MIDI files as online mix tapes and a vinyl record containing recordings of a selection of classic MIDI files. Of course these projects present only “snapshots” of a certain kind of MIDI playback sound and the selection process definitely targets musically interested people of today, but no doubt these efforts will keep the sound present. [↩]
See Tips for Torrenters on this blog for a version of a browser addon that does this. reocities takes a similar approach and even automatically changed all HTML code so it validates and is more likely to “work” on contemporary browsers. While this seems questionable from an archivist’s point of view it surely allows broader access to the historic data. [↩]
The most common problem with open proxies is, as experience from my project insert_coin shows, that people from Saudi Arabia use up a lot of bandwidth while accessing pornographic material that is blocked in their home country. [↩]

8 Comments

← Older posts

One Terabyte of Kilobyte Age

Author Archives: despens

Coping with the Overwhelming ⁋ by despens