One Terabyte of Kilobyte Age | Digging through the Geocities Torrent

Web Designers vs History of Web Design @abookapart ⁋ by olia

related 2012-08-22

I like the idea of A Book Apart to make “brief books for people who make websites”, and I order their books for my interface design students to provide them with text written by designers who are a) not making websites in Photoshop b) have their opinion about web design as a profession c) can write.

But the book #5 Designing for Emotion gives me hard time. Look at the titles of the first 4 chapters … and now imagine that they (and the rest of the book) are written without any mention or reflection of web sites people were making in the 90’s and are still making. Sites that are emotional and personal and human for real, not because of a marketing strategy.

The author, Aarron Walter, confesses in the beginning that for him the WWW started with dot.com with the gold zeros and gold ones rush. Maybe that’s why he doesn’t know or doesn’t want to know that the web outside of company websites and facebook profiles is and always was a very human environment. And beside it, makers of early web pages command an impressive set of instruments to appeal to the emotions of other users, to keep their audience and to be special and authentic.

The ignorance of web design professionals to the vernacular web started in 1997 and still progressing. Instead of valuing the rich history of their own medium, they’re proud when their web sites look like comics, magazine spreads, bottle labels or cartoons. Web designers, learn from the users, learn from the 19 (not 15!) years of web design, dig through the Geocities Torrent with us. It will bring you closer to “emotional design’s primary goal to facilitate human-to-human communication” (p.29), if this is really the goal.

P.S. Designers at Work

Leave a comment

How to fix wrongly encoded file names ⁋ by despens

meta 2012-07-01

Make a list of the files that need inspection with convmv, like this:
(convmv --lowmem -r --nfc -f latin1 -t utf8 *) 2>&1 >> ~/Desktop/encoding_errors.txt
You can find out here what files probably have filenames not encoded in utf8.
Check out each directory containing bad filenames. Usually, inside one profile you will only encounter one encoding. To find out which one exactly, grep for parts of the bad file name in surrounding HTML files. For example:
$ ls A?onet.jpg milenio.html $ grep 'onet\.jpg' *.html milenio.html: <td colspan=3 rowspan=1 width=314 align="center" [rest of output line ommitted]

Now you know that milenio.html links to the file that has a mysterious file name.
Check out the HTML file in Firefox:
$ firefox milenio.html
By looking at the source code (Control+U) or “page information” (Control+I) you can most of the time guess the used encoding. Either
- it is explicitly written into the <head> part of the HTML, like
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
- or the characters are written in entities inside the HTML, like À, À or %C0. Then the browser will usually display the correctly and you can try different encodings for the file.
- If no encoding is specified anywhere, at least on Geocities the likelihood that it is iso-8859-1 is very high. (This used to be Netscape’s default encoding if no other was specified.)
Use convmv to convert the file names:
$ convmv -f iso-8859-1 -t utf8 -r --notest *
The –notest option makes convmv actually do the renaming, if this option is omitted, convmv will just display what it would do to the files.

Leave a comment

Russian Karate ⁋ by despens

last updated 2006 + meta + user 2012-06-16

As mentioned in Olia’s A Vernacular Web, people writing in the Cyrillic alphabet had lots of fun before Unicode became common. Every operating system sported multiple ways of mixing Cyrillic and ASCII letters into one 8bit encoding scheme.

Many Russian web sites had to offer their text in several encodings to accommodate all possible visitors. And a stripe of buttons offering these encodings could be found on the top of many pages. Collections of such button graphics in different styles became of the web folklore.

And of course many Geocities users wrote in the Cyrillic alphabet. Vladimir Sitnik from Artemovsky created a site for his local Karate Dojo. A typical link to a photo on his website would look like this:

pict_e.html?RU?images/040502b.jpg?Наша%20команда?с%20наградами%20из%20Одессы%202004

Naturally, during the Archive Team’s backup, this confused wget quite a lot and it created many files looking like this:

Since most current Linux systems use Unicode for file names, the 8bit encoded Cyrillic letters are illegal and printed as questions marks. The HTML code of any of the files in the directory reveals the encoding used by Vladimir Sitnik:

<meta http-equiv="Content-Type" content="text/html; Charset=Windows-1251">

Using the handy tool conmv with the codepage cp1251 and switching the terminal to the same encoding for a especially nasty case, I managed to restore the file names:

Only after completing this tendious operation I checked that each of the pict_e.html file variations was actually the same, apart from some minor tracking code inserted by the Geocities server — there is no reason to really keep more than one.² They all were generated because Vladimir Sitnik successfully circumvented the fact that Geocities did not offer any kind of server side scripting or usable templating mechanism. But he still did not want to create hundreds of pages with just a different photo on each. So he had the quite exotic idea of using HTTP GET parameters to create photo pages via Javascript.

In his links, the parameters are separated by question marks: interface language, file name of the photo, caption lines. pict_e.html then uses the Javascript instruction document.writeln to create the actual HTML. So that is a minimal content management system on a free service that does not offer content management systems. Almost AJAX, but Netscape 4 compatible!

Considering this page was made by an “amateur”, created in 2000 and last updated 2006, it already shows that much better scraping tools than wget are needed³ — even when it comes to pages that give a very modest impression to “professionals”. Users have been very inventive in all aspects of web authoring, they did not care for any specification or standard, they went ahead and pushed the borders, trying all the new options presented to them in the browser wars.

Original URL: http://www.geocities.com/soho/Museum/1134/ph4tu_r.html

All examples lifted from A Vernacular Web. [↩]
Except for studying wget behavior or Geocities’ random number generator. [↩]
Maybe something based on PhantomJS? [↩]

Leave a comment

Birdseye view on www.geocities.com ⁋ by despens

meta + neighborhoods 2012-06-10

Tree Maps are a practical way to visualize large file systems, and a very useful tool for data archeologists, archivists and restorers.

The Free Software tool gdmap draws tree maps and running it on the Geocities Torrent files is useful for identifying problems within the collection. It produces images like these (click to enlarge):

Edit: See screenshot of the color coding legend. Please excuse the German interface.

This is just the main server, www.geocities.com.¹ On mouse-hover one can get path information for each rectangle, helping to decide what part of the files still need work:

The large shaded rectangles represent the neighborhoods, inside them are the sub-neighborhoods, and even before fixing the various apparent problems it is clear that HeartLand dwarfs them all. Vanity Profiles on the right edge are per-user accounts created after Yahoo! bought Geocities and neighborhoods were discontinued. Some of the largest vanity profiles seem to challenge whole sub-neighborhoods in size.

Instances of Symlink Cancer (marked red) are showing up as big blobs of directories with hardly any files in them. These are cases where Yahoo! deeply screwed up their file hosting and URL forwarding, making the Archive Team’s harvesting software wget run in circles and create almost indefinitely nested directories. Of course this cancer distorts the graphics quite a bit, need to get rid of it!

Missing mergers (marked blue) indicate huge case-related double-entries, for example “heartland” still has to be merged into “HeartLand” and “soho” into “SoHo”.

Apparently finding case-related redundancies will require another huge round of processing time.

The artwork The Deleted City by Richard Vijgen² uses treemap visualization to show the contents of the Geocities Torrent as a city scape.

Regular readers of this blog will immediately notice that there was officially no neighborhood named “athens”, only one called “Athens”. However “EnchantedForest” and “FashionAvenue” seem alright. Without having seen the piece in action personally I am guessing Deleted City is not very accurately representing Geocities. Richard, do you need a cleaned data set? Contcact me. :)

With a dataset that large the image takes several hours to compile. [↩]
already mentioned here before [↩]

Leave a comment

Resolving conflicts ⁋ by despens

meta 2012-06-09

As it became clear already, the contents of the torrent can not be considered an archive, but rather the result of a distributed web harvesting effort.¹ Many of the Torrent’s files are redundant or conflicting, because of uppercase/lowercase issues or different URLs pointing to the same file etc.

Before these files can be put together again and served via HTTP², decisions need to be made which of these files should be served if a certain URL is requested. For most URLs this decision is easy — a request for

http://www.geocities.com/EnchantedForest/3151/index.html

should probably be answered with the corresponding file in the directory

www.geocities.com/EnchantedForest/3151/index.html.

However this file exists three times:

www.geocities.com/EnchantedForest/3151/index.html,
geocities.com/EnchantedForest/3151/index.html (without the “www”) and
YAHOOIDS/E/n/EnchantedForest/3151/index.html.

And all of them differ in contents!

While this particular case is easy — the difference in between these files is a tracking code inserted by Yahoo!, randomly generated for every visitor of the page –, in general each conflict has to be examined, especially when last modified dates are differing.

Many conflicts might be resolved automatically, however the first step was to identify conflicts and remove redundancies from the file system. This means actually changing the source material: the files contained in the Torrent and how they are organized. Ethically this is problematic, practically I do not see another way than reducing the amount of decisions to be made for a reconstruction.

To make the alteration more acceptable I publish all tools that I created for handling the torrent. They can be found on github.³ merge_directories.pl is the script that did most of the conflict identifying work. The directory with scripts created for handling the Torrent contains a list of scripts that should be run in order after downloading to have the data normalized. Running them all in a row took me 29 hours and 47 minutes.

The original Geocities Torrent contains 36’238’442 files. My scripts managed to identify 3’198’141 redundant files — quite impressive 8.85%. There are still 768’884 conflicts, 475 involving triple files. — And this is even before taking into account case-related redundancies or conflicts that I hope to solve with the help of a database.

Ingesting is next.

See Tips for Torrenters, Cleaning Up The Torrent [↩]
see Authenticity/Access [↩]
The simple ones are unix shell scripts, most Linux distributions come with a rich set of tools to do file handling. More complex scripts are written in Perl. I hold the opinion that anybody involved in data preservation and digital archeology should be knowledgeable in Perl for mangling large amounts of data and interfacing to everything ever invented. [↩]

Leave a comment

please bear with me ⁋ by olia

meta + under construction 2012-05-22

Found this picture on my desktop. I can’t remember what profile it comes from. May be it is not even from Geocities. And I’m not sure if it was planned for this blog or for car metaphors…

Leave a comment

This may take up to 20 sec ⁋ by olia

alive + GIF + last updated 2006 + webmaster 2012-04-28

Why in the second part of the 90’s animated GIFs were rarely (outside of porno sites) used to show film and video sequences? Because even half a second of heavily compressed and and downscaled, barely recognizable footage would be still too heavy and slow network of that time.

In 1998, Shocking Blue fan Greg converted one second of a TV performance of the group’s hit song “Venus” into a GIF. It is 160×120 pixles, contains 15 frames and weights 93KB. Greg didn’t dare to confront the visitors of the page with such a huge file.¹ He uses a static image and suggests to start loading the animation with a click, but to be ready that “it may take 20 sec”.

Starting image

Animated GIF

The moment in the original video

Original URL: http://www.geocities.com/ofmang/greg/shockblu.html

Just for comparison, the third picture in the Alternative Animated GIF Timeline, a video loop as common in GIFs today, is 461×322 pixels, 2MB, 42 frames.
[↩]

Leave a comment

word ⁋ by olia

alive + last updated 2009 2012-04-25

Original URL: http://www.geocities.com/vienna/4302/

Leave a comment

V(RM)Lcome to my page! ⁋ by olia

alive + GIF + last updated 2001 + welcome 2012-04-25

null

Original URL: http://www.geocities.com/~johanh/

The page is still on(VRM)line.

Leave a comment

Authenticity/Access ⁋ by despens

meta 2012-04-09

Access to the remains of Geocities can be measured on two axis: authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity).

Graphical authenticity on the pixel level means that a Geocities web page will render exactly as it would to visitors in the time the page was published. This might be considered as esoteric on the grounds that web authors could never be sure how their creations would appear to their audience, by the nature of HTML itself. However, web authors often were not aware of this fact and people of the 1990’s used different browsers on different operating systems than we do today. The dominance of for example Netscape browsers and a certain set of plugins meant that most people would experience a web page in a certain way that was very different from accessing it now with Webkit based browsers.

What has changed most significantly in operating systems since the high times of Geocities is the display of text. All current operating systems render characters with smoothed out edges, and this is reflected as much in current web design as the historic aliased pixel text display influenced the web design of the past.

Low barrier access to original visual web culture can be provided by screenshots taken from virtual machines running an historic operating system and browser.

Today, MIDI music files do not sound at all as they used to when they dominated web audio. These audio files can be recorded from historic hardware and operating systems or rendered with emulators for contemporary listeners. This will rip them of any context, but still can give a good, easily accessible impression on how the web sounded at a certain point in time.¹

The web pages’ original interactivity can be restored by accessing them from a mirror and employing a browser addon that re-writes the original URLs contained in the HTML to match the mirror’s URLs.²

If the data should not be tampered with, neither on the client nor on the mirror server, and URLs shown in the browser’s address bar should be the exact originals, a proxy server must be put in front of the mirror server. This proxy can transparently re-direct requests to for example http://www.geocities.com/ to http://mirror.local/www.geocities.com/ without the need for any name server tricks or changing historic HTML code. It is trivial to write such a proxy in nodejs for example.

As even the oldest browsers support proxy servers, one can employ this system on a virtualized historic operating system like Windows 95 or Windows 98 with authentic web browsers. This significantly raises the access barrier though, as access to such a proxy must be technically restricted, or the Internet will collapse.³ Also, virtual machine software, historic operating systems and historic browsers and plugins are not easily available to the largest part of web users that might be interested in looking at their cultural past.

The largest authenticity surface is covered by using all the before mentioned techniques and running historic hardware. Current virtualization and emulation systems are quite good in re-creating all aspects of 1990’s computers, except audio. There are just not enough business critical MIDI files out there to make companies like VMWare emulate OPL3 sound chips. Additionally, all graphical output looks very different on CRT monitors and their special surface-to-pixel ratios than it does on contemporary flat screens. For example, when looking at historic web pages on a 800×600 pixel 14″ CRT screen with a 60Hz refresh rate, it becomes clear why many people decided to use dark backgrounds and bright text for their designs instead of emulating paper with black text on a white background.

Balancing

While restoration work must be done on the right end of the scale to provide a very authentic re-creation of the web’s past, it is just as important to work on every point of the scale in between to allow the broadest possible audience to experience the most authentic re-enactment of Geocities that is comfortable for consumption on many levels of expertise and interest.

The artist Ryder Ripps for example published MP3 recordings of MIDI files as online mix tapes and a vinyl record containing recordings of a selection of classic MIDI files. Of course these projects present only “snapshots” of a certain kind of MIDI playback sound and the selection process definitely targets musically interested people of today, but no doubt these efforts will keep the sound present. [↩]
See Tips for Torrenters on this blog for a version of a browser addon that does this. reocities takes a similar approach and even automatically changed all HTML code so it validates and is more likely to “work” on contemporary browsers. While this seems questionable from an archivist’s point of view it surely allows broader access to the historic data. [↩]
The most common problem with open proxies is, as experience from my project insert_coin shows, that people from Saudi Arabia use up a lot of bandwidth while accessing pornographic material that is blocked in their home country. [↩]

8 Comments

← Older posts

Newer posts →