Author Archives: despens

Many people started the adventure of downloading the Geocities Torrent, and while constantly seeding it, I have seen the peer numbers drop to zero a long time ago. Anyway, I do not recommend using the torrent at the moment, since Jason Scott, now working for the Internet Archive, put all the Geocities data into the Geocities Valhalla.

These are still the same files, hard to handle in their sheer amount, and not as neatly organized as the contents of the original torrent. If you like to complete your canceled torrent download, or of you wish to download the whole thing via HTTP, you might like this script:

I mostly wrote it for myself to check data integrity before the first serious database ingest, but extended it to support downloading. It requires Perl 5.1x, XML::TreePP and md5sum. Hope you will find it useful.

(To all the peeps reading this with a feed reader: The post contains a Perl script, embedded via gist from github.)

A Research Database for Everybody

Geocities was a people’s effort, and it should continue belonging to the people, even after its demise. This is important to keep in mind when doing research on the millions of home pages that have been rescued. The results of this research should be easy to build new research upon for any interested party.

The scripts and system setup guides I am preparing to be published on GitHub will make it relatively comfortable to work with parts of Geocities that already have been recovered or will be recovered in the future.

Using a database for helping to make sense of this huge body of data is a very straightforward idea, but how to organize it so that it is as portable as the rest of the project? Additionally, the database must be able to work with multiple, contradicting versions of the same file that where harvested by different people at different points in time, using different tools, making different mistakes. And it should be possible to dump the complete database, or parts of it, hand it over to another researcher, and later merge the changes.

Naïve database designs1 usually work with surrogate keys to uniquely identify records. The simplest form of such a key is a counter that increases with every new database record.2 So the first entry gets an id of 1, the second an id of 2, and so forth. Another approach to generating surrogate keys is randomizing numbers or strings. Because such keys have nothing to do with the records they help identify, they are called “surrogates.”

This is fine for a centralized system collecting data in one place, but unfit for a distributed case as outlined above. Without knowing the state of such a counter on one computer it is impossible to add entries on another computer without the possibility of a key collision when the databases are merged again. Even when staying isolated inside one computer, the keys’ values are — in the case of a serial counter — determined by the in many cases arbitrary timely order in which records are put into the database.3 This is critical because it is likely that it will be infeasible to distribute the over-sized database as a whole, and rather distribute scripts that generate it.

The solution is to use natural keys to identify database records. These keys are generated from unique properties of the records they identify and therefore are predictable. This approach leads to certain constraints that will be explained below.

Truth and opinion

Trying to normalize a dataset like Geocities seems like heresy in the first place. Most of the insights waiting for discovery inside of it are not enumerable. So the database has to reflect on this by separating truth and opinion.

There is little absolute truth in each file extracted from the torrent, apart from actual data and sparse filesystem metadata it contains: name, size, last-modified date and of course its contents. It is not even possible to say for sure from which URL each file was harvested: the classic wget4 does not save this information and the original Geocities server used case-insensitive URLs5, so the same file could be retrieved from many different URLs (and indeed many duplicates can be found in the collections). Then there is the classic case of a file being named www.host.com/folder/index.html when it was in fact downloaded from just http://ww.host.com/folder/ without the “index.html” part, and so on.

So, this is The Truth:

For the simple case of reviving Geocities to a state that can be experienced again in a browser, a mapping of URLs to files is needed. While the URL can be figured out in most cases with a straightforward lower-casing of the file name, it is actually guesswork for some percentage of the files. Especially in the case of duplicates, it is not easy to decide which file to serve for a certain URL.

To solve such cases, a system of agents is introduced.

Every opinionated entry in the database is signed by an agent. An agent might be a script that does some analysis of a file to extract information. Later, a human agent might correct mistakes the script has made, or a later version of the script might add new information.

A table that can be customized for each user/researcher contains information how much weight each agent has. In the case of contradicting information, the information created by the agent with the highest weight wins. — Usually, humans should win over robots.

Different agents can have the same opinion as the agent name is part of the natural key for the entry. This is vital for the weighting system to work. But each agent can have only one opinion on each topic. Sloppy scripts that made a mistake might undo it by adding corrected information, but under a different agent name — this is why the agent name for scripts should always contain the version number of the script.

Humans might need different identifiers, too, if they intend to erase a mistake they have made before. I suggest using twitter handles as identifiers, and “namespacing” them with dashes if required.6 For example, while human agent “despens” made some bad decisions, “despens-wiser” might overwrite them. Of course this is only required for publicly released database records or scripts.

Stay tuned as the database gets fleshed out more with additional tables. These were just examples.

About SQL

There has been a development of “new wave” databases – e.g. couchdb and mongodb, summarized under NoSQL. While they are rightfully praised for their performance and flexibility, they do not really address any problems that come up when handling complex semantic relations like it is required, as I believe, for working with something like Geocities. Instead, they make this power and complexity optional by pushing it out of the database system into the software that wants to make use of this data.

My main motivation for creating this database is to amass knowledge and interpretations about Digital Folklore, giving researchers (including Olia and myself) a tool to verify assumptions and find surprising correlations. SQL appears like a bit of a weird language, but it is tried-and-tested, can solve complex tasks and knowledge about it is wide-spread among archivists and researchers.

I decided to build everything on the PostgreSQL database, which is a free software project, refined since 1989, with mature documentation and some nice graphical frontends7 ; plus, PostgreSQL is actually reasonably fast when it comes to writing records.


  1. … meaning “What I used to do until now” :) []
  2. In for example the PostgreSQL database, these counters are called serial. []
  3. For instance, when reading a list of files to add to the database, the order in which these files are read from the disk is typically determined by the order they are stored on disk. This order is quite random, to create a predictable order would mean to sort the file names before ingest. This represents just another step where a lot could go wrong. []
  4. wget is a GNU software for automatic downloading of network data. The archiveteam recently fixed some of its problems and released wget-warc. []
  5. See Cleaning Up the Torrent []
  6. Twitter, or any other service that allows only unique names, could be used as an authority; however this will only become a problem when hundreds of people start to research Geocities, which seems quite unlikely at the moment. []
  7. I prefer pgadmin. []

FortuneCity, a free home page hosting service just like Geocities, will be closing April 30th 2012. ArchiveTeam is busy copying it, and if you want to help, look at the instructions. The distributed copying system with a centralized tracker seems quite sophisticated. While the Geocities download was organized on a wiki page, this time everything is fully automated and smooth. I am happy to contribute three computers in three different networks!

I am still working on the Geocities database, and it will be fit to hold data from FortuneCity as well. Time to buy a new harddisk.

Big news: My work on Geocities is now part of the European project Digital Art Conservation. For the whole of 2012 I will be working1 for the University of Applied Arts in Switzerland in the Analit research project, which has the stated goal of “moving from case studies to a conservation matrix” in digital archiving.

My job is to make it possible to meaningfully experience Geocities again. Of course this heap of data is a good way to examine how such artifacts can be handled. Reflecting on what I came to know until now, digital conservation and archeology is still struggling with many issues:

  • The logic of business dominates the creation and preservation of digital culture.
  • Objects considered art get a premium conservation treatment. Especially considering internet art this seems weird, because if the “low culture” is not being preserved next to them, the art pieces are removed from any context. Meaningful internet art is not about uploading a self-sufficient piece of data and use the net just as a distribution channel, it it deeply interwoven with the the net as a whole.
  • The second kind of objects waiting in line to be conserved are off-mainstream, weird, outstanding artifacts that do not represent the day-to-day reality of digital life. Of course artifacts that are denying themselves to fed into the mainstream systems are very interesting, but to favor them is distorting what actually happens and smells of hacker elitism.
  • The idea of original objects, authenticity and defined authorship still bleeds over from library science into interpreting digital mass culture.
  • Practices of how digital cultural artifacts were produced or used is not taken into account enough, as pointed out by Camille Paloque-Berges in the panel at Transmediale that Olia, me and her took part in.

Incorporating historic artifacts into new works of art or transferring them into other contexts seems to be the most successful strategy for preservation at the moment. For example, The Deleted City, an interactive piece by Richard Vijgen, is an attractive interface to Geocities data. It does not have to do much with what Geocities was or what it means today, but attracts attention to it and does not work without keeping this data. Museums buying this artwork will have to make sure that they keep a copy of Geocities around.

If the goal is to use Geocities as research material or make it possible to experience it again in different grades of realism or meaningfulness, it seems crucial to work with different versions and collections. There are already a lot of  contradictions, redundancies and at the end of last year, Jason Scott annouced on Twitter that a new pile of Geocities files is available on archive.org. I am also hoping to cooperate with the person behind oocities to merge his and the the ArchiveTeam’s collections. This situation will be typical for future archiving efforts and I will try to create a meaningful  framework to handle this situation with a defined ingression process.

What will guide this “normalization work” is the firm conviction that Geocities belongs to the people. Olia and me certainly hope that it can help building an identity and history for personal computer and net users. So while this blog has some code and instructions sprinkled around, everything will now end up on github, in well-defined order and including documentation. As soon as there is something usable there, you’ll be the first to know!

Also, as the ArchiveTeam with the release of their torrent masterfully pioneered, there are other ways of keeping digital heritage alive and meaningful. We will certainly investigate several possibilities of spreading the knowledge and love.

And let me remind you that there is still this torrent seeding.


  1. Olia is still available for funding! []

What you get by downloading the Geocities Torrent is not actually an “archive”. It contains many 7zip archive files, but how the data therein is organized is not fit to make statistical analysis. The Geocities Torrent tells the story of a great disaster and salvation, also in its structure.

For example, to simply answer the question “How many Geocities accounts are contained in the torrent?” or “What was the most used divider image in a certain neighborhood?”, counting index.html files is not enough. For this, we need to know the original directory structure on the Geocities server, and since Yahoo! didn’t give anybody access to it directly, we have to rely on the information about it that was available to the Archive Team via HTTP during the time when they made the copy. Also, the Archive Team had to pack the data for optimal distribution, which worked very well, and created an almost entertaining downloading experience.1 But the big amount of symbolic links makes it difficult to do even simple counting.

Users do not like case-sensitive file systems

Geocities used the powerful Apache web server on an unknown Unix-like operating system.2 User account names, neighborhood names plus directory and file names were stored case-sensitive on there, meaning that the file “Hello.html” is different from “hello.html” or “heLLo.html”. Traditionally, most users do not understand why there should be a difference for the same name written in a different case. “Consumer” operating systems (aka Windows and Macintosh) do not distinguish case in the file system. Most users of Geocities didn’t care for case when putting links in their HTML code, for example they could link to http://www.geocities.com/bob/dogs/ when the actual file name on the Geocities server would call for a link like http://www.geocities.com/Bob/Dogs/.

Apparently, Geocities followed two strategies for easing their users’ pain with case-sensitivity:

Symbolic Links by Geocities

They created symbolic links in their file system that pointed from Bob to bob.3 This means that when looking into the directorly bob, it will always contain the same content as the directory Bob, and vice-versa. Symbolic links are a powerful file system feature, however it is very easy to create train wrecks with for example directories that contain a link to themselves: an infinite loop in the file system. What’s worse, when looking at a site through a browser, symbolic links can not be distinguished from a real file or directory. So both Bob and bob would exist as if there were two users instead of one. And the Archive Team of course hadto save both variations, because, without looking inside of each directory, they wouldn’t know if there maybe was another user that went with the same name in lowercase.4

There are many ruins of symbolic links to be found in the torrent, especially of the type that creates infinite loops.

mod_speling

There is a plugin for the Apache web server, mod_speling, which tries to correct wrongly typed URLs and redirects the browser to the actual URL with the correct case. It appears like at some point the Geocities server was equipped with this module — otherwise it would be a miracle how all this could function in general. However, the mirror tool wget used by the Archive Team to copy Geocities, will still save the file under the original request name. So if you ask wget to copy bob, it will be redirected to Bob and save what is found there, but still locally give it the name bob.5 And again it would result in a potentially duplicate file.

This is neither the fault of Archive Team6, the wget developers or Geocities. HTTP and HTML were designed in a certain way, but when millions of users are let loose on a technically well-defined standard, unpredictable things happen.

Where the Archive Team detected duplicate downloads, they replaced them with symbolic links in their copy’s file system. While this makes browsing the data much easier, it also leads to problems about deciding for what operation which type of symbolic link has to be followed. If for example a symlink makes a whole sub-neighborhood exist twice in two different spellings, this symlink should be ignored. A symlink to an user’s account that is stored in YAHOOIDS should be followed though. It would be possible to develop a logic that takes all of this into account, but it will be prone to errors, resulting in some research operations having to be repeated when bugs in research scripts are found. And each run can take ages! So it seems like a good investment to fix the file system before going any further.

Fixing

  1. Most analysis on the Geocities Torrent will have to be conducted through HTTP and an Apache webserver running mod_speling. Redirects will have to be taken into account.
  2. All symbolic links have to be resolved. Steps:
    1. Use the command find . -type l to catch the first level of symbolic links.
    2. Use readlink to determine where symlinks are pointing and replace them with the original files (first rm the symlink, then mv the original to the symlink’s location).
    3. Repeat steps C and B until no symbolic links are left.

    Of course, every round of found symlinks has to be examined manually for infinite loops or obvious traps.

  3. Find directories and with “almost equal” names, e.g. names that would be found by mod_speling. Compare the pairs’ contents. If the contents are equal, decide which is the original and delete the other. If the contents are partly different, merge the contents and keep only one version. If the content is different, keep both versions. (Probably should be done using diff.)

Each of these operations takes from hours to days. So please bear with us for a while :)


  1. How the user accounts were pouring in which the arrival of each 7zip file was simply blissful. []
  2. We know because Apache generates certain kinds of index and error pages that can be found in the torrent. Also, the file system is definitely case-sensivite. []
  3. It is not clear if users were also allowed to create symbolic links. []
  4. If they had taken the time to compare all this, there would probably be no torrent at all. []
  5. Browsers still do the same: How often did you save a PDF file with the name download.php? []
  6. In fact, using the default behaviors of standard software was the best choice in this case, because now the coming about of the data can easily be reconstructed. If the Archive Team had made assumptions on how the Geocities server was configured and had modified wget accordingly, a lot of data might have been lost. []

HTML frames, introduced to web users in 1996 with version 2.0 of the then-dominant browser Netscape Navigator, offered a way to divide the browser window into rectangles, each showing a different HTML page. A lot has been written about this most controversial tag in the history of markup languages. Already in the year it was created, usability experts announced that it breaks fundamental rules of hypertext and navigation. Users, until this moment happily following all new technical enhancements given to them with each software update, developed strong opinions if frames should be used or not.1 Finally, in today’s web, frames are not used anymore – they even have been removed from the HTML5 standard.2

All this battle aside, fact is: Frames made possible new kinds of structures and graphical effects not possible before in the browser.3 And Geocities carries thousands of examples on how amateurs used them. Remarkably, the frames debate is very present in their pages’ designs, because visitors are in most cases given the choice: frames or no frames.


Original URL: http://www.geocities.com/SoHo/Gallery/2826/

Asking web site visitors such “technical” questions is professionally considered a capital usability offence. In some corners of today’s web you can still find the quite similar “Flash or HTML?” But frames or not wasn’t only a technical question, it was a new approach on how to deal with controversies online: just give people a choice and everybody will be happy.

So when visitors had to decide something anyway, many webmasters used the opportunity to create welcome pages describing what’s laying ahead.


Original URL: http://www.geocities.com/Tokyo/Towers/7492/

In many cases their appearance differed from the following actual web site, deliberately standing out in the navigation flow and creating dramaturgy. The MIDI Universe for example embeds the question whether to use frames in a space setting. Take a step back and consider the big picture! The following pages give the impression of having zoomed into the cloud or ocean surface of the planet. Dive in!


Original URL: http://www.geocities.com/SunsetStrip/Palms/7120/

Speaking of MIDI: Of course any serious web site with music playing in the background would use frames: one constantly holding the music player while others could be used to navigate around.

Typically, one of the frames was filled with links to all parts of the site, and the minimal setup consists of just two frames, navigation aid and content. This meant that webmasters had to provide two navigation systems, one for frames and one for no frames. The most efficient way was probably to include the no frames-navigation on the site’s first “content” page. The downside of this approach was that in frames mode, visitors would see the navigation two times, once in the navigation frame and once in the “content” frame.


Original URL: http://www.geocities.com/SiliconValley/Lab/5481/frames.html

Another very popular style was to create a different starting page with the navigation for no frames that would not show up in the frames context. Since webmasters offering frames usually also preferred the frames mode, the no frames version would in many cases fall out of their sight, soon be forgotten and missing updates.



Original URL: http://www.geocities.com/SouthBeach/Boardwalk/2643/

Commercial web sites liked to use frames for branding. Splitting the browser window allowed to place a visually stable logo on screen that wasn’t affected by the visitor scrolling in another frame. Of course amateurs did the same with their sites and branded them with their names or their made up companies’ names.


Original URL: http://www.geocities.com/SouthBeach/Sands/5486/

Some frame branded pages would display double branding, with two different logos. The reason is similar as with navigation frames: The frames version shows an additional logo, and no frames got its own logo. I suspect that many webmasters created two different logos on purpose, because it looks much less confusing than having the same logo twice.


Original URL: http://www.geocities.com/SouthBeach/Boardwalk/2643/frames.html

Finally, the choice of frames vs no frames reflected the joy of playing with new technology. Many sites were started before frames became available as a design tool. So when web masters added them later, they got filled with elements not necessarily related to the rest of the pages, very likely “multimedia” things. It is common that a frames site contains Java applets, MIDI music, javascripted scrolling and so on. No frames was there as a safe option, for people who had’t yet upgraded their browser or owned a weak computer that couldn’t handle displaying six HTML pages at once.


Original URL: http://www.geocities.com/TheTropics/Shores/9911/menu.html

Assorted examples


Original URL: http://www.geocities.com/RodeoDrive/1031/


Original URL: http://www.geocities.com/SouthBeach/9128/


Original URL: http://www.geocities.com/Hollywood/Set/4440/


Original URL: http://www.geocities.com/SouthBeach/4413/


  1. See the part about frames in Olia’s “A Vernacular Web” for more details on the impact of this tag. []
  2. Read some condolences by Tobias Leingruber. []
  3. And the later established style of frame layouts would become a strong sign for “online” in print design as well. Advertising posters for online companies for example would show compositions reminding of framests. []

Splash Screens, these web pages that announce one is about to enter somewhere with the next click, described and recommended for commercial web sites in David Siegel’s legendary book Creating Killer Web Sites, are today thought to be one of the most annoying things the web of the 90’s had to offer. On Geocities they seem to be not very common, instead most page authors chose to place a “Welcome” message and offer navigation options right there.

Anyway, Prince Etrigan created a very nice splash screen: The metaphor of a state border is used to claim a place on the web, at the same time it “allows” you to move freely. While most splash screens seem to hold travelers up, this one gives a sense of relief. There is a state (a private kingdom in fact), but this state is not like the ones in reality. Knowingly or not, this welcome greeting oscillates between Barlow’s A Declaration of the Independence of Cyberspace, ideas to bring this freedom to the actual world, e.g. as demanded by the noborder network and virtual art states like NSKSTATE.1

We got used to “travel” freely around the internet. For 90’s web users, it was a completely new experience. And at the time a technical challenge. Today became a political challenge to keep it this way, while the web has also grown an oppressive face against real-world travelers.

Original URL: http://www.geocities.com/TimesSquare/1141/


  1. That one definitely looked better in the 90’s! []

In addition to my last post on claiming copyright as a sign of strong envolvement, here is another striking example: user “sunshiney”, an at the time 17 year old girl from Norway who got online 1993, thinks of this background as the finest creation of her life. And indeed it is marvelous. While other artists used the more or less invisible “comment” part of image files to state copyright lines, sunshiney named the file:

copyrightsunshiney.jpg

Personally, I recommend you to use this image, but keep the file name in tribute to the author.

Original URL: http://www.geocities.com/SiliconValley/Lab/5481/

This new torrent of Geocities unifies the original release plus the patches. Thanks again to the Archive Team for releasing both!

With this combined torrent, it is possible to seed the data forever, without having to keep two copies of everything. And I kindly ask everybody to do so!

Download @ Piratebay

UPDATE 2011-05-02: If you had problems with the torrent before, please go and check the link again. I fixed the torrent after some reports of it not being compatible with many clients.

The Archive Team released a patch to the Geocities Torrent, fixing the last 0.05% of the download that never seemed to arrive and some broken archives that couldn’t be decrunched before. Hurray!!

It is already the second day the computer is working hard applying this patch, using scripts published here before. To use this patch we had to stop seeding the torrent, because the patch is consistent which the data the torrent is spreading. Going online again would mean to erase the patch. So, after applying the patch is completed, we will publish a new torrent containing the patches (if the Archive Team itself will not do it themselves in the meantime).