A Research Database for Everybody

Geocities was a people’s effort, and it should continue belonging to the people, even after its demise. This is important to keep in mind when doing research on the millions of home pages that have been rescued. The results of this research should be easy to build new research upon for any interested party.

The scripts and system setup guides I am preparing to be published on GitHub will make it relatively comfortable to work with parts of Geocities that already have been recovered or will be recovered in the future.

Using a database for helping to make sense of this huge body of data is a very straightforward idea, but how to organize it so that it is as portable as the rest of the project? Additionally, the database must be able to work with multiple, contradicting versions of the same file that where harvested by different people at different points in time, using different tools, making different mistakes. And it should be possible to dump the complete database, or parts of it, hand it over to another researcher, and later merge the changes.

Naïve database designs1 usually work with surrogate keys to uniquely identify records. The simplest form of such a key is a counter that increases with every new database record.2 So the first entry gets an id of 1, the second an id of 2, and so forth. Another approach to generating surrogate keys is randomizing numbers or strings. Because such keys have nothing to do with the records they help identify, they are called “surrogates.”

This is fine for a centralized system collecting data in one place, but unfit for a distributed case as outlined above. Without knowing the state of such a counter on one computer it is impossible to add entries on another computer without the possibility of a key collision when the databases are merged again. Even when staying isolated inside one computer, the keys’ values are — in the case of a serial counter — determined by the in many cases arbitrary timely order in which records are put into the database.3 This is critical because it is likely that it will be infeasible to distribute the over-sized database as a whole, and rather distribute scripts that generate it.

The solution is to use natural keys to identify database records. These keys are generated from unique properties of the records they identify and therefore are predictable. This approach leads to certain constraints that will be explained below.

Truth and opinion

Trying to normalize a dataset like Geocities seems like heresy in the first place. Most of the insights waiting for discovery inside of it are not enumerable. So the database has to reflect on this by separating truth and opinion.

There is little absolute truth in each file extracted from the torrent, apart from actual data and sparse filesystem metadata it contains: name, size, last-modified date and of course its contents. It is not even possible to say for sure from which URL each file was harvested: the classic wget4 does not save this information and the original Geocities server used case-insensitive URLs5, so the same file could be retrieved from many different URLs (and indeed many duplicates can be found in the collections). Then there is the classic case of a file being named www.host.com/folder/index.html when it was in fact downloaded from just http://ww.host.com/folder/ without the “index.html” part, and so on.

So, this is The Truth:

For the simple case of reviving Geocities to a state that can be experienced again in a browser, a mapping of URLs to files is needed. While the URL can be figured out in most cases with a straightforward lower-casing of the file name, it is actually guesswork for some percentage of the files. Especially in the case of duplicates, it is not easy to decide which file to serve for a certain URL.

To solve such cases, a system of agents is introduced.

Every opinionated entry in the database is signed by an agent. An agent might be a script that does some analysis of a file to extract information. Later, a human agent might correct mistakes the script has made, or a later version of the script might add new information.

A table that can be customized for each user/researcher contains information how much weight each agent has. In the case of contradicting information, the information created by the agent with the highest weight wins. — Usually, humans should win over robots.

Different agents can have the same opinion as the agent name is part of the natural key for the entry. This is vital for the weighting system to work. But each agent can have only one opinion on each topic. Sloppy scripts that made a mistake might undo it by adding corrected information, but under a different agent name — this is why the agent name for scripts should always contain the version number of the script.

Humans might need different identifiers, too, if they intend to erase a mistake they have made before. I suggest using twitter handles as identifiers, and “namespacing” them with dashes if required.6 For example, while human agent “despens” made some bad decisions, “despens-wiser” might overwrite them. Of course this is only required for publicly released database records or scripts.

Stay tuned as the database gets fleshed out more with additional tables. These were just examples.

About SQL

There has been a development of “new wave” databases – e.g. couchdb and mongodb, summarized under NoSQL. While they are rightfully praised for their performance and flexibility, they do not really address any problems that come up when handling complex semantic relations like it is required, as I believe, for working with something like Geocities. Instead, they make this power and complexity optional by pushing it out of the database system into the software that wants to make use of this data.

My main motivation for creating this database is to amass knowledge and interpretations about Digital Folklore, giving researchers (including Olia and myself) a tool to verify assumptions and find surprising correlations. SQL appears like a bit of a weird language, but it is tried-and-tested, can solve complex tasks and knowledge about it is wide-spread among archivists and researchers.

I decided to build everything on the PostgreSQL database, which is a free software project, refined since 1989, with mature documentation and some nice graphical frontends7 ; plus, PostgreSQL is actually reasonably fast when it comes to writing records.


  1. … meaning “What I used to do until now” :) []
  2. In for example the PostgreSQL database, these counters are called serial. []
  3. For instance, when reading a list of files to add to the database, the order in which these files are read from the disk is typically determined by the order they are stored on disk. This order is quite random, to create a predictable order would mean to sort the file names before ingest. This represents just another step where a lot could go wrong. []
  4. wget is a GNU software for automatic downloading of network data. The archiveteam recently fixed some of its problems and released wget-warc. []
  5. See Cleaning Up the Torrent []
  6. Twitter, or any other service that allows only unique names, could be used as an authority; however this will only become a problem when hundreds of people start to research Geocities, which seems quite unlikely at the moment. []
  7. I prefer pgadmin. []

FortuneCity, a free home page hosting service just like Geocities, will be closing April 30th 2012. ArchiveTeam is busy copying it, and if you want to help, look at the instructions. The distributed copying system with a centralized tracker seems quite sophisticated. While the Geocities download was organized on a wiki page, this time everything is fully automated and smooth. I am happy to contribute three computers in three different networks!

I am still working on the Geocities database, and it will be fit to hold data from FortuneCity as well. Time to buy a new harddisk.


I was happy to present our imaginary social networks and services of 1997 — Once Upon — at Unlike Us, hosted by the Institute of the Network Cultures in Amsterdam.

It was a great event with a truly interested and competent audience. The only problem was that in the end it was mostly about Facebook, even though in The Netherlands Hyves has still more users than Facebook.

That’s why it was very important to see the presentation by Philipp Budka from the Department of Social and Cultural Anthropology at the University of Vienna. He talked about Myknet, an obscure Canadian social network and web hosting service.

[…] a system of personal homepages intended for remote First Nations users in Northern Ontario. This free of charge, free of advertisements, locally-supported online social environment grew from a constituency base of remote First Nations in a region where numerous communities lived without adequate residential telecom service well into the millennium (Ramirez, Aitkin, Jamieson, & Richardson, 2003; Fiser, Clement, & Walmark, 2006). MyKnet.org now hosts over 30,000 registered user accounts, of which approximately 25,000 represent active homepages. This is particularly notable considering that the system primarily serves members of Northern Ontario’s First Nations whose combined population is approximately 45,000 (occupying a geographic area comparable to the size of France). Equally significant is that over half of this population is under the age of 25, making MyKnet.org primarily a youth-driven online social environment.

A simple alphabetic list of all the users makes it possible to go from account to account. You can also look at the last updated ones and see that there were already almost 400 pages updated today. It is only around 14:00 in Ontario now.

The first impression: Myknet users make their pages in a special way that is obviously influenced by some simple and not strict content management system or a custom made site builder. Would be interesting to get access to it.

Clearly, the pages are made for low resolution displays and slow connections.

The most exciting fact is that they look exactly like classic homepages, but many are used like tweets or status updates. Visually and structurally it is an unexpected mix. See examples on the top of the article.
Many users have moved to Facebook, leaving their facebook badge on Myknet — functioning like classic “this page has moved” notices.

Many accounts are deleted or were only updated in the last decade, which is interpreted like it never happened by the database.

Working Title:
From Heartland Neighborhood to Pinterest.com.
The Rise and Fall of “organizing and sharing the things you love” Culture.

Fig.1.1
Homepage of heartlandhelpinghands, Heartland’s satellite.

Fig.1.2
Pinterest.com with menu “Everything” unfolded.

A very interesting thread on Reddit.
Geocities is mentioned quite often. some times seriously, some times as a joke.
I liked the comment “You think you’re being funny, but Geocities probably should be on that list.”
Some mention Altavista. Napster and other dead services, others mention what is hot right now and do really believe, that “internet is only 8 years old.”
I could read this discussion forever. Still this is my favorite reaction:“I somehow read this as the “Stevie Wonders” of the internet.”

Big news: My work on Geocities is now part of the European project Digital Art Conservation. For the whole of 2012 I will be working1 for the University of Applied Arts in Switzerland in the Analit research project, which has the stated goal of “moving from case studies to a conservation matrix” in digital archiving.

My job is to make it possible to meaningfully experience Geocities again. Of course this heap of data is a good way to examine how such artifacts can be handled. Reflecting on what I came to know until now, digital conservation and archeology is still struggling with many issues:

  • The logic of business dominates the creation and preservation of digital culture.
  • Objects considered art get a premium conservation treatment. Especially considering internet art this seems weird, because if the “low culture” is not being preserved next to them, the art pieces are removed from any context. Meaningful internet art is not about uploading a self-sufficient piece of data and use the net just as a distribution channel, it it deeply interwoven with the the net as a whole.
  • The second kind of objects waiting in line to be conserved are off-mainstream, weird, outstanding artifacts that do not represent the day-to-day reality of digital life. Of course artifacts that are denying themselves to fed into the mainstream systems are very interesting, but to favor them is distorting what actually happens and smells of hacker elitism.
  • The idea of original objects, authenticity and defined authorship still bleeds over from library science into interpreting digital mass culture.
  • Practices of how digital cultural artifacts were produced or used is not taken into account enough, as pointed out by Camille Paloque-Berges in the panel at Transmediale that Olia, me and her took part in.

Incorporating historic artifacts into new works of art or transferring them into other contexts seems to be the most successful strategy for preservation at the moment. For example, The Deleted City, an interactive piece by Richard Vijgen, is an attractive interface to Geocities data. It does not have to do much with what Geocities was or what it means today, but attracts attention to it and does not work without keeping this data. Museums buying this artwork will have to make sure that they keep a copy of Geocities around.

If the goal is to use Geocities as research material or make it possible to experience it again in different grades of realism or meaningfulness, it seems crucial to work with different versions and collections. There are already a lot of  contradictions, redundancies and at the end of last year, Jason Scott annouced on Twitter that a new pile of Geocities files is available on archive.org. I am also hoping to cooperate with the person behind oocities to merge his and the the ArchiveTeam’s collections. This situation will be typical for future archiving efforts and I will try to create a meaningful  framework to handle this situation with a defined ingression process.

What will guide this “normalization work” is the firm conviction that Geocities belongs to the people. Olia and me certainly hope that it can help building an identity and history for personal computer and net users. So while this blog has some code and instructions sprinkled around, everything will now end up on github, in well-defined order and including documentation. As soon as there is something usable there, you’ll be the first to know!

Also, as the ArchiveTeam with the release of their torrent masterfully pioneered, there are other ways of keeping digital heritage alive and meaningful. We will certainly investigate several possibilities of spreading the knowledge and love.

And let me remind you that there is still this torrent seeding.


  1. Olia is still available for funding! []

The web is almost 20 years old now. Throughout these years, it has transformed from a new medium to new media in the best very sense of the word – it continues to evolve. Its “stupidity”, its neutrality, which lies at the core of internet architecture, allows it to continue to grow but without ever really growing old or mature.

Dealing with an eternally young medium means that we always have to deal with something new – technically as well as ideologically and aesthetically. It means that the dying of web pages, users or services is seen as a natural process. And it makes no sense to speak of a project reaching middle age, because age has no value here. Getting old is something that you don’t do on the net.

The Geocities archive provides us with the experience of getting old. Coming into contact with aged pages is an important lesson that defies the impression that on the net, everything always happens in the present.

Ruins and Templates of Geocities is the first chapter of the publication Still There. It is based on this blog’s posts tagged with “ruins”, “templates” and “alive”.

We made a new work about the web of the 90s — Once Upon. Thanks to top bloggers and twitterers it was reblogged, retweeted and got quite some attention and positive response.

But we can’t ignore that links to “Once Upon” mostly were tagged with humor, web humor, geek humor and the like. Which by itself is ok. After all we are not deadly serious and laugh about architecture and appearance of today’s social networks and smile at our own obsession to design with framesets and tables.

What really troubles us is the idea that we wanted to make fun about WWW of the last century, and that we wanted to show how ugly Facebook etc could look, or how ugly the web looked then. Even the most loyal commentors, even the most nostalgic ones, notice that we make the services “look ugly”1. Though we didn’t. Neither intentionally nor of ignorance.

We actually find our designs beautiful (as well as design of many Geocities pages we analyze in the blog) but here everybody is free to disagree. We really insist that our Once Upon designs are browser specific: They are modular — you see the borders in between data of different formats like pictures, text, and markup itself; and they don’t hide markup — you see borders of tables, borders of frames, scrollbars, default colors of links and background. In a slightly exaggerated manner they show the web’s native aesthetics.

And, sadly, this browser specificity is exactly what is considered to be “ugly”.

Remember the famous text by Dutch typo gurus, founders of Emigre magazine Zuzana Licko and Rudy VanderLans Ambition/Fear? In 1989 they reassured fellow designers:

There is nothing intrinsically “computer-like” about digitally generated images. Low-end devices such as the Macintosh do not yield a stronger inherent style than do the high-end Scitex systems, which are often perceived as functioning invisibly and seamlessly. This merely shows what computer virgins we are. High-end computers have been painstakingly programmed to mimic traditional techniques such as airbrushing or calligraphy, whereas the low-end machines force us to deal with more original, sometimes alien, manifestations. Coarse bitmaps are no more visibly obtrusive than the texture of oil paint on a canvas, but our unfamiliarity with bitmaps causes us to confuse the medium with the message. Creating a graphic language with today’s tools will mean forgetting the styles of archaic technologies and remembering the very basic of design principles.

Believe in “the basic design principles” and desire to overcome limitations of young technologies are mainstream. Professional screen design is all about forgetting. Forgetting and ignoring the browser, the interface, pixels, …

We should do something about it in 2012.

Happy New Year

olia


  1. See for example ubergizmo’s blog post. []


Jason Scott answers Dragan’s question about GeoCities profiles that have apparently survived during VERBINDINGEN/JONCTIONS 13 2011-12-04.

In his opinion these happened to people who payed for Yahoo! Web Hosting, an option suggested to GeoCities users when Yahoo! announced its end. And because of a technical glitch, not only the newly bought domain names lead to the old content, but also the classic GeoCities URLs.

We just looked at Yahoo!’s help and found something else, a supposed premium GeoCities account called GeoCities Plus that promised unupdatable eternity, WWW’s hell :

If you’re a Yahoo! GeoCities Plus customer, your friends and family can still view your web site as usual. However, you can no longer access your files or update your pages with GeoCities tools. To update your site, you’ll need to upgrade to Web Hosting.

You won’t believe, but several hours ago, while surfing and looking for something unrelated to Geocities, we found a Geocities page that still exists! on Geocities!
Not a folder with templates by Yahoo. Not an invisible GIF.

But a real profile of a real user!

Namely famous ASCII artist Joan G. Stark.
http://www.geocities.com/spunk1111/
Last updated in 2001.

We rubbed our eyes, reloaded and Shift-reloaded, but the miracle didn’t disappear. The page is still there. And that’s not all, further research showed that Spunk’s previous account /7373 in SoHo neighborhood is also online. http://www.geocities.com/SoHo/7373/
Last updated in 2001 as well.

Both profiles are almost identical. And there is the 3rd one — http://www.ascii-art.com/.

But it is only an index page –not updated since 2001 and squatted by porn spam — if you click enter you are back at Geocities/SoHo/7373/

What’s going on? How did it happen? Was it forgotten? Protected? Paid? Are there other survivors?

Update: the question about other survivors is answered in comments by Nick and Google http://www.google.ca/search?q=site%3Ageocities.com.
What are all these profiles doing there?