Organizing the Database | One Terabyte of Kilobyte Age

A Research Database for Everybody

Geocities was a people’s effort, and it should continue belonging to the people, even after its demise. This is important to keep in mind when doing research on the millions of home pages that have been rescued. The results of this research should be easy to build new research upon for any interested party.

The scripts and system setup guides I am preparing to be published on GitHub will make it relatively comfortable to work with parts of Geocities that already have been recovered or will be recovered in the future.

Using a database for helping to make sense of this huge body of data is a very straightforward idea, but how to organize it so that it is as portable as the rest of the project? Additionally, the database must be able to work with multiple, contradicting versions of the same file that where harvested by different people at different points in time, using different tools, making different mistakes. And it should be possible to dump the complete database, or parts of it, hand it over to another researcher, and later merge the changes.

Naïve database designs¹ usually work with surrogate keys to uniquely identify records. The simplest form of such a key is a counter that increases with every new database record.² So the first entry gets an id of 1, the second an id of 2, and so forth. Another approach to generating surrogate keys is randomizing numbers or strings. Because such keys have nothing to do with the records they help identify, they are called “surrogates.”

This is fine for a centralized system collecting data in one place, but unfit for a distributed case as outlined above. Without knowing the state of such a counter on one computer it is impossible to add entries on another computer without the possibility of a key collision when the databases are merged again. Even when staying isolated inside one computer, the keys’ values are — in the case of a serial counter — determined by the in many cases arbitrary timely order in which records are put into the database.³ This is critical because it is likely that it will be infeasible to distribute the over-sized database as a whole, and rather distribute scripts that generate it.

The solution is to use natural keys to identify database records. These keys are generated from unique properties of the records they identify and therefore are predictable. This approach leads to certain constraints that will be explained below.

Truth and opinion

Trying to normalize a dataset like Geocities seems like heresy in the first place. Most of the insights waiting for discovery inside of it are not enumerable. So the database has to reflect on this by separating truth and opinion.

There is little absolute truth in each file extracted from the torrent, apart from actual data and sparse filesystem metadata it contains: name, size, last-modified date and of course its contents. It is not even possible to say for sure from which URL each file was harvested: the classic wget⁴ does not save this information and the original Geocities server used case-insensitive URLs⁵, so the same file could be retrieved from many different URLs (and indeed many duplicates can be found in the collections). Then there is the classic case of a file being named www.host.com/folder/index.html when it was in fact downloaded from just http://ww.host.com/folder/ without the “index.html” part, and so on.

So, this is The Truth:

For the simple case of reviving Geocities to a state that can be experienced again in a browser, a mapping of URLs to files is needed. While the URL can be figured out in most cases with a straightforward lower-casing of the file name, it is actually guesswork for some percentage of the files. Especially in the case of duplicates, it is not easy to decide which file to serve for a certain URL.

To solve such cases, a system of agents is introduced.

Every opinionated entry in the database is signed by an agent. An agent might be a script that does some analysis of a file to extract information. Later, a human agent might correct mistakes the script has made, or a later version of the script might add new information.

A table that can be customized for each user/researcher contains information how much weight each agent has. In the case of contradicting information, the information created by the agent with the highest weight wins. — Usually, humans should win over robots.

Different agents can have the same opinion as the agent name is part of the natural key for the entry. This is vital for the weighting system to work. But each agent can have only one opinion on each topic. Sloppy scripts that made a mistake might undo it by adding corrected information, but under a different agent name — this is why the agent name for scripts should always contain the version number of the script.

Humans might need different identifiers, too, if they intend to erase a mistake they have made before. I suggest using twitter handles as identifiers, and “namespacing” them with dashes if required.⁶ For example, while human agent “despens” made some bad decisions, “despens-wiser” might overwrite them. Of course this is only required for publicly released database records or scripts.

Stay tuned as the database gets fleshed out more with additional tables. These were just examples.

About SQL

There has been a development of “new wave” databases – e.g. couchdb and mongodb, summarized under NoSQL. While they are rightfully praised for their performance and flexibility, they do not really address any problems that come up when handling complex semantic relations like it is required, as I believe, for working with something like Geocities. Instead, they make this power and complexity optional by pushing it out of the database system into the software that wants to make use of this data.

My main motivation for creating this database is to amass knowledge and interpretations about Digital Folklore, giving researchers (including Olia and myself) a tool to verify assumptions and find surprising correlations. SQL appears like a bit of a weird language, but it is tried-and-tested, can solve complex tasks and knowledge about it is wide-spread among archivists and researchers.

I decided to build everything on the PostgreSQL database, which is a free software project, refined since 1989, with mature documentation and some nice graphical frontends⁷ ; plus, PostgreSQL is actually reasonably fast when it comes to writing records.

… meaning “What I used to do until now” :) [↩]
In for example the PostgreSQL database, these counters are called serial. [↩]
For instance, when reading a list of files to add to the database, the order in which these files are read from the disk is typically determined by the order they are stored on disk. This order is quite random, to create a predictable order would mean to sort the file names before ingest. This represents just another step where a lot could go wrong. [↩]
wget is a GNU software for automatic downloading of network data. The archiveteam recently fixed some of its problems and released wget-warc. [↩]
See Cleaning Up the Torrent [↩]
Twitter, or any other service that allows only unique names, could be used as an authority; however this will only become a problem when hundreds of people start to research Geocities, which seems quite unlikely at the moment. [↩]
I prefer pgadmin. [↩]

One Terabyte of Kilobyte Age

Organizing the Database ⁋ by despens

A Research Database for Everybody

Truth and opinion

About SQL

Leave a Reply Cancel reply