As mentioned in Olia’s A Vernacular Web, people writing in the Cyrillic alphabet had lots of fun before Unicode became common. Every operating system sported multiple ways of mixing Cyrillic and ASCII letters into one 8bit encoding scheme.
Many Russian web sites had to offer their text in several encodings to accommodate all possible visitors. And a stripe of buttons offering these encodings could be found on the top of many pages. Collections of such button graphics in different styles became of the web folklore.
And of course many Geocities users wrote in the Cyrillic alphabet. Vladimir Sitnik from Artemovsky created a site for his local Karate Dojo. A typical link to a photo on his website would look like this:
pict_e.html?RU?images/040502b.jpg?Наша%20команда?с%20наградами%20из%20Одессы%202004
Naturally, during the Archive Team’s backup, this confused wget quite a lot and it created many files looking like this:
Since most current Linux systems use Unicode for file names, the 8bit encoded Cyrillic letters are illegal and printed as questions marks. The HTML code of any of the files in the directory reveals the encoding used by Vladimir Sitnik:
<meta http-equiv="Content-Type" content="text/html; Charset=Windows-1251">
Using the handy tool conmv with the codepage cp1251
and switching the terminal to the same encoding for a especially nasty case, I managed to restore the file names:
Only after completing this tendious operation I checked that each of the pict_e.html file variations was actually the same, apart from some minor tracking code inserted by the Geocities server — there is no reason to really keep more than one.2 They all were generated because Vladimir Sitnik successfully circumvented the fact that Geocities did not offer any kind of server side scripting or usable templating mechanism. But he still did not want to create hundreds of pages with just a different photo on each. So he had the quite exotic idea of using HTTP GET parameters to create photo pages via Javascript.
In his links, the parameters are separated by question marks: interface language, file name of the photo, caption lines. pict_e.html then uses the Javascript instruction document.writeln
to create the actual HTML. So that is a minimal content management system on a free service that does not offer content management systems. Almost AJAX, but Netscape 4 compatible!
Considering this page was made by an “amateur”, created in 2000 and last updated 2006, it already shows that much better scraping tools than wget are needed3 — even when it comes to pages that give a very modest impression to “professionals”. Users have been very inventive in all aspects of web authoring, they did not care for any specification or standard, they went ahead and pushed the borders, trying all the new options presented to them in the browser wars.
Original URL: http://www.geocities.com/soho/Museum/1134/ph4tu_r.html
- All examples lifted from A Vernacular Web. [↩]
- Except for studying wget behavior or Geocities’ random number generator. [↩]
- Maybe something based on PhantomJS? [↩]