Internet researchers are confronted with an instable object of study, the ephemerality of the object. The question is how to make the medium permanent so we can study it with care? The shape of the archive informs what I can ask the archive.
This perspective on archives is placed within Weltevrede’s research into National Webs. To think nationally with the web might seem counterintuitively at first because dominant ideas of the web are so global. This originates from the 90s idea of Cyberspace which is a universal space with ideas of disembodiment and identity play. Crucially, cyberspace is a place that is disembedded from reality. After 2000 cyberspace was confronted with what Weltevrede calls “the national turn.”
This may be seen in a number of places, probably most familiar is Google.com redirects you to the location you are at, for example Google.nl and you get a totally different result page. Another example is “This video is not available in your country” intellectual property is really dominant in the nationalization of web content. You might also think in the terms of language. English used to be the dominant universal language, there is a lot of clustering happening on the web based on a shared language.
To move to the web archive, the most exhaustive project in the field is the Internet Archive which originates from the cyberspace period (1996.) This can also be seen in how the archive was set up. First of all, the scope of the collection is the “whole” internet which is a very broad collection aim. Secondly, when you look at the interface of the archive, the Wayback Machine, what you immediately notice is that you query it by URL and browse from that point on. It is characterized by browsing instead the current dominant form: searching. The Internet Archive therefor privileges single site histories instead of researching its context.
The Internet Archive emerged from the web company Alexa and Alexa provides all the crawls and donates it to the archive. This means that the selection of sites is based on traffic data. If you have the Alexa toolbar installed every page you visit will be included in the archive. It is a very smart way to start thinking about which pages should be included in the archive. After the Internet Archive in 1996 a number of initiatives emerged with a national focus. The general thought behind that was that national web archives can best serve local wishes and demands and serve the community (researchers, general public) best.
As an example we will look at a Dutch web archive maintained by the Royal Library of the Netherlands, the KB. Before we go into the actual project, let’s get a size of the Dutch web. The .nl domain is the fourth largest country domain with 3.2 million sites, an enormous amount.
How to demarcate the national web
- .nl is the 4th largest country domain
- A second way to look at the national web (.nl is not the whole Dutch web you could argue) we can look at all the domains registered by the Dutch (sidn.nl 2008)
- What do we Dutch people find relevant sites? We can look at the most visited websites as listed by Alexa. We find these sites important through the number of visits.
These are three ways to think of how to define the national web by web means. The definition of the national aspect as used by the Royal Library is. They created a new definition of what is Dutch content.
- A: Website in Dutch, registered in the Netherlands
- B: Website in another language, registered in the Netherlands
- C: Website in Dutch, registered in another country
- D: Website in another language, registered in another country, topic aimed at the Netherlands.
All of these options seem technically feasible except for the last one. We cannot technically or automatically define content that is aimed at the Netherlands. It makes it highly unlikely that this Dutch web can be archived. What the Royal Library has done, is leave this definition and manually select sites. They started with 100 sites, it became 400 and now just over a 1000. They archive those sites really well.
As an internet researcher Weltevrede is particularly interested in the dynamics of websites. The contribution she would like to put forward is how else can we approach the object of collection, the Dutch web?
If you start web archiving the most easy and effective method is to follow the possibilities of the medium. You can automate a lot of things and besides that you can also focus on the context and prominence of the website in a particular period. The first point calls attention to the challenge to develop methods that follow the medium to automate the collection process. You could
schedule Google.nl for the query “.nl” because Google takes into account what is relevant, links to a website. These are not only considered relevant by Google but by a large group of people. Hyperlink structures are human acts of association, links die and emerge, what would that information provide us about the context and its network? If you would schedule it over time you could see the relevance of a particular source in a particular period. It would provide context for sources or websites, the born digital.
The final questions are:
- What would the national Web archive look like when the focus is on capturing hyperlinks, search engine results, and other digital objects?
- What aspects besides the digital document are relevant to save and why?
- Can we learn from how born digital devices (e.g. search engines, platforms and recommendation systems) make use of the objects, and if so, how can such uses be repurposed for Web archiving>
Final personal note: The day after this presentation (this morning) my friend and colleague Esther Weltevrede graduated Cum Laude from the University of Amsterdam on her research on Archiving Web Dynamics. She will continue her research on National Webs as a PhD candidate with the Digital Methods Initiative. Congratulations Esther!