DMI mini-conference Day 1: Michael Stevenson on the Archived Blogosphere

The Digital Methods Initiative is holding a three day mini-conference with workshop presenting papers and research proposals.Today I responded to Michael Stevenson’s paper on the history of the blogosphere through the eyes of EatonWeb and the Internet Archive. The following is my summary of his paper and argument followed by questions.

Michael Stevenson. The archived blogosphere: exploring web historical methods using the Internet Archive

Respondent: Anne Helmond, University of Amsterdam. 20 January 2010.

One of the main questions of Stevenson’s research is: How can we use and repurpose the Internet Archive to study the history of the blogosphere?  The Internet Archive is especially useful for single site histories, as the Archive is browsed by URL. However, websites rarely exist in a vacuum on their own. This is partly recognized by the special collections in the Archive on a particular topic or event. Blogs, and their (in)formal linking policies, constitute a different type of collection of sites that do not converge on topic or event but on their formal characteristics: the blogosphere. As Stevenson notes “The genre (of blogs) was defined less by content than by form, with reverse-chronology and the centrality of linking trumping the extent to which bloggers focused on similar topics.” How to deal with a collection of websites in an archive that constitute a separate websphere when the device used is especially useful for studying the history of single sites?

Historical accounts of the blogosphere are often from an anecdotal perspective (Blood 2000 & Rosenberg 2009). Stevenson notes that:

What is missing in this approach, however, is reflection on the changing conditions for historical research when the object of study is the Web, or (as may increasingly be the case) is studied with the Web. (p. 75)

The Internet Archive is described as a legacy system in the sense that it is based on browsing instead of the current trend of searching and in this sense displays aspects of an earlier (web) culture. What is sustained is cyberculture. Cyberculture (1980s-1990s) is characterized by a “commitment to egalitarian and universal access to information” (78). Cyberspace is described as “somewhere else” which is still visible in the IA which prefers browsing over querying. The rise of the blogosphere may be seen as “the rejection of cyberspace” and as a transition phase from cyberculture (egalitarian) to web culture (A-lists). The blogosphere is marked with a strong tension between the idea of egalitarianism and the actual compilation of A-lists by disproportionate linking.

Case study
How to delimit the object of study? DMI asks how the dominant devices do it, for example blogs are defined by the engines as anything that publishes a feed. In this case study the first dominant blogosphere device EatonWeb was taken as a starting point. EatonWeb was a manually created collection (expert-list) of blogs and inclusion was based on the formal characteristic of blogs: reverse-chronological ordered entries. “Of the 947 blogs listed by the directory, 857 (or 85.5%) were present in the Internet Archive.” The missing blogs in the Archive were located by following the outlinks of the blogs in the set. This presents a map of the “whole” early blogosphere.

Contribution
Stevenson contributes to studies on the history of the blogosphere by compiling a new special collection, the Early Blogosphere (according to EatonWeb), that may be mapped and queried. By mapping the outlinks of the blogs in EatonWeb the non-archived blogs (the missing pieces of the archived blogosphere by the Internet Archive) are positioned within the network.

Questions
“The organization of the EatonWeb Portal suggested egalitarianism” which is in line with the characteristics of cyberspace. Are ranking devices the official end of cyberspace? Do you consider EatonWeb in that sense a transitional device?
You have now compiled your own special collection of the early blogosphere. Querying this collection, in contrast to the IA, is now possible. What would you like to ask the collection?
The focus is now on outlinks. Where were these links taken from? The whole page? Suggestion for detailed focus: blogroll analysis only. Do they provide a different map?

Further research
Platform specific maps. Actors receiving links from EatonWeb blogs that are not in the EatonWeb themselves are often blog platforms such as Blogger.com and Pitas.com. Redo map with a focus on platforms. Do platforms cluster?
There are some specific Pitas blogs on the maps, but no specific Blogger.com websites. Is it possible to look “beyond” pitas.com (*.pitas.com) or blogger.com (*.blogger.com) which sites were there?

More info on Michael Stevenson’s & DMI research on the DMI wiki:
Tracing And Mapping The Evolution Of The Early Blogosphere With The Internet Archive
Profiling the Archived Blogosphere
Wayback Web Collections
Early Blog Features

Archive 2020: Esther Weltevrede – Archiving Web Dynamics

Archive 2020
Internet researchers are confronted with an instable object of study, the ephemerality of the object. The question is how to make the medium permanent so we can study it with care? The shape of the archive informs what I can ask the archive.

This perspective on archives is placed within Weltevrede’s research into National Webs. To think nationally with the web might seem counterintuitively at first because dominant ideas of the web are so global. This originates from the 90s idea of  Cyberspace which is a universal space with ideas of disembodiment and identity play. Crucially, cyberspace is a place that is disembedded from reality. After 2000 cyberspace was confronted with what Weltevrede calls “the national turn.”

This may be seen in a number of places, probably most familiar is Google.com redirects you to the location you are at, for example Google.nl and you get a totally different result page. Another example is “This video is not available in your country” intellectual property is really dominant in the nationalization of web content. You might also think in the terms of language. English used to be the dominant universal language, there is a lot of clustering happening on the web based on a shared language.

To move to the web archive, the most exhaustive project in the field is the Internet Archive which originates from the cyberspace period (1996.) This can also be seen in how the archive was set up. First of all, the scope of the collection is the “whole” internet which is a very broad collection aim. Secondly, when you look at the interface of the archive, the Wayback Machine, what you immediately notice is that you query it by URL and browse from that point on. It is characterized by browsing instead the current dominant form: searching. The Internet Archive therefor privileges single site histories instead of researching its context.

The Internet Archive emerged from the web company Alexa and Alexa provides all the crawls and donates it to the archive. This means that the selection of sites is based on traffic data. If you have the Alexa toolbar installed every page you visit will be included in the archive. It is a very smart way to start thinking about which pages should be included in the archive. After the Internet Archive in 1996 a number of initiatives emerged with a national focus. The general thought behind that was that national web archives can best serve local wishes and demands and serve the community (researchers, general public) best.

As an example we will look at a Dutch web archive maintained by the Royal Library of the Netherlands, the KB. Before we go into the actual project, let’s get a size of the Dutch web. The .nl domain is the fourth largest country domain with 3.2 million sites, an enormous amount.

Archive 2020

How to demarcate the national web

  1. .nl is the 4th largest country domain
  2. A second way to look at the national web (.nl is not the whole Dutch web you could argue) we can look at all the domains registered by the Dutch (sidn.nl 2008)
  3. What do we Dutch people find relevant sites? We can look at the most visited websites as listed by Alexa. We find these sites important through the number of visits.

These are three ways to think of how to define the national web by web means. The definition of the national aspect as used by the Royal Library is. They created a new definition of what is Dutch content.

  • A: Website in Dutch, registered in the Netherlands
  • B: Website in another language, registered in the Netherlands
  • C: Website in Dutch, registered in another country
  • D: Website in another language, registered in another country, topic aimed at the Netherlands.

All of these options seem technically feasible except for the last one. We cannot technically or automatically define content that is aimed at the Netherlands. It makes it highly unlikely that this Dutch web can be archived. What the Royal Library has done, is leave this definition and manually select sites. They started with 100 sites, it became 400 and now just over a 1000. They archive those sites really well.

As an internet researcher Weltevrede is particularly interested in the dynamics of websites. The contribution she would like to put forward is how else can we approach the object of collection, the Dutch web?

Archive 2020

If you start web archiving the most easy and effective method is to follow the possibilities of the medium. You can automate a lot of things and besides that you can also focus on the context and prominence of the website in a particular period. The first point calls attention to the challenge to develop methods that follow the medium to automate the collection process. You could
schedule Google.nl for the query “.nl” because Google takes into account what is relevant, links to a website. These are not only considered relevant by Google but by a large group of people. Hyperlink structures are human acts of association, links die and emerge, what would that information provide us about the context and its network? If you would schedule it over time you could see the relevance of a particular source in a particular period. It would provide context for sources or websites, the born digital.

The final questions are:

  • What would the national Web archive look like when the focus is on capturing hyperlinks, search engine results, and other digital objects?
  • What aspects besides the digital document are relevant to save and why?
  • Can we learn from how born digital devices (e.g. search engines, platforms and recommendation systems) make use of the objects, and if so, how can such uses be repurposed for Web archiving>

Archive 2020

Final personal note: The day after this presentation (this morning) my friend and colleague Esther Weltevrede graduated Cum Laude from the University of Amsterdam on her research on Archiving Web Dynamics. She will continue her research on National Webs as a PhD candidate with the Digital Methods Initiative. Congratulations Esther!