MIT8 Talk: Exploring the Boundaries of a Website. Using the Internet Archive to Study Historical Web Ecologies

[slideshare id=20725582&doc=mit8webecologies-130507073144-phpapp01]

Slides and notes from my conference presentation “Exploring the Boundaries of a Website. Using the Internet Archive to Study Historical Web Ecologies” at MiT8: public media, private media. May 3-5, 2013 at MIT, Cambridge, MA.


I’m Anne, a PhD candidate and lecturer in New Media and Digital Culture at the University of Amsterdam. I am part of a research group called the Digital Methods Initiative which is dedicated to developing digital methods and tools to analyze and map web data. In this paper I want to, first, explore the boundaries of a website and propose to reconceptualize the website as an ecology, and second, to put the Internet Archive to new uses by proposing a new method to reconstruct historical web ecologies using Internet Archive data.


What are the boundaries of a website? Where does a website begin and end? With the shift towards the web as platform, or Web 2.0, the boundaries of a website become become difficult to establish and delineate. In the early days of the web, often referred to as Web 1.0 or the ‘Web-as-information- source’, websites were created by webmasters and were fairly self-contained units as most content was stored on the same server (Song 2010: 251). Within Web 2.0, or the ‘Web-as-participation-platform’ websites are increasingly entangled in a networked context and shaped by third-party content and dynamically generated functionality. While a so-called “shift” to Web 2.0 is contested, and often considered a discursive move, we can see the emergence of a complex web ecology where the website is entangled in various relations with other websites but also ad servers, social media platforms and other actors that shape and permeate the boundaries of the website. So what about the website as an archived object?



Most web archives take the website as the main unit of archiving (Brügger 2012, 5) and privilege the content of a website over its socio-technical context (Weltevrede 2009, 84). The website as an archived object is favored over other natively digital objects where the website is archived over “the references contained therein (hyperlinks), the systems that delivered them (engines), the ecology in which they may or may not thrive (the sphere) and the related pages, profiles and status updates on platforms” (Rogers 2013). Thus, in the archiving process the website is detached from the larger web context it resides in. The problem with archived websites is that they are separated from this context on various levels, think, for example, of the website’s server log, its statistics (e.g. Google Analytics), its ranking on Alexa, its related web activities such as likes, shares or retweets, or its comment-space enabled by an external commenting system such as Disqus or Facebook Comments. Today I would like to address this problem of archived websites by proposing a new method to reconstruct parts of this larger context using Internet Archive data. But first, let’s look at a website in detail to see how and where we can find traces of this larger context from within the website itself.



As argued before, a website is no longer a self-contained unit but actively shaped by content hosted on content delivery networks, dynamically generated ads delivered by ad servers, and by social media platforms delivering a personalized website environment and extending their platform features using social plugins. If we look closely at a website, the source code of the website, we can detect the presence of various third-party actors exchanging content and functionality. Here, we have an example of the Huffington Post which uses content-delivery network Akamai, a number of ad servers such as and various social plugins such as the Twitter button and the Facebook Like button. All these different third-parties co-constitute and shape the website as they enable the circulation of content and data flows and at the same time they embed the website in various relations with actors such as ad servers, tracking companies and social media platforms. So what does that mean for the boundaries of a website?



The Huffington Post uses social plugins to enable the sharing of their articles across various social media platforms and they also cross-post their own articles on these platforms. Social buttons are a good example of how the web operates as platform allowing content to circulate between websites and social media platforms and for websites to embed platform functionality such as liking, sharing and tweeting. Within the web as platform, APIs or application protocol interfaces, provide a structured exchange of data and functionality between platforms and services, apps and websites (see also Langlois et. al. 2009).



This circulation of content opens up the boundaries of a website because commenting on an article is no longer restricted to the website’s commentspace but commentspaces are now distributed across various social media platforms such as Twitter and Facebook (Helmond & Gerlitz 2010).



The Huffington Post also incorporates related tweets on their website and thereby integrates this distributed commentspace on Twitter back into the article.



Some news website have implemented Facebook Comments, which allow users to use their Facebook account to comment on an article. When you comment on an article the comment is shown beneath the article but is also posted on your Facebook News Feed. When someone replies on that comment from within Facebook, that comment is fed back into the website and shown into the comment section. Facebook Comments blur the boundaries of a website because comments may be exchanged between website and platform in both directions. Social plugins make the boundaries of a website permeable, as they allow for formatted content and functionality exchanges between websites and platforms.



Besides these exchanges, social plugins also establish data exchanges in the background because most social plugins function as trackers (Gerlitz & Helmond 2013). In this Huffington Post example, the browser plugin Ghostery found 16 different trackers establishing data connections with third-parties including social media platforms such as Twitter and Facebook. The website is embedded in a complex network of trackers that operate invisibly in the back-end.



Thus, I would like to propose to reconceptualize the website as ecology, following Matthew Fuller’s notion of media ecology and David M. Berry’s notion of computational ecology who both share an interest in ecologies of the non-human, where the website may be seen as the habitus of various actors on the web maintaining dyamic relations. In this paper I refer to website ecology as the study of relations of dynamically generated objects within websites and the interrelations with their environments, for example their interactions with social media platforms and tracking companies.



The website is inhabited and co-constituted by various actors and the question here is how can we study the relations between these actors and the interrelations with their larger environment? How can use the website as an object of study to analyze larger web contexts beyond the website itself? And how can we use the archived website as an object of study to analyze historical states of the web that websites are embedded in?



In this case study I would like to address a common problem with archived websites, where the website is archived over its context. Despite the focus on the website as the core archival unit and its detachment from the larger web context it resides in we can still see the traces of this assemblage within the website that point to the larger context outside of the website. I would like to show how we can use Internet Archive data to detect the traces of the web ecology a website is embedded and how the source code of an archived website provides a very rich source to reconstruct historical web ecologies.



We can find the traces of the larger web ecology the archived website was embedded in the website’s sourcecode. Here we do not only find the connections with other websites in the form of links, but we can also find connections with trackers. The point here is, that while JavaScript and dynamic content pose a problem for archiving because dynamic content is frozen in the archiving process and JavaScript functionality will not render, we can use these code traces to reconstruct historical states of the web. In this case study I focus on the presence of trackers on the New York Times website, to look at the website’s relations with tracking companies over time to see how the website is embedded in various tracking practices over time.



For this case study I used some tools created by the Digital Methods Initiative and I would like to briefly run you through the method and some initial findings.



First I want to make to get the links to all the archived New York Times snapshots from the Internet Archive’s Wayback machine to create a corpus. The Internet Archive Wayback Machine Link Ripper tool gets all the links to available archived snapshots.



Second, I checked all these archived websites for the presence of third-party trackers using our Tracker Tracker tool. This tool is based on the previously mentioned browser-plugin Ghostery which detects over 1200 trackers on websites. We repurposed the tool so it can also detect trackers on websites archived in the Internet Archive’s Wayback Machine. It looks for known patterns of trackers in the archived website’s source code and outputs all the trackers found.



I did this for sixteen years of available New York Times data. Here we see the number of unique trackers per year on the New York Times frontpage. In the beginning (1996-2000) the New York Times used first-party trackers. Over the years we see a proliferation of the type of trackers: in the early days trackers were mainly ads and trackers and later also analytics and widgets such as social plugins with tracking capabilities. There is a decline in trackers in 2004 but this is probably due to a gap in archived snapshots in the Internet Archive as you can see. In 2006 and 2007 there were 18 unique trackers on the NYT frontpage and after that we see a slow decline. One of the questions for further research would be whether this is due to media concentration.



Here we see which trackers have been found and when, where we can see that DoubleClick has persistently been on the New York Times front-page and that new trackers have been introduced over the years. The Internet Archive recently updated its index and now has data up till a few weeks ago and I would like to include this data in further research to further investigate the diversity of trackers and the types of trackers over time: how has the tracking ecology changed over time?



In a next step of this project I do not only want to look at trackers on an individual website such as the New York Times but at a network of websites. Here you see an example of the top 1000 websites according to Alexa and the large tracker ecology these websites are embedded in. I would like reconstruct historical tracking networks using the previously described method using the archived website as an entry point. In other words, by looking into the presence of third-party trackers on a selection of websites over time we can reconstruct historical web ecologies to reveal the complex tracking network that websites have been embedded over time.



To conclude, today I have discussed the boundaries of a website by showing how they are shaped by third-parties, permeating the boundaries of a website, and embedded in a complex web ecology including social media platforms and trackers. In my case study have put the Internet Archive to new use to show how the source code of an archived website provides a very rich source to reconstruct historical web ecologies.



Berry, David M. 2012. “Life in Code and Software.” Living Books About Life. September 23.

Brügger, Niels. 2012. “Web Historiography and Internet Studies: Challenges and Perspectives.” New Media & Society (November 21). doi:10.1177/1461444812462852.

Fuller, Matthew. 2005. Media Ecologies: Materialist Energies In Art And Technoculture. Cambridge: MIT Press.

Gerlitz, Carolin, and Anne Helmond. 2013. “The Like Economy: Social Buttons and the Data-intensive Web.” New Media & Society (February 4). doi:10.1177/1461444812472322.

Helmond, Anne and Carolin Gerlitz. 2010. Distributed Commentspaces.

Langlois, Ganaele, Fenwick McKelvey, Greg Elmer, and Kenneth Werbin. 2009. “Mapping Commercial Web 2.0 Worlds: Towards a New Critical Ontogenesis” (14).

Rogers, Richard. 2013. Digital Methods. Cambridge: MIT Press.

Song, Felicia Wu. “Theorizing Web 2.0.” Information, Communication and Society 13, no. 2 (2010): 249–275. doi:10.1080/13691180902914610.

Weltevrede, Esther. 2009. “Thinking Nationally with the Web: A Medium-specific Approach to the National Turn in Web Archiving.” Amsterdam: University of Amsterdam.

One thought on “MIT8 Talk: Exploring the Boundaries of a Website. Using the Internet Archive to Study Historical Web Ecologies

Leave a Reply

Your email address will not be published. Required fields are marked *