Skip to content

The Platform Governance Archive v1: A longitudinal dataset to study the governance of communication and interactions by platforms and the historical evolution of platform policies

Author: Katzenbach, C., Kopps, A., Magalhaes, J. C., Redeker. D., Sühr, T., Wunderlich, L.
Published in: Zentrum für Medien-, Kommunikations- und Informationsforschung (ZeMKI)
Year: 2023
Type: Academic articles
DOI: 10.26092/elib/2331

Platform policies contain the spelled out rules about what is allowed and prohibited on a service. As such, they constitute both a normative framework as well as a means of public communication by platforms. Studying the evolution of the increasingly complex web of policies that platforms have developed can hence allow us to trace the emergence of a specific normative order, i.e. the ways in which platforms are governing user activities and public speech and communication dynamics, as well as identify how they have reacted to public controversies, political debates and legal regulation. A major difficulty for studies on the historical evolution of platform policies, however, is the availability of past policies which is often needed for a thorough analysis, as the policies change quite frequently and even their names and locations often differ from the current version. Although platforms have become increasingly transparent about how and when they are changing their rules and have begun to offer public archives of the different historical versions of their policies, these archives often do not contain all of the past versions of a policy and relying on them entails trusting the platforms to provide complete information. Thus it remains hard to systematically study how the rules and norms of platforms have changed over time. Our Platform Governance Archive (PGA) aims to address this need by providing a comprehensive and uniformly collected dataset of all of the historical versions of platform policies which does not rely on the platforms’ own public records. While we are working on extending the scope of the archive to include more platforms and policies, the current dataset described in this paper contains all of the historical versions of three types of policy documents (Terms of Service, Community Guidelines, Privacy Policies) by four major platforms (Facebook, YouTube, Twitter and Instagram) in the time period from the inception of each policy until the end of 2021. Our paper gives a comprehensive overview of the conceptual layout of the Platform Governance Archive and details the automated and manual processes of data collection and data cleaning, as well as our practical and theoretical challenges. Starting with how we define a relevant change to a platform policy, we lay out how we used the Internet Archive’s Wayback Machine to identify past versions of platform policies, collect them, and then automatically and manually check for changes. Specifically, we explain how we mapped the URLs of the selected policies and they have changed over time, putting together a puzzle of how they were renamed and relocated. We then detail the automated scraping process of these URLs from the Wayback Machine as well as the automated diff-checking which we employed. The last step of the data cleaning consisted in a manual revision of the automatically identified versions based on our definition of a relevant change, which was necessary because a significant amount of data noise remained. The paper furthermore describes how the platforms' ways of displaying their policies have changed over time by increasingly turning them into interactive pages and multi-page documents, as well as how we addressed the data collection challenges that arose from this. The paper furthermore provides an overview of the resulting v1 corpus the Platform Governance Archive which is a dataset consisting of 354 policy documents with a total of 6,036 pages. By detailing the structure of our data repository on Github, we offer a guide on how to access and work with the data. We furthermore describe the characteristics and details of each platform and policy type to account for the fact that each of them have undergone a specific historical development. Lastly, our paper also presents a structural analysis of some of the general trends and patterns which are visible in the dataset over a time period of up to almost two decades on the document level. Using a quantitative analysis, we analyse how the change frequency and the character count of each platform policy has developed over time. A comparative visualisation of these findings allows us to show how the extent of the policies has grown over time, to identify periods of high growth and frequent changes and to draw comparative conclusions about the four different platforms. The Platform Governance Archive aims to be a resource for researchers, journalists, policy-makers, platform operators, activists, and other stakeholders as well as the general public. By offering both a comprehensive dataset and an accessible interface, we aim to offer and continue to develop this resource to enable research and public debate on the historical evolution of platform policies in order to trace down changes, to identify characteristic periods of isomorphic policies, to measure influencing factors, and to understand how specific debates, events, and legislation have influenced and manifested in platform policies.

Visit publication


Connected HIIG researchers

Adrian Kopps

Associated Researcher: The evolving digital society

Christian Katzenbach, Prof. Dr.

Associated researcher: The evolving digital society

  • Open Access

Explore current HIIG Activities

Research issues in focus

HIIG is currently working on exciting topics. Learn more about our interdisciplinary pioneering work in public discourse.