24 November 2015

Big data: from the doom of sociology to its method

This article was published first by the Internet Policy Review.

Once, Mike Savage predicted the downfall of sociology – but now, he revises his pessimism. In 2007, his famous essay “The coming crisis of empirical sociology” had caused quite a stir. Back then, he stated that social scientists would be falling behind the natural scientists, failing to make use of the oil of the 21st century: big data.

In a talk within the series “Big Data: Big power shifts?” held on 5 November 2015 in Berlin, the sociologist from the London School of Economics and Political Sciences (LSE) drew the conclusion that, today, the most successful and popular social scientists primarily build up their work on data analysis.

Today, it would be unthinkable that intellectuals could contrive theory buildings or propagate grand narratives like those of Michel Foucault or Jürgen Habermas. Still, there is a new star in the sky: the social scientist Thomas Piketty with his book “Capital in the Twenty-First Century”.

Piketty as a pioneer of big data

Piketty is said to be using data from various sources, focusing on income distribution, in order to illustrate the complex coherences in a bunch of simple data visualisations. “Piketty is using big Data, but he is not calling it Big Data,” Savage said.

Further, the French economist builds his critical reasoning on comprehensible data visualisations, combining a descriptive approach with critique regarding the prevailing conditions.

Robert Putnam is said to be using a similar approach in his book “Bowling Alone”, building his thesis of a decrease in social integration on data regarding memberships in clubs and other statistics. Another example for social sciences relying on big data is the book “The Spirit Level” by Richard Wilkinson and Kate Pickett, which focuses on social inequality. According to Savage, social scientists are only able to make up for their lack of technical knowledge by a better contextualisation.

At the 5 November event – organised by the Humboldt Institute for Internet and Society in collaboration with the Vodafone Institute for Society and Communications, Isabelle Sonnenfeld from Google News Lab made a similar statement: “Social scientists, unlike computer scientists, can come to a data source with a more complex and historical understanding.” The group decided to make some of its most important data – search data – partially accessible via offerings such as google.com/trends. But the decisive factor is not the data itself, but rather its interpretation. “We provide aggregated and anonymised Google Trends data, but it is the journalists and academics who are contextualising it,” Sonnenfeld said.

Why big data still has a long way to go

Nevertheless, Google’s approach to share some of its data with the public clearly shows that access to big data is still unevenly distributed. Fortunately, more and more large companies – most recently Deutsche Bahn, but state institutions as well – bank on openness and decide to make machine-readable sets of data available to the public, as a web search on “gov data” shows. A significant problem that remains is, however, that these data are not very informative, because they usually lack a relevant context and the granularity.

Deutsche Bahn, for example, has so far only released seven sets of data, including a directory listing the lengths and heights of the railway platforms in Germany. At the same time, far more interesting and informative data regarding the consumption and mobility patterns of the German people remain inaccessible for the public. Data journalist Lorenz Matzat therefore sees the datasets published so far as “Schnarchdaten” (“snoring data”). So far, state administrations are keeping back the more interesting sets of datal: i.e. the City of Cologne has published its budget data in machine-readable form. However, since the budget items are summarised in rough categories, the data remain difficult to decipher.

While many datasets are not published at all, there are also problems with the ones that are available. Typically, datasets are published in an anonymised form, which is also important in the way of privacy protection, making it almost impossible to compare an anonymised dataset with other data. However, it is necessary to be able to integrate and compare different data to be able to gain insight from them.

An example: a supermarket chain collects data regarding the shopping habits of its customers via customer cards, but there is only little demographic or personal data connected to the customer card; only name and address, for example. By itself, the data set is of little interest, so – in order to gain more insight – the supermarket chain purchases additional data on demography, household size, age, hobbies, interests, etc. from a third party. The consumer profiles are “filled with life”, allowing conclusions about the possible motives behind purchasing decisions. Big data can only develop its full potential if it is possible to connect different datasets.

The determining factor to connect different datasets to each other is a so-called unique identifier, serving to identify a person in several different datasets; in our example probably name and address. While companies and security agencies rely on integrating different datasets, researchers and journalists often don’t have this possibility. Firstly, because of lack of money, and secondly because of ethical concerns vis-à-vis investigating and publishing such data.

Great opportunity vs. ethical concerns

For the social sciences, it is both an ethical dilemma as well as a great opportunity that people feel unobserved while producing the data, unaware that they are an object of study. For example, it could be possible to examine fields such as income distribution or prostitution, in which voluntary disclosure and self-description often lead to inaccurate results.

Of course, it can be argued that social scientists have always been working with large amounts of data – or Big Data – in censuses, election analyses or large surveys. However, the new thing about Big Data is that a lot of data are seemingly collected incidentally, not for a specific purpose such as in the scope of a census. Thus, the online retailer Amazon primarily sells products – but a lot of consumer data are collected as well. They are saved and treated as a raw material, based on the assumption that it will sooner or later be useful to evaluate them.

Social scientists as “Jacks of all trades”

It is not only the access to large datasets that is unequally distributed, but also the skills to handle them. While companies such as Google have countless programmers and data analysts to interpret data, social scientists often work on their own.

Should aspiring sociologists thus also learn programming? Savage doesn’t think so: “If you had to actually learn those big data skills, that would be a big commitment – and you would lose a lot of theoretical and substantive skills too,” said Savage. Instead, there have to be cooperations with programmers and data analysts. In mixed teams like this, the sociologists’ theoretical, critical and historical knowledge could help to interpret data.

The lecture took place in the British Embassy in Berlin on November 5, 2015. The event series “Big Data: Big power shifts” of the Humboldt Institute for Internet and Society is supported by the Vodafone Institute for Society and Communications. More information: https://www.hiig.de/big-data-big-power-shifts/

This post represents the view of the author and does not necessarily represent the view of the institute itself. For more information about the topics of these articles and associated research projects, please contact info@hiig.de.