Cornerstone Course – Day 5: Digital Society – Big Data

Digital Society is a very elastic phrase. We will explore three examples:

  • Network Neutrality
  • Privacy and Surveillance
  • Big Data

All are focused on how technology changes society. It is a contested topic on whether the impact is positive or negative. Issues are at the intersection of information and communications technologies and society, law, and public policy.

Big Data

Mass collection of personal information is essentially discrimination, however, widely used in credit rating (Gandy & Oscar , 1993) . The Internet aggravated the situation by commercial use of targeted marketing. This leads to a fine-grained market segmentation and systematic discrimination which in turn is hard to detect or resist. Even worse, most companies cannot pinpoint the discrimination that they apply in their services.

The traditional scientific approach describes itself as

  1. Formulate hypothesis
  2. Design and conduct experiments
  3. Use results to confirm or disprove
  4. Basis for decisions and actions

It is arguably not how science works, but it is how science presented itself to work.

Big Data contrasts to the scientific approach as it

  1. Existing large data set (not necessarily what you where looking for)
  2. Mine data for correlations (patterns)
  3. Infer links between factors (sort of a hypothesis)
  4. Basis for decisions and actions.

The approach is completely automated and produced by a computer, no humans involved (other than devising the algorithms). Resulting models of the world are highly complicated and incomprehensible to humans (even beyond the possibility of understanding by humans). Big data further focuses on correlations rather than causation. The complete data is used rather than sampling and statistics are used in contrast to actual individual accuracy. To make this work you must collect all the data in advance and more specifically you must collect any data you can.

Why now?

Computational power has become much cheaper. Data is available and data mining & machine learning have become viable. The Internet of Things (IoT) is increasing the amount of data available drastically. Processing the data is difficult and it is not clear how malicious actors could influence the process. Most IoT services are useful, but they generate a huge amount of data that is shared and used by the provider of the IoT services.

Google Translation is a case in point for Big Data. Previously, people tried to deconstruct language by understanding the grammar and then reassemble them in another language. Google learns nothing of grammars, but actually correlates the same text in two languages to obtain a statistical connection between languages. The EU provided a great source of data as (nearly) all its text are (manually) translated in all 24 official languages.

Another example is Google Flu Trends, which automatically found search terms that where correlated with influenza cases to create a prediction system. The system worked well for data between 2004 and 2010, however than it broke down. The question is whether public policy can be based on this.

“Personal data is the new oil of the Internet and the new currency of the digital world.” – Meglana Kuneva, European Consumer Commissioner, 2009

Buying habits, how likely you are to vote for a party, likelihood of accidents and health habits can be (tried to be) predicted. Statistical learning is better with larger data sets which favours larger players. Data has unexpected/unpredictable uses when it is correlated which apparently with unrelated information. All this makes data looks like a natural monopoly.

An example is exploding manhole covers in New York that happened inexplicably. However, they could be correlated to requests for telephone line repairs. After investigation it was found that old (broken) lines produces explosive chemicals that eventually would go off. Replacing the lines solved the issue.

However, correlation is not causation and therefore it is dangerous to base policy solely on Big Data.

References

Gandy, J., & Oscar , H. (1993). The Panoptic Sort: A Political Economy of Personal Information. Critical Studies in Communication and in the Cultural Industries. Boulder, CO: Westview Press, Inc.