April 2, 2013

Correlation is not the same as causation. But where had I read this recently? I pride myself of quoting my sources and I obsessively archive what I read, but paper archives are not so easily searched. I could switch to online reading and archiving. Following David Pogue's review (*), I could use Google Keep to "make a note on [my] phone". Am I so mindful of my eprivacy that I let myself be bypassed by the future?

On a hunch this scientific truth came from a New York Times article, I googled "correlation explanation New York Times" over "the past month" and the answer came back as an article by John Noble Wilford, himself quoting Dr. Tschinkel on "mysterious circles dotting an African desert" (**). Eager Google read and index this fillip to further spread my fully formed ideas, how could I lose sleep on the chance it duly notes my search?

Most of my memes are not even exclusively mine. Were they so, they would still be little more than the visible artifacts whose origin scientists are at pain to prove. My thoughts however should remain as subterranean as the suspected cause of the circles, "a particular species of sand termites".

My own hymn to data sharing is more than four years old. It is surely time to say Big Data too can do a lot of good if used appropriately, meaning if not appropriated before hand in a way either overbearing or underhanded. Well tamed, Big Data is really about putting your own data to work. As Alan Feuer reports, this is how New York City's "Office of Policy and Strategic Planning" help undermanned departments focus their resources. Dig deep and correlate issues, e.g., "clogged drains", with probable causes, i.e., "restaurants that did not have a carter [to haul away the grease]" (***).

Legalizing data ownership beyond today's noble class would confine Big Data to isolated silos no more than other types of property rights impede economic exchanges, on the contrary. I have just described how Google's masterful machine which indexes all publicly available documents can constructively interact with my humble human brain, a personal web of 100 billion neurons (1). Alan Feuer mentions a cooperative project "correlating [New York] city information with data from utilities, like ConEdison" to "detect, in real time, when a building's heat or lights were out".

Yet the more Big Data can be refined into useful information, whether actionable or insightful, the more we need to understand the issues at stake.

Where eprivacy is concerned, the first issue which comes to mind is the risks incurred, either of identity theft or to personal reputation. Within the past decade, the actual phenomenon has become so common, the media hardly bothers with it anymore except to highlight constant caution. "Representatives from American Express confirmed that the company was under attack Thursday, but said there was no evidence that customer data had been compromised", relay Nicole Perlroth and David E. Sanger (****). Absence of evidence of course is no proof of lack of damage.

American Express stores personal data in order to fulfill transactions made by its users themselves. When consumers are neither aware of "volunteering" data nor of being "observed", will personal data aggregators take as much care? Cost containment is their first concern and we should be under no illusion. Barney Jopson warns us. "Outsourcing companies which provide low-cost computer services are emerging as "the weakest link" in the battle against cybercrime" (*****). As for cybernomachia, it knows no difference between civilian and military targets.

Big banks too are notorious for shifting risks on taxpayers under the cover of being useful to society. But risks only tell part of the story. Like banking despite its mad excesses, Big Data brings benefits, including shared Big Data, wherein lies the second issue, how to share collaborative profits? The best proof there is we live the dawn of the Information Age is that nobody knows how to go beyond "might makes right". Those who speak of the Digital Age are no more than clever conjurers who use their legerdemot to focus our attention on technology and away from power.

Take Tim Bradshaw's news story about "British teenager Nick D'Aloisio and Summly, which automatically summarises news stories for the small screen" (******). I would be a Grinch to begrudge my applause to D'Aloisio. Were he called Aloysius instead of Nick, what else could I ask for? Yet I do have a question on Nick's trick. If it changes "the way content is consumed", how will the ensuing income be split? Since the snippets with which Google answers searches are but clumsy summaries, will Yahoo Summly also summarize the Financial Times content without its consent?

Besides my own modest proposal on collaborative value creation, does anyone claim to know how to manage the necessary negotiations between monopolies? Aren't the only answers today but arbitrary decisions taken by pronaocratic governments under the corrupting influence of competing lobbyists. Richard Waters speaks of "deregulating manholes" as "the latest creative approach to the concept of "unbundling"" (*******). But Amazon, Apple, Facebook, Google practice new forms of bundling as the way to extract consumer data for free and nobody dares to block them.

Authors of a recent report on taxing the Digital Age (2), Pierre Collin and Nicolas Colin call "multisided" such a business model, which rests in extracting personal data for free by bundling it with a low cost service consumers want in order to resell it to advertisers at a huge profit. Not all multisided models succeed. "The value of the Reader system is accruing to the users [...] rather than to Google", writes Paul Ford (********). Google agrees, hence the demise of the service, a clear signal on who controls Big Data benefits, on where power resides today.

Service termination however is but a crude profit sharing power play. Subtler is the what John Kay calls "statistical prestidigitation" (*********).

"It is possible to create instantaneous and surprisingly detailed psycho-demographic user profiles - containing statistically valid information - using only publicly available Facebook Likes" (**********). Michal Kosinski stresses the privacy risk created by what can be "inferred" from "volunteered" and "observed" data (3). This again raises the issue of how to apportion value to the co-creators. Notice how the word "statistically" gives us a clue to the third issue, the ability of whoever controls Big Data to bend reality to selfish purposes without bearing any responsibility.

Correlation is not causation. Statistical truth does not do justice to an individual. But for corporations looking for a profit and governments strapped for tax revenues, statistical truth is good enough on average. In the case of grease clogged sewers, New York City "hand[ed] inspectors a list of statistically likely suspect". What prevents in the future handling such a list automatically? Wasn't this the original solution of Hadopi to illegal downloads? Why stop there? How difficult is it to salt the statistics to fit a desired result, as John Kay accuses the British government of doing?

In the Age of Food, control of the land gave rise to despotic states. In the Age of Energy, control over commerce begat colonial empires. What heartless dominions will the Age of Information bring about, based on the control of data, Big Data, our data mainly, and its statistical interpretation?

While the establishment of Hedonism exalts the individual as being autonomous, will the latter be ironically reduced to an insignificant data point?

Philippe Coueignoux

April 2013
