Philippe's Fillips

Confidential data transaction cost cutting

June 21, 2011

In support of a compulsory personal "data commons", Marc Rodwin criticizes solutions based on the recognition of individual eprivacy rights for imposing high transaction costs. Yet how can one reasonably justify making data piracy legal when ordinary citizens are the victims but illegal when they are the perpetrators? I propose instead to look for ways to cut data transaction costs.

Most of those costs derive from a single factor. Crediting Pamela Samuelson, Marc Rodwin explains "when individuals transfer their property interest in data, they lose control over its use" (1). While one should use data contracts to attach relevant terms and conditions (2), "such restrictions will be very hard to enforce after the sale". Associated costs are all the higher as one tries to create markets amidst a plunder economy.

True democracy is not cost free. Denied public resources, Justice quickly descends to its cheapest level, i.e. "might makes right". Still one can be pragmatic without losing one's ideals. The best way to cut data transaction costs is to stop making such data transactions. This also makes the whole issue of profile de-identification irrelevant. Neat, isn't it?

Am I being facetious? Not so. My faithful readers already know my solution. New ones may well meditate Milton's words on inventions. "So easy it seem'd / Once found, which yet unfound, most would have thought / Impossible".

But before I proceed, let me stress a vital point. Whether addressing social or health related problems, data mining is meaningless if not based on common data definitions. Not for nothing Adam Liptak reports that "in the last two decades, the use of dictionaries at the Supreme Court has been booming" (*). Deride lawyers all you want for asking what is the meaning of the word "is", but heed the symptoms. The underlying disease is deadly.

Take Susan Saulny and Jacques Steinberg's analysis of race as reported by US college applications (**). Is it how candidates see themselves, how they are seen by those with whom they have interacted so far or how they want to be seen by admissions officers and future employers? "Adding to the confusion in admissions offices is that there is no standard definition, in higher education or elsewhere, of what it means to be mixed race".

Physicians use race to qualify the risks a patient faces relative to certain conditions. In that case, race relates more to genetic predispositions than personal perceptions. In other cases, race is a code rather than a fact, to be expressed and modulated according to circumstances in a society all but race neutral. How can one carry academic research on such data without knowing what it measures in the first place?

Hence before one even dreams of sharing data, one must insist researchers and data owners all share the same formal vocabulary within practical domains of application. The cost associated to this requirement does not depend on the number of data transactions it enables and should prove a highly profitable investment. Make no mistake. The stakes are high. Word definitions carry power, proof the trend noted at the Supreme Court.

Assume then the existence of shared vocabularies. My solution simply provides individual profile owners with a confidential software platform which can receive and process the questions put to them by researchers over the Internet and send back the results without ever disclosing the data (3).

This architecture may use the inherently decentralized nature of Internet quite efficiently, it cannot avoid a critical discussion regarding the rights of both the researcher and the individual data owners.

When they start without a clue about what is important, researchers need freedom of thought. As Simon Kuper writes of statistics in soccer (***), "by the mid-2000's, the numbers men in football were becoming uneasily aware that many of the stats they had been trusting were useless". Yet, if its domain vocabulary lacks the correct term, my at arm's length approach cannot discover it. One needs direct access to raw data in its totality.

Michael M. Crow for instance proposes to reform NIH with the goal to "actually improv[e] people's health" with an eye on "behavioral, social and cultural shifts" (****). But medical records and billing statements miss most of what primary care physicians and social workers, let alone patients, do. Learning from soccer, he may want to capture them all on tape. Not to be overwhelmed with a "Big Data" approach, he will have to select and enroll limited learning samples with the consent of all concerned. Once what matters is agreed upon, it can be defined and handled by my approach.

Freedom of thought is not only curbed by denying researchers direct data access. Their having to send their questions rather than research subjects their data raises the possibility of censorship and surveillance. Similar to surveillance but done by competitors, spying also threatens the economic interests behind academic research. Again look at soccer. "A lot of [the numbers that matter] is proprietary" says "Chelsea's performance director".

The solution is to encrypt questions and, for higher security, process them within the user's confidential environment in a tamper resistant module.

Consumer products may not offer the latter overnight although smart cards have shown the way. But, at least in healthcare, what could be the fixed cost of fitting a tamper proof module to today's electronic records systems? And when individual patients hand over their co-payments, what could be the marginal cost of soliciting and entering their consents? Would securing the latter by lowering the co-payments bankrupt healthcare?

Profile data is never exchanged. Can it be difficult to draw a standard contract simply to instruct an organization, to which a consumer has already entrusted personal data and which would recognized the proprietary rights of the consumer, to accept and process third party questions against it?

Still, answering a question about one's own private profile can leak back information about it to the party who asked the question. Whenever a party freely engages into a mutually agreeable transaction, it must accept to let the counter party observe some personal data. The case of social and health related research is special though, as the individual's interest is highly diluted, the researchers unknown and the questions unforseeable.

To rule out observations in this case, rely on the formalism of my approach to set up intermediaries with special roles at both ends of the chain. Individual answers can be reduced to a series of counter increments. Ask a proxy downstream to implement these counters and certify the population surveyed is large enough to make reverse identification impossible according to the principles of confidential data mining. Upstream ask a domain specialist to vet questions for conformity, which is not the same as censorship by data owners or holders.

Aren't such stringent measures too stifling? Not really. Assume a question precisely describes and sizes a rare population. Instead of adding enough noise to thwart identification, why not complement the question with a solicitation for matching targets to enroll in a limited training sample? Training sets are also the way to run best fitting algorithms. Use my approach only to measure how good is the "best fit" thus found. But one could use it to discover beer correlates with diapers at supermarkets (4). Assume they sell 32,000 items, the proxy can fit the billion counters on a PC hard drive!

Yet I admit my solution comes out short on one source of cost, unfortunately far from marginal, i.e. changing the general mindset. Ask Copernicus.

Philippe Coueignoux

(*) ....... Justices Turning More Frequently to Dictionary, and Not Just for Big Words, by Adam Liptak (New York Times) - June 14, 2011

(**) ..... On College Forms, a Question Of Race, or Races, Can Perplex, by Susan Saulny and Jacques Steinberg (New York Times) - June 14, 2011

(***) ... The numbers game, by Simon Kuper (Financial Times) - June 18, 2011

(****) . Growing a better NIH, by Michael M. Crow (Boston Globe) - June 19, 2011

(1) for more details see Patient Data: Property, Privacy & the Public Interest, by Marc A. Rodwin (Suffolk University Law School) - May 2, 2010

(2) this is indeed what I advocate in my submission to the Department of Commerce call for feedback on privacy - January 27, 2011

(3) for more details, see US Patents Number 6,092,197 and 7,945,954 and US Patent Application 2009/0076914.

(4) I do not remember the exact source of this example, used to illustrate the power of "Big Data" to generate unexpected information from raw data.

June 2011

Copyright © 2011 ePrio Inc. All rights reserved.