June 7, 2011
Polemic in nature, already too long for today's average reader's attention span, my fillips may raise questions and some eyebrows. Yet what am I to do when reality itself goes beyond belief? I warned last week against freedom of speech fundamentalism. Let my evidence speak for itself.
"Charged with violating the Stolen Valor Act, [which] makes it a federal crime to lie about being a military hero", which he admits he has done, "Mr Strandlof has been fighting the case against him, arguing that the law violates his right to free speech", Dan Frosch reported two weeks ago (*).
Is it sane for the US to borrow from China in order to pay federal judges for deciding the matter? If what one says means nothing, why convict Raj Rajaratnam for mouthing mere words into FBI recording machines? If lying about official facts is protected by the constitution, why is it a crime to practice medicine without a license or to pass one's own print for real money? A tyranny where the armed forces are the victims is still a tyranny.
And yet, under certain circumstances, one may lie. Indeed, to free oneself from the fear of offending the social fabric, one should lie.
Recall last week I also criticized Jane Yakowitz for her "data commons" proposal, to be fed by parties in possession of confidential data without the consent of the persons concerned. I pointed out that she was in no hurry to share the drafts of her own papers with the general public (1).
Fortunately Leon Neyfakh, the journalist who had interviewed her, was kind enough to help my research with his insider information. He gave me a URL for Jane Yakowitz' draft as well as for a paper by Suffolk University Law Professor Marc Rodwin. I intend to review both next week.
Incidently this highlights an issue plaguing the border between what is public and what is private. If I put some personal data under a URL given to a few friends only, is it still private? For Jane Yakowitz, it would remain private unless someone made the URL itself available online. For Paul Ohm, knowing everyone with the correct URL can access it at will, it would be as good as public as over time my enemies would come to know the URL.
Both Jane Yakowitz and Marc Rodwin refer to a body of science, confidential data mining, which teaches how one can do research on personal data while duly protecting the persons concerned. George T. Duncan, quoted in Leon Neyfakh's essay, has written a book on this very subject (2). Following Jane Yakowitz' wishes, his work is public. But, although his desire to benefit from his very own data is quite justified, the fact his publisher charges researchers $70 for the right to read it puts him at odds with Jane Yakowitz' ideal of a free "data commons".
It also puts his book outside of my reach. Instead I owe my knowledge on confidential data mining to University of Massachusetts Professor Xiaobai Li. Any error in what follows is due to this modest student's lack of merit. But if I recall correctly, lying happens to be one tool of choice.
Imagine you look at a small town, not far from Lake Wobegon. To help all researchers, its mayor has compiled an extensive data base profiling his fellow citizens, stripped of all personally identifiable data (PII). But what about a certain Lou Gehrig, not that it be his real name naturally?
A baseball coach at the local high school, Lou Gehrig sadly suffers from ALS, a progressive, mostly fatal disease. As he is the only citizen in this case and despite the absence of any PII, anyone with a passing knowledge of his town knows who is the baseball coach who suffers from ALS. This so-called "cell of one" issue is the tip of an iceberg, i.e. identity is an emergent quality of personal profiles, no matter how well anonymized.
One solution would be to erase the occupation from the profile data base. For, contrary to its better known and more endowed neighbor, this town has many cases of ALS. Yet, as occupation is often a crucial factor in their studies, many researchers would object. The same would happen if the mayor struck down health related data. As Paul Ohm would say, anonymity can only be achieved at the expense of utility.
There is a solution however. Simply lie to the researchers. Having run for office, the mayor has a long experience of telling well meaning lies. But sustainable lying is not easy. This is why confidential data mining has it down to a science, although it does not call itself lying, naturally.
The method is to make sure the larger picture stays correct although its details can no longer be relied upon. By shifting the baseball coach's ALS onto the physics teacher, the mayor does not skew his town ALS statistics. Yet if he lets it be known that his data is false half of the time at random, nobody will be foolish enough to think the only physics teacher in town had so far been able to keep his having ALS a secret.
Is this solution fool-proof? Will Paul Ohm be compelled to concede defeat? Not so fast. If the researcher is bent on exploring occupational diseases, shifting an ALS case from a softball coach to a physics teacher may well introduce a bias. Not if the study focuses on asbestos exposure, but what if it looks at neurological conditions? Yet how could the poor mayor know in advance to what uses his data will be put once released?
Is then confidential data-mining an oxymoron, a scientific cover for over eager researchers? Again not so fast. Our mayor's wisdom is one reason his electors keep voting for him. He understands one important lesson. If you ask the researcher to release her precise statistical quest to you in advance, then you may still find a way to lie honestly and efficiently about the profile data you send her in answer. Otherwise deny the request.
Science cedes priority to power plays. Forced to reveal their intentions, researchers will bristle against what appears to them as censorship under another name. Without the raw data, who is to prove there was no safe way to lie? Besides, this approach puts a brake on their creativity. If their best idea comes once they have the doctored data, can they afford to bother the mayor again to make sure his past lies do not lie in their new ways?
Aren't there suddenly some points worth raising? Isn't the whole issue a negotiation about whose privacy is more important, the individuals concerned or the researcher? And if research questions must be communicated in advance, why bother producing the data, especially since it will be doctored and doctored in ways tailored to each question? Why can't the mayor process the questions himself and give away no profile data at all?
Researchers will say that, no matter his wisdom, this poor mayor is incompetent. Forget about social studies, what will he do about health sciences? But this is besides the point. Well formulated, research questions can be programmed. Well programmed, they can be executed on the mayor's machine. Only Big Data devotees believe the sharpest insights come from centrally accumulating as much raw data from as many people as possible.
If all 200 millions adult Americans sat on all juries, answered all polls, told the President what to do all the time, what a great tyranny it would be!
Philippe Coueignoux
- (*) . Fighting for the Right to Tell Lies, by Dan Frosch (New York Times) - May 21, 2011
- (1) see Jane Yakowitz's publications, per the Brooklyn Law site on June 6, 2011
- (2) see Statistical Confidentiality, by George T Duncan, Mark Eliot and Juan-José Salazar-González (Springer), 2011
|