Data Mining: The good, the bad and how to regulate effectively.

by Inayat Chaudhry

We live in an era of “big data”. Everyday all of us provide a large amount of data to a number of private and public organizations, the government and telecommunication companies. Data mining can be defined as “the intelligent search for new knowledge in existing masses of data.”[1] While it is obvious that our privacy is threatened by unregulated surveillance efforts, it would be remiss to say that data mining is all-harmful. Utilizing some of the information provided in metadata fuels innovative research. And recently, data mining has been especially helpful in healthcare. The discussion that follows is divided into three parts. Parts I and II will discuss the privacy concerns that data mining raises and the research benefits it provides for public health purposes respectively. Part III will propose a three step approach that can help us regulate data mining in an effective manner so that we can diminish the privacy concerns and help proliferate the research benefits.

I. The Bad: Data mining and Privacy concerns.

Before I begin my discussion about privacy concerns in an age of data mining, it is extremely prudent to mention that data mining by private or public organizations and data mining by the government are two completely different beasts. This paper is not going to talk about government surveillance; rather it is going to focus on the former.

Most users click “accept” without hesitation on privacy policies. We check into places on Foursquare and Path, we use our debit and credit cards that can be used to trace us back to our home addresses and most social media networking websites like Facebook and Twitter have options such as including your location while transmitting data. However, this is exactly the kind of thing that raises privacy concerns. This explosion of social media gives data companies a much deeper look into one’s personal and social life giving them access to a user’s habits and likes and dislikes.[2] Scholars have explained that “digital dossiers” which hold users’ personal information have been extensively created.[3] Without a way to look at our own consumer profiles it is difficult to say what the companies know and what kind of information they are disbursing to other interested parties.

A deep concern for privacy experts and lawmakers is what might be done with a user’s information once it is collected: identity theft, impersonation and personal embarrassment are only some of the consequences that concern them. But most importantly they worry that people might be written of as “undesirable” when their virtual selves differ from their “actual offline” selves. For example, Acxiom, a giant of consumer database marketing categorizes people into different socioeconomic groups and markets to them accordingly.[4] However, information obtained about a user online might not be fully accurate and this might deprive some people of offers that would otherwise be targeted towards them. Furthermore, though some users might find “discounted” personalized offers marketed to them beneficial, others might see the surveillance required behind making these types of offers as intrusive and as a violation of their privacy.

II. The Good: The Benefits of Data Mining.

While there is a lot of talk about the harmful effects of data mining on a user’s privacy, there is also a bright side to this practice. Recent research has shown that applying data mining techniques can augment the creation of untapped valuable knowledge from large medical datasets. Therefore, the following discussion will focus on the significant benefits that data mining has created in the field of healthcare by providing examples of case studies that have proved successful.

In 2012 researchers from Harvard’s School of Public Health analyzed how human mobility affects malaria infections.[5] In this study, the scientists collected data from approximately 15 million mobile phones over the course of one year.[6] This data was utilized to identify the location of “hot spots” where infected humans were most likely to travel, carrying the disease with them. Based on this information, the team of scientists was able to show that malaria is spread through the movement of infected humans rather than movement of mosquitos.[7] The hot spots helped the scientists in identifying locations with high endemic rates[8] that could most benefit from targeted malaria intervention programs.[9]

In another case study from 2001, a group of researchers set out to analyze the relationship between antipsychotic drugs and myocarditis and cardiomyopathy. The researchers were able to analyze this relationship by using international databases possessed by World Health Organization. The researchers used a data mining approach to test reports of clozapine and other antipsychotic drugs suspected of causing myocarditis and cardiomyopathy against all other reports in the WHO database.[10] Using Bayesian statistics, they found that clozapine was significantly more frequently reported in relation to cardiomyopathy and myocarditis than other drugs.[11] While the researchers concluded that further research is required in order to determine a causal effect, it is easy to see why this finding might be useful to them. Chemists could potentially save people’s lives by comparing clozapine’s chemical structure with that of other drugs to tease out exactly which chemical component in clozapine leads to the inflammation and chronic disease of the heart muscle in a patient. Isolating and eliminating this element in the production of clozapine and other drugs can avoid this result in the future.

III. Effective Regulation – Focus on use, harsh punishments for abuse.

A huge problem with America’s privacy law is that it regulates the release of data rather than its use.[12] Therefore, the critical issue that needs to be examined is: how can organizations, the government and individuals focus more closely on data use? In this section of the paper, I put forth my own three-step approach of anonymity, accountability and punishment that addresses this issue. Borrowing from Jane Yakowitz’s proposal[13], the first step is that all personal data collected by organizations should be stripped of personally identifiable data. In the next step, this anonymized data should be put in a hypothetical “protective incubator” which would only grant access to the data after an entity requests access. The entity would have to provide the legal name of its organization or self and provide at least one type of unique identifier that can speak to its legitimacy, thereby incorporating accountability in the structure. With the barrier of entry this high, a number of ways in which personal data is misused can be eliminated at this stage. Of course, this type of protective incubator model would require an agency dedicated to the cause, but even if it is not viable for a new agency to be formed for this purpose, the Federal Trade Commission’s, Bureau of Consumer Protection should be able to handle this kind of regulation. The last stage would require lawmakers to dole out strict punishment for the misuse of data. However, in order to be effective the penalty would have to be strong enough to give pause to a reasonable person.

At this point, it is prudent to note that though the data will be anonymous, the incubator (and only the incubator) would retain the ability to trace the identity of users whose information is in the database. Following HIPAA’s lead, this is so that when an organization requires identifiable information for the purposes of research in medicine, etc. it would be in a position to provide the information requested. However, the organization would have to be accountable by assuring that it would personally guard and handle the data itself. And if found in violation of this caveat, it would be subject to harsh punishment.

The key to this model is the ability of lawmakers and institutions such as the FTC to place reasonable limits on an organizations’ access to data. I believe that this paradigm shift from the focus on the release and collection of data to the focus on its use will mitigate a lot of “bad consequences” of data mining such as identity theft, fraud, unfair discrimination, etc. while retaining its research benefits.

[1] See Joseph S. Fulda, Data Mining and Privacy, 11 Alb. L.J. Sci. & Tech. 105, 106 (2000-2001).

[2] See generally Natasha Singer, Mapping, and Sharing, the Consumer Genome, (Accessed March 11, 2014).

[3] See generally Daniel J. Solove, Digital Dossiers and the Dissipation of Fourth Amendment Privacy, 75 S. Cal. L. Rev. 1083 (2002).

[4] Id.

[5] See generally, Amy Wesolowski, Nathan Eagle, Andrew J. Tatem, David L. Smith, Abdisalan M. Noor, Robert W. Snow & Caroline O. Buckee, Quantifying the Impact of Human Mobility on Malaria, 338 Science 267 (2012) (available at

[6] Id at 268.

[7] Id.

[8] Id at 269.

[9] Id at 270.

[10] See David M. Coulter, Andrew Bate, Ronald H B Meyboom, Marie Lindquist & I Ralph Edwards, Antipsychotic drugs and heart muscle disorder in international pharmacovigilance: data mining study, 322 British Med. J. 1207, 1207 (available at

[11] Id at 1208.

[12] See Jane Yakowitz, Tragedy of the Data Commons, 25 Harv. J.L. & Tech. 1, 43 (available at

[13] Id at 44.