April 17 2012
Big data is not (yet) big enough - part two
In my previous post, I examined the emergence of petabyte-scale data sets and some of the benefits, and risks, of using them as evidence to power health care decisions. To further that conversation, I wanted to share some additional thoughts on the importance of diverse groups working with these data sets.
The value of more diverse communities' using this data may be in the serendipitous opportunities that it creates. Not only will different communities look to new data sets with which to combine the primary data (research networks, and even social networks such as Facebook or Twitter), but they may also approach the problem from new angles.
Explicit and contractually formalized partnerships of collaborators are needed—ones in which there is clarity on the principles of data stewardship plus engagement of and remuneration of participating consumers, to counterbalance the incentives to provider institutions and other organizations for near-real time use of clinical data in knowledge discovery and evidence development. One of my previous posts last Fall addressed the potential for statistical biases and wrong answers when individual persons or groups of people decline to participate and do so in a manner that is influenced by non-random factors. The same concern also arises when participation is declined because of intense competition between teams of researchers who are all seeking tenure or grants—or between companies and universities who have financial and intellectual property (patent) interests in the results of the research.
“Free” data sharing is a noble, mom-and-apple-pie notion, extremely unlikely in the long run, on account of incentives and private property rights. This reality is overlooked or discounted by Michael Nielsen and by other Big Data evangelists in the scientific community. In industry and academia alike, most scientists have not shared their raw data or all of their findings for free with all and sundry interested parties including their competitor scientists, because their own future publications and career advancement, and their own organizations' success, depend on their not sharing. It is purely delusional to think that the growing availability of interoperable mechanisms for Big Data and sharing will somehow make all of those motivations and incentives passé.
Although I believe that getting a broader base of people interested in health-related Big Data will have significant benefits I noted in my previous post, I suspect that the majority and the “cutting edge” of mining health data to make breakthrough discoveries and create big gains in value and safety and efficiency in health care will remain with the specialists and with the organizations that have the funding to support the collection and quality-assurance and curation of the Big Data, plus the management of the privacy and ethics of Big Data, plus protection of property rights that inhere in the data and in derivative works that are produced from the raw data. Remember that the data do not digitize themselves, curate themselves, and serve themselves up for use without any money having to be spent to do those things. It is an expensive proposition to create and maintain usable, accurate datasets that can be mined, and those organizations and individuals who bring about the creation and maintenance need a fair return on their investment, or else the whole Big Data enterprise will not be financially sustainable to deliver lasting value to society.
In any case, without the continuous involvement of scientifically-qualified people, mining Big Data can lead to serious misinterpretations and false conclusions, especially where the data are complex, de-normalized, and non-transparent. Presently, only within the bioinformatics communities do you find the combination of ontological, epistemological, and statistical expertise needed to make major, reliably-accurate discoveries and breakthroughs in understanding from Big Data. In fact, many of the potential apps imagined will absolutely require expert input to create and maintain genuinely valuable, reliable solutions (products, services). But without adequate and continuous involvement of non-scientist consumers and businesses and other stakeholders, mining Big Data can potentially lead to serious disenfranchisement from the eventual applications and policies and risks and benefits.
It would help tremendously to create more opportunities for citizen scientists, domain experts, and the general public to engage together with this critically important data and with policies and practices governing its use. In that regard, there is a two-day conference to be held in late September on this topic at Oxford University. Maybe you will consider going? It is after most of the Queen's Diamond Jubilee events and the Summer Olympics, so the logistics of attending the conference should not be too daunting...
Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner's Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner's Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner's clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert®, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.
Bekkerman R, Bilenko M, Langford J, eds. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge Univ, 2011.
Campbell J, et al., eds. Knowledge and Skepticism. Bradford, 2010.
Davis K, Patterson D. Ethics of Big Data: Balancing Risk and Innovation. O'Reilly, 2012.
Dobbs R, et al. Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey, 2011.
Fox B. Using big data for big impact: How predictive modeling can affect patient outcomes. Health Manag Technol. 2012;33:32-3.
Franks B. Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics. Wiley, 2012.
Graham M. Big Data and the end of Theory? The Guardian, 09-MAR-2012.
Grossman C, et al., eds. Clinical Data as the Basic Staple of Health Learning. Institute of Medicine/National Academies Press, 2010.
Lifsitz R. Big Data will begin improving the bottom line of healthcare players. Scientia Advisors Blog, 06-FEB-2012.
Lohr S. The age of Big Data. New York Times, 11-FEB-2012.
Loukides M, et al. Big Data Now: Current Perspectives. O'Reilly, 2012.
Minelli M, Smith D. Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses. Wiley, 2012.
Oxford University 2012 Conference on Internet, Politics & Policy: Big Data, Big Challenges?
Pearl J. Causality: Models, Reasoning and Inference. 2e. Cambridge Univ, 2009.
Ratner B. Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. 2e. CRC, 2011.
Rogers S, Girolami M. A First Course in Machine Learning. CRC, 2011.
Seni G, Elder J. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool, 2011.
Tang L, Liu H. Community Detection and Mining in Social Media. Morgan & Claypool, 2010.
Watts D. Everything is Obvious: *Once You Know The Answer. Crown, 2011.
Weinberger D. Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room. Basic, 2012.
Wu J, Coggeshall S. Foundations of Predictive Analytics. CRC, 2012.