April 17, 2012
Big data is not (yet) big enough - part two
In my previous post, I examined the emergence of petabyte-scale data sets and some of the benefits, and risks, of using them as evidence to power health care decisions. To further that conversation, I wanted to share some additional thoughts on the importance of diverse groups working with these data sets.
The value of more diverse communities’ using this data may be in the
serendipitous opportunities that it creates. Not only will different
communities look to new data sets with which to combine the primary data
(research networks, and even social networks such as Facebook or Twitter), but
they may also approach the problem from new angles.
Explicit and contractually formalized partnerships of collaborators are
needed—ones in which there is clarity on the principles of data stewardship
plus engagement of and remuneration of participating consumers, to
counterbalance the incentives to provider institutions and other organizations
for near-real time use of clinical data in knowledge discovery and evidence
development. One of my previous posts last Fall addressed the potential for statistical biases
and wrong answers when individual persons or groups of people decline to
participate and do so in a manner that is influenced by non-random factors. The
same concern also arises when participation is declined because of intense competition
between teams of researchers who are all seeking tenure or grants—or between
companies and universities who have financial and intellectual property
(patent) interests in the results of the research.
“Free” data sharing is a
noble, mom-and-apple-pie notion, extremely unlikely in the long run, on
account of incentives and private property rights. This reality is overlooked or discounted by
Michael Nielsen and by other Big Data evangelists in the
scientific community. In industry and
academia alike, most scientists have not shared their raw data or all of their
findings for free with all and sundry interested parties including their
competitor scientists, because their own future publications and career
advancement, and their own organizations’ success, depend on their not
sharing. It is purely delusional to
think that the growing availability of interoperable mechanisms for Big Data
and sharing will somehow make all of those motivations and incentives passé.
Although I believe that getting a broader base of people interested in
health-related Big Data will have significant benefits I noted in my previous post, I suspect that the majority and the “cutting edge” of mining health data
to make breakthrough discoveries and create big gains in value and safety and
efficiency in health care will remain with the specialists and with the
organizations that have the funding to support the collection and
quality-assurance and curation of the Big Data, plus the management of the
privacy and ethics of Big Data, plus protection of property rights that inhere
in the data and in derivative works that are produced from the raw data. Remember that the data do not digitize
themselves, curate themselves, and serve themselves up for use without any
money having to be spent to do those things. It is an expensive proposition to
create and maintain usable, accurate datasets that can be mined, and those
organizations and individuals who bring about the creation and maintenance need
a fair return on their investment, or else the whole Big Data enterprise will
not be financially sustainable to deliver lasting value to society.
In any case, without the continuous involvement of
scientifically-qualified people, mining Big Data can lead to serious misinterpretations
and false conclusions, especially where the data are complex, de-normalized,
and non-transparent. Presently, only within the bioinformatics communities do
you find the combination of ontological, epistemological, and statistical
expertise needed to make major, reliably-accurate discoveries and breakthroughs
in understanding from Big Data. In fact, many of the potential apps imagined
will absolutely require expert input to create and maintain genuinely valuable,
reliable solutions (products, services). But without adequate and continuous involvement of non-scientist
consumers and businesses and other stakeholders, mining Big Data can
potentially lead to serious disenfranchisement from the eventual applications
and policies and risks and benefits.
It would help tremendously to create more opportunities for citizen
scientists, domain experts, and the general public to engage together with this
critically important data and with policies and practices governing its
use. In that regard, there is a two-day conference
to be held in late September on this topic at Oxford University. Maybe you will
consider going? It is after most of the
Queen’s Diamond Jubilee events and the Summer Olympics,
so the logistics of attending the conference should not be too daunting...
Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner’s Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner’s Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner’s clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert®, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.
Bekkerman R, Bilenko M, Langford J, eds. Scaling up Machine Learning:
Parallel and Distributed Approaches. Cambridge Univ, 2011.
Campbell J, et al., eds. Knowledge and Skepticism. Bradford, 2010.
Davis K, Patterson D. Ethics of Big Data: Balancing Risk and Innovation. O'Reilly, 2012.
Dobbs R, et al. Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey, 2011.
Fox B. Using big data for big impact: How predictive modeling can affect patient outcomes. Health Manag Technol. 2012;33:32-3.
Franks B. Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics. Wiley, 2012.
Graham M. Big Data and the end of Theory? The Guardian, 09-MAR-2012.
Grossman C, et al., eds. Clinical Data as the Basic Staple of Health Learning. Institute of Medicine/National Academies Press, 2010.
Lifsitz R. Big Data will begin improving the bottom line of healthcare players. Scientia Advisors Blog, 06-FEB-2012.
Lohr S. The age of Big Data. New York Times, 11-FEB-2012.
Loukides M, et al. Big Data Now: Current Perspectives. O'Reilly, 2012.
Minelli M, Smith D. Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses. Wiley, 2012.
Oxford University 2012 Conference on Internet, Politics & Policy: Big Data, Big Challenges?
Pearl J. Causality: Models, Reasoning and Inference. 2e. Cambridge Univ, 2009.
Ratner B. Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. 2e. CRC, 2011.
Rogers S, Girolami M. A First Course in Machine Learning. CRC, 2011.
Seni G, Elder J. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool, 2011.
Tang L, Liu H. Community Detection and Mining in Social Media. Morgan & Claypool, 2010.
Watts D. Everything is Obvious: *Once You Know The Answer. Crown, 2011.
Weinberger D. Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room
Is the Room. Basic, 2012.
Wu J, Coggeshall S. Foundations of Predictive Analytics. CRC, 2012.