March 28, 2012
Big data is not (yet) big enough - part one
The promise of timely and reliable evidence in sufficient quantities to
statistically ‘power’ health care decisions depends on its quality, its interoperable ontological
mapping of concepts (so that “apples” are really “apples”) and on the
accessibility of the data on which to base discoveries and conclusions. These aspects are improving rapidly
in health care Big Data recently, making that promise real. Detecting, identifying and doing
analytical, descriptive-statistics procedures: there is no doubt that these
must inform and define the best matches and optimal choices and improvement
actions and predictive models.
The key
issue to be addressed in health and health care, though, is the same faced by
other industries and domains: avoid being overwhelmed by the petabyte-scale data
yet find the best, ethically appropriate ways to obtain and apply actionable
knowledge from the data. If only the “operational” perspective is applied (see graphic
above), the risk is to remain at a superficial level, whipsawed by emphemeral
“macro” features that may be more apparent than of deep physicochemical or clinical
or operational relevance. And if only “analysis” is pursued, the risk is to
obtain a very detailed numerical characterization of the dataset without
adequate understanding of (a) how to interpret these figures in a statisically
valid way or (b) the practical deployment details and the social and ethical
defensibility of the interpretation and contemplated application of the
knowledge.
“At the petabyte scale, information is not a matter of simple three or four axes and drilling-down, but of dimensionally-agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. Petabyte scale forces us to view data mathematically first and establish a ‘context’—a ‘meaning’ or codeset for it—later. Google conquered the advertising world with nothing more than applied mathematics. Google didn’t pretend to know anything about the culture and conventions of advertising—it just assumed that better data, with better analytic tools, and an algorithm to derive ‘meaning’ (PageRank) would win the day. And Google was right.” --Chris Anderson, ‘Long Tail: Selling Less of More’
These two risks are related to the challenge of
obtaining actionable knowledge out of the data, which is how to provide a good
description that can inform the definition of improvement actions. As far as
Big Data is concerned, currently the health research scientific communities
seem to be mostly preoccupied with the “identification/description” pole. There
are quite a few quantitative methods to process safety and
comparative-effectiveness performance data.
On the other hand, research and analyses and real-world, practical
applications are beginning to appear that are making productive use of the
petabytes of health data that are now accumulating. These are clearly innovative and beneficial, to the
public and to private organizations and individuals.
- Predict future risks and adverse health events/statuses, with enough
lead time to enable interventions to prevent the event/status
from materializing
- Automatically discover multi-variable patterns that implicitly
constitute new ‘concepts,' including features that denote similarities that
form a basis for personalized medicine decision-making
- Quickly find an anonymized large set of other persons whose array of
health-related attributes (and medical history, and outcomes subsequent to
treatment) closely resembled the array of facts that have materialized so far for
me or for a family member, so that the likely consequences of options I am (we
are) now considering can be objectively evaluated
- Optimize health plan benefits design and transparent pricing of
policies for fairer and more accountable (and cheaper, or better value)
coverages for more people
- Optimize procedures and processes, to deliver superior clinical
outcomes, operational efficiency, and financial performance
- Simulate the outcomes of alternative contemplated changes in staffing
or processes, before implementing them
- Discover new safe and effective diagnostics and therapeutics (and
label-expansion covering new uses of existing ones), by mining outcomes data
for products (and combinations of concomitant uses of two or more products)
that have a similar mechanism-of-action and that are already available in the
market
- Detect and forecast public health trends and outbreaks
- Measure the effectiveness of procedures and health policies in
communities or diagnosis-oriented groupings of people
- Automatically find patients who meet inclusion-exclusion criteria to
be eligible to participate in clinical trials, so that research can reach
reliable conclusions and decisions more quickly and respond to unmet health
needs and reduce one bottleneck in new-product development
- Quantitatively design clinical trial inclusion-exclusion criteria to
enable more safe and effective therapeutic products to successfully complete
the clinical development and regulatory approval cycles, more quickly and with
lower risk and cost
But while noting these benefits and emerging Big Data-based
applications, we also should look more closely at and ask hard
questions about the data, about the factors that contribute to its creation and about how those affect the answers that come from analyzing the data. Are Big Data repositories of health care-related
tweets really a sound and reliably generalizable measure of consumer sentiment
if the decision to tweet (about a doctor or a provider institution or a disease
or a treatment or a side-effect or outcome of a treatment) isn’t random and
isn’t geographically and socio-economically unbiased? Who is online, how do
they differ from the offline folks, and how do those differences potentially
skew the inferences that can be drawn from the data?
Are scholarly refereed journal articles really the best “gold standard”
of evidence anymore, if the studies published are ones mostly involving positive results and ones that only involved a few hundreds or a few tens of
thousands—when the Big Data alternative is relatively well-controlled,
accuracy-validated, and statistically unbiased observational cohorts of cases
and controls involving many millions instead of hundreds or a few thousands?
Are health data warehouses that are only fed by a set of relatively
financially well-off, metropolitan health care institutions adequately and
unbiasedly projectable (see Campbell, pp. 230, 290) to represent (and
contribute to decision-making for) the non-participating rural institutions or
the public charity-care institutions whose financial performance is marginal?
Are Watson's Jeopardy-style, page-rank-based, rapidly-rung-in answers adequate for
low-probability-event safety-oriented questions or for questions involving new
discoveries and innovations where there is as yet no large corpus of
pre-existing evidence “needles” for the Big Data Watson-type algorithms to
identify in the “haystack?"
Today there is scant discussion of questions and risks like these.
Perhaps the shortage of attention so far to limitations of “Big Data” research
is just a predictable, breathless, early “best thing since sliced bread” phase
of the technology innovation cycle. But that doesn’t mean we should relax our
scientific standards just because petabyte-scale data conveniently now exist.
Part two of this discussion is available here.
Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner’s Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner’s Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner’s clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert®, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.