August 15, 2012
Big Data: Numerousness doesn't equal value
Theories and
practices about outcomes performance, decision-making, leadership and many
other domains are firmly rooted in the “norm of normality,” where individual
performance follows a normal distribution and any deviations from normality are
seen as “outliers” that must be “fixed.” For example, actuaries’ mathematical
modeling in health insurance often censors out the top and the bottom few
percent of cases as “outliers.” In other examples, analysts make strong,
simplifying, but often unjustified assumptions about the symmetrical,
bell-shaped distribution of the data.
But in the Big
Data era it is clear that very few of the objects of analysis
in health and health care meet the bell-shaped “normal distribution” criterion.
Almost all of the analyses involve objects whose statistical distributions of
values are highly asymmetrical or skewed.
IBM Watson and
other groups extol hoped-for virtues and promise of letting web-sourced
searches “ring in” with answers that may be acted upon
as-received, with clinicians taking the results as normative guides for
decision-making in the care of their own patients. Such a process gives
normative priority to the commonest, most often-repeated items—without regard
to the social or logistical reasons why some things that have a certain
notoriousness might be the most frequently mentioned in the corpus of indexed
and searched information, which can lead them to be top-ranked.
In our experience,
only the most conventional and slowly-changing wisdom is retrieved by
Watson-type processes. What this means is that collections of evidence and
multi-variable patterns that offer superior outcomes but that have never been
noticed by human beings and, therefore, have never been published or collated
into an indexable, searchable document will be completely missed, as will brand
new and emerging findings that have not yet accumulated a very high page-rank
or very many cross-citations or repetitions in the secondary literature.
A case-in-point
is an example that was published in a recent issue of the Journal
of Pediatrics:
“Researchers entered 13 key phrases
pertaining to infant sleep safety, and then analyzed a total of 1,300 Web sites
found through Google search. Less than half (43.5 percent) of the sites listed
contained infant sleep safety information that reflects American Academy of
Pediatrics (AAP) recommendations.” — B. Joyner &
coworkers
A majority of
workers in health data analytics espouse the following goals:
- Improve normative choice transparency: Provide patients and providers with
accurate and reasonably complete information about which care options are
available and what the comparative relevance of each of them is in the same
context that prevails for the patient
- Improve value transparency: Provide patients and providers with
more information about the relative effectiveness and safety of the available
care options
- Increase cost transparency: Provide patients with more
information about the underlying costs to them of the service with which they
are being provided
- Create positive as well as negative
incentives: Allow patients
who discover value, or who through their behavior drive better-than-average
value, to pocket most of the savings they drive
But we contend
that these goals can only be achieved by carefully designed, scientifically
validated and independently auditable processes and methodology that respect
the “tails” of the statistical distributions of the variables that are
pertinent to the decisions and outcomes, and that do not discard the low
page-rank instances that reflect “new” instances that out-perform or
under-perform the majority. They cannot be achieved under conditions where the
process is subject to the vagaries of casual search-strings and un-validated
processes, operating against uncontrolled corpuses of data in a manner that is
not reproducible and auditable.
In summary,
proper Big Data methodology for health sciences must utilize non-parametric
statistical methods in most cases, on account of the varying, severe (and in
many cases unquantified) skewness
of the data variables’ distributions.
For accuracy and
fairness, we strongly recommend that Big Data methodology for health sciences
must not arbitrarily coerce data for comparative effectiveness, or accountable
care outcomes performance, or safety into parametric, normal distributions—nor
should Big Data methodology attach undue priority or privilege to the common or
conventional. The authorities in each country responsible for health policy and
comparative effectiveness research (CER) should establish minimum standards for
methodology and for its verification and validation, much like the International Conference on Harmonization
has done for clinical trials. This is necessary in order to prevent a future
cacophony of information purporting to be knowledge, but without adequate
warrants as to its truth and contextual validity.
Additional Resources
Google
search no guarantee for health data accuracy Healthcare IT News, 06-AUG-2012.
The
best and the rest: Revisiting the norm of normality of individual performancePersonnel Psychology, 27-FEB-2012.
Applied
Nonparametric Statistical Methods, Fourth Edition
Correlation heat map
of physicians’ rates of non-achievement of HbA1c targets with severity-matched
cohorts of diabetic patients in their care
Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner’s Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner’s Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner’s clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert®, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.
Sasanka Are, PhD, is Senior Director at Cerner Math, Inc. Cerner Math delivers superior analytical solutions that drive smarter healthcare decisions. The company’s innovative use of mathematics to predict health outcomes and optimize health processes transforms the health value chain and revolutionizes the way risk is managed. Sasanka has extensive research experience in the fields of Mechanical Engineering (3-D Turbulence modeling, Engine Simulations) and Mathematics (Monte-Carlo methods, Stochastic modeling, Numerical methods). Sasanka has a PhD in Mathematics and a Masters in Mechanical Engineering.