August 15, 2012

Big Data: Numerousness doesn't equal value

Theories and practices about outcomes performance, decision-making, leadership and many other domains are firmly rooted in the “norm of normality,” where individual performance follows a normal distribution and any deviations from normality are seen as “outliers” that must be “fixed.” For example, actuaries’ mathematical modeling in health insurance often censors out the top and the bottom few percent of cases as “outliers.” In other examples, analysts make strong, simplifying, but often unjustified assumptions about the symmetrical, bell-shaped distribution of the data.

But in the Big Data era it is clear that very few of the objects of analysis in health and health care meet the bell-shaped “normal distribution” criterion. Almost all of the analyses involve objects whose statistical distributions of values are highly asymmetrical or skewed.

IBM Watson and other groups extol hoped-for virtues and promise of letting web-sourced searches “ring in” with answers that may be acted upon as-received, with clinicians taking the results as normative guides for decision-making in the care of their own patients. Such a process gives normative priority to the commonest, most often-repeated items—without regard to the social or logistical reasons why some things that have a certain notoriousness might be the most frequently mentioned in the corpus of indexed and searched information, which can lead them to be top-ranked.

In our experience, only the most conventional and slowly-changing wisdom is retrieved by Watson-type processes. What this means is that collections of evidence and multi-variable patterns that offer superior outcomes but that have never been noticed by human beings and, therefore, have never been published or collated into an indexable, searchable document will be completely missed, as will brand new and emerging findings that have not yet accumulated a very high page-rank or very many cross-citations or repetitions in the secondary literature.

A case-in-point is an example that was published in a recent issue of the Journal of Pediatrics:

“Researchers entered 13 key phrases pertaining to infant sleep safety, and then analyzed a total of 1,300 Web sites found through Google search. Less than half (43.5 percent) of the sites listed contained infant sleep safety information that reflects American Academy of Pediatrics (AAP) recommendations.” — B. Joyner & coworkers

A majority of workers in health data analytics espouse the following goals:

  • Improve normative choice transparency: Provide patients and providers with accurate and reasonably complete information about which care options are available and what the comparative relevance of each of them is in the same context that prevails for the patient
  • Improve value transparency: Provide patients and providers with more information about the relative effectiveness and safety of the available care options
  • Increase cost transparency: Provide patients with more information about the underlying costs to them of the service with which they are being provided
  • Create positive as well as negative incentives: Allow patients who discover value, or who through their behavior drive better-than-average value, to pocket most of the savings they drive

But we contend that these goals can only be achieved by carefully designed, scientifically validated and independently auditable processes and methodology that respect the “tails” of the statistical distributions of the variables that are pertinent to the decisions and outcomes, and that do not discard the low page-rank instances that reflect “new” instances that out-perform or under-perform the majority. They cannot be achieved under conditions where the process is subject to the vagaries of casual search-strings and un-validated processes, operating against uncontrolled corpuses of data in a manner that is not reproducible and auditable.

In summary, proper Big Data methodology for health sciences must utilize non-parametric statistical methods in most cases, on account of the varying, severe (and in many cases unquantified) skewness of the data variables’ distributions.

For accuracy and fairness, we strongly recommend that Big Data methodology for health sciences must not arbitrarily coerce data for comparative effectiveness, or accountable care outcomes performance, or safety into parametric, normal distributions—nor should Big Data methodology attach undue priority or privilege to the common or conventional. The authorities in each country responsible for health policy and comparative effectiveness research (CER) should establish minimum standards for methodology and for its verification and validation, much like the International Conference on Harmonization has done for clinical trials. This is necessary in order to prevent a future cacophony of information purporting to be knowledge, but without adequate warrants as to its truth and contextual validity.

Additional Resources

Google search no guarantee for health data accuracy Healthcare IT News, 06-AUG-2012.

The best and the rest: Revisiting the norm of normality of individual performancePersonnel Psychology, 27-FEB-2012.

Applied Nonparametric Statistical Methods, Fourth Edition

Correlation heat map of physicians’ rates of non-achievement of HbA1c targets with severity-matched cohorts of diabetic patients in their care

Douglas McNair, MD, PhD, is president of Cerner Math, Inc., and one of three Cerner Engineering Fellows and is responsible for innovations in decision support and very-large-scale datamining. McNair joined Cerner in 1986, first as VP of Cerner’s Knowledge Systems engineering department; then as VP of Regulatory Affairs; then as General Manager for Cerner’s Detroit and Kansas City branches. Subsequently, he was Chief Research Officer, responsible for Cerner’s clinical research operations. In 1987, McNair was co-inventor and co-developer of Discern Expert®, a decision-support engine that today is used in more than 2,000 health care facilities around the world. Between 1977 and 1986, McNair was a faculty member of Baylor College of Medicine in the Departments of Medicine and Pathology. He is a diplomate of the American Board of Pathology and the American Board of Internal Medicine.

Sasanka Are, PhD, is Senior Director at Cerner Math, Inc. Cerner Math delivers superior analytical solutions that drive smarter healthcare decisions. The company’s innovative use of mathematics to predict health outcomes and optimize health processes transforms the health value chain and revolutionizes the way risk is managed. Sasanka has extensive research experience in the fields of Mechanical Engineering (3-D Turbulence modeling, Engine Simulations) and Mathematics (Monte-Carlo methods, Stochastic modeling, Numerical methods). Sasanka has a PhD in Mathematics and a Masters in Mechanical Engineering.