Machine learning and computational modeling remain some of the hottest topics in both the academic and industrial biology and healthcare communities. Here I’ll take a closer look at some problems with how we conceive of the use of AI and machine learning in biology, and how shifting our mode of thinking may lead to more realistic assessments and outcomes.
A recent article by Forbes contributor David Shaywitz reviewed a talk I gave last fall at a conference about big data and AI in cancer. In that talk, I described two cultures of biomarker discovery (or more generally, cultures of modeling in biology): a mechanistic or deductive culture, which seeks to derive models based on known mechanisms bottom-up versus an empirical or inductive culture which seeks to accumulate large amounts of (usually observational) data and automate model building using statistical methods in a non-mechanistic way. Both methods face limitations I described in the talk (and that David summarized nicely in his article). In this article I want to focus on the empirical mindset.
The roots of the empirical/inductive approach are well-founded: although we don’t know enough to build our models of biology entirely from first principles, empiricism suggests that we might not need to. The goal of a model is to make a prediction; you may need complete information to be able to make any arbitrary prediction, but certain particular predictions may not need all of that information. For example, if I wanted to model the flight of a paper airplane, I might need very detailed, high-resolution data on wind speeds, but if I wanted to know if the plane would get rained on, I could probably do reasonably well with much less information by just looking for clouds. So, if there is “enough” data available, maybe the irrelevant details will average out in the noise and the relevant information will rise out — or rather, be drawn out — by the power of our algorithms.
Note that in that last sentence, I might as well have replaced the word “algorithms” with “magic” (accio signal?). While there are certainly well-thought-out, detailed proposals that go beyond this level, many proposed applications of AI in biology (whether in diagnostics, drug discovery, or synthetic biology) fall into the trap of thinking that “machine learning” can be used as a magic wand over “large” data sets to make results fall out. Many proposed applications of AI in biology have so little detail that they might as well be magic.
In the talk I claim that, as magicians go, blindly-applied machine learning in biology is more like Gilderoy Lockhart (the fraudulent celebrity wizard from Harry Potter and the Chamber of Secrets) than it is like Hermione Granger (the magical genius). Specifically, the assumption we made up front, that the “irrelevant details will average out”, tends not to be the case. Data sets in biology are typically small enough that, without outside information, there’s no good way for an uninformed observer (i.e., an algorithm) to tell what is signal and what is an irrelevant confounder. For example, biomarker discovery sets are often heavily confounded with age: because the incidence of many diseases increases with age, ceteris paribus samples from cases are usually from older individuals than samples from controls, so it can be difficult to distinguish whether a model is actually finding signal for a disease or just predicting age. More subtle effects can creep in as well (like technical bias from the way in which experiments are run).
This is particularly problematic in biology. Methodologically speaking, biology has enormous complexity both in terms of the number of underlying processes and well as the wide range of effect size of these processes, both of which make reliable automated inference difficult. Impact-wise this lack of reliability is also problematic: much of the interest in the space ties into direct applications of these technologies to healthcare, which means mistakes may be very costly indeed.
I argue that a re-think of how we consider experimental/strategic design can improve the reliability of the biological (or other) machine learning work that we do. Delving deep into my shallow knowledge of popular literature from British authors, I suggest the Sherlock Holmes strategy: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” Instead of spending all our effort trying to show the machine what “good” looks like, we should spend equal or greater effort on showing it what “bad” looks like and making sure that it does not fall into these traps. In other words, if we can keep the machine from picking “impossible” (incorrect) scenarios, then whatever remains (if anything) must be the truth. So, how can we define these failure modes and validate that models are not falling into them?
A starting point is to use the structure of our existing knowledge to define what “good” and “bad” look like and integrate this into our evaluation of model quality. A recent piece of research (shameless plug: from the research group I led) shows an example of how to critically evaluate machine learning models in this context. Like other diseases, cancer incidence is highly correlated with age, and “standard” training methods on a confounded data set can produce a model that looks good on the surface, but will end up performing badly in the real world if it is actually picking out aging, not cancer. Ideally, we would have perfectly age-matched sets of samples. In reality, this is not practical: even if we could match samples on age, there would be no way to match every possible covariate. We addressed these challenges in our paper by designing validation methods that tested models against several known failure modes (age, sequencing batch, institution of sample’s origin) and verifying that performance did not collapse as a consequence when each confounder was held out. If it did, that would have implied that the model’s predictions were a consequence of using the confounder, which would be undesirable. Thus, filtering to models that remain robust under such control is an example of “eliminating the impossible”.
It’s worth noting that while the Sherlock Holmes principle may sound good at first glance, the real world is always more complicated than a story: it’s not that easy to cut off all the ways for a machine learning model to cheat. Models evading what the engineers think are well-designed constraints is known as “specification gaming”, and it’s a hard problem in applied AI/ML. An approachable exploration of specification gaming is in the story “The Hidden Complexity of Wishes,” and a crowd-sourced table of interesting examples of specification gaming has also been assembled.
Taking a step back, the notion that we can call a particular variable a “confounder” (that is, irrelevant) up front implies that we have some approximate notion of the mechanisms in play, and can identify which are relevant or not. Good experimental design has a rich set of such assumptions baked in: we call them controls. The cross-validation methods described in the earlier paper are ways to use negative controls (“this variable is irrelevant”) to evaluate a model’s performance. There are also methods to use similar negative control data during model inference, not just in validation. For example, this paper on minimizing batch effects in biological data can be thought of as a way to make use of data from technical replicates, which express the negative control that “different batches should not give different results”. Designing methods to integrate information from more complicated negative controls as well as positive controls (“this ought to be relevant”) may be an interesting direction forward.