I was recently invited by OpenEye Scientific to speak at their 19th CUP meeting - a scientific conference focused on challenges of computational modeling in drug discovery. It was my first time back at CUP in 8 years, since I switched fields from computational chemistry into computational biology. This time around, rather than presenting a particular computational method (like GPU similarity algorithms in 2011, at the request of the organizers I discussed principles to keep in mind when thinking about developing an AI solution to a problem.
Annotated slides of the talk are available here, if you’d like to follow along with some of the discussion points in this post.
Driven largely (though not entirely) by hype, an enormous number of potential problems have been proposed as good (or at least, the next) killer application for AI. In particular, lots of people hope that problems that human intelligence hasn’t been able to crack might fall to artificial intelligence! My earlier post laid out some ideas on how to think about experimental design given that we want to try an ML approach. Let’s take it one step earlier here, and consider how to choose our problems: which things might be good targets for ML, and which might be wastes of time and resources?
In the talk linked above, I discuss three criteria that I have used myself to evaluate project proposals:
Each of these questions is broadly useful in evaluating ML projects generally, and are particularly valuable in exploratory domains like biology and chemistry. Let’s discuss each of these more closely.
Within this category, three simple questions can help evaluate the difficulty of a project in AI/ML:
These questions help stratify the difficulty of the proposed project:
If #1 is yes, then great! Someone else has shown that the problem is feasible, and you may even be able to license or otherwise implement their solution. This means that you have a business problem (“will this work in my domain”, “will this have the impact I want”) and certainly an engineering problem (“can I build this out and operationalize it”), but probably not a research problem. Note that the solution may not even involve ML.
If #1 is no, then we wonder how to solve the problem. If #2 is yes, that means that there are people who have domain expertise and can help guide non-experts — whether they’re engineers or computers — towards a solution. However, that doesn’t automatically mean that an ML solution will actually work. Humans have been able to process human language for eons, but only recently have we had success in building solid machine translation systems. Here, you have a research problem: can we build an algorithm that can effectively capture the experts’ knowledge from the available data?
If #2 is also no, now you have a hard problem. Assuming #3 is yes, at least the idea of a “good solution” is well-defined, and perhaps you’ll be able to accumulate a lot of data and hope that an algorithm will be able to pick out the right patterns. Note, though, that the lack of constraint here means that false discovery (the “Gilderoy Lockhart problem” from the earlier post) will become a big challenge.
Finally, if the answer to all three questions is no, this problem is probably just too hard; in fact, you haven’t actually defined the problem. Notably, most problems that we think are well-defined usually end up not being so. If you’re interested in learning more, check out the discussion on specification gaming in the earlier post.
A pithy summary of the above is that it’s useful to consider how many “magic wands” we would need to build to solve a problem. When the answers to all three questions are “yes”, no magic is required: the problem and the solution are well-understood. When only #2 and #3 are yes, we need to build at least one magic wand (“get the knowledge out of his/her head into the computer”), and the difficulty scales up as more answers become no. One magic wand is a research project; more than one is a research program.
This question seems like the most straightforward, mostly because everyone thinks that they have “Big Data”. (If your data fits in RAM, it’s probably not Big. And you can get servers with a lot of RAM these days.)
While the usual metric to measure data big-ness is sheer number of bytes, the point of this principle is to force oneself to consider the shape and structure of those bytes, as this makes a big difference in our ability to learn over data. A useful metric is to consider that a dataset has both a width (the number of attributes measured per sample or instance) and a height (the number of instances). This captures the intuition that a 100GB dataset of emails is very different from a 100GB dataset of whole genome sequence. While the former likely has millions or tens of millions of messages, the latter probably only has a few dozen individuals, so the diversity and completeness of the datasets must be dramatically different. Prefer tall data to “big” data.
The second parameter to pay attention to is the structure of the attributes measured on each sample. Consider, for example, two datasets each with five attributes:
Daily temperature in | Daily blood levels of | |
---|---|---|
— | — | |
San Jose, CA | C-reactive protein | |
Palo Alto, CA | triglycerides | |
Redwood City, CA | insulin | |
Daly City, CA | cortisol | |
San Francisco, CA | PSA |
If you’re familiar with Bay Area geography, you’ll recognize that the dataset on the left has spatial structure: data points sampled roughly south to north. And despite the notoriety of Bay Area microclimates, you can generally infer that if it’s 70 degrees in San Francisco and 80 in Redwood City, it won’t be snowing in Daly City. The data set on the right lacks such simple structure! While there are probably some relationships (e.g., connected to stress, inflammation, etc.), there’s nothing nearly as obvious or simple as on the left. Simple structure (particularly local structure) tends to be more amenable to ML approaches. In fact, building up the ability to capture more complicated structure is one of the key problems in ML research.
So given these considerations, we return to the key evaluation principle: is the data available not just big, but also tall (and with tractable structure)? When data are small (few samples or few attributes), hand-encoding of domain knowledge through feature engineering takes on greater value, but this is very much not an automatic process.
The last question is the simplest to state and yet often the hardest to optimize, particularly in applications that have to touch the real world (as opposed to those operating in a purely digital form). The goal of an ML system is to predict, but no prediction system is perfect, and it can only learn if it knows when it gets things right and wrong. The dimensions along which feedback might vary include:
We’ve seen a great deal of progress on problems where feedback — both positive and negative — comes extremely rapidly (e.g., game-playing, or making spam). Furthermore, in these applications (unlike, e.g., medical pathology AI), the consequences to a bad prediction are fairly low-stakes, so it’s OK to let the machine explore its parameter space for a while before converging.
In many biological problems, these properties have a much more difficult character: feedback on a biological model may take anywhere from days (cell culture) to years (clinical trials) to collect, and bad predictions along the way might be extremely costly (in dollars, time, or even lives), making it hard to train a model.
Note that this question is tightly connected with question two: in many cases, if feedback is fast, then it’s also easy to get lots of data (in particular, when simulation is the source of both sides of the equation). However, they’re not entirely the same: we may be able to generate boatloads of data but then be left without any method better than cross-validation to generate test sets because data are extremely slow or expensive to generate. This question focuses around how to easily get test data.
The talk closes out with a discussion of how to apply some of these principles to drug discovery, and how we might be able to leverage technological developments to help.
A big problem (no pun intended) for ML in chemistry is the size of the available data in comparison to the size of the chemical universe. The latter (as noted in the first entry of the drug discovery checklist) is incredibly huge. Jean-Louis Reymond and his group at the University of Bern have computationally explored some of the possibilities here, and note that there are likely around 26.4M molecules with 11 or fewer atoms (C/N/O/F only).Raising the threshold to 17 atoms brings up the count over 6000-fold, to 164B molecules. Only a tiny, tiny fraction of these compounds have ever been experimentally described; the CAS registry only describes on the order of 100M molecules — less than 1/1000th of even the 17-atom enumeration — and of course most of these have more than 17 atoms.
A differentiating factor between chemistry and genomics is that datasets in chemistry tend to be not just short, but also surprisingly narrow. Whereas in genomics it is relatively easy to measure thousands of parameters per sample (e.g., whole-transcriptome RNA), both experimental data and computational features on compounds tend to be fairly limited. Consequently, there has been a lot of effort on chemical metrics like fingerprints, shape comparison, functional group matching, electrostatics, etc., all of which we can imagine as hand-engineered methods of widening the data matrix. While there are people working on automated feature extraction from chemical structures (using, e.g., graph convolutions), the small data matrices available may be a limiting factor.
I finished by sketching a couple directions that could be taken to address this data limitation for chemistry. One option is to leverage physics-based chemical simulation; in particular, there was an entire session at CUP the day before that focused on model-building from simulation data. The other direction that I’m particularly excited about is the ability to use massively parallel assays (mostly counting assays based on NGS or mass spec) to pull out giant matrices of data in one go — including both diversity in chemical species or conditions as well as the measurable properties. A key organizational challenge will be to set up interdisciplinary groups in which experimentalists and computationalists can work together to design these fiendishly complicated experiments, as it’s likely that the optimal experiment to generate data for a model may not appear to be interpretable by standard methodologies. (The example of perfect recovery of CT data by compressed sensing comes to mind.) This approach will create a number of sub-challenges along the way, including the need to mitigate biases from highly noisy data. While there are basic procedural approaches that can be taken to mitigate these (e.g., cross-validation procedures discussed in this paper and analytical protocols I discussed in a talk last year, it may also be interesting to consider how to build models to directly attack these biases.