Three Principles for AI/ML in Drug Discovery

What makes ML a better or worse fit for your problem?

(See the slides here)

I was recently invited by OpenEye Scientific to speak at their 19th CUP meeting - a scientific conference focused on challenges of computational modeling in drug discovery. It was my first time back at CUP in 8 years, since I switched fields from computational chemistry into computational biology. This time around, rather than presenting a particular computational method (like GPU similarity algorithms in 2011, at the request of the organizers I discussed principles to keep in mind when thinking about developing an AI solution to a problem.

Annotated slides of the talk are available here, if you’d like to follow along with some of the discussion points in this post.

Structured Thinking about ML applications

Driven largely (though not entirely) by hype, an enormous number of potential problems have been proposed as good (or at least, the next) killer application for AI. In particular, lots of people hope that problems that human intelligence hasn’t been able to crack might fall to artificial intelligence! My earlier post laid out some ideas on how to think about experimental design given that we want to try an ML approach. Let’s take it one step earlier here, and consider how to choose our problems: which things might be good targets for ML, and which might be wastes of time and resources?

In the talk linked above, I discuss three criteria that I have used myself to evaluate project proposals:

  1. How many magic wands would you need to invent?
  2. How tall is your data?
  3. How fast can you get feedback?

Each of these questions is broadly useful in evaluating ML projects generally, and are particularly valuable in exploratory domains like biology and chemistry. Let’s discuss each of these more closely.

Principle 1: Research Problems vs Business Problems

Within this category, three simple questions can help evaluate the difficulty of a project in AI/ML:

  1. Has someone already gotten a computer to solve this problem (perhaps in another domain, or without ML)?
  2. Do there exist humans who know how to solve this problem? (And are they on your team?)
  3. Is a “good” solution well-defined?

These questions help stratify the difficulty of the proposed project:

A pithy summary of the above is that it’s useful to consider how many “magic wands” we would need to build to solve a problem. When the answers to all three questions are “yes”, no magic is required: the problem and the solution are well-understood. When only #2 and #3 are yes, we need to build at least one magic wand (“get the knowledge out of his/her head into the computer”), and the difficulty scales up as more answers become no. One magic wand is a research project; more than one is a research program.

Principle 2: Data, cheap and big or expensive and small?

This question seems like the most straightforward, mostly because everyone thinks that they have “Big Data”. (If your data fits in RAM, it’s probably not Big. And you can get servers with a lot of RAM these days.)

While the usual metric to measure data big-ness is sheer number of bytes, the point of this principle is to force oneself to consider the shape and structure of those bytes, as this makes a big difference in our ability to learn over data. A useful metric is to consider that a dataset has both a width (the number of attributes measured per sample or instance) and a height (the number of instances). This captures the intuition that a 100GB dataset of emails is very different from a 100GB dataset of whole genome sequence. While the former likely has millions or tens of millions of messages, the latter probably only has a few dozen individuals, so the diversity and completeness of the datasets must be dramatically different. Prefer tall data to “big” data.

The second parameter to pay attention to is the structure of the attributes measured on each sample. Consider, for example, two datasets each with five attributes:

Daily temperature in     Daily blood levels of
San Jose, CA C-reactive protein
Palo Alto, CA triglycerides
Redwood City, CA insulin
Daly City, CA cortisol
San Francisco, CA PSA

If you’re familiar with Bay Area geography, you’ll recognize that the dataset on the left has spatial structure: data points sampled roughly south to north. And despite the notoriety of Bay Area microclimates, you can generally infer that if it’s 70 degrees in San Francisco and 80 in Redwood City, it won’t be snowing in Daly City. The data set on the right lacks such simple structure! While there are probably some relationships (e.g., connected to stress, inflammation, etc.), there’s nothing nearly as obvious or simple as on the left. Simple structure (particularly local structure) tends to be more amenable to ML approaches. In fact, building up the ability to capture more complicated structure is one of the key problems in ML research.

So given these considerations, we return to the key evaluation principle: is the data available not just big, but also tall (and with tractable structure)? When data are small (few samples or few attributes), hand-encoding of domain knowledge through feature engineering takes on greater value, but this is very much not an automatic process.

Principle 3: Is Feedback Fast or Slow?

The last question is the simplest to state and yet often the hardest to optimize, particularly in applications that have to touch the real world (as opposed to those operating in a purely digital form). The goal of an ML system is to predict, but no prediction system is perfect, and it can only learn if it knows when it gets things right and wrong. The dimensions along which feedback might vary include:

We’ve seen a great deal of progress on problems where feedback — both positive and negative — comes extremely rapidly (e.g., game-playing, or making spam). Furthermore, in these applications (unlike, e.g., medical pathology AI), the consequences to a bad prediction are fairly low-stakes, so it’s OK to let the machine explore its parameter space for a while before converging.

In many biological problems, these properties have a much more difficult character: feedback on a biological model may take anywhere from days (cell culture) to years (clinical trials) to collect, and bad predictions along the way might be extremely costly (in dollars, time, or even lives), making it hard to train a model.

Note that this question is tightly connected with question two: in many cases, if feedback is fast, then it’s also easy to get lots of data (in particular, when simulation is the source of both sides of the equation). However, they’re not entirely the same: we may be able to generate boatloads of data but then be left without any method better than cross-validation to generate test sets because data are extremely slow or expensive to generate. This question focuses around how to easily get test data.

Putting it together: future directions for ML in drug discovery

The talk closes out with a discussion of how to apply some of these principles to drug discovery, and how we might be able to leverage technological developments to help.

A big problem (no pun intended) for ML in chemistry is the size of the available data in comparison to the size of the chemical universe. The latter (as noted in the first entry of the drug discovery checklist) is incredibly huge. Jean-Louis Reymond and his group at the University of Bern have computationally explored some of the possibilities here, and note that there are likely around 26.4M molecules with 11 or fewer atoms (C/N/O/F only).Raising the threshold to 17 atoms brings up the count over 6000-fold, to 164B molecules. Only a tiny, tiny fraction of these compounds have ever been experimentally described; the CAS registry only describes on the order of 100M molecules — less than 1/1000th of even the 17-atom enumeration — and of course most of these have more than 17 atoms.

A differentiating factor between chemistry and genomics is that datasets in chemistry tend to be not just short, but also surprisingly narrow. Whereas in genomics it is relatively easy to measure thousands of parameters per sample (e.g., whole-transcriptome RNA), both experimental data and computational features on compounds tend to be fairly limited. Consequently, there has been a lot of effort on chemical metrics like fingerprints, shape comparison, functional group matching, electrostatics, etc., all of which we can imagine as hand-engineered methods of widening the data matrix. While there are people working on automated feature extraction from chemical structures (using, e.g., graph convolutions), the small data matrices available may be a limiting factor.

I finished by sketching a couple directions that could be taken to address this data limitation for chemistry. One option is to leverage physics-based chemical simulation; in particular, there was an entire session at CUP the day before that focused on model-building from simulation data. The other direction that I’m particularly excited about is the ability to use massively parallel assays (mostly counting assays based on NGS or mass spec) to pull out giant matrices of data in one go — including both diversity in chemical species or conditions as well as the measurable properties. A key organizational challenge will be to set up interdisciplinary groups in which experimentalists and computationalists can work together to design these fiendishly complicated experiments, as it’s likely that the optimal experiment to generate data for a model may not appear to be interpretable by standard methodologies. (The example of perfect recovery of CT data by compressed sensing comes to mind.) This approach will create a number of sub-challenges along the way, including the need to mitigate biases from highly noisy data. While there are basic procedural approaches that can be taken to mitigate these (e.g., cross-validation procedures discussed in this paper and analytical protocols I discussed in a talk last year, it may also be interesting to consider how to build models to directly attack these biases.