The following was submitted to the Journal of Medicinal Chemistry as a mini-perspective for their special issue on AI in drug discovery. I leave it to the reader to judge whether they were right to reject it forthwith.
The breadth of the field of artificial intelligence (AI) in drug discovery (DD) can be difficult to keep up with for even the most well-read scientists (DTKUWFETMWS)! To ensure that you might be able to count yourself among the best-prepared audience members at this year’s biggest meetings, we (the royal “we”, there’s only one author willing to take credit for this) have prepared this special piece reviewing the top terms you need to know to ensure fluent understanding no matter which parallel talk track you might wander into (you poor soul).
It is widely recognized that if you show up at a conference and admit not knowing one of the terms in Table 1, the organizers may not only ban you from attending in future years, but may in fact ridicule you on stage [1,2]. Don’t let this happen to you — identify terms you don’t recognize, and look up the definitions in the handy-dandy glossary that follows! Then, to reinforce the value of your learning, print out the table and take it with you to your next meeting. Check off a box each time you hear one of the terms, and count up the value of your studying in real time!
NB: while Table 1 includes a central square marked “Free”, we strongly discourage readers from printing out permutations of the table, handing them out to their colleagues, and seeing who can first complete a line of 5 marked squares and shout “BINGO!” in the lecture hall, as that would be uncouth.
Table 1: An index to the most critical terminology for AI-powered drug discovery in 2020. Any resemblances the terms may have to buzzwords, or the table may have to a Bingo card, are purely coincidental.
AI / Artificial Intelligence: Any process that involves a computer in any way.
Blockchain: A system to secure your precious information by shouting “cryptography” really loudly while setting up an extremely slow shared database. Meanwhile, your officemate is happily entering his login credentials into a page that looks like the email portal but oddly is at “outlook.com.wearedefinitelynothackers.stealingyourdata.com”. Weird.
CRISPR/Cas9: Pro Tip: sound ahead of the curve by casually dropping in conversation that you’ve been doing gene editing since back in the CRISPR/Cas5 days, not like all of today’s amateurs.
Data lake: Is your lab’s data a bit dirty? Results smell kinda fishy? Fear not: enterprise IT just finished reading about 1950s environmental chemistry and got inspired: if we dump everyone’s results together into a single pool, no one can blame any individual group for the trash data screwing up the model. Remember: “dilution is the solution to pollution!”
Data science: You know that one colleague with a PhD who, despite being surrounded by PhDs, insists on being addressed as “Doctor”, as though they’re not quite sure if everyone has noticed their exalted status? Well, anyway, isn’t it funny how it’s not called “chemistry science” or “biology science”?
Decision support system: A box containing a coin, a pair of dice, and a flask of whiskey.
Deep learning: linear regression hidden under a deep stack of venture capital dollars.
Digital transformation: Using highly advanced technologies to replace your face over video chat with that of a cartoon dog. Looks great on screen; surprisingly, does not change your real life appearance or abilities. Woof.
Embedding: Some exec took the idea of “chemical space” a little too literally and demanded a map of your screening library, and now you’re having to explain why, no, even though you assigned coordinates to molecules it wouldn’t be a good idea to use Waze to find a retrosynthetic route, it’s really not that kind of map, and yes, Apple Maps isn’t as good haha yes sir great joke.
End-to-end: Referring to a technique in which source data is transformed to final predictions completely automatically, without feature engineering or expert advice in the middle, which therefore implies that if the predictions were bad, the original data must have been bad, and so it’s all the fault of those pesky lab types.
FAIR: literally just the ability to find and use data; nothing to do with “fairness”. As long as we’re using acronyms that are already words to mean things that aren’t that word, I’ll suggest another useful set of properties for big data: “Broad, Unbiased, Logically-Linked, Specific, High-Information,”…hmm, seem to have run out of ideas. Tarnation.
Generative chemistry: a computer program to automatically generate more molecules than your human chemists will actually make for you, necessitating the purchase of a robotic lab. Like all robot uprisings, if it didn’t work well the first time, it’s because you didn’t eliminate enough humans.
Graph convolutions: A highly democratic approach to building chemical models, wherein if only we could get all the carbons to not just talk to other atoms that look like them, but even talk to their neighbors of different atomic number, charge state, and hybridization, and their neighbors’ neighbors, and so on in one big happy forum, we could trust their conversations to together work out the right answer. You think I’m joking; look it up.
HF/6-31G* : Don’t worry, no one else in the audience knows either.
Knowledge graph: A generalization of the logical technique of reductio ad absurdum in which one never calls any conclusion absurd.
NaN: “Not a number” - the most common result from any AI system that is supposed to output a number. This one isn’t a joke; it’s real life. Pass me the last component of your decision support system, would you?
Platform solution: “You see, houses sell for a million bucks, and I bet you could use this scaffolding to make AT LEAST ten houses, so really, $5M for this pile of pipes and planks is a bargain.”
PubChem Bioassay: it’s a data lake with chemistry data! Yay! Except instead of a private lake, it’s more of a public swimming pool — and that group over yonder’s been in the water for a suspiciously long time without getting out for a bathroom break.
Real-world evidence: Refers to a recent movement to collect safety and efficacy data on Earth, reversing a trend that has grown since 1986, when a coalition of large pharmaceutical companies successfully opened a portal to Earth-Prime and outsourced most drug development to cheaper CROs in a parallel dimension.
Reinforcement learning: A method of training AI systems that borrows from the Montessori method. Traditional deep learning is punitive, assigning models “losses” but never “wins”; and sometimes involves outright brainwashing, “normalizing” and “regularizing” models that fail to behave. By contrast, practitioners of reinforcement learning believe that a nurturing, exploratory environment is best for young AIs.
Robotic lab: depending on whether the speaker comes from academia or industry, actually a swarm of undergrads or highly educated yet cheap scientists in a nation 15 time zones away. Often paired with a system to do generative chemistry.
State-of-the-art: still not good enough to do anything useful, or we would have done it.
Virtual screening: In graduate school, the author was told by his housing manager that insect mesh for the windows would not provided by management (true story). In a fit of exasperation one summer evening, and lacking the funds for netting or insect repellent, he shouted loudly at the mosquitoes to keep out. It turns out that simulation is really no substitute for actual physical and chemical techniques to let the good molecules in and keep the bad ones out.
[1] Haque IS. AI in Drug Discovery Glossary, 2020 Edition.
[2] Muntz, Nelson. “Ha-ha!” Personal communication.
A disclaimer, in case it went over anyone’s head: the views expressed in this post do not reflect the views of my past, present, future, or subjunctive employers on the topic, nor do they really reflect my own. My view is that if we were to be unable to laugh at ourselves, all would be lost.