Blog
Imran Haque — Sun 18 August 2019

The shocking conclusion to a series on advances in early detection liquid biopsy from 2019H1:

In the previous installments of this series, I’ve briefly reviewed some of the key technical advances presented by Grail and Guardant at conferences earlier this year and examined how these advances interact with the ever-important factor of tumor fraction to create the pure detection metrics of stage-specific sensitivity and specificity. In this, the last part of the series, I’ll bring these numbers into a broader clinical context with comparison to existing screening methods and consideration of important clinical/population parameters like cost and adherence (how many patients told to take the test actually take the test). Today, I’ll narrow the focus to colorectal cancer (CRC) specifically, for two key reasons: first, it has a TON of existing comparison data from prospective evaluation of various stool tests; and second, it’s the only cancer for which Guardant disclosed data at AACR.

To do this, we’ll build a VERY simple analytical and economic model to analyze each test’s projected performance and its cost to the overall health system…and to come to a very interesting conclusion. You can see the model and its parameters on Google Sheets or download it in Excel format. The relevant parameters for each test are its stage-specific sensitivity and nominal specificity, adherence rate, cost of test, cost of followup on positives, cost of “missing” a cancer (false negative), and periodicity (how often the test is done). I’ll consider two stool-based tests: Cologuard and FIT (both using data taken from the Cologuard validation study, Imperiale 2014 [1]), and two different blood-based tests (from Grail [2] and Guardant [3]) at a few different specificity cutoffs.

There are a number of caveats with this model (which I’ll get into later in the post), but one extremely critical one I’ll mention upfront: while the data for stool testing were generated in a fully-prospective screening population, the liquid biopsy data were all generated in much smaller case-control populations, often those with prior clinical presentation. I’m using the numbers at face value in this analysis, but know that history shows that numbers generated under these circumstances usually get worse when they hit the real world — so consider the analysis here something of a best-case scenario for liquid tests.

Analytical Performance

In a prospective screening population, Imperiale et al reported 65 CRCs in almost 10,000 patients; knowing this overall incidence is critical for determining the number of false positives and negatives that we’ll detect, as well as the benefit from true positives (detected cancers). Every test performs differently as a function of stage; in particular, as we saw in earlier parts of this series, liquid biopsies tend to perform quite poorly at stage I and II (compared to stool-based testing), but interestingly seem to do better with metastatic tumors. (This could be down to tumor burden and location, or just statistical noise). While Guardant’s numbers at first glance seemed to me to be implausibly high, when put in context with the Grail results they look a bit more reasonable. We haven’t seen much in the way of ROC curves from Grail, but the performance reported by Guardant seems like a reasonable sens/spec tradeoff with respect to the (even higher specificity) Grail data.

		Sensitivity at Stage
Test	Specificity	I	II	III	IV
Guardant AACR 2019	90%	84%	90%	95%	100%
Guardant AACR 2019	95%	76%	87%	95%	100%
Guardant AACR 2019	98%	64%	82%	84%	100%
Grail ASCO 2019	99%	53.9%	77.2%	78.2%	92.7%
Cologuard	86.6%	88.7%	100%	90%	75.3%
FIT	94.9%	65.5%	75.9%	90%	75.5%

I then took these stage-specific sensitivity numbers and multiplied by the stage-specific incidence in the screening population reported by Imperiale 2014 to compute an overall projected sensitivity in screening. Using this weighted sensitivity, nominal specificity, and the incidence in the population, it’s easy to compute the full confusion matrix (true/false positive/negative rate) as well as interesting metrics like positive predictive value. For full details, check the spreadsheets linked above. Some highlights:

Test	Sensitivity	Specificity	PPV
Grail	67.8%	99%	30.7%
Guardant	88.7%	90%	5.5%
Cologuard	91.8%	86.6%	4.3%

Grail’s test appears to have much lower overall sensitivity than Cologuard (as expected from its poor stage I performance): 68% vs 92%; but much higher specificity (99% vs 87%).
Guardant’s test at its lowest-specificity operating point is strikingly similar to Cologuard’s numbers.
All of these tests have quite poor PPV! As a consequence of the rarity of CRC, even the Grail test with its 99% specificity will have almost 7/10 of positive results be false positives. Tests with specificity around 90% are even worse: 19/20 positive test results will be false positives. That’s a lot of potential scares.

Economics

I then took these analytical numbers and fed in some simple assumptions on costs to build a VERY SIMPLE economic model for these screening tests. I’ll model the overall “system” cost of a test as: (cost of test + cost of any followup testing on positive results + cost of missing a cancer) / (interval of screening). No cancer treatment other than colonoscopy is modeled. Some price estimates were fairly arbitrary:

I modeled FIT as costing $50/test and colonoscopy (the positive followup) as $1200 (both may be excessive).
I modeled Cologuard at $482/test based on (revenue / total tests) from Exact Sciences’ 2019Q2 earnings report [4], and set the cost of each blood test to match that of Cologuard (under the assumption that COGS are likely higher for liquid biopsy, but reimbursement is not likely to be higher than that for Cologuard).

Given that existing screening guidelines recommend either annual FIT or triennial Cologuard, I modeled the cost of a false negative at $25,645, which makes the final system cost for annual FIT and triennial Cologuard the same in this simple model.

Finally, adherence is a critical parameter in the model; one of the main appeals of liquid biopsy screening is that it may be more appealing to patients than stool collection or exposure to ionizing radiation. Extensive data exists for adherence to stool testing; I modeled it at 65.5% at the mean of results from Inadomi 2012 and Cyhaniuk 2016 [5,6]. Data on liquid biopsy screening is, understandably, nonexistent. However, other blood-based monitoring tests exist in different areas of medicine. Moffet et al report adherence of 86% to screening for hemoglobin A1c in the management of diabetes, which seems like a reasonable upper bound for adherence for cancer screening [7]. Adherence to followup colonoscopy for positives is assumed to be 100%.

The results of this model were extremely surprising to me. It turns out that high-specificity liquid biopsy tests, in spite of their relatively poor sensitivity, not only come out with a similar cost profile to stool testing, but even come out with a similar effective sensitivity (i.e., not many more false negatives). This comes down to adherence: although Cologuard works better in those who actually take the test, the blood tests as modeled are able to get more people to actually be screened, which basically cancels out the liquid test’s analytical deficiency. There’s also an interesting tradeoff to be made in relaxing specificity of the liquid tests: the Guardant test at 90% specificity ends up only a few percentage points more expensive than the high-spec test, but cuts the number of missed cancers by a third. Changing the weighting applied to followup cost versus false negative costs could really change the outcome here!

Test	Periodicity	Total cost/patient/year	Effective False negative rate in population
Cologuard	Triennial	$163.87	0.26%
Grail	Triennial	$166.31	0.27%
Guardant @spec98	Triennial	$166.31	0.23%
Guardant @spec90	Triennial	$172.55	0.15%

Conclusion

I admit that taking this exercise to its conclusion quite surprised me: the landscape of screening may be a lot more interesting than the analytical numbers alone dictate. I’ve so far been quite bearish on the published liquid-based screening (especially mutation-based screening) protocols on the basis of their inability to deal with low-tumor-fraction and early stage cancers: basically, if our goal is to catch cancer early, but the test can’t even detect stage I at better than coin-flip odds, what’s the point? However, incorporating some very simple economics and patient preferences into the analysis changes the game: it may be that from a population perspective, the (analytically) worse test is (epidemiologically) better, because you can actually get people to take it! At the very least, it’s far more competitive than I had previously thought.

That said, there are a number of really, really big limitations in this analysis. From a medical perspective, I’ve completely ignored the ability of each test to detect precancerous lesions (e.g., adenomas in CRC; DCIS in breast cancer). As no one has published on liquid biopsies doing this, it seems likely that non-liquid protocols still do a better job here. On the other hand, most liquid tests in this space are interested in looking at more than one cancer (Grail’s data explicitly discussed this), which may get more value out of a single test run (or more cost, if lots of expensive followup is needed on a low-PPV test!). Don’t forget that the analytical comparison here is not really fair, as the liquid tests have not undergone a true population-based prospective trial, and will likely see their performance drop in such a setting. Finally, the economic analysis in this post is laughably simplistic; at the very least, you’d want to account for life-years of improvement from screening as a function of stage, with a cost per life-year or QALY/DALY (this might shift things back in favor of protocols with better earlier-stage performance: there are many more life-years gained from cutting out a stage I tumor than initiating treatment at stage IV).

Regardless of these limitations, I hope this has been as enlightening a read for you as it was an analysis for me to do, and hopefully this conveys the extremely complicated and multidimensional nature of evaluating technical advances in screening for actual clinical application.