Using clinical text to combat selection bias in medical research

Stanford researchers identify clinically meaningful confounders of retrospective observational research in medical record text.

Unstructured text in patients’ medical records can reveal treatment selection bias, Stanford researchers find. 

Patients diagnosed with cancer must often choose between several treatment options, such as surgery, chemotherapy, or radiation. And they expect their clinicians to recommend the option that offers the greatest likelihood of survival - or the best quality of life.

To provide that advice, clinicians turn to research that compares various treatments’ success. Some of that research is based on randomized clinical trials, which are the gold standard for so-called comparative effectiveness research. But because randomized clinical trials have not been done for all possible treatments, clinicians may instead rely on studies that look retrospectively at the relative success of different treatments in the real world.

The problem is this: The results of these retrospective studies are often confounded by the non-random ways in which clinicians actually decide which patients should get a particular treatment.

"We have to really watch out for selection bias," says Ross Shachter , Stanford associate professor of management science and engineering and faculty affiliate of the Stanford Institute for Human-Centered Artificial Intelligence. For example, he says, "Older patients might be naturally dying at a higher rate, and if they are shunted to a different treatment in practice, it looks like the treatment they are receiving is worse because they are starting with sicker patients."

In addition to patient age, potential sources of confounding include pre-existing conditions and comorbidities - factors that might cause doctors to recommend different treatments and that also might produce worse outcomes.

To identify specific sources of confounding in retrospective observational research studies - and ultimately to support clinician decision making - Jiaming Zeng , a PhD candidate in Management Science and Engineering , teamed up with Shachter, professor Susan Athey of Stanford Graduate School of Business, and Stanford Medical School professor Daniel Rubin and clinical associate professor Michael Gensheimer. 

Zeng and her collaborators used relatively simple machine learning methods to identify clinically meaningful sources of treatment selection bias in the unstructured text portion of patients’ electronic medical records.

"The terms we uncover in the medical records’ unstructured text can help us better understand how treatment decisions are currently made and common decisions that introduce selection bias without our awareness," says Zeng, lead author on the team’s paper. "This can in turn help inform medical decisions."

Doctors want to know about potential sources of selection bias in retrospective studies before relying on them when advising their patients. And Zeng’s work gives them that explanation. "It’s a way to gain their trust," she says. 

The Confounding Problem

In a randomized clinical trial (RCT), researchers select an appropriately circumscribed group of patients and randomly assign them to various treatments. They then follow those patients over time and evaluate the treatment groups’ relative survival rates and post-treatment quality of life. But as cancer treatments proliferate, it can be hard to conduct an expensive RCT for every possible treatment option. 

The main alternative to an RCT is a retrospective comparative effectiveness study that looks at outcomes for patients who have been treated in the real world. For these studies, experts strive to select a circumscribed set of patients who have already had a particular treatment. And they try very hard to account for factors that could confound the result, such as a patient’s gender, race, ethnicity, disease severity, or pre-existing conditions. "The idea is that if you can control for all the potential confounders in a retrospective population-based study, you would get a causal inference result that should match the randomized clinical trial," Zeng notes.

But confounders - nonrandom factors that are associated with both treatment and outcomes - continue to affect the outcomes of retrospective studies. Indeed, researchers have seen huge discrepancies between observational study results and RCT results. For example, a 2015 retrospective study of treatments for prostate cancer suggested that surgery was better than radiation for patients’ overall survival, but a subsequent RCT showed that survival was actually the same for patients treated with either radiation or surgery. One hypothesis of the potential source of bias is that younger or healthier patients were offered surgery because they were more likely to recover from it, while older or less healthy patients were more likely to be offered radiation. The successful recovery of the healthier patients made surgery seem like the better option, when in fact the two treatments yield similar survival outcomes. As a result of this research , doctors who had preferred surgery over radiation had to change their practice.

Seeking Confounders in Unstructured Text

Even though designers of retrospective research try to account for confounding through reliance on the structured part of the electronic medical record, they don’t quite succeed. "That’s why we had the idea to use the textual information in the electronic medical records," Zeng says. "Textual records are messy, but there’s a lot more information in there," she says.

The goal was to identify words in the unstructured text of the electronic medical record that are predictive of both the treatment and the outcome. "That’s the whole issue with selection bias - trying to control for the variables that affect both," Shachter says.

Zeng focused on a population of Stanford Health Center patients who had been treated for either prostate cancer or non-small-cell (NSC) lung cancer in the last decade or so. She chose these cancers because there already existed an RCT that could serve as a benchmark for each: A prostate cancer RCT had found that surgery, radiation, and active monitoring all produced similar survival and quality of life outcomes; and an NSC lung cancer RCT found slightly better survival for those treated with radiation than with surgery.

Zeng extracted biomedical terms from the prostate and NSC lung cancer patients’ unstructured clinical notes and then used a simple natural language processing technique called "bag of words" to generate a matrix of word frequency counts. She then applied a method for shrinking the number of relevant variables and selecting those that were closely tied to patients’ treatment and outcome. Because of the approach she used, the outcome was interpretable: Each variable could be matched to a word.

For prostate cancer, Zeng’s bag-of-words approach found that the terms "bladder" and "urothelial" were linked to worse surgical outcomes. In consultation with co-author Gensheimer, Zeng learned that patients with bladder or urothelial cancer are more likely to be assigned surgery because these cancers don’t respond well to radiation. "That’s how these terms relate to treatment decisions," Zeng says. Also, she learned, such patients are likely to be older and to have other comorbidities - which is how these terms affect outcomes .

For lung cancer, Zeng’s bag of words identified the terms "ALK" and "left.low" as being linked with both radiation treatment and poor survival. It turns out that ALK is a type of lung cancer mutation that can affect both treatment decisions and survival. And Zeng found papers observing that patients with cancer on the lower left lobe of the lung tend to have poorer survival rates. "This again shows that the terms we have uncovered are interpretable and can give interesting clinical insights," Zeng says.

But Zeng did not stop there. She wanted to know if including the additional potential confounders she uncovered from the clinical text in a retrospective study would provide more accurate estimates of which treatments yield higher survival rates. "The idea is to adjust for the source of confounding," she says.

When Zeng did what was essentially a classic retrospective population-based study of prostate cancer treatments without including her unstructured text, she found that patients treated with radiation or monitoring fared somewhat better than those treated with surgery - a result that differed from the RCT findings. But when she added the variables identified through her method, such as the frequency of the words "bladder" and "urothelial," the treatments’ relative survival rates moved closer to equipoise - the result predicted by the RCT. 

This shift was not as clear for Zeng’s lung cancer study. "The movement toward equipoise isn’t as great in this case, but the terms we uncovered still gave interesting clinical insights."

In future work, Zeng says it would be interesting to use more advanced natural language processing techniques that look not just at word counts, but at the context in which words are used. But whatever technique is tried, she says, it should be interpretable rather than black box. "What was interesting in our work were the terms we pulled out and how those were able to give us meaningful clinical insight." 

A De-Confounding Pipeline

Zeng’s research sets up a new infrastructure for future retrospective research, Shachter says. "People can follow this path to get rid of some of the sources of selection bias in interpreting observational data." The approach could help predict the success of treatments for other diseases and other populations; and it may even prove useful in the social sciences, where researchers commonly rely on retrospective data.

Of course, some confounding may be unknown and unobservable. "Selection bias is a really tough nut," Shachter says. "And we can’t solve it, but we can chip away at it."

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition.  Learn more

When Algorithmic Fairness Fixes Fail: The Case for Keeping Humans in the Loop

Attempts to fix clinical prediction algorithms to make them fair also make them less accurate.


This site uses cookies and analysis tools to improve the usability of the site. More information. |