A study published in the Journal of the American College of Surgeons finds that missing patient information in a national database can lead to underestimation of survival rates and exclusion of underserved patients.
A significant number of cancer patients, particularly those with advanced stages who are more likely to receive care at community hospitals, safety net hospitals, and rural medical centers, may have incomplete case information in the Surveillance, Epidemiology, and End Results (SEER) database, according to a study published in the Journal of the American College of Surgeons. This finding raises concerns about the reliability of studies that rely on SEER data.
Senior study author Schelomo Marmor, Ph.D., MPH, notes that these missing cases create "blind spots" in the data. Dr. Marmor and McKenzie White, MD, a complex general surgical oncology fellow at Moffitt Cancer Center, studied four types of cancer: breast, pancreas, colon, and non-small cell lung cancer (NSCLC). The researchers found that patients with missing data had significantly lower three-year overall survival rates compared to those with complete records. For example, patients with missing data had a 63% overall survival rate for breast cancer, while those with complete data had an 81% rate.
"These are not minor statistical loose ends," Dr. Marmor emphasizes. "These are high-risk, underserved populations that effectively disappear from the scientific record every time studies exclude incomplete cases."
The study analyzed 328,000 patients and found that patients who went to centers that were not Commission on Cancer (CoC)-accredited were more than two to three times more likely to have missing data. For instance, 23% of breast cancer patients at non-CoC-accredited centers had missing data compared to 9% at CoC facilities.
"This means that the most difficult cases are being systematically excluded from analyses," Dr. Marmor explains. "When these records are dropped, we don't just lose data points; we lose the clinical and human reality of cancer in America."
The study also revealed that patients with missing data were more likely to be older, from rural areas or socioeconomically disadvantaged backgrounds, making them less likely to receive preventive care and diagnosed at aggressive stages. This exclusion further exacerbates existing health disparities.
"Think of it this way: If you set out to understand how cancer treatments perform across the entire country but your data systematically leaves out the sickest patients, the oldest patients, and those from rural or underserved communities, then what you're left with is a portrait of cancer care that looks much rosier than reality," Dr. Marmor said.
The implications for building AI models are significant. Population-based registries like SEER are foundational in enabling AI-driven oncology research. However, if specific data points are systematically excluded due to missing information, it can lead to biased or inaccurate AI predictions.
Dr. Marmor suggests that cancer researchers should use multiple data sources, such as both the National Cancer Database and SEER, to ensure a more comprehensive understanding of patient outcomes. He also emphasizes the need for further research into why specific data are missing, as this knowledge is crucial for developing strategies to address these gaps in future studies.
"This work highlights the importance of ensuring that all patients' records are included in analyses," Dr. Marmor concludes. "Without complete and accurate data, we risk perpetuating health disparities and overlooking critical insights into cancer care."