The CDRH workshop: “Evolving Role of Artificial Intelligence in Radiological Imaging”
As data scientists we often focus on solving specific problems, and do so in an idealized setting. Because of this it’s important, from time to time, to pause for a moment and examine the general context in which our solutions would be deployed. To listen to the professional and scientific communities we work with, and understand what it is that they value and what they are concerned about. The recent CDRH workshop – “Evolving Role of Artificial Intelligence in Radiological Imaging” – represented in this sense an interesting opportunity, and in this post I hope to convey some of the major highlights from the workshop.
As computational power, imaging quality and technical know-how constantly increase, machine learning becomes ever more integrated in day to day medical practice, at the same time raising new technical and regulatory challenges.
Framed as classification, segmentation, or a mixture of both, a large amount of work has been devoted to building detection algorithms. As imaging resolution increases, several presenters expressed an interest in streamlined systems automatically detecting all abnormalities in the acquired data, especially when small or hard to spot.
In practice, there is some concern over the effectiveness of these devices when implemented, as past Computer Aided Diagnosis and Computer Aided Detection systems (CADx and CADe) were in some cases proved to be irrelevant or even detrimental in terms of specificity and positive predictive value (Fenton et al., 2007, Influence of computer-aided detection on performance of screening mammography).
This should be seen as an important reminder of taking the general context into account. These automated systems have to be understood in their context, as part of a larger process beginning when the patient is admitted and including the entire image-processing pipeline and the physician. Quality control of the data employed, ensuring it is fit for its purpose, was highlighted as a critical aspect of realistic clinical-level pipelines.
The performance of radiologists themselves however can be rather inconsistent, according to different levels of experience and skill, and whether the radiologist sub-specialized in the specific context. Automated detection algorithms could increase the sensitivity of the diagnostic process especially when compared to less-experienced radiologists, especially when integrating lab results, omics and medical records. In fact, this context is absolutely fundamental in a real world medical diagnosis, and should ideally be integrated in ML tools.
Knowledge of the underlying physics can help generate synthetic data with known ground truth, as in the work presented by Abhinav Jha. This is especially valuable for tasks where data is scarce, or hard to correctly label manually.
Rule-out systems are an interesting form of detection system of a more narrow scope, and for this reason likely safer and easier to fit into existing medical practices. For example, rather than hoping the AI might correctly label each and every PAP image with no human involvement, it can be used to screen out high-confidence negative results.
Rule-out systems are especially indicated where false positives are risky or resource-expensive. In practical applications a large proportions of samples is negative, so that even if 25% of negatives qualified for high enough confidence to be screened out this would still result in a significant workload reduction.
As prediction confidence is a fundamental aspect of these systems, Bayesian deep learning models would seem to be especially interesting for rule-out systems. As each false negative could potentially lead to loss of life, an extremely high negative predictive value is required, surpassing human performance.
AI in existing measuring pipelines
One of the general take-away messages we should gather from this workshop is that medical practitioners hope AI will help them increase the amount of time they can directly dedicate to their patients.
In deep learning research we often look for an end-to-end approach, yet there is still a great interest in AI streamlining existing measuring pipelines. Automated quantitation of relevant diagnostic data can at the same time enhance the quality of the measures involved and save valuable time. There is significant interest in automating all those repetitive tasks involved in medical practice that force physicians to dedicate less time to their patients.
Generative models and autoencoders
Another interesting direction of research in medical imaging focuses on image reconstruction and generation. For example, a generative model can use a low-dose CT scan and generate what appears to be a good-looking scan of the same tissues acquired with an higher radiation dose. However, this does not necessarily provide any clinical advantage, and can in some cases introduce features that would not be present in a real scan of the same tissues.
This is not surprising: a generative model learns to produce likely results given its inputs, but this is a statistical process. It is not informed by the actual tissues, and cannot take the physics of the instrument into account, save for what is extrapolated during training.
While these models can still be useful for the purposes of data augmentation, we have to be extremely careful when employing them for image reconstruction, denoising or otherwise visual improvements, requiring task-specific evaluation. These systems need to be evaluated based on their actual impact on clinical outcomes.
Incidental finding are remarkably common in medical imaging. A study in 2017 in nearly 4000 children reported incidental findings in 25.6% of them, and approximately 1 in 200 children required clinical follow-up. Especially when imagining data-to-diagnostic systems with no direct human oversight on the images, opportunistic screening becomes essential: it is necessary to look beyond the primary reason for acquiring the data.
Many imaging techniques could benefit from AI-guided acquisition. These systems could guide the user when selecting acquisition parameters, for example by providing guidance regarding positioning and orientation. In MRI they could advise regarding sequence parameters, FOV and slicing options.
Ultrasound imaging hardware has become portable and easy to transport, yet requires specialized knowledge to be operated. AI guidance can allow general practitioners or untrained users to acquire clinically useful data. This is especially interesting in rural areas, in countries where healthcare access is limited, and to increase access for disabled patients.
Another aspect of AI-guided acquisition would be the standardization of the acquired data and a general quality improvement, that would facilitate all other forms of automation outlined above.
More interesting applications
Other interesting applications, proposed to reduce workload and streamline various aspects of the medical profession, include applying machine learning to triage and prognosis prediction. Many of these methods could also benefit from information extracted through natural language processing from medical records and patient history. Another interesting area of research is the development of neural networks that might simulate the effects of specific treatments on the patients.
An interesting point regarding triage, is that implementing current clinicalyl available ML solutions for triage often increase turnaround time, as a consequence of an excessive number of false positives.
Two general concerns about automated systems are data drift and real-world impact on clinical outcome, both requiring post-market surveillance.
Data built up over time can change enough so that even though the preclinical performance was satisfactory, the system becomes in practice ineffective. This calls for continuous monitoring for statistical change, and can require constant training and tweaking of the AI system to keep up with the shifting environment.
The FDA in this regards requires a proactive approach from the production side, preemptively addressing risk factors and building up appropriate strategies for risk mitigation. Current regulation allows these devices to be constantly updated and tweaked provided the modalities are clearly outlined beforehand. To also address privacy concerns, federated learning strategies are expected to prove valuable in this context.
Understandably a major industry concern is generalization. It is yet unclear how the spectrum of adaptiveness of a system can be characterized, and so generalization needs to be actively proved, or the system restricted to data it has been proven to handle effectively. In this regard quality control is key.
We also saw some examples of GANs used to adapt data to a different contrast for data augmentation, which might also help for data drift. It is unclear, and an interesting research topic, if and to what extent the outputs of automated procedures can be used as the ground truth later on.
Indeed, current methods for explainability do not tell us much on the decision process of the AI. A saliency map can indicate what the network is using, not how. However, it is still incredibly useful in practice for the medical practitioner, as it allows for easier interpretation of the results.
Doctors need to know when they can trust the outputs, whether a decision was taken based on an artifact or based on real features. As observed by radiologist Peter Chang, this become a key workflow component: MDs need the tools to understand when to override the AI. This process should be quicker and more effective as a consequence of introducing automation, rather than requiring an extensive amount of second guessing regarding the motives of the machine.
Unknown unknowns and the human factor
Machine learning generates models we often do not fully understand, and coupled with real, complex system as the ones involved in medicine this results in a large number of unknown unknowns. While as data scientists we often focus on specific implementation steps and operations, the entire patient + physician + AI system requires constant evaluation and oversight in its implementation. We need to be wary of hidden assumptions, and there is a demand for automated quality control at every level of analysis.
Design and interface are also key for effective real world use. The roles of automation and the human user need to be clearly delineated, and an effective interface provided between the two. It has to be clear to the user how to verify the correctness of results, and how to override them if needed.
It is still unclear how medical errors should be handled on a legal level when the mistakes depend on the AI. Equivocal mistakes, where a human could just as well have been fooled, carry little legal liability. However, random errors do not. These aspects will be further developed over time, and while less technical regulatory aspects of this field will result in technical constraints ultimately relevant even for academic research.
We should also distinguish between models and the specific use cases. The same model, with the same outputs, can be used in different ways for different purposes. For example when deciding whether to bump up a patient in the queue it could be reasonable to only do so if the probability of a positive is high, as this operation would push others down the queue. However, when deciding to send patients home it would be desirable to have a larger tolerance and only send home those who are extremely unlikely to be positive for a certain condition. Even though from a ML perspective this would be based on the same model, these different use cases might need separate approval.
Traditionally, medical imaging did not focus on reproducible results, but focused on clarity and speed. Most systems were not designed to be quantitative, and because of this imaging data can be very heterogeneous based on hardware vendors and experimental setup.
Speakers also expressed interest in the quantification of the number of cases required for effective training of AI networks, and insisted on the need for public datasets. An highlight from the workshop is the Cancer Imaging Archive, providing data from about 47,300 anonymized patients in multiple modalities divided in 110 collections. While these numbers are still far from classic ML datasets like ImageNet, these are clearly steps in the right direction.
Both standardization and the effective composition of datasets contribute to the algorithm’s robustness. Datasets should be enriched both in terms of biological heterogeneity and technological diversity, including different vendors and acquisition parameters. It is imperative to avoid selection bias, including representative cohorts and harder to label samples, to avoid limiting the system to the most obvious cases.
Following on the example of the DICOM standard, there is a growing need for open standards to allow comparison and integration between different datasets.
Defining the ground truth is also far from trivial. Radiologists contradict each other, and themselves. The problem of defining the ground truth needs to be explicitly addressed.
During this workshop medical practitioners and regulators highlighted a number of reasonable concerns, about generalization, interpretability, and user interaction. Most importantly, they provided their inputs on workflow integration and operating values.