FDA Center For Devices and Radiological Health
This is the computer aided diagnosis workshop, and we are basically here today to hear
from you. Today marks the beginning of FDA's effort to write guidance in this area, and
the first step is that we want to hear from you - particularly your ideas on the relevant
characteristics of these devices and how they should be evaluated. We'll have a number of
speakers this morning and then plenty of time for discussion this afternoon. We're going
to start off with a couple of introductory talks about device regulation and general
software policy, followed by several speakers from the Center who'll be talking on CADx,
and then we'll get to the main part of the program. Obviously we can't cover everything
today, so later on as ideas occur to you, please submit them to us in writing.
Medical devices are regulated under the authority of the medical device amendments to
the Federal Food, Drug, and Cosmetic Act of 1938 [21 Code of Federal Regulations (CFR)].
The 1938 Act required devices to be safe but placed the burden of proof to remove unsafe
products on the government. The 1976 Medical Device Amendments and 1990 Safe Medical
Devices Act established a comprehensive scheme of regulation. They defined the term
"device", provided for classification of all medical devices into three classes,
required device manufacturer registration and listing of products, and set up procedures
for clinical investigations (Investigational Device Exemption -- IDE), premarket
notification (510(k)), and premarket approval (PMA). In addition they included the basic
prohibition on misbranding and adulteration, required adherence to good manufacturing
practices (GMP), and provided for post market surveillance (of selected devices). The
device definition was written with such generality as to include a wide range of
products--including computer software--within its scope: "...an instrument,
apparatus, implement, machine, contrivance, implant, in vitro reagent, or other similar or
related article, including any component, part, or accessory, which is - (1) recognized in
the official National Formulary, or the United States Pharmacopeia, or any supplement to
them, (2) intended for use in the diagnosis of disease or other conditions, or in the
cure, mitigation, treatment, or prevention of disease, in man or other animals, or (3)
intended to affect the structure or any function of the body of man or other animals, and
[which is not a drug]..." [Section 201(h)]. Further information concerning these
regulations can be obtained through the Center's Division of Small Manufacturers
Assistance (DSMA) at 800-638-2041.
An FDA software policy is under development at this time. No such policy currently
exists, although there has been a "draft" policy for several years. With regard
to this policy, the first question normally asked is how software can be a medical device
anyway. The previous speaker gave the answer to that question, indicating that much
software is a device since it is a component or accessory to a device, and other software
would be covered (probably as a "contrivance") under the very broad device
definition. The second question with regard to software policy is why such a policy is
needed. A policy is needed because any device is subject to all of the requirements of the
Food Drug and Cosmetics Law as amended, including registration and listing, GMPs, and
premarket review, unless specifically exempted by regulation. If we rigorously applied
these provisions to all software medical devices, it would represent a tremendous burden
on both the Agency and on the medical community. The software policy is risk/exemption
based. We are trying to assess the risks of medical software devices, decide on
appropriate exemptions, and write classification regulations to implement the exemptions.
Thus, our first task is to define criteria for assessing the impact of product failure on
the patient and apply them rationally to known products. Here are some of the criteria
which seem reasonable: (1) Seriousness of the disease to be diagnosed or treated, (2) Time
frame for use of the information (3) Concordance with accepted medical practice, (4)
Format of data and its presentation, (5) Individualized vs. Aggregate patient care
recommendations, and (6) Clarity of the algorithm. We are planning upon holding a public
workshop to discuss the policy. A Federal Register notice announcing the workshop and
laying out some of the details of such a policy is under preparation and should be
published in a few months [Registrants at this workshop will be informed of details of the
Software Policy meeting]. Finally, how do CADx devices fit into this picture? CADx systems
are either accessories or have been determined to have significant impact on patient care
and thus need to be regulated via premarket review. Thus, they are at the high impact/risk
end of the medical device software spectrum. The question is not how they should be
regulated, but what sort of information is necessary to make good decisions on clearing
these products.
The Center has begun receiving premarket approval submissions for devices with CADx
features. These are devices which use modern data analysis techniques to carry out some
portion of the decision making process previously provided by the physician or other
health care professionals. Thus, CADx refers to computer-aided-diagnosis devices, decision
support products, and not to computer-aided diagnostic-devices, computerized devices used
to provide basic input to the diagnostic process (e.g., CT or MRI systems). Examples of
CADx devices include image analysis products used to identify potential abnormalities, ECG
analysis programs, and in-vitro diagnostic test devices which flag
"out-of-bounds" test results and/or provide some more sophisticated data
synthesis. The Center believes that in order to perform intelligent and timely reviews of
these devices with consistency across product lines, it is appropriate at this time to
develop reviewer guidance for them. To that end the Center has established a CADx Working
Group, composed of reviewers and other technical professionals from all CDRH components.
Today's public workshop is an effort to obtain input from the public at the initial stage
of this project. In particular, we pose two questions to you: (1) How should CADx devices
be categorized?--i.e., What device attributes are relevant to the degree of regulatory
oversight exercised by the Center over a particular CADx device? And (2) What evaluation
methodologies are appropriate to the assessment of the performance of these devices? We
look forward to your comments today and to any written comments which you are warmly
invited to contribute to us in the future.
In 1995, the FDA approved the first two computer-assisted devices for evaluation of
Papanicolaou (Pap) smear slides. These devices are limited to rescreening of Pap smear
slides that have been previously screened by manual microscopy and were diagnosed as
negative (within normal limits) (WNL). As yet, no computer-assisted devices have been
approved for primary screening of Pap smear slides. FDA regulates computer-assisted Pap
smear readers as in-vitro diagnostic medical devices. FDA's premarket approvals for the
NeoPath, Inc. AutoPap 300 ZC Automatic Pap Rescreener System and the Neuromedical Systems,
Inc. PAPNET Testing System were based on the evaluation by FDA staff and panel of
consultants of the design of each device, and the manufacturers' pre-clinical and clinical
testing data that demonstrated the effectiveness and safety of the devices for their
intended use and intended populations for use. The approved intended uses and indications
for use are published in the FDA-approved package insert for each device.
The purpose of this talk is to briefly discuss evaluations of diagnostic cardiovascular
devices, and to point out areas of concern. Electrocardiographs with computer
interpretation are devices that acquire the diagnostic-quality electrocardiogram (ECG),
extract various measurements or features from the signal, and apply those features to some
deterministic or probabilistic decision-making algorithm to arrive at an interpretation.
If the algorithm uses patient information and traditional ECG measurements, and the output
is over read by an appropriate physician, then we would be less concerned with algorithm
performance. Since the devices are used by the general clinical community, however, we
routinely request performance statistics for each possible interpretation. In rare cases,
when we determined that stand-alone software packages are not accessories to other
classified devices, we also exempted the devices from premarket notification. In addition
to devices that mimic clinical decision making, the Division also conducts reviews of
devices with advanced digital signal processing of the ECG. Heart rate variability
analysis is illustrative of how further processing of the data elicits further
consideration of these submissions. These issues involve not only clinical application of
the information but also providing the user with an understanding of how the data were
generated. If the measurements have some basis in traditional electrocardiography, and a
reasonable approach to validation is taken to document the reliability of the information,
then we are likely to continue to clear devices for market, with restricted labeling,
until such time when a manufacturer is able to provide clinical data to support specific
diagnostic indications. This strategy may not completely alleviate our concerns for
potentially unreliable data, that can not be verified by the user, and the impact of the
data on clinical decision making and, ultimately, on patient safety.
Computer-aided Diagnostic (CADx) devices will be an indispensable part of the future
practice of clinical medicine. Computer-aided diagnostic devices must be distinguished
from computer-based diagnostic devices. A diagnosis is a prediction. An important
implication of CADx's as prediction devices is that CADx software is assessed in terms of
its accuracy, not its efficacy. There are at least three prediction methods: statistical,
expert systems, and empirical formulae. Each method requires a somewhat different
evaluation approach. Prediction methods can be either general methods applied to medical
problems, or unique-to-the-medical-problem methods. All three methods can be applied to:
(1) the generation and analysis of diagnostic test information (including laboratory tests
such as a SMAC or genetic screening, functional tests such as an ECG, and radiographic
tests such as CT, MRI, mammogram) and (2) the integration of diagnostic information. Three
device categories can be defined: devices marketed to the public, devices marketed to
physicians involving peer-reviewed statistical methods, and devices marketed to physicians
that have not been peer-reviewed or devices that involve expert system methods. The
creation of CADx guidelines is currently being performed by an internal FDA CADx Working
Group. In order to obtain (i) a non-regulatory perspective and (ii) additional CADx
expertise, non-FDA CADx experts (who are not currently associated with a CADx device)
should be invited to join the CADx Working Group. The view that the FDA is an obstacle to
innovation in medicine may no longer be correct. In the CADx domain it may be that rather
than trying to protect everyone from everything, the FDA is adopting the view that its job
is to make sure that companies that wish to market CADx devices to physicians provide: (i)
the FDA with sufficient information so that it can determine that the device meets its
functional and accuracy claims and (ii) physicians with sufficient information so that
he/she can determine if the device will be medically useful in his/her specific clinical
situation.
I propose that experts with potential conflicts be included in the review process.
Their potential conflict status should be considered when the panel makes the final
advisory decision; however, their educated and experienced opinions should be utilized to
the fullest. Too much is at stake to lose the expertise of highly qualified individuals
merely because they have perhaps in the past represented industrial developers with a
financial interest. This applies to all developers and any consultants who are working
toward a common goal. The American public will have confidence in FDA scientists and their
consultants if they adhere to principles of scientific integrity including full
disclosure, and all will benefit from these devices, especially the patients for whom they
are designed.
Quantitative analyses of the electroencephalograph (EEG) have been available for over
50 years. During the past several decades, researchers at New York University Medical
Center's Brain Research Laboratories have developed the neurometric method of analysis of
the EEG. Digitized EEG recordings are subjected to a Fast Fourier Transform to extract
information on power, frequency, and phase. These measures are log-transformed to
approximate gaussianity, age-regressed to account for variations in EEG variable
distribution as a function of age, and compared to an extensive normative database to
derive Z-score estimates of deviation from normal. The Z-score matrix provided by the
1200+ extracted variables is subjected to a multivariate analysis that corrects for
intercorrelations between and within measures to provide accurate estimates of the
difference between patient Z-score values and those of the normal population. Discriminant
analyses are used to identify variables that contribute to differentiating the patient
from normal (normal vs. abnormal comparison), and correlate the profile with that of
various empirically defined clinical groups. The likelihood that the profile matches
profiles of groups consisting of individuals with known disorders is stated in
probabilistic terms. Test sensitivity and specificity is evaluated by using ROC curves.
Statistical tables summarize the results of the analysis. Data is further transformed into
topographic maps that visually depict the extent of the deviation of the patient from the
normal reference group. The neurometric method is based on widely accepted statistical
procedures, and has been replicated in a variety of laboratories around the world.
Neurometrics provides an empirical test of brain function and structure, and is useful as
a diagnostic aid for patient evaluation, treatment planning, and treatment monitoring,
enhancing the quality of patient care in psychiatry, neurology, and related disciplines.
With the rapid expansion in the speed and capabilities of computer software and
hardware, we are now beginning to have the capabilities of simulating some of the human
decision processes involved in differential diagnosis and image pattern recognition. While
certainly issues of validation and hazard analysis of software systems intended for human
diagnosis are a significant part of the evaluation process, the fundamental part of
assessing effectiveness of such devices will remain as the clinical evaluation of the
accuracy of the classification process the device claims to perform. The behavior of the
binomial distribution, and the associated issues of statistical rigor in experimental
design, will be absolutely critical in understanding the procedures for performance
evaluation of computer-assisted diagnosis devices. The binomial distribution dictates how
devices which classify patients (images, samples, assays etc.) into one of two categories
(normal/abnormal) will behave, and this behavior can often be counter-intuitive.
Evaluating such a device requires particularly careful attention to the standard clinical
design issues of poolability, cross-over evaluations on the same samples/patients with and
without the device in place in the diagnostic process, statement of all assumptions
involved in testing, statement of the correct hypotheses, collection of correctly random
and unbiased samples from the specified target population, and separation of performance
measures into those independent of prevalence assumptions and those which explicitly or
implicitly depend on prevalence (such as predictive value). Most importantly of all,
perhaps, is the need to include the entire range of difficulties of classification into
the evaluation process. None of these factors can be ignored in designing or reviewing
submissions on such classification devices.
NeoPath, Inc. has been engaged for six years in the development of an automated
cytological screener for the analysis of Pap smears. Last year, the NeoPath AutoPap 300 QC
System was granted PMA approval following the methods presented below. It is NeoPath's
position that the basic clinical testing of a cytological screening device must include
well-controlled, scientifically valid studies to establish a quantitative baseline of
device performance. This baseline provides a means for FDA reviewers to assess the initial
safety and efficacy of a device as well as evaluate device enhancements and future
devices. For either primary screening or QC rescreening of Pap smears in accord with the
Bethesda system of cervical cytology four interlocking studies are needed: prospective
intended use, historical and current sensitivity, multi-run precision-reproducibility, and
historical consistency.
The incidence of and mortality from invasive cervical cancer has been increasing in the
United States since 1986 in the purportedly well screened population of white women under
50 years of age. This disturbing trend is thought to be attributable, at least in part, to
the spread of the Human Papilloma virus (HPV), which has now reached near epidemic
proportions in young women throughout the world. Thus, the factors contributing to the
development of cervical cancer are apparently so widespread that more women are developing
this preventable cancer despite screening. In addition, there are estimated to be over 50
million cervical smear tests performed each year in the United States. Therefore, it is
paramount that FDA assure that any new automated device to be used as a substitute for
conventional microscopic screening be very rigorously tested to assure that even rare
cytopathology or unusual presentations of abnormalities are detected, as even
"rare" cases can affect tens of thousands of American women at a national level.
There are three primary degrees of freedom that must be considered when assuring that all
presentations have been sampled and included in the clinical trial: (1) Diagnostic
variations (all categories of The Bethesda System, including various types of
adenocarcinomas); (2) Patient variations (prevalence of abnormal cells, size of abnormal
cells, smear patterns - must include various patient demographics); (3) Laboratory
variations (staining color and intensity, coverslip bubbles, artifacts - must include a
wide variety of laboratories). The clinical trial should simulate the device's intended
use as closely as possible. In addition, in terms of establishing standards for comparison
with conventional screening, bias should be minimized by utilizing historic screening
records and applying exhaustive microscopic searching and automated rescreeners to ensure
that no significant abnormality is missed by the substitutive test. In conclusion, given
the public health threat represented by the HPV epidemic, the rise in incidence in some
populations and the potential for prevention of cervical cancer, the objective of
automated cervical smear screening should be increasing the accuracy of the test, and not
serving as a labor substitute at the expense of sensitivity.
Reasonable standards have already been established for clinical trials for medical
devices within the FDA and the medical community. CADx Pap smear medical devices are
essentially no different from other devices, and careful trial design should be followed.
In general, there are four issues that need to be addressed. First, the device must be
tested in its intended use. This means that levels of disease prevalence used in the trial
should reflect prevalence in routine use. Too high a level of disease in a trial can
affect vigilance of the participants. Second, a reference standard must be established.
Essentially, we want to compare the discriminatory level of a Pap smear screening device
to that of humans. Using standards, one can evaluate sensitivity and specificity,
preferably using an analytical method such as Receiver Operator Characteristic curves.
Such performance standards and comparisons are necessary for educating potential users in
how a Pap smear device may work in their laboratories. It will also help users to compare
the device to other alternatives; for example it may be desirable to compare the accuracy
and cost effectiveness of a double screening by humans to the combined use of humans and a
machine. Third, given the subjective nature of cytology, and the difficulty with
borderline diagnoses, a method of adjudicating the difference between the reference and
the CADx result must be developed. There are many methods that can be applied to the Pap
smear, and any one of these should prove acceptable. They include the use of an
independent pathologist, a panel review, biopsy, Human Papilloma Virus testing, and
patient follow-up. Finally, in a trial, vigilance must be controlled since taking part in
any trial elevates ones attention and performance. This demands the use of a two armed
clinical trial so that both the CADx arm and the standard (e.g. human) arm have elevated
levels of vigilance. One should also consider how vigilance might be raised, or even
lowered, in actual use when use when using a CADx system. In summary, careful clinical
trial design is critical for evaluating Pap smear methodology and the results of trials
should be presented in such a way that the potential users of a system will be able to
comprehend potential performance in their own laboratories.
The main points I would like to leave with you are: (1) Discriminating power is the
underlying measure of performance of a CADx device. (2) As a minimum, a single measurement
of both sensitivity AND specificity is necessary to establish discriminating power. This
would also allow an ROC analysis to be conducted. A well-designed study should also
generate sensitivity AND specificity results for a human reviewing the same material
without the CADx device. This would yield two-armed results and should form the basis for
the product's evaluation. (3) If a CADx device, by itself, has greater discriminating
power that a human, then approval should be forthcoming. (4) If a CADx device does not, by
itself, have greater discriminating power than a human, then this information should be
made clear in the labeling. Without this caution clearly in the labeling, users WILL
mistakenly assume such a device is better than a human--after all the FDA approved it. (5)
A device with lower discriminating power may still provide benefit if it is cheaper than a
human, and is used for back-up purposes only. The FDA should make information available to
allow a user to calculate cost-effectiveness. This information is either the underlying
discriminating power or the sensitivity/specificity results.
As one of the nation's leading providers of cervical cytology testing services, Corning
Clinical Laboratories has been working with each of the developers of new technologies
that promise to improve Pap testing accuracy. There is a risk, however, that vigorous
marketing by these developers and media exposure will create pressure to adopt these new
technologies before problems inherent in their use are fully resolved and before complete
data regarding their efficacy is available. In particular, our concerns include (1) likely
loss of positive predictive value of abnormal results during a months- or years-long
pathologist and cytotechnologist "learning curve" (the two recently FDA-approved
devices, PapNet and AutoPap achieve higher sensitivity by "flagging" cells or
slides for manual re-review, leading possibly to negative cases being misinterpreted as
abnormal because of device created biases), (2) possible loss of situation awareness among
those who read Pap smear slides and who may develop complacency and decreased detection
rates, (3) the risk that new standards of care will be created "by default" in
the face of a dearth of clinical outcome studies, and (4) and difficult post-FDA-approval
period marked by unclear regulations, a lack or reporting format standards and
inconsistent reimbursement policies. Perhaps public health interests plus the very near
approach by some of these new technologies to actual diagnostic processes, warrant an FDA
paradigm shift, namely to require technology assessment that goes beyond the normal
purview. For example, the FDA could require post-approval market surveillance that
includes rigorous training requirements and measurement of training outcomes, and
post-approval clinical specificity studies. Other agencies and professional organizations
could play a more active role than they have been, as well. These concerns
notwithstanding, our company believes the FDA-approved technologies plus others still in
development promise to significantly improve the accuracy of cervical cytology testing.
The primary goal of the Pap screening test is to eliminate death and suffering that
result from invasive cancer of the cervix, at an acceptable cost. This is accomplished by
identifying and treating pre-invasive cancerous lesions. Other uses of the Pap test, such
as the detection of ovarian cancer, sexually transmitted diseases, etc., are, at best,
secondary goals. The sensitivity of the Pap test to the detection of STDs and other
conditions is poor. The performance of the existing Pap test should be well understood by
those who assess a computer assisted Pap test. The conventional Pap test leads to
treatment of very many women for conditions which, if left untreated, would never develop
into invasive cancer. For every woman with a truly pre-cancerous lesion, at least 30 and
possibly more than 50 receive treatment. Computer assisted diagnostic Pap screeners should
be assessed in the context of an objective assessment of the current conventional system.
For an imperfect test, accuracy is most completely characterized by the ROC
(receiver-operator characteristic) curve. The test accuracy describes the ability of the
test to separate overlapping true positive and true negative populations. The notion of
positive predictive value combines test accuracy with disease incidence. For the Pap test
process to achieve a positive predictive value of only 10% would require test accuracy
corresponding to a separation of the true positive and true negative populations by 5-6
standard deviations. This problem is fundamental to the current Pap test. It does not
result from the occasional failure of a screener to detect a "needle in a
haystack;" it results from the facts that few "haystacks with needles" have
the potential to become invasive cervical cancer, and we can't tell which ones they are
with the current approach.
My presentation will focus on evaluating the quality and utility of digital images. I
will summarize some of the principles developed in ongoing collaborations with Stanford
colleagues Robert M. Gray, PhD, Professor of Electrical Engineering; Debra Ikeda, MD,
Section Chief of Breast Imaging; and many others. Our research involves compression and
enhancement of digital medical images and the applications of these technologies to
computer-aided diagnosis. We study CT images of the lung and mediastinum, MR chest images
taken for the purpose of measuring major vessels in the chest, and many aspects of
mammography. However, when computational interventions affect what radiologists see, it is
imperative that these interventions be evaluated by carefully designed clinical
experiments. Experimental protocols should simulate ordinary clinical practice to the
extent possible. A nearly full range of examples should be included. Findings should be
reportable using the American College of Radiology Standardized Lexicon. Statistical
analyses should be based upon assumptions that are faithful to the clinical scenario and
tasks. The numbers of studies and radiologists should be sufficient to ensuresatisfactory
size and power for the principal statistical tests of interest. "Gold standards"
must be defined clearly and be consistent with experimental hypotheses. Sources of bias
should be recognized and minimized. To the extent possible, I will deal with all these
issues in my 10 minutes, not least some statistical techniques that we feel are
particularly relevant here.
Mammography has become the standard for detection of early, more curable breast cancer.
Increasing numbers of mammograms are being performed, and reading screening mammograms is
a repetitive task that requires high attention to minute detail. While mammography is the
best method for early breast cancer detection, radiologists interpreting the mammograms
are fallible, and an estimated 30% of breast cancers are present but missed on mammograms.
A second human observer can detect up to 15% more cancer, but having a second reader is
time-consuming and probably impractical, being done in only about 5% of practices in the
US. This type of problem is one that lends itself to automation, through computer-aided
diagnosis. The detection process of flagging potential abnormalities for the radiologist
can be accomplished by CAD, using digitized mammograms. CAD can be defined as a diagnosis
made by a radiologist using computer output to improve his or her decision, with a goal of
making radiographic interpretation easier and more accurate. Work over the last decade has
developed mammographic CAD programs to a level where about 85% of breast cancers are
detected by the computer, at a reasonably low false positive rate of 1 or 2 per image. The
detection programs work for both calcifications and masses, the two prime signs of breast
cancer on mammogram. At the University of Chicago, we have been running CAD in our
clinical mammography area for over a year, on more than 5,000 mammograms. Analysis of the
first 1,149 patients shows that CAD performed as expected, identifying 86% of the
screening-detected cancers. We have also been greatly encouraged by our studies showing
that retrospective CAD can correctly identify approximately 50% of lesions clinically
missed by radiologists (observation errors). The current level of development is
appropriate for clinical introduction, acting as a second reader to aid the radiologist,
who retains the final decision on whether or not potential areas on the mammogram are
suspicious enough to warrant further work up. I believe that introduction of technology of
this type is inevitable, as the results to date have been very promising. Radiologist
access to CAD in the clinical setting should act to significantly improve patient care.
Many different types of systems are categorized as "computer aided diagnosis
(CADx)." The essential characteristics of these systems that have relevance for
assessments of safety and effectiveness can be analyzed into three groups: (1), type of
system design, of which three are identified here, (2), type of information base
incorporated in the system, of which three are also identified, these three not having a
one to one correspondence with the three types of system design, and (3), type, or level
of certainty, of system output, of which four were identified, again without any direct
correspondence to the preceding six classes. Within each of the above three groups, a
hierarchy of the types was identified, each level having more serious implications for
evaluation of safety and effectiveness than the preceding. Thus, there is, hypothetically
at least, the possibility of 36 distinct combinations of the levels of the three groups of
essential characteristics within the set of CADx systems, each having a different
convolution of the challenges and opportunities for assessment of safety and effectiveness
represented within the groups. This suggests that CADx systems cannot be thought of as a
single type of entity and that a single regulatory policy cannot be successful for all.
Rather, a regulatory scheme that recognizes each of the identified subgroups and
establishes policies for them which account for the various ways in which they may be
combined is required. Some systems will require extensive clinical testing. Others may be
fully evaluated through engineering testing alone.
1. Richard Eaton, NEMA presented a list of questions from his organization: (i) General
Questions/Issues: How do computer-aided devices differ from computer-controlled devices?
Are there additional requirements for 510(k) applications? What are the 510(k) and
postmarket surveillance requirements which will be associated with these types of devices?
Are there different levels of concerns for each of these types of devices? Will recalls be
required if there is a "glitch" in the software? How do user errors influence
regulation of these devices? (ii) Issues pertaining to transmission of data over line: We
have concerns over what happens when data is sent over a line: How is validation handled
when data is sent over a line, as opposed to on-site validation? What about patient
confidentiality issues? (iii) Issues relating to a device "acting as a
physician:" Will there need to be a duplicative diagnosis done by a physician if the
device itself "is acting as a physician" and thus renders a diagnosis? (iv)
Sufficiency of electronic signatures: Is there an "equivalent" for the doctor's
signature, an "electronic OK" which is needed before the diagnosis data can be
transmitted across the lines? (v) Effect of favorable FDA approval of class III device
upon manufacturer's product liability exposure, and use of favorable decision as a
defense: If FDA determines that a computer-aided device is a class III device, and thus a
PMA would be required, would an FDA approval of an application serve to reduce the
manufacturers' product liability exposure, such that FDA's approval could be used at least
as a partial defense to an action against a manufacturer?
2. Several participants suggested that they would like an explanation of how a CADx
decision was reached. This would better allow the physician to judge the reliability of
the CADx output, which would otherwise be just emerging from a "black box."
Others, however, indicated that this is overly simplistic. For any difficult problem, it
is very hard to provide a simple explanation of the CADx system "reasoning." It
was further suggested that what the user of a CADx system really needed was an indication
of the consistency and accuracy of the diagnostic information, not a description of how
the decision was reached.
3. Numerous speakers cited the usefulness and validity of receiver operating
characteristic (ROC) analysis. The ROC curve is a plot of the variation of the true
positive fraction as a function of the false positive fraction (sensitivity vs. one minus
specificity). The ROC curve is obtained by varying the threshold criterion for deciding
between positive and negative diagnoses from more conservative to less conservative. It
therefore includes information on all system operating points (sensitivity/specificity
pairs) and is independent of disease prevalence. A particular benefit of the method is
that it allows the separation of technology assessment from practice-of-medicine issues.
One participant was concerned with Gaussian assumptions (not fundamental to ROC theory but
made by many ROC analysis programs). Discussion also ensued concerning the variability of
human observer performance and the difficulty this causes for the evaluation of a machine
"observer." The complication that diagnostic tasks are not typically binary (as
required for conventional ROC analysis) but have multiple possible outcomes (diagnoses)
was also raised.
4. The suggestion was made that the evaluation of commercial CADx devices should be
similar to that of the scientific peer review process. The machine algorithm and
representative data should be available for outside professionals to carry out
disinterested confirmation of the manufacturer's results. The fear of compromising trade
secret or other proprietary information seemed to temper the enthusiasm of commercial
participants to this suggestion.
5. Questions were raised concerning the availability of guidance on other software
matters. It was noted that in addition to the overall software policy, there are Center
groups considering policy with regard to commercial off the shelf (COTS) software and
developing design control guidance as part of the GMP revisions efforts. All of these
efforts will be soliciting public comment.
6. It was noted that CADx algorithms may be very sensitive to the particular sensors
used in obtaining training data. Great care must be exercised in determining the range of
input sensors for which the device functions accurately. Furthermore, it was noted that
often in the evaluation of CADx devices there is a commingling of the training and testing
sets. This must be avoided in order to obtain an unbiased performance estimate.
7. Compression was mentioned as a source of performance degradation for CADx devices.
When large data sets are needed, (lossy) compression may be required. Its effect must be
examined carefully.
8. One participant noted that a liberal interpretation of the medical device definition
would result in clinical guidelines being considered as medical devices. Despite their
ubiquitous presence in the field, very few have been properly validated.
9. The presentation of CADx results and the labeling of CADx devices in terms of
probabilities was discussed. This was felt by many participants to be desirable; however,
it was suggested that the clinician "user" population was not sufficiently
sophisticated to understand data presented in that way.
10. The problem of CADx false positives was raised. CADx "attention getting"
systems typically point to many areas where no abnormality exists. This was felt to be a
natural attribute of these systems and an aspect which should be addressed through user
training and experience. As long as these systems are only "aiding" in the
diagnosis, they should not be held to the same standards as a device actually making the
diagnosis.
Thank you for participating in the computer-aided diagnosis device workshop. We have
heard today a few ideas on the categories of CADx devices and even more on evaluation
methods, especially the use of the receiver operating characteristic curve. Further
written comments are solicited and may be faxed to us at (301) 443-9101. This input will
aid us in preparing reviewer guidance for the premarket clearance of these devices. As a
reminder to the speakers, please get to me a copy of your overheads and, if possible, your
talk. I will submit these to Dockets Management for docket number 95N-0363 where they will
be available to the public. In addition, if the speakers will provide me with a brief
summary of their talks we will compile a meeting summary which we will mail to all persons
who have registered for this workshop. In addition, the summary will be available on the
World Wide Web at the URL http://www.fda.gov.
(March 6, 1996)