In medicine, the cautionary tales about the unintended effects of artificial intelligence are already legendary.
There was the program meant to predict when patients would develop sepsis, a deadly bloodstream infection, that triggered a litany of false alarms. Another, intended to improve follow-up care for the sickest patients, appeared to deepen troubling health disparities.
Wary of such flaws, physicians have kept A.I. working on the sidelines: assisting as a scribe, as a casual second opinion and as a back-office organizer. But the field has gained investment and momentum for uses in medicine and beyond.
Within the Food and Drug Administration, which plays a key role in approving new medical products, A.I. is a hot topic. It is helping to discover new drugs. It could pinpoint unexpected side effects. And it is even being discussed as an aid to staff who are overwhelmed with repetitive, rote tasks.
Yet in one crucial way, the F.D.A.’s role has been subject to sharp criticism: how carefully it vets and describes the programs it approves to help doctors detect everything from tumors to blood clots to collapsed lungs.
“We’re going to have a lot of choices. It’s exciting,” Dr. Jesse Ehrenfeld, president of the American Medical Association, a leading doctors’ lobbying group, said in an interview. “But if physicians are going to incorporate these things into their workflow, if they’re going to pay for them and if they’re going to use them — we’re going to have to have some confidence that these tools work.”
From doctors’ offices to the White House and Congress, the rise of A.I. has elicited calls for heightened scrutiny. No single agency governs the entire landscape. Senator Chuck Schumer, Democrat of New York and the majority leader, summoned tech executives to Capitol Hill in September to discuss ways to nurture the field and also identify pitfalls.
Google has already drawn attention from Congress with its pilot of a new chatbot for health workers. Called Med-PaLM 2, it is designed to answer medical questions, but has raised concerns about patient privacy and informed consent.
How the F.D.A. will oversee such “large language models,” or programs that mimic expert advisers, is just one area where the agency lags behind rapidly evolving advances in the A.I. field. Agency officials have only begun to talk about reviewing technology that would continue to “learn” as it processes thousands of diagnostic scans. And the agency’s existing rules encourage developers to focus on one problem at a time — like a heart murmur or a brain aneurysm — a contrast to A.I. tools used in Europe that scan for a range of problems.
The agency’s reach is limited to products being approved for sale. It has no authority over programs that health systems build and use internally. Large health systems like Stanford, Mayo Clinic and Duke — as well as health insurers — can build their own A.I. tools that affect care and coverage decisions for thousands of patients with little to no direct government oversight.
Still, doctors are raising more questions as they attempt to deploy the roughly 350 software tools that the F.D.A. has cleared to help detect clots, tumors or a hole in the lung. They have found few answers to basic questions: How was the program built? How many people was it tested on? Is it likely to identify something a typical doctor would miss?
The lack of publicly available information, perhaps paradoxical in a realm replete with data, is causing doctors to hang back, wary that technology that sounds exciting can lead patients down a path to more biopsies, higher medical bills and toxic drugs without significantly improving care.
Dr. Eric Topol, author of a book on A.I. in medicine, is a nearly unflappable optimist about the technology’s potential. But he said the F.D.A. had fumbled by allowing A.I. developers to keep their “secret sauce” under wraps and failing to require careful studies to assess any meaningful benefits.
“You have to have really compelling, great data to change medical practice and to exude confidence that this is the way to go,” said Dr. Topol, executive vice president of Scripps Research in San Diego. Instead, he added, the F.D.A. has allowed “shortcuts.”
Large studies are beginning to tell more of the story: One found the benefits of using A.I. to detect breast cancer and another highlighted flaws in an app meant to identify skin cancer, Dr. Topol said.
Dr. Jeffrey Shuren, the chief of the F.D.A.’s medical device division, has acknowledged the need for continuing efforts to ensure that A.I. programs deliver on their promises after his division clears them. While drugs and some devices are tested on patients before approval, the same is not typically required of A.I. software programs.
One new approach could be building labs where developers could access vast amounts of data and build or test A.I. programs, Dr. Shuren said during the National Organization for Rare Disorders conference on Oct. 16.
“If we really want to assure that right balance, we’re going to have to change federal law, because the framework in place for us to use for these technologies is almost 50 years old,” Dr. Shuren said. “It really was not designed for A.I.”
Other forces complicate efforts to adapt machine learning for major hospital and health networks. Software systems don’t talk to each other. No one agrees on who should pay for them.
By one estimate, about 30 percent of radiologists (a field in which A.I. has made deep inroads) are using A.I. technology. Simple tools that might sharpen an image are an easy sell. But higher-risk ones, like those selecting whose brain scans should be given priority, concern doctors if they do not know, for instance, whether the program was trained to catch the maladies of a 19-year-old versus a 90-year-old.
Aware of such flaws, Dr. Nina Kottler is leading a multiyear, multimillion-dollar effort to vet A.I. programs. She is the chief medical officer for clinical A.I. at Radiology Partners, a Los Angeles-based practice that reads roughly 50 million scans annually for about 3,200 hospitals, free-standing emergency rooms and imaging centers in the United States.
She knew diving into A.I. would be delicate with the practice’s 3,600 radiologists. After all, Geoffrey Hinton, known as the “godfather of A.I.,” roiled the profession in 2016 when he predicted that machine learning would replace radiologists altogether.
Dr. Kottler said she began evaluating approved A.I. programs by quizzing their developers and then tested some to see which programs missed relatively obvious problems or pinpointed subtle ones.
She rejected one approved program that did not detect lung abnormalities beyond the cases her radiologists found — and missed some obvious ones.
Another program that scanned images of the head for aneurysms, a potentially life-threatening condition, proved impressive, she said. Though it flagged many false positives, it detected about 24 percent more cases than radiologists had identified. More people with an apparent brain aneurysm received follow-up care, including a 47-year-old with a bulging vessel in an unexpected corner of the brain.
At the end of a telehealth appointment in August, Dr. Roy Fagan realized he was having trouble speaking to the patient. Suspecting a stroke, he hurried to a hospital in rural North Carolina for a CT scan.
The image went to Greensboro Radiology, a Radiology Partners practice, where it set off an alert in a stroke-triage A.I. program. A radiologist didn’t have to sift through cases ahead of Dr. Fagan’s or click through more than 1,000 image slices; the one spotting the brain clot popped up immediately.
The radiologist had Dr. Fagan transferred to a larger hospital that could rapidly remove the clot. He woke up feeling normal.
“It doesn’t always work this well,” said Dr. Sriyesh Krishnan, of Greensboro Radiology, who is also director of innovation development at Radiology Partners. “But when it works this well, it’s life changing for these patients.”
Dr. Fagan wanted to return to work the following Monday, but agreed to rest for a week. Impressed with the A.I. program, he said, “It’s a real advancement to have it here now.”
Radiology Partners has not published its findings in medical journals. Some researchers who have, though, highlighted less inspiring instances of the effects of A.I. in medicine.
University of Michigan researchers examined a widely used A.I. tool in an electronic health-record system meant to predict which patients would develop sepsis. They found that the program fired off alerts on one in five patients — though only 12 percent went on to develop sepsis.
Another program that analyzed health costs as a proxy to predict medical needs ended up depriving treatment to Black patients who were just as sick as white ones. The cost data turned out to be a bad stand-in for illness, a study in the journal Science found, since less money is typically spent on Black patients.
Those programs were not vetted by the F.D.A. But given the uncertainties, doctors have turned to agency approval records for reassurance. They found little. One research team looking at A.I. programs for critically ill patients found evidence of real-world use “completely absent” or based on computer models. The University of Pennsylvania and University of Southern California team also discovered that some of the programs were approved based on their similarities to existing medical devices — including some that did not even use artificial intelligence.
Another study of F.D.A.-cleared programs through 2021 found that of 118 A.I. tools, only one described the geographic and racial breakdown of the patients the program was trained on. The majority of the programs were tested on 500 or fewer cases — not enough, the study concluded, to justify deploying them widely.
Dr. Keith Dreyer, a study author and chief data science officer at Massachusetts General Hospital, is now leading a project through the American College of Radiology to fill the gap of information. With the help of A.I. vendors that have been willing to share information, he and colleagues plan to publish an update on the agency-cleared programs.
That way, for instance, doctors can look up how many pediatric cases a program was built to recognize to inform them of blind spots that could potentially affect care.
James McKinney, an F.D.A. spokesman, said the agency’s staff members review thousands of pages before clearing A.I. programs, but acknowledged that software makers may write the publicly released summaries. Those are not “intended for the purpose of making purchasing decisions,” he said, adding that more detailed information is provided on product labels, which are not readily accessible to the public.
Getting A.I. oversight right in medicine, a task that involves several agencies, is critical, said Dr. Ehrenfeld, the A.M.A. president. He said doctors have scrutinized the role of A.I. in deadly plane crashes to warn about the perils of automated safety systems overriding a pilot’s — or a doctor’s — judgment.
He said the 737 Max plane crash inquiries had shown how pilots weren’t trained to override a safety system that contributed to the deadly collisions. He is concerned that doctors might encounter a similar use of A.I. running in the background of patient care that could prove harmful.
“Just understanding that the A.I. is there should be an obvious place to start,” Dr. Ehrenfeld said. “But it’s not clear that that will always happen if we don’t have the right regulatory framework.”