The propensity for AI systems to make mistakes and for humans to miss those mistakes has been on full display in the US legal system as of late. The follies began when lawyers—including some at prestigious firms—submitted documents citing cases that didn’t exist. Similar mistakes soon spread to other roles in the courts. In December, a Stanford professor submitted sworn testimony containing hallucinations and errors in a case about deepfakes, despite being an expert on AI and misinformation himself.

The buck stopped with judges, who—whether they or opposing counsel caught the mistakes—issued reprimands and fines, and likely left attorneys embarrassed enough to think twice before trusting AI again.

But now judges are experimenting with generative AI too. Some are confident that with the right precautions, the technology can expedite legal research, summarize cases, draft routine orders, and overall help speed up the court system, which is badly backlogged in many parts of the US. This summer, though, we’ve already seen AI-generated mistakes go undetected and cited by judges. A federal judge in New Jersey had to reissue an order riddled with errors that may have come from AI, and a judge in Mississippi refused to explain why his order too contained mistakes that seemed like AI hallucinations. 

The results of these early-adopter experiments make two things clear. One, the category of routine tasks—for which AI can assist without requiring human judgment—is slippery to define. Two, while lawyers face sharp scrutiny when their use of AI leads to mistakes, judges may not face the same accountability, and walking back their mistakes before they do damage is much harder.

Drawing boundaries

Xavier Rodriguez, a federal judge for the Western District of Texas, has good reason to be skeptical of AI. He started learning about artificial intelligence back in 2018, four years before the release of ChatGPT (thanks in part to the influence of his twin brother, who works in tech). But he’s also seen AI-generated mistakes in his own court. 

In a recent dispute about who was to receive an insurance payout, both the plaintiff and the defendant represented themselves, without lawyers (this is not uncommon—nearly a quarter of civil cases in federal court involve at least one unrepresented party). The two sides wrote their own filings and made their own arguments. 

“Both sides used AI tools,” Rodriguez says, and both submitted filings that referenced made-up cases. He had authority to reprimand them, but given that they were not lawyers, he opted not to. 

“I think there’s been an overreaction by a lot of judges on these sanctions. The running joke I tell when I’m on the speaking circuit is that lawyers have been hallucinating well before AI,” he says. Missing a mistake from an AI model is not wholly different, to Rodriguez, from failing to catch the error of a first-year lawyer. “I’m not as deeply offended as everybody else,” he says. 

Related work from others:  Latest from MIT Tech Review - Large language models can do jaw-dropping things. But nobody knows exactly why.

In his court, Rodriguez has been using generative AI tools (he wouldn’t publicly name which ones, to avoid the appearance of an endorsement) to summarize cases. He’ll ask AI to identify key players involved and then have it generate a timeline of key events. Ahead of specific hearings, Rodriguez will also ask it to generate questions for attorneys based on the materials they submit.

These tasks, to him, don’t lean on human judgment. They also offer lots of opportunities for him to intervene and uncover any mistakes before they’re brought to the court. “It’s not any final decision being made, and so it’s relatively risk free,” he says. Using AI to predict whether someone should be eligible for bail, on the other hand, goes too far in the direction of judgment and discretion, in his view.

Erin Solovey, a professor and researcher on human-AI interaction at Worcester Polytechnic Institute in Massachusetts, recently studied how judges in the UK think about this distinction between rote, machine-friendly work that feels safe to delegate to AI and tasks that lean more heavily on human expertise. 

“The line between what is appropriate for a human judge to do versus what is appropriate for AI tools to do changes from judge to judge and from one scenario to the next,” she says.

Even so, according to Solovey, some of these tasks simply don’t match what AI is good at. Asking AI to summarize a large document, for example, might produce drastically different results depending on whether the model has been trained to summarize for a general audience or a legal one. AI also struggles with logic-based tasks like ordering the events of a case. “A very plausible-sounding timeline may be factually incorrect,” Solovey says. 

Rodriguez and a number of other judges crafted guidelines that were published in February by the Sedona Conference, an influential think tank that issues principles for particularly murky areas of the law. They outline a host of potentially “safe” uses of AI for judges, including conducting legal research, creating preliminary transcripts, and searching briefings, while warning that judges should verify outputs from AI and that “no known GenAI tools have fully resolved the hallucination problem.”

Dodging AI blunders

Judge Allison Goddard, a federal magistrate judge in California and a coauthor of the guidelines, first felt the impact that AI would have on the judiciary when she taught a class on the art of advocacy at her daughter’s high school. She was impressed by a student’s essay and mentioned it to her daughter. “She said, ‘Oh, Mom, that’s ChatGPT.’”

Related work from others:  Latest from MIT : Synthetic imagery sets new bar in AI training efficiency

“What I realized very quickly was this is going to really transform the legal profession,” she says. In her court, Goddard has been experimenting with ChatGPT, Claude (which she keeps “open all day”), and a host of other AI models. If a case involves a particularly technical issue, she might ask AI to help her understand which questions to ask attorneys. She’ll summarize 60-page orders from the district judge and then ask the AI model follow-up questions about it, or ask it to organize information from documents that are a mess. 

“It’s kind of a thought partner, and it brings a perspective that you may not have considered,” she says.

Goddard also encourages her clerks to use AI, specifically Anthropic’s Claude, because by default it does not train on user conversations. But it has its limits. For anything that requires law-specific knowledge, she’ll use tools from Westlaw or Lexis, which have AI tools built specifically for lawyers, but she finds general-purpose AI models to be faster for lots of other tasks. And her concerns about bias have prevented her from using it for tasks in criminal cases, like determining if there was probable cause for an arrest.

In this, Goddard appears to be caught in the same predicament the AI boom has created for many of us. Three years in, companies have built tools that sound so fluent and humanlike they obscure the intractable problems lurking underneath—answers that read well but are wrong, models that are trained to be decent at everything but perfect for nothing, and the risk that your conversations with them will be leaked to the internet. Each time we use them, we bet that the time saved will outweigh the risks, and trust ourselves to catch the mistakes before they matter. For judges, the stakes are sky-high: If they lose that bet, they face very public consequences, and the impact of such mistakes on the people they serve can be lasting. 

“I’m not going to be the judge that cites hallucinated cases and orders,” Goddard says. “It’s really embarrassing, very professionally embarrassing.”

Still, some judges don’t want to get left behind in the AI age. With some in the AI sector suggesting that the supposed objectivity and rationality of AI models could make them better judges than fallible humans, it might lead some on the bench to think that falling behind poses a bigger risk than getting too far out ahead. 

A ‘crisis waiting to happen’

The risks of early adoption have raised alarm bells with Judge Scott Schlegel, who serves on the Fifth Circuit Court of Appeal in Louisiana. Schlegel has long blogged about the helpful role technology can play in modernizing the court system, but he has warned that AI-generated mistakes in judges’ rulings signal a “crisis waiting to happen,” one that would dwarf the problem of lawyers’ submitting filings with made-up cases. 

Related work from others:  UC Berkeley - Keeping Learning-Based Control Safe by Regulating Distributional Shift

Attorneys who make mistakes can get sanctioned, have their motions dismissed, or lose cases when the opposing party finds out and flags the errors. “When the judge makes a mistake, that’s the law,” he says. “I can’t go a month or two later and go ‘Oops, so sorry,’ and reverse myself. It doesn’t work that way.”

Consider child custody cases or bail proceedings, Schlegel says: “There are pretty significant consequences when a judge relies upon artificial intelligence to make the decision,” especially if the citations that decision relies on are made-up or incorrect.

This is not theoretical. In June, a Georgia appellate court judge issued an order that relied partially on made-up cases submitted by one of the parties, a mistake that went uncaught. In July, a federal judge in New Jersey withdrew an opinion after lawyers complained it too contained hallucinations. 

Unlike lawyers, who can be ordered by the court to explain why there are mistakes in their filings, judges do not have to show much transparency, and there is little reason to think they’ll do so voluntarily. On August 4, a federal judge in Mississippi had to issue a new decision in a civil rights case after the original was found to contain incorrect names and serious errors. The judge did not fully explain what led to the errors even after the state asked him to do so. “No further explanation is warranted,” the judge wrote.

These mistakes could erode the public’s faith in the legitimacy of courts, Schlegel says. Certain narrow and monitored applications of AI—summarizing testimonies, getting quick writing feedback—can save time, and they can produce good results if judges treat the work like that of a first-year associate, checking it thoroughly for accuracy. But most of the job of being a judge is dealing with what he calls the white-page problem: You’re presiding over a complex case with a blank page in front of you, forced to make difficult decisions. Thinking through those decisions, he says, is indeed the work of being a judge. Getting help with a first draft from an AI undermines that purpose.

“If you’re making a decision on who gets the kids this weekend and somebody finds out you use Grok and you should have used Gemini or ChatGPT—you know, that’s not the justice system.”

Share via
Copy link
Powered by Social Snap