Beyond Accuracy: Evaluating LLMs for Evidentiary Use in Legal Contexts

We are happy to share that a new book on AI and legal evidence has just been published by Routledge, and that Fair Patterns contributed to the first chapter.

The chapter, written by Sundaraparipurnan Narayanan, Marie Potel, and Vibhav Mithal, looks at a question that comes up more and more often: how do we evaluate large language models in legal contexts?

LLMs are already part of legal workflows

Large language models are now used for a wide range of tasks. Legal research, drafting, summarising, even predicting outcomes.

They can be useful, but they also come with clear limitations.

General-purpose models are not built for legal reasoning. They can produce incorrect information, rely on incomplete context, or reflect biases present in their training data. In low-risk situations, this may be manageable. In legal contexts, it is not.

Why accuracy is not enough

A common way to evaluate these systems is to look at accuracy. Does the answer look correct?

In practice, this is not sufficient.

Legal use cases require more than surface-level correctness. They involve context, interpretation, jurisdiction, and responsibility. A response can sound convincing while still being incomplete or misleading.

This becomes particularly important in tasks like contract clause generation, where small details can have significant consequences.

Key challenges in legal use

The chapter highlights several challenges that need to be considered when using LLMs in legal contexts:

adapting to different legal systems and evolving regulations
handling sensitive data and ensuring privacy
identifying and mitigating biases
maintaining human oversight
managing complex, multi-jurisdictional scenarios
ensuring accountability when errors occur

These are not edge cases. They are part of everyday legal practice.

Introducing the CLAUSE framework

To address these questions, the chapter introduces a framework called CLAUSE.

Rather than focusing only on accuracy, it proposes a broader way to evaluate LLMs used for contract clause generation, based on six dimensions:

compliance and adaptability across jurisdictions
legal reasoning and competence
reliability and trustworthiness
understanding of context
handling of structured contract elements
ethics, privacy, and accountability

The goal is to provide a more complete view of how these systems perform in real-world legal scenarios.

A need for continuous evaluation

One of the main ideas behind this work is that evaluation cannot be a one-time step.

Legal environments evolve. Models evolve. Use cases evolve.

This makes continuous review and human oversight essential. Not only to improve performance, but also to ensure that these systems remain aligned with legal and ethical expectations.

The chapter is part of the book:
Artificial Intelligence and Legal Evidence: The Indian Policy Perspective (Routledge)

Chapter 1: Beyond Accuracy: Evaluating LLMs for Evidentiary Use in Legal Contexts
Sundaraparipurnan Narayanan, Marie Potel, Vibhav Mithal

‍

Amurabi helped us think out of the box in a very powerful way

Jolling de Pree

Partner at De Brauw

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Name Surname

Position, Company name

LLMs are already part of legal workflows

Why accuracy is not enough

Key challenges in legal use

Introducing the CLAUSE framework

A need for continuous evaluation

Read more

Trusted by clients worldwide

LLMs are already part of legal workflows

Why accuracy is not enough

Key challenges in legal use

Introducing the CLAUSE framework

A need for continuous evaluation

Read more

Trusted by clients worldwide

Cookie settings