Beyond Accuracy: Evaluating LLMs for Evidentiary Use in Legal Contexts

We are happy to share that a new book on AI and legal evidence has just been published by Routledge, and that Fair Patterns contributed to the first chapter.
The chapter, written by Sundaraparipurnan Narayanan, Marie Potel, and Vibhav Mithal, looks at a question that comes up more and more often: how do we evaluate large language models in legal contexts?
LLMs are already part of legal workflows
Large language models are now used for a wide range of tasks. Legal research, drafting, summarising, even predicting outcomes.
They can be useful, but they also come with clear limitations.
General-purpose models are not built for legal reasoning. They can produce incorrect information, rely on incomplete context, or reflect biases present in their training data. In low-risk situations, this may be manageable. In legal contexts, it is not.
Why accuracy is not enough
A common way to evaluate these systems is to look at accuracy. Does the answer look correct?
In practice, this is not sufficient.
Legal use cases require more than surface-level correctness. They involve context, interpretation, jurisdiction, and responsibility. A response can sound convincing while still being incomplete or misleading.
This becomes particularly important in tasks like contract clause generation, where small details can have significant consequences.
Key challenges in legal use
The chapter highlights several challenges that need to be considered when using LLMs in legal contexts:
- adapting to different legal systems and evolving regulations
- handling sensitive data and ensuring privacy
- identifying and mitigating biases
- maintaining human oversight
- managing complex, multi-jurisdictional scenarios
- ensuring accountability when errors occur
These are not edge cases. They are part of everyday legal practice.
Introducing the CLAUSE framework
To address these questions, the chapter introduces a framework called CLAUSE.
Rather than focusing only on accuracy, it proposes a broader way to evaluate LLMs used for contract clause generation, based on six dimensions:
- compliance and adaptability across jurisdictions
- legal reasoning and competence
- reliability and trustworthiness
- understanding of context
- handling of structured contract elements
- ethics, privacy, and accountability
The goal is to provide a more complete view of how these systems perform in real-world legal scenarios.
A need for continuous evaluation
One of the main ideas behind this work is that evaluation cannot be a one-time step.
Legal environments evolve. Models evolve. Use cases evolve.
This makes continuous review and human oversight essential. Not only to improve performance, but also to ensure that these systems remain aligned with legal and ethical expectations.
Read more
The chapter is part of the book:
Artificial Intelligence and Legal Evidence: The Indian Policy Perspective (Routledge)

