Are LLMs the Silver Bullet for Compliance?

Over the last 18 months, the team at ERM Libryo have had multiple conversations with customers about the recent developments in AI, ChatGPT and how that impacts the legal compliance industry. We reached out to Alex - one of our team members who specialises in Regulatory System Development to shine more light on this subject and share some exciting insights that he has gained. Thank you Alex!

AI is developing quickly; among these developments, breakthroughs in natural language processing (NLP) represent some of the biggest.

NLP is an area of AI that sits at the intersection of human languages and machines. It deals with things like making predictions and extracting insight from text and much more. For instance, you can use NLP to search for laws about “working at heights”, classify laws containing legal obligations, or group similar regulations together. These are the traditional use cases of NLP. But over the last 5 years, new transformer architectures, huge text datasets, and growing computing resources have culminated in something new: Large Language Models (LLMs).

The era of LLMs

At the heart of the LLM is the transformer architecture – it’s like the instruction book that tells a computer how to train something like GPT. It works like a sandwich: on the top, you have layers of sauces (encoders), and on the bottom, you have layers of toppings (decoders). The encoders learn the important word sequences in some input text. Then the decoders generate new word sequences based on what the encoders learned was important. Of course, using more varied texts to train will result in more complex text generation, like a more complex sandwich flavour. A lot more can go into this ‘sandwich’, but these are the basics.

LLMs are significant because they allow us to generate text from a prompt, a fundamentally new use case in NLP. But this use case has created widespread excitement and panic as many industries ask themselves: if LLMs can write anything, why can’t we use them to write everything? Or in the context of regulatory technology: why can’t we use them to assess and report compliance?

In this article, we’ll explore some of the reasons why relying on nothing but an LLM for compliance reporting is probably a bad idea.

1. Hallucination

Model hallucination refers to cases where LLMs generate responses that look correct but contain factual errors. These responses are sometimes very confidently written, which makes them particularly devious. Reliance on hallucinated responses can have serious real-world consequences. Almost monthly, lawyers around the world are getting caught out with ChatGPT-fabricated legal research, landing them in hot litigation waters.

Fortunately, there have been recent improvements to the safeguards for legal queries. For example, we just asked ChatGPT:

“What does Article 4 of Chapter 4 of Division 7 of the California Water Code require me to do?”

A few months ago, it would have tried to give an answer, even though that answer would have almost certainly been incorrect. Now, however, it’s a lot more cautious:

“…as of my last update, there is no specific information available regarding Article 4 of Chapter 4 of Division 7 of the California Water Code. …To obtain the most accurate and up-to-date information regarding Article 4 of Chapter 4 of Division 7 of the California Water Code, I recommend referring to the official California legislative website, legal resources, or consulting with an attorney…”

Despite these comforting developments, hallucination is still a difficult problem to solve completely, especially in the context of compliance. Regulations are voluminous, complex, open to interpretation, and frequently changing. In a best-case scenario, an LLM will simply tell you to consult an attorney. In a worst-case scenario, it will provide a hallucinated legal opinion that lands you a fine or even jail time. As a general rule, the higher the risk of getting it wrong, the less you should rely solely on an LLM.

2. Expertise

LLMs perform well on a wide variety of generative tasks and are continuously improving. But they currently struggle to produce the level of technical detail found in many real-world regulations. For example, we gave ChatGPT the following prompt:

“Draft an example obligation that deals with monitoring inorganic chemicals in water.”

It returned the following:

“Sampling and Analysis:

3.1 The obligation requires the collection of representative water samples from various points within the water supply system, including raw water sources, treatment processes, and distribution networks.

3.2 Sampling procedures shall follow approved methodologies to ensure accuracy and consistency in sample collection, preservation, and transportation.

3.3 Water samples shall be analysed by qualified laboratories using approved analytical techniques and equipment capable of detecting and quantifying the target inorganic chemicals within acceptable detection limits.”

On the surface, this looks pretty reasonable. It mentions things like distribution networks, ensuring accuracy, and qualified labs. But the more you compare generated regulations like these with real-world ones, the more you realise how they tend to be simplistic, generic, and vague. By contrast, here’s what Title 40 of the US Code of Federal Regulations says:

“§ 141.23 Inorganic chemical sampling and analytical requirements.

(a) Monitoring shall be conducted as follows:

(1) Groundwater systems shall take a minimum of one sample at every entry point to the distribution system which is representative of each well after treatment (hereafter called a sampling point) beginning in the initial compliance period. The system shall take each sample at the same sampling point unless conditions make another sampling point more representative of each source or treatment plant.

….”

Real-world regulations deal with specific substances, specific engineering standards, specific timeframes, and specific agencies. But ChatGPT is typically vague. It told us to adhere to “approved methodologies” for collecting “representative samples” from “various points” in the water system. What are the methodologies? What are representative samples? Where are the various points? Even with follow-up clarifications, it’s hard to get ChatGPT to commit to the specifics.

With compliance, the devil is often in the details. So an LLM’s inability to drill down into detail can be a serious limitation and one that’s mainly due to a lack of training data. The overwhelming majority of any LLM’s training data will come from scraping freely available, conveniently accessible websites. But the problem is that freely available, high-quality regulatory data is not always convenient to scrape.

For example, regulatory content might be locked behind a premium subscription service, region-locked to a specific country, technically difficult (or illegal) to scrape, or in a format that’s hard to work with, like PDFs or images of documents. And besides, acquiring extensive regulatory content is simply not a priority for most organisations that provide consumer LLMs, much less acquiring non-English regulatory content.

Though despite these limitations, model expertise is set to improve over time. Many LLM providers are conducting Reinforcement Learning from Human Feedback (RLHF), which involves employing real people to correct model responses. This “human loop” solution brought notable improvements in the iterations up to GPT4 and is the “secret sauce” in many bleeding edge LLMs. We’ll probably soon see more experts employed to use their knowledge to improve large language models. But until then, the more technically complex the question, the riskier it is to rely on an LLM for a technically correct answer.

3. Regulatory Updates

In June 2023, there were 106 changes to the Texas Administrative Code, covering matters from inland marine insurance to emergency response systems. Around the world, hundreds of regulations like these are being created, amended, and repealed every week. Add to that the global monthly changes in executive actions and standards, and you have an overwhelming amount of data to keep track of.

Indeed, ChatGPT’s training data ended in September 2021, which means it hasn’t been trained on any legal updates since (in fact it probably wasn’t trained on any significant number of legal updates to begin with due to the scraping challenges we mentioned above). Moreover, it’s currently not possible for LLMs to somehow “absorb” these new updates. Training a new LLM from scratch on new data every month is not at all feasible, nor is monthly finetuning at this scale. So what does this all mean? It means an LLM by itself will not help you find changes to the laws that affect your business.

Fortunately, LLMs are still excellent at tasks like summarising and extracting information from existing documents. And there’s been an explosion of developments in this area: we have small-scale solutions, like ChatGPT plugins that can extract data from a collection of personal PDFs to large-scale solutions like LLM search engine integrations. If you could acquire and store regulatory updates (or any document for that matter) relevant to your business, you could technically extract insight from them using an LLM. Your document collection would serve as a kind of memory bank.

Unfortunately, acquiring all the relevant regulatory updates is still an enormous challenge – not to mention storing, organising and displaying them consistently. You would run into all the scraping problems mentioned above but on a larger scale. Even commercial search engines don’t do such a great job of finding legal updates, because they aren’t built with this use case in mind. The best they can do is point you to the right website. That’s fine if the website is good, but if it’s not then you’re on your own. So it’s very difficult to squeeze any benefit out of LLMs in this area without serious investment in purpose-built web scraping and search technologies.

4. Interfaces

When ChatGPT was first released, everyone was asking it all sorts of questions. The idea that you can write something, and a machine can sensibly reply, is truly remarkable. But we suspect that this prompt-response, question-answer style interface will start losing its novelty soon. We think users are already starting to feel tired and overwhelmed in cases involving lots of back and forth. One such case is compliance, where resolving issues with a chatbot is often time-consuming, inelegant, and exhausting.

Let’s assume that GPT10 has just been released. It can answer any legal question with perfect accuracy, even about the most recent regulatory changes. In this scenario, managing compliance with GPT10 amounts to managing conversations, which is deceptively hard. For example, you would need to consider the following when dealing with the LLM:

What questions should I ask?
How should I phrase my questions?
What follow-up/clarification questions should I ask?
How do I know when to stop asking questions?
Have I already asked a similar question?
Did my colleague already ask a similar question?
Do my colleagues at different sites need to ask different questions?
Do my colleagues in different jurisdictions get different answers?
Was my question fully answered?
Are my answers still applicable?
How often should I re-ask the same questions?
When should I ask a new question?

Knowing what to ask is also difficult because it requires detailed, site-specific insight. Imagine you need to assess and report on a particular site’s air quality and emissions. Before you can begin asking questions, you would first need to know that having fuel-burning equipment on the premises is a material fact to communicate to an LLM. Depending on the jurisdiction, you would also need to know the equipment type (boilers, process heaters, incinerators etc), the wattage, whether vent streams are introduced with primary fuel, and so on. And you might have just had a long conversation only to realise you provided the wrong boiler wattage, or forgot to mention you also have a diesel generator.

When interactions like these happen at scale, you inevitably end up with large, unordered collections of interlinked conversations that still need to be organised, reviewed, and periodically re-evaluated. They could potentially grow until they’re similar in size and complexity to the actual regulatory content itself. In these cases, you might as well just read the actual law directly. After all this, it’s also still not obvious whether you’re actually 100% compliant. Perhaps you left a stone unturned, a question unasked?

My final insights

Large Language Models (LLMs) are a truly groundbreaking development in AI. They can already do so much, and their capabilities are set to improve rapidly from here. They’re a great way to boost productivity, especially when integrated into other dedicated EHS or ESG solutions. However, they are not, in themselves, integrated platforms to manage and report on regulatory compliance. So they shouldn’t be used for complex or high-risk legal tasks, or tasks that depend on precise and up-to-date regulatory information.

Learn how ERM Libryo uses the latest technology to accurately compile and track site-specific regulation using Libryo Streams.

Sky High Thinkers

Categories

Subscribe to Email Updates

Popular Stories