I’ve written before about how we’re thinking about “low resource” use of Large Language Models (LLMs) — and where some of the benefits of LLMs can be captured without entering the “dependent on external API” vs “need new infrastructure to run internally” trade-offs.
One of the use cases we have for LLMs is categorisation: across parliamentary data in TheyWorkForYou, and FOI data in WhatDoTheyKnow we have a lot of unstructured text that it would be useful to assign structured labels to, for either public facing or internal processes.
This blog post is a write up of an experiment (working title: RuleBox) that uses LLMs to create classification rules, which can then be run on traditional computing infrastructure without dependence on external APIs. This allows large-scale text categorisation to run quickly and cheaply on traditional hardware without ongoing API dependencies.
Categorising Early Day Motions
We have a big dataset of parliamentary Early Day Motions (EDMs), which are formally ‘draft motions’ for parliamentary discussion but effectively work as an internal petition tool where MPs can signal their interest or support in different areas.
For our tools like the Local Intelligence Hub (LIH) we highlight a few EDMs as relevant to indicating if an MP has a special interest in an area of climate/environmental work. We want to keep these up to date better, and to have a pipeline that’s flexible for future versions of the LIH that might focus on different sectors. We want to be able to tag existing and new EDMs depending if they relate to climate/environmental matters, or other domains of interest such violence against women and girls (VAWG).
A very simple approach would just be to plug into the OpenAI API and store some categories each day, but this is giving us a dependency and ongoing cost. What we’ve experimented with instead is an approach where we use the OpenAI API to bootstrap a process. We’ve used the commercial LLM to add categories to a limited set of data, and then seen how we can use that to create rules to categorise the rest.
Machine learning and text classification
Regular expressions and text processing rules
The “traditional” way of classifying lots of text automatically is to use text matching or regular expressions.
Regular expressions are a special format for defining when a set of text matches a pattern (which might be “contains one of these words” or “find the thing that is structured like an email address”).
The advantage of this approach is that you can see the rules you’ve added and at this point the underlying technical implementations are really fast. The disadvantage is that you might need to add a lot of edge cases manually, and regular expression syntax is not always clear to understand.
Machine learning
The use of “normal” machine learning provides a new tool. Here, models that have already been trained on a big dataset of the language are then fine-tuned to map input texts to provided categories.
The theory of what is happening here is that in order to accurately “predict the next word”, language models need to have developed internal structures that map to different flows and structures in the text. As such, if you cut off the final “predicting the next word” step, and replace it with a “what category” step, those internal structures can be usefully repurposed to this task.
As such, machine learning based text classifiers can be more flexible. They are picking up patterns like “this flavour of word is in proximity to this flavour of word” that would be difficult to manually code for. The downside is that they are a black box, and it is hard to understand what it has done to make a classification decision. They are also more resource intensive and slower to categorise large datasets — but still fundamentally possible to run on traditional hardware.
LLMs
The next wave is LLMs, which take the same basic concept and massively increase the data and the size of the model. Here, rather than replacing the “next word” step, the LLM is trained on a datasets that contain both instructions and the results of following those instructions. This makes zero-shot classification possible. Without retraining, a model can be given a text and a list of labels and it can assign the label.
This remains a (now massive) black box, but errors in category assignment can be improved by adjusting the instructions. The new downsides over smaller machine learning models is the much larger size of the model hugely increases the cost of self-hosting and creates dependencies on external companies providing models. If you use proprietary models (that are regularly updated and deprecated) this creates problems for reproducible processes.
Rulebox approach
The Rulebox approach combines aspects of both approaches. One of the things that LLMs are quite good at is writing code to solve stated problems. Here we’re doing a version of that: providing text and a category, and asking it to produce a set of regular expressions that should assign this category.
This has its unique set of pros and cons: you are still bound by the underlying problem of regular expressions that they are matching on text rather than the vibes of the text (which language models are better at). But you have massively reduced the labour time needed to create the huge set of rules, and once you have these they can be applied at speed on traditional hardware.
This is part of a focus on “low resource” use of LLMs – where we want to think about where we can get the most value out of new technology, in a way that avoids dependence or hugely increased capacity.
The process
We used an OpenAI-based process to assign labels to a set of 2,000 EDMs (1000 each for a training and validation dataset).
We then created a basic structure for holding regular expression rules using Pydantic for the underlying data structure of the collection of rules. For each rule, this either allows a list of regex expressions that are AND (all must match) or OR (one must match) — with the option of NOT rules that will negate a positive match.
Once we have the holder for a set of rules, and a dataset with a set of labels, we can start to calculate mismatches between what the rules say, and the result. Running this in a loop with steps that query an LLM helps refine the result.
The steps are:
- Calculate mismatches between ground truth labels, and assigned labels: finding both missing labels and incorrect labels.
- AI: For each missing label, create a new regex rule that would assign the correct label.
- AI: For each incorrect label, adjust and replace the regex rules that triggered this label.
- Repeat until no missing or incorrect labels.
PydanticAI is used to interface with the OpenAI API. This includes not just using pydantic to validate the returning data structure, but some extra validation rules that the resulting rules match the text that was being input. So for instance, if a rule is being generated to assign a label to a piece of text, if the generated rule fails to match the input text, this failure is passed back to trigger a retry.
The initial attempt at this got stuck in a loop creating rules that were too general, and trying to narrow them down. At this point, we cut the categories down to just the two we were really interested in, and after that performed better, expanded out to six more where it felt like keyword categories should perform reasonably well (or at least successfully generate rules). This ends up with 1,500 regular expressions to assign eight categories.
Applying the rules
Once we have the rules, we know they work for the training dataset, but how useful are they in general?
Using the validation dataset, we can see the following differences:
- Correct labels: 230
- Missing labels: 73
- Incorrect labels: 41
- Items where no labels were assigned (correctly): 808 / 1000 total items
Reviewing these, incorrect labels generally felt fair enough – these tended to be examples that contained obvious keywords related to the environment, but were part of longer lists where the labelling process did not judge it as one of the focuses of the text. The missing labels were more of a problem, where 33 of the missing labels were environmental ones. Expanding the training data should improve this, but there is always just going to be a long tail that’s missed.
Something else we experimented with at this stage was moving the process that applied the rules from Python to Rust (using an LLM to translate a basic version of the Python mechanics). This cut the time taken to categorising 13,000 EDMs from 2 minutes to 4 seconds. The benefit of this isn’t just being fast on this dataset, but that much more complicated rulesets would not be a big slowdown.
What have we learned
In general this is an approach worth investigating further as a bridge between several useful features: with it, we are able to translate an initial high intensity of LLM into a process that can be run fast on traditional hardware, and importantly is not a black box in terms of how it assigns labels.
It doesn’t completely carry over the benefits of LLMs:it is better for smaller, more precise categories. It really needs a good theory on why a keyword approach would be a good way of categorising something. It might be a good transitional approach for a few years while options stabilise around more open models with lower resource requirements.
Next steps
The next steps on this are to expand the training data a bit and start seeing if we can practically make use of the categories assigned, or if the accuracy causes problems.
Depending how this goes, we can revisit the initial experiment code and tidy it up into a more general classifying tool. This could tackle other classification problems we have that might be suitable, and we could make the tool more widely available. An advantage of this kind of approach (as our previous work around vector search) is it is the kind of project where “a technically-minded volunteer helped us to create a tool” might help organisations without creating significant new dependencies or new infrastructure requirements.
We also want to think about where hybrid approaches might be useful. For instance, in these datasets, most items are not labelled at all. A fast first pass that identifies potential items could then switch to an LLM approach to knock out false positives from the data. Similarly, once we have a smaller pool of environmentally-linked items, further subclassification using LLMs is much more viable.
Our general approach is to try and identify the things that LLMs can do uniquely well, and build them into overall processes that tame some of the things that worry us about AI in general. Here we are exploring how we have focused on the use of LLMs, resulting in new processes that are both fast and efficient. For more about our approach, read our AI framework.
Photo by Marc Sendra Martorell on Unsplash