Daniel Rosehill

Posted on Dec 3, 2024

Making An LLM A Data Analysis Intern (Who Even Likes Reading Sustainability Reports!)

#rag #llm #data

Did you ever spend a week wading through sustainability disclosures trying to get to the bottom of niche but vital questions of our time like comparing the profitability of major oil and gas producers with the amount of greenhouse gas emissions that could be attributable to them (a lot)?

I have. Fun it was not.

I emerged more informed about our world, weary in the way that you tend to feel after spending extensive effort to find information that, ultimately, wasn't long on cause for optimism.

But I also left wondering why it seemed to take exhaustive acts of research, and inordinate perseverance, to answer what should really be a simple question that should be answerable quickly from public-source information.

Thanks to the powers of the internet put on steroids by the information-mining power of tools like ChatGPT, anyone motivated to find out the current temperature in Beijing or the approximate cost of a pizza in London can do so in about five seconds.

But try to answer this question and you're in for a bumpier ride:

"What if we added together the reported GHG emissions for Shell in 2023 and attempted to convert them into dollars at one of the various suggested social cost of carbon numbers? And what if we compared that number to their [financial metric]?"

LLMs Value-Add In Data Analysis: Scaling Up First-Pass Insight-Generation

What struck me when trying to do exactly this repetitively:

Reading and interpreting sustainability data is hard work, especially for non-experts like me scrambling furiously to cobble together our own data dictionary _(sustainability people: please take few leaves out of the handbook of best practices in documentation from the tech industry. Data dictionaries can be one of them). _

Another problem inherent in the manner in which sustainability data is released to the vast pool of people who could examine it with probity: the tedious process of analysing it often precludes doing so at scale. And getting a wide angle lens on this kind of topic really requires exactly that.

A core and underdiscussed contributing factor_ (we'll get later to the more optimistic topic of potential solutions)_?

The kind of technical minutiae that many are wont to dismiss as unimportant details but which actually plays a pivotal role in determining the size of the gap between what companies think are helpful ways of disclosing their sustainability imprint and the form of data liberation that might actually be helpful to those with a vested interest in understanding them (and very often, that's all of us).

Take the aforementioned glut of sustainability PDFs released in large volume by just about every company that has taken the view that doing this is either useful or expedient to their self-interest.

Many of the companies with the most significant volume of emissions are the most wont to release emissions data as long PDFs (these can come packaged in the form of "sustainability reports", "ESG data supplements" or one of a whole number of synonyms).

Reading this literature is a bit like stepping into a strange alter universe in which you find yourself wondering whether they were written page by page by entirely different people.

You might find yourself scratching your head in confusion: isn't performance here actually getting worse? If so why is that the headline, then? It's mildly crazy-inducing in a way that it almost shouldn't be.

While some companies have made notable efforts to make these documents data-analysis friendly, many, if not the vast majority, have not.

Those inclined towards endorsing conspiracy theories would likely retort: "yes Daniel, that's the entire point. So that you don't read these things."

Call me naive, but I try to take a slightly more benevolent view.

I think that many mean well but overestimate the ability of an interested readership to actually glean useful information from their content, which is invariably heavy on visionary statements and commitments.

The norm, at least in terms of distribution format: a long and very glossy PDF that is rapturous reading for probably nobody. Potentially useful, perhaps, for investors and determined sector analysts zoned in on the pivots up and down in the line chart. For the great unwashed rest of us? Much harder to process.

Somewhere between the part where the company describes their vision and the index, however, are invariably the cold hold numbers outlining how much the company is emitting across the three conventional 'scopes' in which greenhouse gas emissions are measured (scope 1, 2, and 3; oh, and because they are often imputed rather than measured, scope 3 emissions are often, unhelpfully, in a separate document!).

They frequently look very much like this (Amazon 2023).

Or this one from BP (also 2023):

The bitter irony:

The ever-rising drumbeat to be transparent and release data often - perversely - ends up simply creating so much attractively formatted data that getting to the kernel of the interesting datapoints becomes exasperatingly difficult (to speak nothing about the units of measurement or the unfortunate fact that _metric tonnes of co2 equivalents and millions of tonnes of co2 equivalents share the same acronym!)_

Consider this page from p24 of BP's 2023 Sustainability Report (exhibit A). Notice also how how much easier and more appealing it is to read about BP's net zero sales target instead of trying to assess whether the bar charts on the mid-left are getting bigger (worse) or smaller (better):

Zoom in or pull out the magnifying glass sitting in your filing cabinet!

The reported scope 3 emissions (for 2023) at 315 MtCO2) were actually their second year-on-year increase since 2021. Their upstream impact is getting worse, not better:

For those trying to read these reports with a critical eye (as we all should be), three things are needed in abundance:

Coffee
Patience
The zoom shortcut on your web browser

All this means that in the absence of these data being released in analyst-friendly formats (like APIs) the cause of transparency, which sustainability disclosures proffer to serve, is, I suspect, just as often frustrated as enabled by these efforts.

In practice, sector analysts tracking these markets and diligent but overworked environmental reporters do much of the heavy lifting in bringing the critical analysis and expert analysis that these documents demand (note, for example, that this morsel from BP's report was covered in sustainability and industry literature).

But even that (arguably) isn't a reliable system: if open source analysis is packed into a closed source analytical product, it's not very helpful to the general public.

Could LLMs (With An Injection of RAG) Help?

It was against the context of this rather bleak endeavor that I decided that I may as well see if large language models (LLMs) could help us embattled analysts out, even just a little.

I did so with a sense of trepidation.

While my enthusiasm for the potential of LLMs is currently somewhat boundless, I have taken note that many of my friends are less bullish in their assessments of its capabilities.

Hallucinations and inaccurate data retrieval are certainly problems but the charge I've often hard ("it's all bullsh*t") is also a vast overstatement in the wrong direction.

As only oddities like me probably do, I've used LLMs to do everything from calculate comparative trade data to (yes, you're reading this right) attempt to graph the altitude and speeds at which Iran fired ballistic missiles at Israel - and express the latter as multiples of the average cruising speed of an A330 (_this was, I believe, my most interesting prompting experiment). _

So the objective of this analysis did not (and does not) seem that fanciful to me:

As retrieving this data manually is so slow and painful, could we use LLMs (and RAG) to do what they're inherently great at (working with language and, increasingly, knowing when to enlist the help of sidekicks APIs like Wolfram Alpha for computation?

To tell it like it was (and is), writing prompts and configuration texts to try to do this data analysis is gruelling work.

It took iteration upon iteration to bake in enough safeguards and instructions to start getting reliable outputs.

Even AI fiends are wont to dismiss prompt engineering as a "light" activity in AI. It's not. It's hard work. But the results are often what gives those of us enamoured by the pursuit the fire to keep trying.

So after I began receiving my first tentatively hopeful results, I decided to try kick things up a notch:

I ran my prompt through an LLM, told it what the objective was (comparing sustainability performance with financial profitability) and asked it what additional parameters it thought were worthy of including in the analysis.

This is another area where LLMs shine brightly: they're an on-demand brainstorming partner that allows us to think beyond the confines of our often limited thought processes.

The first run on that attempt yielded five additional variables to feed into the analysis. But the current limitations around context and tokenisation demanded far more prudence.

My mind was urging:

Gather 10 datapoints, not 5! Calculate correlations! Chart the data (or generate a Plotly script to chart the data with!).

But I yielded to the less exciting but more responsible methodology of starting small and iterating slowly. Still, I saw promising signs of hope.

Deployment methods

Finally, let's get down to the technical brass tacks of how this proof of concept actually worked:

Firstly, I validated this as a Langchain script writing out the reports as markdown files and as a chatbot (frontend: OpenAI).

Next, I attempted a version which asks for (the ultimate objective) gathering the data of five companies and comparing them against one another. However, this increases drastically the context load on the LLM. Not only does each datapoint need to be gathered five times over and held within one context window, all the sources need to be gathered too.

Research shows, however, that even when you're safely within context limits, response coherence can degrade. So right now the best advice is probably this: chunk long multi-step prompts to keep the task workload light. But even small prompts can gather together helpful outputs.

Other cause for hope: some day reasonably soon, even this challenge will be a relic of the past and it will seem laughable that the first LLMs to reach mass adoption were once so "primitive".

Long context models and more powerful underlying hardware are ebbing away at the context challenge posed by the very nature of tokenising text. One bright day, this challenge will be historical. But for now, we're limited to being strategic in working with the AI at our disposal, even though it's already very powerful.

But even beginning from a modest starting point (gather three datapoints, multiply them by one multiplier, format them in *one format) the results have sprung cause for optimism, which is something I left short on when reading them by hand.

What's also been impressive as I have analysed and validated the results: Not only can the models pick out the right datapoints (no small feat when even the units of measurement are inherently confusing, like millions of metric tons of carbon dioxide equivalents), they can perform accurate mathematics, including when the other units are denoted in notation like $23.4M (determining 'M' to be shorthand for millions).

Tools That Make This Work (Data Gathering, Analysis, And Visualisation Stack)

So although it's tempting to attempt to get an LLM to do literally all of the analysis and visualisation workload asked for here (believe me, I tried!), it's probably both more realistic and responsible to try to divvy up the task between services.

But the core of the stack here _- and why I remain hugely optimistic about their role in reshaping all manner of tech workflows _- is the large language model (LLM) itself.

A modern LLM might be using RAG to pull in the data and an external API to do the actual number crunching. But it's still the indispensable orchrestrator that makes this newfound symphony of AI strum up.

The RAG might contextualise it on up to date data and the other tools can increase its data-crunching powers. But these are tools that we've had easy access to for some time now. The LLM is the pivotal bridge between the world of human language and the technical universe within which it sits.

When using LLMs for data analysis, choose your model carefully

But as I discovered through trial and error, not just any LLM will do.

For one, this is an instructional use case and not much of a conversational one. Ideally, this use-case isn't even a conversation at all: the user names a company and gets back a report. We don't need our model to be much of a talker.

We do, however, need a model that can think and analyse.

Data analysis workflows demand abilities that span several domains of LLM abilities: reasoning, analytical abilities, and an understanding of computations - so choose a model or variant that plays to those strengths _(despite its high API pricing, I would suggest that OpenAI's o1 series tick these boxes well). _

Having knowledge of these sustainability disclosures would obviate the need for RAG entirely - so a model with a recent training data cutoff would be another natural fit. And one important point: even if the model can retrieve some data from a RAG pipeline, having it in knowledge provides something else that's useful: a benchmark for corroboration.

Accessory modules:

Wolfram Alpha is an outstanding add-on for computation and has an API that's easy to access.
The list of promising data visualisation tools is a bit shorter, but this probably stretches past the part of this process that it makes any sense to do by AI anyway.

Once you've assembled this stack, you've theoretically got:

_- Something to gather the data

Something to parse the data
Something to visualise the data _

Then scale all of this up to:

Gather & parse reports at scale
Parse and analyse data at scale
Generate insights and comparisons at scale

Variations & Config Notes

The configuration text is below, but here are a few notes about some of the deliberate choice of language it contains:

Hallucination prevention is key to this kind of workload and the hurdle to overcome when trying to sell this idea to those who haven't, like me, guzzled down liters of AI Kool Aid.

Although my intention in developing this was creating a model to frontload the task of human review (hence the demand for sources), you still want to include robust language to break up a chain if data can't be found.

Consider cues like:

Only use X as a data source. If you cannot retrieve the sustainability data for this financial year, inform the use that you cannot complete the task.

The variable (X) can refer to either a single discrete source or a whitelist of sources you trust to be accurate.

Other slightly more passive-aggressive instructions can be added on for further insistence like:

Do not return speculative information under any circumstance.

The checkpoint that I ultimately baked into my script was this one:

If you are certain that you can retrieve and parse the requested data for {company} continue. If you cannot, inform the user that you cannot reliably perform the requested analysis.

Output formatting

Another thing I love about prompt engineering and why working with information via an LLM is (to my mind) so much more engrossing than doing so with a search engine: the ability to shape how you want to receive the information retrieved.

In this case - and when working with LLMs generally - adding verbiage to instruct the LLM to favor a specific output format is the go-to technique.

The output format can be markdown (if using this as a script; I got mine to fill up .md files for every company within a Github repository).

Or you can configure an OpenAI Assistant to provide JSON for easier intake into analytics and automation workloads.

In OpenAI Platform, change the response format selector:

The latter is particularly powerful as it allows the automation to be configured in something like (say) Zapier.

Here's one workflow that would function very efficiently:

LLM gathers raw sustainability report data (source: training data / RAG pipeline / API call)
If required, PDF extraction API parses it (optional; LLM can also handle)
Computational API performs calculation (optional; mathematical LLMs can also handle)
Data visualisation tool plots requested charts
Processed analysis returned to user / BI system

If you're using this method, you can be more deterministic about where you want the data to be written to.

For example:

Parse the JSON with a JSON processor, identify and extract the CSV row returned by the Assistant; data that as a new row of data to your analytics database/data warehouse

Evaluation and Supervision

A note or two about evaluation and supervision, as it's clearly of high importance when trying to leverage AI for real world data analysis:

The configuration retrieves the source data both for the sake of accuracy and to enable evaluation/supervision/QA activities.

So:

Test the LLM on a run of X companies
Evaluate against actual sustainability report data
Determine accuracy
Iterate for improvements

Input: Tokenisation And Context

As mentioned, context is a tough obstacle in the fight to squeeze every potential jot of data processing power out of these exciting technologies.

The configuration I wrote below is about 500 words which is comfortably within the input tokenisation limits of major LLM and Assistant platforms (e.g. the Assistants feature in OpenAI Assistants).

The configuration is designed to return the output in receipt of a simple one word prompt with the company name.

This can also be used for programmatic template-based prompting by poling an API endpoint and passing the variable, like this:

https://api.yourllm.com/data-analysis-llm-endpoint/companyname=thecompany

Now, finally, the Assistant conifg:

Version for consistent CSV output

You are a data research assistant. Your task is to do the following.

Ask the user to provide the name of a company.

If you are not sure which it is, disambiguate it. If you are reasonably sure, process.

Retrieve the following:

Emissions Data Checkpoint

If you can establish that the company released its GHG emissions data for 2023, proceed to the next step. If not, inform the user that no data could be retrieved and end the interaction.

Emissions Data Gathering

If you have these data, retrieve them. Validate them. Note that the company may not have reported all of these datapoints. If a scope wasn't reported it, note that and record it as 0.

Scope 1 emissions
- Retrieve the value and units of reporting
Scope 2 emissions
- Retrieve the value and units of reporting. If location and place-based emissions are reported, choose the place based emissions.
Scope 3 emissions
- Retrieve the value and units of reporting

Calculated Values: Emissions

Sum together the value of these three emissions. This variable is {total-emissions}.

Report the unit of measurement as a unit and spelled out: for example mtco2e (millions of tons of carbon dioxide equivalents).

Report these data:

Report URL
Report title
Report publication date

Financial performance

Find the company's EBITDA for year end 2022.

Provide its source (URL, title, publication date).

Report this value correct to two decimal places.

Final calculation

Take:

{total-emissions}:

If the reporting unit is millions of tonnes of co2e multiply it by 236,000,000.
If the reporting unit is tonnes of co2e multiply it by 236.

This figure is monetised total emissions and is denominated in USD. Report it correct to two decimal places (e.g. $23.23BN).

Report Format

Gather all the data and follow exactly this template, outtping the data you gather as CSV within a codefence:

 your,header,row

Notes

Provide CSV header row to gather data in a standard format, if desired.
Thumbnail can be ommitted; for data visualisation.

# Demo

User input:

Output is a formatted analysis using the requested calculations (reported carbon dioxide equivalents emitted simulated 'monetisation' at the $236/mtco2e rate proposed by the International Foundation For Valuing Impacts):

# QA/ Validation

## Data: Amazon 2023 (GHG Emissions) And 2022 (Year End EBITDA)

Validated: emissions:

Validated: EBITDA:

End Data Visualisation Chain

The flexibility in this approach for first-pass data analysis and visualisation lies in the enormous flexibility provided by asking large language models (LLMs) to format their output according to specific instructions.

For example, for markdown, with PDF as the ultimate target:

Enclose the output within a codeblock, formatted as markdown

For CSV, with ingestion to a database/analytics tool/date warehouse as the objective:

Output the gathered data as raw CSV enclosed within a codeblock

To achieve a consistent output format on repeated/batch runs, a standard header row can be written and then passed as an instruction:

Output the gathered data as raw CSV enclosed within a codeblock. Adhere precisely to this format.

You can even request that the header row be included to validate that the format instruction has been followed and then strip from the output if it has:

Output the gathered data as raw CSV enclosed within a codeblock. Return the header row (row 1) and then your findings (row 2).

DEV Community