There was a paper published in Nature Scientific Reports recently that claimed that writing and illustration tasks are completed by AI models 130 -2900 times more efficiently than by humans. I broke down the calculations they did and highlighted some areas of uncertainty in this post. Here, I’ve made some different design decisions and repeated the calculations using Impact Framework. I’ll explain my decisions and walk through the calculations step by step in part 3 here - in this post I’ll just give a high level summary of the different approaches and their outcomes.

My aim was:

  • to make a different set of design decisions to the original paper and compare the results

My aim is not necessarily to present my way as correct and their way as wrong, just to show the impact of a different set of assumptions and approaches. You can make up your own mind, or even better fork the manifests and create more manifests of your own.

Note! This is a non-adversarial post. Carbon calculations are difficult and we are always working with incomplete data and proxies that are often abstracted quite far from the actual metrics we’d ideally like to observe. I respect the hard work that was put into the paper and am grateful that the authors made the paper and supplementary information open-access online. It’s only because the authors were open and transparent that critiques, refinements and alternative calculations are possible.

The starting point for this post is the original paper. I recomputed their results in an Impact Framework manifest file, matching each value in the computed manifest to cells in their calculation spreadsheet provided in supplementary information. This helped to understand what was done to create the values in the original paper. It was surprising to me to see a broadly-scoped top-down estimate for human writers being compared to a narrow-scoped bottom-up estimate to AI. Therefore, I focused on creating alternatives that implemented different combinations of top-down and bottom-up estimates.

Main takeaways

  • Comparing top-down to bottom-up probably does not yield a fair comparison. Top down calculations tend to overestimate because they include too many emissions factors in the application boundary, whereas bottom-up calculations tend to underestimate due to missing things. This is especially true in this case because the top down calculation for human creators encompassed such a huge range of emission factors unrelated to writing, whereas the AI creators had a very narrow scope that excluded important components such as the embodied carbon of the hardware used to serve inference.

  • In our modified versions, we compared bottom-up to bottom-up calculations for humans and the same two LLMs, and found them to be very similar (20 g CO2e/page for ChatGPT, 22 g CO2/page for human writer, 10 g CO2e/page for BLOOM). However, there are still two assumptions that favour the AI creators even in our bottom up calculations - they are a) the finished page of writing is always generated using a single prompt, b) we account for heating the writer’s workspace using natural gas, which is the largest component of the human writer’s carbon emissions.

  • In our bottom-up models, the efficiency of human writers and LLMs is roughly equivalent UNLESS a page requires >1 prompt to complete. Every additional prompt adds ~ a human writer’s total carbon per page, so for example a 2 prompt page is about half as efficient as a human writer, a 5 prompt page emits 5x as much carbon as a human writer. Since it is common for users to prompt multiple times (many LLM guides online advise repeating and refining prompts to get the response you want), it makes sense to relax the 1 prompt-per-page constraint, meaning AI writers are probably less efficient than human writers in reality.

  • Excluding home heating and lighting (whcih could be considered equivalent to omitting data center lighting, temperature control etc for the AI estimates in the paper), a human writer emits about 1/4 of the carbon of Chat-GPT or about half of BLOOM.

  • A simple cost-based top-down estimate for Chat-GPT exceeded the original paper’s top-down estimate for human writers when we use 3 queries per page of writing.

  • The estimate of carbon emissions for human writers or AI models is very sensitive to design decisions such as top-down vs bottom up, the application boundary and carbon intensity values. We need a combination of accepted standards and transparency to ensure estimates made by different groups are intercomparable.

The baseline

The baseline was the original paper. I simply reproduced their results in the form of an Impact Framework manifest file. This is top-down human vs bottom-up AI.

The result is:

Human: ~1400 gCO2e/page of writing AI: ~ 1g CO2e/page of writing

Here’s the manifest and the output file.

Bottom-up human vs bottom-up AI

Here I made some adjustments to Tomlinson et al’s (2024) approach to bottom up estimates for AI. I did this for both BLOOM and Chat-GPT. Let’s focus on the Chat-GPT version here for brevity (you can see the BLOOM version in part 3).

The main differences are that we included embodied carbon of inference as well as training, and included the servers as well as the GPUs. We used different sources that indicated a much larger number of GPUs were used than were included in the original paper. We also used different data to inform our estimates of operational carbon for training and inference.

In the end, we estimated the carbon per query for ChatGT to be around 20 g, compared to Tomlnson et al’s 1.34 g. 6.38 + 5.93 + 8.125 = 20.435 g CO2e/query

These folks estimated 4.32gCO2e/query, seemingly just for the operational carbon component, about half of what we suggest here. Our estimate is about 15x that of Tomlinson et al (2024) who estimated 1.35 g/page.

Assuming a single query is used to generate a page of text, we get to ~20 g CO2e/page.

For bottom-up human we decided that lighting, heating and laptop usage were inside the application boundary. We assumed the writer was in the UK, and sourced values for the average UK household gas and electricity consumption and the proportions accounted for by lighting and heating. Then we scaled down to a fraction of time and space used for writing. The same approach was taken for the laptop power consumption. In the end we came up with a value of ~22 g CO2e/page (or ~4 g excluding home lighting and heating).

So the bottom-up estimates were roughly equal for human and AI writers, being ~22 and 20 g respectively. However, this assumes one query per page. Some light googling and chatting to others in my network suggested 3 queries per page is reasonable, which triples the estimate for Chat-GPT, causing it to substantially exced the carbon emissions for a human writer.

Therefore, bottom up: bottom up yields approximate equality between humans and AI when we assume 1 query per page, and when we assume 3 queries per page, Chat-GPT emits about triple the carbon of a human writer.

Here’s the manifest and output file assuming 3 queries per page, and the manifest and output file assuming 1 query per page.

bottom-up human vs bottom up AI assuming 1 query per page

IF visualizer for bottom-up human vs bottom up AI assuming 1 query per page

bottom-up human vs bottom up AI assuming 3 queries per page

IF visualizer for bottom-up human vs bottom up AI assuming 3 queries per page

Top-down human vs top-down AI

We simply persist the top-down estimate for human writers from Tomlinson et al (2024) and attempt to make an equivalent top-down calculation for Chat-GPT.

A top down estimate for Chat-GPT used cost as an entry point. Microsoft’s reported Scope 1, 2 and 3 (location based) emissons were divided by their revenue for the same year, yielding an estimate for carbon per billion dollars of revenue: 539325.842696 T or 539325842696.62 g of CO2e.

OpenAI reportedly paid Microsoft $4 billion to train and serve Chat-GPT, so we can estimate 4 * 539325842696.62 = 2157303370786.48 g CO2e as the total carbon cost of the model.

According to this blog Chat-GPT serves 10,000,000 requests a day, or 3650000000 queries per year.

So, we can estimate the carbon per query as:

2157303370786.48 / 3650000000 = 591.04 g CO2e/query.

Assuming one query per page, the carbon per page is equal to the carbon per query, so 591.04 g CO2e.

If you assume three queries per page of text, then the carbon per page is 591.04 * 3 = 1773.13 gCO2e

Again, assuming one query per page, the carbon emissions of human writers exceed AIs, but with 3 queries per page the carbon emissions of AIs greatly exceed human writers.

Here’s the manifest and output file for 1 query per page, and manifest and output file assuming 3 queries per page.

top-down human vs top-down AI AI assuming 1 query per page

IF visualizer for top-down human vs top-down AI assuming 1 query per page

top-down human vs top-down AI assuming 3 queries per page

IF visualizer for top-down human vs top-down AI assuming 3 queries per page

Summary

We’ve shown that the approach taken to calculate carbon emissions greatly influences the outcomes. This includes the decision to go top-down or bottom-up and what to include inside the application boundary as well as the more nuanced decisions abotu coefficient values etc that get made along the way. Depending on the approach taken, you can show AIs exceeding humans, approximate equality or humans exceeding AI. However, I personally feel that >1 query per page is appropriate, and that it’s hard to justify going top-down for one component and bottom-up for another. With these constraints, AI writers typically emit more carbon than humans in my experiments.

Now you can hop to Part 3 to see the granular details of the calculations.