Deep diving "The carbon emissions of writing and illustrating are lower for AI than for humans": Part 1
There was a paper published in Nature Scientific Reports recently that claimed that writing and illustration tasks are completed by AI models 130 -2900 times more efficiently than by humans. This eye-catching claim piqued my interest, as it did for many other folks in the green software space. There were a lot of conflicting takes on the results and the methods used in the paper, so I decided to reproduce the results myself using the Impact Framework.
My aims were:
- to understand the calculations that were done so I could better evaluate the results
- to represent the calculations and results in a shareable Impact framework manifest file format so that I and others can use it as the basis for experimentation.
- to fork and modify the baseline manifest to see the effects of changing the approach and methodological details.
This experiment will be split into three parts:
- Part 1 (this post) will break down what wasdone in the original study
- Part 2 will show the results from changing the calculation methodology in various ways
- Part 3 will be a detailed step-by-step walk through of all the changes that were made in part 2
You can also hop straight to my manifest files here.
Note! This is a non-adversarial post. Carbon calculations are difficult and we are always working with incomplete data and proxies that are often abstracted quite far from the actual metrics we’d ideally like to observe. I respect the hard work that was put into the paper and am grateful that the authors made the paper and supplementary information open-access online. It’s only because the authors were open and transparent that follow-on critiques and refinements are possible.
What does the paper show?
The paper aims to compare the carbon costs of two creative tasks, writing and illustrating, between human creators and AI models. For writing, the authors estimate the carbon emitted by Chat-GPT and BLOOM per page of text and compare that against the carbon emitted by a human writing the same. For illustrations, human illustrators are compared against DALLE-2 and Midjourney. For each AI model, they take into account the operational carbon associated with training and inference, plus the embodied carbon of the hardware used in training and inference. For human creators, they divide the total carbon emitted by a person in a year, in the USA and India separately, and scale that to the hours a person might spend writing annually. The authors also do an additional comparison between the AI models and home computers being used for word processing or digital illustrations, by taking annual emissions estimates for laptops and desktops and scaling them by the writing time per year. These values are then compared to yield the following results (these are ratios of the carbon emissions for each pair of creators):
WRITING
-------
Laptop vs. ChatGPT 20.06224896
Laptop vs. BLOOM 28.334896601008
Desktop vs. ChatGPT 53.56125109
Desktop vs. BLOOM 75.6471776504275
US human vs. ChatGPT 1060.282902
US human vs. BLOOM 1497.48946154968
India human vs. ChatGPT 134.302501
India human vs. BLOOM 189.681998490944
ILLUSTRATION:
-------------
Laptop vs. DALL-E2: 46.65807808
Laptop vs. Midjourney: 54.56488158
Desktop vs. DALL-E2: 124.565548
Desktop vs. Midjourney: 145.674761
US human vs. DALL-E2: 2465.863251
US human vs. Midjourney: 2883.735074
India human vs. DALL-E2: 312.3426784
India human vs. Midjourney: 365.2731094
The paper’s abstract references the human vs AI ratio range 130 - 2900, where 130 is rounded down from a ratio of 134 for a human in India vs ChatGPT for writing, and 2900 is rounded up from a ratio of 2883 for a human in the USA vs Midjourney.
The calculations
In this section, I’ll break down what I think was done. I went through the spreadsheet provided in the paper’s supplementary info and recreated the operations done there in a manifest file. The results and intermediate values all agree with what’s in the sheet, so I’m pretty confident it’s a faithful representation of what’s in the paper. I’ll break this down into sections for each task (writing and illustration) and sub-sections for each creator.
Writing
BLOOM
The IF component that calculates the carbon-per-written page for the BLOOM model looks as follows. I won’t copy out the plugin config for each of the plugins in the pipeline as it takes too much space, but you can explore them here. Hopefully the notes here and the descriptive instance names are enough to see what’s going on.
bloom-model:
pipeline:
compute:
- divide-training-carbon-by-queries-per-month-bloom
- divide-inference-carbon-by-n-queries-bloom
- sum-embodied-carbon-components-bloom
- scale-embodied-carbon-to-training-time-period-bloom
- scale-embodied-carbon-to-queries-per-month-bloom
- sum-training-and-inference-carbon-per-query-bloom
- convert-carbon-per-query-to-carbon-per-word-bloom
- sum-words-per-query-and-embodied-carbon-per-query-bloom
- calculate-total-footprint-per-page-bloom
defaults:
training-carbon-bloom: 30000000
queries-per-month-bloom: 300000000
carbon-per-n-queries-bloom: 340000
n-queries-in-bloom-study: 230768
bloom-gpu-embodied-carbon: 12439.69798
fraction-of-gpu-lifetime-used-for-training-bloom: 0.08238203957
embodied-energy-of-a100-gpu: 150000
carbon-footprint-for-recycling-gpu: 1000
words-per-bloom-response: 412.8
words-per-page: 250
inputs:
- timestamp: '2024-09-01T10:00:00'
duration: 86400
The BLOOM model estimates were generated by first grabbing a total training operational carbon emissions value (30 metric tonnes) from this paper and dividing it by an assumed 300000000 prompts per month (assuming 10000000 prompts per day as per this medium article). This yielded the training carbon that was allocated per query (equalling 0.1 gCO2e/query).
Then they calculated the operational carbon for model inference. They grab values from the same paper where BLOOM carbon emissions were calculated over a period during which 230768 queries were made. The authors divide the carbon emissions by the number of queries to yield carbon per query (equalling 1.47 gCO2e/query).
The carbon-per-query
values for training and inference were summed to give an overall operational carbon value (1.57 gCO2e/query).
The embodied carbon was then accounted for by summing the embodied carbon of an A100 GPU (150 kg, from the aforementioned BLOOM study) and the embodied carbon of recycling an A100 GPU (1 kg) giving a total-embodied-carbon
of 151 kg. This was then scaled by the ratio of the training time to the GPU lifespan (1082 hours/1.5 years = 0.08238
). This gave an embodied carbon of 12.439 kg CO2e, which was then amortized over the 300000000 monthly queries to give an embodied carbon-per-query
of 4.1465 g/query.
The total carbon was then normalized per word on a written page, rather than per query. If I’ve followed the logic correctly, this was done by grabbing a value for average number of words on a page (250, from https://wordcounter.net/words-per-page) and average number of words a writer writes per hour (300, also from https://wordcounter.net/words-per-page). Dividing these gives a writing rate in hours/page (0.83333).
The total operational carbon per query was divided by the writing rate in hours/page, and the result was divided by the sum of the number of words in a typical Chat-GPT response (calculated from the number of words in 10 responses the authors made to Chat-GPT, assumed to also apply to BLOOM) and the embodied carbon per query.
They report a final value of 0.95289 gCO2e/page.
Things I haven’t understood about the BLOOM model calculations:
1) The final calculation step:
I’ve so far failed to understand why a summation of the embodied-carbon-per-query
and words-per-response
makes sense as a denominator here. To be clear, this is what was done:
total-carbon-per-page = (operational-carbon-per-query * words-per-response) / (words-per-response + embodied-carbon-per-query)
Perhaps it is supposed to be a multiplication rather than an addition in the denominator here, which, if we are assuming one page per response, would yield embodied-carbon-per-word
. At the same time, I’m not sure why it makes sense to multiply operational-carbon-per-response
(which we assume is equal to operational carbon per page) by words-per-response
. To me, this term represents embodied carbon per 250 pages. To get embodied carbon per word, you surely want to divide embodied carbon per page
by words per page
?
2) Why do they only use 1 GPU, no servers, and a short training time in the embodied carbon estimates
They use a value for A100 GPU embodied carbon from this BLOOM study of 150 kg and amortize it across queries and time to account for embodied carbon of model training. However, that same study states that 384 of those same GPUs were used to train the BLOOM model, whereas only one was accounted for here. The cited BLOOM study suggested 7.57 tonnes of embodied carbon for the servers used, which were omitted here, and 3.67 tonnes of embodied carbon for the GPUs, giving a total embodied carbon of ~11 tonnes of carbon to amortize, whereas 150 kg were accounted for here. Also, only one GPU was accounted for in the recycling step. Plugging 11 tonnes into the manifest in place of the original 150kg, and 384 kg for recycling instead of 1kg (because there are 384 GPUs to account for) increases the estimated carbon per page from 0.95g to > 4g. I can see that they note in supplementary info that they extend the time that a single GPU was used instead of accounting for multiple GPUs running in parallel. The training time used is 1082 hours (45 days), whereas in the BLOOM model paper, the coefficients were derived from reported a training time for BLOOM of 118 days, across 384 GPUs (totaling 1082990 GPU hours, compared to 1082 GPU hours used here), so the time extension doesn’t immediately seem to account for the missing GPUs.
3) Why do they include embodied carbon of the hardware used for training but not for inference?
The paper only accounts for embodied carbon for the hardware used to train the model, but there is no embodied carbon component for the hardware used to serve responses to queries. I’m not sure the rationale for excluding this component from the application boundary.
4) If a human writer’s impact should include the totality of the human’s carbon emissions, not just those directly related to writing (see later sectons for details), shouldn’t we also account for the human prompting the AI? If we account for a human writer’s heating, lighting, laptop etc, then these things are also required to prompt an AI.
Chat-GPT
The IF component that calculates the carbon-per-written page for the BLOOM model looks as follows:
chat-gpt-model:
pipeline:
compute:
- divide-training-carbon-by-queries-per-month-chat-gpt
- divide-inference-carbon-by-n-queries-chat-gpt
- sum-training-and-inference-carbon-per-query
- sum-embodied-carbon-components-chat-gpt
- scale-embodied-carbon-to-training-time-period-chat-gpt
- scale-embodied-carbon-to-queries-per-month-chat-gpt
- convert-carbon-per-query-to-carbon-per-word-chat-gpt
- sum-words-per-query-and-embodied-carbon-per-query-chat-gpt
- calculate-total-footprint-per-page-chat-gpt
defaults:
training-carbon-chat-gpt: 552000000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
queries-per-month-chat-gpt: 300000000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
operational-carbon-per-day-chat-gpt: 3820000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
queries-per-day-chat-gpt: 10000000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
chat-gpt-gpu-embodied-carbon: 36974.31507
fraction-of-gpu-lifetime-used-for-training-chat-gpt: 0.2448630137 # 3217.5 hours out of 1.5 year lifetime
embodied-energy-of-a100-gpu: 150000
carbon-footprint-for-recycling-gpu: 1000
words-per-chat-gpt-response: 412.8
words-per-page: 250
inputs:
- timestamp: '2024-09-01T10:00:00'
duration: 86400
The Chat-GPT calculations were done in a similar way to the BLOOM model, subbing in some values.
The operational carbon associated with model training was 552 metric tonnes, taken from this paper (compared to 30 tonnes for BLOOM). This was divided by one month’s worth of prompts, giving a value for training-carbon-per-prompt
of 1.84 grams/query.
The operational carbon associated with model inference was calculated by dividing Chris Pointon’s estimate for daily carbon emissions for inference, 3.82 tonnes, by the prompts per day taken from the same article. This yielded a value of 0.382 gCO2e/query.
The total operational carbon associated with Chat-GPT was the sum of the training and inference operational carbon, 1.84 + 0.382 = 2.22 g/query
.
The embodied carbon was calculated in the same way as for the BLOOM model. The embodied carbon of an A100 GPU (150 kg) and its recycling (1kg) were summed to give the total embodied carbon of the Chat-GPT hardware. Note that the value for recycling is derived from a blog post that suggests 1kg as a recycling emission value for a laptop, and it is assumed to be the same for an A100 GPU here. The total embodied carbon was scaled by the ratio of training time, assumed to be 3217.5 hours, and GPU lifespan, assumed to be 1.5 years.
Again, the sum of the operational and embodied carbon components for training and inference was converted into a value for carbon-per-page
using the same method as explained above for BLOOM.
Things I haven’t understood about the Chat-GPT model calculations:
1) Why the conversion to carbon/page works
It’s the same uncertainty I outlined for the BLOOM model, as the method was identical for Chat-GPT. It did not clarify for me when I worked through it for Chat-GPT.
2) Why do they only use 1 GPU, no servers, and a short training time in the embodied carbon estimates
It’s the same query as for the BLOOM model - I won’t hash out the details again here, except to point out that the training time used is 3712 GPU hours. It’s not clear where these timings were derived from, but other researchers estimate ~34 days of training time using 10,000 GPUs, giving 340000 GPU days (or 816000 GPU hours). It’s a large difference.
3) Why do they include embodied carbon of the hardware used for training but not for inference?
As above for BLOOM. The paper only accounts for embodied carbon for the hardware used to train the model, but there is no embodied carbon component for the hardware used to serve responses to queries. I’m not sure the rationale for this.
4) As above for BLOOM, if a human writer’s impact should include the totality of the human’s carbon emissions, not just those directly related to writing (see later sectons for details), shouldn’t we also account for the human prompting the AI? If we account for a human writer’s heating, lighting, laptop etc, then these things are also required to prompt an AI.
Human writers
The IF manifest component for human writers looks as follows:
human-usa:
pipeline:
compute:
- compute-writing-rate-in-hours-per-page
- multiply-human-carbon-footprint-by-writing-rate
defaults:
writing-rate-words-per-hour: 300 # from https://www.writermag.com/writing-inspiration/the-writing-life/many-words-one-write-per-day/
writing-rate-words-per-page: 250 # Common knowledge, or https://wordcounter.net/words-per-page
human-footprint-in-gco2-per-hour: 1712.328767123329 # derived from https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions
inputs:
- timestamp: '2024-09-01T10:00:00'
duration: 86400
The carbon emissions for human writers were calculated by taking a value for the total carbon emitted by a human in the USA and India from Our world in data, scaled to an hourly value (1712 gCO2/hour in USA, 216 gCO2/hour in India). This was multiplied by a writing speed measured in hours/page (derived from values for average words written per hour and words written per page, as explained in the BLOOM model section). The product of these two values is taken to be the carbon emissions of a human writing a page of text.
Things I haven’t understood about the human writer calculations:
1) Why are human writers estimated top-down when the AI models are estimated bottom-up?
The choice of application boundary makes a big difference to the carbon emissions attributed to some task. For the AI models, the calculations have a fairly tightly scoped application boundary it is specifically the operational and embodied carbon associated with the model training and inference. The data center and supply chain emissions are omitted - there are no scope 3 emissions.
However, for human emissions, the boundary is much broader. The URL they cite for the human emissions links to a page with several datasets - I’m assuming they are getting their data from the plot titled “per capita CO2 emissions”. The explainer for that plot indicates that the value accounts for all fossil fuel combustion and industrial emissions for each country, divided by the number of people in the country’s population. The industrial emissions include cement and steel production. Fossil fuel emissions include emissions from coal, oil, gas, flaring “and other processes”. These negative externalities are not accounted for for AI’s, despite the fact that a human requires the same basic resources to sit and prompt an LLM. The AI doesn’t replace the human, in reality.
It seems quite far from an apples-to-apples comparison between the human and AI estimates since the human estimates are not derived from processes that are directly connected to writing, they include a large basket of emissions that power human society. For the AI models, on the other hand, there’s a narrow set of factors that directly lead to a page of text being generated.
Laptops and desktops
The IF manifest compoennts for laptops and desktops looks like this:
laptop:
pipeline:
compute:
- compute-writing-rate-in-hours-per-page
- multiply-laptop-footprint-by-writing-rate
defaults:
writing-rate-words-per-hour: 300 # from https://www.writermag.com/writing-inspiration/the-writing-life/many-words-one-write-per-day/
writing-rate-words-per-page: 250 # Common knowledge, or https://wordcounter.net/words-per-page
laptop-footprint-in-gco2-per-hour: 32.4 # derived from https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator#results
inputs:
- timestamp: '2024-09-01T10:00:00'
duration: 8640
desktop:
pipeline:
compute:
- compute-writing-rate-in-hours-per-page
- multiply-desktop-footprint-by-writing-rate
defaults:
writing-rate-words-per-hour: 300 # from https://www.writermag.com/writing-inspiration/the-writing-life/many-words-one-write-per-day/
writing-rate-words-per-page: 250 # Common knowledge, or https://wordcounter.net/words-per-page
desktop-footprint-in-gco2-per-hour: 72.08 # derived from https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator#results
inputs:
- timestamp: '2024-09-01T10:00:00'
duration: 86400
The calculations for laptops and desktops are generated by taking a value for the carbon footprint of a laptop or desktop. The authors cite a URL that leads to the EPA carbon calculator for these cells in the supplementary info sheet, and the paper indicates they used energy values of 75 W for a laptop and 200 W for a desktop. The values they report are 32.4 g CO2e for a laptop and 86.5 g CO2e for a desktop. I got slightly different but very similar results when I entered these values into the calculator, I assume the grid carbon intensity used to convert energy to carbon changed.
These values were then scaled by the aforementioned value for hours/page
. This yields values of 27 gCO2e/page for laptops and 72 g CO2e/page for desktops.
Things I haven’t understood about the laptop/desktop calculations:
1) Are these stand-ins for bottom-up estimates for human writers?
I think what I’m not completely sure about here is the extent to which these estimates are supposed to be understood as bottom-up estimates for human writing. If so, only the laptop’s operational carbon is accounted for, with all other factors including the laptop’s embodied emissions being omitted. If they are not supposed to be stand-ins for bottom-up estimates for human writers, then are they supposed to be benchmark tech for both the human and AI writers to compare against? If so, I’m not sure why only the operational carbon is accounted for, whereas the AI models and humans are treated differently. So I’m finding the purpose of these estimates a little fuzzy and struggling to see how the comparisons are close to like-for-like.
2) Why are the humans decoupled from their tools?
I’m also not fully understanding why it’s useful to separately account for the human writers and the tools they would do their writing on. Given the breadth of the emissions accounted for in the top-down estimates for human writers, shouldn’t the electricity burned by their computers be accounted for already? How should I interpret the laptop and desktop emissions given that they also need a human operator - should these estimates be additive?
Illustrations
I won’t go component by component through the illustration section, as I’ll repeat myself a lot. Most of the calculations are ported over from the writing example, except instead of the total carbon being normalized per page of writing, they are normalized to image generation.
Summary
This was an interesting study that aimed to do something very difficult in a context where there is very patchy data and a lot of proxies and fallbacks are necessary. Re-calculating the results using the Impact Framework was an interesting exercise that helped to understand what was done and comprehend the assumptions made. I’ve focused on the technical aspects, not things like the differing quality of the results. I suspect some different decisions made in the calculation pipeline could have led to dramatically higher carbon emission estimates for AI and smaller for humans, influencing the paper’s conclusions. I’ve highlighted some things here that I haven’t fully understood and would love to learn more about. Kudos again to the authors for openly sharing their calculation sheet alongside the paper.
Now you can hop to Part 2 where I show the outcomes from changing the calculation methodology.