Deep diving “The carbon emissions of writing and illustrating are lower for AI than for humans”: Part 3

In two previous posts (Part 1 and Part 2), I dug into what was done in the Tomlinson et a 2024 paper that compared human and AI writers and illustrators. I outlined a reproduction of the original calculations published in the paper, which took a top-down approach to estimating the carbon emissions of human writers and a bottom-up approach to estimating the carbon emissions of AI writers. I then showed a set of modified manifest files that take alternative approaches to the same question, finding that radically different conclusions emerge from taking the following approaches to the calculations:

bottom-up estimate of AI vs top-down estimate of humans (the original paper’s calculations)
bottom-up estimate of humans vs bottom-up estimate of AI
top-down estimate of humans vs top-down estimate of AI

Furthermore, the original study assumed a page of writing is generated using a single query when estimating the carbon emissions for AI. Googling and asking experts in my network suggests that 3 might be more typical, so i repeated some parts of the study assuming 3 queries per page of text.

The point is not that any of my alternative manifests are right - the point is that depending on the approach taken, the conclusions can be very different. When i applied an equivalent measurement protocol to both humans and AIs, their carbon emissions were about equal when I assumed 1 query per page. With 3 queries per page, AIs were far less carbon-efficient than human writers. This is the opposite conclusion to the original paper.

Here I go into fine detail about the design decisions and implementation details for each alternative manifest file.

Bottom-up estimate of AI writers

In this section, I’ll try to achieve the same outcomes as the original paper but by a different route. I’ll explain my working and give a side-by-side comparison for each component.

BLOOM

The original paper did the following:

grab training operational carbon emissions value (30 metric tonnes) from this paper
divide total operational carbon by 300000000 prompts per month (assuming 10000000 prompts per day as per this medium article), yielding the training carbon that was allocated per query (equalling 0.1 gCO2e/query).
grab values from the same paper for BLOOM emissions per 230768 queries. Divide the carbon emissions by the number of queries to yield carbon per query (equalling 1.47 gCO2e/query).
sum operational carbon for training and inference to give oveall oeprational carbon of 1.57 g/query.
Sum embodied carbon of an A100 GPU (150 kg, from the aforementioned BLOOM study) and the embodied carbon of recycling an A100 GPU (1 kg) giving a total-embodied-carbon of 151 kg.
Scale embodied carbon by the ratio of the training time to the GPU lifespan (1082 hours/1.5 years = 0.08238). This gave an embodied carbon of 12.439 kg CO2e, which was then amortized over the 300000000 monthly queries to give an embodied carbon-per-query of 4.1465 g/query.
Normalize to per page using a writing rate in hours/page (0.83333).
They report a final value of 0.95289 gCO2e/page.

Tomlinson et al (2024) pipeline:

bloom-model:
  pipeline:
    compute:
      - divide-training-carbon-by-queries-per-month-bloom
      - divide-inference-carbon-by-n-queries-bloom
      - sum-embodied-carbon-components-bloom
      - scale-embodied-carbon-to-training-time-period-bloom
      - scale-embodied-carbon-to-queries-per-month-bloom
      - sum-training-and-inference-carbon-per-query-bloom
      - convert-carbon-per-query-to-carbon-per-word-bloom
      - sum-words-per-query-and-embodied-carbon-per-query-bloom
      - calculate-total-footprint-per-page-bloom
    defaults:
      training-carbon-bloom: 30000000
      queries-per-month-bloom: 300000000
      carbon-per-n-queries-bloom: 340000
      n-queries-in-bloom-study: 230768
      bloom-gpu-embodied-carbon: 12439.69798
      fraction-of-gpu-lifetime-used-for-training-bloom: 0.08238203957
      embodied-energy-of-a100-gpu: 150000
      carbon-footprint-for-recycling-gpu: 1000
      words-per-bloom-response: 412.8
      words-per-page: 250
    inputs:
      - timestamp: '2024-09-01T10:00:00'
        duration: 86400

My approach will be similar. I will also go bottom up and draw values from the Luccione et al 2022 paper; however, there will be some substantial differences in the values selected and the way they are treated. Let’s go through each piece:

We can start with the operational carbon associated with the inference phase. I agree with the Tomlinson et al (2024) approach here. We can indeed take the carbon emitted by the BLOOM model across 230768 queries and divide it by 230768 to yield carbon per query - so we’ll persist their value of 1.47 g/query.

Now the operational carbon for model training. I can’t actually find the 30 kg total training emissions used by Tomlinson et al (2024) in the study they cite. I can see two estimates for training footprint - one that only includes the dynamic power consumption across the training period (24.7 metric tonnes), and one with a wider application boundary that also includes equipment embodied carbon and idle emissions (50.5 metric tonnes). They paper breaks down the total training carbon by component, so we can just take the dynamic and idling operational carbon, omitting the embodied as we’ll deal with that separately. They report 24.69 T CO2e for the dynamic consumption and 14.6 T for the idle consumption, summing to 39.29 T of operational carbon due to model training. We follow Tomlinson et al (2024) in dividing this value by 300,000,000 monthly queries (despite thinking this is likely an overestimate for BLOOM) to yield an operational carbon per query of 0.131 g CO2e/query, compared to Tomlinson et al’s value of 0.1. So far, pretty similar.

The embodied carbon is where we first start to diverge significantly. Let’s start with embodied carbon for model training.

They use a single A100 GPU, noting that they extended the training time to account for the fact that multiple GPUs are run in parallel. However, the Luccione et al (2022) paper provides us with ample information to account for the actual GPU hours used. They state that 384 GPUs were used to train the BLOOM model, with each having an embodied carbon value of 150kg. This gives 57600 kg of embodied emissions for the GPUs across the GPU lifespan, which they scale down to a 0.003kg CO2e per hour per GPU. They also estimate 2500 kg embodied emissions for the servers used to train the model, scaled down to 0.056 kg CO2 per GPU hour (the GPU carbon is not included in the server value). The total GPU hours used toi train the model was 1.08 million, whereas Tomlinson et al (2024) used a value of 1082. This yields a total embodied carbon of 11.2 T for the model training, comprising 7.57 T for the servers and 3.64 T for the GPUs. I will use the Luccioni value of 11.2 T, compared to the Tomlinson value of 151 kg total footpriont which they scale down to 12 kg to be accounted for during the training period.

Tomlinson et al (2024) also include the embodied carbon of recycling a GPU - we can persist their method here, but we’ll use all 384 GPUs rather than just one, yielding 384 kg of recycling carbon.

Note that these value already account for the ratio of training time to GPU lifespan - this was already done by Luccioni et al (2022) when they generated the 11.2 T embodied carbon value, so no further scaling is required here. We can simply distribute this over our 300,000,000 queries to give an embodied carbon value of

((11.2 * 1000 * 1000 ) + (384 * 1000))/ 300000000 = 0.039 g Co2e/query

which is about 892 times higher than the Tomlinson et al (2024) value of 4.1465 g/query (cell B52).

Tomlinson et al (2024) do not accoutn for any embodied carbon for the hardware used for inference. Given that we are borrowing Tomlinson et al’s (2024) estimate of 300,000,000 prompts per day, we should account for the embodioed carbon of hardware appropriate to serve that many requests. Semianalysis did a very deep dive into Chat-GPT3 and suggested 3617 HGX A100 servers were required to handle the demand generated by 100M users which we can assume generate the 10,000,000 requests per day that the Tomlinson 300,000,000 request per month figure was derived from.

Nvidia do not publish embodied carbon values for their hardware, but we can look for a suitable analog. The challenge was therefore to find an AI-specific rack server with GPU support, and also reports embodied carbon estimates. I found the PowerEdge R740xd. This supports up to three double-width 300W GPUs, such as the A100 or V100. The reported embodied carbon value for this server is 9180 kgCO2e, i.e. 9.18 tonnes per server, for a full 4 year lifecycle.

If we assume that 3167 servers, each having an embodied carbon of 9.18 T are used for inference, we have a total embodied carbon of 33204 T CO2e for inference hardware. But this accounts for the entire lifespan of the servers (4 years). We know the training time was more like 118 days (from the Luccioni paper). So we can scale by the ratio of 118 days to 4 years, which is 0.0808. This gives an inference embodied carbon value of 2682 T. Let’s normalize this to the 300,000,000 requests and we get

(2862*1000*1000)/300000000 = 9.54 g CO2e/query

Now we can add all our compoennts together to yield a total carbon per query of:

0.131 + 0.039 + 9.54 = 9.71 g C/query

Tomlinson et al (2024) added a further transformation step that aimed to convert carbon per query into carbon per page, but I have not been able to understand the logic in this step. They did the following:

total-carbon-per-page = (operational-carbon-per-query * words-per-response) / (words-per-response + embodied-carbon-per-query)

I’m not going to replicate this, as I think one page per query is an acceptable heuristic, making carbon per query and carbon per page equal.

Therefore, my carbon-per-page for BLOOM is 9.71 g/query, compared to the Tomlinson (2024) value of 0.95 g/query, a roughly 10x difference.

Chat-GPT

The Tomlinson et al (2024) paper did the following:

grab a value for the operational carbon associated with model training (552 metric tonnes) from this paper
divide by one month’s worth of prompts, giving 1.84 gCO2e/query.
operational carbon associated with model inference was calculated by dividing Chris Pointon’s estimate for daily carbon emissions for inference, 3.82 tonnes, by the prompts per day taken from the same article. This yielded a value of 0.382 gCO2e/query.
Sum training and inference operational carbon: 1.84 + 0.382 = 2.22 g/query.
For embodied carbon repeat method from BLOOM model with same values except for the training time, which was assumed to be 3217.5 hours.
Sum embodied and operational carbon components and convert to carbon-per-page to give 1.35 g CO2e/page.

Tomlinson et al (2024) pipeline:

chat-gpt-model:
  pipeline:
    compute:
        - divide-training-carbon-by-queries-per-month-chat-gpt
        - divide-inference-carbon-by-n-queries-chat-gpt
        - sum-training-and-inference-carbon-per-query
        - sum-embodied-carbon-components-chat-gpt
        - scale-embodied-carbon-to-training-time-period-chat-gpt
        - scale-embodied-carbon-to-queries-per-month-chat-gpt
        - convert-carbon-per-query-to-carbon-per-word-chat-gpt
        - sum-words-per-query-and-embodied-carbon-per-query-chat-gpt
        - calculate-total-footprint-per-page-chat-gpt
  defaults:
    training-carbon-chat-gpt: 552000000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
    queries-per-month-chat-gpt: 300000000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
    operational-carbon-per-day-chat-gpt: 3820000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
    queries-per-day-chat-gpt: 10000000 # from https://medium.com/@chrispointon/the-carbon-footprint-of-chatgpt-e1bc14e4cc2a
    chat-gpt-gpu-embodied-carbon: 36974.31507
    fraction-of-gpu-lifetime-used-for-training-chat-gpt: 0.2448630137 # 3217.5 hours out of 1.5 year lifetime
    embodied-energy-of-a100-gpu: 150000
    carbon-footprint-for-recycling-gpu: 1000
    words-per-chat-gpt-response: 412.8
    words-per-page: 250
  inputs:
    - timestamp: '2024-09-01T10:00:00'
      duration: 86400

My approach will be a little different.

The Tomlinson et al (2024) paper seem to have used values for training carbon that were for the GPT-3 model, rather than Chat-GPT. There is a strong argument to be made that the emissions from both GPT-3 and Chat-GPT should be included here, but also, attribution of emissions from parent models to children is complicated and it’s ambiguous how to do it correctly, so here we’ll omit the parent mdoel and only include the Chat-GPT model.

We don’t know the precise hardware used to train Chat-GPT. We know, from OpenAI, that Chat-GPT3 was tuned from a model in the GPT3.5 series in 2022. We also know from several press releases that it was trained on Microsoft hardware. The Megatron-Turing NLG supercomputer was announced as a partnership between Nvidia and Microsoft just before Chat-GPT3 was trained. It contains 4480 A100 GPUS. It is possible that this specific machine was used, but even if not, it is probably a good analog for whatever hardware was used. So, we can assume that Chat-GPT3 was trained on 4480 A100s.

Here’s some links to sources supporting this logic:

Youtube video The Verge Bloomberg Microsoft News Nvidia news

It’s also very possible that the hardware requirements for model inference might exceed that required for model training. MS/Nvidia already announced the models are running on Azure servers. A single Nvidia DGX or HGX A100 server is sufficient to run the model but not at the scale of 100 M users/day. Semianalysis did a very deep dive and suggested 3617 HGX A100 servers required to serve inference. We’ll assume this is correct.

With this information we can estimate the embodied carbon associated with training and inference for Chat-GPT.

Nvidia do not publish embodied carbon values for their hardware, so we sought an analog. Again, I used the PowerEdge R740xd. This supports up to three double-width 300W GPUs, such as the A100. The reported embodied carbon value for this server is 9180 kgCO2e, i.e. 9.18 tonnes per server, for a full 4 year lifecycle.

Given that we identified that 4480 were used to train Chat-GPT3 and each server can accommodate three, the total number of servers required to deliver the Chat-GPT3 training is:

n-servers:

4800/3
== 1600

It’s not clear whether the embodied carbon values on the manufacturer’s sheet include the GPUs or not, I’ll assume they do to err on the side of a conservative estimate. If I find out otherwise, I’ll come back and revise the values.

Now we can multiply the number of servers by the embodied carbon per server:

1600*9.18 = 14688 T CO2e

This is the total embodied carbon across the entire lifespan of the hardware. I have not been able to find any quantitative estimates of the training time, only qualitative suggestions such as “several months”. We can assume the hardware was dedicated during this time. Let’s say it’s 4 months - accepting that this is a guess. The final value for embodied carbon is pretty sensitive to this estimate - it does have an appreciable impact whether the true number is e.g. 3 months, or 6 months.

4 months out of 4 years = 4/48 = 0.0833, so the embodied carbon we attribute to the Chat-GPT training is:

14688*0.0833 = 1223 T

This is the estimated total embodied carbon for training Chat-GPT.

Now we can do the same for inference. Let’s assume the same type of servers are used - we identified that 3617 servers are required. This equates to:

3617*9.18
== 33205 T

We’ll follow Tomlinson in amortizing it over one month’s worth of queries, so that the estimates can be compared side-by-side. This gives:

33205 * (1/48) = 691 T CO2e

The total embodied carbon for training and inference is then the sum of 691 T and 1223 T, giving:

691 + 1223 = 1914 T CO2e

which we distribute over 300,000,000 queires to give

(1914 * 1000 * 1000) / 300000000 = 6.38 gCO2e/query

Now we move on to operational carbon.

There were 4800 300W GPUs used to train Chat-GPT, over a period of 4 months (120 days), and we assume a PUE of 1.1 and a carbon intensity of 390 g/kWh (typical West Coast USA value). This means we can do the following:

(((300*4800)*(120*24))*1.1)   // 120 days in hours
= 4561920000 Wh
= 4561920 kWh or 4561.92 MWh energy used

4561920 * 390 = 1779148800 g CO2e
or 1779.15 T

Scaling this down to per query, we get:

(1779 * 1000 * 1000) / 300000000
= 5.93 g/query

Now we turn to inference. Again we use Semianalysis estimate of 3617 HGX A100 servers being required to serve inference. Each of these servers contains 8 A100 GPUs, meaning there are 3617 * 8 = 28936 GPUs serving queries.

In this case, we can assume they are all being fully utilized and are running constantly. This means we can do:

300W * 28936 = 8680800 W power

Then turn this into an amount of energy used in a day:

8680800 * 24
== 208339200 Wh per day
== 208339.2 kWh per day

We use the same grid carbon intensity as before (390 g/kWh) to yield:

208339.2 * 390 = 81252288 g CO2e per day
== 81252.29 kg CO2e per day

Now let’s scale by the number of queries:

81252288/10000000
= 8.125 g CO2e/query

Now we simply sum the embodied and operational carbon compoennts to get a total carbon per query. Again, we’ll assume one query per page of writing.

6.38 + 5.93 + 8.125 = 20.435 g CO2e/query

These folks estimated 4.32gCO2e/query, seemingly just for the operational carbon component, about half of what we suggest here. Our estimate is about 15x that of Tomlinson et al (2024) who estimated 1.35 g/page.

Bottom-up estimate for human writers

In the original paper, the estimate for human writers was done top-down, by dividing a value for the total carbon emissions of a human per hour by the number of hours it takes to write a page.

The IF manifest component for the original paper looked as follows:

human-usa:
  pipeline:
    compute:
      - compute-writing-rate-in-hours-per-page
      - multiply-human-carbon-footprint-by-writing-rate
  defaults:
    writing-rate-words-per-hour: 300 # from https://www.writermag.com/writing-inspiration/the-writing-life/many-words-one-write-per-day/
    writing-rate-words-per-page: 250 # Common knowledge, or https://wordcounter.net/words-per-page
    human-footprint-in-gco2-per-hour: 1712.328767123329 # derived from https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions
  inputs:
    - timestamp: '2024-09-01T10:00:00'
      duration: 86400

To my mind, it only makes sense to do a top down calculation for human writers if we also do a top down calculation for AI with the same application boundary. The equivalent top down calculation for an AI would be to take all the carbon emitted by a data center (scopes 1, 2, 3) and divide it by the hours spent training models and serving inference. However, this was not what was done by Tomlinson et al (2024) - they computed AI emissions bottom-up with a narrow application boundary, so that’s also the approach that should be taken for human writers, otherwise the comparisons are not apples-to-apples.

So let’s think about what should be in the application boundary for a human writer. To compare like for like with the AI writers, we should omit scope 3 emissions, as no upstream or downstream emissions were accounted for BLOOM or Chat-GPT. So we are left with the energy used to power the writer’s device. Maybe we can also include the energy used to light and heat the writer’s workspace, but not the food they eat or their general life activities that are not directly associated with the writing task. So we’ll include the following:

laptop being used to write - embodied and operational
office lighting
office heating

We will assume the writer is working on the laptop without any additional peripherals like a second monitor or external speakers. We’ll take Tomlinson’s estimate of a writing rate of 0.83 hours.

The average UK household uses 3941 kWh electricity per year, of which 15% is due to lighting (https://www.ovoenergy.com/guides/energy-guides/how-much-electricity-does-a-home-use). We assume 1/8 of the total space in the house is used for writing, giving (3941 x 0.15) x 0.125 = 73.89 kWh/yr for a remote working developer on lighting alone. Scaling to a working day gives 73.89/365 = 0.20 kWh, and to an hour gives 0.20/24 = 0.008 kWh. but since we want to normalize per page, not per hour, and we are assuming one page takes 0.8333 hours to write, we scale further: 0.008 * 0.8333 = 0.0066 kWh. At the 2023 UK average carbon intensity of 162 gCO2/kWh, this yields 0.0066 * 162 = 1.06 g C/ page.

A typical UK household uses 11500 kWh of natural gas/year (https://www.ofgem.gov.uk/average-gas-and-electricity-usage). We assume 2/3 of this is used for heating, giving 7666 kWh. We scale this value down to 1/8 of the total space in the house which is used for home working = 7666 * 0.125 = 958 kWh/yr. Again we can scale this down to a value per 0.83 hours: (958 / 365 / 24) * 0.83 = 0.09 kWh. We take the carbon ointensity of burning natural gas to be 0.185 kg/kWh, from here so the carbon we account for for heating is 0.09 * 185 = 16.65 g.

A laptop uses approximately 0.05kWh per working day (https://www.ovoenergy.com/guides/energy-guides/how-much-electricity-does-a-home-use). Scaling this down to the time taken to write a page, we get (0.05 / 24)*0.833 = 0.0017 kWh. At the 2023 UK grid carbon intensity, this is 162 * 0.0017 = 0.27 g C/page.

For a Macbook Pro 14” the embodied carbon is 196.56 kg CO2 eq (i.e. 196560 g) covering a 4 year lifespan. This equates to 4.65 g per page of writing.

So the bottom-up estimate for a human writer is the sum of these components, 1.06 + 16.65 + 0.27 + 4.65 = 22.63 g CO2e/page. This is about 62 x smaller than Tomlinson’s estimate of 1426 g CO2e/page for a human in the USA, or 10 x smaller than their estimate for a human in India.

It puts the carbon emissions by a human writer only slightly above that of Chat-GPT. This is mostly coming from the energy required to heat the writing space. It’s still not really a like for like comparison because we’d also need to add data center energy usage to the AI estimates for equivalence to the home heating and lighting components for the human writers, and we’ve been generous in assuming people use one prompt to generate a page of writing - I suspect most people use several prompts to refine each page of AI generated text. As soon as >1 prompt is used to generate a page of text, the carbon cost of AI text generation is going to increase and whenever prompts-per-page > 1, humans are going to be at least twice as efficient (since they are about twice as efficient at prompts-per-page = 2).

Excluding home heating and lighting, we have a value for human writers of 0.27 + 4.65 = 4.92 g CO2e, about 1/4 of the carbon emissions per page for Chat-GPT or half of BLOOM.

Top-down estimate for Chat-GPT

A top down estimate for Chat-GPT used cost as an entry point. Microsoft’s reported Scope 1, 2 and 3 (location based) emissons were divided by their revenue for the same year, yielding an estimate for carbon per billion dollars of revenue: 539325.842696 T or 539325842696.62 g of CO2e.

OpenAI reportedly paid Microsoft $4 billion to train and serve Chat-GPT, so we can estimate 4 * 539325842696.62 = 2157303370786.48 g CO2e as the total carbon cost of the model.

According to this blog Chat-GPT serves 10,000,000 requests a day, or 3650000000 queries per year.

So, we can estimate the carbon per query as:

2157303370786.48 / 3650000000 = 591.04 g CO2e/query.

Assuming one query per page, the carbon per page is equal to the carbon per query, so 591.04 g CO2e.

If you assume three queries per page of text, then the carbon per page is 591.04 * 3 = 1773.13 gCO2e

Again, to reiterate, I’m not claiming any of these numbers to be correct, I’m just pointing out that the outcomes are extremely sensitive to the assumptions made in the calculations and the methodological choices - we really should have some standardization of the calculation methodologies and clarity over whether we are aiming to give marginal costs or total footprint estimates, and avoid hybrids!