Is training in scope for AI carbon intensity?

I recently sat down and tried to write an Impact Framework manifest file for calculating the SCI score of Chat-GPT3. This turned out to be quite eye opening, because once I started trying to roll the estimated embodied and operational carbon of the model training into the SCI calculation, I hit some conceptual barriers that I hadn't anticipated in advance. What's more, I think these nuances generalize to any software system that has a training phase or equivalent finite-length setup phase that's distinct from an operation phase of indefinite duration. In this post, I'll dive in to these barriers and explain why they are impactful for calculating software carbon intensity for AI.

A primer on carbon and SCI

Carbon emissions can be calculated by summing the carbon emitted from all the various components of an application that fall within the application boundary. For example:

embodied carbon for the hardware used to train a model
embodied carbon for the hardware used to serve requests
embodied carbon for end user devices
embodied carbon for networking devices
operational carbon for sending data over the network during model training
operational carbon for sending data over the network to send requests and responses
operational carbon for training the model
operational carbon for inference

Very often, direct measurements of the carbon emissions arising from each of these components are not available, but proxies, analogs and models are often available that can yield reasonable estimates that can then be summed to provide an estimate of the total carbon emissions for the system.

To turn this carbon estimate into an SCI score, we divide by a "functional unit" - turning the carbon emissions into a rate that can be compared between different systems that might use very different underlying technologies and have different popularity and reach.

For example, if website A emits 10 kg CO2e/ day and website B emits 100 kg CO2e/day, we might initially favour website A over website B. However, we later discover that website B is serving 100x more users than website A, meaning that despite emitting more carbon overall, website B is 10x more carbon efficient in terms of carbon emitted per user. The SCI score corrects for these effects by normalizing to a functional unit that is common to all the systems you might want to compare.

You might also have sufficiently granular data to break the carbon emissions down into timesteps, so that you can track peaks and troughs in your carbon emissions and SCI score. Typically, this is done using some usage monitor on a resource, such as a cloud instance that exposes CPU, RAM and data ingress/egress via an API. This can be used to model the operational carbon from that resource over time, given information about the hardware being used and the carbon intensity of the electricity powering the machines.

For embodied carbon, the total embodied carbon of the hardware is often derived either from manufacturer's data that in turn originated from a scope-3 life cycle assessment of the hardware components, or using a model such this one proposed by CCF. The total embodied carbon is then scaled by the proportion of the total computational resources available on the hardware being dedicated to a particular application, and the portion of the hardware lifespan being allocated to an application.

This approach is though to work well for most applications, but they break down when we try to apply them to LLMs. Here's why:

Training is finite

The first reason why SCI is difficult to apply to AI applications is that training a model happens in some finite amount of time, using some finite amount of operational carbon and accounting for some fixed amount of embodied carbon. We then have to define a function for fairly allocating this carbon over an indefinite time period during which the model is active and serving responses to queries. This turns out to be awkward. Two possible approaches to solving the issue are imposing stop conditions and time-boxing the SCI calculation - below I'll explain why neither of these are complete solutions.

Stop conditions

Allocating carbon emissions of the training hardware fails when we attempt to express carbon emissions as a rate, because there is no natural way to impose a stop condition on the algorithm when the absolute amount of carbon emitted during training has already been accounted for.

To take a very simplified example - let's say we emit 10 kg C during model training (the sum of the operational carbon emitted due tothe electricity used to power the model training and the allocatable fraction of the total embodied carbon of the hardware used for the model training). Training took 10 days, meaning we emitted 1 kgCO2e/day. We also observe that we serve 10 queries per day, so we end up proposing an SCI score of 0.1 kg CO2e/query.

In this scenario, the sum total of the carbon emissions due to model training have been accounted for at timestep 10. At every timestep from 11 to N, we are artificially accounting for carbon that was already allocated earlier in the time series. We are effectively assigning a carbon value to training that will be additive forever, even after the original carbon emissions have already been accounted for.

This implies that some stop condition is needed in the calculation algorithm that omits the embodied carbon from training after the total emissions have been counted. This means the function for calculating the SCI score is a step function, with the intervals being separated at the time that the training carbon had all been allocated. But then we are left with another conundrum because it doesn't make sense for queries before t[n] to cost X gCO2e/query and queries after t[n] to cost Y gCO2e/query.

Imposing a stop condition artificially burdens early users with high carbon allocation, and makes later users wrongly appear to be more carbon efficient - in reality both sets of users are using the same resource in the same way and should share the accountability for the training emissions. A straightforward application of SCI to an LLM that considers the operational and embodied carbon emitted during the training phase to be inside the application boundary, will therefore grossly overestimate the carbon emitted during training, by an amount that increases over time.

Time boxing

My gut reaction to this problem was to try to fit some fixed time window over which we equally distribute the carbon costs of the training phase, to avoid biasing early or late users.

However, unless we are looking back retrospectively on the model after it is decommissioned we can never know the total number of queries the model will serve over its lifespan, and can therefore never set an appropriate time window - a time box is functionally equivalent to imposing a stop condition.

In scenarios where we are calculating an SCI score for an active model, assuming the number of queries served by the model is not zero, we will amortize a little less training carbon per query each time we recalculate because the cumulative number of queries will have increased.

There are some benefits to this approach - it does prevent us from double-counting emissions because each recalculation redistributes the total carbon from the training phase over the currently known set of queries, allocating less per query such that the total emitted remains constant. However, on the other hand, we again heavily bias against early users and it creates a perverse incentive in that you benefit from delaying your SCI calculations as long as possible to increase the value of your denominator!

It also means that until a model is fully decommissioned, you can never settle on an SCI value as you never know the future trajectory of the value of your denominator, i.e. the cumulative number of queries you want to distribute your carbon over.

Another option could be to establish a decay curve for the carbon, so that instead of expiring on some given day, the burden decreases according to some linear or non-linear function, until eventually the carbon allocated per query becomes negligible, and we can decide what negligible means based on how much of the original carbon emission has been allocated. This seems unnecessarily complex, and it also doesn't address the issue of unevenly distributing carbon to queries that are only distinguished by time.

Carbon inheritance

LLMs and other AI models are not always trained from scratch from an original data source. Often, they are retrained or tuned from some older model used as a jumping off point. For example, Chat-GPT3 was tuned from the general purpose GPT-3 model. For every large general purpose model there are multiple more specialized models downstream. How do we then allocate carbon between them?

Let's say I train a model, Something-GPT. Now I create a more specialized version called Special-GPT. Should I include all the emissions from training Something-GPT when I calculate the SCI score for Special-GPT?

Now a friend comes along and trains a different specialized model, Extraspecial-GPT from Something-GPT. Now, should I distribute the Something-GPT training emissions equally between both models? In that case, I've halved the carbon intensity of Special-GPT without actually doing anything! The more new models are trained, the lower my carbon emissions!

So the alternative is not to redistribute, and instead to fully allocate the Something-GPT emissions to both models. Well, now one single, finite block of carbon is accounted for twice. The more models that are downstream of Something-GPT, the more carbon was apparently emitted.

Both options are clearly nonsensical!

So the final alternative is to omit Something-GPT emissions from the Special-GPT carbon accounting altogether. But assuming the carbon emissions for a small model is entirely independent of the potentially very large model that it was bootstrapped from doesn't really smell right. On the other hand, there are emissions associated with dependencies for every application that are typically considered out of scope. We typically would not account for the carbon emissions associated with developing, for example, the Tensorflow or Pytorch libraries that were used to build some AI app. We wouldn't typically account for the carbon emissions associated with developing the Javascript or Python langauges when calculating the impact of a Javascript of Python app. Maybe the base model should be treated equivalently - as a low level dependency whose development emissions are now decoupled from the carbon emissions of the latest app and therefore outside the application boundary.

What to do?

Right now, I don't see a good solution to scenarios where we need to determine an SCI score for currently active models. There seems to be a condition on the application of the SCI in order for model training to be included:

If the time window captured by an SCI score is smaller than the total lifespan of a model, it's critical to understand that the SCI score is sensitive to the window's start time, stop time and duration.
We must appreciate that this could be gamed, or yield unrepresentative values. At a bare minimum, a SCI score without precise information about the time period being used should not be accepted.

On the other hand, we can be more confident in SCI scores that account for the entire lifespan of a model. Unfortunately, retrospective calculations can't be used to course-correct or mitigate emissions in real time, and they impose a frustrating lag between action and accountability.

The answer may well be that model training sits outside the boundary of the SCI calculation, and can only be accounted for using absolute carbon emissions, not carbon intensity expressed as a rate. Perhaps the SCI applies only to AI inference, and the carbon emitted during training is either omitted altogether, or becomes a number that comes along with the SCI as a kind of sidecar. e.g. SCI = 1.3g CO2e/query + 900kg training.

At the time of writing this, the inheritance issues also feel unsolvable, but I'm open-minded to solutions that emerge after I've spent a lot more time thinking on it and hopefully had some feedback from others in the space. At a minimum, determining an industry standard that defines how to account for model hierarchies is going to be critical. At least then, we will be better set up to compare like for like and understand the limitations of different calculations.

Some other perspectives

After posting this article on Linkedin, I got some very insightful feedback. One aspect I hadn't originally considered was that uneven allocation of training carbon over time (i.e. penalizing early adopters and rewarding latecomers) might be a feature, not a bug. This is because there's a sustainability benefit to making better use of older models, extending their useful lifespan rather than chasing the latest new shiny model, especially because newer mdoels are typically more carbon expensive because of the bigger-is-better paradigm for AI.

On the other hand, there was also feedback to reinforce the idea of omitting training carbon from SCI on the basis that the user is ony responsible for the number of queries they make to a given model - responsibility for training carbon rests with the entity doing the training. Since SCI is supposed to be a metric that can be used as a sustainability KPI, it makes sense for training to be outside of the application boundary.

Finally, it was pointed out that the issues described here apply to all embodied carbon allocations because the actual hardware lifespan isn't known in advance. I agree with that - but think the problem is exacerbated for AI. The main conceptual difference is that for a "traditional" app, there's a real-time coupling between the app's usage and the embodied carbon you allocate to it. Your software lifespan can't exceed the hardware lifespan, so there's always some of the total embodied carbon available to allocate without double-counting. There's some added complexity due to upgrading or replacing hardware - but you still don't typically see a siloed blob of historical embodied carbon to allocate that is decoupled from what your software does today. It's that decoupling that introduces a double-counting risk for AI applications.

So, I'd say the issue isn't entirely resolved, but from the very small sample of commenters on my post, it seems like there's a preference for omitting training carbon from the SCI.