If only data centers would participate in demand response
Even for AI training workloads, data center demand response remains an academic exercise - intriguing but impractical.
Data center demand response is one of those topics that academics love. If you want to guarantee funding, journal paper acceptance, and press coverage, then researching data center demand response is a good bet.
Yet, despite its theoretical appeal, data centers are unlikely to adopt it in any meaningful way. This is a classic example of academic research being completely disconnected from commercial realities.
The topic has come up again with a new report:
Norris, T. H., T. Profeta, D. Patino-Echeverri, and A. Cowie-Haskell. 2025. Rethinking Load Growth: Assessing the Potential for Integration of Large Flexible Loads in US Power Systems. NI R 25-01. Durham, NC: Nicholas Institute for Energy, Environment & Sustainability, Duke University.
I wrote about this topic back in 2023, but let’s take another look.
Data center buildout is power constrained
It’s clear that there will be a need for much more data center capacity in coming years. The problem is not building the data centers themselves, but provisioning the power capacity (transmission and interconnection) to support the load. The US grid is not in a good state to handle such expansion, with backlogs of up to 7 years in some regions. I’ve written about similar problems in the UK.
So the results of this report are interesting:
76 GW of new load—equivalent to 10% of the nation’s current aggregate peak demand—could be integrated with an average annual load curtailment rate of 0.25% (i.e., if new loads can be curtailed for 0.25% of their maximum uptime)
98 GW of new load could be integrated at an average annual load curtailment rate of 0.5%, and 126 GW at a rate of 1.0%
This means that the infrastructure buildout wouldn’t be needed if those new data centers curtailed their power demand for certain periods.
The constraints on power procurement and grid improvements are mostly regulatory, so this is an opportunity for the new administration to have a real impact. However, there is so much regulation to deal with at various levels of government that other solutions are worth considering.
Uptime assumptions
Data centers are engineered for uptime, offering a controlled environment for reliable compute services. Redundant power supplies, backup batteries, and rapid-start generators are standard.
Uptime is assumed, but well architected applications assume failure might happen. This is why cloud services are designed around zones within regions. Each zone tends to be a physical data center, with zones in a small geographic region. Network latency between zones is minimized, which means those data centers need to be close to each other.
Applications can often handle zonal outages if architected for it. Regional failures, however, are tougher - especially for stateful applications. Regional failover is usually reserved for disaster recovery, as AWS explains:
A Multi-AZ architecture is also part of a DR strategy designed to make workloads better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. DR strategies may also make use of multiple AWS Regions. For example, in an active/passive configuration, service for the workload fails over from its active Region to its DR Region if the active Region can no longer serve requests.
Changing this assumption to one where a data center might go offline at short notice is a major shift in how applications must be designed. Sure, one data center (or zone) might be able to go offline if you have built zonal failover into your architecture, but are zones spread across distinct grid segments? Unlikely.
Predictable flexibility
Short notice flexibility of this sort is incompatible with how the majority of applications are built today. However, what about managing workloads with more notice? The California Duck curve is a known phenomenon, so could we architect facilities around known periods of load?
Google is already doing this. They have three case studies across Europe, Asia and the US where they use predictable events to reduce peak power consumption:
Scheduled power reductions during peak periods of Winter 2022-23 between 5pm-9pm in the Netherlands, Belgium, Ireland, Finland, and Denmark.
Daily peak power reduction in Taiwan during the summers of 2022 and 2023.
Reduced data center power consumption in Oregon, Nebraska, and the Southeast during extreme weather events.
These successes share a theme: predictability (or sufficient notice) and Google’s total control over its software and facilities. As they explain:
When we receive notice from a grid operator of a forecasted local grid event, for example an extreme weather event that will cause a supply constraint, we can alert our global computing planning system to when and where it will take place. This alert activates an algorithm that generates hour-by-hour instructions for specified data centers to limit non-urgent compute tasks for the duration of the grid event, and allows them to be rescheduled after the grid event has passed. When feasible, some of these tasks get rerouted to a data center on a different power grid.
To what extent can others apply the same level of sophistication to their data center usage, particularly when using shared cloud services? The report itself notes this:
These facilities house multiple tenants, each with varying operational requirements. Coordinating demand response participation in such environments introduces layers of administrative and logistical complexity, as operators must mediate cost- and reward-sharing agreements among tenants.
Spatial flexibility: moving workloads
The report focuses on how AI-specialized data centers, with deferrable tasks like neural network training, could enable load flexibility:
The central hypothesis is that the evolving computational load profiles of AI-specialized data centers facilitate operational capabilities that are more amendable to load flexibility. Unlike the many real-time processing demands typical of conventional data center workloads, such as cloud services and enterprise applications, the training of neural networks that power large language models and other machine learning algorithms is deferrable.
Unlike real-time cloud or enterprise workloads, AI training can theoretically pause. But this overlooks key costs:
Duplication: Maintaining redundant infrastructure across regions is expensive.
Migration: Moving large datasets across networks is slow and costly.
Does shifting workloads to off-peak grids save enough to justify these expenses? If workloads pause instead, can they resume seamlessly?
Snapshotting & checkpointing is possible, but convergence is a key part of AI training runs. That relies on regular recalculations across large-scale distributed GPU or TPU clusters, adaptive optimization, and learning rate (step size) tweaks, which cannot always be restarted without data loss.
This might be feasible for small jobs, but the risk involved with interrupting large jobs could outweigh any benefits.
What’s missing from these studies
Even without data center growth, there are clearly major issues with how long it takes to upgrade grid capacity, deliver new equipment, and improve reliability. Regulations in the UK and US have been designed conservatively to minimize disruption and maintain reliability - precisely what you want when grid demand is stable. But that is changing. Regulations now need to adapt to allow for a faster pace.
Examining demand response is a good idea because it can offer another solution (why not do both?). However, every analysis I’ve read misses a discussion of the practicalities:
Workload Flexibility: How many tasks are truly movable in time and space?
Incentives: How do operators weigh power savings against uptime? Does this differ for single-tenant (e.g., Google) versus multi-tenant (e.g., cloud providers) setups?
Spot Markets: Could cloud providers use spot pricing to nudge customers toward flexibility?
Development Costs: How should developers handle replication and migration costs? What notice period mitigates these?
Workload Fit: Are AI training runs as pausable as assumed? How do checkpointing and scheduling affect convergence, especially for large-scale jobs?
Without tackling these, data center demand response remains an academic exercise - intriguing but impractical.