How do you set capacity for live services? [change your answer]
You did not answer this question.
How to do better
Below are rapidly actionable steps to reduce waste and move beyond provisioning for the extreme peak:
Implement Resource Monitoring and Basic Analytics
- Gather usage metrics to understand actual peaks, off-peak times, and daily/weekly cycles:
- AWS CloudWatch metrics + AWS Cost Explorer to see usage vs. cost patterns
- Azure Monitor + Azure Cost Management for hourly/daily usage trends
- GCP Monitoring + GCP Billing reports (BigQuery export for deeper analysis)
- OCI Monitoring + OCI Cost Analysis for instance-level metrics
- IBM Cloud Monitoring + IBM Cloud Cost Estimator for hourly usage and trends
- Share this data with stakeholders to highlight the discrepancy between peak vs. average usage.
- Gather usage metrics to understand actual peaks, off-peak times, and daily/weekly cycles:
Pilot Scheduled Shutdowns for Non-Critical Systems
- Identify development and testing environments or batch-processing servers that don’t require 24/7 availability:
- Utilise AWS Instance Scheduler to automate start and stop times for Amazon EC2 and RDS instances.
- Implement Azure Automation’s Start/Stop VMs v2 to manage virtual machines on user-defined schedules.
- Apply Google Cloud’s Instance Schedules to automatically start and stop Compute Engine instances based on a schedule.
- Use Oracle Cloud Infrastructure’s Resource Scheduler to manage compute instances’ power states according to defined schedules.
- Use IBM Cloud Schedule Scaling to add or remove instance group capacity, based on daily, intermittent, or seasonal demand. You can create multiple scheduled actions that scale capacity monthly, weekly, daily, hourly, or even every set number of minutes.
- Sharing this data with stakeholders can highlight the discrepancy between peak and average usage, demonstrating immediate cost savings without impacting production systems.
- Identify development and testing environments or batch-processing servers that don’t require 24/7 availability:
Explore Simple Autoscaling Solutions
Even if you continue peak provisioning for mission-critical workloads, consider selecting a smaller or non-critical service to test autoscaling:
AWS Auto Scaling Groups – basic CPU-based triggers: Amazon EC2 Auto Scaling allows you to automatically add or remove EC2 instances based on CPU utilisation or other metrics, ensuring your application scales to meet demand.
Azure Virtual Machine Scale Sets – scale by CPU or memory usage: Azure Virtual Machine Scale Sets enable you to create and manage a group of load-balanced VMs, automatically scaling the number of instances based on CPU or memory usage to match your workload demands.
GCP Managed Instance Groups – autoscale based on utilisation thresholds: Google Cloud’s Managed Instance Groups provide autoscaling capabilities that adjust the number of VM instances based on utilsation metrics, such as CPU usage, to accommodate changing workloads.
OCI Instance Pool Autoscaling – CPU or custom metrics triggers: Oracle Cloud Infrastructure’s Instance Pool Autoscaling allows you to automatically adjust the number of instances in a pool based on CPU utilisation or custom metrics, helping to optimise performance and cost.
IBM Cloud Auto Scale for VPC allows you to create an instance group to scale according to your requirements. Based on the target utilisation metrics that you define, the instance group can dynamically add or remove instances to achieve your specified instance availability.
Implementing autoscaling in a controlled environment allows you to evaluate its benefits and challenges, providing valuable insights before considering broader adoption for more critical workloads.
Review Reserved or Discounted Pricing
If you must maintain consistently high capacity, consider vendor discount programs to reduce per-hour costs:
AWS Savings Plans or Reserved Instances: AWS offers Savings Plans, which provide flexibility by allowing you to commit to a consistent amount of compute usage (measured in $/hour) over a 1- or 3-year term, applicable across various services and regions. Reserved Instances, on the other hand, involve committing to specific instance configurations for a term, offering significant discounts for predictable workloads.
Azure Reservations for VMs and Reserved Capacity: Azure provides Reservations that allow you to commit to a specific VM or database service for a 1- or 3-year period, resulting in cost savings compared to pay-as-you-go pricing. These reservations are ideal for workloads with predictable resource requirements.
GCP Committed Use Discounts: Google Cloud offers Committed Use Discounts, enabling you to commit to a certain amount of usage for a 1- or 3-year term, which can lead to substantial savings for steady-state or predictable workloads.
OCI Universal Credits: Oracle Cloud Infrastructure provides Universal Credits, allowing you to utilise any OCI platform service in any region with a flexible consumption model. By purchasing a sufficient number of credits, you can benefit from volume discounts and predictable billing, which is advantageous for maintaining high-capacity workloads.
IBM Cloud Reservations are a great option when you want significant cost savings and dedicated resources for future deployments. You can choose a 1 or 3-year term, server quantity, specific profile, and provision those servers when needed. IBM Cloud Enterprise Savings Plan, with this billing model, you commit to spend a certain amount on IBM Cloud and receive discounts across the platform. You are billed monthly based on your usage and you continue to receive a discount even after you reach your committed amount.
Implementing these discount programs won’t eliminate over-provisioning but can soften the budget impact.
Engage Leadership on the Financial and Sustainability Benefits
- Present how on-demand autoscaling or even basic scheduling can reduce overhead and potentially improve your service’s environmental footprint.
- Link these improvements to departmental net-zero or cost reduction goals, highlighting easy wins.
Through monitoring, scheduling, basic autoscaling pilots, and potential reserved capacity, you can move away from static peak provisioning. This approach preserves reliability while unlocking efficiency gains—an important step in balancing cost, compliance, and performance goals in the UK public sector.
How to do better
Here are rapidly actionable steps to evolve from manual seasonal scaling to a more automated, responsive model:
Automate the Manual Steps You Already Do
If you anticipate seasonal peaks (e.g., quarterly public reporting load), replace manual processes with scheduled scripts to ensure timely scaling and prevent missed scale-downs:
AWS: Utilise AWS Step Functions in conjunction with Amazon EventBridge Scheduler to automate the start and stop of EC2 instances based on a defined schedule.
Azure: Implement Azure Automation Runbooks within Automation Accounts to create scripts that manage the scaling of resources during peak periods.
Google Cloud Platform (GCP): Leverage Cloud Scheduler to trigger Cloud Functions or Terraform scripts that adjust instance groups in response to anticipated load changes.
Oracle Cloud Infrastructure (OCI): Use Resource Manager stacks combined with Cron tasks to schedule scaling events, ensuring resources are appropriately managed during peak times.
Automating these processes ensures that scaling actions occur as planned, reducing the risk of human error and optimising resource utilisation during peak and off-peak periods.
Identify and Enforce “Scale-Back” Windows
- Even if you scale up for busy times, ensure you have a defined “sunset” for increased capacity:
- Configure an autoscaling group or scale set to revert to default size after the peak.
- Set reminders or triggers to ensure you don’t pay for extra capacity indefinitely.
- Even if you scale up for busy times, ensure you have a defined “sunset” for increased capacity:
Introduce Autoscaling on a Limited Component
Choose a module that frequently experiences load variations within a day or week—perhaps a web front-end for a public information portal:
AWS: Implement Auto Scaling Groups with CPU-based or request-based triggers to automatically adjust the number of EC2 instances handling your service’s load.
Azure: Utilise Virtual Machine Scale Sets or the AKS Cluster Autoscaler to manage the scaling of virtual machines or Kubernetes clusters for your busiest microservices.
Google Cloud Platform (GCP): Use Managed Instance Groups with load-based autoscaling to dynamically adjust the number of instances serving your front-end application based on real-time demand.
Oracle Cloud Infrastructure (OCI): Apply Instance Pool Autoscaling or the OKE Cluster Autoscaler to automatically scale a specific containerised service in response to workload changes.
Implementing autoscaling on a targeted component allows you to observe immediate benefits, such as improved resource utilisation and cost efficiency, which can encourage broader adoption across your infrastructure.
Consider Serverless for Spiky Components
If certain tasks run sporadically (e.g., monthly data transformation or PDF generation), investigate moving them to event-driven or serverless solutions:
AWS: Utilise AWS Lambda for event-driven functions or AWS Fargate for running containers without managing servers. AWS Lambda is ideal for short-duration, event-driven tasks, while AWS Fargate is better suited for longer-running applications and tasks requiring intricate orchestration.
Azure: Implement Azure Functions for serverless compute, Logic Apps for workflow automation, or Container Apps for running microservices and containerised applications. Azure Logic Apps can automate workflows and business processes, making them suitable for scheduled tasks.
Google Cloud Platform (GCP): Deploy Cloud Functions for lightweight event-driven functions or Cloud Run for running containerised applications in a fully managed environment. Cloud Run is suitable for web-based workloads, REST or gRPC APIs, and internal custom back-office apps.
Oracle Cloud Infrastructure (OCI): Use OCI Functions for on-demand, serverless workloads. OCI Functions is a fully managed, multi-tenant, highly scalable, on-demand, Functions-as-a-Service platform built on enterprise-grade infrastructure.
Transitioning to serverless solutions for sporadic tasks eliminates the need to manually adjust virtual machines for short bursts, enhancing efficiency and reducing operational overhead.
Monitor and Alert on Usage Deviations
Utilise cost and performance alerts to detect unexpected surges or prolonged idle resources:
AWS: Implement AWS Budgets to set custom cost and usage thresholds, receiving alerts when limits are approached or exceeded. Additionally, use Amazon CloudWatch’s anomaly detection to monitor metrics and identify unusual patterns in resource utilisation.
Azure: Set up Azure Monitor alerts to track resource performance and configure cost anomaly alerts within Azure Cost Management to detect and notify you of unexpected spending patterns.
Google Cloud Platform (GCP): Create budgets in Google Cloud Billing and configure Pub/Sub notifications to receive alerts on cost anomalies, enabling prompt responses to unexpected expenses.
Oracle Cloud Infrastructure (OCI): Establish budgets and set up alert rules in OCI Cost Management to monitor spending. Additionally, configure OCI Alarms with notifications to detect and respond to unusual resource usage patterns.
Implementing these alerts enables quicker responses to anomalies, reducing the reliance on manual monitoring and helping to maintain optimal resource utilisation and cost efficiency.
By automating your manual scaling processes, exploring partial autoscaling, and shifting spiky tasks to serverless, you unlock more agility and cost efficiency. This approach helps ensure you’re not left scrambling if usage deviates from seasonal patterns.
How to do better
Below are actionable ways to upgrade from basic autoscaling:
Broaden Autoscaling Coverage
Extend autoscaling to more workloads to enhance efficiency and responsiveness:
AWS:
- EC2 Auto Scaling: Implement EC2 Auto Scaling across multiple groups to automatically adjust the number of EC2 instances based on demand, ensuring consistent application performance.
- ECS Service Auto Scaling: Configure Amazon ECS Service Auto Scaling to automatically scale your containerised services in response to changing demand.
- RDS Auto Scaling: Utilise Amazon Aurora Auto Scaling to automatically adjust the number of Aurora Replicas to handle changes in workload demand.
Azure:
- Virtual Machine Scale Sets (VMSS): Deploy Azure Virtual Machine Scale Sets to manage and scale multiple VMs for various services, automatically adjusting capacity based on demand.
- Azure Kubernetes Service (AKS): Implement the AKS Cluster Autoscaler to automatically adjust the number of nodes in your cluster based on resource requirements.
- Azure SQL Elastic Pools: Use Azure SQL Elastic Pools to manage and scale multiple databases with varying usage patterns, optimising resource utilisation and cost.
Google Cloud Platform (GCP):
- Managed Instance Groups (MIGs): Expand the use of Managed Instance Groups with autoscaling across multiple zones to ensure high availability and automatic scaling of your applications.
- Cloud SQL Autoscaling: Leverage Cloud SQL’s automatic storage increase to handle growing database storage needs without manual intervention.
Oracle Cloud Infrastructure (OCI):
- Instance Pool Autoscaling: Apply OCI Instance Pool Autoscaling to additional workloads, enabling automatic adjustment of compute resources based on performance metrics.
- Database Auto Scaling: Utilise OCI Autonomous Database Auto Scaling to automatically scale compute and storage resources in response to workload demands.
Gradually incorporating more of your application’s microservices into the autoscaling framework can lead to improved performance, cost efficiency, and resilience across your infrastructure.
Incorporate More Granular Metrics
Move beyond simple CPU-based thresholds to handle memory usage, disk I/O, or application-level concurrency:
AWS: Implement Amazon CloudWatch custom metrics to monitor specific parameters such as memory usage, disk I/O, or application-level metrics. Additionally, utilise Application Load Balancer (ALB) request count to trigger autoscaling based on incoming traffic.
Azure: Use Azure Monitor custom metrics to track specific performance indicators like queue length or HTTP request rate. These metrics can feed into Virtual Machine Scale Sets or the Azure Kubernetes Service (AKS) Horisontal Pod Autoscaler (HPA) for more responsive scaling.
Google Cloud Platform (GCP): Leverage Google Cloud’s Monitoring custom metrics to capture detailed performance data. Implement request-based autoscaling in Google Kubernetes Engine (GKE) or Cloud Run to adjust resources based on real-time demand.
Oracle Cloud Infrastructure (OCI): Utilise OCI Monitoring service’s custom metrics to track parameters such as queue depth, memory usage, or user concurrency. These metrics can inform autoscaling decisions to ensure optimal performance.
Incorporating more granular metrics allows for precise autoscaling, ensuring that resources are allocated based on comprehensive performance indicators rather than relying solely on CPU usage.
Implement Dynamic, Scheduled, or Predictive Scaling
If you observe consistent patterns in your application’s usage—such as increased activity during lunchtime or reduced traffic on weekends—consider enhancing your existing autoscaling strategies with scheduled scaling actions:
AWS: Configure Amazon EC2 Auto Scaling scheduled actions to adjust capacity at predetermined times. For instance, you can set the system to scale up at 08:00 and scale down at 20:00 to align with daily usage patterns.
Azure: Utilise Azure Virtual Machine Scale Sets to implement scheduled scaling. Additionally, integrate scaling adjustments into your Azure DevOps pipelines to automate capacity changes in response to anticipated workload variations.
Google Cloud Platform (GCP): Employ Managed Instance Group (MIG) scheduled scaling to define scaling behaviors based on time-based schedules. Alternatively, use Cloud Scheduler to trigger scripts that adjust resources in line with expected demand fluctuations.
Oracle Cloud Infrastructure (OCI): Set up scheduled autoscaling for instance pools to manage resource allocation according to known usage patterns. You can also deploy Oracle Functions to execute timed scaling events, ensuring resources are appropriately scaled during peak and off-peak periods.
Implementing scheduled scaling allows your system to proactively adjust resources in anticipation of predictable workload changes, enhancing performance and cost efficiency.
For environments with variable and unpredictable workloads, consider utilising predictive scaling features. Predictive scaling analyzes historical data to forecast future demand, enabling the system to scale resources in advance of anticipated spikes. This approach combines the benefits of both proactive and reactive scaling, ensuring optimal resource availability and responsiveness.
AWS: Explore Predictive Scaling for Amazon EC2 Auto Scaling, which uses machine learning models to forecast traffic patterns and adjust capacity accordingly.
Azure: While Azure does not currently offer a native predictive scaling feature, you can implement custom solutions by analyzing historical metrics through Azure Monitor and creating automation scripts to adjust scaling based on predicted trends.
GCP: Google Cloud’s autoscaler primarily operates on real-time metrics. For predictive capabilities, consider developing custom predictive models using historical data from Cloud Monitoring to inform scaling decisions.
OCI: Oracle Cloud Infrastructure allows for the creation of custom scripts and functions to implement predictive scaling based on historical usage patterns, although a native predictive scaling feature may not be available.
By integrating scheduled and predictive scaling strategies, you can enhance your application’s ability to handle varying workloads efficiently, ensuring optimal performance while managing costs effectively.
Enhance Observability to Validate Autoscaling Efficacy
Instrument your autoscaling events and track them to ensure optimal performance and resource utilisation:
Dashboard Real-Time Metrics: Monitor CPU, memory, and queue metrics alongside scaling events to visualise system performance in real-time.
Analyze Scaling Timeliness: Assess whether scaling actions occur promptly by checking for prolonged high CPU usage or frequent scale-in events that may indicate over-scaling.
Tools:
AWS:
AWS X-Ray: Utilise AWS X-Ray to trace requests through your application, gaining insights into performance bottlenecks and the impact of scaling events.
Amazon CloudWatch: Create dashboards in Amazon CloudWatch to display real-time metrics and logs, correlating them with scaling activities for comprehensive monitoring.
Azure:
Azure Monitor: Leverage Azure Monitor to collect and analyze telemetry data, setting up alerts and visualisations to track performance metrics in relation to scaling events.
Application Insights: Use Azure Application Insights to detect anomalies and diagnose issues, correlating scaling actions with application performance for deeper analysis.
Google Cloud Platform (GCP):
Cloud Monitoring: Employ Google Cloud’s Operations Suite to monitor and visualise metrics, setting up dashboards that reflect the relationship between resource utilisation and scaling events.
Cloud Logging and Tracing: Implement Cloud Logging and Cloud Trace to collect logs and trace data, enabling the analysis of autoscaling impacts on application performance.
Oracle Cloud Infrastructure (OCI):
OCI Logging: Use OCI Logging to manage and search logs, providing visibility into scaling events and their effects on system performance.
OCI Monitoring: Utilise OCI Monitoring to track metrics and set alarms, ensuring that scaling actions align with performance expectations.
By enhancing observability, you can validate the effectiveness of your autoscaling strategies, promptly identify and address issues, and optimise resource allocation to maintain application performance and cost efficiency.
Adopt Spot/Preemptible Instances for Autoscaled Non-Critical Workloads
To further optimise costs, consider utilising spot or preemptible virtual machines (VMs) for non-critical, autoscaled workloads. These instances are offered at significant discounts compared to standard on-demand instances but can be terminated by the cloud provider when resources are needed elsewhere. Therefore, they are best suited for fault-tolerant and flexible applications.
AWS: Implement EC2 Spot Instances within an Auto Scaling Group to run fault-tolerant workloads at up to 90% off the On-Demand price. By configuring Auto Scaling groups with mixed instances, you can combine Spot Instances with On-Demand Instances to balance cost and availability.
Azure: Utilise Azure Spot Virtual Machines within Virtual Machine Scale Sets for non-critical workloads. Azure Spot VMs allow you to take advantage of unused capacity at significant cost savings, making them ideal for interruptible workloads such as batch processing jobs and development/testing environments.
Google Cloud Platform (GCP): Deploy Preemptible VMs in Managed Instance Groups to run short-duration, fault-tolerant workloads at a reduced cost. Preemptible VMs provide substantial savings for workloads that can tolerate interruptions, such as data analysis and batch processing tasks.
Oracle Cloud Infrastructure (OCI): Leverage Preemptible Instances for batch processing or flexible tasks. OCI Preemptible Instances offer a cost-effective solution for workloads that are resilient to interruptions, enabling efficient scaling of non-critical applications.
By integrating these cost-effective instance types into your autoscaling strategies, you can significantly reduce expenses for non-critical workloads while maintaining the flexibility to scale resources as needed.
By broadening autoscaling across more components, incorporating richer metrics, scheduling, and advanced cost strategies like spot instances, you transform your “basic” scaling approach into a more agile, cost-effective solution. Over time, these steps foster robust, automated resource management across your entire environment.
How to do better
Here are actionable ways to refine your widespread autoscaling strategy to handle more nuanced workloads:
Adopt Application-Level or Log-Based Metrics
Move beyond CPU and memory metrics to incorporate transaction rates, request latency, or user concurrency for more responsive and efficient autoscaling:
AWS:
- CloudWatch Custom Metrics: Publish custom metrics derived from application logs to Amazon CloudWatch, enabling monitoring of specific application-level indicators such as transaction rates and user concurrency.
- Real-Time Log Analysis with Kinesis and Lambda: Set up real-time log analysis by streaming logs through Amazon Kinesis and processing them with AWS Lambda to generate dynamic scaling triggers based on application behavior.
Azure:
- Application Insights: Utilise Azure Monitor’s Application Insights to collect detailed usage data, including request rates and response times, which can inform scaling decisions for services hosted in Azure Kubernetes Service (AKS) or Virtual Machine Scale Sets.
- Custom Logs for Scaling Signals: Implement custom logging to capture specific application metrics and configure Azure Monitor to use these logs as signals for autoscaling, enhancing responsiveness to real-time application demands.
Google Cloud Platform (GCP):
- Cloud Monitoring Custom Metrics: Create custom metrics in Google Cloud’s Monitoring to track application-specific indicators such as request count, latency, or queue depth, facilitating more precise autoscaling of Compute Engine (GCE) instances or Google Kubernetes Engine (GKE) clusters.
- Integration with Logging: Combine Cloud Logging with Cloud Monitoring to analyze application logs and derive metrics that can trigger autoscaling events based on real-time application performance.
Oracle Cloud Infrastructure (OCI):
- Monitoring Custom Metrics: Leverage OCI Monitoring to create custom metrics from application logs, capturing detailed performance indicators that can inform autoscaling decisions.
- Logging Analytics: Use OCI Logging Analytics to process and analyze application logs, extracting metrics that reflect user concurrency or transaction rates, which can then be used to trigger autoscaling events.
Incorporating application-level and log-based metrics into your autoscaling strategy allows for more nuanced and effective scaling decisions, ensuring that resources align closely with actual application demands and improving overall performance and cost efficiency.
Introduce Multi-Metric Policies
- Instead of a single threshold, combine metrics. For instance:
- Scale up if CPU > 70% AND average request latency > 300ms.
- This ensures you only scale when both resource utilisation and user experience degrade, reducing false positives or unneeded expansions.
- Instead of a single threshold, combine metrics. For instance:
Implement Predictive or Machine Learning–Driven Autoscaling
To anticipate demand spikes before traditional metrics like CPU utilisation react, consider implementing predictive or machine learning–driven autoscaling solutions offered by cloud providers:
AWS:
- Predictive Scaling: Leverage Predictive Scaling for Amazon EC2 Auto Scaling, which analyzes historical data to forecast future traffic and proactively adjusts capacity to meet anticipated demand.
Azure:
- Predictive Autoscale: Utilise Predictive Autoscale in Azure Monitor, which employs machine learning to forecast CPU load for Virtual Machine Scale Sets based on historical usage patterns, enabling proactive scaling.
Google Cloud Platform (GCP):
- Custom Machine Learning Models: Develop custom machine learning models to analyze historical performance data and predict future demand, triggering autoscaling events in services like Google Kubernetes Engine (GKE) or Cloud Run based on these forecasts.
Oracle Cloud Infrastructure (OCI):
- Custom Analytics Integration: Integrate Oracle Analytics Cloud with OCI to perform machine learning–based forecasting, enabling predictive scaling by analyzing historical data and anticipating future resource requirements.
Implementing predictive or machine learning–driven autoscaling allows your applications to adjust resources proactively, maintaining performance and cost efficiency by anticipating demand before traditional metrics indicate the need for scaling.
Correlate Autoscaling with End-User Experience
To enhance user satisfaction, align your autoscaling strategies with user-centric metrics such as page load times and overall responsiveness. By monitoring these metrics, you can ensure that scaling actions directly improve the end-user experience.
AWS:
- Application Load Balancer (ALB) Target Response Times: Monitor ALB target response times using Amazon CloudWatch to assess backend performance. Elevated response times can indicate the need for scaling to maintain optimal user experience.
- Network Load Balancer (NLB) Metrics: Track NLB metrics to monitor network performance and identify potential bottlenecks affecting end-user experience.
Azure:
- Azure Front Door Logs: Analyze Azure Front Door logs to monitor end-to-end latency and other performance metrics. Insights from these logs can inform scaling decisions to enhance user experience.
- Application Insights: Utilise Application Insights to collect detailed telemetry data, including response times and user interaction metrics, aiding in correlating autoscaling with user satisfaction.
Google Cloud Platform (GCP):
- Cloud Load Balancing Logs: Examine Cloud Load Balancing logs to assess request latency and backend performance. Use this data to adjust autoscaling policies, ensuring they align with user experience goals.
- Service Level Objectives (SLOs): Define SLOs in Cloud Monitoring to set performance targets based on user-centric metrics, enabling proactive scaling to meet user expectations.
Oracle Cloud Infrastructure (OCI):
- Load Balancer Health Checks: Implement OCI Load Balancer health checks to monitor backend server performance. Use health check data to inform autoscaling decisions that directly impact user experience.
- Custom Application Pings: Set up custom application pings to measure response times and user concurrency, feeding this data into autoscaling triggers to maintain optimal performance during varying user loads.
By integrating user-centric metrics into your autoscaling logic, you ensure that scaling actions are directly correlated with improvements in end-user experience, leading to higher satisfaction and engagement.
Refine Scaling Cooldowns and Timers
- Tweak scale-up and scale-down intervals to avoid thrashing:
- A short scale-up delay can address spikes quickly.
- A slightly longer scale-down delay prevents abrupt resource removals when a short spike recedes.
- Evaluate your autoscaling policy settings monthly to align with evolving traffic patterns.
- Tweak scale-up and scale-down intervals to avoid thrashing:
By incorporating more sophisticated application or log-based metrics, predictive scaling, and user-centric triggers, you ensure capacity aligns closely with real workloads. This approach elevates your autoscaling from a broad CPU/memory-based strategy to a finely tuned system that balances user experience, performance, and cost efficiency.
How to do better
Even at the top level, you can refine and push boundaries further:
Adopt More Granular “Distributed SLO” Metrics
Evaluate Each Microservice’s Service-Level Objectives (SLOs): Define precise SLOs for each microservice, such as ensuring the 99th-percentile latency remains under 400 milliseconds. This granular approach allows for targeted performance monitoring and scaling decisions.
Utilise Cloud Provider Tools to Monitor and Enforce SLOs:
AWS:
- CloudWatch ServiceLens: Integrate Amazon CloudWatch ServiceLens to gain comprehensive insights into application performance and availability, correlating metrics, logs, and traces.
- Custom Metrics and SLO-Based Alerts: Implement custom CloudWatch metrics to monitor specific performance indicators and set up SLO-based alerts to proactively manage service health.
Azure:
- Application Insights: Leverage Azure Monitor’s Application Insights to track detailed telemetry data, enabling the definition and monitoring of SLOs for individual microservices.
- Service Map: Use Azure Monitor’s Service Map to visualise dependencies and performance metrics across services, aiding in the assessment of SLO adherence.
Google Cloud Platform (GCP):
- Cloud Operations Suite: Employ Google Cloud’s Operations Suite to create SLO dashboards that monitor service performance against defined objectives, facilitating informed scaling decisions.
Oracle Cloud Infrastructure (OCI):
- Observability and Management Platform: Implement OCI’s observability tools to define SLOs and correlate them with performance metrics, ensuring each microservice meets its performance targets.
Benefits of Implementing Distributed SLO Metrics:
Precision in Scaling: By closely monitoring how each component meets its SLOs, you can make informed decisions to scale resources appropriately, balancing performance needs with cost considerations.
Proactive Issue Detection: Granular SLO metrics enable the early detection of performance degradations within specific microservices, allowing for timely interventions before they impact the overall system.
Enhanced User Experience: Maintaining stringent SLOs ensures that end-users receive consistent and reliable service, thereby improving satisfaction and trust in your application.
Implementation Considerations:
Define Clear SLOs: Collaborate with stakeholders to establish realistic and measurable SLOs for each microservice, considering factors such as latency, throughput, and error rates.
Continuous Monitoring and Adjustment: Regularly review and adjust SLOs and associated monitoring tools to adapt to evolving application requirements and user expectations.
Conclusion: Adopting more granular “distributed SLO” metrics empowers you to fine-tune your application’s performance management, ensuring that each microservice operates within its defined parameters. This approach facilitates precise scaling decisions, optimising both performance and cost efficiency.
Experiment with Multi-Provider or Hybrid Autoscaling
- If policy allows, or your architecture is containerised, test the feasibility of bursting into another region or cloud for capacity:
- This approach is advanced but can further optimise resilience and cost across providers.
Integrate with Detailed Cost Allocation & Forecasting
- Combine real-time scale data with cost forecasting models:
- AWS Budgets with advanced forecasting, or AWS Cost Anomaly Detection for unplanned scale-ups.
- Azure Cost Management budgets with Power BI integration for detailed analysis.
- GCP Budgets & cost predictions in the Billing console, with BigQuery analysis for scale patterns vs. spend.
- OCI Cost Analysis with usage forecasting and custom alerts for spike detection.
- This ensures you can quickly investigate if an unusual surge in scaling leads to unapproved budget expansions.
- Combine real-time scale data with cost forecasting models:
Leverage AI/ML for Real-Time Scaling Decisions
- Deploy advanced ML models that continuously adapt scaling triggers based on anomaly detection in logs or usage patterns.
- Tools or patterns:
- AWS Lookout for Metrics integrated with AWS Lambda to adjust scaling groups in real-time.
- Azure Cognitive Services or ML pipelines that feed insights to an auto-scaling script in AKS or Scale Sets.
- GCP Vertex AI or Dataflow pipelines analyzing streaming logs to instruct MIG or Cloud Run scaling policies.
- OCI Data Science/AI services that produce dynamic scale signals consumed by instance pools or OKE clusters.
Adopt Sustainable/Green Autoscaling Policies
- If your usage is flexible, consider shifting workloads to times or regions with lower carbon intensity:
- AWS Sustainability Pillar in Well-Architected Framework and region selection guidance for scheduling large tasks.
- Azure Emissions Impact Dashboard integrated with scheduled scale tasks in greener data center regions.
- Google Cloud’s Carbon Footprint and Active Assist for reducing cloud carbon footprint.
- Oracle Cloud Infrastructure’s sustainability initiatives combined with custom autoscaling triggers for environment-friendly computing.
- This step can integrate cost savings with environmental commitments, aligning with the Greening Government Commitments.
- If your usage is flexible, consider shifting workloads to times or regions with lower carbon intensity:
By blending advanced SLO-based scaling, multi-provider strategies, cost forecasting, ML-driven anomaly detection, and sustainability considerations, you ensure your autoscaling remains cutting-edge. This not only provides exemplary performance and cost control but also positions your UK public sector organisation as a leader in efficient, responsible cloud computing.
Keep doing what you’re doing, and consider sharing your successes via blog posts or internal knowledge bases. Submit pull requests to this guidance if you have innovative approaches or examples that can benefit other public sector organisations. By exchanging real-world insights, we collectively raise the bar for cloud maturity and cost effectiveness across the entire UK public sector.
How do you run services in the cloud? [change your answer]
You did not answer this question.
How to do better
Here are rapidly actionable improvements to help you move beyond purely static VMs:
Enable Basic Monitoring and Cost Insights
- Even if you keep long-running VMs, gather usage metrics and financial data:
- Check CPU, memory, and storage utilisation. If these metrics show consistent underuse (like 10% CPU usage around the clock), it’s a sign you can downsize or re-architect.
Leverage Built-in Right-sizing Tools
- Major cloud providers offer “right-sizing” recommendations:
- AWS Compute Optimiser to get suggestions for smaller or larger instance sizes.
- Azure Advisor for VM right-sizing to identify underutilised virtual machines.
- GCP Recommender for machine types to optimise resource utilisation.
- OCI Workload and Resource Optimisation for tailored resource recommendations.
- IBM Cloud Resource Controller is the next-generation IBM Cloud platform provisioning layer that manages the lifecycle of IBM Cloud resources in a customer account.
- Make a plan to apply at least one or two right-sizing recommendations each quarter. This is a quick, low-risk path to cost savings and better resource use.
- Major cloud providers offer “right-sizing” recommendations:
Introduce Simple Scheduling
- If some VMs are only needed during business hours, schedule automatic shutdown at night or on weekends:
- A single action to stop dev/test or lightly used environments after hours can yield noticeable cost (and energy) savings.
Conduct a Feasibility Check for a Small Container Pilot
- Even if you retain most workloads on VMs, pick one small application or batch job and try containerising it:
- AWS Fargate or Amazon EKS for containers.
- Azure Container Instances or Azure Kubernetes Service (AKS).
- Google Cloud Run or Google Kubernetes Engine (GKE).
- Oracle Cloud Infrastructure (OCI) Container Instances or Oracle Kubernetes Engine (OKE).
- There are two options on IBM Cloud for a container platform, Redhat Openshift on IBM Cloud or IBM Cloud Kubernetes Service
- By piloting a single container-based workload, you can assess potential elasticity and determine whether container orchestration solutions meet your needs. This approach allows for quick experimentation with minimal risk.
- Even if you retain most workloads on VMs, pick one small application or batch job and try containerising it:
Raise Awareness with Internal Stakeholders
- Share simple usage and cost graphs with your finance or leadership teams. Show them the difference between “always-on” vs. “scaled” or “scheduled” usage.
- This could drive more formal mandates or budget incentives to encourage partial re-architecture or adoption of short-lived compute in the future.
By monitoring usage, applying right-sizing, scheduling idle time, and introducing a small container pilot, you can meaningfully reduce waste. Over time, you’ll build momentum toward more flexible compute strategies while still respecting the constraints of your existing environment.
How to do better
Here are actionable next steps to accelerate your modernisation journey without overwhelming resources:
Expand Container/Serverless Pilots in a Structured Way
- Identify a short list of low-risk workloads that could benefit from ephemeral compute, such as batch processing or data transformation.
- Use native solutions to reduce complexity:
- AWS Fargate with ECS/EKS for container-based tasks without server management.
- Azure Container Apps or Azure Functions for event-driven workloads.
- Google Cloud Run for container-based microservices or Google Cloud Functions.
- Oracle Cloud Infrastructure (OCI) Container Instances or OCI Functions for short-lived tasks.
- Document real cost/performance outcomes to present a stronger case for further expansion.
Implement Granular VM Auto-Scaling
- Even with VMs, you can configure auto-scaling groups or scale sets to handle changing loads:
- This ensures you pay only for the capacity you need during peak vs. off-peak times.
Use Container Services for Non-Critical Production
- If you have a stable container proof-of-concept, consider migrating a small but genuine production workload. Examples:
- Internal APIs, internal data analytics pipelines, or front-end servers that can scale up/down.
- Focus on microservices that do not require extensive refactoring.
- This fosters real operational experience, bridging from “non-critical tasks” to “production readiness.”
- If you have a stable container proof-of-concept, consider migrating a small but genuine production workload. Examples:
Leverage Cloud Marketplace or Government Frameworks
- Explore container-based solutions or DevOps tooling that might be available under G-Cloud or Crown Commercial Service frameworks.
- Some providers offer managed container solutions pre-configured for compliance or security—this can reduce friction around governance.
Train or Upskill Teams
- Provide short courses or lunch-and-learns on container orchestration (Kubernetes, ECS, AKS, etc.) or serverless fundamentals.
- Many vendors have free or low-cost training:
Building confidence and skills helps teams adopt more advanced compute models.
Through these steps—structured expansions of containerised or serverless pilots, improved auto-scaling of VMs, and staff training—your organisation can gradually shift from “limited experimentation” to a more balanced compute ecosystem. The result is improved agility, potential cost savings, and readiness for more modern architectures.
How to do better
Below are rapidly actionable ways to enhance your mixed compute model:
Adopt Unified Deployment Pipelines
- Strive for standard tooling that can deploy both VMs and container/serverless environments. For instance:
- AWS CodePipeline or AWS CodeBuild integrated with ECS, Lambda, EC2, etc.
- Azure Pipelines or GitHub Actions for VMs, AKS, Azure Functions.
- Google Cloud Build for GCE, GKE, Cloud Run deployments.
- OCI DevOps service for flexible deployments to OKE, Functions, or VMs.
- This reduces fragmentation and fosters consistent best practices (code review, automated testing, environment provisioning).
- Strive for standard tooling that can deploy both VMs and container/serverless environments. For instance:
Enhance Observability
- Implement a single monitoring stack that captures logs, metrics, and traces across VMs, containers, and functions:
- AWS CloudWatch combined with AWS X-Ray for distributed tracing in containers or Lambda.
- Azure Monitor along with Application Insights for containers and serverless telemetry.
- Google Cloud’s Operations Suite utilising Cloud Logging and Cloud Trace for multi-service environments.
- Oracle Cloud Infrastructure (OCI) Logging integrated with the Observability and Management Platform for cross-service insights.
- Unified observability ensures you can quickly identify inefficiencies or scaling issues.
- Implement a single monitoring stack that captures logs, metrics, and traces across VMs, containers, and functions:
Introduce a Tagging/Governance Policy
- Standardise tags or labels for cost center, environment, and application name. This practice aids in tracking spending, performance, and potential carbon footprint across various compute services.
- Utilise tools such as:
- Implementing a unified tagging strategy fosters accountability and helps identify usage patterns that may require optimisation.
Implement Automated or Dynamic Scaling
- For container-based workloads, set CPU and memory usage thresholds to enable auto-scaling of pods or tasks:
- For serverless architectures, establish concurrency or usage limits to prevent unexpected cost spikes.
Implementing these scaling strategies ensures that your applications can efficiently handle varying workloads while controlling costs.
Leverage Reserved or Discounted Pricing for Steady Components
- If certain VMs or container clusters must run continuously, investigate vendor discount models:
- Blend on-demand resources for elastic workloads with reservations for predictable baselines to optimise costs.
Implementing these strategies can lead to significant cost savings for workloads with consistent usage patterns.
By unifying your deployment practices, consolidating observability, enforcing tagging, and refining autoscaling or discount usage, you move from an ad-hoc mix of compute styles to a more cohesive, cost-effective cloud ecosystem. This sets the stage for robust, consistent governance and significant agility gains.
How to do better
Below are actionable expansions to push your ephemeral usage approach further:
Adopt a “Compute Decision Framework”
- Formalise how new workloads choose among FaaS (functions), CaaS (containers), or short-lived VMs:
- If event-driven with spiky traffic, prefer serverless.
- If the service requires consistent runtime dependencies but can scale, prefer containers.
- If specialised hardware or older OS is needed briefly, use short-lived VMs.
- This standardisation helps teams quickly pick the best fit.
- Formalise how new workloads choose among FaaS (functions), CaaS (containers), or short-lived VMs:
Enable Event-Driven Automation
- Use events to trigger ephemeral jobs:
- AWS EventBridge or CloudWatch Events to invoke Lambda or spin up ECS tasks.
- Azure Event Grid or Logic Apps triggering Functions or container jobs.
- GCP Pub/Sub or EventArc calls Cloud Run services or GCE ephemeral jobs.
- OCI Events Service integrated with Functions or autoscaling rules.
- This ensures resources only run when triggered, further minimising idle time.
- Use events to trigger ephemeral jobs:
Implement Container Security Best Practices
- As ephemeral container usage grows, so do potential security concerns:
- Use AWS ECR scanning or Amazon Inspector for container images.
- Use Azure Container Registry (ACR) image scanning with Microsoft Defender for Cloud.
- Use GCP Container Registry or Artifact Registry with scanning and Google Cloud Security Command Center.
- Use OCI Container Registry scanning and Security Zones for container compliance.
- Integrate scans into your CI/CD pipeline for immediate alerts and automation.
- As ephemeral container usage grows, so do potential security concerns:
Refine Infrastructure as Code (IaC) and Pipeline Patterns
- Standardise ephemeral environment creation using:
- AWS CloudFormation or AWS CDK, plus AWS CodePipeline.
- Azure Resource Manager templates or Bicep, plus Azure DevOps or GitHub Actions.
- GCP Deployment Manager or Terraform, with Cloud Build triggers.
- OCI Resource Manager for stack deployments, integrated with OCI DevOps pipeline.
- Encourage a shared library of environment definitions to accelerate new project spin-up.
- Standardise ephemeral environment creation using:
Extend Tagging and Cost Allocation
Since ephemeral resources come and go quickly, ensure they are labeled or tagged upon creation.
Set up budgets or cost alerts to identify if ephemeral usage unexpectedly spikes:
By formalising your decision framework, expanding event-driven architectures, ensuring container security, and strengthening IaC patterns, you solidify your short-lived compute model. This approach reduces overheads, fosters agility, and helps UK public sector teams remain compliant with cost and operational excellence targets.
How to do better
Even at this advanced state, you can still hone practices. Below are suggestions:
Automate Decision Workflows
- Build an internal “Service Catalog” or “Decision Tree.” For instance:
- A web-based form that asks about the workload’s functional, regulatory, performance, and cost constraints, then suggests suitable solutions (SaaS, FaaS, containers, etc.).
- This can be integrated with pipeline automation so new projects must pass through the framework before provisioning resources.
- Build an internal “Service Catalog” or “Decision Tree.” For instance:
Deepen SaaS Exploration for Niche Needs
- Explore specialised SaaS options for areas like data analytics, content management, or identity services.
- Ensure your staff or solution architects regularly revisit the G-Cloud listings or other Crown Commercial Service frameworks to see if an updated SaaS solution can replace custom-coded or container-based systems.
Further Standardise DevOps Across All Layers
- If you run FaaS on multiple clouds or keep some workloads on private cloud, unify your deployment approach.
- Encourage a single pipeline style:
- AWS CodePipeline or GitHub Actions for everything from AWS Lambda to Amazon ECS, plus AWS CloudFormation for infrastructure as code.
- Azure DevOps for .NET-based function apps, container solutions like Azure Container Instances, or Azure Virtual Machines under one roof.
- Google Cloud Build triggers that handle Cloud Run, Google Compute Engine, or third-party SaaS integrations.
- Oracle Cloud Infrastructure (OCI) DevOps pipeline for a mixed environment using Oracle Kubernetes Engine (OKE), Oracle Functions, or third-party webhooks.
Maintain a Living Right-sizing Strategy
- Expand beyond memory/CPU metrics to measure cost per request, concurrency, or throughput.
- Tools like:
- AWS Compute Optimiser advanced metrics for EBS I/O, Lambda concurrency, etc.
- Azure Monitor Workbooks with custom performance/cost insights
- GCP Recommenders for scaling beyond just CPU/memory (like disk usage suggestions)
- OCI Observability with granular resource usage metrics for compute and storage optimisation
Focus on Energy Efficiency and Sustainability
- Refine your approach with a strong environmental lens:
- Pick regions or times that yield lower carbon intensity, if permitted by data residency rules.
- Enforce ephemeral usage policies to avoid running resources unnecessarily.
- Each vendor offers sustainability or carbon data to inform your “fit for purpose” decisions:
- Refine your approach with a strong environmental lens:
Champion Cross-Public-Sector Collaboration
- Share lessons or templates with other departments or agencies. This fosters consistent best practices across local councils, NHS trusts, or central government bodies.
By automating your decision workflows, continuously exploring SaaS, standardising DevOps pipelines, and incorporating advanced metrics (including sustainability), you maintain an iterative improvement path at the peak of compute maturity. This ensures you remain agile in responding to new user requirements and evolving government initiatives, all while controlling costs and optimising resource efficiency.
Keep doing what you’re doing, and consider writing up success stories, internal case studies, or blog posts. Submit pull requests to this guidance or relevant public sector best-practice repositories so others can learn from your achievements. By sharing real-world experiences, you help the entire UK public sector enhance its cloud compute maturity.
How do you track sustainability? [change your answer]
You did not answer this question.
How to do better
Below are rapidly actionable steps that provide greater visibility and ensure you move beyond mere vendor assurances:
Request Vendor Transparency
- Ask your provider for UK-region-specific energy usage information and carbon intensity data. For example:
- Even if the data is approximate, it helps you begin to monitor trends.
Enable Basic Billing and Usage Reports
- Activate native cost-and-usage tooling to gather baseline compute usage:
- AWS Cost Explorer with daily or hourly granularity.
- Azure Cost Management
- GCP Billing Export to BigQuery
- OCI Cost Analysis
- IBM Cloud Billing & IBM Cost Estimator
- While these tools focus on monetary spend, you can correlate usage data with the vendor’s sustainability information.
- Activate native cost-and-usage tooling to gather baseline compute usage:
Incorporate Sustainability Clauses in Contracts
- When renewing or issuing new calls on frameworks like G-Cloud, add explicit language for carbon reporting.
- Request quarterly or annual updates on how your usage ties into the vendor’s net-zero or carbon offset strategies.
Incorporating sustainability clauses into your contracts is essential for ensuring that your cloud service providers align with your environmental goals. The Crown Commercial Service offers guidance on integrating such clauses into the G-Cloud framework. Additionally, the Chancery Lane Project provides model clauses for environmental performance, which can be adapted to your contracts.
By proactively including these clauses, you can hold vendors accountable for their sustainability commitments and ensure that your organisation’s operations contribute positively to environmental objectives.
Track Internal Workload Growth
- Even if you rely on vendor neutrality claims, set up a simple spreadsheet or a lightweight tracker for each of your main cloud workloads (service name, region, typical CPU usage, typical memory usage). If usage grows, you will notice potential new carbon hotspots.
Raise Internal Awareness
- Create a short briefing note for leadership or relevant teams (e.g., finance, procurement) highlighting:
- Your current reliance on vendor offsetting, and
- The need for baseline data collection.
This ensures any interest in deeper environmental reporting can gather support before usage grows further.
- Create a short briefing note for leadership or relevant teams (e.g., finance, procurement) highlighting:
How to do better
Here are quick wins to strengthen your approach and make it more actionable:
Use Vendor Sustainability Tools for Basic Estimation
- Enable the carbon or sustainability dashboards in your chosen cloud platform to get monthly or quarterly snapshots:
Create Simple Internal Guidelines
- Expand beyond policy statements:
- Resource Tagging: Mandate that every new resource is tagged with an owner, environment, and a sustainability tag (e.g., “non-prod, auto-shutdown” vs. “production, high-availability”).
- Preferred Regions: If feasible, prefer data centers that the vendor identifies as more carbon-friendly. For example, some AWS and Azure UK-based regions rely on greener energy sourcing than others.
- Expand beyond policy statements:
Schedule Simple Sustainability Checkpoints
- Alongside your standard procurement or architectural reviews, add a sustainability review item. E.g.:
- “Does the new service use the recommended low-carbon region?”
- “Is there a plan to power down dev/test resources after hours?”
- This ensures your new policy is not forgotten in day-to-day activities.
- Alongside your standard procurement or architectural reviews, add a sustainability review item. E.g.:
Offer Quick Training or Knowledge Sessions
- Host short lunch-and-learn events or internal micro-training on “Cloud Sustainability 101” for staff. Show them how they can use:
The point is to connect cost optimisation with sustainability—over-provisioned resources burn more carbon.
Publish Simple Reporting
- Create a once-a-quarter dashboard or presentation highlighting approximate cloud emissions. Even if the data is partial or not perfect, transparency drives accountability.
By rapidly applying these steps—using native vendor tools to measure usage, establishing minimal but meaningful guidelines, and scheduling brief training or check-ins—you elevate your policy from mere awareness to actual practice.
How to do better
Focus on rapid, vendor-native steps to convert targets into tangible reductions:
Automate Right-sizing
- Many providers have native tools to recommend more efficient instance sizes:
- AWS Compute Optimiser to identify underutilised EC2, EBS, or Lambda resources
- Azure Advisor Right-sizing for VMs and databases
- GCP Recommender for VM rightsizing
- OCI Adaptive Intelligence for resource optimisation
By automatically resizing or shifting to lower-tier SKUs, you reduce both cost and emissions.
- Many providers have native tools to recommend more efficient instance sizes:
Implement Scheduled Autoscaling
- Introduce or refine your autoscaling policies so that workloads scale down outside peak times:
This directly lowers carbon usage by removing idle capacity.
Leverage Serverless or Container Services
- Where feasible, re-platform certain workloads to serverless or container-based architectures that scale to zero. Rapid wins can be found by:
Serverless can significantly cut wasted resources, which aligns with your reduction targets.
Adopt “Carbon Budgets” in Project Plans
- For every new app or service, define a carbon allowance. If estimates exceed the budget, require design changes. Incorporate vendor solutions that show region-level carbon data:
These tools provide insights into the carbon emissions associated with different regions, enabling more sustainable decision-making.
- Align with Departmental or National Sustainability Goals
- Update your internal reporting to reflect how your targets link to national net zero obligations or departmental commitments (e.g., the NHS net zero plan, local authority climate emergency pledges). This ensures your measurement and goals remain relevant to broader public sector accountability.
Implementing these steps swiftly helps ensure you don’t just measure but actually reduce your carbon footprint. Regular iteration—checking usage data, right-sizing, adjusting autoscaling—ensures continuous progress toward your stated targets.
How to do better
Actionable steps to deepen your integrated approach:
- Set Up Automated Governance Rules
- Enforce region-based or instance-based policies automatically:
- AWS Service Control Policies to block high-carbon region usage in non-essential cases
- Azure Policy for “Allowed Locations” or “Tagging Enforcement” with sustainability tags
- GCP Organisation Policy to limit usage to certain carbon-friendly regions
- OCI Security Zones or policies restricting resource deployment
- Enforce region-based or instance-based policies automatically:
Implementing these policies ensures that resources are deployed in regions with lower carbon footprints, aligning with your sustainability objectives.
Adopt Full Lifecycle Management
- Extend sustainability beyond compute:
- Automate data retention: Move older data to cooler or archive storage for lower energy usage:
- Review ephemeral development: Ensure test environments are automatically cleaned after a set period.
- Extend sustainability beyond compute:
Use Vendor-Specific Sustainability Advisors
- Some providers offer “sustainability pillars” or specialised frameworks:
Incorporate these suggestions directly into sprint backlogs or monthly improvement tasks.
Embed Sustainability in DevOps Pipelines
- Modify build/deployment pipelines to check resource usage or region selection:
- If a new environment is spun up in a high-carbon region or with large instance sizes, the pipeline can prompt a warning or require an override.
- Tools like GitHub Actions or Azure DevOps Pipelines can call vendor APIs to fetch sustainability metrics and fail a build if it’s non-compliant.
- Modify build/deployment pipelines to check resource usage or region selection:
Promote Cross-Functional “Green Teams”
- Form a small working group or “green champions” network across procurement, DevOps, governance, and finance, meeting monthly to share best practices and track new optimisation opportunities.
- This approach keeps your integrated practices dynamic, ensuring you respond quickly to new vendor features or updated government climate guidance.
By adding these automated controls, pipeline checks, and cross-functional alignment, you ensure that your integrated sustainability approach not only continues but evolves in real time. You become more agile in responding to shifting requirements and new tools, maintaining a leadership stance in UK public sector cloud sustainability.
How to do better
Even at this advanced level, below are further actions to refine your dynamic management:
Build or Leverage Carbon-Aware Autoscaling
- Many providers offer advanced scaling rules that consider multiple signals. Integrate carbon signals:
- AWS EventBridge + Lambda triggers that check region carbon intensity before scaling up large clusters
- Azure Monitor + Azure Functions to re-schedule HPC tasks when the grid is greener
- GCP Cloud Scheduler + Dataflow for time-shifted batch jobs based on carbon metrics
- OCI Notifications + Functions to enact advanced scheduling policies
- Many providers offer advanced scaling rules that consider multiple signals. Integrate carbon signals:
Collaborate with BEIS or Relevant Government Bodies
- The Department for Business, Energy & Industrial Strategy (BEIS) or other departments may track grid-level carbon. If you can integrate their public data (e.g., real-time carbon intensity in the UK), you can refine your scheduling.
- Seek synergy with national digital transformation or sustainability pilot programmes that might offer new tools or funding for experimentation.
AI or ML-Driven Forecasting
- Incorporate predictive analytics that forecast your usage spikes and align them with projected carbon intensity (peak/off-peak). Tools like:
Then automatically shift or throttle workloads accordingly.
Innovate with Low-Power Hardware
- Evaluate next-gen or specialised hardware solutions with lower energy profiles:
Typically, these instance families consume less energy for similar workloads, further reducing carbon footprints.
Automated Data Classification and Tiering
- For advanced data management, use AI to classify data in real-time and automatically place it in the most sustainable storage tier:
This ensures minimal energy overhead for data retention.
Set an Example through Openness
- If compliance allows, publish near real-time dashboards illustrating your advanced scheduling successes or hardware usage.
- Share code or Infrastructure-as-Code templates with other public sector teams to accelerate mutual learning.
By implementing these advanced tactics, you sharpen your dynamic optimisation approach, continuously pushing the envelope of what’s possible in sustainable cloud operations—while respecting legal constraints around data sovereignty and any performance requirements unique to public services.
Keep doing what you’re doing, and consider documenting or blogging about your experiences. Submit pull requests to this guidance so other UK public sector organisations can accelerate their own sustainability journeys. By sharing real-world results and vendor-specific approaches, you help shape a greener future for public services across the entire nation.
How do you manage costs? [change your answer]
You did not answer this question.
How do I do better?
If you want to improve beyond “Restricted Billing Visibility,” the next step typically involves democratising cost data. This transition does not mean giving everyone unrestricted access to sensitive financial accounts or payment details. Instead, it centers on making relevant usage and cost breakdowns accessible to those who influence spending decisions, such as product owners, development teams, and DevOps staff, in a manner that is both secure and comprehensible.
Below are tangible ways to create a more open and proactive cost culture:
Role-Based Access to Billing Dashboards
- Most major cloud providers offer robust billing dashboards that can be securely shared with different levels of detail. For example, you can configure specialised read-only roles that allow developers to see usage patterns and daily cost breakdown without granting them access to critical financial settings.
- Look into official documentation and solutions from your preferred cloud provider:
- By carefully configuring role-based access, you enable various teams to monitor cost drivers without exposing sensitive billing details such as invoicing or payment methods.
Regular Cost Review Meetings
- Schedule short, recurring meetings (monthly or bi-weekly) where finance, engineering, operations, and leadership briefly review cost trends. This fosters collaboration, encourages data-driven decisions, and allows everyone to ask questions or highlight anomalies.
- Ensure these sessions focus on actionable items. For instance, if a certain service’s spend has doubled, discuss whether that trend reflects legitimate growth or a misconfiguration that can be quickly fixed.
Automated Cost Alerts for Key Stakeholders
- Integrating cost alerts into your organisational communication channels can be a game changer. Instead of passively waiting for monthly bills, set up cost thresholds, daily or weekly cost notifications, or usage anomalies that get shared in Slack, Microsoft Teams, or email distribution lists.
- This approach ensures that the right people see cost increases in near real-time. If a developer spins up a large instance for testing and forgets to turn it off, you can catch that quickly.
- Each major provider offers alerting and budgeting features:
Cost Dashboards Embedded into Engineering Workflows
- Rather than expecting developers to remember to check a separate financial console, embed cost insights into the tools they already use. For example, if your organisation relies on a continuous integration/continuous deployment (CI/CD) pipeline, you can integrate scripts or APIs that retrieve daily cost data and present them in your pipeline dashboards or as part of a daily Slack summary.
- Some organisations incorporate cost metrics into code review processes, ensuring that changes with potential cost implications (like selecting a new instance type or enabling a new managed service) are considered from both a technical and financial perspective.
Empowering DevOps with Cost Governance
- If you have a DevOps or platform engineering team, involve them in evaluating cost optimisation best practices. By giving them partial visibility into real-time spend data, they can quickly adjust scaling policies, identify over-provisioned resources, or investigate usage anomalies before a bill skyrockets.
- You might create a “Cost Champion” role in each engineering squad—someone who monitors usage, implements resource tagging strategies, and ensures that the rest of the team remains mindful of cloud spend.
Use of FinOps Principles
- The emerging discipline of FinOps (short for “Financial Operations”) focuses on bringing together finance, engineering, and business stakeholders to drive financial accountability. Adopting a FinOps mindset means cost visibility becomes a shared responsibility, with iterative improvement at its core.
- Consider referencing frameworks like the FinOps Foundation’s Principles to learn about building a culture of cost ownership, unit economics, and cross-team collaboration.
Security and Compliance Considerations
- Improving visibility does not mean exposing sensitive corporate finance data or violating compliance rules. Many organisations adopt an approach where top-level financial details (like credit card info or total monthly invoice) remain restricted, but usage-based metrics, daily cost reports, and resource-level data are made available.
- Work with your governance or risk management teams to ensure that any expanded visibility aligns with data protection regulations and internal security policies.
By following these strategies, you shift from a guarded approach—where only finance or management see the details—to a more inclusive cost culture. The biggest benefit is that your engineering teams gain the insight they need to optimise continuously. Rather than discovering at the end of the month that a test environment was running at full throttle, teams can detect and fix potential overspending early. Over time, this fosters a sense of shared cost responsibility, encourages more efficient design decisions, and drives proactive cost management practices across the organisation.
How do I do better?
To enhance a “Proactive Spend Commitment by Finance” model, organisations often evolve toward deeper collaboration between finance, engineering, and product teams. This ensures that negotiated contracts and reserved purchasing decisions accurately reflect real workloads, growth patterns, and future expansions. Below are methods to improve:
Integrated Forecasting and Capacity Planning
- Instead of having finance make decisions based purely on past billing, establish a forecasting model that includes planned product launches, major infrastructure changes, or architectural transformations.
- Encourage technical teams to share roadmaps (e.g., upcoming container migrations, new microservices, or expansions into different regions) so finance can assess whether existing reservation strategies are aligned with future reality.
- By merging product timelines with historical usage data, finance can negotiate better deals and tailor them closely to the actual environment.
Dynamic Monitoring of Reservation Coverage
- Use vendor-specific tools or third-party solutions to track your reservation utilisation in near-real-time. For instance:
- Continuously reviewing coverage lets you adjust reservations if your provider or plan permits it. Some vendors allow you to modify instance families, shift reservations to different regions, or exchange them for alternative instance sizes, subject to specific constraints.
Cross-Functional Reservation Committees
- Create a cross-functional group that meets quarterly or monthly to decide on reservation purchases or modifications. In this group, finance presents cost data, while engineering clarifies usage patterns and product owners forecast upcoming demand changes.
- This ensures that any new commits or expansions account for near-future workloads rather than only historical data. If you adopt agile practices, incorporate these reservation reviews as part of your sprint cycle or program increment planning.
Leverage Spot or Preemptible Instances for Variable Workloads
- An advanced tactic is to blend long-term reservations for predictable workloads with short-term, highly cost-effective instance types—such as AWS Spot Instances, Azure Spot VMs, GCP Preemptible VMs, or OCI Preemptible Instances—for workloads that can tolerate interruptions.
- Finance-led pre-commits for baseline needs plus engineering-led strategies for ephemeral or experimental tasks can minimise your total cloud spend. This synergy requires communication between finance and engineering so that the latter group can identify which workloads can safely run on spot capacity.
Refining Commitment Levels and Terms
- If your cloud vendor offers multiple commitment term lengths (e.g., 1-year vs. 3-year reservations, partial upfront vs. full upfront) and different coverage tiers, refine your strategy to match usage stability. For example, if 60% of your workload is unwavering, consider 3-year commits; if another 20% fluctuates, opt for 1-year or on-demand.
- Over time, as your usage data becomes more accurate and your architecture stabilises, you can shift more workloads into longer-term commitments for greater discounts. Conversely, if your environment is in flux, keep your commitments lighter to avoid overpaying.
Unit Economics and Cost Allocation
- Enhance your commitment strategy by tying it to unit economics—i.e., cost per customer, cost per product feature, or cost per transaction. Once you can express your cloud bills in terms of product-level or service-level metrics, you gain more clarity on which areas most justify pre-commits.
- If you identify a specific product line that reliably has N monthly active users, and you have stable usage patterns there, you can base reservations on that product’s forecast. Then, the cost savings from reservations become more attributable to specific products, making budgeting and cost accountability smoother.
Ongoing Financial-Technical Collaboration
- Beyond initial negotiations, keep the lines of communication open. Cloud resource usage is dynamic, particularly with continuous integration and deployment practices. Having monthly or quarterly check-ins between finance and engineering ensures you track coverage, refine cost models, and respond quickly to usage spikes or dips.
- Consider forming a “FinOps” group if your cloud usage is substantial. This multi-disciplinary team can use data from daily or weekly cost dashboards to fine-tune reservations, detect anomalies, and champion cost-optimisation strategies across the business.
By progressively weaving in these improvements, you move from a purely finance-led contract negotiation model to one where decisions about reserved spending or commitments are strongly informed by real-time engineering data and future product roadmaps. This more holistic approach leads to higher reservation utilisation, fewer wasted commitments, and better alignment of your cloud spending with actual business goals. The result is typically a more predictable cost structure, improved cost efficiency, and reduced risk of paying for capacity you do not need.
How do I do better?
If you wish to refine your cost-efficiency, consider adding more sophisticated processes, automation, and cultural practices. Here are ways to evolve:
Implement More Granular Auto-Scaling Policies
- Move beyond simple CPU-based or time-based triggers. Incorporate multiple metrics (memory usage, queue depth, request latency) so you scale up and down more precisely. This ensures that environments adjust capacity as soon as traffic drops, boosting your savings.
- Evaluate advanced solutions from your cloud provider:
Use Infrastructure as Code for Environment Management
- Instead of ad hoc creation and shutdown scripts, adopt Infrastructure as Code (IaC) tools (e.g., Terraform, AWS CloudFormation, Azure Bicep, Google Deployment Manager, or OCI Resource Manager) to version-control environment configurations. Combine IaC with schedule-based or event-based triggers.
- This approach ensures that ephemeral environments are consistently built and torn down, leaving minimal risk of leftover resources. You can also implement automated tagging to track cost by environment, team, or project.
Re-Architect for Serverless or Containerised Workloads
- If your application can tolerate stateless, event-driven, or container-based architectures, consider adopting serverless computing (e.g., AWS Lambda, Azure Functions, GCP Cloud Functions, OCI Functions) or container orchestrators (e.g., Kubernetes, Docker Swarm).
- These models often scale to zero when no requests are active, ensuring you only pay for actual usage. While not all workloads are suitable, re-architecting certain components can yield significant cost improvements.
Optimise Storage and Networking
- Cost-effective management extends beyond compute. Look for opportunities to move infrequently accessed data to cheaper storage tiers, such as object storage archive classes or lower-performance block storage. Configure lifecycle policies to purge logs or snapshots after a specified retention.
- Monitor data transfer costs between regions, availability zones, or external endpoints. If your architecture unnecessarily routes traffic through costlier paths, consider direct inter-region or peering solutions that reduce egress charges.
Scheduled Resource Hibernation and Wake-Up Processes
- Extend beyond typical off-hour shutdowns by creating fully automated schedules for every environment that does not require 24/7 availability. For instance, set a policy to shut down dev/test resources at 7 p.m. local time, and spin them back up at 8 a.m. the next day.
- Tools or scripts can detect usage anomalies (e.g., someone working late) and override the schedule or send a prompt to confirm if the environment should remain active. This approach ensures maximum cost avoidance, especially for large dev clusters or specialised GPU instances.
Incorporate Cost Considerations into Code Reviews and Architecture Decisions
- Foster a culture in which cost is a first-class design principle. During code reviews, developers might highlight the cost implications of using a high-tier database service, retrieving data across regions, or enabling a premium feature.
- Architecture design documents should include estimated cost breakdowns, referencing official pricing details for the services involved. Over time, teams become more adept at spotting potential overspending.
Automated Auditing and Cleanup
- Implement scripts or tools that run daily or weekly to detect unattached volumes, unused IP addresses, idle load balancers, or dormant container images. Provide automated cleanup or at least raise alerts for manual review.
- Many cloud providers have built-in recommendations engines:
- AWS: AWS Trusted Advisor
- Azure: Azure Advisor
- GCP: Recommender Hub
- OCI: Oracle Cloud Advisor
Track and Celebrate Savings
- Publicise cost optimisation wins. If an engineering team shaved 20% off monthly bills by fine-tuning auto-scaling, celebrate that accomplishment in internal communications. Show the before/after metrics to encourage others to follow suit.
- This positive reinforcement helps maintain momentum and fosters a sense of shared ownership.
By layering these enhancements, you move beyond basic scheduling or minimal auto-scaling. Instead, you cultivate a deeply ingrained practice of continuous optimisation. You harness automation to enforce best practices, integrate cost awareness into everyday decisions, and systematically re-architect services for maximum efficiency. Over time, the result is a lean cloud environment that can expand when needed but otherwise runs with minimal waste.
How do I do better?
If you want to upgrade your cost-aware development environment, you can deepen the integration of financial insight into everyday engineering. Below are practical methods:
Enhance Toolchain Integrations
- Provide cost data directly in the platforms developers use daily:
- Pull Request Annotations: When a developer opens a pull request in GitHub or GitLab that adds new cloud resources (e.g., creating a new database or enabling advanced analytics), an automated comment could estimate the monthly or annual cost impact.
- IDE Plugins: Investigate or develop plugins that estimate cost implications of certain library or service calls. While advanced, such solutions can drastically reduce guesswork.
- CI/CD Pipeline Steps: Incorporate cost checks as a gating mechanism in your CI/CD process. If a change is projected to exceed certain cost thresholds, it triggers a review or a labeled warning.
- Provide cost data directly in the platforms developers use daily:
Reward and Recognition Systems
- Implement a system that publicly acknowledges or rewards teams that achieve significant cost savings or code optimisations that reduce the cloud bill. This can be a monthly “cost champion” award or a highlight in the company-wide newsletter.
- Recognising teams for cost-smart decisions helps embed a culture where financial prudence is celebrated alongside feature delivery and reliability.
Cost Education Workshops
- Host internal workshops or lunch-and-learns where experts (whether from finance, DevOps, or a specialised FinOps team) explain how cloud billing works, interpret usage graphs, or share best practices for cost-efficient coding.
- Make these sessions as practical and example-driven as possible: walk developers through real code and show the difference in cost from alternative approaches.
Tagging and Chargeback/Showback Mechanisms
- Encourage consistent resource tagging so that each application component or service is clearly attributed to a specific team, project, or feature. This tagging data feeds into cost reports that let you see which code bases or squads are driving usage.
- You can then implement a “showback” model (where each team sees the monthly cost of their resources) or a “chargeback” model (where those costs directly affect team budgets). Such financial accountability often motivates more thoughtful engineering decisions.
Guidelines and Architecture Blueprints
- Produce internal reference guides that show recommended patterns for cost optimisation. For example, specify which database types or instance families are preferred for certain workloads. Provide example Terraform modules or CloudFormation templates that are pre-configured for cost-efficiency.
- Encourage developers to consult these guidelines when designing new systems. Over time, the default approach becomes inherently cost-aware.
Frequent Feedback Loops
- Implement daily or weekly cost digests that are automatically posted in relevant Slack channels or email lists. These digests highlight the top 5 cost changes from the previous period, giving engineering teams rapid insight into where spend is shifting.
- Additionally, create a channel or forum where developers can ask cost-related questions in real time, ensuring they do not have to guess how a new feature might affect the budget.
Collaborative Budgeting and Forecasting
- For upcoming features or architectural revamps, involve engineers in forecasting the cost impact. By inviting them into the financial planning process, you ensure they understand the budgets they are expected to work within.
- Conversely, finance or product managers can learn from engineers about the real operational complexities, leading to more accurate forecasting and fewer unrealistic cost targets.
Adopt a FinOps Mindset
- Expand on the FinOps principles beyond finance alone. Encourage all engineering teams to take part in continuous cost optimisation cycles—inform, optimise, and operate. In these cycles, you measure usage, identify opportunities, experiment with changes, and track results.
- Over time, cost efficiency becomes an ongoing practice rather than a one-time initiative.
By adopting these approaches, you elevate cost awareness from a passive, occasional concern to a dynamic, integrated element of day-to-day development. This deeper integration helps your teams design, code, and deploy with financial considerations in mind—often leading to innovative solutions that deliver both performance and cost savings.
How do you choose where to run workloads and store data? [change your answer]
You did not answer this question.
How to do better
Below are rapidly actionable ways to refine an intra-region approach:
Enable Automatic Multi-AZ Deployments
- e.g., AWS Auto Scaling groups across multiple AZs, Azure VM Scale Sets in multiple zones, GCP Managed Instance Groups (MIGs) or multi-zonal regional clusters, OCI multi-AD distribution for compute/storage,IBM Cloud Instance Group for Autoscaling.
- Minimises manual overhead for distributing workloads.
Replicate Data Synchronously
- For databases, consider regionally resilient services:
- AWS RDS Multi-AZ
- Azure SQL Zone Redundancy
- GCP Cloud SQL HA
- OCI Data Guard in Multi-AD Mode
- IBM Cloud: for PostgreSQL, for Cloudant, for MySQL & for Cloud Databases
- Ensures quick failover if one Availability Zone (AZ) fails.
- For databases, consider regionally resilient services:
Set AZ-Aware Networking
- Deploy separate subnets or load balancers for each Availability Zone (AZ) so traffic automatically reroutes upon an AZ failure:
- Ensures high availability and fault tolerance by distributing traffic across multiple AZs.
Regularly Test AZ Failover
- Induce a partial Availability Zone (AZ) outage or rely on “game days” to ensure applications properly degrade or failover:
- Referencing NCSC guidance on vulnerability management.
- Ensures systems can handle unexpected disruptions effectively.
- Induce a partial Availability Zone (AZ) outage or rely on “game days” to ensure applications properly degrade or failover:
Monitor Cross-AZ Costs
- Some vendors charge for data transfer between AZs, so monitor usage with AWS Cost Explorer, Azure Cost Management, GCP Billing, OCI Cost Analysis, IBM Cloud Billing & IBM Cost Estimator.
By automatically spreading workloads, replicating data in multiple AZs, ensuring AZ-aware networking, regularly testing failover, and monitoring cross-AZ costs, you solidify your organisation’s resilience within a single region while controlling costs.
How to do better
Below are rapidly actionable improvements:
Automate Cross-Region Backups
- e.g., AWS S3 Cross-Region Replication, Azure Backup to another region, GCP Snapshot replication, OCI cross-region object replication.
- Minimises manual tasks and ensures consistent DR coverage.
Schedule Non-Production in Cheaper Regions
- If cost is a driver, shut down dev/test in off-peak times or run them in a region with lower rates:
- Referencing your chosen vendor’s regional pricing page.
- If cost is a driver, shut down dev/test in off-peak times or run them in a region with lower rates:
Establish a Basic DR Plan
- For the second region, define how you’d bring up minimal services if the primary region fails:
Regularly Test Failover
- Do partial or full DR exercises at least annually, ensuring data in the second region can spin up quickly.
- Referencing NIST SP 800-34 DR test recommendations or NCSC operational resilience playbooks.
Plan for Data Residency
- If using non-UK regions, confirm any legal constraints on data location, referencing GOV.UK data residency rules or relevant departmental guidelines.
By automating cross-region backups, offloading dev/test workloads where cost is lower, defining a minimal DR plan, regularly testing failover, and ensuring data residency compliance, you expand from a single-region approach to a modest but effective multi-region strategy.
How to do better
Below are rapidly actionable enhancements:
Sustainability-Driven Tools
- e.g., AWS Customer Carbon Footprint Tool, Azure Carbon Optimisation, GCP Carbon Footprint, OCI Carbon Footprint.
- Evaluate region choices for best environmental impact.
Implement Real-Time Cost & Perf Monitoring
- Track usage and cost by region daily or hourly.
- Referencing AWS Cost Explorer, Azure Cost Management, GCP Billing Alerts, OCI Cost Analysis.
Enable Multi-Region Data Sync
- If you shift workloads for HPC or AI tasks, ensure data is pre-replicated to the chosen region:
Address Latency & End-User Performance
- For services with user-facing components, consider CDN edges, multi-region front-end load balancing, or local read replicas to ensure acceptable performance.
Document Region Swapping Procedures
- If you occasionally relocate entire workloads for cost or sustainability, define runbooks or scripts to manage DB replication, DNS updates, and environment spin-up.
By using sustainability calculators to choose greener regions, implementing real-time cost/performance checks, ensuring multi-region data readiness, managing user latency via CDNs or local replicas, and documenting region-swapping, you fully leverage each provider’s global footprint for cost and environmental benefits.
How to do better
Below are rapidly actionable methods to refine dynamic, cost-sustainable distribution:
Automate Workload Placement
- Tools like [AWS Spot Instance with EC2 Fleet, Azure Spot VMs with scale sets, GCP Preemptible VMs, OCI Preemptible Instances] or container orchestrators that factor region costs:
- referencing vendor cost management APIs or third-party cost analytics.
- Tools like [AWS Spot Instance with EC2 Fleet, Azure Spot VMs with scale sets, GCP Preemptible VMs, OCI Preemptible Instances] or container orchestrators that factor region costs:
Use Real-Time Carbon & Pricing Signals
- e.g., AWS Instance Metadata + carbon data, Azure carbon footprint metrics, GCP Carbon Footprint reports, OCI sustainability stats.
- Shift workloads to the region with the best real-time carbon intensity or lowest spot price.
Add Continual Governance
- Ensure no region usage violates data residency constraints or compliance:
- referencing NCSC multi-region compliance advice or departmental data classification guidelines.
- Ensure no region usage violates data residency constraints or compliance:
Embrace Chaos Engineering
- Regularly test failover or region-shifting events to ensure dynamic distribution can recover from partial region outages or surges:
- Referencing NCSC guidance on chaos engineering or vendor solutions:
- These tools help simulate real-world disruptions, allowing you to observe system behavior and enhance resilience.
- Regularly test failover or region-shifting events to ensure dynamic distribution can recover from partial region outages or surges:
Integrate Advanced DevSecOps
- For each region shift, the pipeline or orchestrator re-checks security posture and cost thresholds in real time.
By automating workload placement with spot or preemptible instances, factoring real-time carbon and cost signals, applying continuous data residency checks, stress-testing region shifts with chaos engineering, and embedding advanced DevSecOps validations, you maintain a dynamic, cost-sustainable distribution model that meets the highest operational and environmental standards for UK public sector services.
Keep doing what you’re doing, and consider blogging about or opening pull requests to share how you handle multi-region distribution and operational management for cloud workloads. This information can help other UK public sector organisations adopt or improve similar approaches in alignment with NCSC, NIST, and GOV.UK best-practice guidance.