Report

Cost & Sustainability

How do you set capacity for live services? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to reduce waste and move beyond provisioning for the extreme peak:

Implement Resource Monitoring and Basic Analytics
- Gather usage metrics to understand actual peaks, off-peak times, and daily/weekly cycles:
  - AWS CloudWatch metrics + AWS Cost Explorer to see usage vs. cost patterns
  - Azure Monitor + Azure Cost Management for hourly/daily usage trends
  - GCP Monitoring + GCP Billing reports (BigQuery export for deeper analysis)
  - OCI Monitoring + OCI Cost Analysis for instance-level metrics
  - IBM Cloud Monitoring + IBM Cloud Cost Estimator for hourly usage and trends
- Share this data with stakeholders to highlight the discrepancy between peak vs. average usage.
Pilot Scheduled Shutdowns for Non-Critical Systems
- Identify development and testing environments or batch-processing servers that don’t require 24/7 availability:
  - Utilise AWS Instance Scheduler to automate start and stop times for Amazon EC2 and RDS instances.
  - Implement Azure Automation’s Start/Stop VMs v2 to manage virtual machines on user-defined schedules.
  - Apply Google Cloud’s Instance Schedules to automatically start and stop Compute Engine instances based on a schedule.
  - Use Oracle Cloud Infrastructure’s Resource Scheduler to manage compute instances’ power states according to defined schedules.
  - Use IBM Cloud Schedule Scaling to add or remove instance group capacity, based on daily, intermittent, or seasonal demand. You can create multiple scheduled actions that scale capacity monthly, weekly, daily, hourly, or even every set number of minutes.
- Sharing this data with stakeholders can highlight the discrepancy between peak and average usage, demonstrating immediate cost savings without impacting production systems.
Explore Simple Autoscaling Solutions
- Even if you continue peak provisioning for mission-critical workloads, consider selecting a smaller or non-critical service to test autoscaling:
  - AWS Auto Scaling Groups – basic CPU-based triggers: Amazon EC2 Auto Scaling allows you to automatically add or remove EC2 instances based on CPU utilisation or other metrics, ensuring your application scales to meet demand.
  - Azure Virtual Machine Scale Sets – scale by CPU or memory usage: Azure Virtual Machine Scale Sets enable you to create and manage a group of load-balanced VMs, automatically scaling the number of instances based on CPU or memory usage to match your workload demands.
  - GCP Managed Instance Groups – autoscale based on utilisation thresholds: Google Cloud’s Managed Instance Groups provide autoscaling capabilities that adjust the number of VM instances based on utilsation metrics, such as CPU usage, to accommodate changing workloads.
  - OCI Instance Pool Autoscaling – CPU or custom metrics triggers: Oracle Cloud Infrastructure’s Instance Pool Autoscaling allows you to automatically adjust the number of instances in a pool based on CPU utilisation or custom metrics, helping to optimise performance and cost.
  - IBM Cloud Auto Scale for VPC allows you to create an instance group to scale according to your requirements. Based on the target utilisation metrics that you define, the instance group can dynamically add or remove instances to achieve your specified instance availability.

Implementing autoscaling in a controlled environment allows you to evaluate its benefits and challenges, providing valuable insights before considering broader adoption for more critical workloads.

Review Reserved or Discounted Pricing
- If you must maintain consistently high capacity, consider vendor discount programs to reduce per-hour costs:
  - AWS Savings Plans or Reserved Instances: AWS offers Savings Plans, which provide flexibility by allowing you to commit to a consistent amount of compute usage (measured in $/hour) over a 1- or 3-year term, applicable across various services and regions. Reserved Instances, on the other hand, involve committing to specific instance configurations for a term, offering significant discounts for predictable workloads.
  - Azure Reservations for VMs and Reserved Capacity: Azure provides Reservations that allow you to commit to a specific VM or database service for a 1- or 3-year period, resulting in cost savings compared to pay-as-you-go pricing. These reservations are ideal for workloads with predictable resource requirements.
  - GCP Committed Use Discounts: Google Cloud offers Committed Use Discounts, enabling you to commit to a certain amount of usage for a 1- or 3-year term, which can lead to substantial savings for steady-state or predictable workloads.
  - OCI Universal Credits: Oracle Cloud Infrastructure provides Universal Credits, allowing you to utilise any OCI platform service in any region with a flexible consumption model. By purchasing a sufficient number of credits, you can benefit from volume discounts and predictable billing, which is advantageous for maintaining high-capacity workloads.
  - IBM Cloud Reservations are a great option when you want significant cost savings and dedicated resources for future deployments. You can choose a 1 or 3-year term, server quantity, specific profile, and provision those servers when needed. IBM Cloud Enterprise Savings Plan, with this billing model, you commit to spend a certain amount on IBM Cloud and receive discounts across the platform. You are billed monthly based on your usage and you continue to receive a discount even after you reach your committed amount.
- Implementing these discount programs won’t eliminate over-provisioning but can soften the budget impact.
Engage Leadership on the Financial and Sustainability Benefits
- Present how on-demand autoscaling or even basic scheduling can reduce overhead and potentially improve your service’s environmental footprint.
- Link these improvements to departmental net-zero or cost reduction goals, highlighting easy wins.

Through monitoring, scheduling, basic autoscaling pilots, and potential reserved capacity, you can move away from static peak provisioning. This approach preserves reliability while unlocking efficiency gains—an important step in balancing cost, compliance, and performance goals in the UK public sector.

How to do better

Here are rapidly actionable steps to evolve from manual seasonal scaling to a more automated, responsive model:

Automate the Manual Steps You Already Do
- If you anticipate seasonal peaks (e.g., quarterly public reporting load), replace manual processes with scheduled scripts to ensure timely scaling and prevent missed scale-downs:
  - AWS: Utilise AWS Step Functions in conjunction with Amazon EventBridge Scheduler to automate the start and stop of EC2 instances based on a defined schedule.
  - Azure: Implement Azure Automation Runbooks within Automation Accounts to create scripts that manage the scaling of resources during peak periods.
  - Google Cloud Platform (GCP): Leverage Cloud Scheduler to trigger Cloud Functions or Terraform scripts that adjust instance groups in response to anticipated load changes.
  - Oracle Cloud Infrastructure (OCI): Use Resource Manager stacks combined with Cron tasks to schedule scaling events, ensuring resources are appropriately managed during peak times.
- Automating these processes ensures that scaling actions occur as planned, reducing the risk of human error and optimising resource utilisation during peak and off-peak periods.
Identify and Enforce “Scale-Back” Windows
- Even if you scale up for busy times, ensure you have a defined “sunset” for increased capacity:
  - Configure an autoscaling group or scale set to revert to default size after the peak.
  - Set reminders or triggers to ensure you don’t pay for extra capacity indefinitely.
Introduce Autoscaling on a Limited Component
- Choose a module that frequently experiences load variations within a day or week—perhaps a web front-end for a public information portal:
  - AWS: Implement Auto Scaling Groups with CPU-based or request-based triggers to automatically adjust the number of EC2 instances handling your service’s load.
  - Azure: Utilise Virtual Machine Scale Sets or the AKS Cluster Autoscaler to manage the scaling of virtual machines or Kubernetes clusters for your busiest microservices.
  - Google Cloud Platform (GCP): Use Managed Instance Groups with load-based autoscaling to dynamically adjust the number of instances serving your front-end application based on real-time demand.
  - Oracle Cloud Infrastructure (OCI): Apply Instance Pool Autoscaling or the OKE Cluster Autoscaler to automatically scale a specific containerised service in response to workload changes.
- Implementing autoscaling on a targeted component allows you to observe immediate benefits, such as improved resource utilisation and cost efficiency, which can encourage broader adoption across your infrastructure.
Consider Serverless for Spiky Components
- If certain tasks run sporadically (e.g., monthly data transformation or PDF generation), investigate moving them to event-driven or serverless solutions:
  - AWS: Utilise AWS Lambda for event-driven functions or AWS Fargate for running containers without managing servers. AWS Lambda is ideal for short-duration, event-driven tasks, while AWS Fargate is better suited for longer-running applications and tasks requiring intricate orchestration.
  - Azure: Implement Azure Functions for serverless compute, Logic Apps for workflow automation, or Container Apps for running microservices and containerised applications. Azure Logic Apps can automate workflows and business processes, making them suitable for scheduled tasks.
  - Google Cloud Platform (GCP): Deploy Cloud Functions for lightweight event-driven functions or Cloud Run for running containerised applications in a fully managed environment. Cloud Run is suitable for web-based workloads, REST or gRPC APIs, and internal custom back-office apps.
  - Oracle Cloud Infrastructure (OCI): Use OCI Functions for on-demand, serverless workloads. OCI Functions is a fully managed, multi-tenant, highly scalable, on-demand, Functions-as-a-Service platform built on enterprise-grade infrastructure.
- Transitioning to serverless solutions for sporadic tasks eliminates the need to manually adjust virtual machines for short bursts, enhancing efficiency and reducing operational overhead.
Monitor and Alert on Usage Deviations
- Utilise cost and performance alerts to detect unexpected surges or prolonged idle resources:
  - AWS: Implement AWS Budgets to set custom cost and usage thresholds, receiving alerts when limits are approached or exceeded. Additionally, use Amazon CloudWatch’s anomaly detection to monitor metrics and identify unusual patterns in resource utilisation.
  - Azure: Set up Azure Monitor alerts to track resource performance and configure cost anomaly alerts within Azure Cost Management to detect and notify you of unexpected spending patterns.
  - Google Cloud Platform (GCP): Create budgets in Google Cloud Billing and configure Pub/Sub notifications to receive alerts on cost anomalies, enabling prompt responses to unexpected expenses.
  - Oracle Cloud Infrastructure (OCI): Establish budgets and set up alert rules in OCI Cost Management to monitor spending. Additionally, configure OCI Alarms with notifications to detect and respond to unusual resource usage patterns.
- Implementing these alerts enables quicker responses to anomalies, reducing the reliance on manual monitoring and helping to maintain optimal resource utilisation and cost efficiency.

By automating your manual scaling processes, exploring partial autoscaling, and shifting spiky tasks to serverless, you unlock more agility and cost efficiency. This approach helps ensure you’re not left scrambling if usage deviates from seasonal patterns.

How to do better

Below are actionable ways to upgrade from basic autoscaling:

Broaden Autoscaling Coverage
- Extend autoscaling to more workloads to enhance efficiency and responsiveness:
  - AWS:
    - EC2 Auto Scaling: Implement EC2 Auto Scaling across multiple groups to automatically adjust the number of EC2 instances based on demand, ensuring consistent application performance.
    - ECS Service Auto Scaling: Configure Amazon ECS Service Auto Scaling to automatically scale your containerised services in response to changing demand.
    - RDS Auto Scaling: Utilise Amazon Aurora Auto Scaling to automatically adjust the number of Aurora Replicas to handle changes in workload demand.
  - Azure:
    - Virtual Machine Scale Sets (VMSS): Deploy Azure Virtual Machine Scale Sets to manage and scale multiple VMs for various services, automatically adjusting capacity based on demand.
    - Azure Kubernetes Service (AKS): Implement the AKS Cluster Autoscaler to automatically adjust the number of nodes in your cluster based on resource requirements.
    - Azure SQL Elastic Pools: Use Azure SQL Elastic Pools to manage and scale multiple databases with varying usage patterns, optimising resource utilisation and cost.
  - Google Cloud Platform (GCP):
    - Managed Instance Groups (MIGs): Expand the use of Managed Instance Groups with autoscaling across multiple zones to ensure high availability and automatic scaling of your applications.
    - Cloud SQL Autoscaling: Leverage Cloud SQL’s automatic storage increase to handle growing database storage needs without manual intervention.
  - Oracle Cloud Infrastructure (OCI):
    - Instance Pool Autoscaling: Apply OCI Instance Pool Autoscaling to additional workloads, enabling automatic adjustment of compute resources based on performance metrics.
    - Database Auto Scaling: Utilise OCI Autonomous Database Auto Scaling to automatically scale compute and storage resources in response to workload demands.
- Gradually incorporating more of your application’s microservices into the autoscaling framework can lead to improved performance, cost efficiency, and resilience across your infrastructure.
Incorporate More Granular Metrics
- Move beyond simple CPU-based thresholds to handle memory usage, disk I/O, or application-level concurrency:
  - AWS: Implement Amazon CloudWatch custom metrics to monitor specific parameters such as memory usage, disk I/O, or application-level metrics. Additionally, utilise Application Load Balancer (ALB) request count to trigger autoscaling based on incoming traffic.
  - Azure: Use Azure Monitor custom metrics to track specific performance indicators like queue length or HTTP request rate. These metrics can feed into Virtual Machine Scale Sets or the Azure Kubernetes Service (AKS) Horisontal Pod Autoscaler (HPA) for more responsive scaling.
  - Google Cloud Platform (GCP): Leverage Google Cloud’s Monitoring custom metrics to capture detailed performance data. Implement request-based autoscaling in Google Kubernetes Engine (GKE) or Cloud Run to adjust resources based on real-time demand.
  - Oracle Cloud Infrastructure (OCI): Utilise OCI Monitoring service’s custom metrics to track parameters such as queue depth, memory usage, or user concurrency. These metrics can inform autoscaling decisions to ensure optimal performance.
- Incorporating more granular metrics allows for precise autoscaling, ensuring that resources are allocated based on comprehensive performance indicators rather than relying solely on CPU usage.
Implement Dynamic, Scheduled, or Predictive Scaling
- If you observe consistent patterns in your application’s usage—such as increased activity during lunchtime or reduced traffic on weekends—consider enhancing your existing autoscaling strategies with scheduled scaling actions:
  - AWS: Configure Amazon EC2 Auto Scaling scheduled actions to adjust capacity at predetermined times. For instance, you can set the system to scale up at 08:00 and scale down at 20:00 to align with daily usage patterns.
  - Azure: Utilise Azure Virtual Machine Scale Sets to implement scheduled scaling. Additionally, integrate scaling adjustments into your Azure DevOps pipelines to automate capacity changes in response to anticipated workload variations.
  - Google Cloud Platform (GCP): Employ Managed Instance Group (MIG) scheduled scaling to define scaling behaviors based on time-based schedules. Alternatively, use Cloud Scheduler to trigger scripts that adjust resources in line with expected demand fluctuations.
  - Oracle Cloud Infrastructure (OCI): Set up scheduled autoscaling for instance pools to manage resource allocation according to known usage patterns. You can also deploy Oracle Functions to execute timed scaling events, ensuring resources are appropriately scaled during peak and off-peak periods.
- Implementing scheduled scaling allows your system to proactively adjust resources in anticipation of predictable workload changes, enhancing performance and cost efficiency.
- For environments with variable and unpredictable workloads, consider utilising predictive scaling features. Predictive scaling analyzes historical data to forecast future demand, enabling the system to scale resources in advance of anticipated spikes. This approach combines the benefits of both proactive and reactive scaling, ensuring optimal resource availability and responsiveness.
  - AWS: Explore Predictive Scaling for Amazon EC2 Auto Scaling, which uses machine learning models to forecast traffic patterns and adjust capacity accordingly.
  - Azure: While Azure does not currently offer a native predictive scaling feature, you can implement custom solutions by analyzing historical metrics through Azure Monitor and creating automation scripts to adjust scaling based on predicted trends.
  - GCP: Google Cloud’s autoscaler primarily operates on real-time metrics. For predictive capabilities, consider developing custom predictive models using historical data from Cloud Monitoring to inform scaling decisions.
  - OCI: Oracle Cloud Infrastructure allows for the creation of custom scripts and functions to implement predictive scaling based on historical usage patterns, although a native predictive scaling feature may not be available.
- By integrating scheduled and predictive scaling strategies, you can enhance your application’s ability to handle varying workloads efficiently, ensuring optimal performance while managing costs effectively.
Enhance Observability to Validate Autoscaling Efficacy
- Instrument your autoscaling events and track them to ensure optimal performance and resource utilisation:
  - Dashboard Real-Time Metrics: Monitor CPU, memory, and queue metrics alongside scaling events to visualise system performance in real-time.
  - Analyze Scaling Timeliness: Assess whether scaling actions occur promptly by checking for prolonged high CPU usage or frequent scale-in events that may indicate over-scaling.
- Tools:
  - AWS:
    - AWS X-Ray: Utilise AWS X-Ray to trace requests through your application, gaining insights into performance bottlenecks and the impact of scaling events.
    - Amazon CloudWatch: Create dashboards in Amazon CloudWatch to display real-time metrics and logs, correlating them with scaling activities for comprehensive monitoring.
  - Azure:
    - Azure Monitor: Leverage Azure Monitor to collect and analyze telemetry data, setting up alerts and visualisations to track performance metrics in relation to scaling events.
    - Application Insights: Use Azure Application Insights to detect anomalies and diagnose issues, correlating scaling actions with application performance for deeper analysis.
  - Google Cloud Platform (GCP):
    - Cloud Monitoring: Employ Google Cloud’s Operations Suite to monitor and visualise metrics, setting up dashboards that reflect the relationship between resource utilisation and scaling events.
    - Cloud Logging and Tracing: Implement Cloud Logging and Cloud Trace to collect logs and trace data, enabling the analysis of autoscaling impacts on application performance.
  - Oracle Cloud Infrastructure (OCI):
    - OCI Logging: Use OCI Logging to manage and search logs, providing visibility into scaling events and their effects on system performance.
    - OCI Monitoring: Utilise OCI Monitoring to track metrics and set alarms, ensuring that scaling actions align with performance expectations.
- By enhancing observability, you can validate the effectiveness of your autoscaling strategies, promptly identify and address issues, and optimise resource allocation to maintain application performance and cost efficiency.
Adopt Spot/Preemptible Instances for Autoscaled Non-Critical Workloads
- To further optimise costs, consider utilising spot or preemptible virtual machines (VMs) for non-critical, autoscaled workloads. These instances are offered at significant discounts compared to standard on-demand instances but can be terminated by the cloud provider when resources are needed elsewhere. Therefore, they are best suited for fault-tolerant and flexible applications.
  - AWS: Implement EC2 Spot Instances within an Auto Scaling Group to run fault-tolerant workloads at up to 90% off the On-Demand price. By configuring Auto Scaling groups with mixed instances, you can combine Spot Instances with On-Demand Instances to balance cost and availability.
  - Azure: Utilise Azure Spot Virtual Machines within Virtual Machine Scale Sets for non-critical workloads. Azure Spot VMs allow you to take advantage of unused capacity at significant cost savings, making them ideal for interruptible workloads such as batch processing jobs and development/testing environments.
  - Google Cloud Platform (GCP): Deploy Preemptible VMs in Managed Instance Groups to run short-duration, fault-tolerant workloads at a reduced cost. Preemptible VMs provide substantial savings for workloads that can tolerate interruptions, such as data analysis and batch processing tasks.
  - Oracle Cloud Infrastructure (OCI): Leverage Preemptible Instances for batch processing or flexible tasks. OCI Preemptible Instances offer a cost-effective solution for workloads that are resilient to interruptions, enabling efficient scaling of non-critical applications.
- By integrating these cost-effective instance types into your autoscaling strategies, you can significantly reduce expenses for non-critical workloads while maintaining the flexibility to scale resources as needed.

By broadening autoscaling across more components, incorporating richer metrics, scheduling, and advanced cost strategies like spot instances, you transform your “basic” scaling approach into a more agile, cost-effective solution. Over time, these steps foster robust, automated resource management across your entire environment.

How to do better

Here are actionable ways to refine your widespread autoscaling strategy to handle more nuanced workloads:

Adopt Application-Level or Log-Based Metrics
- Move beyond CPU and memory metrics to incorporate transaction rates, request latency, or user concurrency for more responsive and efficient autoscaling:
  - AWS:
    - CloudWatch Custom Metrics: Publish custom metrics derived from application logs to Amazon CloudWatch, enabling monitoring of specific application-level indicators such as transaction rates and user concurrency.
    - Real-Time Log Analysis with Kinesis and Lambda: Set up real-time log analysis by streaming logs through Amazon Kinesis and processing them with AWS Lambda to generate dynamic scaling triggers based on application behavior.
  - Azure:
    - Application Insights: Utilise Azure Monitor’s Application Insights to collect detailed usage data, including request rates and response times, which can inform scaling decisions for services hosted in Azure Kubernetes Service (AKS) or Virtual Machine Scale Sets.
    - Custom Logs for Scaling Signals: Implement custom logging to capture specific application metrics and configure Azure Monitor to use these logs as signals for autoscaling, enhancing responsiveness to real-time application demands.
  - Google Cloud Platform (GCP):
    - Cloud Monitoring Custom Metrics: Create custom metrics in Google Cloud’s Monitoring to track application-specific indicators such as request count, latency, or queue depth, facilitating more precise autoscaling of Compute Engine (GCE) instances or Google Kubernetes Engine (GKE) clusters.
    - Integration with Logging: Combine Cloud Logging with Cloud Monitoring to analyze application logs and derive metrics that can trigger autoscaling events based on real-time application performance.
  - Oracle Cloud Infrastructure (OCI):
    - Monitoring Custom Metrics: Leverage OCI Monitoring to create custom metrics from application logs, capturing detailed performance indicators that can inform autoscaling decisions.
    - Logging Analytics: Use OCI Logging Analytics to process and analyze application logs, extracting metrics that reflect user concurrency or transaction rates, which can then be used to trigger autoscaling events.
- Incorporating application-level and log-based metrics into your autoscaling strategy allows for more nuanced and effective scaling decisions, ensuring that resources align closely with actual application demands and improving overall performance and cost efficiency.
Introduce Multi-Metric Policies
- Instead of a single threshold, combine metrics. For instance:
  - Scale up if CPU > 70% AND average request latency > 300ms.
  - This ensures you only scale when both resource utilisation and user experience degrade, reducing false positives or unneeded expansions.
Implement Predictive or Machine Learning–Driven Autoscaling
- To anticipate demand spikes before traditional metrics like CPU utilisation react, consider implementing predictive or machine learning–driven autoscaling solutions offered by cloud providers:
  - AWS:
    - Predictive Scaling: Leverage Predictive Scaling for Amazon EC2 Auto Scaling, which analyzes historical data to forecast future traffic and proactively adjusts capacity to meet anticipated demand.
  - Azure:
    - Predictive Autoscale: Utilise Predictive Autoscale in Azure Monitor, which employs machine learning to forecast CPU load for Virtual Machine Scale Sets based on historical usage patterns, enabling proactive scaling.
  - Google Cloud Platform (GCP):
    - Custom Machine Learning Models: Develop custom machine learning models to analyze historical performance data and predict future demand, triggering autoscaling events in services like Google Kubernetes Engine (GKE) or Cloud Run based on these forecasts.
  - Oracle Cloud Infrastructure (OCI):
    - Custom Analytics Integration: Integrate Oracle Analytics Cloud with OCI to perform machine learning–based forecasting, enabling predictive scaling by analyzing historical data and anticipating future resource requirements.
- Implementing predictive or machine learning–driven autoscaling allows your applications to adjust resources proactively, maintaining performance and cost efficiency by anticipating demand before traditional metrics indicate the need for scaling.
Correlate Autoscaling with End-User Experience
- To enhance user satisfaction, align your autoscaling strategies with user-centric metrics such as page load times and overall responsiveness. By monitoring these metrics, you can ensure that scaling actions directly improve the end-user experience.
  - AWS:
    - Application Load Balancer (ALB) Target Response Times: Monitor ALB target response times using Amazon CloudWatch to assess backend performance. Elevated response times can indicate the need for scaling to maintain optimal user experience.
    - Network Load Balancer (NLB) Metrics: Track NLB metrics to monitor network performance and identify potential bottlenecks affecting end-user experience.
  - Azure:
    - Azure Front Door Logs: Analyze Azure Front Door logs to monitor end-to-end latency and other performance metrics. Insights from these logs can inform scaling decisions to enhance user experience.
    - Application Insights: Utilise Application Insights to collect detailed telemetry data, including response times and user interaction metrics, aiding in correlating autoscaling with user satisfaction.
  - Google Cloud Platform (GCP):
    - Cloud Load Balancing Logs: Examine Cloud Load Balancing logs to assess request latency and backend performance. Use this data to adjust autoscaling policies, ensuring they align with user experience goals.
    - Service Level Objectives (SLOs): Define SLOs in Cloud Monitoring to set performance targets based on user-centric metrics, enabling proactive scaling to meet user expectations.
  - Oracle Cloud Infrastructure (OCI):
    - Load Balancer Health Checks: Implement OCI Load Balancer health checks to monitor backend server performance. Use health check data to inform autoscaling decisions that directly impact user experience.
    - Custom Application Pings: Set up custom application pings to measure response times and user concurrency, feeding this data into autoscaling triggers to maintain optimal performance during varying user loads.
- By integrating user-centric metrics into your autoscaling logic, you ensure that scaling actions are directly correlated with improvements in end-user experience, leading to higher satisfaction and engagement.
Refine Scaling Cooldowns and Timers
- Tweak scale-up and scale-down intervals to avoid thrashing:
  - A short scale-up delay can address spikes quickly.
  - A slightly longer scale-down delay prevents abrupt resource removals when a short spike recedes.
- Evaluate your autoscaling policy settings monthly to align with evolving traffic patterns.

By incorporating more sophisticated application or log-based metrics, predictive scaling, and user-centric triggers, you ensure capacity aligns closely with real workloads. This approach elevates your autoscaling from a broad CPU/memory-based strategy to a finely tuned system that balances user experience, performance, and cost efficiency.

How to do better

Even at the top level, you can refine and push boundaries further:

Adopt More Granular “Distributed SLO” Metrics
- Evaluate Each Microservice’s Service-Level Objectives (SLOs): Define precise SLOs for each microservice, such as ensuring the 99th-percentile latency remains under 400 milliseconds. This granular approach allows for targeted performance monitoring and scaling decisions.
- Utilise Cloud Provider Tools to Monitor and Enforce SLOs:
  - AWS:
    - CloudWatch ServiceLens: Integrate Amazon CloudWatch ServiceLens to gain comprehensive insights into application performance and availability, correlating metrics, logs, and traces.
    - Custom Metrics and SLO-Based Alerts: Implement custom CloudWatch metrics to monitor specific performance indicators and set up SLO-based alerts to proactively manage service health.
  - Azure:
    - Application Insights: Leverage Azure Monitor’s Application Insights to track detailed telemetry data, enabling the definition and monitoring of SLOs for individual microservices.
    - Service Map: Use Azure Monitor’s Service Map to visualise dependencies and performance metrics across services, aiding in the assessment of SLO adherence.
  - Google Cloud Platform (GCP):
    - Cloud Operations Suite: Employ Google Cloud’s Operations Suite to create SLO dashboards that monitor service performance against defined objectives, facilitating informed scaling decisions.
  - Oracle Cloud Infrastructure (OCI):
    - Observability and Management Platform: Implement OCI’s observability tools to define SLOs and correlate them with performance metrics, ensuring each microservice meets its performance targets.
- Benefits of Implementing Distributed SLO Metrics:
  - Precision in Scaling: By closely monitoring how each component meets its SLOs, you can make informed decisions to scale resources appropriately, balancing performance needs with cost considerations.
  - Proactive Issue Detection: Granular SLO metrics enable the early detection of performance degradations within specific microservices, allowing for timely interventions before they impact the overall system.
  - Enhanced User Experience: Maintaining stringent SLOs ensures that end-users receive consistent and reliable service, thereby improving satisfaction and trust in your application.
- Implementation Considerations:
  - Define Clear SLOs: Collaborate with stakeholders to establish realistic and measurable SLOs for each microservice, considering factors such as latency, throughput, and error rates.
  - Continuous Monitoring and Adjustment: Regularly review and adjust SLOs and associated monitoring tools to adapt to evolving application requirements and user expectations.
- Conclusion: Adopting more granular “distributed SLO” metrics empowers you to fine-tune your application’s performance management, ensuring that each microservice operates within its defined parameters. This approach facilitates precise scaling decisions, optimising both performance and cost efficiency.
Experiment with Multi-Provider or Hybrid Autoscaling
- If policy allows, or your architecture is containerised, test the feasibility of bursting into another region or cloud for capacity:
- This approach is advanced but can further optimise resilience and cost across providers.
Integrate with Detailed Cost Allocation & Forecasting
- Combine real-time scale data with cost forecasting models:
- This ensures you can quickly investigate if an unusual surge in scaling leads to unapproved budget expansions.
Leverage AI/ML for Real-Time Scaling Decisions
- Deploy advanced ML models that continuously adapt scaling triggers based on anomaly detection in logs or usage patterns.
- Tools or patterns:
Adopt Sustainable/Green Autoscaling Policies
- If your usage is flexible, consider shifting workloads to times or regions with lower carbon intensity:
  - AWS Sustainability Pillar in Well-Architected Framework and region selection guidance for scheduling large tasks.
  - Azure Emissions Impact Dashboard integrated with scheduled scale tasks in greener data center regions.
  - Google Cloud’s Carbon Footprint and Active Assist for reducing cloud carbon footprint.
  - Oracle Cloud Infrastructure’s sustainability initiatives combined with custom autoscaling triggers for environment-friendly computing.
- This step can integrate cost savings with environmental commitments, aligning with the Greening Government Commitments.

By blending advanced SLO-based scaling, multi-provider strategies, cost forecasting, ML-driven anomaly detection, and sustainability considerations, you ensure your autoscaling remains cutting-edge. This not only provides exemplary performance and cost control but also positions your UK public sector organisation as a leader in efficient, responsible cloud computing.

Keep doing what you’re doing, and consider sharing your successes via blog posts or internal knowledge bases. Submit pull requests to this guidance if you have innovative approaches or examples that can benefit other public sector organisations. By exchanging real-world insights, we collectively raise the bar for cloud maturity and cost effectiveness across the entire UK public sector.

How do you run services in the cloud? [change your answer]

You did not answer this question.

How to do better

Here are rapidly actionable improvements to help you move beyond purely static VMs:

Enable Basic Monitoring and Cost Insights
- Even if you keep long-running VMs, gather usage metrics and financial data:
  - AWS CloudWatch and AWS Cost Explorer.
  - Azure Monitor and Azure Cost Management.
  - GCP Monitoring and Billing Reports.
  - OCI Monitoring and Cost Analysis.
  - IBM Cloud Billing and IBM Cost Estimator.
- Check CPU, memory, and storage utilisation. If these metrics show consistent underuse (like 10% CPU usage around the clock), it’s a sign you can downsize or re-architect.
Leverage Built-in Right-sizing Tools
- Major cloud providers offer “right-sizing” recommendations:
  - AWS Compute Optimiser to get suggestions for smaller or larger instance sizes.
  - Azure Advisor for VM right-sizing to identify underutilised virtual machines.
  - GCP Recommender for machine types to optimise resource utilisation.
  - OCI Workload and Resource Optimisation for tailored resource recommendations.
  - IBM Cloud Resource Controller is the next-generation IBM Cloud platform provisioning layer that manages the lifecycle of IBM Cloud resources in a customer account.
- Make a plan to apply at least one or two right-sizing recommendations each quarter. This is a quick, low-risk path to cost savings and better resource use.
Introduce Simple Scheduling
- If some VMs are only needed during business hours, schedule automatic shutdown at night or on weekends:
- A single action to stop dev/test or lightly used environments after hours can yield noticeable cost (and energy) savings.
Conduct a Feasibility Check for a Small Container Pilot
- Even if you retain most workloads on VMs, pick one small application or batch job and try containerising it:
  - AWS Fargate or Amazon EKS for containers.
  - Azure Container Instances or Azure Kubernetes Service (AKS).
  - Google Cloud Run or Google Kubernetes Engine (GKE).
  - Oracle Cloud Infrastructure (OCI) Container Instances or Oracle Kubernetes Engine (OKE).
  - There are two options on IBM Cloud for a container platform, Redhat Openshift on IBM Cloud or IBM Cloud Kubernetes Service
- By piloting a single container-based workload, you can assess potential elasticity and determine whether container orchestration solutions meet your needs. This approach allows for quick experimentation with minimal risk.
Raise Awareness with Internal Stakeholders
- Share simple usage and cost graphs with your finance or leadership teams. Show them the difference between “always-on” vs. “scaled” or “scheduled” usage.
- This could drive more formal mandates or budget incentives to encourage partial re-architecture or adoption of short-lived compute in the future.

By monitoring usage, applying right-sizing, scheduling idle time, and introducing a small container pilot, you can meaningfully reduce waste. Over time, you’ll build momentum toward more flexible compute strategies while still respecting the constraints of your existing environment.

How to do better

Here are actionable next steps to accelerate your modernisation journey without overwhelming resources:

Expand Container/Serverless Pilots in a Structured Way
- Identify a short list of low-risk workloads that could benefit from ephemeral compute, such as batch processing or data transformation.
- Use native solutions to reduce complexity:
  - AWS Fargate with ECS/EKS for container-based tasks without server management.
  - Azure Container Apps or Azure Functions for event-driven workloads.
  - Google Cloud Run for container-based microservices or Google Cloud Functions.
  - Oracle Cloud Infrastructure (OCI) Container Instances or OCI Functions for short-lived tasks.
- Document real cost/performance outcomes to present a stronger case for further expansion.
Implement Granular VM Auto-Scaling
- Even with VMs, you can configure auto-scaling groups or scale sets to handle changing loads:
- This ensures you pay only for the capacity you need during peak vs. off-peak times.
Use Container Services for Non-Critical Production
- If you have a stable container proof-of-concept, consider migrating a small but genuine production workload. Examples:
  - Internal APIs, internal data analytics pipelines, or front-end servers that can scale up/down.
  - Focus on microservices that do not require extensive refactoring.
- This fosters real operational experience, bridging from “non-critical tasks” to “production readiness.”
Leverage Cloud Marketplace or Government Frameworks
- Explore container-based solutions or DevOps tooling that might be available under G-Cloud or Crown Commercial Service frameworks.
- Some providers offer managed container solutions pre-configured for compliance or security—this can reduce friction around governance.
Train or Upskill Teams
- Provide short courses or lunch-and-learns on container orchestration (Kubernetes, ECS, AKS, etc.) or serverless fundamentals.
- Many vendors have free or low-cost training:
Building confidence and skills helps teams adopt more advanced compute models.

Through these steps—structured expansions of containerised or serverless pilots, improved auto-scaling of VMs, and staff training—your organisation can gradually shift from “limited experimentation” to a more balanced compute ecosystem. The result is improved agility, potential cost savings, and readiness for more modern architectures.

How to do better

Below are rapidly actionable ways to enhance your mixed compute model:

Adopt Unified Deployment Pipelines
- Strive for standard tooling that can deploy both VMs and container/serverless environments. For instance:
  - AWS CodePipeline or AWS CodeBuild integrated with ECS, Lambda, EC2, etc.
  - Azure Pipelines or GitHub Actions for VMs, AKS, Azure Functions.
  - Google Cloud Build for GCE, GKE, Cloud Run deployments.
  - OCI DevOps service for flexible deployments to OKE, Functions, or VMs.
- This reduces fragmentation and fosters consistent best practices (code review, automated testing, environment provisioning).
Enhance Observability
- Implement a single monitoring stack that captures logs, metrics, and traces across VMs, containers, and functions:
  - AWS CloudWatch combined with AWS X-Ray for distributed tracing in containers or Lambda.
  - Azure Monitor along with Application Insights for containers and serverless telemetry.
  - Google Cloud’s Operations Suite utilising Cloud Logging and Cloud Trace for multi-service environments.
  - Oracle Cloud Infrastructure (OCI) Logging integrated with the Observability and Management Platform for cross-service insights.
- Unified observability ensures you can quickly identify inefficiencies or scaling issues.
Introduce a Tagging/Governance Policy
- Standardise tags or labels for cost center, environment, and application name. This practice aids in tracking spending, performance, and potential carbon footprint across various compute services.
- Utilise tools such as:
- Implementing a unified tagging strategy fosters accountability and helps identify usage patterns that may require optimisation.
Implement Automated or Dynamic Scaling
- For container-based workloads, set CPU and memory usage thresholds to enable auto-scaling of pods or tasks:
  - AWS Fargate/ECS auto-scaling based on CloudWatch metrics.
  - Azure Kubernetes Service Horisontal Pod Autoscaler (HPA).
  - GCP GKE Horisontal Pod Autoscaler or Cloud Run request-based autoscaling.
  - OCI OKE cluster autoscaler or container usage triggers.
- For serverless architectures, establish concurrency or usage limits to prevent unexpected cost spikes.
Implementing these scaling strategies ensures that your applications can efficiently handle varying workloads while controlling costs.
Leverage Reserved or Discounted Pricing for Steady Components
- If certain VMs or container clusters must run continuously, investigate vendor discount models:
- Blend on-demand resources for elastic workloads with reservations for predictable baselines to optimise costs.
Implementing these strategies can lead to significant cost savings for workloads with consistent usage patterns.

By unifying your deployment practices, consolidating observability, enforcing tagging, and refining autoscaling or discount usage, you move from an ad-hoc mix of compute styles to a more cohesive, cost-effective cloud ecosystem. This sets the stage for robust, consistent governance and significant agility gains.

How to do better

Below are actionable expansions to push your ephemeral usage approach further:

Adopt a “Compute Decision Framework”
- Formalise how new workloads choose among FaaS (functions), CaaS (containers), or short-lived VMs:
  - If event-driven with spiky traffic, prefer serverless.
  - If the service requires consistent runtime dependencies but can scale, prefer containers.
  - If specialised hardware or older OS is needed briefly, use short-lived VMs.
- This standardisation helps teams quickly pick the best fit.
Enable Event-Driven Automation
- Use events to trigger ephemeral jobs:
  - AWS EventBridge or CloudWatch Events to invoke Lambda or spin up ECS tasks.
  - Azure Event Grid or Logic Apps triggering Functions or container jobs.
  - GCP Pub/Sub or EventArc calls Cloud Run services or GCE ephemeral jobs.
  - OCI Events Service integrated with Functions or autoscaling rules.
- This ensures resources only run when triggered, further minimising idle time.
Implement Container Security Best Practices
- As ephemeral container usage grows, so do potential security concerns:
- Integrate scans into your CI/CD pipeline for immediate alerts and automation.
Refine Infrastructure as Code (IaC) and Pipeline Patterns
- Standardise ephemeral environment creation using:
  - AWS CloudFormation or AWS CDK, plus AWS CodePipeline.
  - Azure Resource Manager templates or Bicep, plus Azure DevOps or GitHub Actions.
  - GCP Deployment Manager or Terraform, with Cloud Build triggers.
  - OCI Resource Manager for stack deployments, integrated with OCI DevOps pipeline.
- Encourage a shared library of environment definitions to accelerate new project spin-up.
Extend Tagging and Cost Allocation
- Since ephemeral resources come and go quickly, ensure they are labeled or tagged upon creation.
- Set up budgets or cost alerts to identify if ephemeral usage unexpectedly spikes:

By formalising your decision framework, expanding event-driven architectures, ensuring container security, and strengthening IaC patterns, you solidify your short-lived compute model. This approach reduces overheads, fosters agility, and helps UK public sector teams remain compliant with cost and operational excellence targets.

How to do better

Even at this advanced state, you can still hone practices. Below are suggestions:

Automate Decision Workflows
- Build an internal “Service Catalog” or “Decision Tree.” For instance:
  - A web-based form that asks about the workload’s functional, regulatory, performance, and cost constraints, then suggests suitable solutions (SaaS, FaaS, containers, etc.).
- This can be integrated with pipeline automation so new projects must pass through the framework before provisioning resources.
Deepen SaaS Exploration for Niche Needs
- Explore specialised SaaS options for areas like data analytics, content management, or identity services.
- Ensure your staff or solution architects regularly revisit the G-Cloud listings or other Crown Commercial Service frameworks to see if an updated SaaS solution can replace custom-coded or container-based systems.
Further Standardise DevOps Across All Layers
- If you run FaaS on multiple clouds or keep some workloads on private cloud, unify your deployment approach.
- Encourage a single pipeline style:
  - AWS CodePipeline or GitHub Actions for everything from AWS Lambda to Amazon ECS, plus AWS CloudFormation for infrastructure as code.
  - Azure DevOps for .NET-based function apps, container solutions like Azure Container Instances, or Azure Virtual Machines under one roof.
  - Google Cloud Build triggers that handle Cloud Run, Google Compute Engine, or third-party SaaS integrations.
  - Oracle Cloud Infrastructure (OCI) DevOps pipeline for a mixed environment using Oracle Kubernetes Engine (OKE), Oracle Functions, or third-party webhooks.
Maintain a Living Right-sizing Strategy
- Expand beyond memory/CPU metrics to measure cost per request, concurrency, or throughput.
- Tools like:
Focus on Energy Efficiency and Sustainability
- Refine your approach with a strong environmental lens:
  - Pick regions or times that yield lower carbon intensity, if permitted by data residency rules.
  - Enforce ephemeral usage policies to avoid running resources unnecessarily.
- Each vendor offers sustainability or carbon data to inform your “fit for purpose” decisions:
Champion Cross-Public-Sector Collaboration
- Share lessons or templates with other departments or agencies. This fosters consistent best practices across local councils, NHS trusts, or central government bodies.

By automating your decision workflows, continuously exploring SaaS, standardising DevOps pipelines, and incorporating advanced metrics (including sustainability), you maintain an iterative improvement path at the peak of compute maturity. This ensures you remain agile in responding to new user requirements and evolving government initiatives, all while controlling costs and optimising resource efficiency.

Keep doing what you’re doing, and consider writing up success stories, internal case studies, or blog posts. Submit pull requests to this guidance or relevant public sector best-practice repositories so others can learn from your achievements. By sharing real-world experiences, you help the entire UK public sector enhance its cloud compute maturity.

How do you track sustainability? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps that provide greater visibility and ensure you move beyond mere vendor assurances:

Request Vendor Transparency
- Ask your provider for UK-region-specific energy usage information and carbon intensity data. For example:
- Even if the data is approximate, it helps you begin to monitor trends.
Enable Basic Billing and Usage Reports
- Activate native cost-and-usage tooling to gather baseline compute usage:
  - AWS Cost Explorer with daily or hourly granularity.
  - Azure Cost Management
  - GCP Billing Export to BigQuery
  - OCI Cost Analysis
  - IBM Cloud Billing & IBM Cost Estimator
- While these tools focus on monetary spend, you can correlate usage data with the vendor’s sustainability information.
Incorporate Sustainability Clauses in Contracts
- When renewing or issuing new calls on frameworks like G-Cloud, add explicit language for carbon reporting.
- Request quarterly or annual updates on how your usage ties into the vendor’s net-zero or carbon offset strategies.
Incorporating sustainability clauses into your contracts is essential for ensuring that your cloud service providers align with your environmental goals. The Crown Commercial Service offers guidance on integrating such clauses into the G-Cloud framework. Additionally, the Chancery Lane Project provides model clauses for environmental performance, which can be adapted to your contracts.
By proactively including these clauses, you can hold vendors accountable for their sustainability commitments and ensure that your organisation’s operations contribute positively to environmental objectives.
Track Internal Workload Growth
- Even if you rely on vendor neutrality claims, set up a simple spreadsheet or a lightweight tracker for each of your main cloud workloads (service name, region, typical CPU usage, typical memory usage). If usage grows, you will notice potential new carbon hotspots.
Raise Internal Awareness
- Create a short briefing note for leadership or relevant teams (e.g., finance, procurement) highlighting:
  1. Your current reliance on vendor offsetting, and
  2. The need for baseline data collection.
This ensures any interest in deeper environmental reporting can gather support before usage grows further.

How to do better

Here are quick wins to strengthen your approach and make it more actionable:

Use Vendor Sustainability Tools for Basic Estimation
- Enable the carbon or sustainability dashboards in your chosen cloud platform to get monthly or quarterly snapshots:
Create Simple Internal Guidelines
- Expand beyond policy statements:
  1. Resource Tagging: Mandate that every new resource is tagged with an owner, environment, and a sustainability tag (e.g., “non-prod, auto-shutdown” vs. “production, high-availability”).
  2. Preferred Regions: If feasible, prefer data centers that the vendor identifies as more carbon-friendly. For example, some AWS and Azure UK-based regions rely on greener energy sourcing than others.
Schedule Simple Sustainability Checkpoints
- Alongside your standard procurement or architectural reviews, add a sustainability review item. E.g.:
  - “Does the new service use the recommended low-carbon region?”
  - “Is there a plan to power down dev/test resources after hours?”
- This ensures your new policy is not forgotten in day-to-day activities.
Offer Quick Training or Knowledge Sessions
- Host short lunch-and-learn events or internal micro-training on “Cloud Sustainability 101” for staff. Show them how they can use:
The point is to connect cost optimisation with sustainability—over-provisioned resources burn more carbon.
Publish Simple Reporting
- Create a once-a-quarter dashboard or presentation highlighting approximate cloud emissions. Even if the data is partial or not perfect, transparency drives accountability.

By rapidly applying these steps—using native vendor tools to measure usage, establishing minimal but meaningful guidelines, and scheduling brief training or check-ins—you elevate your policy from mere awareness to actual practice.

How to do better

Focus on rapid, vendor-native steps to convert targets into tangible reductions:

Automate Right-sizing
- Many providers have native tools to recommend more efficient instance sizes:
  - AWS Compute Optimiser to identify underutilised EC2, EBS, or Lambda resources
  - Azure Advisor Right-sizing for VMs and databases
  - GCP Recommender for VM rightsizing
  - OCI Adaptive Intelligence for resource optimisation
By automatically resizing or shifting to lower-tier SKUs, you reduce both cost and emissions.
Implement Scheduled Autoscaling
- Introduce or refine your autoscaling policies so that workloads scale down outside peak times:
This directly lowers carbon usage by removing idle capacity.
Leverage Serverless or Container Services
- Where feasible, re-platform certain workloads to serverless or container-based architectures that scale to zero. Rapid wins can be found by:
Serverless can significantly cut wasted resources, which aligns with your reduction targets.
Adopt “Carbon Budgets” in Project Plans
- For every new app or service, define a carbon allowance. If estimates exceed the budget, require design changes. Incorporate vendor solutions that show region-level carbon data:

These tools provide insights into the carbon emissions associated with different regions, enabling more sustainable decision-making.

Align with Departmental or National Sustainability Goals
- Update your internal reporting to reflect how your targets link to national net zero obligations or departmental commitments (e.g., the NHS net zero plan, local authority climate emergency pledges). This ensures your measurement and goals remain relevant to broader public sector accountability.

Implementing these steps swiftly helps ensure you don’t just measure but actually reduce your carbon footprint. Regular iteration—checking usage data, right-sizing, adjusting autoscaling—ensures continuous progress toward your stated targets.

How to do better

Actionable steps to deepen your integrated approach:

Set Up Automated Governance Rules
- Enforce region-based or instance-based policies automatically:

Implementing these policies ensures that resources are deployed in regions with lower carbon footprints, aligning with your sustainability objectives.

Adopt Full Lifecycle Management
- Extend sustainability beyond compute:
  - Automate data retention: Move older data to cooler or archive storage for lower energy usage:
  - Review ephemeral development: Ensure test environments are automatically cleaned after a set period.
Use Vendor-Specific Sustainability Advisors
- Some providers offer “sustainability pillars” or specialised frameworks:
Incorporate these suggestions directly into sprint backlogs or monthly improvement tasks.
Embed Sustainability in DevOps Pipelines
- Modify build/deployment pipelines to check resource usage or region selection:
  - If a new environment is spun up in a high-carbon region or with large instance sizes, the pipeline can prompt a warning or require an override.
  - Tools like GitHub Actions or Azure DevOps Pipelines can call vendor APIs to fetch sustainability metrics and fail a build if it’s non-compliant.
Promote Cross-Functional “Green Teams”
- Form a small working group or “green champions” network across procurement, DevOps, governance, and finance, meeting monthly to share best practices and track new optimisation opportunities.
- This approach keeps your integrated practices dynamic, ensuring you respond quickly to new vendor features or updated government climate guidance.

By adding these automated controls, pipeline checks, and cross-functional alignment, you ensure that your integrated sustainability approach not only continues but evolves in real time. You become more agile in responding to shifting requirements and new tools, maintaining a leadership stance in UK public sector cloud sustainability.

How to do better

Even at this advanced level, below are further actions to refine your dynamic management:

Build or Leverage Carbon-Aware Autoscaling
- Many providers offer advanced scaling rules that consider multiple signals. Integrate carbon signals:
Collaborate with BEIS or Relevant Government Bodies
- The Department for Business, Energy & Industrial Strategy (BEIS) or other departments may track grid-level carbon. If you can integrate their public data (e.g., real-time carbon intensity in the UK), you can refine your scheduling.
- Seek synergy with national digital transformation or sustainability pilot programmes that might offer new tools or funding for experimentation.
AI or ML-Driven Forecasting
- Incorporate predictive analytics that forecast your usage spikes and align them with projected carbon intensity (peak/off-peak). Tools like:
Then automatically shift or throttle workloads accordingly.
Innovate with Low-Power Hardware
- Evaluate next-gen or specialised hardware solutions with lower energy profiles:
Typically, these instance families consume less energy for similar workloads, further reducing carbon footprints.
Automated Data Classification and Tiering
- For advanced data management, use AI to classify data in real-time and automatically place it in the most sustainable storage tier:
This ensures minimal energy overhead for data retention.
Set an Example through Openness
- If compliance allows, publish near real-time dashboards illustrating your advanced scheduling successes or hardware usage.
- Share code or Infrastructure-as-Code templates with other public sector teams to accelerate mutual learning.

By implementing these advanced tactics, you sharpen your dynamic optimisation approach, continuously pushing the envelope of what’s possible in sustainable cloud operations—while respecting legal constraints around data sovereignty and any performance requirements unique to public services.

Keep doing what you’re doing, and consider documenting or blogging about your experiences. Submit pull requests to this guidance so other UK public sector organisations can accelerate their own sustainability journeys. By sharing real-world results and vendor-specific approaches, you help shape a greener future for public services across the entire nation.

How do you manage costs? [change your answer]

You did not answer this question.

How do I do better?

If you want to improve beyond “Restricted Billing Visibility,” the next step typically involves democratising cost data. This transition does not mean giving everyone unrestricted access to sensitive financial accounts or payment details. Instead, it centers on making relevant usage and cost breakdowns accessible to those who influence spending decisions, such as product owners, development teams, and DevOps staff, in a manner that is both secure and comprehensible.

Below are tangible ways to create a more open and proactive cost culture:

Role-Based Access to Billing Dashboards
- Most major cloud providers offer robust billing dashboards that can be securely shared with different levels of detail. For example, you can configure specialised read-only roles that allow developers to see usage patterns and daily cost breakdown without granting them access to critical financial settings.
- Look into official documentation and solutions from your preferred cloud provider:
  - AWS: AWS Cost Explorer
  - Azure: Azure Cost Management
  - GCP: Cloud Billing Reports
  - OCI: Oracle Cloud Cost Analysis
  - IBM Cloud Billing & IBM Cost Estimator
- By carefully configuring role-based access, you enable various teams to monitor cost drivers without exposing sensitive billing details such as invoicing or payment methods.
Regular Cost Review Meetings
- Schedule short, recurring meetings (monthly or bi-weekly) where finance, engineering, operations, and leadership briefly review cost trends. This fosters collaboration, encourages data-driven decisions, and allows everyone to ask questions or highlight anomalies.
- Ensure these sessions focus on actionable items. For instance, if a certain service’s spend has doubled, discuss whether that trend reflects legitimate growth or a misconfiguration that can be quickly fixed.
Automated Cost Alerts for Key Stakeholders
- Integrating cost alerts into your organisational communication channels can be a game changer. Instead of passively waiting for monthly bills, set up cost thresholds, daily or weekly cost notifications, or usage anomalies that get shared in Slack, Microsoft Teams, or email distribution lists.
- This approach ensures that the right people see cost increases in near real-time. If a developer spins up a large instance for testing and forgets to turn it off, you can catch that quickly.
- Each major provider offers alerting and budgeting features:
  - AWS: Budgets and Alerts
  - Azure: Budget Alerts and Advisor Recommendations
  - GCP: Budget Notifications and Alerts
  - OCI: Cost Tracking and Alerting
  - IBM Cloud: Best practices for Account Cost Tracking
Cost Dashboards Embedded into Engineering Workflows
- Rather than expecting developers to remember to check a separate financial console, embed cost insights into the tools they already use. For example, if your organisation relies on a continuous integration/continuous deployment (CI/CD) pipeline, you can integrate scripts or APIs that retrieve daily cost data and present them in your pipeline dashboards or as part of a daily Slack summary.
- Some organisations incorporate cost metrics into code review processes, ensuring that changes with potential cost implications (like selecting a new instance type or enabling a new managed service) are considered from both a technical and financial perspective.
Empowering DevOps with Cost Governance
- If you have a DevOps or platform engineering team, involve them in evaluating cost optimisation best practices. By giving them partial visibility into real-time spend data, they can quickly adjust scaling policies, identify over-provisioned resources, or investigate usage anomalies before a bill skyrockets.
- You might create a “Cost Champion” role in each engineering squad—someone who monitors usage, implements resource tagging strategies, and ensures that the rest of the team remains mindful of cloud spend.
Use of FinOps Principles
- The emerging discipline of FinOps (short for “Financial Operations”) focuses on bringing together finance, engineering, and business stakeholders to drive financial accountability. Adopting a FinOps mindset means cost visibility becomes a shared responsibility, with iterative improvement at its core.
- Consider referencing frameworks like the FinOps Foundation’s Principles to learn about building a culture of cost ownership, unit economics, and cross-team collaboration.
Security and Compliance Considerations
- Improving visibility does not mean exposing sensitive corporate finance data or violating compliance rules. Many organisations adopt an approach where top-level financial details (like credit card info or total monthly invoice) remain restricted, but usage-based metrics, daily cost reports, and resource-level data are made available.
- Work with your governance or risk management teams to ensure that any expanded visibility aligns with data protection regulations and internal security policies.

By following these strategies, you shift from a guarded approach—where only finance or management see the details—to a more inclusive cost culture. The biggest benefit is that your engineering teams gain the insight they need to optimise continuously. Rather than discovering at the end of the month that a test environment was running at full throttle, teams can detect and fix potential overspending early. Over time, this fosters a sense of shared cost responsibility, encourages more efficient design decisions, and drives proactive cost management practices across the organisation.

How do I do better?

To enhance a “Proactive Spend Commitment by Finance” model, organisations often evolve toward deeper collaboration between finance, engineering, and product teams. This ensures that negotiated contracts and reserved purchasing decisions accurately reflect real workloads, growth patterns, and future expansions. Below are methods to improve:

Integrated Forecasting and Capacity Planning
- Instead of having finance make decisions based purely on past billing, establish a forecasting model that includes planned product launches, major infrastructure changes, or architectural transformations.
- Encourage technical teams to share roadmaps (e.g., upcoming container migrations, new microservices, or expansions into different regions) so finance can assess whether existing reservation strategies are aligned with future reality.
- By merging product timelines with historical usage data, finance can negotiate better deals and tailor them closely to the actual environment.
Dynamic Monitoring of Reservation Coverage
- Use vendor-specific tools or third-party solutions to track your reservation utilisation in near-real-time. For instance:
- Continuously reviewing coverage lets you adjust reservations if your provider or plan permits it. Some vendors allow you to modify instance families, shift reservations to different regions, or exchange them for alternative instance sizes, subject to specific constraints.
Cross-Functional Reservation Committees
- Create a cross-functional group that meets quarterly or monthly to decide on reservation purchases or modifications. In this group, finance presents cost data, while engineering clarifies usage patterns and product owners forecast upcoming demand changes.
- This ensures that any new commits or expansions account for near-future workloads rather than only historical data. If you adopt agile practices, incorporate these reservation reviews as part of your sprint cycle or program increment planning.
Leverage Spot or Preemptible Instances for Variable Workloads
- An advanced tactic is to blend long-term reservations for predictable workloads with short-term, highly cost-effective instance types—such as AWS Spot Instances, Azure Spot VMs, GCP Preemptible VMs, or OCI Preemptible Instances—for workloads that can tolerate interruptions.
- Finance-led pre-commits for baseline needs plus engineering-led strategies for ephemeral or experimental tasks can minimise your total cloud spend. This synergy requires communication between finance and engineering so that the latter group can identify which workloads can safely run on spot capacity.
Refining Commitment Levels and Terms
- If your cloud vendor offers multiple commitment term lengths (e.g., 1-year vs. 3-year reservations, partial upfront vs. full upfront) and different coverage tiers, refine your strategy to match usage stability. For example, if 60% of your workload is unwavering, consider 3-year commits; if another 20% fluctuates, opt for 1-year or on-demand.
- Over time, as your usage data becomes more accurate and your architecture stabilises, you can shift more workloads into longer-term commitments for greater discounts. Conversely, if your environment is in flux, keep your commitments lighter to avoid overpaying.
Unit Economics and Cost Allocation
- Enhance your commitment strategy by tying it to unit economics—i.e., cost per customer, cost per product feature, or cost per transaction. Once you can express your cloud bills in terms of product-level or service-level metrics, you gain more clarity on which areas most justify pre-commits.
- If you identify a specific product line that reliably has N monthly active users, and you have stable usage patterns there, you can base reservations on that product’s forecast. Then, the cost savings from reservations become more attributable to specific products, making budgeting and cost accountability smoother.
Ongoing Financial-Technical Collaboration
- Beyond initial negotiations, keep the lines of communication open. Cloud resource usage is dynamic, particularly with continuous integration and deployment practices. Having monthly or quarterly check-ins between finance and engineering ensures you track coverage, refine cost models, and respond quickly to usage spikes or dips.
- Consider forming a “FinOps” group if your cloud usage is substantial. This multi-disciplinary team can use data from daily or weekly cost dashboards to fine-tune reservations, detect anomalies, and champion cost-optimisation strategies across the business.

By progressively weaving in these improvements, you move from a purely finance-led contract negotiation model to one where decisions about reserved spending or commitments are strongly informed by real-time engineering data and future product roadmaps. This more holistic approach leads to higher reservation utilisation, fewer wasted commitments, and better alignment of your cloud spending with actual business goals. The result is typically a more predictable cost structure, improved cost efficiency, and reduced risk of paying for capacity you do not need.

How do I do better?

If you wish to refine your cost-efficiency, consider adding more sophisticated processes, automation, and cultural practices. Here are ways to evolve:

Implement More Granular Auto-Scaling Policies
- Move beyond simple CPU-based or time-based triggers. Incorporate multiple metrics (memory usage, queue depth, request latency) so you scale up and down more precisely. This ensures that environments adjust capacity as soon as traffic drops, boosting your savings.
- Evaluate advanced solutions from your cloud provider:
Use Infrastructure as Code for Environment Management
- Instead of ad hoc creation and shutdown scripts, adopt Infrastructure as Code (IaC) tools (e.g., Terraform, AWS CloudFormation, Azure Bicep, Google Deployment Manager, or OCI Resource Manager) to version-control environment configurations. Combine IaC with schedule-based or event-based triggers.
- This approach ensures that ephemeral environments are consistently built and torn down, leaving minimal risk of leftover resources. You can also implement automated tagging to track cost by environment, team, or project.
Re-Architect for Serverless or Containerised Workloads
- If your application can tolerate stateless, event-driven, or container-based architectures, consider adopting serverless computing (e.g., AWS Lambda, Azure Functions, GCP Cloud Functions, OCI Functions) or container orchestrators (e.g., Kubernetes, Docker Swarm).
- These models often scale to zero when no requests are active, ensuring you only pay for actual usage. While not all workloads are suitable, re-architecting certain components can yield significant cost improvements.
Optimise Storage and Networking
- Cost-effective management extends beyond compute. Look for opportunities to move infrequently accessed data to cheaper storage tiers, such as object storage archive classes or lower-performance block storage. Configure lifecycle policies to purge logs or snapshots after a specified retention.
- Monitor data transfer costs between regions, availability zones, or external endpoints. If your architecture unnecessarily routes traffic through costlier paths, consider direct inter-region or peering solutions that reduce egress charges.
Scheduled Resource Hibernation and Wake-Up Processes
- Extend beyond typical off-hour shutdowns by creating fully automated schedules for every environment that does not require 24/7 availability. For instance, set a policy to shut down dev/test resources at 7 p.m. local time, and spin them back up at 8 a.m. the next day.
- Tools or scripts can detect usage anomalies (e.g., someone working late) and override the schedule or send a prompt to confirm if the environment should remain active. This approach ensures maximum cost avoidance, especially for large dev clusters or specialised GPU instances.
Incorporate Cost Considerations into Code Reviews and Architecture Decisions
- Foster a culture in which cost is a first-class design principle. During code reviews, developers might highlight the cost implications of using a high-tier database service, retrieving data across regions, or enabling a premium feature.
- Architecture design documents should include estimated cost breakdowns, referencing official pricing details for the services involved. Over time, teams become more adept at spotting potential overspending.
Automated Auditing and Cleanup
- Implement scripts or tools that run daily or weekly to detect unattached volumes, unused IP addresses, idle load balancers, or dormant container images. Provide automated cleanup or at least raise alerts for manual review.
- Many cloud providers have built-in recommendations engines:
  - AWS: AWS Trusted Advisor
  - Azure: Azure Advisor
  - GCP: Recommender Hub
  - OCI: Oracle Cloud Advisor
Track and Celebrate Savings
- Publicise cost optimisation wins. If an engineering team shaved 20% off monthly bills by fine-tuning auto-scaling, celebrate that accomplishment in internal communications. Show the before/after metrics to encourage others to follow suit.
- This positive reinforcement helps maintain momentum and fosters a sense of shared ownership.

By layering these enhancements, you move beyond basic scheduling or minimal auto-scaling. Instead, you cultivate a deeply ingrained practice of continuous optimisation. You harness automation to enforce best practices, integrate cost awareness into everyday decisions, and systematically re-architect services for maximum efficiency. Over time, the result is a lean cloud environment that can expand when needed but otherwise runs with minimal waste.

How do I do better?

If you want to upgrade your cost-aware development environment, you can deepen the integration of financial insight into everyday engineering. Below are practical methods:

Enhance Toolchain Integrations
- Provide cost data directly in the platforms developers use daily:
  - Pull Request Annotations: When a developer opens a pull request in GitHub or GitLab that adds new cloud resources (e.g., creating a new database or enabling advanced analytics), an automated comment could estimate the monthly or annual cost impact.
  - IDE Plugins: Investigate or develop plugins that estimate cost implications of certain library or service calls. While advanced, such solutions can drastically reduce guesswork.
  - CI/CD Pipeline Steps: Incorporate cost checks as a gating mechanism in your CI/CD process. If a change is projected to exceed certain cost thresholds, it triggers a review or a labeled warning.
Reward and Recognition Systems
- Implement a system that publicly acknowledges or rewards teams that achieve significant cost savings or code optimisations that reduce the cloud bill. This can be a monthly “cost champion” award or a highlight in the company-wide newsletter.
- Recognising teams for cost-smart decisions helps embed a culture where financial prudence is celebrated alongside feature delivery and reliability.
Cost Education Workshops
- Host internal workshops or lunch-and-learns where experts (whether from finance, DevOps, or a specialised FinOps team) explain how cloud billing works, interpret usage graphs, or share best practices for cost-efficient coding.
- Make these sessions as practical and example-driven as possible: walk developers through real code and show the difference in cost from alternative approaches.
Tagging and Chargeback/Showback Mechanisms
- Encourage consistent resource tagging so that each application component or service is clearly attributed to a specific team, project, or feature. This tagging data feeds into cost reports that let you see which code bases or squads are driving usage.
- You can then implement a “showback” model (where each team sees the monthly cost of their resources) or a “chargeback” model (where those costs directly affect team budgets). Such financial accountability often motivates more thoughtful engineering decisions.
Guidelines and Architecture Blueprints
- Produce internal reference guides that show recommended patterns for cost optimisation. For example, specify which database types or instance families are preferred for certain workloads. Provide example Terraform modules or CloudFormation templates that are pre-configured for cost-efficiency.
- Encourage developers to consult these guidelines when designing new systems. Over time, the default approach becomes inherently cost-aware.
Frequent Feedback Loops
- Implement daily or weekly cost digests that are automatically posted in relevant Slack channels or email lists. These digests highlight the top 5 cost changes from the previous period, giving engineering teams rapid insight into where spend is shifting.
- Additionally, create a channel or forum where developers can ask cost-related questions in real time, ensuring they do not have to guess how a new feature might affect the budget.
Collaborative Budgeting and Forecasting
- For upcoming features or architectural revamps, involve engineers in forecasting the cost impact. By inviting them into the financial planning process, you ensure they understand the budgets they are expected to work within.
- Conversely, finance or product managers can learn from engineers about the real operational complexities, leading to more accurate forecasting and fewer unrealistic cost targets.
Adopt a FinOps Mindset
- Expand on the FinOps principles beyond finance alone. Encourage all engineering teams to take part in continuous cost optimisation cycles—inform, optimise, and operate. In these cycles, you measure usage, identify opportunities, experiment with changes, and track results.
- Over time, cost efficiency becomes an ongoing practice rather than a one-time initiative.

By adopting these approaches, you elevate cost awareness from a passive, occasional concern to a dynamic, integrated element of day-to-day development. This deeper integration helps your teams design, code, and deploy with financial considerations in mind—often leading to innovative solutions that deliver both performance and cost savings.

How do you choose where to run workloads and store data? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to refine an intra-region approach:

Enable Automatic Multi-AZ Deployments
- e.g., AWS Auto Scaling groups across multiple AZs, Azure VM Scale Sets in multiple zones, GCP Managed Instance Groups (MIGs) or multi-zonal regional clusters, OCI multi-AD distribution for compute/storage,IBM Cloud Instance Group for Autoscaling.
- Minimises manual overhead for distributing workloads.
Replicate Data Synchronously
- For databases, consider regionally resilient services:
  - AWS RDS Multi-AZ
  - Azure SQL Zone Redundancy
  - GCP Cloud SQL HA
  - OCI Data Guard in Multi-AD Mode
  - IBM Cloud: for PostgreSQL, for Cloudant, for MySQL & for Cloud Databases
- Ensures quick failover if one Availability Zone (AZ) fails.
Set AZ-Aware Networking
- Deploy separate subnets or load balancers for each Availability Zone (AZ) so traffic automatically reroutes upon an AZ failure:
- Ensures high availability and fault tolerance by distributing traffic across multiple AZs.
Regularly Test AZ Failover
- Induce a partial Availability Zone (AZ) outage or rely on “game days” to ensure applications properly degrade or failover:
  - Referencing NCSC guidance on vulnerability management.
- Ensures systems can handle unexpected disruptions effectively.
Monitor Cross-AZ Costs
- Some vendors charge for data transfer between AZs, so monitor usage with AWS Cost Explorer, Azure Cost Management, GCP Billing, OCI Cost Analysis, IBM Cloud Billing & IBM Cost Estimator.

By automatically spreading workloads, replicating data in multiple AZs, ensuring AZ-aware networking, regularly testing failover, and monitoring cross-AZ costs, you solidify your organisation’s resilience within a single region while controlling costs.

How to do better

Below are rapidly actionable improvements:

Automate Cross-Region Backups
- e.g., AWS S3 Cross-Region Replication, Azure Backup to another region, GCP Snapshot replication, OCI cross-region object replication.
- Minimises manual tasks and ensures consistent DR coverage.
Schedule Non-Production in Cheaper Regions
- If cost is a driver, shut down dev/test in off-peak times or run them in a region with lower rates:
  - Referencing your chosen vendor’s regional pricing page.
Establish a Basic DR Plan
- For the second region, define how you’d bring up minimal services if the primary region fails:
  - Referencing AWS CloudEndure, RDS cross-region read replicas, Azure Site Recovery, GCP DR solutions, OCI Disaster Recovery orchestration.
Regularly Test Failover
- Do partial or full DR exercises at least annually, ensuring data in the second region can spin up quickly.
- Referencing NIST SP 800-34 DR test recommendations or NCSC operational resilience playbooks.
Plan for Data Residency
- If using non-UK regions, confirm any legal constraints on data location, referencing GOV.UK data residency rules or relevant departmental guidelines.

By automating cross-region backups, offloading dev/test workloads where cost is lower, defining a minimal DR plan, regularly testing failover, and ensuring data residency compliance, you expand from a single-region approach to a modest but effective multi-region strategy.

How to do better

Below are rapidly actionable enhancements:

Sustainability-Driven Tools
- e.g., AWS Customer Carbon Footprint Tool, Azure Carbon Optimisation, GCP Carbon Footprint, OCI Carbon Footprint.
- Evaluate region choices for best environmental impact.
Implement Real-Time Cost & Perf Monitoring
- Track usage and cost by region daily or hourly.
- Referencing AWS Cost Explorer, Azure Cost Management, GCP Billing Alerts, OCI Cost Analysis.
Enable Multi-Region Data Sync
- If you shift workloads for HPC or AI tasks, ensure data is pre-replicated to the chosen region:
  - Referencing AWS S3 Cross-Region Replication, Azure Geo-Redundant Storage, GCP Multi-Region Storage, OCI cross-region replication.
Address Latency & End-User Performance
- For services with user-facing components, consider CDN edges, multi-region front-end load balancing, or local read replicas to ensure acceptable performance.
Document Region Swapping Procedures
- If you occasionally relocate entire workloads for cost or sustainability, define runbooks or scripts to manage DB replication, DNS updates, and environment spin-up.

By using sustainability calculators to choose greener regions, implementing real-time cost/performance checks, ensuring multi-region data readiness, managing user latency via CDNs or local replicas, and documenting region-swapping, you fully leverage each provider’s global footprint for cost and environmental benefits.

How to do better

Below are rapidly actionable methods to refine dynamic, cost-sustainable distribution:

Automate Workload Placement
- Tools like [AWS Spot Instance with EC2 Fleet, Azure Spot VMs with scale sets, GCP Preemptible VMs, OCI Preemptible Instances] or container orchestrators that factor region costs:
  - referencing vendor cost management APIs or third-party cost analytics.
Use Real-Time Carbon & Pricing Signals
- e.g., AWS Instance Metadata + carbon data, Azure carbon footprint metrics, GCP Carbon Footprint reports, OCI sustainability stats.
- Shift workloads to the region with the best real-time carbon intensity or lowest spot price.
Add Continual Governance
- Ensure no region usage violates data residency constraints or compliance:
  - referencing NCSC multi-region compliance advice or departmental data classification guidelines.
Embrace Chaos Engineering
- Regularly test failover or region-shifting events to ensure dynamic distribution can recover from partial region outages or surges:
  - Referencing NCSC guidance on chaos engineering or vendor solutions:
- These tools help simulate real-world disruptions, allowing you to observe system behavior and enhance resilience.
Integrate Advanced DevSecOps
- For each region shift, the pipeline or orchestrator re-checks security posture and cost thresholds in real time.

By automating workload placement with spot or preemptible instances, factoring real-time carbon and cost signals, applying continuous data residency checks, stress-testing region shifts with chaos engineering, and embedding advanced DevSecOps validations, you maintain a dynamic, cost-sustainable distribution model that meets the highest operational and environmental standards for UK public sector services.

Keep doing what you’re doing, and consider blogging about or opening pull requests to share how you handle multi-region distribution and operational management for cloud workloads. This information can help other UK public sector organisations adopt or improve similar approaches in alignment with NCSC, NIST, and GOV.UK best-practice guidance.

Data

How do you manage data storage and usage? [change your answer]

You did not answer this question.

How to do better

Here are rapidly actionable steps to establish foundational data management and reduce risks:

Identify and Tag All Existing Data Stores
- Start by running a quick inventory or “data discovery” across your cloud environment:
- Even if you only have partial naming standards, tag each discovered resource with “owner,” “purpose,” and “data type.” This immediately lowers the risk of data sprawl.
Establish Basic Data Handling Guidelines
- Document a short set of rules about where teams should store data, who can access it, and minimal security classification steps (e.g., “Use only these approved folders/buckets for OFFICIAL-SENSITIVE data”).
- Reference the Government Security Classification Policy (GSCP) or departmental guidelines to outline baseline compliance steps.
Enable Basic Monitoring and Access Controls
- Ensure you have simple controls in place:
- This helps prevent accidental public exposure or misconfigurations.
Educate Teams on Data Sensitivity
- Run short, targeted training or lunch-and-learns on recognising PII, official data, or other categories.
- Emphasize that storing data in an “unofficial” manner can violate data protection laws or hamper future compliance efforts.
Draft an Interim Data Policy
- Outline a simple, interim policy that sets initial standards for usage. For example:
  - "Always store sensitive data (OFFICIAL-SENSITIVE) in an encrypted bucket or database.
  - “Tag resources with project name, data owner, and data sensitivity level.”
- Having any policy is better than none, setting the stage for more formal governance.

By identifying your data storage resources, applying minimal security tagging, and sharing initial guidelines, you shift from ad hoc practices to a basic, more controlled environment. This foundation paves the way for adopting robust data governance tools and processes down the line.

How to do better

Below are rapidly actionable ways to improve upon team-based documentation:

Adopt Centralised Tagging/Labeling Policies
- Instead of each team inventing its own naming or classification, unify your approach:
- This fosters consistent data metadata across teams.
Introduce Lightweight Tools for Schema and Documentation
- Even if you can’t deploy a full data catalog, encourage teams to use a shared wiki or knowledge base that references cloud resources directly:
- This can evolve into a more formal data inventory.
Standardise on Security and Compliance Checklists
- Provide each team with a short checklist:
  - Data classification verified?
  - Encryption enabled?
  - Access controls (RBAC/IAM) aligned with sensitivity?
- Tools and references:
Schedule Quarterly or Semi-Annual Data Reviews
- Even if managed by each team, commit to an organisational cycle:
  - They update their data inventories, verify classification, and confirm no stale or untagged storage resources.
  - Summarise findings to central governance or a data protection officer for quick oversight.
Motivate with Quick Wins
- Share success stories: “Team X saved money by archiving old data after a manual review, or prevented a compliance risk by discovering unencrypted PII.”
- This fosters cultural buy-in and continuous improvement.

By implementing standardised tagging, shared documentation tools, and routine checklists, you enhance consistency and reduce errors. You’re also positioning yourself for the next maturity level, which often involves more automated scanning and classification across the organisation.

How to do better

To refine your “Inventoried and Classified Data” approach, apply these rapidly actionable enhancements:

Automate Scanning and Classification
- Supplement manual entries with scanning tools that detect PII, sensitive patterns, or regulated data:
- Regularly schedule these scans so new data is automatically classified.
Introduce Basic Lineage Tracing
- Even if partial, track how data flows from source to destination:
  - For instance, a CRM system exporting daily CSV to an S3 bucket for analytics, then into a data warehouse.
- Tools like:
- This practice enhances data traceability and supports compliance efforts.
Align with Legal & Policy Requirements
- Mark data sets with relevant regulations—UK GDPR, FOI, PCI-DSS, etc.
- Build retention policies that automatically archive or delete data when it meets disposal criteria:
Create a Single “Data Inventory” Dashboard
- Consolidate classification statuses in a simple dashboard or spreadsheet so data governance leads can track changes at a glance.
- If possible, generate monthly or quarterly “data classification health” reports.
Provide Self-Service Tools for Teams
- Offer them a quick way to see if their new dataset might include sensitive fields or which storage option is recommended for OFFICIAL-SENSITIVE data.
- Maintaining “responsible autonomy” fosters compliance while reducing central bottlenecks.

With scanning, lineage insights, policy-aligned retention, and better visibility, you not only maintain your inventory but move it toward a dynamic, living data map. This sets the stage for deeper data understanding and advanced catalog solutions.

How to do better

Below are rapidly actionable steps to deepen your data lineage and documentation:

Adopt or Expand a Data Catalog with Lineage Features
- Introduce or enhance tooling that can map data flows automatically or semi-automatically:
Create a Standard Operating Procedure for Lineage Updates
- Whenever a new data pipeline is created or an ETL job changes, staff must add or adjust lineage documentation.
- Ensure this ties into your DevOps or CI/CD process:
  - E.g., new code merges automatically trigger updates in Purview or Data Catalog.
Encourage Data Reuse and Collaboration
- With partial lineage, teams might still re-collect or duplicate data. Create incentives for them to discover existing data sets:
  - Host a monthly “Data Discovery Forum” or internal knowledge-sharing session.
  - Highlight “success stories” where reusing a known dataset saved time or reduced duplication.
Set Up Tiered Access Policies
- Understanding lineage helps define more granular access control. If you see that certain data flows from a core system to multiple departmental stores, you can apply consistent RBAC or attribute-based access control:
Integrate with Risk and Compliance Dashboards
- If you have a departmental risk register, link data classification/lineage issues into that.
- This ensures any changes or gaps in lineage are recognised as potential compliance or operational risks.

By systematically building out lineage features and embedding them in everyday workflows, you move closer to a truly integrated data environment. Over time, each dataset’s path through your infrastructure becomes transparent, boosting collaboration, reducing duplication, and easing regulatory compliance.

How to do better

Even at the highest maturity, here are actionable ways to refine:

Incorporate Real-Time or Streaming Data
- Expand your catalog’s scope to include real-time pipelines, e.g., streaming from IoT devices or sensor networks:
Add Automated Data Quality Rules and Alerts
- Configure threshold-based triggers that check data quality daily:
  - e.g., “If more than 5% of new rows fail validation, alert the data steward.”
- Some vendor-native tools or third-party solutions can embed these checks in your data pipeline or catalog.
Leverage AI/ML to Classify and Suggest Metadata
- Let machine learning simplify classification:
Integrate Catalog with Wider Public Sector Ecosystems
- If your data catalog can integrate with cross-government data registries or share metadata with partner organisations, you reduce duplication and improve interoperability. For instance:
  - Some local authorities or NHS trusts might share standardised definitions or GDS guidelines.
  - Tools or APIs that facilitate federation with external catalogs can open up broad data collaboration.
Continuously Evaluate Security, Access, and Usage
- Review who actually accesses data vs. who is authorised, adjusting policies based on usage patterns.
- If certain data sets see heavy usage from a new department, ensure lineage, classification, and approvals remain correct.

At this advanced level, your main goal is to keep your data catalog living, dynamic, and well-integrated with the rest of your technology stack and governance frameworks. By embracing new data sources, automating quality checks, leveraging ML classification, and ensuring interoperability across the UK public sector, you solidify your position as a model of data governance and strategic data management.

Keep doing what you’re doing, and consider publishing blog posts or internal case studies about your data governance journey. Submit pull requests to this guidance or relevant public sector repositories to share innovative approaches. By swapping best practices, we collectively improve data maturity, compliance, and service quality across the entire UK public sector.

What is your approach to data retention? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to strengthen your organisational policy awareness and transition toward more robust management:

Map Policy to Actual Cloud Storage
- Encourage each team to identify where their data resides and apply your organisation’s retention timeline:
- This ensures that the policy is not just known but also visible in cloud environments.
Implement Basic Lifecycle Rules for Key Data Types
- Even at an early stage, you can set simple time-based rules:
Offer Practical Guidelines
- Simplify your policy into short, scenario-based instructions. For instance:
  - “Project data that includes personal information must be kept for 2 years, then deleted.”
  - “No indefinite retention without approval from Data Protection Officer.”
- Make these guidelines easily accessible (intranet page, project templates).
Encourage Regular Self-Checks
- Have teams perform a quick “retention check” every quarter or release cycle to see if they are retaining any data beyond the policy.
- Tools like:
Align with Stakeholders
- Brief senior leadership, legal teams, and information governance officers on any proposed changes or automation.
- Gains their support by showing how these improvements reduce compliance risk and free up unnecessary storage costs.

By proactively mapping retention policies to actual data, implementing simple lifecycle rules, and guiding teams with clear, scenario-based instructions, you reinforce “Organisation-Level Policy Awareness” with tangible, enforceable practices.

How to do better

Below are rapidly actionable ways to ensure attestations translate to real adherence:

Incorporate Retention Audits into CI/CD
- Automate checks whenever a new data store is created or an environment is updated:
Spot-Check Attestations with Periodic Scans
- Randomly select a few projects each quarter to run data retention scans:
  - Compare declared retention schedules vs. actual lifecycle settings or creation dates.
  - Tools:
Centralise Retention Documentation
- Instead of scattered project documents, maintain a central registry or dashboard capturing:
  - Project name, data types, retention period, date of last attestation.
- Provide read access to compliance and governance staff, ensuring quick oversight.
Link Attestation to Funding or Approvals
- For large programmes, make data retention compliance a prerequisite for budget release or major go/no-go decisions.
- This creates a strong incentive to maintain correct lifecycle settings.
Short Mandatory Training
- Provide teams a bite-sized eLearning or workshop on how to configure retention in their chosen cloud environment.
- This ensures they know the practical steps needed, so attestation isn’t just paperwork.

By coupling attestation with actual configuration checks, spot audits, centralised documentation, and relevant training, you boost confidence that claims of compliance match reality.

How to do better

Below are rapidly actionable ways to strengthen your audit and review process:

Adopt Automated Compliance Dashboards
- Supplement periodic manual audits with near-real-time or daily checks:
- This ensures frequent visibility, not just at audit time.
Include Retention in Security Scans
- Many organisations focus on security misconfigurations but forget data retention. Integrate retention checks into:
- This ensures that retention policies are consistently enforced and monitored across your cloud environments.
Track Action Plans to Closure
- Use a centralised ticketing or workflow tool (e.g., Jira, ServiceNow) to capture audit findings, track remediation, and confirm sign-off.
- Tag each ticket with “Data Retention Issue” for easy reporting and trend analysis.
Publish Trends and Success Metrics
- Show leadership the quarterly or monthly improvement in compliance percentage.
- Celebrating zero major findings in a review cycle fosters a positive compliance culture and encourages teams to keep up the good work.
Integrate with Other Governance Reviews
- Data retention checks can be coupled with data security, privacy, or cost reviews.
- This holistic approach ensures teams address multiple dimensions of good data stewardship simultaneously.

By automating aspects of the review process, embedding retention checks into security tools, and systematically remediating findings, you evolve from static cyclical audits to a dynamic, ongoing compliance posture.

How to do better

Below are rapidly actionable ways to embed retention exceptions deeper into risk management:

Automate Exception Labelling and Monitoring
- When a project is granted an exception, label or tag the data with “Exception=Approved” or “RetentionOverride=Yes,” along with a reference ID:
Set Time-Bound Exceptions
- Rarely should exceptions be indefinite. Include an “exception end date” in your risk register.
- Use cloud scheduling or lifecycle policies to revisit after that date:
  - E.g., if an exception ends in 1 year, revert to normal retention automatically unless renewed.
Enhance Risk Register Integration
- Link risk items to your data inventory or data catalog so you can quickly see which resources are covered by the exception.
- Tools like ServiceNow, Jira, or custom risk management solutions can cross-reference cloud resource IDs or labels.
Reevaluate Exception Cases in Each Audit
- Incorporate exception checks into your regular data retention audits:
  - Confirm the exception is still valid and authorised.
  - If it’s no longer needed, remove it and revert to standard retention policies.
Leverage Encryption or Extra Security for Exceptions
- If data must be stored longer than usual, apply enhanced controls:

By systematically capturing exceptions as risks, labeling them in cloud resources, setting expiry dates, and ensuring periodic review, your exceptions process remains controlled rather than a loophole. This approach mitigates the dangers of indefinite data hoarding and supports robust risk governance.

How to do better

Even at the top maturity level, here are rapidly actionable ways to refine your automated enforcement:

Deepen Integration with Data Catalog
- Ensure your automated retention engine references data classification in your catalog:
Leverage Event-Driven Remediation
- Use serverless functions or automation to react instantly to non-compliant provisioning:
Expand to All Data Storage Services
- Beyond object storage, ensure automation covers databases, logs, and backups:
Adopt Predictive Monitoring for Storage Growth
Utilise Predictive Analytics for Data Growth and Anomaly Detection
- Employ predictive analytics to forecast data growth and identify anomalies when retention rules aren’t effective:
Continuously Update Policies for New Data Types
- As your department adopts new AI workloads, IoT sensor data, or unstructured media, confirm your automated retention tools can handle these new data flows.
- Keep stakeholder alignment: if legislation changes (e.g., new FOI or data privacy rules), swiftly update your policy-as-code approach.

By aligning your advanced automation with data classification, extending coverage to all storage services, and employing event-driven remediation, you maintain an agile, reliable data retention program that rapidly adapts to technology or policy shifts. This ensures your UK public sector organisation upholds compliance, minimises data sprawl, and demonstrates best-in-class stewardship of public data.

Keep doing what you’re doing, and consider documenting or blogging about your journey to automated data retention enforcement. Submit pull requests to this guidance or share your success stories with the broader UK public sector community to help others achieve similarly robust data retention practices.

Governance

How do you decide who handles the different aspects of cloud security? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond minimal consideration of shared responsibilities:

Identify Your Specific Obligations
- Review provider documentation on the shared responsibility model:
- Make a short list or matrix of tasks you must own (patching certain layers, data backups, encryption management, etc.) vs. what the vendor handles (infrastructure security, certain managed services).
Apply Basic Tagging for Ownership
- Use resource tags or labels to clarify who is responsible for tasks like patching, rotating encryption keys, or daily backups:
Conduct a Simple Risk Assessment
- Walk through a typical scenario (e.g., security incident or downtime) and identify who would act under the current arrangement.
- Document any gaps (e.g., “We assumed the vendor patches the OS, but it’s actually an IaaS solution so we must do it ourselves.”) and address them promptly.
Raise Awareness with a Short Internal Briefing
- Present the shared responsibility model in a simple slide deck or lunch-and-learn:
  - Emphasize how it differs from on-prem or typical outsourcing.
  - Show real examples of misconfigurations that occurred because teams weren’t aware of their portion of responsibility.
Involve Governance or Compliance Officers
- Ensure your information governance team or compliance officer sees the model. They can help flag missing responsibilities, especially around data protection (UK GDPR) or official classification levels.
- This can prevent future misunderstandings.

By clarifying essential tasks, assigning explicit ownership, and performing a quick risk assessment, you proactively plug the biggest gaps that come from ignoring the shared responsibility model.

How to do better

Here are rapidly actionable ways to convert basic awareness into structured alignment:

Develop a Clear Responsibilities Matrix
- Create a simple spreadsheet or diagram that outlines specific responsibilities for each service model (IaaS, PaaS, SaaS). For example:
  - “Networking configuration: Cloud vendor is responsible for physical network security; we handle firewall rules.”
  - “VM patching: We handle OS patches for IaaS; vendor handles it for managed PaaS.”
- Share this matrix with all relevant teams—developers, ops, security, compliance.
Embed Responsibility Checks in CI/CD
- Include reminders or tasks in your pipeline for whichever responsibilities your organisation must handle:
Set Up Basic Compliance Rules
- Use native policy or configuration tools to ensure teams don’t forget their portion of security:
Create a Minimum Standards Document
- Summarise “We do X, vendor does Y” in a concise, 1- or 2-page reference for new hires, project leads, or procurement teams.
- This helps each team swiftly verify if they’re meeting their obligations.
Schedule Regular (Bi-Annual) Awareness Sessions
- As new people join or existing staff shift roles, re-run an internal training on the shared responsibility model.
- This ensures knowledge doesn’t degrade over time.

By formalising the understanding into documented responsibilities, embedding checks in your workflows, and reinforcing compliance rules, you strengthen your posture beyond mere awareness and toward consistent application across teams.

How to do better

Below are rapidly actionable improvements to reinforce your informed decision-making:

Adopt a “Responsibility Checklist” in Every Project Kickoff
- Expand your architecture or project initiation checklist to include:
  - Security responsibilities (e.g., OS patching, identity management).
  - Data responsibilities (e.g., encryption key ownership, backups).
  - Operational responsibilities (e.g., scaling, monitoring, incident response).
- Tools and References:
Integrate with Governance Boards or Change Advisory Boards (CAB)
- Whenever a major cloud solution is proposed, the governance board ensures the shared responsibility breakdown is explicit.
- This formal gate fosters consistent compliance with your model.
Track “Responsibility Gaps” in Risk Registers
- If you discover any mismatch—like you thought the vendor handled container OS patching, but it’s actually your job—log it in your risk register until resolved.
- This encourages a quick fix and ensures no gap remains unaddressed.
Conduct Periodic “Mock Incident” Exercises
- For key services, run a tabletop exercise or test scenario: e.g., a severe OS vulnerability or unexpected data leak.
- Evaluate how well the team knows who must patch or respond. Document lessons learned to refine your decision-making process.
Refine Cost Transparency
- Show how responsibilities can affect cost:
  - If you’re using a fully managed database, you pay a premium but shift more patching or upgrades to the vendor.
  - If you choose IaaS, you do more patching but may see lower direct service charges.
- Provide a quick cost/responsibility matrix so teams can weigh these trade-offs effectively.

By embedding the model into architecture reviews, governance boards, risk tracking, and cost analysis, you ensure each cloud decision is well-informed and widely understood across the organisation.

How to do better

Here are rapidly actionable ideas to refine your strategic integration:

Formalise a “Shared Responsibility Roadmap”
- Outline how your responsibilities may shift as you adopt new services or modernise apps:
  - E.g., “We plan to transition from self-managed DB to a fully managed service, shifting patching to the vendor by Q4.”
- Maintain an updated doc or wiki, shared with vendor account managers if relevant.
Implement Joint Incident-Response Protocols
- For critical workloads, define a response plan that involves both your team and the vendor:
- This ensures everyone knows their role if an incident arises—no confusion about who must take the first steps.
Regular Joint Reviews of SLAs and MoUs
- MoU (Memorandum of Understanding) or contracts can explicitly reference responsibilities.
- Revisit them at least annually to confirm they remain relevant, especially if the vendor introduces new features or if you adopt new compliance frameworks.
Quantify Responsibility Impacts on Cost and Resource
- Evaluate how shifting responsibilities (e.g., from IaaS to PaaS) reduces your operational overhead or risk while potentially increasing subscription fees.
- This cost-benefit analysis should guide strategic decisions about which responsibilities to keep in house.
Publish Internal Case Studies
- Showcase a project that integrated the shared responsibility model successfully, explaining how it prevented major incidents or streamlined compliance.
- This inspires other teams to replicate the approach.

By systematically planning your responsibilities roadmap, establishing joint incident protocols, and performing regular SLA reviews, you embed the shared responsibility model at the heart of your strategic cloud partnerships.

How to do better

Even at the pinnacle, there are actionable strategies to maintain and refine:

Incorporate Real-Time Observability of Shared Responsibilities
- Extend your monitoring dashboards to highlight any newly provisioned resources that don’t align with known responsibilities or best practices:
Conduct Regular Cost-Benefit Re-Evaluations
- At least quarterly, re-check if shifting more responsibilities to vendor-managed solutions or retaining them in house remains the best approach:
  - Some tasks might become cheaper or more secure if the vendor has introduced an improved managed feature or a new region with stronger compliance credentials.
- Document these findings for leadership to see the ROI of the chosen approach.
Shape Best Practices Across the Public Sector
- Share your advanced model with partner agencies, local councils, or central government departments.
- Contribute to cross-government playbooks on cloud adoption, showing how the shared responsibility model fosters better outcomes.
Combine Shared Responsibility Insights with Ongoing Cloud Transformation
- If you’re running modernisation or digital transformation programs, embed the shared responsibility model into new microservices, container deployments, or serverless expansions.
- Constantly question: “Where does the boundary lie, and is it cost-effective or compliance-aligned to shift it?”
Prepare for Regulatory Changes
- Monitor updates to UK data protection laws, the National Cyber Security Centre (NCSC) guidelines, or changes in vendor compliance certifications.
- Adjust responsibilities quickly if new standards require a different approach (e.g., more encryption or different backup retention mandated by a new policy).

By ensuring real-time observability, frequent cost-benefit checks, sector-wide collaboration, and a readiness to pivot for regulatory shifts, you sustain a robust, adaptive shared responsibility model at the core of your cloud usage. This cements your organisation’s position as a leader in cost-effective, secure, and compliant public sector cloud adoption.

Keep doing what you’re doing, and consider sharing blog posts, case studies, or internal knowledge base articles on how your organisation integrates the shared responsibility model into cloud governance. Submit pull requests to this guidance or similar public sector best-practice repositories to help others learn from your success.

How do you manage and store build artefacts (files created when building software)? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move away from ad-hoc methods:

Introduce a Basic CI/CD Pipeline
- Even a minimal pipeline can automatically build code from a version control system:
Ensure Everything Is in Version Control
- Do not edit code or configurations directly on servers. Instead:
Create a Shared Storage for Build Outputs
- Set up a simple “build artifacts” bucket or file share for your compiled binaries or container images:
Document Basic Rollback Steps
- At a minimum, define how to revert a server or application if a live edit breaks something:
  - Write a short rollback procedure referencing the last known working code in version control.
- This ensures you’re not stuck with manual edits you can’t undo.
Educate the Team
- Explain the risks of live server edits in short training sessions:
  - Potential compliance violations if changes are not auditable.
  - Difficulty diagnosing or rolling back production issues.

By adopting minimal CI/CD, storing artifacts in a shared location, and referencing everything in version control, you reduce chaos and set a foundation for more robust artifact management.

How to do better

Below are rapidly actionable strategies:

Centralise Your Build Once
- Shift to a pipeline that builds the artifact once, then deploys the same artifact to dev, test, and production. For instance:
Define a Consistent Build Container
- If you want complete reproducibility:
  - Use a Docker image as your build environment (e.g., pinned versions of compilers, frameworks).
  - Keep that Docker image in your artifact registry so each new build uses the same environment.
Implement Version or Commit Hash Tagging
- Tag the artifact with a version or Git commit hash. Each environment references the same exact build (like “my-service:build-1234”).
- This eliminates guesswork about which code made it to production vs. test.
Apply Simple Promotion Strategies
- Instead of rebuilding, you “promote” the tested artifact from dev to test to production:
  - Mark the artifact as “passed QA tests” or “passed security scan,” so you have a clear chain of trust.
- This approach improves reliability and shortens lead times.
Create Basic Documentation
- Summarise the difference between “build once, deploy many” and “build in each environment.”
- Show management how consistent builds reduce risk and effort.

By consolidating the build process, storing a single artifact per version, and promoting that same artifact across environments, you achieve consistency and reduce the risk of environment drift.

How to do better

Here are rapidly actionable enhancements:

Adopt Write-Once-Read-Many (WORM) or Immutable Storage
- Many cloud vendors offer immutable or tamper-resistant storage:
Set Up Access Controls and Auditing
- Restrict who can modify or delete artifacts. Log all changes:
Enforce In-House or Managed Build Numbering Standards
- Decide how you version artifacts (e.g., semver, build number, git commit) to ensure consistent tracking across repos.
- This practice reduces confusion when dev/test teams talk about a specific build.
Extend to Container Images or Package Repositories
- If you produce Docker images or library packages (NuGet, npm, etc.), store them in:
Introduce Minimal Integrity Checks
- Even if you don’t have full cryptographic signatures, consider generating checksums (e.g., SHA256) for each artifact to detect accidental corruption.

By using immutable storage, controlling access, and standardising versioning, you strengthen artifact reliability and traceability without overwhelming your current processes.

How to do better

Below are rapidly actionable improvements:

Leverage Vendor Tools for Dependency Scanning
- Integrate automatic scanning to confirm pinned versions match known secure states:
Sign Your Artifacts
- Use code signing or digital signatures:
Adopt a “Bill of Materials” (SBOM)
- Generate a Software Bill of Materials for each build, listing all dependencies and their checksums:
  - This clarifies exactly which libraries or frameworks were used, crucial for quick vulnerability response.
Enforce Minimal Versions or Patch Levels
- If a library has a known CVE, your pipeline rejects builds that rely on that version.
- This ensures you don’t accidentally revert to vulnerable dependencies.
Combine with Immutable Storage
- If you haven’t already, store these pinned, verified artifacts in a write-once or strongly controlled location.
- This ensures no tampering after you sign or hash them.

By scanning for vulnerabilities, signing artifacts, using SBOMs, and enforcing patch-level policies, you secure your supply chain and provide strong assurance of artifact integrity.

How to do better

Even at this pinnacle, there are actionable ways to refine:

Automate Artifact Verification on Deployment
- For example:
Embed Forensic Analysis Hooks
- Provide metadata in logs (e.g., commit hashes, SBOM references) so if an incident occurs, security teams can quickly retrieve the relevant artifact.
- This reduces incident response time.
Regularly Test Restoration Scenarios
- Conduct a “forensic reenactment” once or twice a year:
  - Attempt to reconstruct an environment from your stored artifacts.
  - Check if you can seamlessly spin up an older version with pinned dependencies and configurations.
- This ensures the system works under real conditions, not just theory.
Apply Multi-Factor Access Control
- Protect your signing keys or artifact storage with strong MFA and hardware security modules (HSMs) if needed:
Participate in Industry or Government Communities
- As you lead in artifact management maturity, share best practices with other public sector bodies or cross-governmental security groups.
- Encourage consistent auditing and artifact immutability standards across local councils, departmental agencies, or NHS trusts.

By verifying artifacts on each deployment, maintaining robust forensic readiness, testing restoration scenarios, and securing signing keys with HSMs or advanced controls, you perpetually refine your processes. This ensures unwavering trust and compliance in your build pipeline, even under rigorous UK public sector scrutiny.

Keep doing what you’re doing, and consider sharing case studies or best-practice guides. Submit pull requests to this guidance or other UK public sector repositories to help others learn from your advanced artifact management journey.

How do you manage and update access policies, and how do you tell people about changes? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move away from ad-hoc management:

Begin a Simple Policy Definition
- Draft a one-page document outlining baseline access rules (e.g., “Least privilege,” “Need to know”).
- Reference relevant UK government guidance on access controls or consult your departmental policy docs.
Centralise Identity and Access
- Instead of manual account creation or server-based user lists, consider cloud-native identity solutions:
Record Exemptions in a Simple Tracker
- If you must grant an ad-hoc exception, log it in a basic spreadsheet or ticket system:
  - Who was granted the exception?
  - Why?
  - When will it be revisited or revoked?
Define at Least One “Review Step”
- If someone wants new permissions, ensure a second person or a small group must approve the request.
- This adds minimal overhead but prevents hasty over-permissioning.
Communicate the New Basic Policy
- Email a short notice to your team, or host a 15-minute briefing.
- Emphasize that all new access requests must align with the minimal policy.

By introducing a baseline policy, centralising identity management, tracking exceptions, and implementing a simple approval step, you achieve immediate improvements and lay the groundwork for more robust policy governance.

How to do better

Here are rapidly actionable enhancements:

Schedule Regular Policy Updates
- Commit to revisiting policies at least annually or semi-annually, and each time there’s a major change (e.g., new compliance requirement).
- Add a reminder to your calendar or project board for a policy review session.
Establish a Basic Change Log
- Store the policy in version control (e.g., GitHub or an internal repo). Each update is a commit, so you have a clear history:
Use Consistent Communication Channels
- If you have an organisational Slack, Teams, or intranet, create a #policy-updates channel (or equivalent) to announce changes.
- Summarise the key differences in plain language.
Apply or Update an RBAC Model
- For each system, define roles that map to policy privileges:
Create a Briefing Deck
- Summarise the policy in fewer than 10 slides or 1–2 pages, so teams quickly grasp their obligations.
- Present it in your next all-hands or departmental meeting.

By versioning your policy documents, scheduling updates, and communicating changes through consistent channels, you ensure staff remain aligned with the policy’s intent and scope, even as it evolves.

How to do better

Below are rapidly actionable ways to refine:

Introduce a “Policy Advisory Group”
- Involve representatives from different teams (security, compliance, operations, major app owners).
- They review proposed changes before final approval, fostering collaboration and broader buy-in.
Leverage Automated Policy Tools
- Integrate policy definitions or changes with your cloud environment:
Conduct Impact Assessments
- Each time a policy update is proposed, share an “impact summary” so teams know if they must adjust access roles, add new logging, or change their workflows.
Record Meeting Minutes or Summaries
- Publish a short summary of each policy review session.
- This allows staff who couldn’t attend to remain informed and fosters more transparency.
Add a Feedback Loop
- Let staff submit policy improvement suggestions via an online form or an email address.
- Review these suggestions in each policy cycle, acknowledging them in announcements.

By establishing a policy advisory group, using automated enforcement, sharing impact assessments, and keeping transparent documentation, you enhance collaboration and understanding around policy changes.

How to do better

Below are rapidly actionable strategies:

Use Version Control for Policy and Automated Testing
- Host policy definitions (or partial automation code) in a Git repository:
- This fosters transparency, and each stakeholder can see exactly how changes are being deployed.
Schedule Interactive Workshops
- Quarterly or monthly policy workshops enable direct Q&A and early feedback on proposed changes, preventing surprises.
Implement a Self-Service Portal or Dashboard
- Provide a simple interface where teams can request new access or see current policy constraints. For instance:
Link Policy Changes to Organisational Goals
- For each update, clearly state how it supports:
  - Security improvements (reducing potential data breaches).
  - Compliance with UK data protection or government classification requirements.
  - Operational efficiency or cost savings.
Establish Basic Metrics
- E.g., measure “time to complete a policy change,” “number of exemptions requested,” or “incident rate attributed to policy confusion.”
- Track these to demonstrate improvements over time.

By versioning policy code, conducting interactive workshops, providing self-service dashboards, and linking changes to tangible organisational goals, you reinforce a collaborative, integrated policy management culture.

How to do better

Below are rapidly actionable improvements, even at the highest level:

Adopt Advanced Policy Testing Frameworks
- For instance:
Create a Sandbox for Policy Experiments
- Let staff propose changes in a “policy staging environment” or a set of test subscriptions/accounts/folders.
- Automatic validation ensures no harmful or contradictory rules get merged into production.
Automate Documentation Generation
- Convert policy-as-code comments into readable documentation so staff see both the code logic and a plain-language explanation:
- This fosters transparency, and each stakeholder can see exactly how changes are being deployed.
Extend Collaboration to Partner Agencies
- If you work closely with other local authorities or health boards, consider sharing a portion of your policy code or best practices across organisations.
- This fosters consistency and accelerates policy alignment.
Perform Periodic “Policy Drills”
- Similar to security incident drills, you can test large policy changes:
  - E.g., “We propose removing direct SSH access to all VMs” or “We require multi-factor authentication for every console user.”
- Observe the process of review, merging, and rollout—this ensures your pipeline works under pressure.

By integrating advanced testing frameworks, using a sandbox approach, automating documentation, and sharing with partner agencies, you keep your policy-as-code approach dynamic and continuously improving, setting a standard for robust and transparent governance in the UK public sector.

Keep doing what you’re doing, and consider writing blog posts or internal knowledge base articles on your policy management journey. Submit pull requests to this guidance or similar public sector best-practice repositories to help others learn from your successful practices.

How do you manage your cloud environment? [change your answer]

You did not answer this question.

How to do better

Runbooks and Playbooks

Create Minimal Runbooks/Playbooks
- Document step-by-step procedures for essential tasks (e.g., adding an instance, rotating keys).
- Referencing:
Ensure Accessibility & Security
- Store runbooks in a version-controlled repository (e.g., GitHub, GitLab).
- Avoid passwords or secrets in the documentation, referencing NCSC guidelines on secure storage of credentials.
Enforce Update Discipline
- Each time an admin modifies the environment, they must update the runbook.
- Prevents drift where docs become irrelevant or untrusted.

Change Logs and Audit Logs

Enable Cloud Provider Audit Logging
- e.g., AWS CloudTrail for AWS, Azure Activity Logs, GCP Audit Logs, OCI Audit Service, IBM Cloud Logs.
- Familiarise yourself with how to query logs and set retention.
Capture the “Why”
- Maintain a short change log to record the rationale behind config changes:
  - Possibly a central wiki or a simple Slack channel for “cloud change announcements.”
Plan Next Steps
- Use these logs to identify repetitive tasks or areas ripe for automation in the near future.

By documenting runbooks/playbooks, ensuring logs are enabled and accessible, capturing rationale behind changes, and frequently updating your documentation, you reduce the risks tied to manual “click-ops” while preparing the groundwork for partial or full automation.

How to do better

Below are rapidly actionable improvements:

Use Scripting for Repetitive Tasks
- Even if you remain “click-ops” at large, certain steps can be scripted:
  - e.g., AWS CLI or PowerShell scripts, Azure CLI, GCP CLI, OCI CLI.
- Minimises errors between test and production.
Track Environment Differences
- For each environment, note variations (like instance sizes, domain naming).
- referencing NCSC guidance on environment segregation or NIST environment management best practices.
Add Post-Deployment Verification
- After each manual deployment, run a checklist or small script that verifies key resources are correct.
Plan a Shift to Infrastructure-as-Code
- Over the next 3–6 months, adopt IaC for at least one main service:
  - e.g., AWS CloudFormation, Azure Bicep or ARM templates, GCP Deployment Manager, OCI Resource Manager, HashiCorp Terraform/Ansible.
Initiate Basic Drift Detection
- Tools like AWS Config, Azure Resource Graph, GCP Config Controller, or OCI Resource Discovery can highlight differences across environments or changes made outside your runbooks.

By partially automating recurring tasks, carefully recording environment discrepancies, verifying deployments, piloting Infrastructure-as-Code, and implementing drift checks, you mitigate errors and pave the way for more complete automation.

How to do better

Below are rapidly actionable ways to evolve from partial scripting:

Expand Scripting to Complex Tasks
- Tackle the next biggest source of manual changes—e.g., managing load balancer rules, rotating credentials, or updating complex network rules.
- referencing AWS CLI scripts, Azure CLI or PowerShell, GCP CLI, OCI CLI.
Adopt an IaC Framework
- Convert major scripts into Terraform, AWS CloudFormation, Azure Bicep/ARM, GCP Deployment Manager, OCI Resource Manager templates for more uniform deployment.
Introduce Basic CI/CD
- If you have a central Git repo for scripts, integrate them with AWS CodePipeline, Azure DevOps, GCP Cloud Build, OCI DevOps pipeline for consistent application across dev/test/prod.
Set up a “Review & Approve” Process
- For complex tasks, code changes in scripts or IaC are peer-reviewed before deployment:
  - referencing NCSC guidance on secure code reviews or NIST secure development frameworks.
Leverage Cloud Vendor Tools
- e.g., AWS Systems Manager Automation runbooks, Azure Automation runbooks, GCP Workflows, OCI Automation and Orchestration to handle advanced tasks with minimal manual input.

By incrementally automating complex changes, standardising on an IaC framework, establishing a basic CI/CD workflow, ensuring code reviews, and utilising vendor orchestration tools, you reduce your reliance on manual interventions and strengthen cloud environment consistency.

How to do better

Below are rapidly actionable ways to refine a highly automated approach:

Implement Automatic Drift Remediation
- If changes are made outside your IaC pipeline, the system automatically reverts them or alerts the team:
  - e.g., AWS Config auto-remediation, Azure Policy with remediation tasks, GCP Config Controller with policy-based management, OCI drift detection & auto-correction.
Incorporate Policy-as-Code
- Tools like Open Policy Agent, AWS SCP, Azure Policy, GCP Org Policy, OCI Security Zones define governance rules in code, preventing non-compliant configs from deploying.
Extend DevSecOps Tooling
- e.g., scanning IaC templates for security issues, verifying recommended best practices in each pipeline step:
  - referencing NCSC’s secure developer guidelines or NIST SP 800-53 R5 for secure configurations.
Perform Regular Architecture Reviews
- With a high level of automation, a small monthly or quarterly session can keep IaC templates up to date with new cloud features or cost optimisation.
Foster Cross-Department Knowledge Sharing
- If relevant, coordinate with other public sector orgs to share automation scripts or IaC modules:
  - referencing GOV.UK cross-department knowledge sharing guidance.

By enabling automatic drift remediation, implementing policy-as-code, enhancing DevSecOps pipeline checks, conducting periodic architecture reviews, and collaborating across agencies, you refine a strong foundation of standardised, highly automated processes for cloud management.

How to do better

Below are rapidly actionable methods to maximise a fully declarative, drift-detecting environment:

Integrate Real-Time Security & Cost Checks
- Each code merge triggers scanning for known misconfigurations or cost anomalies:
  - e.g., Terraform Sentinel policies, AWS CFN Guard, Azure Bicep policy enforcement, GCP Config Controller, OCI Security Zones with policy checks.
Adopt Multi-Cloud or Hybrid Templates
- If you operate across multiple clouds or on-prem, unify definitions in a single code base:
  - referencing HashiCorp Terraform, Pulumi, Crossplane with GCP, AWS, Azure, OCI providers, or a consistent multi-cloud approach.
Enhance Observability
- Introduce robust logging and distributed tracing for infrastructure-level events:
  - e.g., correlating IaC changes with performance or cost trends in AWS CloudWatch, Azure Monitor, GCP Operations Suite, OCI Observability and Management.
Foster a Culture of Peer Reviews
- For each IaC or pipeline update, encourage a thorough peer review:
  - referencing NCSC secure code review or NIST SP 800-160 suggestions for code-based reviews.
Pursue Cross-Government Collaboration
- If possible, share or open-source reusable modules or templates:
  - referencing GOV.UK guidance on open source, NCSC supply chain security for code reuse across departments.

By adding real-time security and cost checks in your pipeline, adopting multi-cloud/hybrid IaC, enhancing observability, promoting peer reviews, and collaborating with other UK public sector bodies, you reinforce an already advanced, fully declarative environment with robust drift detection—ensuring secure, consistent, and efficient cloud management.

Keep doing what you’re doing, and consider publishing blog posts or making pull requests to share your approach to fully automated, code-based cloud management with drift detection. This knowledge can help other UK public sector organisations replicate your success under NCSC, NIST, and GOV.UK best-practice guidelines.

How do you apply and enforce policies? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to start applying policies:

Define a Minimal Baseline Policy
- Begin by stating basic governance guidelines (e.g., “All user accounts must have multi-factor authentication,” “All data must be encrypted at rest”).
- Publish this in a short doc or wiki, referencing relevant UK public sector best practices.
Identify a Small Pilot Use Case
- Pick a single area (e.g., identity and access management) to apply a simple policy.
- For instance:
Communicate the Policy
- Alert your team that from now on, they must follow this minimal policy.
- Provide quick references or instructions in your Slack/Teams channel or an intranet page.
Log Exceptions
- If someone must deviate from the baseline (e.g., a short-term test needing an exception), track it in a simple spreadsheet or ticket system.
- This fosters accountability and sets the stage for incremental improvement.

By taking these initial steps—defining a baseline policy, piloting it, and communicating expectations—you move from “no policy application” toward a more controlled environment.

How to do better

Below are rapidly actionable ways to start enforcing existing policies:

Adopt Basic Monitoring or Reporting
- Use native cloud governance tools to see if resources match policy guidelines:
Automate Alerts for Major Breaches
- If a policy states “No public buckets,” set an alert that triggers if a bucket becomes public:
Introduce Basic Consequence Management
- If a policy is violated, require the team to fill out an exception form or gain approval from a manager.
- This ensures staff think twice before ignoring policy.
Incrementally Expand Enforcement
- Start with “auditing mode,” then gradually move to “deny mode.” For example:
  - In AWS, use Service Control Policies or AWS Config rules in “detect-only” mode first, then enforce.
  - In Azure, run Azure Policy in “audit” effect, then shift to “deny” once comfortable.
  - GCP or OCI similarly allow rules to initially only log and then eventually block non-compliant actions.

By automating policy checks, alerting on critical breaches, and phasing in enforcement, you build momentum toward consistent compliance without overwhelming teams.

How to do better

Below are rapidly actionable ways to enhance process-driven application:

Introduce Lightweight Technical Automation
- Even if processes remain the backbone, add a few checks:
Use a Single Source of Truth
- Store policy documentation and forms in a single location (e.g., SharePoint, Confluence, or an internal Git repo).
- This avoids confusion about which version of the process to follow.
Add a “Policy Gate” to Ticketing Systems
- For example, in ServiceNow or Jira:
  - A ticket for provisioning a new VM or network must pass a “policy gate” status, requiring sign-off from a compliance or security person referencing your standard steps.
Measure Process Efficiency
- Track how long it takes to apply each policy step. Identify bottlenecks or missed checks.
- This helps you see where minimal automation or additional staff training could cut manual overhead.
Conduct Periodic Spot Audits
- Check a random subset of completed tickets or new resources to ensure every policy step was genuinely followed, not just ticked off.
- Publicise the outcomes so staff remain vigilant.

By introducing minor automation, centralising policy references, adding a policy gate in ticketing, and auditing process compliance, you blend the reliability of your current manual approach with the efficiency gains of technical enablers.

How to do better

Below are rapidly actionable ways to reinforce or extend your existing setup:

Expand Technical Enforcement
- Implement more “deny by default” mechanisms:
Integrate Observability and Alerting
- Use real-time or near-real-time monitoring to detect policy breaches quickly:
Adopt “Immutability” or “Infrastructure as Code”
- If possible, define infrastructure states in code. Your policy steps can be embedded:
Push for More Cross-Team Training
- Ensure DevOps, security, and compliance teams understand how to interpret automated policy checks.
- This fosters a shared sense of ownership and makes the half-automated approach more effective.
Set Up a Policy Remediation or “Self-Healing” Mechanism
- Where feasible, let your system automatically fix minor compliance drifts:
  - e.g., If a bucket is created public by mistake, the system reverts it to private and notifies the user.

By strengthening technical guardrails, improving alerting, and embedding your policies deeper into IaC, you evolve your limited technical controls into a more comprehensive and proactive enforcement model.

How to do better

Below are rapidly actionable refinements, even at the highest maturity:

Adopt Policy-as-Code with Automated Testing
- Store policy definitions in version control, run them through pipeline tests:
Enable Dynamic, Real-Time Adjustments
- Some advanced organisations adopt “adaptive policies” that can respond automatically to shifting risk contexts:
  - e.g., Requiring step-up authentication or extra scanning if abnormal usage patterns appear.
Analytics and Reporting on Policy Efficacy
- Track metrics like “time to resolve policy violations,” “number of exceptions requested per quarter,” or “percentage of resources in compliance.”
- Present these metrics to leadership for data-driven improvements.
Cross-department Collaboration
- If you share data or resources with other public sector agencies, coordinate policy definitions or enforcement bridging solutions.
- This ensures consistent governance and security across multi-department projects.
Regularly Test Failover or Incident Response
- Conduct simulation exercises to confirm that policy enforcement remains intact during partial outages or security incidents.
- Evaluate whether the automated controls effectively protect resources and whether manual overrides are restricted or well-logged.

By implementing policy-as-code with automated testing, adopting dynamic enforcement, collecting analytics on compliance, and performing cross-department or incident drills, you ensure your integrated model remains agile and robust—setting a high benchmark for public sector governance.

Keep doing what you’re doing, and consider writing internal blog posts or external case studies about your policy enforcement journey. Submit pull requests to this guidance or related public sector best-practice repositories so others can learn from your advanced application and enforcement strategies.

How do you use version control and branch strategies? [change your answer]

You did not answer this question.

How do I do better?

Below are rapidly actionable next steps:

Pick a Git-based Platform
- e.g., GitHub, GitLab, Bitbucket, or a cloud vendor’s service (AWS CodeCommit, Azure Repos, GCP Source Repos, OCI DevOps Repos).
- Start by simply pushing your code there.
Require Commits for Every Change
- Prohibit direct edits on production servers or local code without commits.
- referencing NCSC best practices for code repository usage.
Document Basic Workflow
- A short doc stating each developer must commit changes daily or at key milestones, helps trace history.
- referencing GOV.UK guide: “How GDS uses Git and GitHub”.
Tag Notable Versions
- If something is “ready for release,” apply a Git tag or version (e.g., v1.0).
- Minimises guesswork about which commit correlates to a live environment.
Plan for Future Branching Strategy
- Over 3–6 months, adopt a recognised model (e.g., GitHub Flow or trunk-based) to handle multiple contributors or features.

By using a modern Git-based platform, ensuring all changes result in commits, documenting a minimal workflow, tagging key releases, and scheduling a shift to a recognised branching strategy, you quickly move from minimal version control to a more robust approach that supports collaboration and security needs.

How do I do better?

Below are rapidly actionable methods to move from a custom approach to a standard one:

Map Existing Branching to a Known Strategy
- Compare your custom steps to recognised flows like GitFlow, GitHub Flow, trunk-based, or Azure DevOps typical branching.
- Identify similarities or differences.
Document a Cross-Reference
- If you choose GitFlow, rename your custom “hotfix” or “dev” branches to align with standard naming, making it easier for new joiners.
Simplify Where Possible
- Some custom strategies overcomplicate merges. Consolidate or reduce the number of long-lived branches to avoid confusion.
Provide a Quick “Cheatsheet”
- e.g., a short wiki page or PDF explaining how to handle new feature branches, bug fixes, or emergency patches:
  - referencing GOV.UK or departmental dev guidelines on version control naming conventions.
Pilot a Standard Flow on a New Project
- In parallel, adopt a recognised model (e.g., GitHub Flow) on a small project to gain team familiarity before rolling it out more widely.

By comparing your custom model to standard flows, documenting a cross-reference, simplifying branch use, providing a quick reference, and trialing a standard approach on a new project, you reduce complexity and align with recognised best practices.

How do I do better?

Below are rapidly actionable improvements:

Document the Adaptations
- Clarify how your version of GitFlow or trunk-based differs from the original.
- Minimises onboarding confusion.
Regularly Revisit Branch Usage
- If certain branches (like “hotfix”) see little use, consider simplifying them out of the process:
  - referencing NCSC guidance on agile software iteration and trunk-based dev for speed and clarity.
Incorporate CI/CD Automation
- Whenever a new branch is pushed, run automated tests or security scans:
  - e.g., AWS CodePipeline, CodeBuild triggers, Azure DevOps pipelines, GCP Cloud Build triggers, OCI DevOps pipelines.
Train New Team Members
- Provide short “branch strategy 101” sessions, referencing well-known Git tutorials.
- referencing GOV.UK “How GDS uses Git” or NCSC’s developer resource library.
Simplify for Next Project
- If you find the strategy too complex for frequent releases, consider trunk-based or GitHub Flow on your next new service or microservice.

By documenting your adaptations clearly, removing unused branches, adding CI/CD hooks for every branch commit, onboarding new developers, and evaluating simpler flows for future projects, you ensure your branch strategy remains practical and efficient.

How do I do better?

Below are rapidly actionable ways to optimise a textbook GitFlow-like approach:

Apply Automated Merges/Sync
- Tools that automatically keep “develop” and “main” in sync after merges reduce manual merges or missed fixes:
  - referencing GitHub Actions, Azure DevOps auto-merge solutions, AWS CodeBuild-based auto merges, GCP Cloud Build triggers.
Monitor Branch Sprawl
- Limit the number of concurrent “release” branches.
- If dev sees multiple releases with cross-pollination, consider if a simpler model might be more agile.
Include Security Checks per Branch
- For each “hotfix” or “feature” branch, run security scans (SAST/DAST):
  - referencing NCSC DevSecOps scanning tools or NIST SP 800-53 for secure code checks.
Document Rarely Used Branches
- If your GitFlow includes “hotfix” or “maintenance” branches rarely used, confirm usage patterns or retire them for simplicity.
Evaluate Branch Strategy Periodically
- Every 6–12 months, revisit whether GitFlow remains necessary or trunk-based dev might serve better for speed.

By automating merges, controlling branch sprawl, embedding security checks into every branch, documenting rarely used branches, and regularly re-evaluating your overall branching structure, you keep your textbook GitFlow or similar approach practical and effective.

How do I do better?

Below are rapidly actionable ways to refine a minimal branch strategy:

Expand Test Coverage
- Ensure automated tests (unit, integration, security scans) run on every PR or push to main:
  - referencing AWS CodeBuild with check runs, Azure DevOps build pipelines, GCP Cloud Build triggers, OCI DevOps test stages.
Establish Feature Flags
- If new code is not fully ready for users, hide it behind toggles:
  - referencing LaunchDarkly, Azure App Configuration with feature flags, AWS AppConfig, GCP config toggles, OCI config solutions.
Enforce Peer Review
- Before merging to main, at least one peer or senior dev reviews the PR, referencing GOV.UK dev guidelines for code review best practices.
Set Real-Time Release Observability
- After merges, watch metrics and logs for anomalies. Roll back quickly if issues arise:
  - referencing AWS CloudWatch + CodeDeploy auto-rollback, Azure DevOps pipeline with canary checks, GCP Rolling updates with GKE, OCI deployment checks and auto-rollback.
Encourage Short-Lived Branches
- Keep branches open for days or less, not weeks, ensuring minimal drift from main and fewer merge conflicts.

By strengthening test coverage, leveraging feature flags, requiring peer reviews, observing real-time release metrics, and promoting short-lived branches, you optimise a streamlined approach that fosters continuous delivery and rapid iteration aligned with modern DevSecOps standards.

Keep doing what you’re doing, and consider sharing your version control and branching strategy successes through blog posts or contributing them as best practices. This helps other UK public sector organisations adopt effective workflows aligned with NCSC, NIST, and GOV.UK guidance for secure, efficient software development.

How do you provision cloud services? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond purely manual provisioning:

Start Capturing Configurations in Scripts
- Even if you rely on the portal/console, record steps in a lightweight script. For example:
Implement Basic Naming and Tagging Conventions
- Create a short doc listing agreed naming prefixes/suffixes and mandatory tags:
  - e.g., DepartmentName, Environment (Dev/Test/Prod), Owner tags.
- This fosters consistency and prepares for more advanced automation.
Add a Simple Approval Step
- If you’re used to provisioning without oversight, set up a minimal “approval check.”
- For instance, use a shared Slack or Teams channel where you post new resource requests, and a manager or security person acknowledges before provisioning.
Consider a Pilot with Infrastructure as Code (IaC)
- Select a small, low-risk environment to try:
Document Provisioning Steps
- Keep a simple runbook or wiki page. Summarise each manual provisioning step so you can easily shift these instructions into scripts or templates later.

By scripting basic tasks, implementing a simple naming/tagging policy, adding minimal approvals, and piloting an IaC solution, you start transitioning from ad hoc provisioning to more consistent automation practices.

How to do better

Below are rapidly actionable ways to standardise your provisioning scripts:

Adopt a Common Repository for Scripts
- Create an internal Git repo (e.g., on GitHub, GitLab, or a cloud-hosted repo) for all provisioning scripts:
  - AWS CodeCommit, Azure DevOps Repos, or GCP Source Repositories can also be used for version control
- Encourage teams to share and reuse scripts, aligning naming conventions and code structure.
Define Minimal Scripting Standards
- E.g., standard file naming, function naming, environment variable usage, or logging style.
- Keep it simple but ensure each team references the same baseline.
Use Infrastructure as Code Tools
- Instead of random scripts, consider a consistent IaC approach:
- Start with a small template, then expand as teams gain confidence.
Create a Shared Module or Template Library
- If multiple teams need similar infrastructure (e.g., a standard VPC, a typical storage bucket), store that logic in a common template or module:
  - e.g., Terraform modules in a shared Git repo or a private registry.
- This ensures consistent best practices are used across all projects.
Encourage Collaboration and Peer Reviews
- Have team members review each other’s scripts or templates in a code review process, catching mistakes and unifying approaches along the way.

By consolidating scripts in a shared repository, defining lightweight standards, introducing IaC tools, and fostering peer reviews, you gradually unify your provisioning process and reduce fragmentation.

How to do better

Below are rapidly actionable ways to expand your declarative automation:

Set Organisation-Wide IaC Defaults
- Decide on a primary IaC tool (Terraform, CloudFormation, Bicep, Deployment Manager, or others) and specify guidelines:
  - e.g., “All new infrastructure that goes to production must use Terraform for provisioning, with code in X repo.”
Create a Reference Architecture or Template
- Provide an example repository for a standard environment:
- Encourage teams to clone and adapt these templates.
Extend IaC Usage to Lower Environments
- Even for dev/test, use declarative templates so staff get comfortable and maintain consistency:
  - This ensures the same patterns scale up to production effortlessly.
Implement Automated Checks
- Use CI/CD pipelines to validate IaC templates before deployment:
Offer Incentives for Adoption
- e.g., Team metrics or internal recognition if all new deployments use IaC.
- Showcase success stories: “Team A reduced production incidents by 30% after adopting IaC.”

By standardising your IaC approach, providing shared templates, enforcing usage even in lower environments, and automating checks, you accelerate your journey toward uniform, declarative provisioning across teams.

How to do better

Below are rapidly actionable ways to continue refining:

Integrate with CI/CD Pipelines
- If not already done, ensure every major deployment goes through a pipeline that runs:
  - Linting, security scans (e.g., checking for known misconfigurations), and policy compliance checks:
Establish a Platform Engineering or DevOps Guild
- A cross-team group can maintain shared IaC libraries, track upgrades, and collaborate on improvements.
- This fosters ongoing enhancements and helps new teams onboard quickly.
Strengthen Security and Compliance Automation
- Embed more advanced checks into your IaC pipeline:
  - e.g., verifying that certain resources cannot be exposed to the public internet, forcing encryption at rest, etc.
Expand to Multi-Cloud or Hybrid
- If relevant, unify your IaC approach for resources across multiple clouds or on-prem environments:
  - Tools like Terraform can handle multi-cloud provisioning under one codebase.
Continue Upskilling Staff
- Offer advanced IaC training, sessions on best practices, or pair programming to help teams adopt more sophisticated patterns (modules, dynamic references, etc.).

By using formal CI/CD for all deployments, fostering a DevOps guild, strengthening compliance checks, and supporting multi-cloud approaches, you refine widespread IaC usage into a highly orchestrated, reliable practice across the organisation.

How to do better

Below are rapidly actionable ways to push the boundaries, even at the highest maturity:

Implement Policy-as-Code
- Ensure each pipeline run checks compliance automatically:
Adopt Advanced Testing and Security Checks
- Extend your pipeline to run static code analysis (SAST), dynamic checks (DAST), and security scanning for container images or VM base images.
- Provide a thorough “shift-left” approach, catching issues pre-production.
Introduce Automated Change Approvals
- If you want a “human in the loop” for major changes, use pipeline gating:
  - e.g., a Slack or Teams approval step before applying infrastructure changes in production.
- This merges automation with the final manual sign-off for critical changes.
Evolve Toward Self-Service Platforms
- Provide an internal “portal” or “service catalog” for non-expert teams to request resources that are auto-provisioned via Infrastructure as Code (IaC) and Continuous Integration/Continuous Deployment (CI/CD):
Expand to True “GitOps” for Ongoing Management
- Continuously synchronise changes from Git to your runtime environment:
  - e.g., using FluxCD or ArgoCD for containerised workloads, or hooking a Terraform operator into a Git repo.

By integrating policy-as-code, advanced security checks, optional gating approvals, self-service catalogs, and GitOps strategies, you refine your mandatory declarative automation approach into a truly world-class, highly efficient model of modern cloud provisioning for the UK public sector.

Keep doing what you’re doing, and consider sharing internal or external blog posts about your provisioning automation journey. Submit pull requests to this guidance or similar public sector best-practice repositories to help others learn from your experiences and successes.

Operations

Do you use continuous integration and continuous deployment (CI/CD) tools? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to adopt a basic CI/CD foundation:

Begin with Simple Scripting
- Automate part of your build or test process via scripts:
Implement Basic Automated Testing
- Start by automating unit tests:
  - Each commit triggers a script that runs tests in a shared environment, providing at least a “pass/fail” outcome.
Use a Shared Version Control Repository
- If you’re not already using one, adopt Git (e.g., GitHub, GitLab, or an internal service) to store your source code so that you can begin integrating basic CI steps.
Document the Process
- Create a short runbook or wiki entry explaining how code is built, tested, and deployed.
- This helps new team members adopt the new process.
Set a Goal to Remove Manual Steps Gradually
- Identify the most error-prone or time-consuming manual tasks. Automate them first to gain quick wins.

By introducing simple build/test scripting, hosting code in version control, and documenting your process, you establish the baseline for a more formal CI/CD pipeline in the future.

How to do better

Below are rapidly actionable ways to broaden CI/CD usage:

Establish a Centralised CI/CD Reference
- Create an internal wiki or repository showcasing how leading teams set up their pipelines:
  - For example, an example pipeline for .NET in Azure DevOps Pipelines or GitHub Actions.
  - A Java pipeline in AWS CodePipeline.
- Encourage other teams to replicate successful patterns.
Provide or Recommend CI/CD Tools
- Suggest a small set of commonly supported tools:
- This consistency can reduce fragmentation.
Host Skill-Sharing Sessions
- Have teams currently using CI/CD present their approaches in short lunch-and-learn sessions.
- Record these sessions so new staff or less mature teams can learn at their own pace.
Create Minimal Pipeline Templates
- Provide a starter template for each major language or platform (e.g., Node.js, Java, .NET).
- Ensure these templates include basic build, test, and package steps out of the box.
Reward Cross-Team Collaboration
- If a more advanced project helps a struggling team set up their pipeline, recognise both parties’ efforts.
- This fosters a culture of internal assistance rather than siloed approaches.

By sharing knowledge, offering recommended tools, and providing example templates, you organically expand CI/CD adoption and empower teams to adopt consistent approaches.

How to do better

Below are rapidly actionable ways to refine or unify CI/CD tool usage:

Define Core Principles or Best Practices
- Even if each team chooses different tools, align on key principles:
  - Every pipeline must run unit tests, produce build artifacts, and store logs.
  - Every pipeline must integrate with code reviews and version control.
- This ensures consistency of outcomes, if not standard tooling.
Document Cross-Tool Patterns
- Create a short doc or wiki explaining how to handle:
  - Secrets management, environment variables, artifact storage, and standard branch strategies.
- This helps teams use the same approach to security and governance, even if they use different CI/CD apps.
Encourage Modular Pipeline Code
- Teams can share modular scripts or config chunks for tasks like static analysis, security checks, or environment provisioning:
  - e.g., Docker build modules, Terraform integration steps, or test coverage logic.
Highlight or Mentor
- If certain pipelines are especially successful, highlight them as “recommended” or offer mentorship so other teams can replicate their approach.
- Over time, the organisation may naturally converge on a handful of widely accepted tools.
Consider a Central CI/CD Service for Key Use Cases
- Some organisations set up a central instance of Jenkins or a self-hosted GitLab/GitHub runner for teams to use, at least for shared services or highly regulated workloads.
- Others rely on cloud-native solutions like AWS CodePipeline/CodeBuild, Azure DevOps Pipelines, GCP Cloud Build, or OCI DevOps for standardised approaches.

By defining core CI/CD principles, documenting shared patterns, and selectively offering a central service or recommended tool, you maintain team autonomy while reaping benefits of consistent practices.

How to do better

Below are rapidly actionable ways to refine widespread team-driven CI/CD:

Introduce a DevOps Guild or CoE (Center of Excellence)
- Regularly meet with representatives from each team, discussing pipeline improvements, new features, or security issues.
- Gather best practices in a single location.
Further Integrate Security (DevSecOps)
- Encourage each pipeline to include vulnerability scanning, license checks, and compliance validations:
Standardise Basic Access & Observability
- Regardless of the pipeline tool, ensure:
  - A consistent approach to storing build logs and artifacts, tagging builds with version numbers, and applying RBAC for pipeline access.
- This unifies the data your compliance officers or governance teams rely on.
Automate Approvals for Critical Environments
- If production deployments require sign-off, implement a pipeline-based approval process:
  - e.g., Slack or Teams-based approval checks, or an integrated manual approval step in the pipeline (Azure DevOps, GitHub Actions, GCP Cloud Build triggers, or AWS CodePipeline).
Measure Pipeline Performance and Reliability
- Gather metrics like average build time, deployment success rate, or lead time for changes.
- Use these insights to target pipeline improvements or unify slow or error-prone steps.

By fostering a DevOps guild, infusing security checks, and unifying logging/artifact storage, you balance team autonomy with enough cross-cutting standards to maximise reliability and compliance.

How to do better

Below are rapidly actionable ways to refine your standardised CI/CD practices:

Adopt Pipeline-as-Code for All
- Store pipeline definitions in Git, ensuring changes undergo the same review as application code:
Implement Advanced Deployment Strategies
- For example, canary or blue/green deployments:
  - This reduces downtime and risk during releases, making your pipelines more robust.
Integrate Policy-as-Code
- Ensure pipeline runs automatically verify compliance with organisational policies:
  - e.g., scanning IaC templates or container images for security or cost violations, referencing official standards.
Expand Observability
- Offer real-time dashboards for build success rates, deployment times, and test coverage.
- Publish these metrics in a central location so leadership and cross-functional teams see progress.
Encourage “Chaos Days” or Hackathons
- Let teams experiment with pipeline improvements, new integration patterns, or novel reliability tests.
- This fosters ongoing innovation and ensures your standardised approach does not become static.

By version-controlling pipeline definitions, embracing advanced deployment patterns, integrating policy checks, and driving continuous improvement initiatives, you keep your standardised CI/CD framework at the cutting edge—well-aligned with UK public sector priorities of robust compliance, reliability, and efficiency.

Keep doing what you’re doing, and consider writing up your CI/CD journey in internal blog posts or knowledge bases. Submit pull requests to this guidance or related public sector best-practice repositories so others can learn from your experiences as well.

How fast are your builds and deployments? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to introduce basic measurements and reduce build/deployment durations:

Implement a Simple Tracking Mechanism
- Start by documenting each deployment’s start and end times in a spreadsheet or ticket system:
  - Track which environment was deployed, total time taken, any blockers encountered.
- Over a few weeks, you’ll get a baseline for improvement.
Automate Basic Steps
- If you’re manually building code, add a script or minimal pipeline:
Adopt a Central Version Control System
- If you aren’t already, store source code and deployment artifacts in Git (e.g., GitHub, GitLab, Azure Repos, etc.):
- This lays the groundwork for more advanced automation later.
Introduce Basic SLAs for Deployment Windows
- e.g., “We aim to complete production deployments within 1 working day once approved.”
- This ensures staff start to see time-to-deploy as a priority.
Identify Key Bottlenecks
- Are approvals causing delays? Are you waiting for a single SME to do manual steps?
- Focus on automating or streamlining the top pain point first.

By tracking deployments in a simple manner, automating the most time-consuming tasks, and setting minimal SLAs, you begin reducing deployment time and gain insight into where further improvements can be made.

How to do better

Below are rapidly actionable ways to reduce delays and evolve your tracking:

Automate Testing
- Expand beyond a simple build script, adding automated tests (unit, integration):
Streamline Approvals
- If manager sign-off is causing long waits, propose a structured yet efficient approval flow:
  - For example, define a Slack or Teams channel where changes can be quickly acknowledged.
  - Use a ticket system or pipeline-based manual approval steps that require minimal overhead.
Implement Parallel or Incremental Deployments
- Instead of a big-bang approach, deploy smaller changes more frequently:
  - If teams see fewer changes in each release, testing and validation can be quicker.
Enforce Clear Deployment Windows
- e.g., “Production deploys occur every Tuesday and Thursday at 2 PM,” with a cut-off for code submissions.
- This planning reduces ad hoc deployments that cause confusion.
Set Target Timelines
- e.g., “Builds should not exceed 30 minutes from commit to artifact,” or “Deployments to test environments should complete within an hour of code merges.”
- Start small, measure progress, and refine goals.

By adding automated testing, simplifying approvals, and promoting incremental deployments, you shorten delays and create a more responsive release pipeline.

How to do better

Below are rapidly actionable ways to enhance your moderate efficiency:

Add Real-Time or Automated Monitoring
- Implement dashboards or Slack/Teams notifications for every build/deployment, capturing:
  - Duration, pass/fail status, and any QA feedback.
  - Tools:
Optimise Build and Test Steps
- Identify any overly long test suites or build tasks:
  - e.g., parallelise tests or use caching to skip redundant steps.
- Tools like AWS CodeBuild caching, Azure Pipeline caching, or GCP Cloud Build caching can accelerate repeat builds.
Adopt Infrastructure as Code (IaC)
- If you manage infrastructure changes manually, incorporate IaC to reduce environment setup delays:
  - AWS CloudFormation, Azure Bicep, GCP Deployment Manager, or Terraform for multi-cloud solutions.
- This ensures consistent provisioning for test and production environments.
Implement Rolling or Blue/Green Deployments
- Reduce downtime and user impact by applying advanced deployment strategies.
- The more confident you are in your pipeline, the faster you can roll out changes.
Introduce Regular Retrospectives
- e.g., monthly or bi-weekly sessions to review deployment metrics (average build time, deployment durations).
- Plan small improvements each cycle—like removing a manual test step or simplifying a build script.

By improving monitoring, optimising test/build steps, adopting IaC, and refining deployment strategies, you make your moderately efficient process even faster and more stable.

How to do better

Below are rapidly actionable ways to optimise an already streamlined process:

Expand Shift-Left Testing and Security
- Integrate early security scanning, code quality checks, and performance tests into your pipeline:
Add Automated Rollback or Canary Analysis
- If a new release fails performance or user acceptance checks, revert automatically:
  - e.g., using canary deployments with AWS AppConfig or Azure App Service Deployment Slots or GCP Cloud Run revisions
Adopt Feature Flags
- Further speed up deployment by decoupling feature rollout from the actual code release:
  - This allows partial or user-segmented rollouts, improving feedback loops.
Implement Detailed Pipeline Telemetry
- If you only track overall build/deploy times, gather finer metrics:
  - Time spent in unit tests vs. integration tests, container image builds vs. scanning, environment creation vs. final validations.
- These insights highlight your next optimisation targets.
Formalise Continuous Improvement
- Host regular pipeline reviews or “build engineering” sprints.
- Evaluate changes in build times, error rates, or frequency of hotfixes. Use these insights to plan enhancements.

By infusing advanced scanning, canary release strategies, feature flags, and deeper telemetry into your existing streamlined pipeline, you further reduce risk, speed up feedback, and maintain a high level of operational maturity.

How to do better

Below are rapidly actionable ways to refine a near-optimal pipeline:

Incorporate AI/ML Insights
- Tools or custom scripts that analyze build logs and deployment results for anomalies or patterns over time:
  - e.g., predicting which code changes may cause test failures, optimising pipeline concurrency.
Expand Multi-Stage Testing and Observability
- Integrate performance, load, and chaos testing into your pipeline:
  - AWS Fault Injection Simulator or Azure Chaos Studio for resilience tests automatically triggered in your pipeline after staging deploys
  - GCP can use chaos engineering frameworks in Cloud Build triggers, or custom steps for load tests in staging environments
  - OCI can incorporate chaos testing scripts in DevOps pipelines for reliability checks pre-production
Share Expertise Across Agencies
- If your pipeline is among the fastest in the UK public sector, participate in cross-government knowledge-sharing:
  - Offer case studies or presentations at GDS or GovTech events, or collaborate with other agencies for mutual learning.
Fully Integrate Infrastructure and Policy as Code
- Ensure that not only your app code but also your network, security group, and policy definitions are stored in the pipeline, with automatic checks:
  - This creates a fully self-service environment for dev teams, reducing manual interventions further.
Set Zero-Downtime Deployment Goals
- If you haven’t already, aim for zero user-impact deployments:
  - e.g., advanced canary or rolling strategies in every environment, with automated rollback if user metrics degrade.

By experimenting with AI-driven pipeline intelligence, chaos engineering, advanced zero-downtime deployment strategies, and cross-department collaboration, you continue pushing the boundaries of high-speed, highly reliable build/deployment processes—reinforcing your position as a leader in efficient operations within the UK public sector.

Keep doing what you’re doing, and consider creating blog posts or internal case studies to document your continuous improvement journey. You can also submit pull requests to this guidance or related public sector best-practice repositories, helping others learn from your approach to fast and dependable build/deployment processes.

How do you monitor your systems? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move from reactive observation to basic continuous monitoring:

Implement Simple Infrastructure Monitoring
- Use vendor-native dashboards or minimal agent-based metrics:
Enable Basic Application Logging
- Configure logs to flow into a centralised service:
Set Up Minimal Alerts
- e.g., CPU usage > 80% triggers an email, or container restarts exceed a threshold:
  - This ensures you don’t rely purely on user reports for operational awareness.
Document Observability Practices
- A short wiki or runbook describing how to check logs, which metrics to watch, and who to contact if issues emerge.
- Even a minimal approach fosters consistency across dev and ops teams.
Schedule a Monitoring Improvement Plan
- Book a monthly or quarterly checkpoint to discuss any monitoring issues or data from the past period.
- Decide on incremental enhancements each time.

By adopting basic infrastructure metrics, centralising logs, configuring minimal alerts, and documenting your approach, you shift from purely reactive observation to foundational continuous monitoring.

How to do better

Below are rapidly actionable ways to integrate your basic monitoring tools:

Consolidate Metrics in a Central Dashboard
- If each cloud service has its own dashboard, unify them in a single view:
Automate Alerts
- Replace or supplement manual checks with automated alerts for abnormal spikes or dips:
  - e.g., memory usage, 5xx error rates, queue backlogs, etc.
- Alerts should reach relevant Slack/Teams channels or an email distribution list.
Introduce Tagging for Correlation
- If you tag resources consistently, your monitoring tool can group related services:
  - e.g., “Project=ServiceX” or “Environment=Production.”
- This helps you spot trends across all resources for a specific application.
Document Standard Operating Procedures (SOPs)
- For each common alert (e.g., high CPU, memory leak), define recommended steps or references to logs for quick troubleshooting.
- This reduces reliance on guesswork or individual heroics.
Integrate with Deployment Pipelines
- If you have a CI/CD pipeline, embed a step that checks basic health metrics post-deployment:
  - e.g., if error rates spike after a new release, roll back automatically or alert the dev team.

By consolidating metrics, automating alerts, introducing consistent tagging, and creating SOPs, you reduce manual overhead and gain a more unified picture of your environment, improving response times.

How to do better

Below are rapidly actionable ways to deepen integration of infrastructure and application data:

Adopt APM (Application Performance Monitoring) Tools
- Pair your infrastructure metrics with application tracing or performance insight:
Implement Unified Logging and Metric Correlation
- Use a logging solution that supports correlation IDs or distributed traces:
  - This helps you pivot from an app error to the underlying VM or container metrics in one step.
Create Multi-Dimensional Alerts
- Instead of CPU-based alerts alone, combine them with application error rates or queue backlog:
  - e.g., alert only if CPU > 80% AND 5xx errors spike, reducing false positives.
Enable Synthetic Monitoring
- Set up automated user-journey or transaction tests:
  - If these fail, you know the user experience is impacted, not just backend metrics.
Refine SLA/SLI/SLO
- If you measure high-level “availability,” break it down into a more precise measure (e.g., 99.9% of user requests under 2 seconds).
- Align your alerts to these SLOs so your monitoring focuses on real user impact.

By combining APM, correlated logs, synthetic tests, and multi-dimensional alerts, you ensure your teams spot potential issues quickly and tie them directly to user experience, thereby boosting operational effectiveness.

How to do better

Below are rapidly actionable methods to push partial integration to near full integration:

Enhance Distributed Tracing
- If you only partially track transactions across microservices, unify them:
Adopt an Observability-First Culture
- Encourage developers to embed structured logs, custom metrics, and trace headers from day one.
- This synergy helps advanced monitoring tools build a full picture of performance.
Automate Root Cause Analysis (RCA)
- Some advanced tools or scripts can identify potential root causes by analyzing correlated data:
  - e.g., pinpoint a failing database node or a memory leak in a specific container automatically.
Refine Alert Thresholds Using Historical Data
- If you have advanced metrics but struggle with noisy or missed alerts, adjust thresholds based on past trends.
- e.g., If your memory usage typically runs at 70% baseline, alert at 85% instead of 75% to reduce false positives.
Integrate ChatOps
- Deliver real-time alerts and logs to Slack/Teams channels. Let teams query metrics or logs from chat directly:
  - e.g., a chatbot that surfaces relevant data for incidents or just-in-time debugging.

By fortifying distributed tracing, adopting an “observability-first” mindset, automating partial root cause analysis, and refining alerts, you close the remaining gaps and strengthen end-to-end situational awareness.

How to do better

Below are rapidly actionable ways to refine an already integrated “single pane of glass” approach:

Leverage AI/ML-Based Anomaly Detection
- Some vendor-native or third-party solutions can preemptively spot unusual patterns:
Implement Self-Healing
- If your integrated system detects a consistent fixable issue, automate the remedy:
  - e.g., automatically scale containers or restart a microservice if certain metrics exceed thresholds.
- Ensure any automated fix logs the action for audit or compliance.
Integrate Observability with ChatOps
- Offer real-time interactive troubleshooting:
  - e.g., Slack bots that can run queries or “explain” anomalies using your “single pane” data.
Adopt Full Lifecycle Cost and Performance Analysis
- Link your monitoring data to cost metrics for a holistic view:
  - e.g., seeing how scaling up or out affects not only performance but also budget.
- This fosters more strategic decisions around resource usage.
Share Observability Insights Across the Public Sector
- If you’ve achieved a truly integrated solution, document your architecture, the tools you used, and best practices.
- Present or collaborate with other agencies or local councils, uplifting broader public sector observability.

By harnessing AI-driven detection, automating remediation steps, integrating real-time ChatOps, and linking cost with performance data, you push your advanced single-pane-of-glass monitoring to a new level—enabling near-instant responses and deeper strategic insights.

Keep doing what you’re doing, and consider writing internal blogs or case studies on your observability journey. Submit pull requests to this guidance or other public sector best-practice repositories to help others learn from your experiences with integrated cloud monitoring.

How do you get real-time data and insights? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to improve data literacy and real-time insight capabilities:

Provide Basic Data Literacy Training
- Organise short workshops, possibly in partnership with GOV.UK Data in government guidance or local councils, focusing on:
  - How to read and interpret basic charts or dashboards.
  - Terminology for metrics (e.g., “mean,” “median,” “time series,” “confidence intervals”).
- This empowers more staff to self-serve on simpler data queries.
Adopt a Simple Visualisation or BI Tool
- Introduce a basic tool that can produce automated reports from spreadsheets or CSV data:
- Even rudimentary dashboards reduce the SME dependency for repetitive questions.
Pilot a Data Lake or Central Data Repository
- Instead of storing departmental data in multiple ad hoc spreadsheets or on local drives, centralise it:
- This central repository can feed into simple dashboards or queries.
Encourage a Data Buddy System
- Pair domain experts with data-literate staff (or external analysts) who can guide them on structured data approaches.
- This fosters knowledge transfer and upskills both sides.
Reference Official Guidance on Data Handling
- For compliance and security, consult:

By improving data literacy, introducing a basic BI tool, creating a pilot data repository, and pairing experts with data-savvy staff, you begin reducing your reliance on point-in-time manual analysis. Over time, these steps pave the way for real-time insights.

How to do better

Below are rapidly actionable ways to transition from basic delayed reporting to more timely insights:

Explore Incremental Data Refresh
- Instead of daily or weekly full loads, adopt incremental or micro-batch processing:
Add Near Real-Time Dashboards
- Maintain existing weekly summary reports while layering a near real-time view for critical metrics:
  - e.g., the number of service requests in the last hour or real-time error rates in a public-facing service.
Improve Data Quality Checks
- If data quality or cleaning is causing delays, implement automated checks:
Set Timeliness KPIs
- e.g., “All critical data sets must be updated at least every 2 hours,” or “System error logs refresh in analytics within 15 minutes.”
- Over time, strive to meet or improve these targets.
Align with NCSC and NIST Guidance on Continuous Monitoring
- Assess if your delayed insights hamper quick detection of security anomalies, referencing:
  - NCSC Logging and Protective Monitoring guidance
  - NIST SP 800-137 on continuous monitoring (for security and operations)

With incremental data refreshes, partial real-time dashboards, better data pipelines, and timeliness KPIs, you reduce the gap between data generation and insight delivery, improving responsiveness.

How to do better

Below are rapidly actionable ways to enhance your partially real-time analytics:

Adopt Stream Processing for More Datasets
- If only a few sources stream data, expand to additional streams:
Consolidate Real-Time Dashboards
- Instead of multiple tools, unify around one main real-time analytics platform:
Enhance Data Integration
- If certain data sets remain batch-only, try hybrid ingestion methods:
  - e.g., partial streaming for time-critical fields, scheduled for large historical loads.
Conduct Cross-Team Drills
- Run mock scenarios (e.g., a surge in user transactions or a security event) to test if real-time analytics allow quick response.
- Identify where missing or delayed data hampers resolution.
Leverage Gov/Industry Guidance
- For data handling and streaming best practices:
  - NIST Big Data Interoperability Framework for scaling analytics solutions
  - NCSC best practices for monitoring and incident response

By increasing stream processing, consolidating dashboards, and expanding real-time coverage to more data sets, you minimise the blind spots in your analytics, enabling faster, more informed decisions across the board.

How to do better

Below are rapidly actionable ways to refine your advanced real-time analytics:

Enhance Data Federation and Governance
- If data sits across multiple cloud or on-prem systems, implement a data mesh or robust governance policy:
- Ensure compliance with relevant NCSC data security and NIST data governance guidelines.
Promote Self-Service BI
- Offer user-friendly dashboards with drag-and-drop analytics:
  - e.g., enabling policy officers, operation managers, or finance leads to build custom views without waiting on IT.
Incorporate Automated Anomaly Detection
- Move beyond manual queries to ML-based insight:
Support Data Literacy Initiatives
- Provide ongoing training, e.g., workshops or eLearning, referencing:
  - GDS Academy data training courses
  - NIST Big Data Public Working Group insights or relevant NCSC guidelines for data monitoring
Set Real-Time Performance Goals
- e.g., “90% of operational metrics should be visible within 60 seconds of ingestion.”
- Routinely track how these goals are met or if data pipelines slow over time, making improvements as needed.

By strengthening data governance, encouraging self-service, adopting automated anomaly detection, and continuing to boost data literacy, you maximise the value of your advanced analytics environment.

How to do better

Below are rapidly actionable ways to refine self-service real-time insights:

Expand Data Sources and Data Quality
- Enrich dashboards by integrating external open data or cross-department feeds:
  - e.g., integrating UK open data from data.gov.uk or other public sector agencies for broader context.
Introduce Natural Language or Conversational Queries
- Tools like:
Automate Governance and Access Controls
- Ensure compliance with data protection regulations (e.g., UK GDPR). Implement dynamic row-level or column-level security for sensitive data:
Integrate Predictive Insights in Dashboards
- If you have ML models, embed their output directly into the dashboard:
  - e.g., forecasting future usage or risk, highlighting anomalies on live charts.
Foster Cross-department Collaboration
- Share your best-practice dashboards or data schemas with other public sector bodies, referencing:
  - NCSC guidelines for secure data exchange and logging practices
  - GOV.UK’s cross-department data sharing protocols

By expanding data sources, enabling natural language querying, automating governance, embedding predictive analytics, and partnering with other agencies, you ensure your comprehensive self-service environment stays at the cutting edge—empowering a data-driven culture in UK public sector organisations.

Keep doing what you’re doing, and consider blogging about your journey toward real-time analytics and self-service dashboarding. Submit pull requests to this guidance or other public sector best-practice repositories to help others learn from your successes in delivering timely, actionable insights.

How do you release updates? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to transition from downtime-based updates to more resilient approaches:

Pilot a Rolling or Blue/Green Approach
- Instead of a complete shutdown, start with a minimal approach:
Establish a Basic CI/CD Pipeline
- So that updates are automated and consistent:
  - e.g., run unit tests, integration checks, and create a deployable artifact with each commit.
- NCSC’s guidance on DevSecOps or NIST SP 800-160 can inform security integration into the pipeline.
Use Snapshot Testing or Quick Cloning
- If you remain reliant on backups for rollback, test them frequently:
  - Ensure daily or more frequent snapshots can be swiftly restored in a staging environment to confirm reliability.
Communicate Downtime Effectively
- If immediate elimination of downtime is not feasible, set up a transparent communication plan:
  - Inform users of upcoming windows via email or intranet, referencing any gov.uk service continuity guidelines.
Aim for Rolling Updates Pilot
- Identify at least one non-critical service to pilot rolling or partial updates, building confidence for production.

By adopting minimal rolling or staging-based updates, automating deployment pipelines, and ensuring robust backup/restore processes, you reduce the disruptive nature of downtime-based updates—paving the way for more advanced, near-zero-downtime methods.

How to do better

Below are rapidly actionable improvements:

Implement Automated Health Checks
- Ensure each instance is verified healthy before taking the next one offline:
Adopt a Canary or Blue/Green Strategy for Critical Services
- Gradually test changes on a small portion of traffic before proceeding:
  - This reduces risk if an update has issues.
Shorten or Eliminate Maintenance Windows
- If rolling updates are stable, see if you can do them in business hours for services with robust capacity.
- Communicate frequently with users about partial capacity reductions, referencing relevant GOV.UK operational guidelines.
Automate Rollback
- If an update fails, ensure your pipeline or scripts can quickly revert to the previous version:
  - Storing versioned artifacts in, for example, AWS S3 or ECR, Azure Container Registry, GCP Artifact Registry, or OCI Container Registry.
Reference NCSC Guidance on Operational Resilience
- Rolling updates align with resilience best practices, but see if NCSC or NIST SP 800-53 revision on system and communications protection controls suggests additional steps to reduce downtime.

By adding health checks, introducing partial canary or blue/green methods, and continuously automating rollbacks, you further minimise the user impact even within a rolling update strategy—potentially removing the need for fixed maintenance windows.

How to do better

Below are rapidly actionable ways to enhance manual cut-over processes:

Automate the Switch
- Even if you keep a manual approval step, script the rest of the transition:
  - e.g., flipping a DNS entry, load balancer config, or feature toggle automatically:
Incorporate Automated Testing Pre-Cut-Over
- Run smoke/integration tests on the new environment before the final switch:
  - If tests pass, you simply approve the cut-over.
Establish Clear Checklists
- List each step, from final pre-check to DNS swap, ensuring all relevant logs, metrics, or alerts are turned on:
  - Minimises risk of skipping a crucial step during a manual process.
Use Observability Tools for Rapid Validation
- After switching, verify the new environment quickly with real-time dashboards or synthetic user tests:
  - This helps confirm everything runs well before fully retiring the old version.
Refer to NCSC Operational Resilience Guidance
- NCSC documentation offers principles for ensuring minimal disruption when switching environments.
- NIST SP 800-160 Vol 2 can also provide insights on engineering for cyber-resilience in deployment processes.

By automating as many cut-over steps as possible, implementing integrated testing, and leveraging robust observability, you reduce manual overhead while retaining the safety of parallel versions.

How to do better

Below are rapidly actionable methods to enhance manual canary or blue/green strategies:

Automate Traffic Shaping
- Instead of manually controlling traffic percentages, leverage:
Implement Automated Rollback
- If metrics degrade beyond thresholds, revert automatically to the stable version without waiting for manual action:
  - e.g., a pipeline checking real-time error rates or latency.
Adopt Observability-Driven Deployment
- Use real-time logging, metrics, and user experience monitoring to confirm if the new version is healthy:
  - [NCSC and NIST SP 800-137 (Continuous Monitoring) guidance can help formalise the approach].
Enhance Developer Autonomy
- If your policy allows, let smaller updates or patch releases auto-deploy after canary checks pass, reserving manual oversight only for major changes or high-risk deployments.
Consider ChatOps or Tools for One-Click Approvals
- Slack/Teams integrated pipeline steps let authorised personnel type a simple command or press a button to shift traffic from old to new version.
- This lowers friction while preserving manual control.

By introducing traffic shaping with partial auto-deploy or rollback, deeper observability, and flexible chat-based control, you refine your canary or blue/green approach, reducing the manual overhead of each release while keeping high confidence.

How to do better

Even at this top maturity level, there are rapidly actionable improvements:

Expand Automated Testing & AI/ML Analysis
- If canary performance is only measured by simple metrics (error rate, latency), consider advanced checks:
Implement Feature Flag Management
- Decouple feature releases from deployments entirely:
  - e.g., changing user experience or enabling new functionality with toggles, tested gradually.
- Tools like [LaunchDarkly], or vendor-based solutions [AWS AppConfig feature flags, Azure Feature Management, GCP Feature Flags, or OCI-based toggles] can help.
Advance Security & Testing
- Integrate real-time security checks pre- and post-deployment:
  - e.g., scanning container images or serverless packages for known vulnerabilities, referencing NIST SP 800-190 for container security best practices or NCSC’s container security guidance.
Explore Multi-Cluster or Multi-Region Failover
- If one region or cluster is updating, route traffic to another fully operational cluster for absolute minimal disruption:
  - This further cements zero downtime across a national or global footprint.
Collaborate with Other Public Sector Bodies
- Share your near-instant, zero-downtime deployment patterns with local councils or other departments:
  - Possibly present at cross-government events, referencing the GOV.UK community approach to agile delivery for broader impact.

By embedding advanced anomaly detection, feature flag strategies, multi-region failover, and deepening security checks, you maintain a cutting-edge continuous deployment ecosystem—aligning with top-tier operational excellence in the UK public sector.

Keep doing what you’re doing, and consider documenting your advanced release strategies in internal or external blog posts. You can also submit pull requests to this guidance or other public sector best-practice repositories, helping others progress toward zero-downtime, high-confidence release methods.

How do you manage deployment and QA? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond entirely manual QA and deployments:

Introduce a Simple CI Pipeline
- Begin by automating at least the build and basic test steps:
Document a Standard Release Checklist
- Ensure each deployment follows a consistent procedure, covering essential steps like code review, environment checks, and sign-off by the project lead.
Schedule a Pilot for Automated QA
- If you typically rely on manual testers, pick a small piece of your test suite to automate:
  - e.g., smoke tests or a top-priority user journey.
- This pilot can demonstrate the value of automation to stakeholders.
Set Clear Goals for Reducing Manual Steps
- Aim to reduce “time to deploy” or “time spent on QA” by a certain percentage over the next quarter, aligning with agile or DevOps improvement cycles recommended by GOV.UK Service Manual practices.
Review Security Compliance
- Consult NCSC’s DevSecOps recommendations and NIST SP 800-160 Vol 2 for integrating secure coding checks or scanning into your newly introduced pipeline steps.

By establishing minimal CI automation, clarifying release steps, and piloting automated QA, you build confidence in incremental improvements, setting the foundation for more robust pipelines.

How to do better

Below are rapidly actionable methods to evolve from partial automation:

Expand Automated Tests to Integration or End-to-End (E2E)
- Move beyond simple unit tests:
Adopt a More Frequent Release Cadence
- Commit to at least monthly or bi-weekly releases, allowing you to discover issues earlier and respond to user needs faster.
Introduce Automated Rollback or Versioning
- Store artifacts in a repository for easier rollback:
  - AWS S3, ECR, or CodeArtifact; Azure Artifacts or Container Registry; GCP Artifact Registry; OCI Container Registry
- Make rollback steps part of your pipeline script to minimise disruption if a new release fails QA in production.
Refine Manual Approvals
- If manual gates remain, streamline them with a single sign-off or Slack-based approvals rather than long email chains:
  - This ensures partial automation doesn’t stall at a manual step for days.
Consult NIST SP 800-53
- Evaluate recommended controls for software release (CM-3, SA-10) and integrate them into your pipeline for better compliance documentation.

By broadening test coverage, increasing release frequency, and automating rollbacks, you lay the groundwork for more frequent, confident deployments that align with modern DevOps practices.

How to do better

Below are rapidly actionable ways to enhance integrated deployment and QA:

Add Security and Performance Testing
- Integrate security scanning tools into the pipeline:
- Also consider lightweight performance tests in staging to detect regressions early.
Implement Parallel Testing or Test Suites
- If test execution time is long, parallelise them:
  - e.g., AWS CodeBuild parallel builds, Azure Pipelines multi-job phases, GCP Cloud Build multi-step concurrency, or OCI DevOps parallel test runs.
Introduce Slack/Teams Notifications
- Notify dev and ops channels automatically about pipeline status, test results, and potential regressions:
  - Encourages quick fixes and fosters a more collaborative environment.
Adopt Feature Flag Approaches
- Deploy new code continuously but hide features behind flags:
  - This ensures “not fully tested or accepted” features remain off for end users until QA sign-off.
Reference GOV.UK and NCSC
- GOV.UK agile delivery guidelines can help refine iterative approaches.
- NCSC advice on DevSecOps pipelines encourages secure integration from start to finish.

By strengthening security/performance checks, parallelising tests, using real-time notifications, and employing feature flags, you further streamline your integrated QA pipeline while maintaining robust checks and balances.

How to do better

Below are rapidly actionable ways to refine your existing CI/CD with automated testing:

Shift Left Security
- Embed security tests (SAST, DAST, license compliance) earlier in the pipeline:
  - e.g., scanning pull requests or pre-merge checks for known vulnerabilities.
Adopt Canary/Blue-Green Deployments
- Pair your stable CI/CD pipeline with progressive exposure of new versions to real traffic:
Implement Automated Rollback
- If user impact or error rates spike post-deployment, revert automatically to the previous version without manual steps.
Use Feature Flags for Safer Experiments
- Deploy code continuously but toggle features on gradually.
- This approach de-risks large releases and speeds up delivery.
Encourage Cross-Government Collaboration
- Share pipeline patterns with other public sector bodies, referencing GOV.UK community guidance on agile/DevOps communities.

By deepening security integration, adopting advanced deployment tactics, and refining rollbacks or feature flags, you enhance an already stable CI/CD pipeline. This leads to even faster, safer releases aligned with top-tier DevSecOps practices recommended by NCSC and NIST.

How to do better

Even at this apex, there are rapidly actionable improvements:

Adopt Policy-as-Code for Environment Provisioning
- Ensure ephemeral environments adhere to data governance, resource tagging, and security baselines automatically:
Automated Data Masking or Synthetic Data
- If ephemeral environments need real data, ensure compliance with UK data protection regs:
  - Use synthetic test data or anonymise production copies to maintain NCSC data security best practices.
Inject Chaos or Performance Tests
- Incorporate chaos engineering (e.g., random container/network failures) and load tests in ephemeral environments:
  - This ensures high resilience under real-world stress.
Optimise Environment Lifecycle
- Monitor resource usage to avoid ephemeral environments lingering longer than needed:
  - e.g., automatically tear down environments if no activity is detected after 48 hours.
Collaborate with UK Gov or Local Councils
- Offer case studies on ephemeral environment success, referencing GOV.UK best practices in agile dev and continuous improvement.

By embedding policy-as-code, securing data in ephemeral environments, introducing chaos/performance tests, and aggressively managing environment lifecycles, you ensure your pipeline remains at the cutting edge—fully aligned with advanced DevOps capabilities recommended by NCSC, NIST, and other relevant bodies.

Keep doing what you’re doing, and consider writing up your experiences or creating blog posts about your ephemeral environment successes. You can also submit pull requests to this guidance or other public sector best-practice repositories, helping others in the UK public sector evolve their QA pipelines and deployment processes.

How do you develop and implement your cloud strategy? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to start formalising a cloud-oriented approach:

Identify a Cloud Advocate
- Appoint a single volunteer (or a small group) as the go-to person(s) for cloud questions:
  - They can gather and share best practices, referencing NCSC guidance on secure cloud migrations.
Host Internal Workshops
- Invite vendor public sector teams (AWS Public Sector, Azure for Government, GCP Public Sector, or Oracle Government Cloud) for short awareness sessions on cloud fundamentals and cost management.
Create a Cloud Starter Doc
- Summarise the organisation’s existing cloud usage, known gaps, and next steps for improvement.
- Include references to GOV.UK’s technology code of practice or NIST’s cloud computing guidelines for alignment.
Pilot a Small Cross-Functional Team
- If you have an upcoming project with cloud components, assemble a temporary team from different departments (development, security, finance) to coordinate on cloud decisions.
Define Basic Cloud Roles
- Even without a dedicated cloud team, define who handles security reviews, cost optimisation checks, or architectural guidance.

By designating a cloud advocate, introducing basic cloud knowledge sessions, and forming a small cross-functional group for a pilot project, you lay the groundwork for a more coordinated approach to cloud strategy and operations.

How to do better

Below are rapidly actionable ideas to strengthen informal cloud expertise:

Formalise a Community of Practice
- Schedule monthly or bi-monthly meetups for cloud practitioners across teams:
  - They can share success stories, approaches to cost management, referencing AWS Cost Explorer, Azure Cost Management, or GCP Cloud Billing dashboards.
Create a Shared Knowledge Base
- Host a wiki, Slack channel, or Teams group to store common Q&As or how-to guides:
  - Link to relevant NCSC cloud security resources and GOV.UK technology code of practice.
Encourage One-Stop Repos
- For repeated patterns (e.g., Terraform templates for secure VMs or container deployments), maintain a Git repo that all teams can reference.
Promote Shared Governance
- Align on a minimal set of “must do” controls (e.g., mandatory encryption, logging).
- Consider referencing NIST SP 800-53 controls for cloud resource security responsibilities.
Pilot a Small Formal Working Group
- If informal collaboration works well, create a small “Cloud Working Group” recognised by leadership.
- They can propose consistent patterns or cost-saving tips for cross-team usage.

By forming a community of practice, establishing a knowledge base, and beginning minimal governance alignment, you transition from ad hoc experts toward a more structured, widely beneficial cloud strategy.

How to do better

Below are rapidly actionable strategies to improve a formal Cloud COE:

Offer Self-Service Catalogs or Templates
- Provide easily consumable Terraform or CloudFormation templates for standard workloads:
Extend COE Services
- e.g., specialised security reviews, compliance checks referencing NCSC 14 Cloud Security Principles, or cost optimisation workshops that unify departmental approaches.
Set up a Community of Practice
- Have the COE coordinate monthly open sessions for all cloud practitioners to discuss new vendor features, success stories, or security enhancements.
Embed COE Members in Key Projects
- Provide “COE ambassadors” who temporarily join project teams to share knowledge and shape architecture from the start.
Consult NIST and GOV.UK for Strategy Guidance
- e.g., NIST Cloud Computing Reference Architecture or GOV.UK recommendations on technology strategies can strengthen your COE’s strategic approach.

By delivering self-service solutions, deeper security reviews, and an active cloud community, the COE matures into a vital driver for consistent, secure, and cost-effective cloud adoption across the organisation.

How to do better

Below are rapidly actionable steps to further integrate the COE’s standards into everyday operations:

Adopt “Cloud-First” or “Cloud-Smart” Policies
- Mandate that new solutions default to cloud-based approaches unless there’s a compliance or cost reason not to.
- Reference relevant policy from GOV.UK’s Cloud First policy for alignment.
Introduce Automated Compliance Checks
- Bake COE standards into automated tools:
- This ensures no team can inadvertently deviate from security or cost baselines.
Enable On-Demand Cloud Labs/Training
- Provide hands-on workshops or sandbox accounts where staff can experiment with new cloud services in a safe environment.
- Encourages further skill growth and cross-pollination.
Measure Outcomes and Iterate
- Track success metrics: e.g., time to provision environments, frequency of security incidents, cost savings realised by standard patterns.
- Present these metrics in monthly or quarterly leadership updates, aligning with NCSC operational resilience guidance.
Improve Cross-Functional Team Composition
- Incorporate security engineers and cloud architects directly into product squads for new digital services, reducing handoffs.

By mandating automated compliance checks, fostering a “cloud-first” approach, expanding skill-building labs, and embedding security/architecture roles into each delivery team, you further entrench consistent, effective cloud usage across the public sector organisation.

How to do better

Below are rapidly actionable ways to refine an already advanced operating model:

Introduce FinOps Practices
- Link cost optimisation more tightly with developer workflows:
Enable Self-Service Data & AI
- If each product team can not only provision compute but also harness advanced analytics or ML on demand:
  - Speeds up data-driven policy or service improvements.
Adopt Policy-as-Code
- Extend your automated governance:
  - e.g., using [Open Policy Agent (OPA) or AWS Service Control Policies, Azure Policy, GCP Organisation Policy, or OCI Security Zones] to ensure consistent rules across the entire estate.
Engage in Cross-Government Collaboration
- Share your advanced COE successes with other departments, local councils, or healthcare orgs:
  - Possibly present at GOV.UK community meetups, or work on open-source infrastructure modules that other public bodies can reuse.
Stay Current with Tech and Security Trends
- Periodically assess new NCSC or NIST advisories, cloud vendor releases, or best-practice updates to keep your operating model fresh, secure, and cost-effective.

By incorporating robust FinOps, self-service AI, policy-as-code, cross-government collaboration, and continuous trend analysis, you ensure your advanced COE model remains at the forefront of effective and secure cloud adoption in the UK public sector.

Keep doing what you’re doing, and consider writing blog posts or internal knowledge-sharing articles about your advanced Cloud COE. Submit pull requests to this guidance or other public sector best-practice repositories to help others learn from your successes in structuring cross-functional cloud teams and ensuring an effective operating model.

Who manages your cloud operations? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond developer-exclusive cloud management:

Form a DevOps Guild or Community of Practice
- Even without a formal team, bring developers interested in operations together monthly to share tips.
- This fosters consistent practices, referencing NCSC secure cloud recommendations or GOV.UK’s agile/delivery guidelines.
Introduce Minimal Automated Monitoring & Alerts
- Ensure developers aren’t manually checking logs. Use built-in tools:
Implement Basic Infrastructure as Code
- If developers manage cloud resources manually via console, introduce:
Add a Cloud Security Checklist
- Ensure developers follow at least a minimal set of policies for encryption, IAM, and logging, aligned with NCSC Cloud Security guidance.
Request Budget or Headcount
- If workloads grow, advocate for dedicated cloud engineering staff. Present cost/risk benefits to leadership, referencing GOV.UK cloud-first policy and potential agility gains.

By fostering a DevOps guild, adding automated monitoring, adopting IaC, and pushing for minimal security guidelines, you gradually evolve from purely developer-led ops to a more stable, repeatable cloud operation that can scale.

How to do better

Below are rapidly actionable ways to balance outsourced support with internal ownership:

Retain Strategic Oversight
- Even if operations remain outsourced, designate an internal “Cloud Lead” or small working group responsible for governance and security:
  - They should sign off on major architectural changes, referencing NIST Cloud Computing frameworks.
Set Clear SLA and KPI Requirements
- Make sure the vendor’s contract outlines response times, compliance with GOV.UK Cloud Security Principles or NCSC best practices, and regular cost-optimisation reviews.
Insist on Transparent Reporting
- Request routine dashboards or monthly metrics on performance, cost, security events.
- Ask the vendor to integrate with your chosen monitoring tools if possible.
Plan a Knowledge Transfer Path
- Negotiate with the vendor to provide training sessions or shadowing opportunities, building internal cloud literacy:
  - e.g., monthly knowledge-sharing on cost optimisation or security patterns.
Retain Final Decision Power on Strategic Moves
- The vendor can propose solutions, but major platform changes or expansions should get internal review for alignment with departmental objectives.
- This ensures the outsourced arrangement doesn’t override your broader digital strategy.

By keeping strategic authority, setting stringent SLAs, fostering vendor-provided knowledge transfer, and maintaining transparent reporting, you reduce vendor lock-in and ensure your cloud approach aligns with public sector priorities and compliance expectations.

How to do better

Below are rapidly actionable enhancements:

Co-Create Operational Standards
- Collaborate with your outsourced vendor on a joint “Operations Handbook” that includes standard procedures for deployments, monitoring, or incident response:
  - Reference NCSC incident management guidance or relevant GOV.UK service operation guidelines.
Embed Vendor Staff into Internal Teams
- If feasible, have vendor ops staff attend your sprint reviews or planning sessions, improving communication and reducing friction.
Establish Regular Strategic Review
- Conduct quarterly or monthly reviews to align on:
  - Future cloud services adoption
  - Cost optimisation opportunities
  - Evolving security or compliance needs
Request Real-Time Metrics
- Ensure the vendor’s operational data (e.g., cost usage, performance dashboards) is accessible to your internal strategic leads:
  - e.g., a shared AWS Cost Explorer or Azure Cost Management view for weekly usage checks.
Plan for Potential In-House Expansion
- If usage grows or departmental leadership wants more direct control, negotiate partial insourcing of key roles or knowledge transfer from the vendor.

By jointly defining an operations handbook, integrating vendor ops staff in your planning, reviewing strategy regularly, and retaining real-time metrics, you strengthen internal leadership while enjoying the convenience of outsourced operational tasks.

How to do better

Below are rapidly actionable ways to optimise the hybrid approach:

Standardise Tools and Processes
- Require both in-house and vendor teams to adopt a single set of CI/CD pipelines or logging solutions:
  - e.g., AWS CodePipeline or Azure DevOps for build/test/deploy, with shared logging in CloudWatch/Azure Monitor/GCP Logging/OCI Logging
- This ensures seamless handoffs and consistent security posture.
Define Clear Responsibilities
- For each area (e.g., incident management, security patching, cost reviews), specify whether the vendor or in-house staff leads.
- Consult NCSC’s supply chain security guidance to ensure robust accountability.
Integrate On-Call Rotations
- If the vendor provides 24/7 coverage, have an internal secondary on-call or bridging approach:
  - This fosters knowledge exchange and ensures no single point of failure if the vendor struggles.
Align on a Joint Roadmap
- Create a 6-12 month cloud roadmap, listing major initiatives like infrastructure refreshes, security enhancements (e.g., compliance with NIST SP 800-53 controls), or cost optimisation steps.
Encourage Cross-Training
- Rotate vendor staff into internal workshops or hackathons, and have your staff occasionally shadow vendor experts to deepen in-house capabilities.

By unifying tools, clarifying roles, rotating on-call duties, aligning on a roadmap, and cross-training, you make the hybrid model more cohesive—maximising agility and ensuring consistent cloud operation standards across internal and outsourced teams.

How to do better

Below are rapidly actionable ways to refine an already dedicated in-house cloud team:

Adopt a DevSecOps Center of Excellence (COE)
- Evolve your cloud team into a central repository for best practices, security frameworks, and ongoing training:
  - Provide guidelines on ephemeral environments, compliance-as-code, or advanced ML operations.
Set Up Autonomous Product Teams
- Embed cloud team members directly into product squads, letting them self-manage infrastructure and pipelines with minimal central gatekeeping:
  - This fosters agility while the central team maintains overarching governance.
Implement Policy-as-Code and FinOps
- Automate compliance (e.g., OPA or vendor-based policy enforcements like AWS SCPs, Azure Policy, GCP Organization Policy, OCI Security Zones) across accounts or projects.
- Integrate cost visibility into daily dev processes, referencing NCSC supply chain or financial governance, or NIST SP guidelines on cost management.
Champion Innovations
- Keep experimenting with advanced features (e.g., AWS Graviton, Azure confidential computing, GCP Anthos multi-cloud, or OCI HPC offerings) to continuously optimise performance and cost.
Regularly Review and Update the Roadmap
- Adapt to new government mandates, NCSC advisories, or emerging technologies.
- Share lessons learned via GOV.UK blog posts on digital transformation.

By embedding security and cost best practices, enabling cross-functional product teams, instituting policy-as-code, and continually updating your roadmap, your dedicated in-house cloud team evolves into a dynamic, cutting-edge force that consistently meets UK public sector operational and compliance demands.

Keep doing what you’re doing, and consider writing up your experiences or publishing blog posts on your cloud team’s journey. Also, contribute pull requests to this guidance or similar public sector best-practice repositories, helping others evolve their organisational structures for effective cloud operations.

How do you plan for incidents? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond ad-hoc incident response:

Draft a Simple Incident Response (IR) Checklist
- Outline basic steps for triage, analysis, containment, and escalation:
  - Who to notify, which logs to check, how to isolate affected systems, etc.
- Reference NCSC’s incident response best practices.
Identify Key Roles
- Even if you can’t create a full incident response team, designate an incident lead and a communications point of contact.
- Clarify who decides on severe actions (e.g., taking services offline).
Set Up Basic Monitoring and Alerts
- Ensure you have at least minimal coverage:
Coordinate with Third Parties
- If you rely on external suppliers or a cloud MSP, note their support lines and escalation processes in your checklist.
Review and Refine After Each Incident
- Conduct a mini post-mortem for any downtime or breach, adding lessons learned to your ad-hoc plan.

By drafting a minimal IR checklist, assigning key roles, enabling basic alerts, and learning from each incident, you can quickly improve your readiness without a massive resource investment.

How to do better

Below are rapidly actionable ways to strengthen an initial documented IR plan:

Integrate IR Documentation into CI/CD
- If you maintain an Infrastructure as Code or pipeline approach, embed references to the IR plan or scripts:
  - e.g., one-liners explaining how to isolate or roll back in the event of a security alert.
Automate Some Deployment Checks
- Before launch, run security scans or vulnerability checks:
Link IR Plan to Monitoring Dashboards
- Provide direct references in the plan to the dashboards or logs used for incident detection:
  - This helps new team members quickly identify relevant data sources in a crisis.
Consult Gov & NCSC Patterns
- Reference NIST SP 800-61 Section 3.2 “Incident Handling Checklist” or NCSC Cloud Security guidance to flesh out robust procedures.
Schedule a 3-Month Review Post-Launch
- Ensure the IR plan is updated after initial real-world usage.
- Adjust for any changes in architecture or newly discovered risks.

By embedding IR considerations into your pipeline, linking them to monitoring resources, referencing official guidance, and doing a post-launch review, you maintain an up-to-date plan that effectively handles incidents as the service evolves.

How to do better

Below are rapidly actionable ways to elevate a regularly updated IR plan:

Link Plan Updates to Service/Org Changes
- If new microservices launch or staff roles shift, require an immediate plan review:
  - e.g., add or remove relevant escalation points, update monitoring references.
Automate IR Plan Distribution
- Store the IR plan in version control (like GitHub), so everyone can see changes easily:
  - e.g., label each revision with a date or release tag.
- This fosters transparency and avoids outdated copies lurking in email threads.
Encourage DR Drills
- Expand on tabletop exercises by running limited real-world simulations:
  - e.g., intentionally degrade a non-critical environment to test the plan’s response steps.
- Tools like AWS Fault Injection Simulator, Azure Chaos Studio, or Chaos Mesh on GCP/OCI can facilitate chaos engineering.
Include Ransomware or DDoS Scenarios
- Adapt the plan to cover advanced threats relevant to public sector services, referencing NCSC’s ransomware guidance, NIST SP 800-61 for incident categories.
Regular Stakeholder Briefings
- Present IR readiness status updates to leadership or departmental leads, aligning them with the IR plan improvements.

By linking plan updates to actual org changes, distributing it via version control, frequently testing via drills, and preparing for advanced threats, you maintain an agile, effective IR plan that evolves with your environment.

How to do better

Below are rapidly actionable ways to further optimise integrated, tested IR plans:

Adopt Multi-Cloud or Region Failover Testing
- If your DR strategy includes shifting workloads to another cloud or region, periodically simulate it:
Expand Real-Time Monitoring Integration
- Ensure that if an alert triggers a continuity plan, the IR process is automatically updated with relevant logs or metrics.
- Tools like AWS EventBridge, Azure Event Grid, GCP Pub/Sub, or OCI Events can route incidents to the correct channels instantly.
Formalise Post-Incident Reviews
- Document everything in a post-mortem or “lessons learned” session, referencing NCSC’s post-incident evaluation guidelines.
- Update the plan accordingly.
Include Communication and PR
- Integrate public communication steps if your service is citizen-facing:
  - e.g., prepared statements or web page banners, referencing GOV.UK best practices on emergency communications.
Use NIST 800-61 or NCSC Models
- Evaluate if your IR plan’s phases (preparation, detection, analysis, containment, eradication, recovery, post-incident) align with recognised frameworks.

By simulating cross-region failovers, integrating real-time alert triggers with continuity plans, conducting thorough post-incident reviews, and weaving communications into the IR plan, you maintain a robust, seamlessly tested approach that can respond to diverse incident scenarios.

How to do better

Even at this advanced stage, below are rapidly actionable refinements:

Embed Chaos Drills
- Randomly inject failures or security anomalies in production-like environments to ensure IR readiness:
  - Tools like AWS Fault Injection Simulator or Azure Chaos Studio can orchestrate purposeful disruptions.
  - GCP or OCI can adopt open-source solutions like Chaos Mesh for container-level fault injection.
Adopt AI/ML-Driven Threat Detection
- Integrate advanced analytics for anomaly detection:
  - AWS DevOps Guru or Amazon GuardDuty
  - Azure Sentinel with ML insights
  - GCP Cloud Anomaly Detection
  - OCI Security Advisor with ML-based patterns
- This ensures you detect suspicious behavior even before explicit alerts fire.
Coordinate Regional or Multi-department Exercises
- Team up with allied public bodies or departments to run a joint incident scenario, testing real collaborative processes.
- Sharing data or responsibilities across agencies aligns with NCSC’s multi-organisation incident response guidance.
Link IR Performance to Gov Accountability
- Provide leadership with metrics or dashboards that show how quickly critical services can be restored.
- This fosters ongoing support for practicing and funding IR improvements.
Benchmark with International Standards
- Assess if your IR process meets or exceeds frameworks like [NIST SP 800-61], [ISO 27035], or related global best practices.
- Update or fine-tune accordingly.

By regularly practicing chaos drills, leveraging AI-driven threat detection, collaborating with other agencies, and aligning with recognised international standards, your IR capabilities become even more robust. This ensures you stay prepared for evolving threats while maintaining compliance and demonstrating exceptional public sector resilience.

Keep doing what you’re doing, and consider writing up your incident response practice experiences (e.g., tabletop drills, real-world successes) in a blog post or internal case studies. Submit pull requests to this guidance or public sector best-practice repositories so others can learn from your advanced approaches to incident preparedness and response.

People

How does your organisation work with cloud providers? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move from minimal interaction to stronger collaboration with cloud providers:

Set Up Basic Account Management Contacts
- Register for at least a standard or free tier of support:
- This ensures you know how to escalate issues if they arise.
Use Vendor Documentation & Quickstart Guides
- Encourage staff to leverage official tutorials and quickstarts for key services (compute, storage, networking).
- Reference NIST Cloud Computing resources for broad conceptual best practices.
Attend Vendor Webinars/Events
- Cloud providers frequently hold free webinars or online sessions geared to public sector or cost optimisation:
Implement Minimal Security Best Practices
- Even with minimal contact:
  - NCSC cloud security: At least enable multi-factor authentication (MFA), basic logging, and encryption at rest.
Document Next Steps
- E.g., “In 3 months, explore a higher support tier or schedule a call with a provider solutions architect to discuss cost or architecture reviews.”

By establishing basic contacts, using vendor quickstarts, tapping into free events, and implementing minimal security measures, you start reaping more value from your cloud provider relationship and set the stage for deeper engagement.

How to do better

Below are rapidly actionable ways to evolve beyond basic support:

Establish Regular Check-Ins with Account Managers
- Request quarterly calls or monthly updates:
Request Architecture/Cost Reviews
- Providers typically offer free or low-cost reviews to identify cost-saving or performance improvements:
  - e.g., AWS Well-Architected Review, Azure Architecture Review, GCP Architecture Check, OCI Architecture Center.
Attend or Organise Vendor-Led Training
- Encourage staff to attend vendor-led courses or sign up for NCSC-endorsed cloud security training materials if available.
- This builds internal skill sets, reducing reliance on ad-hoc support.
Leverage Vendor Communities & Forums
- For quick answers outside official tickets, use:
Institute a “Support Triage” Process
- Define guidelines on which issues can be solved internally vs. escalated to the provider to expedite resolution times.
- Helps staff know when to open tickets and what info to include.

By scheduling regular check-ins with account managers, requesting architecture and cost reviews, organising training sessions, and clarifying a support triage process, you step up from reactive usage of basic support to a more proactive and beneficial relationship.

How to do better

Below are rapidly actionable ways to leverage regular provider interaction more effectively:

Pursue Dedicated Technical Engagement
- If your usage or complexity warrants it, consider an advanced support tier:
Targeted Workshops for Specific Projects
- Request solution architecture workshops tailored to, say, big data analytics, HPC, or IoT in the public sector context.
- Align these with departmental goals, referencing NIST Big Data guidelines or NCSC data security advice.
Co-Develop a Cloud Roadmap
- With the provider’s account manager, outline next-year priorities: e.g., expansions to new regions, adopting serverless, or cost optimisation drives.
- Ensure these are documented in a shared action plan.
Engage in Beta/Preview Programs
- Providers often invite customers to test new features, offering direct input.
- This can yield early insights into tools beneficial for your departmental use cases.
Share Feedback on Public Sector Needs
- Raise local government, NHS, or departmental compliance concerns so the provider can adapt or recommend solutions (e.g., private endpoints, advanced encryption key management).

By scheduling advanced support tiers or specialised workshops, co-developing a cloud roadmap, participating in early feature programs, and continuously feeding back public sector requirements, you strengthen the partnership for mutual benefit.

How to do better

Below are rapidly actionable ways to deepen this proactive, tailored relationship:

Establish Joint Success Criteria
- e.g., “Reduce average monthly cloud cost by 20%,” or “Achieve 99.95% uptime with no unplanned downtime over the next quarter.”
- Collaborate with the provider’s solution architects to measure progress monthly.
Conduct Regular Technical Deep-Dives
- If using advanced analytics or HPC, schedule monthly architecture feedback with vendor specialists who can propose further optimisation or new service usage.
- Incorporate relevant NIST SP 500-299 HPC guidelines or domain-specific standards if relevant.
Engage in Co-Innovation Programs
- Some providers run “co-innovation labs” or pilot programs specifically for public sector transformations:
Formalise an Enhancement Request Process
- For feature gaps or special compliance needs, let your account team log these requests, referencing [NCSC or GOV.UK requirements].
- Potentially expedite solutions that meet public sector demand.
Public Sector Showcases
- Offer to speak at vendor events or in case studies, highlighting your success:
  - This often results in further tailored support or early access to relevant solutions.

By defining success metrics, scheduling technical deep-dives, pursuing co-innovation, and ensuring an open channel for feature requests, you make the most of your proactive provider engagement—driving continuous improvement in alignment with public sector priorities.

How to do better

Even at this advanced level, below are rapidly actionable ways to refine a strategic partnership:

Co-Develop Advanced Pilots
- Test cutting-edge solutions, e.g., advanced AI/ML for predictive analytics, HPC for large-scale modeling:
  - Align with NIST AI frameworks or specialised vendor HPC solutions.
- This pushes your public sector services into future-forward innovations.
Integrate Multi-Cloud or Hybrid Strategies
- If relevant, partner with multiple providers while ensuring secure, consistent management:
  - NCSC multi-cloud security considerations or vendor multi-cloud bridging solutions (Azure Arc, GCP Anthos, AWS Outposts, OCI Interconnect)
Spearhead Cross-Government Collaborations
- Collaborate with local councils, NHS, or other agencies—invite them to share your advanced partnership benefits, referencing GOV.UK’s cross-government digital approach.
- Potentially form shared procurement or compliance frameworks with the provider’s help.
Ensure Regular, Comprehensive Security Drills
- Pair with your provider for joint incident simulations, verifying consistent coverage of best practices:
  - e.g., region failovers, advanced DDoS scenarios, referencing NCSC’s DDoS protection guidance and the provider’s protective services.
Establish a Lessons Learned Repository
- Each joint initiative or advanced workshop should produce shareable documentation or “playbooks,” continuously updating your knowledge base for broader departmental usage.

By pushing into co-developed pilots, multi-cloud or hybrid expansions, cross-government collaborations, advanced security drills, and structured knowledge sharing, you maintain a forward-looking, fully integrated partnership with your cloud provider—ensuring ongoing alignment with strategic public sector aspirations.

Keep doing what you’re doing, and consider sharing your experiences (e.g., co-pilots, advanced solutions) in blog posts or on official channels. Submit pull requests to this guidance or related best-practice repositories to help others in the UK public sector benefit from your advanced collaborations with cloud providers.

How does your organisation support cloud training and certification? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to introduce at least a minimal structure for cloud-related training:

Create a Basic Cloud Skills Inventory
- Ask staff to self-report familiarity with AWS, Azure, GCP, OCI, or relevant frameworks (like DevOps, security, cost management).
- This inventory helps identify who might need basic or advanced training.
Encourage Free Vendor Resources
- Point teams to free training modules or documentation:
Sponsor One-Off Training Sessions
- If resources are extremely limited, schedule a short internal knowledge-sharing day:
  - For instance, have a staff member who learned AWS best practices do a 1-hour teach-in for colleagues.
Reference GOV.UK and NCSC Guidelines
- For developing staff skills in public sector contexts, see:
  - GOV.UK’s capability frameworks for digital, data, and technology roles
  - NCSC’s general advice on building secure teams and skill sets
Plan for Future Budget Requests
- If adoption grows, prepare a case for funding basic training or at least paying for exam vouchers, showing potential cost or security benefits.

By initiating a simple skills inventory, directing staff to free resources, hosting internal sessions, and referencing official guidance, you plant the seeds for more structured, formalised cloud training down the line.

How to do better

Below are rapidly actionable steps to unify manager-led training into a more consistent approach:

Set Organisation-Wide Cloud Skill Standards
- e.g., requiring at least one fundamental certification per dev/ops staff, referencing AWS Certified Cloud Practitioner, Azure Fundamentals, GCP Cloud Digital Leader, or OCI Foundations.
- This ensures a baseline competence across all teams.
Track Training Efforts Centrally
- Even if managers sponsor training, request monthly or quarterly updates from each manager:
  - Summaries of who took which courses, certifications earned, or next steps.
Provide a Shared Training Budget or Resource Pool
- Instead of leaving it entirely to managers, allocate a central fund for cloud courses or exam vouchers.
- Teams can draw from it with minimal bureaucracy, ensuring equity.
Host Cross-Team Training Days
- Let managers co-sponsor internal “training day sprints,” where staff from different teams pair up for labs or workshops:
  - Possibly invite vendor solution architects for a half-day session on cost optimisation or serverless.
Reference GOV.UK & NIST on Training Governance
- Align with GOV.UK skill frameworks for digital, data, and technology roles and NIST workforce security guidelines.
- Show managers how structured skill-building can reduce operational risks.

By defining organisation-wide skill baselines, tracking training across teams, offering a shared budget, and running cross-team training events, you build a more equitable and cohesive approach—improving consistency in cloud competence.

How to do better

Below are rapidly actionable improvements:

Customise Training by Role Path
- Provide recommended vendor certification journeys for each role (DevOps, Data Engineer, Security Engineer, etc.):
Incorporate Regular Skills Audits
- Each quarter or half-year, staff update training statuses and new certifications.
- Identify areas for further focus, e.g., advanced security or HPC skills.
Implement Gamified Recognition
- e.g., awarding digital badges or points for completing specific labs or passing certifications:
  - Ties in with internal comms celebrating achievements, boosting morale.
Align Training with Security & Cost Goals
- For instance, if cost optimisation is a priority, encourage staff to take relevant vendor cost management courses.
- If advanced security is crucial, highlight vendor security specialty paths.
Coordinate with GOV.UK Skills Framework
- Cross-check your roles and training paths with digital, data, and technology capability frameworks on GOV.UK.
- Possibly update job descriptions or performance metrics.

By mapping certifications to roles, regularly auditing skills, gamifying recognition, and aligning training with strategic objectives, you embed continuous cloud skill growth into your corporate culture—ensuring sustained readiness and compliance.

How to do better

Below are rapidly actionable ways to refine role-based training and self-assessment:

Integrate Self-Assessments into Performance Reviews
- Encourage staff to reference role-based metrics during appraisals:
  - e.g., “Achieved AWS Solutions Architect – Associate, aiming for Azure Security Engineer next.”
- Ties personal development to formal performance frameworks.
Provide “Skill Depth” Options
- Some staff may prefer broad multi-cloud knowledge, while others want deep specialisation in a single vendor:
  - e.g., a “multi-cloud track” vs. “AWS advanced track” approach.
Enable Peer Mentoring
- Pair junior staff who want a certain certification with an experienced internal mentor or sponsor.
- Encourages knowledge sharing, reinforcing your training culture.
Automate Role-Based Onboarding
- New hires get automatically assigned recommended learning modules or labs:
  - e.g., AWS Qwiklabs, Azure Hands-on Labs, GCP Quick Labs, or OCI hands-on labs that match their role.
Check Alignment with NCSC & NIST
- If security roles require advanced training, ensure it meets NCSC’s Cyber Essentials or advanced security training advice, or NIST SP 800-16 for role-based cybersecurity training.

By linking self-assessments to performance, diversifying skill tracks, enabling peer mentoring, and automating onboarding processes, you create a fully integrated environment where each role’s learning path is clear, self-directed, and aligned to organisational needs.

How to do better

Below are rapidly actionable suggestions to perfect an incentivised and assessed training program:

Tie Certifications to Mastery Projects
- In addition to passing exams, employees might complete real, in-house projects demonstrating they can apply those skills:
  - e.g., building a pilot serverless application or implementing end-to-end security logging using NCSC best practices.
Organise Internal “Training Sprints” or Hackathons
- e.g., a week-long challenge where staff pursue advanced certification labs together, culminating in recognition or prises.
Reward Mentors
- If staff help others achieve certifications, consider awarding them additional recognition or digital badges:
  - Encourages a culture of mentorship and upskilling.
Set Up Cross-Government Partnerships
- Share your approach with other public sector bodies, possibly hosting inter-department training events:
  - Align with GOV.UK community approaches to digital and data capability building.
Monitor ROI & Impact
- Track how training improvements affect cost optimisation, user satisfaction, or speed of service releases:
  - Present these metrics to leadership as evidence that the incentivised approach works.

By coupling incentives with real project mastery, hosting hackathons, rewarding mentors, forming cross-government partnerships, and measuring returns, you refine a world-class training program that fosters continual cloud skill advancement and directly benefits your public sector missions.

Keep doing what you’re doing, and consider writing up your training and certification successes, possibly in blog posts or internal case studies. Submit pull requests to this guidance or other public sector best-practice repositories so fellow UK organisations can follow your lead in creating robust cloud skill-building programs.

How important is cloud experience when hiring leaders, suppliers, and contractors? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to begin emphasizing cloud skills in hiring:

Add Cloud Awareness to Job Descriptions
- Even if not mandatory, mention “cloud awareness” or “willingness to learn cloud technologies” for relevant roles.
- Encourage upskilling referencing free training from AWS Skill Builder, Azure Microsoft Learn, GCP Skill Boost, or OCI Free Training, IBM Cloud free training
Encourage Current Staff to Share Cloud Knowledge
- If you have even one or two employees with cloud expertise, host internal lunchtime talks or short workshop sessions.
- Build a minor internal market for cloud knowledge so that future roles can specify these basic competencies.
Prepare for Cloud-Focused Future
- If you have a known modernisation program, consider building a pipeline of cloud-savvy talent:
  - Start by adding basic cloud competence to “desired” (not required) criteria in some new roles.
Reference NIST & NCSC Workforce Guidance
- For instance, NIST SP 800-181 National Initiative for Cybersecurity Education (NICE) Framework provides role-based skill guidelines, which can be extended to cloud roles.
- NCSC guidance on building a cloud-ready workforce can help formalise job competencies.
Short Internal Hackathons
- Let staff explore a simple cloud project, e.g., deploying a test app or serverless function.
- This stirs interest in cloud skills, naturally leading to job postings that mention them.

By introducing even minor cloud awareness requirements, providing internal knowledge sharing, referencing official frameworks, and organising small hackathons, you start shifting your hiring practices to future-proof your organisation’s cloud readiness.

How to do better

Below are rapidly actionable improvements:

Include Cloud Skills for Leadership
- For senior or executive positions that influence technology strategy, add “awareness of cloud architectures and security” to the job description.
- This aligns with modern public sector digital leadership standards.
Establish Clear Criteria
- Define which roles “must have,” “should have,” or “could have” cloud experience:
  - e.g., for a principal engineer or head of infrastructure, cloud experience is “must have.” For a data analyst, it might be “should have” or optional.
Collaborate with HR or Recruitment
- Ensure recruiters understand terms like “AWS Certified Solutions Architect,” “Azure DevOps Engineer Expert,” or “GCP Professional Cloud Architect.”
- They can better filter or source candidates if they know relevant cloud certifications or skill sets.
Assess Supplier Cloud Proficiency
- When contracting or hiring contingent labor, require them to demonstrate cloud capabilities (like having staff certified to a certain level).
- Reference NCSC supply chain security guidelines to set minimal standards for external vendors.
Offer Pathways for Internal Staff
- Provide existing employees an option to upskill into these “cloud-required” roles, reinforcing a culture of growth.
- Supports staff retention and aligns with NIST workforce development frameworks.

By adding leadership-level cloud awareness, clarifying role-based cloud criteria, ensuring recruiters or contingent labor providers understand these requirements, and offering internal upskilling, you create a more consistent approach that meets both immediate and long-term organisational needs.

How to do better

Below are rapidly actionable ways to advance beyond simple mandatory requirements:

Regularly Update Role Profiles
- As AWS, Azure, GCP, and OCI evolve, review job descriptions annually:
  - e.g., adding modern DevSecOps patterns, container orchestration, serverless, or big data capabilities.
Introduce Cloud Competency Levels
- e.g., “Level 1 – Cloud Foundations,” “Level 2 – Advanced Cloud Practitioner,” “Level 3 – Cloud Architect.”
- This ensures clarity about skill depth for each role, linking to vendor certifications.
Ensure Continuity & Succession
- Plan for staff turnover by establishing robust knowledge transfer processes, referencing NCSC workforce security advice.
- Minimises risk if a key cloud-skilled individual leaves.
Promote Multi-Cloud Awareness
- If your organisation uses more than one provider, encourage roles to include cross-provider or “cloud-agnostic” concepts:
  - e.g., Terraform, Kubernetes, or zero-trust security patterns relevant across AWS, Azure, GCP, or OCI.
Involve Senior Leadership
- Demonstrate how mandatory cloud experience in roles directly supports mission-critical public services, cost optimisation, or security compliance, building top-level buy-in.

By routinely revising DDaT role definitions to keep pace with evolving cloud tech, defining competency levels, planning continuity, encouraging multi-cloud knowledge, and securing leadership sponsorship, you firmly embed cloud skill requirements into your organisational DNA.

How to do better

Below are rapidly actionable methods to keep role definitions agile in a cloud-first IT organisation:

Periodically Revalidate Roles
- Introduce a yearly review cycle where HR, IT leadership, and line managers re-check if roles align with current cloud usage or new compliance mandates (like NIST SP 800-53 revision updates).
Provide Upgrade Path for Existing Staff
- Offer training or special secondments so staff who were initially on-prem can adapt to cloud:
  - e.g., an internal “cloud transformation bootcamp,” referencing AWS, GCP, Azure, or OCI training labs.
Embed Cloud in Performance Management
- Align staff appraisal or objective-setting with adoption of new cloud skills, cost-saving initiatives, or security improvements.
Create a Cloud Champion Network
- For each department, designate “cloud champions” who ensure local roles remain updated and can escalate new skill demands if usage evolves.
Follow GOV.UK or DDaT ‘Career Paths’
- Cross-check with official Digital, Data and Technology (DDaT) role definitions on GOV.UK to ensure your newly updated roles align with common public sector standards.

By systematically revalidating roles, offering staff training for on-prem to cloud transitions, linking performance metrics to cloud initiatives, and referencing official frameworks, you future-proof your team structures in a dynamic cloud landscape.

How to do better

Below are rapidly actionable methods to keep your fully cloud-oriented workforce thriving:

Nurture Advanced Specialisations
- Some roles may deepen knowledge in containers (Kubernetes), serverless, HPC, or big data analytics:
  - e.g., adopting advanced AWS, Azure, GCP, or OCI certifications for architecture, security, or data engineering.
Embed Continuous Learning
- Offer staff consistent updates, hack days, or vendor-led labs to adapt to new features quickly:
  - e.g., monthly community-of-practice sessions to discuss the latest cloud service releases or security advisories.
Encourage Cross-Organisational Collaboration
- Collaborate with other UK public sector bodies, sharing roles or secondment opportunities for advanced cloud experiences.
- This fosters a broader, more resilient talent pool across government.
Pursue International or R&D Partnerships
- If your department engages in cutting-edge projects or HPC research, consider co-innovation programs with cloud providers or academic institutions:
  - This might spin up entirely new specialised roles (AI/ML ops, HPC performance engineer, etc.).
Benchmark Against Leading Practices
- Leverage NCSC or NIST case studies to compare your staff skill frameworks with top-tier digital organisations.
- Conduct periodic audits on the relevance of your role definitions and skill requirements.

By encouraging advanced specialisations, sustaining continuous learning, collaborating with other public sector entities, pursuing co-innovation partnerships, and benchmarking against top-tier best practices, you maintain an extremely robust, cloud-first workforce strategy that evolves with emerging technologies and public sector demands.

Keep doing what you’re doing, and consider writing blog posts or internal knowledge base articles about your journey toward fully integrating cloud skills into hiring. Submit pull requests to this guidance or other public sector best-practice repositories, sharing lessons learned to help others adopt a comprehensive, future-ready cloud workforce strategy.

How do you choose suppliers and partners for cloud work? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond marketing-based selection:

Define Basic Technical and Security Criteria
- Before awarding a contract, ensure the supplier meets minimal security (e.g., ISO 27001) or compliance standards from NCSC’s cloud security guidelines.
- Check if they have relevant cloud certifications (e.g., AWS or Azure partner tiers).
Use Simple Supplier Questionnaires
- Ask about their experience with public sector, references for past cloud projects, and how they manage cost optimisation or data protection.
- This ensures more depth than marketing claims alone.
Check Real-Life Feedback
- Seek out reviews from other departments or local councils that used the same supplier:
  - e.g., informal networks, mailing lists, or digital communities of practice in the public sector.
Ensure They Can Align with GOV.UK Cloud First
- Ask if they understand government data classification, cost reporting, or typical NCSC compliance frameworks.
Plan an Incremental Engagement
- Start with a small pilot or short-term contract to validate their capabilities. If they prove reliable, expand the relationship.

By introducing a basic technical/security questionnaire, referencing real-life feedback, and piloting short engagements, you reduce reliance on marketing materials and ensure suppliers at least meet foundational public sector cloud requirements.

How to do better

Below are rapidly actionable methods to elevate from minimal compliance checks:

Evaluate Supplier Cloud Certifications
- Check if they hold AWS Partner tiers, Azure Expert MSP, GCP Premier Partner status, or OCI Specialised certifications:
  - AWS Partner Network levels, Azure Expert MSP status, GCP Partner Advantage tiers, OCI PartnerNetwork tiers.
Request Past Performance or Case Studies
- Ask for references from other UK public sector clients or comparable regulated industries.
- Prefer those who’ve demonstrated cost-saving or security success stories.
Incorporate Cloud-Specific Criteria in RFPs
- Beyond general compliance, request details on:
  - Cost optimisation approach, multi-region or multi-cloud experience, DevOps maturity, and NCSC’s 14 cloud security principles.
Conduct Briefing Sessions
- Invite top candidates to present their capabilities or do a short proof-of-concept:
  - This highlights who truly understands your departmental needs.
Ensure Contract Provisions for Exit and Risk
- If the supplier underperforms, you need a clear off-ramp or transition plan.
- Align with NIST best practices for exit strategies in cloud supply chain risk management.

By integrating cloud-specific partner certifications, verifying past performance, and adding mandatory contract clauses around risk and exit, you ensure your due diligence extends beyond basic compliance to real technical and operational aptitude.

How to do better

Below are rapidly actionable suggestions to refine moderate screening:

Request a Security & Architecture ‘Show Me’ Session
- Potential suppliers should demonstrate a typical architecture for a user story or scenario relevant to your environment:
  - e.g., how they configure a secure multi-tier application on AWS or Azure, referencing standard patterns from AWS Well-Architected, Azure Architecture Center, etc.
Evaluate Supplier DevSecOps Maturity
- Ask about their CI/CD pipeline, automated testing, or DevSecOps approach:
  - e.g., do they integrate SAST/DAST or infrastructure-as-code checks, referencing NCSC DevOps security advice?
Include Cost Management Criteria
- Suppliers should outline how they manage or optimise cloud spend:
  - Possibly referencing AWS Cost Explorer, Azure Cost Management, GCP Billing Alerts, or OCI Budgets.
- Helps ensure they won’t rack up unplanned expenses.
Check Multi-Region or DR Capabilities
- If resilience is key, ensure they’ve handled multi-region failovers or DR scenarios aligned with NIST SP 800-34 or NCSC resilience guidelines for continuity planning.
Formalise Weighted Scoring
- Allocate points for each requirement (experience, security alignment, cost management, references).
- This ensures an objective method to compare competing suppliers.

By pushing for real demonstrations of security/architecture, assessing DevSecOps maturity, reviewing cost management solutions, checking DR abilities, and using a weighted scoring system, you gain deeper insight into a supplier’s true capability and alignment with your goals.

How to do better

Below are rapidly actionable ways to expand a comprehensive evaluation:

Adopt a Custom Supplier Questionnaire
- Incorporate sections on:
  - Cloud competence (architecture patterns, security defaults), ethical labor practices, diversity and inclusion policies, environment-friendly operations.
- Align with NCSC’s supplier assurance for security aspects.
Verify Internal Code of Conduct Alignment
- Ensure the supplier’s approach to data privacy, anti-discrimination, or workforce conditions matches Civil Service code of conduct or relevant departmental codes.
Assess Cloud Roadmap Consistency
- Evaluate how the supplier’s technology roadmap or R&D investments align with your department’s future strategy:
  - e.g., multi-cloud, advanced ML/AI, zero-trust networking in line with NCSC zero-trust architecture guidance.
Engage in Pilot Co-Creation
- Where feasible, run a small PoC or co-innovation sprint with top candidates to see if they truly deliver under real conditions:
  - AWS PoC programs, Azure fast-track solutions, GCP pilot engagements, or OCI innovation labs.
Weight Sustainability in Procurement
- Incorporate a scoring element for green cloud operations, referencing vendor data on region-level carbon footprints or NCSC’s environment-friendly cloud usage tips.

By employing a custom questionnaire that includes ethical, environmental, and advanced cloud criteria, verifying code-of-conduct alignment, ensuring compatibility with your technical roadmap, piloting co-creation sprints, and weighting sustainability, you further refine the comprehensive evaluation for a well-rounded supplier selection process.

How to do better

Below are rapidly actionable ways to enhance strategic supplier selection:

Promote Multi-Year Collaboration
- Consider multi-year roadmaps with staged deliverables and built-in agility:
  - e.g., specifying review points for adopting new cloud services or ramping up HPC/ML capabilities when needed.
Publish Clear Risk Management Requirements
- Require suppliers to maintain a living risk register, shared with your security team, covering performance, security, and cost risks.
- Align with NCSC’s risk management approach.
Encourage Apprenticeships and Community Contributions
- Award extra points to suppliers who support local apprenticeships or sponsor digital skill-building in your region:
  - GOV.UK apprenticeship guidelines or digital skill initiatives.
Conduct Joint Business Reviews
- Schedule an annual or semi-annual leadership review session, focusing on:
  - Roadmap alignment, upcoming technology expansions, sustainability targets, and success stories to share cross-government.
Integrate ESG and Sustainability
- Evaluate how suppliers reduce carbon footprints in data center usage:
  - e.g., verifying providers’ renewable energy usage or referencing NCSC’s sustainability advice for cloud usage.

By defining multi-year collaborative roadmaps, embedding a shared risk register, incentivising apprenticeships or broader skill contributions, maintaining periodic leadership reviews, and factoring in sustainability metrics, you cultivate a strategic, mutually beneficial relationship with cloud suppliers. This ensures alignment with public sector values, security standards, and a visionary approach to digital transformation.

Keep doing what you’re doing, and consider writing some blog posts about your advanced supplier selection processes or opening pull requests to this guidance for others. By sharing how you integrate technical, ethical, and sustainability factors, you help other UK public sector organisations adopt strategic, future-focused cloud supplier qualification processes.

How do you help staff with little or no cloud experience move into cloud roles? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to establish a baseline development path for new cloud learners:

Create a Simple Cloud Familiarisation Resource
- Gather free vendor tutorials:
- Provide a short list of recommended links to staff wanting to explore cloud concepts.
Encourage Self-Study
- Offer small incentives (e.g., internal recognition or minor expense coverage) if employees complete a fundamental cloud course.
- Even a simple certificate of completion fosters motivation.
Promote Internal Shadowing
- If you have at least one cloud-savvy colleague, arrange informal shadowing or pair sessions.
- This ensures staff with zero cloud background get exposure to real tasks.
Reference GOV.UK and NCSC
- Link staff to relevant GOV.UK digital and technology frameworks or basic NCSC cloud security advice.
Pilot a Tiny Cloud Project
- If budget or time is tight, propose a small, non-critical cloud POC. Staff with no cloud experience can attempt deploying a simple website or serverless function, building basic confidence.

By assembling free training resources, sponsoring small incentives, and facilitating internal shadowing or mini pilots, you kickstart a foundational path for employees to begin acquiring cloud knowledge in a low-cost, organic way.

How to do better

Below are rapidly actionable ways to strengthen basic on-the-job cloud training:

Define Simple Mentorship Guidelines
- Even if informally, specify a mentor’s role—e.g., conducting weekly check-ins, demonstrating best practices for provisioning, cost management, or security scanning.
Adopt a Buddy System for Cloud Tasks
- Pair a novice with a more experienced engineer on actual cloud tickets or incidents:
- Encourages learning through real-world problem-solving.
Introduce a Lightweight Skills Matrix
- Track essential cloud tasks (e.g., spinning up a VM, setting up logging, basic security config) and check them off as novices learn:
  - e.g., [AWS/Azure/GCP/OCI basics], referencing relevant vendor quickstarts.
Encourage Self-Paced Online Labs
- Provide access to some structured labs:
  - [AWS Hands-on labs, Azure Lab Services, GCP codelabs, or OCI labs], guiding novices step-by-step.
Celebrate Progress
- Recognise or reward staff who complete key tasks or mini-certs (like AWS Cloud Practitioner):
  - This fosters a positive culture around skill growth.

By structuring mentorship roles, ensuring novices participate in real tasks, tracking essential skills, adding lab-based self-study, and giving recognition, you can rapidly accelerate staff readiness and consistency in cloud ops.

How to do better

Below are rapidly actionable ways to enhance structured training/mentorship:

Formalise Cloud Learning Journeys
- e.g., for a DevOps role, define stepping stones from fundamental vendor certs to advanced specialisations:
  - AWS Solutions Architect -> SysOps -> Security, Azure Administrator -> DevOps Engineer, GCP Associate Engineer -> Professional Architect, etc.
Adopt Official Vendor Training Programs
- Microsoft’s Enterprise Skills Initiative, AWS Skills Guild, GCP Professional Services training, or Oracle University courses:
  - This can scale up in a structured manner, referencing NIST NICE framework for workforce skill mapping.
Establish Time Allocations
- Guarantee staff a certain number of hours per month for cloud labs, workshops, or self-paced learning:
  - Minimises conflicts with daily duties.
Integrate Real Projects into Training
- Let trainees apply new skills to an actual low-risk project, e.g., a new serverless prototype or a cost optimisation analysis:
  - Encourages practical retention.
Track & Reward Milestones
- Summarise achievements in quarterly stats: “Team X gained five new AWS Solutions Architect Associates.”
- Offer small recognition or career advancement alignment with Civil Service success profiles.

By defining clear cloud learning journeys, leveraging vendor training, scheduling dedicated study time, embedding real projects in the curriculum, and publicly recognising accomplishments, you foster a thriving environment for upskilling staff in cloud technologies.

How to do better

Below are rapidly actionable tips to refine integrated learning and development:

Formal Apprenticeship or Bootcamp
- Partner with recognised training providers:
  - e.g., AWS re/Start, Azure Academy, GCP JumpStart, or Oracle Next Education for more in-depth coverage.
- Ensure alignment with NCSC or NIST cybersecurity modules.
Set Clear Learning Roadmaps by Function
- For Dev, Ops, Security, Data roles—each has curated course combos, from fundamentals to specialised advanced topics:
  - This fosters structured progression.
Involve Senior Leadership Support
- Encourage exec sponsors to highlight success stories, attend final presentations of training cohorts, or discuss how these new skills align with departmental digital transformation goals.
Combine Internal & External Teaching
- Use a mix of vendor trainers, in-house subject matter experts, and third-party specialists for well-rounded instruction.
- This ensures staff see multiple perspectives.
Measure ROI
- Track cost savings, decreased deployment times, or increased user satisfaction from cloud projects led by newly trained staff:
  - Present these metrics in leadership reviews, justifying ongoing investment.

By implementing apprenticeship or structured bootcamp approaches, organising role-specific learning paths, ensuring leadership buy-in, blending internal and external expertise, and measuring ROI, you develop a truly comprehensive and outcome-driven cloud skill development program.

How to do better

Below are rapidly actionable ways to further refine your mature apprenticeship or bootcamp program:

Expand Specialist Tracks
- Develop advanced sub-tracks (e.g., HPC, AI/ML, Zero-Trust Security) for participants who excel at foundational cloud skills:
  - Align with vendor specialised training or NCSC/NIST security standards for deeper expertise.
Coordinate Multi-department Bootcamps
- Collaborate with local councils, NHS, or other government bodies to form a larger talent pool:
  - Shared labs, cross-government hackathons, or combined funding can scale impact.
Ensure Continuous Performance Assessments
- Conduct formal evaluations 6, 12, or 18 months post-bootcamp:
  - Checking advanced skill adoption, real project outcomes, and personal career growth.
Public Acknowledgment & Advancement
- Link successful completion to career progression or pay grade enhancements, referencing civil service HR frameworks or GOV.UK’s capability frameworks.
Incorporate Cost-Savings and ROI Proof
- Track how newly trained staff reduce external consultancy reliance, deliver projects faster, or improve security.
- Present data to leadership, ensuring sustained or increased budgets for these programs.

By launching specialised advanced tracks, fostering cross-department collaborations, performing ongoing performance assessments, integrating real career incentives, and measuring ROI, you secure a pipeline of skilled cloud professionals well-suited to public sector demands, maintaining a resilient workforce aligned with national digital transformation objectives.

Keep doing what you’re doing, and consider documenting your apprenticeship or bootcamp approaches in internal blog posts or knowledge bases. Submit pull requests to this guidance or other best-practice repositories so fellow UK public sector organisations can replicate your success in rapidly upskilling staff for cloud roles.

How much do you rely on third parties for cloud work? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to reduce over-dependence on a single third party:

Retain Critical Access
- Designate at least one in-house staff member with admin or break glass rights, ensuring your organisation can still operate if the supplier is unavailable.
- Cloud providers typically support delegated access models:
Require Transparent Documentation
- Request the third party produce architecture diagrams, runbooks, and logs:
  - So your internal teams can reference them and step in if needed.
Set Clear SLAs and Security Requirements
- Stipulate compliance with NCSC’s cloud security principles, any relevant NIST frameworks, and cost accountability:
  - This helps ensure strong security posture and predictable budgeting.
Conduct Periodic Access Reviews
- Evaluate who has root-level or full access privileges. Revoke or reduce if not absolutely necessary:
  - Minimises the impact if the supplier or a contractor is compromised.
Begin In-House Skill Development
- While outsourcing can remain an option, create a roadmap for building minimal internal cloud literacy:
  - e.g., sponsor staff to complete fundamental vendor certs or attend free training from AWS Skill Builder, Azure Learn, GCP Skill Boost, or OCI Free Training.

By retaining critical admin access, demanding thorough documentation, setting rigorous SLAs, auditing access, and growing your internal skill base, you hedge against supplier lock-in or failure and maintain some sovereignty over crucial cloud operations.

How to do better

Below are rapidly actionable improvements:

Use Granular IAM Permissions
- Instead of giving suppliers full admin rights, adopt least privilege:
Create Supplier-Specific Accounts or Subscriptions
- Segment your cloud environment so suppliers only see or modify what’s relevant:
  - This helps contain damage if credentials leak or get misused.
Mandate Activity Logging & Auditing
- Configure [AWS CloudTrail, Azure Monitor, GCP Cloud Logging, OCI Audit] to track every privileged action:
  - Helps detect anomalies or investigate incidents quickly.
Conduct Scheduled Joint Reviews
- Align on cost management, architecture updates, security posture with the supplier monthly or quarterly:
  - e.g., use [AWS Trusted Advisor / Azure Advisor / GCP Recommender / OCI Advisor] to see if best practices are followed.
Plan for Possible Transition
- If you decide to reduce the supplier’s role in the future, ensure documentation or staff knowledge exist to avoid single-point dependencies.

By applying least privilege IAM, isolating supplier access, logging all privileged actions, collaborating on architecture/cost reviews, and planning for possible transitions, you maintain high security while leveraging external expertise effectively.

How to do better

Below are rapidly actionable ways to refine specialised third-party support:

Automate Break-Glass Processes
- e.g., storing break-glass credentials in a secure vault (like [AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, OCI Vault]) requiring multi-party approval or temporary permission escalation.
Develop Clear Incident Protocols
- Document precisely when to invoke the supplier’s “emergency” access and how to revoke it once resolved:
  - e.g., reference NCSC incident management guidelines.
Perform Yearly Access Drills
- Simulate a scenario requiring supplier intervention:
  - Validate that the break-glass account retrieval process, notifications, and post-incident re-lock steps all work smoothly.
Enforce Accountability
- Keep robust logs of every action taken under break-glass credentials, analyzing for anomalies:
  - AWS CloudTrail, Azure Monitor, GCP Cloud Logging, OCI Audit with mandatory MFA for break-glass usage.
Periodic Skills Transfer
- Let external experts run short workshops, training sessions, or knowledge transfers:
  - e.g., HPC performance tuning, advanced DevSecOps, or AI/ML best practices—improving your team’s ability to handle issues without always relying on break-glass.

By automating break-glass credentials, establishing clear incident protocols, conducting annual drills, logging all privileged actions, and regularly upskilling staff with supplier-led sessions, you can maintain strong security while accessing specialised expertise only when needed.

How to do better

Below are rapidly actionable ways to leverage specialised knowledge further:

Add Read-Only or Auditor Roles
- If a supplier needs to see logs or metrics, create limited read-only access:
  - AWS IAM “Auditor” roles, Azure “Reader” role, GCP “Viewer” role, OCI “Read-Only” policy for compartments.
- This streamlines feedback without giving them admin powers.
Enable Collaborative Architecture Reviews
- Provide sanitised environment data or architecture diagrams for the supplier to review:
  - e.g., removing any sensitive info but enough detail to yield beneficial recommendations.
Request Proactive Security or Cost Analysis
- Possibly share cost usage dashboards (AWS Cost Explorer, Azure Cost Management, GCP Billing, OCI Cost Analysis) or security posture data so the supplier can offer suggestions.
Formalise Knowledge Transfer
- For each engagement, define deliverables like architectural guidelines, best-practice documents, or mini-lab sessions with staff.
- Ensures that specialised advice becomes actionable in-house expertise.
Regular Check-Ins and Feedback Loop
- If they have no direct access, schedule monthly or quarterly calls to review changing requirements or new services, referencing relevant NCSC or NIST updates on secure cloud operations.

By granting read-only roles for better collaboration, scheduling architecture or security reviews, requesting continuous cost/security analysis, and structuring knowledge transfers, you maximise the benefits of external specialists while maintaining tight control over your environment.

How to do better

Below are rapidly actionable ways to refine a minimal/augmentative third-party approach:

Maintain Partnerships Without Access
- Keep a list of vetted specialised vendors (e.g., HPC, big data, AI/ML, security) for future on-demand projects:
  - AWS HPC Competency or Data Analytics Competency partners, Azure HPC specialized consultancies, GCP ML specialized partners, OCI HPC experts.
Ensure Proper Documentation and Knowledge Transfer
- Whenever you briefly hire contingent staff, they must update runbooks, diagrams, or code repos:
  - Mitigates risk of “knowledge walkout.”
Incorporate Cross-Government Collaboration
- For advanced or new cloud initiatives, consider partnering with other public sector bodies first, exchanging staff or expertise:
  - e.g., short secondments or co-located sprints can accelerate learning while minimising external costs.
Benchmark Internal Teams Regularly
- Evaluate your staff’s readiness for new features, security approaches, or multi-cloud expansions.
- Use NCSC skill frameworks or NIST workforce standards to ensure coverage.
Public Sector Thought Leadership
- If you have minimal external dependencies, you likely have strong internal mastery—consider sharing success stories or best practices across local councils or GOV.UK communities of practice.

By maintaining a supplier list without granting them privileged access, enforcing thorough knowledge transfer, collaborating cross-government for specialised expertise, continuously benchmarking in-house capabilities, and showcasing your self-reliant approach, you preserve a high level of operational independence aligned with secure, cost-effective public sector cloud usage.

Keep doing what you’re doing, and consider writing about your strategies for third-party involvement in cloud initiatives or creating pull requests to this guidance. This helps other UK public sector organisations learn how to balance external expertise with robust internal control over their cloud environment.

What does success look like for your cloud team? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to establish at least minimal success criteria:

Identify Key Cloud Objectives
- E.g., reduce hosting costs by 10%, or migrate a pilot workload to AWS/Azure/GCP/OCI.
- Reference departmental priorities or NIST cloud computing frameworks for initial guidance.
Define Simple Metrics
- Examples: “Number of staff trained on fundamental cloud skills,” “Mean time to deploy a new environment,” “Basic cost usage reduction from month to month.”
Align with Leadership
- Present a short list of proposed success metrics to senior management for sign-off, ensuring these metrics reflect organisational or GOV.UK Cloud First policies.
Track Progress Visibly
- Use a shared dashboard or simple spreadsheet to record outcomes:
  - e.g., new workloads migrated, number of test passes, or cost changes.
Create a Baseline
- If you have no prior data, quickly measure current on-prem costs or the time it takes to provision infrastructure:
  - This baseline will contextualise progress in adopting cloud solutions.

By identifying basic cloud objectives, selecting simple metrics, confirming leadership support, tracking progress, and establishing a baseline, you move from undefined success to a workable system that can be refined as your team matures.

How to do better

Below are rapidly actionable steps to advance beyond PoC-based success:

Set PoC Transition Targets
- Define a timeline or conditions under which successful PoCs move into pilot production or scale to more workloads:
  - e.g., “If the PoC meets X performance criteria at Y cost, proceed to production by date Z.”
Establish Operational Metrics
- Expand criteria from “PoC completed” to performance, security, or user satisfaction metrics:
  - e.g., incorporate AWS Well-Architected Framework checks, Azure Advisor recommendations, or equivalent GCP/OCI best practices.
Involve Real End Users
- If feasible, let a pilot serve actual staff or a subset of public users:
  - Gains more meaningful feedback on feasibility or user experience.
Document & Share Learnings
- Produce a short “PoC to Production” playbook referencing GOV.UK service manual agile approach or NCSC cloud security principles.
Link PoCs to Organisational Goals
- Ensure each PoC addresses a genuine departmental need (like cost, user experience, or operational agility), so it’s not a siloed experiment.

By defining clear triggers for scaling PoCs, measuring advanced metrics, engaging real users, sharing lessons learned, and tying PoCs to broader goals, you accelerate from pilot outcomes to genuine organisational transformation.

How to do better

Below are rapidly actionable ways to refine production-based success criteria:

Track Key Operational Metrics
- e.g., Mean Time to Recovery (MTTR), cost per transaction, or user satisfaction scores:
  - Gather real-time data via AWS CloudWatch, Azure Monitor, GCP Cloud Logging/Monitoring, OCI Observability.
Integrate Security & Cost Efficiency
- Expand success definitions to include passing regular security scans (like AWS Inspector, Azure Defender for Cloud, GCP Security Scanner, OCI Security Advisor) or achieving cost baseline targets:
  - e.g., “90% of resources use auto-scaling and adhere to tagging policies referencing NCSC supply chain or security guidelines.”
Define a Full Lifecycle Approach
- Ensure pipelines for new features, rollbacks, or replacements are tested and documented:
  - Reduces risk of “stagnation” where workloads remain unoptimised once launched.
Share Achievements & Best Practices
- Show leadership how launching a new cloud app saved costs, improved uptime, or aligned with GOV.UK’s Cloud First policy.
Plan for Next Steps
- If a single workload is successful in production, identify the next logical workload or cost-saving measure to adopt:
  - e.g., serverless expansions, HPC jobs, advanced AI/ML adoption.

By incorporating operational metrics, weaving in security and cost success factors, ensuring a continuous pipeline approach, celebrating achievements, and planning further expansions, you create a robust definition of success that fosters ongoing improvements.

How to do better

Below are rapidly actionable strategies to further scale prototypes into core services:

Adopt Advanced HA/DR Strategies
- Implement multi-region or multi-availability zone approaches:
  - AWS multi-AZ databases and cross-region replication, Azure paired regions, GCP multi-regional storage, OCI cross-region replication.
- Ensures resilience for business-critical workloads.
Integrate Automated Security Testing
- If not already, embed scanning in CI/CD pipelines:
  - e.g., AWS CodeGuru Security or Amazon Inspector, Azure DevOps with GitHub Advanced Security, GCP Cloud Build with container scanning, OCI DevOps scanning integrations.
Quantify Impact
- Track cost savings, performance gains, or user satisfaction improvements from scaling cloud usage.
- Present these metrics to leadership or cross-government peers.
Develop or Refine Architectural Standards
- Document best practices for microservices, HPC, AI/ML, or data analytics workloads.
- Reference AWS Well-Architected, Azure Architecture Center, GCP Architecture Framework, OCI Reference Architectures.
Collaborate with Other Public Sector Entities
- If you’re delivering critical services, consider knowledge sharing or secondments with local councils, NHS, or central departments:
  - Aligned with GDS cross-government collaboration initiatives.

By adopting advanced resiliency and security, measuring impact thoroughly, standardising architectural approaches, and collaborating with other public sector bodies, you mature from simply scaling prototypes to robust, enterprise-level cloud service delivery.

How to do better

Below are rapidly actionable ways to continue improving innovation- and value-centric success criteria:

Adopt a Value Stream Approach
- Link each cloud initiative to a user-facing or operational outcome:
  - e.g., reducing form-processing time from days to minutes, or improving public web performance by X%.
- This ensures the entire pipeline, from idea to deployment, focuses on delivering measurable benefits.
Incorporate Cross-Organisational Goals
- For large departmental or multi-department programs, align success metrics to shared objectives:
  - e.g., joint cost savings, integrated citizen ID solutions, or unified data analytics capabilities.
Advance Sustainability Metrics
- Include environment-friendly cloud usage as part of success:
  - Checking region-level carbon footprints, or referencing NCSC’s sustainability in cloud usage tips.
- Encourages a green approach to innovation.
Enable Continuous Learning and Sharing
- Promote open blog posts or internal wiki pages detailing each new experiment’s results—whether success or failure.
- Encourages a virtuous cycle of rapid improvement.
Periodically Recalibrate Metrics
- As technology evolves, update or retire older success metrics (e.g., “time to spin up a VM” might be replaced by “time to deploy a new serverless function”), ensuring they stay relevant to strategic ambitions.

By implementing a value stream approach, embedding cross-organisational goals, focusing on sustainability, encouraging transparency in experiments, and periodically recalibrating metrics, your cloud team solidifies its role as a driver of innovation and public value creation. This ensures alignment with evolving public sector needs, best practices, and digital transformation objectives.

Keep doing what you’re doing, and consider writing blog posts about your success criteria or opening pull requests to this guidance so other public sector organisations can adopt or refine similar approaches to measuring and achieving cloud team success.

Do leadership support your move to the cloud? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable suggestions to secure at least minimal executive sponsorship:

Document Quick-Win Success Stories
- Show leadership how small pilots delivered cost savings, performance gains, or alignment with departmental digital goals.
- For instance, highlight a pilot serverless function that replaced an aging on-prem script.
Link Cloud Adoption to Organisational Mandates
- Identify how cloud usage aligns with GOV.UK Service Manual best practices, or how it might enhance operational resilience per NCSC guidelines.
Prepare a Simple Business Case
- Emphasize potential cost savings or improved agility:
  - referencing AWS/Azure/GCP/OCI cost management calculators or TCO tools, Azure Cost Management, GCP Cost Management, OCI Cost Management, IBM Cloud Cost Estimator.
Request a Brief Meeting with a Senior Sponsor
- Secure 15-30 minutes to share pilot results or near-term opportunities:
  - Stress risk of continuing without guidance from top leadership (e.g., security gaps, budget overruns, or duplication).
Offer an Executive-Level Intro
- Propose an hour-long cloud fundamentals overview for interested executives, possibly with vendor partner support or free training sessions.

By compiling quick-win stories, framing cloud adoption in organisational mandates, presenting a succinct business case, requesting a short meeting, and offering an executive primer, you begin building the case for at least baseline senior buy-in.

How to do better

Below are rapidly actionable steps to expand from senior management to full executive endorsement:

Demonstrate Departmental Wins
- Have senior managers publicise successful departmental cloud outcomes to executives:
  - e.g., a 20% cost reduction or improved user satisfaction in a pilot citizen-facing service.
Facilitate an Exec-Level Briefing
- Invite the CFO or CIO to a short session with the senior manager champion:
  - Outline potential broader savings or strategic opportunities aligned with GOV.UK digital transformation guidelines.
Align with Organisational Strategy
- Show how the cloud adoption aligns with published departmental or cross-government strategies, referencing NIST risk management or NCSC operational resilience advice.
Request Executive Sponsor for Large-Scale Migrations
- If you plan a major migration (like HPC, AI/ML, or large data center closure), propose a “sponsor” role for a top exec:
  - Encourages them to champion budget allocations and remove cross-department barriers.
Create a Vision Statement
- Collaborate with senior managers to draft a concise “cloud vision” for the next 1-3 years, referencing success metrics (cost, security posture, user satisfaction) to interest executives.

By showcasing departmental successes, hosting briefings with executives, integrating the initiative into overarching strategies, seeking an executive sponsor for large projects, and formalising a short vision statement, you steadily shift from partial senior sponsorship to broader top-level leadership buy-in.

How to do better

Below are rapidly actionable ways to leverage C-level sponsorship further:

Develop a Multi-Year Cloud Roadmap
- Collaborate with the C-level sponsor to define short, medium, and long-term goals:
  - e.g., incremental migrations, security enhancements, cost optimisation targets.
Establish Clear KPIs & Milestones
- For example, monthly or quarterly metrics:
  - Resource usage cost, user satisfaction, time-to-deploy improvements, referencing AWS/Azure/GCP/OCI cost management dashboards, Azure Cost Management, GCP Cost Management, OCI Cost Management.
Ensure Inter-Departmental Collaboration
- The sponsor can champion cross-department synergy:
  - e.g., merging data streams for analytics, universal security guidelines per NCSC cloud security or NIST risk management frameworks.
Embed Security as a First-Class Concern
- Since you have top-level support, request integrated DevSecOps tooling and compliance checks from day one:
  - e.g., AWS CodeGuru or Security Hub, Azure DevOps security add-ons, GCP Cloud Build scanning, OCI security advisor services.
Highlight Public Sector Success
- Encourage your sponsor to share wins at internal leadership summits or cross-gov conferences, fostering further executive-level peer collaboration.

By crafting a multi-year roadmap, specifying meaningful KPIs, promoting cross-department synergy, embedding robust security from the start, and publicising achievements, you realise the full benefits of C-level sponsorship—driving cohesive, secure, and strategic cloud adoption.

How to do better

Below are rapidly actionable ways to strengthen a comprehensive C-level sponsorship with a strategic roadmap:

Involve Staff in Roadmap Updates
- Host quarterly open sessions where devs, ops, or security can give feedback on the strategic plan:
  - Encourages buy-in and surfaces practical constraints.
Institute a Cloud Steering Committee
- Form a cross-functional group with representation from finance, HR, security, architecture, and user departments:
  - They meet regularly to track progress, share challenges, and drive adjustments in the roadmap.
Focus on Advanced Migrations or Services
- Tackle HPC, advanced analytics, or AI/ML adoption:
  - Possibly referencing HPC or ML best practices from AWS HPC Competency, Azure HPC, GCP AI Platform, OCI HPC solutions.
Integrate Multi-Cloud or Hybrid Considerations
- If the roadmap suggests multi-cloud or a hybrid approach:
  - Align with NCSC multi-cloud security best practices or NIST guidelines on hybrid architecture risk management.
Publish Success Metrics
- Show top-level achievements or cost savings in staff newsletters or a leadership dashboard:
  - Reinforces organisational momentum for the roadmap.

By updating the roadmap collaboratively, establishing a cloud steering committee, venturing into advanced HPC/AI/ML, acknowledging multi-cloud/hybrid scenarios, and publicising success metrics, you deepen the synergy and accountability behind your cloud adoption plan—leading to dynamic, well-supported progress.

How to do better

Below are rapidly actionable ways to continuously strengthen a cloud-first culture under comprehensive C-level sponsorship:

Scale Innovation Hubs
- If you have a center of excellence or an innovation lab, extend its scope to HPC, AI/ML, IoT, or advanced HPC:
  - e.g., adopt HPC solutions from AWS, Azure, GCP, OCI, incorporating domain specialists.
Open Source & Share
- Encourage teams to open-source relevant code or automation, participating in cross-government communities:
  - fosters broader knowledge exchange, referencing GOV.UK open source policy.
Enable Real-Time Security & Compliance
- Consolidate compliance with AWS Control Tower, Azure Policy, GCP Organisation Policy, OCI Security Zones for frictionless guardrails:
  - ensures staff can spin up resources rapidly without jeopardising security or policy compliance.
Track Cloud Maturity Beyond Tech
- Evaluate cultural aspects: e.g., dev empowerment, cost accountability, user feedback loops.
- Revisit or revise success criteria every 6-12 months.
Recognise and Reward Cloud Champions
- Publicly celebrate individuals or squads who pioneer new solutions, demonstrate cost savings, or deliver advanced workloads in HPC or serverless.

By scaling innovation hubs, open-sourcing solutions, implementing real-time compliance guardrails, tracking maturity across cultural dimensions, and publicly recognising cloud champions, you cement a thriving, cloud-first culture that embraces experimentation, security, and strategic public sector outcomes.

Keep doing what you’re doing, and consider publishing blog posts or opening pull requests to share your experiences in fostering a cloud-first mindset under strong executive sponsorship. This helps others in the UK public sector replicate or learn from your advanced leadership-driven cloud transformation.

Security

How do you manage accounts used by software, not people? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to enhance service account security beyond basic user/pass credentials:

Use Cloud-Native IAM for Service Accounts
- Instead of creating user credentials, define service accounts with least privilege:
Adopt a Central Secret Manager
- Store credentials securely in:
- Reduces plaintext password usage, enabling future rotation.
Automate Rotation
- If you must keep user/pass-based secrets temporarily, implement at least monthly or quarterly rotations:
  - Minimises window of exposure if leaked.
Reference NCSC & NIST
- Follow NCSC’s Identity and Access Management principles or NIST SP 800-53 Access Controls (AC-3, AC-6, etc.).
- Ensures alignment with recommended identity hygiene.
Plan for Future Migration
- Target short-lived tokens or IAM role-based approaches as soon as feasible, phasing out permanent user credentials for non-human accounts.

By employing a secure secret manager, rotating basic credentials, and gradually moving to role-based or short-lived tokens, you significantly reduce the risk associated with static user/password pairs for service accounts.

How to do better

Below are rapidly actionable ways to move beyond static API keys:

Store Keys in a Central Secret Manager
- e.g., AWS Secrets Manager or AWS SSM Parameter Store with encryption, Azure Key Vault with RBAC controls, GCP Secret Manager with IAM-based access, OCI Vault with KMS encryption.
- Avoid embedding keys in code or config files.
Automate API Key Rotation
- Implement a rotation schedule (e.g., monthly or quarterly) or on every deployment:
  - Reduces the window if a key is leaked.
Consider IAM Role or Token-Based Alternatives
- Where possible, use short-lived tokens or ephemeral credentials to reduce static API key usage:
  - e.g., AWS IAM roles with STS, Azure Managed Identities, GCP Service Accounts short-lived tokens, OCI dynamic groups/tokens.
Limit Scopes
- If you must rely on an API key, ensure it has the narrowest possible permissions, referencing NCSC’s least-privilege principle.
Log & Alert on Key Usage
- Enable logs that track API calls with each key, setting alerts for unusual activity:
  - e.g., AWS CloudTrail, Azure Monitor, GCP Cloud Logging, OCI Audit Logs integrated with anomaly detection.

By centrally managing keys, rotating them automatically, transitioning to role-based or token-based credentials, enforcing least privilege, and auditing usage, you substantially reduce the security risk associated with static API keys.

How to do better

Below are rapidly actionable ways to refine a centralised secret store with partial rotation:

Extend Rotation to All or Most Credentials
- If some are still static, define a plan for each credential’s rotation frequency:
  - e.g., monthly or upon every production deployment.
Build Automated Pipelines
- Integrate secret retrieval or rotation scripts into your CI/CD:
  - e.g., AWS CodePipeline retrieving from Secrets Manager, Azure DevOps tasks pulling from Key Vault, GCP Cloud Build with Secret Manager, OCI DevOps pipeline with Vault integration.
Enforce Access Policies
- Use AWS IAM policies, Azure RBAC, GCP IAM, OCI compartments to strictly control who can read, update, or rotate secrets.
- Reference NCSC’s least-privilege principle for secret operations.
Combine with Role-Based Authentication
- Shift away from credential-based if possible, using ephemeral roles or instance-based authentication for certain services:
  - e.g., AWS STS assume-role approach, Azure Managed Identities, GCP service account short-lived tokens, OCI dynamic groups/tokens.
Monitor for Stale or Unused Secrets
- Regularly check your secret store for credentials not accessed in a while or older than a certain rotation threshold:
  - helps avoid accumulating outdated secrets.

By expanding automated rotation, integrating secret retrieval into pipelines, enforcing tight access controls, adopting role-based methods for new services, and cleaning stale secrets, you further strengthen your centralised secret store approach for secure, efficient credential management.

How to do better

Below are rapidly actionable ways to improve your mTLS-based authentication approach:

Short-Lived Certificates
- Automate certificate issuance and renewal:
  - e.g., AWS Private CA with automated renewal, Azure Key Vault certificates, GCP Certificate Authority Service, OCI Certificates service.
- Minimises risk if a certificate is compromised.
Adopt a Service Mesh
- If using microservices in Kubernetes, incorporate [Istio, Linkerd, or AWS App Mesh, Azure Service Mesh, GCP Anthos Service Mesh, OCI OKE integrated mesh] to handle mTLS automatically:
  - Enforces consistent policies across services.
Implement Strict Certificate Policies
- E.g., no wildcard certs for internal services, clear naming or SAN usage, referencing NCSC certificate issuance best practices.
Monitor for Expiry and Potential Compromises
- Track certificate expiry dates, set alerts well in advance.
- Log all handshake errors in AWS CloudWatch, Azure Monitor, GCP Logging, OCI Logging to detect potential mismatches or malicious attempts.
Combine with IAM for Additional Controls
- For advanced zero-trust, complement mTLS with role-based or token-based checks:
  - e.g., verifying principal claims in addition to cryptographic identities.

By employing short-lived certs, possibly using a service mesh, establishing strict certificate policies, continuously monitoring usage, and optionally layering further IAM or token checks, you maximise the security benefits of mTLS for your service accounts.

How to do better

Even at this top level, below are rapidly actionable refinements:

Leverage Vendor Identity Federation Tools
- e.g., [AWS IAM roles with Web Identity Federation or AWS Secure Token Service, Azure AD token issuance, GCP IAM federation, OCI Identity Federation with IDCS], ensuring minimal friction for ephemeral tokens.
Integrate Policy-as-Code
- Tools like [Open Policy Agent or vendor policy engines (AWS SCP, Azure Policy, GCP Organisation Policy, OCI Security Zones)] can dynamically evaluate each identity request in real time.
Adopt Service Mesh with Dynamic Identity
- In container or microservice architectures, pair ephemeral identity with a service mesh that injects secure tokens automatically.
Continuously Audit and Analyze Logs
- Check usage patterns: any suspicious repeated token fetch or abnormal expansions of privileges.
- Tools like AWS CloudWatch Logs, Azure Monitor, GCP Logging, OCI Monitoring + ML-based anomaly detection can highlight anomalies.
Cross-Government Federated Services
- If multiple agencies need to collaborate, explore cross-government single sign-on or identity federation solutions that comply with GOV.UK’s identity and digital standards.

By fully harnessing vendor identity federation, embedding policy-as-code, integrating ephemeral identity usage in service meshes, analyzing usage logs for anomalies, and considering cross-government identity solutions, you refine an already highly secure and agile environment for non-human service accounts aligned with best-in-class public sector practices.

Keep doing what you’re doing, and consider publishing blogs or opening pull requests to this guidance about your success in elevating non-human identity security in cloud environments. Sharing your experiences helps other UK public sector organisations adopt robust credential management aligned with the highest security standards.

How does your organisation manage user identities and authentication? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable suggestions to introduce at least a minimal level of identity governance:

Define a Basic Password/Passphrase Policy
- For instance, require passphrases of at least 14 characters, no enforced complexity that leads to repeated password re-use.
- Consult NCSC’s password guidance for recommended best practices.
Centralise Authentication for Cloud Services
- Use vendor-native IAM or single sign-on capabilities:
Start Logging Identity Events
- At a minimum, enable auditing of logins, password resets, or privilege changes:
  - AWS CloudTrail, Azure Monitor, GCP Cloud Logging, OCI Audit Logs
- This ensures you have some data to reference if suspicious activity occurs.
Establish a Simple Governance Policy
- Even a one-page policy stating password length, no shared accounts, and periodic user review is better than nothing.
- Possibly incorporate NIST SP 800-53 AC-2 controls for account management.
Plan for Incremental Improvement
- Mark out a short timeline (e.g., 3-6 months) to adopt multi-factor authentication for privileged or admin roles next.

By introducing a foundational password policy, centralising authentication, enabling basic identity event logging, creating a minimal governance document, and scheduling incremental improvements, you’ll rapidly move beyond ad hoc practices toward a more secure, consistent approach.

How to do better

Below are rapidly actionable ways to automate and strengthen your identity policy enforcement:

Deploy Automated Audits
- For each cloud environment, enable identity-related checks:
Enforce Basic MFA for Privileged Accounts
- For all admin or highly privileged roles, mandate multi-factor authentication:
  - NCSC strongly recommends multi-factor for accounts with elevated access.
Establish Self-Service or Automated Access Reviews
- Implement a monthly or quarterly identity review:
  - e.g., a simple emailed listing of who has what roles, requiring managers to confirm or revoke access.
Adopt Single Sign-On (SSO)
- Use a single IdP (Identity Provider) for all cloud accounts, e.g.:
  - AWS SSO or Azure AD SSO solutions, GCP Identity Federation, or OCI integration with enterprise IdPs.
- This reduces manual overhead and password sprawl.
Store Policies & Logs in a Central Repo
- Keep your identity policy in version control and track changes:
  - Ensures updates are documented, and staff can reference them easily, aligning with GOV.UK policy transparency norms.

By automating audits, enforcing MFA, implementing automated access reviews, consolidating sign-on, and centralising policy documentation, you move from manual enforcement to a more efficient, consistently secure identity posture.

How to do better

Below are rapidly actionable ways to progress toward advanced identity automation:

Expand MFA Requirements to All Users
- If only privileged users have 2FA, consider rolling out to all staff or external collaborators:
  - e.g., AWS, Azure, GCP, OCI support TOTP apps, hardware security keys, or SMS as fallback (not recommended if higher security needed).
Use Role/Attribute-Based Access
- For each environment (AWS, Azure, GCP, OCI), define roles or groups with appropriate privileges:
  - Minimises the risk of over-privileged accounts, referencing NCSC’s least privilege principle.
Consolidate Identity Tools
- If you’ve multiple sub-accounts or subscriptions, unify identity management via:
  - AWS Organizations + AWS SSO or Azure AD tenant-level controls, GCP Organization IAM, OCI compartments/policies with IDCS integration.
Integrate Automated Deprovisioning
- Tie identity systems to HR or staff rosters, automatically disabling accounts when a staff leaves or changes roles:
  - e.g., Azure AD user provisioning to SaaS apps, AWS SCIM integration, GCP Directory sync, OCI integration with enterprise directories.
Enhance Monitoring & Alerting
- Add real-time alerts for suspicious identity events:
  - e.g., multiple failed logins, sudden role escalations, or new key creation.

By extending MFA to all, embracing role-based access, consolidating identity management, automating deprovisioning, and boosting real-time monitoring, you achieve more robust, near-seamless identity automation aligned with best practices for public sector security.

How to do better

Below are rapidly actionable steps to elevate advanced identity management:

Adopt Conditional Access or Policy-based Access
- e.g., AWS IAM condition keys, Azure Conditional Access, GCP Access Context Manager, OCI IAM condition-based policies:
  - Restrict or grant access based on device compliance, location, or time-based rules.
Incorporate Just-In-Time (JIT) Privileges
- For admin tasks, require users to elevate privileges temporarily:
  - e.g., AWS IAM Permission boundaries, Azure Privileged Identity Management, GCP short-lived access tokens, OCI dynamic roles with short-lived credentials.
Monitor Identity with SIEM or Security Analytics
- e.g., [AWS Security Hub, Azure Sentinel, GCP Security Command Center, OCI Logging Analytics] for real-time anomaly detection or advanced threat intelligence:
  - Ties into your identity logs to detect suspicious patterns.
Engage in Regular “Zero-Trust” Drills
- Simulate partial network compromises to test if identity-based controls alone can protect resources:
  - referencing NCSC zero trust architecture advice or NIST SP 800-207.
Promote Cross-Government Identity Standards
- If relevant, align with or propose solutions for single sign-on across multiple agencies to streamline staff movements within the public sector:
  - e.g., exploring GOV.UK One Login or similar cross-government identity initiatives.

By implementing conditional or JIT access, leveraging robust SIEM-based identity monitoring, holding zero-trust scenario drills, and sharing identity solutions across the public sector, you further strengthen an already advanced identity environment.

How to do better

Even at the apex, below are rapidly actionable ways to further optimise:

Multi-Cloud Single Pane IAM
- If you use multiple cloud providers, unify them under a single identity provider or a cross-cloud identity framework:
  - e.g., Azure AD for AWS + Azure + GCP roles, or a third-party IDaaS solution with robust zero-trust policies.
Advanced Risk-Based Authentication
- Leverage vendor AI to detect unusual login behavior, then require step-up (MFA or manager approval):
  - AWS Cognito or Azure AD Conditional Access with risk-based sign-on, GCP Identity-Aware Proxy risk signals, OCI adaptive security policies.
Adopt Policy-as-Code for Identity
- Tools like [Open Policy Agent or vendor policy frameworks (AWS Organizations SCP, Azure Policy, GCP Organization Policy, OCI Security Zones)] to define identity controls in code:
  - Facilitates versioning, review, and auditable changes.
Extend 2FA to Cross-Government Collaboration
- If staff from other agencies frequently collaborate, integrate cross-department SSO or federated identity, referencing GOV.UK single sign-on possibilities or multi-department IAM bridging solutions.
Publish Regular Identity Health Reports
- Summaries of user activity, stale accounts, or re-certifications. Encourages transparency and fosters trust in your identity processes.

By unifying multi-cloud identity, implementing advanced risk-based authentication, using policy-as-code for identity controls, expanding cross-government 2FA, and regularly reporting identity health metrics, you maintain a cutting-edge identity management ecosystem. This ensures robust security, compliance, and agility for your UK public sector organisation in an evolving threat environment.

Keep doing what you’re doing, and consider writing up your experiences, success metrics, or blog posts on advanced identity management. Contribute pull requests to this guidance or other best-practice repositories so fellow UK public sector entities can learn from your identity management maturity journey.

How do you make sure people have the right access for their role? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to transition from ad-hoc reviews to basic structured processes:

Define a Minimal Access Policy
- Even one page stating all roles must have least privilege, with approvals required for additional rights.
- Reference NCSC’s Access Management best practices.
Create a Simple RACI for Access Management
- Identify who is Responsible, Accountable, Consulted, and Informed for each step (e.g., granting, revoking, auditing).
- Helps clarify accountability if something goes wrong.
Leverage Built-In Cloud IAM Tools
- AWS IAM, Azure RBAC, GCP IAM, OCI IAM compartments/policies can define or limit privileges.
- Minimises guesswork in manual permission assignments.
Maintain a Basic User Inventory
- Keep a spreadsheet or list of all privileged users, what roles they have, and last update date:
  - So you can spot dormant accounts or over-privileged roles.
Plan for Periodic Checkpoints
- Commit to a small monthly or quarterly access sanity check with relevant admins, reducing overlooked issues over time.

By laying out a minimal access policy, assigning RACI for administration, adopting cloud-native IAM, maintaining a simple user inventory, and scheduling monthly or quarterly check-ins, you’ll quickly improve from ad-hoc reviews to a more reliable approach.

How to do better

Below are rapidly actionable ways to evolve beyond limited-action reviews:

Mandate a “Test Before Revoke” Procedure
- If concerns about “breaking something” hinder revocations, adopt a short test environment to confirm the user or system truly needs certain permissions.
Categorise Users by Risk
- For high-risk roles (e.g., admin accounts with access to production data), enforce stricter reviews or more frequent re-validation:
  - Potentially referencing AWS IAM Access Analyzer, Azure AD Access Reviews, GCP’s IAM Recommender, OCI IAM tools.
Implement Review Dashboards
- Summarise each user’s privileges, last login, or role usage:
  - If certain roles are not used in X days, consider removing them.
Show Leadership Examples
- Have a pilot case where you successfully reduce access for a role with no negative consequences, building confidence.
Incentivise or Recognise Proper Clean-Up
- Acknowledge teams or managers who diligently remove no-longer-needed permissions:
  - Encourages a habit of safe privilege reduction.

By adopting test environments before revoking privileges, classifying user risk levels, building simple dashboards, demonstrating safe revocations, and recognising best practices, you reduce hesitancy and further align with security best practices.

How to do better

Below are rapidly actionable steps to incorporate permission reduction:

Implement a “Use it or Lose it” Policy
- If a user’s permission or role is unused for a set period (e.g., 30 days), it’s automatically flagged for removal:
  - Tools like AWS IAM Access Analyzer, Azure AD Access Reviews, GCP IAM Recommender, or OCI IAM metrics can show which roles are not used.
Mark Temporary Access with Expiry
- For short-term projects, set an end date for extra privileges:
  - e.g., using AWS or Azure policy conditions, GCP short-lived tokens, or OCI compartments-based ephemeral roles.
Combine with Slack/Teams Approvals
- Automate revocation requests: if an admin sees stale permissions, they click a button to remove them, and a second manager approves:
  - Minimises fear of accidental breakage.
Reward “Right-sizing”
- Celebrate teams that proactively reduce permission sprawl, referencing cost savings or risk reduction:
  - e.g., mention in staff newsletters or internal security updates.
Refine Review Frequency
- If reviews are monthly or quarterly, consider stepping up to weekly or adopting a continuous scanning approach for business-critical accounts.

By adding a usage-based revocation policy, setting expiry for short-lived roles, integrating quick approval workflows, recognising teams that successfully remove unused privileges, and potentially increasing review frequency, you shift from additive-only changes to an environment that truly enforces minimal privileges.

How to do better

Below are rapidly actionable methods to enhance expiry-based reviews:

Use Cloud-Native Access Review Tools
- e.g., AWS IAM Access Analyzer, AWS SSO user provisioning with rotation, Azure Access Reviews in Azure AD, GCP IAM Recommender with time-based checks, OCI IAM compartments with automated policy review triggers.
- Minimises manual overhead.
Adopt Automated Alerts for Upcoming Expiries
- If a role is nearing its expiry date, the user and manager receive an email or Slack notice to re-certify or let it lapse.
Incorporate Risk Scoring
- If an account has high privileges or sensitive system access, require more frequent or thorough re-validation:
  - e.g., monthly for privileged accounts, quarterly for standard user roles.
Implement Delegated Approvals
- For major role changes, define a short chain (e.g., a user’s manager + security lead) to re-approve before extension of privileges.
- Align with NCSC’s supply chain or internal access control best practices.
Maintain Audit Trails
- Store logs of who re-approved or revoked each role, referencing AWS CloudTrail, Azure Monitor, GCP Logging, or OCI Audit logs.
- Demonstrates compliance if audited, per GOV.UK or departmental policies.

By leveraging cloud-native review tools, alerting for soon-to-expire roles, risk-scoring high-privilege accounts for more frequent checks, implementing delegated re-approval processes, and storing thorough audit trails, you maintain an agile, secure environment aligned with best practices.

How to do better

Below are rapidly actionable ways to refine a fully automated, risk-based review system:

Incorporate Real-Time Risk Signals
- E.g., require additional verification for suspicious location logins or rapidly changing user behaviors:
  - AWS Macie or GuardDuty alerts, Azure AD Identity Protection, GCP Security Command Center anomaly detection, OCI Security Advisor signals.
Use Policy-as-Code for Access
- Tools like [Open Policy Agent or vendor-based solutions (AWS Organizations SCP, Azure Policy, GCP Organization Policy, OCI Security Zones)] can define rules for dynamic role allocation.
Ensure Continuous Oversight
- Provide dashboards for leadership or security officers, showing current risk posture, overdue re-certifications, or upcoming role changes:
  - Minimises the chance of an overlooked anomaly.
Extend to Multi-Cloud or Hybrid
- If your department spans AWS, Azure, GCP, or on-prem systems, unify identity reviews under a single orchestrator or Identity Governance tool:
  - e.g., Azure AD Identity Governance, Okta, Ping, etc. with multi-cloud connectors.
Cross-Government Sharing
- Publish a success story or best-practice playbook so other agencies can replicate your automated approach, aligning with GOV.UK digital collaboration initiatives and NCSC supply chain security best practices.

By integrating real-time risk analysis, employing policy-as-code for dynamic role assignment, offering continuous oversight dashboards, supporting multi-cloud/hybrid scenarios, and sharing insights across government bodies, you further refine an already advanced, automated identity review system. This ensures minimal security risk and maximum agility in the public sector context.

Keep doing what you’re doing, and consider publishing blog posts or making pull requests to this guidance about your advanced access review processes. Sharing experiences helps other UK public sector organisations adopt similarly robust, automated solutions for managing user permissions.

How do you create and manage user accounts for cloud systems? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond shared/manual accounts:

Eliminate Shared Accounts
- Mandate each user has an individual account, referencing NCSC’s identity best practices.
- This fosters actual accountability and compliance with typical public sector guidelines.
Set Up Basic IAM
- Use vendor-native identity tools to define unique accounts, e.g.:
Document a Minimal Process
- Write a short policy on how to add or remove users, referencing NIST SP 800-53 AC controls.
Enable Basic Audit Logging
- Turn on logs for sign-in or role usage in each cloud environment:
  - AWS CloudTrail, Azure Activity Log, GCP Cloud Logging, OCI Audit Log, IBM Cloud Monitoring & IBM Cloud Logs
- Identifies who does what in the system.
Move to a Single Sign-On Approach
- Plan to adopt SSO with a single user directory in the next phase:
  - Minimises manual overhead and ensures consistency.

By ensuring each user has an individual account, using vendor IAM for creation, documenting a minimal lifecycle process, enabling audit logging, and preparing for SSO, you remedy the major pitfalls of shared/manual account approaches.

How to do better

Below are rapidly actionable steps to unify and automate your on-prem identity repository with cloud systems:

Enable Federation or SSO
- e.g., AWS Directory Service + AD trust, Azure AD Connect, GCP Cloud Identity Sync, OCI Identity Federation with AD/LDAP:
- Minimises manual user creation in each cloud service.
Deploy Basic Automation Scripts
- If a full federation is not possible immediately, create scripts that read from your directory and auto-provision or auto-delete accounts in cloud:
  - e.g., using vendor CLIs or REST APIs.
Standardise User Roles
- For each cloud environment, define roles that map to on-prem groups:
  - e.g., “Developer group in AD -> Dev role in AWS.”
- Ensures consistent privileges across systems, referencing NCSC’s least-privilege principle.
Implement a Scheduled Sync
- Regularly compare your on-prem directory with each cloud environment to detect orphaned or mismatch accounts.
- Could be monthly or weekly initially.
Transition to Identity Provider Integration
- If feasible, shift to a modern IDP (Azure AD, Okta, GCP Identity, etc.) so manual processes fade out:
  - This might also meet NIST guidelines on cross-domain identity management (SCIM, etc.).

By federating or automating the sync between your directory and cloud, standardising roles, scheduling periodic comparisons, and eventually adopting a modern identity provider, you gradually remove manual friction and potential security gaps.

How to do better

Below are rapidly actionable ways to refine standard identity management:

Require SSO or Federation for All Services
- For new cloud apps, mandate SAML/OIDC/SCIM compliance:
  - e.g., AWS SSO integration, Azure AD enterprise apps, GCP Identity-Aware Proxy, OCI IDCS federation.
Implement Access Workflows
- Use built-in or third-party approval workflows:
  - e.g., Azure AD Privileged Identity Management, AWS SSO access request workflows, GCP Identity Groups, OCI workflow integrations.
- Ensures no direct admin changes bypass the standardised process.
Continuously Evaluate Cloud Services
- Maintain a whitelist of services that meet your identity standards:
  - If a service can’t integrate with SSO or can’t match your password/MFA policies, strongly discourage its use.
Include Role Mapping in a Central Catalog
- Publish a short doc or portal mapping each standard role to corresponding cloud privileges, referencing NCSC’s RBAC best practice.
Expand Logging & Alerting
- If your identity bridging sees repeated login failures, quickly alert managers or security teams:
  - e.g., AWS Security Hub, Azure Sentinel, GCP SCC, OCI security logs integrated with SIEM tools.

By enforcing SSO/federation for all services, deploying structured access workflows, continuously evaluating new cloud offerings, documenting role-to-privilege mappings, and bolstering security alerting, you ensure consistent, secure user identity alignment across your cloud ecosystem.

How to do better

Below are rapidly actionable ways to reinforce automated federated identity:

Adopt Short-Lived Credentials
- e.g., ephemeral tokens from your IDP for each session, referencing AWS STS, Azure AD tokens, GCP short-lived tokens, OCI dynamic tokens.
- Reduces standing privileges.
Implement Policy-as-Code for Identity
- Use Open Policy Agent or vendor-based solutions (AWS SCP, Azure Policy, GCP Org Policy, OCI Security Zones) to define identity governance in code, ensuring version-controlled and auditable changes.
Add Real-Time Security Monitoring
- If a user tries to access a new or high-risk service, enforce additional checks:
  - e.g., multi-factor step-up, manager approval, location-based restrictions.
Integrate Cross-department SSO
- If staff frequently collaborate across multiple public sector agencies, explore cross-government identity solutions:
  - e.g., bridging Azure AD tenants or adopting solutions that unify NHS, local council, or central government credentials.
Review & Update Roles Continuously
- Encourage monthly or quarterly role usage analyses, removing unneeded entitlements automatically:
  - Minimises risk from leftover privileges.

By adopting short-lived credentials, storing identity policy in code, enabling real-time security checks, exploring cross-department SSO, and continuously reviewing role usage, you transform a solid federation setup into a robust and adaptive identity ecosystem.

How to do better

Below are rapidly actionable ways to refine an already unified, cloud-based identity approach:

Implement Passwordless or Phishing-Resistant MFA
- e.g., FIDO2 security keys, Microsoft Authenticator passwordless, or AWS hardware MFA tokens, GCP Titan Security Keys, OCI-based FIDO solutions to further reduce credential compromise risks.
Add Dynamic Risk Scoring
- Use advanced AI to evaluate user login contexts:
  - e.g., abnormal location, device compliance checks, referencing Azure AD Identity Protection or AWS Identity anomaly detection, GCP security analytics, OCI risk-based authentication features.
Extend Identity to Third-Party Collaboration
- If outside contractors or multi-department teams need access, enable B2B or cross-tenant solutions:
  - Azure AD B2B, AWS SSO external identity providers, GCP external identity federation, OCI cross-tenant identity sharing.
Encourage Cross-Public Sector Federation
- Explore or pilot solutions that unify multiple agencies’ directories, aligned with GOV.UK single sign-on or identity assurance frameworks.
Regularly Assess Identity Posture
- Perform security posture reviews or zero-trust evaluations (e.g., referencing NCSC’s zero trust guidance or NIST SP 800-207 for zero-trust architecture):
  - Ensures you keep pace with evolving threats.

By adopting passwordless MFA, integrating dynamic risk scoring, enabling external collaborator identity, exploring cross-public sector federation, and performing continuous zero-trust posture checks, you achieve an exceptionally secure, efficient environment—exemplifying best practices for user provisioning and identity management in the UK public sector.

Keep doing what you’re doing, and consider publishing blog posts or opening pull requests to share your experiences in creating a unified, cloud-based identity approach. By collaborating with others in the UK public sector, you help propagate secure, advanced authentication practices across government services.

How do you manage non-human service accounts in the cloud? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond human-like accounts for services:

Introduce Role-Based Service Accounts
- Use the cloud provider’s native service account or role concept:
  - AWS IAM roles for EC2 or ECS tasks, Azure Managed Identities, GCP Service Accounts, OCI Dynamic Groups
- Avoid user/password-based approaches.
Limit Shared Credentials
- Immediately stop creating or reusing credentials across multiple services. Assign each service a unique identity:
  - Ensures logs and auditing can differentiate actions.
Enforce MFA or Short-Lived Tokens
- If a service truly needs interactive login (rare), require MFA or ephemeral credentials where possible.
- NCSC guidance on multi-factor authentication for accounts.
Document a Minimal Policy
- A short doc stating “No non-human accounts with user-like credentials,” referencing both NCSC principle of least privilege and NIST guidelines.
Begin Transition to Cloud-Native Identity
- Plan a short-term goal (2-4 months) to retire all shared/human-like service accounts, adopting native roles or short-lived credentials where feasible.

By introducing cloud-native roles for services, eliminating shared credentials, enabling MFA or short-lived tokens if needed, documenting a minimal policy, and planning a transition, you reduce security risks posed by long-lived, human-like service accounts.

How to do better

Below are rapidly actionable steps to centralise and secure long-lived API keys:

Move Keys to a Central Secret Store
- e.g., AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, OCI Vault for storing all API keys.
- Minimises local sprawl and fosters consistent security controls.
Enforce Rotation Policies
- Implement at least quarterly or monthly rotation for API keys to reduce exposure window if compromised:
  - Possibly automate via AWS Lambda, Azure Functions, GCP Cloud Functions, or OCI functions.
Use Tooling for Local Key Discovery
- If keys might be in code repos, scan with open-source or vendor tools:
  - e.g., Trufflehog, Gitleaks, AWS CodeGuru Security, or Azure DevOps Security scanning.
- Alert if secrets are committed to version control.
Document a Single Organisational Policy
- State that “All API keys must be stored in central secret management, with at least every X months rotation.”
- Reference NIST secret management or NCSC credential rotation best practices.
Transition to Role-Based or Short-Lived Tokens
- While central secret storage helps, plan a future move to ephemeral tokens or IAM roles:
  - Reduces reliance on static keys altogether.

By centralising key storage, rotating keys automatically, scanning for accidental exposures, formalising a policy, and starting to shift away from static keys, you significantly enhance the security of locally managed long-lived credentials.

How to do better

Below are rapidly actionable ways to strengthen your centralised secret store approach:

Automate Secret Rotation
- For each stored secret (e.g., a database password, a service’s API key), implement rotation:
  - AWS Secrets Manager rotation, Azure Key Vault rotation, GCP Secret Manager rotation, or OCI Vault rotation features.
Incorporate Access Control & Monitoring
- Strictly limit who can retrieve or update each secret, using fine-grained IAM or RBAC:
  - e.g., AWS IAM policies, Azure Key Vault RBAC, GCP Secret Manager IAM, or OCI compartments/policies.
- Monitor logs for unusual access patterns.
Reference a “Secret Lifecycle” Document
- Outline creation, usage, rotation, and revocation steps for each type of secret.
- Align with NIST recommended credential lifecycles or NCSC guidance on secret hygiene.
Integrate into CI/CD
- Ensure automation pipelines fetch credentials from the secret store at build or deploy time, never storing them in code.
Begin Adopting Ephemeral Credentials
- For new services, consider short-lived tokens or dynamic role assumption, stepping away from even stored secrets:
  - e.g., AWS STS, Azure Managed Identity tokens, GCP workload identity pools, or OCI dynamic groups/tokens.

By automating secret rotation, refining access controls, documenting a secret lifecycle, hooking the store into CI/CD, and planning ephemeral credentials for new services, you build on your strong foundation of centralised secret usage to minimise risk further.

How to do better

Below are rapidly actionable improvements to further secure ephemeral identity usage:

Embed Short-Lived Tokens in CI/CD
- For instance, dev and build systems can assume roles or fetch tokens just-in-time:
  - e.g., AWS STS for container builds, Azure DevOps with Managed Identities, GCP Cloud Build using short-lived service account tokens, or OCI DevOps ephemeral tokens.
Adopt Service Mesh or mTLS
- If you have container/microservice architectures, combine ephemeral identity with Istio, AWS App Mesh, Azure Service Fabric, GCP Anthos Service Mesh, or OCI OKE with a mesh add-on for strong mutual TLS:
  - Further ensures identities are validated end-to-end.
Leverage Policy-as-Code
- e.g., Open Policy Agent (OPA) or vendor-based policy solutions (AWS Organizations SCP, Azure Policy, GCP Org Policy, OCI Security Zones) for dynamic authorisation checks:
  - Grant ephemeral credentials only if a container or instance meets certain attestation criteria.
Regularly Audit Attestation Mechanisms
- Confirm your environment attestation approach is updated and trustworthy, referencing NCSC hardware root of trust or secure boot guidance or NIST hardware security modules.
Integrate with Cross-Org Federation
- If multi-department or local councils share workloads, ensure ephemeral identity can federate across boundaries, referencing GOV.UK guidance on cross-government tech collaboration.

By embedding ephemeral tokens into your CI/CD, adding a service mesh or mTLS, employing policy-as-code, auditing attestation rigorously, and exploring cross-organisation federation, you evolve ephemeral identity usage into a highly secure, flexible, and zero-trust-aligned solution.

How to do better

Below are rapidly actionable ways to enhance code-managed identities with federated trust:

Incorporate Real-Time Security Policies
- Use policy-as-code (OPA, AWS SCP, Azure Policy, GCP Org Policy, OCI Security Zones) to automatically detect and block misconfigurations in your IaC definitions.
Leverage DevSecOps Workflows
- Integrate identity code linting, security scanning, and ephemeral token provisioning into CI/CD:
  - e.g., scanning Terraform or CloudFormation for suspicious identity references before merge.
Implement Zero-Trust Microsegmentation
- Each microservice identity obtains ephemeral credentials from a central authority:
  - e.g., HashiCorp Vault with dynamic secrets, AWS STS with short-lived tokens, Azure Managed Identities with per-service tokens, GCP Workload Identity Federation, or OCI dynamic group tokens.
Expand to Multi-Cloud/Hybrid
- If multiple providers or on-prem systems are used, unify identity definitions across them:
  - e.g., bridging AWS, Azure, GCP, OCI roles in the same Terraform codebase, referencing NCSC’s multi-cloud security patterns.
Regularly Validate & Audit
- Implement automated “drift detection” to confirm the code matches deployed reality, ensuring no manual overrides exist.
- Tools like Terraform Cloud, AWS Config, Azure Resource Graph, GCP Config Controller, or OCI resource search + CI/CD checks can help.

By employing policy-as-code, adopting DevSecOps scanning in your pipeline, embracing zero-trust microsegmentation, extending code-based identity to multi-cloud/hybrid, and continuously auditing for drift, you perfect a code-centric model that securely and efficiently manages service identities across your entire public sector environment.

Keep doing what you’re doing, and consider sharing your approach to code-managed identity and federated trust in blog posts or by making pull requests to this guidance. This knowledge helps other UK public sector organisations adopt similarly robust, zero-trust-aligned solutions for non-human service account authentication.

How do you manage risks? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to improve from an informal approach:

Create a Simple Risk Checklist
- Document cloud-specific concerns: data breaches, credential leaks, cost overruns, vendor lock-in.
- Align with NCSC’s Cloud Security Principles or a NIST SP 800-37 based checklist.
Record & Communicate Regularly
- Even a single spreadsheet or Word doc with identified risks, likelihood, and impact fosters consistency.
- Share it monthly or quarterly with the relevant stakeholders.
Assign Risk Owners
- For each risk, name someone responsible for tracking and mitigating.
- Prevents duplication or “everyone and no one” owning an issue.
Introduce Basic Likelihood & Impact Scoring
- e.g., 1-5 scale for likelihood, 1-5 for impact, multiply for a total risk rating.
- This helps prioritise and start discussion around risk tolerance.
Plan for Next Steps
- Over the next 3-6 months, aim to adopt a minimal formal risk register or define a short process, referencing official guidelines from NCSC or GOV.UK project risk management.

By establishing a simple risk checklist, scheduling short reviews, assigning ownership, adopting basic scoring, and outlining a plan for incremental improvements, you quickly move from purely informal approaches to a more recognisable and consistent risk management foundation.

How to do better

Below are rapidly actionable improvements:

Adopt a Standardised Template
- Provide a uniform risk register template across all projects.
- Outline columns (e.g., risk description, category, likelihood, impact, owner, mitigations, target resolution date).
Encourage Regular Cross-Project Reviews
- Monthly or quarterly, each project lead presents top risks.
- Creates awareness of shared or similar risks (like cloud credential leaks, compliance deadlines).
Consolidate Key Risks
- Extract major issues from each spreadsheet into a single “organisational risk summary” for senior leadership or departmental boards.
Implement Basic Tool or Shared Repository
- e.g., host a central SharePoint list, JIRA board, or Google Sheet consolidating all project-level risk inputs:
  - Minimises confusion while maintaining a single source of truth.
Leverage Some Automation
- For cloud-specific issues, consider vendor solutions:
  - AWS Security Hub or AWS Config for scanning misconfigurations, Azure Advisor or Azure Security Center, GCP Security Command Center, or OCI Security Advisor can highlight recognised cloud security or cost risks to feed into your register.

By adopting a consistent template, hosting cross-project reviews, summarising top risks in an organisational-level register, using a shared tool or repository, and partly automating detection of cloud security concerns, you advance from ad-hoc spreadsheets to a more coordinated approach.

How to do better

Below are rapidly actionable ways to expand your formal risk register process:

Introduce Real-Time Updates or Alerts
- If new vulnerabilities or breaches occur, staff must promptly add or update a risk in the register:
  - Possibly integrate with AWS Security Hub, Azure DevOps, GCP Security scans, or OCI Security Advisor for quick notifications.
Measure Risk Reduction Over Time
- Track how mitigations lower risk levels. Summaries can feed departmental or board-level dashboards:
  - e.g., “Risk #12: Cloud credential leaks reduced from High to Medium after implementing MFA and secret rotation.”
Encourage GRC Tools
- Government Risk and Compliance tools can unify multiple registers:
  - e.g., ServiceNow GRC, RSA Archer, or open-source solutions.
- Minimises duplication across large organisations or multiple projects.
Link Mitigations to Budgets and Timelines
- Where possible, highlight the cost or resource needed for each major mitigation:
  - Helps leadership see rationale for investing in e.g., staff training, new security tools.
Adopt a Cloud-Specific Risk Taxonomy
- Incorporate categories like “Data Residency,” “Vendor Lock-in,” “Cost Overrun,” or “Insecure IAM,” referencing NCSC or NIST guidelines.
- Ensures team members identify typical cloud vulnerabilities systematically.

By setting up real-time triggers for new risks, visualising risk reduction, considering GRC tooling, linking mitigation to budgets, and classifying cloud-specific risk areas, you reinforce a structured risk registry that handles dynamic and evolving threats efficiently.

How to do better

Below are rapidly actionable ways to optimise integrated, centrally overseen risk management:

Incorporate Cloud-Specific Telemetry
- Feed alerts from AWS Security Hub, Azure Sentinel, GCP SCC, or OCI Security Advisor directly into your central risk management system:
  - Automates new risk entries or risk re-scoring when a new vulnerability emerges.
Advance Real-Time Dashboards
- Provide live risk dashboards for each department or service, updating as soon as a risk or its mitigations change:
  - e.g., hooking up GRC tools to Slack/Teams for immediate notifications.
Use Weighted Scoring for Cloud Projects
- Factor in vendor lock-in, cost unpredictability, and data security in your risk scoring.
- Align with NCSC’s cloud security frameworks or NIST SP 800-53 AC/B as relevant.
Formalise Risk Response Plans
- For high or urgent risks, define an immediate action plan or “playbook,” referencing NCSC’s incident response methods.
Encourage Cross-department Collaboration
- Public sector bodies often share similar cloud challenges—facilitate risk-sharing sessions with local councils, NHS, or other departments:
  - Possibly aligning with GOV.UK best practices for cross-government knowledge exchange.

By integrating real-time cloud telemetry into your central risk system, offering advanced dashboards, applying specialised scoring for cloud contexts, setting formal risk responses, and cross-collaborating among agencies, you achieve deeper, more proactive risk management.

How to do better

Below are rapidly actionable ways to enhance an already advanced, proactive risk management system:

Adopt AI/ML for Predictive Risk
- Tools or scripts that detect emerging patterns before they become major issues:
  - e.g., anomalous cost spikes or security logs flagged by AWS DevOps Guru, Azure Sentinel ML, GCP Security Command Center with ML, or OCI advanced analytics.
Integrate Risk with DevSecOps
- Show real-time risk scores in CI/CD pipelines, halting deployments if a new “High” or “Critical” risk is detected:
  - e.g., referencing AWS CodePipeline gates, Azure DevOps approvals, GCP Cloud Build triggers, or OCI DevOps pipeline policy checks.
Multi-Cloud or Hybrid Risk Consolidation
- If operating across AWS, Azure, GCP, OCI, or on-prem, unify them in one advanced GRC or SIEM tool:
  - Minimises siloed risk reporting.
Extend Collaborative Risk Governance
- If you share HPC or cross-department AI/ML projects, hold multi-department risk board sessions:
  - Aligning with GOV.UK collaborative approach or relevant NCSC multi-stakeholder security frameworks.
Regularly Refresh Risk Tolerance & Metrics
- Reassess risk thresholds to ensure they remain relevant.
- If your environment scales or new HPC/AI workloads are introduced, adapt risk definitions accordingly.

By leveraging AI for predictive risk detection, embedding risk scoring in DevSecOps pipelines, consolidating multi-cloud/hybrid risk data, collaborating on risk boards across agencies, and regularly updating risk tolerance metrics, you optimise an already advanced, proactive risk management system—ensuring continuous alignment with evolving public sector challenges and security imperatives.

Keep doing what you’re doing, and consider documenting your advanced risk management approaches through blog posts or by opening pull requests to this guidance. Sharing such experiences helps other UK public sector organisations adopt progressive risk management strategies in alignment with NCSC, NIST, and GOV.UK best practices.

How do you manage staff identities? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond isolated identity management:

Create a Basic Directory or SSO Pilot
- For new services, define a single user store or IDP:
  - e.g., AWS Directory Service or AWS SSO, Azure AD, GCP Identity, or OCI IDCS, IBM Cloud AppID
- Minimises further fragmentation for new apps.
Maintain a Simple User Inventory
- List out each app’s user base and identify duplicates or potential orphan accounts:
  - Helps to see the scale of the fragmentation problem.
Encourage Unique Credentials
- Discourage password re-use and adopt basic password policies referencing NCSC password guidance.
Plan a Gradual Migration
- Set a short timeline (e.g., 6-12 months) to unify at least a few key services under a single ID provider.
Highlight Quick-Wins
- If consolidating one or two widely used services to a shared login shows immediate benefits (less support overhead, better logs), use that success to rally internal support.

By implementing a small shared ID approach for new services, maintaining an org-wide user inventory, encouraging unique credentials with basic password hygiene, scheduling partial migrations, and publicising quick results, you steadily reduce the complexity and risk of scattered service-specific identities.

How to do better

Below are rapidly actionable steps to further unify your basic centralised identity:

Mandate SSO for New Services
- All future cloud apps must integrate with your central ID system (SAML, OIDC, etc.).
- AWS SSO, Azure AD App Registrations, GCP Identity Federation, or OCI IDCS integrations.
Target Legacy Systems
- Identify 1-3 high-value legacy applications and plan a short roadmap for migrating them to the central ID store:
  - e.g., rewriting authentication to use SAML or OIDC.
Introduce Periodic Role or Access Reviews
- Ensure the centralised identity system is coupled with a simple process for managers to confirm staff roles:
  - referencing AWS IAM Access Analyzer, Azure AD Access Reviews, GCP IAM Recommender, or OCI IAM policy checks.
Extend MFA Requirements
- If only some users in the centralised system have MFA, gradually require it for all:
  - referencing NCSC’s multi-factor authentication guidance.
Aim for Full Integration by a Set Date
- e.g., a 12-18 month plan to unify all services, presenting to leadership how this will lower security risk and reduce support costs.

By demanding SSO for new apps, migrating top-priority legacy systems, enabling periodic role reviews, enforcing MFA across the board, and setting a timeline for full integration, you reinforce your centralised identity approach and shrink vulnerabilities from leftover local user stores.

How to do better

Below are rapidly actionable ways to incorporate the last few outliers:

Establish an “Exception Approval”
- If a service claims it can’t integrate, mandate a formal sign-off by security or architecture boards:
  - Minimises indefinite exceptions.
Plan Legacy Replacement or Integration
- For each separate system, define a short project to incorporate them with SAML, OIDC, or SCIM:
  - e.g., Azure AD integration for older apps, AWS SSO bridging, GCP Identity front-end with IAP or custom IAM integration, or OCI IDCS bridging.
Enhance Monitoring on Exceptions
- If integration isn’t possible yet, strictly log and track those system’s user access, referencing NCSC logging recommendations or NIST logging controls (AU family).
Regularly Reassess or Sunset Non-Compliant Services
- If an exception remains beyond a certain period (e.g., 6-12 months), escalate to leadership.
- This keeps pressure on removing exceptions eventually.
Include Exceptions in Identity Audits
- Ensure these standalone services aren’t forgotten in user account cleanup or security scanning efforts:
  - e.g., hooking them into an “all-of-org” identity or vulnerability scan at least quarterly.

By requiring official approval for non-integrated systems, scheduling integration projects, monitoring or sunsetting exceptions, and auditing them in the main identity reviews, you unify identity management and ensure consistent security across all cloud services.

How to do better

Below are rapidly actionable ways to enhance advanced integrated identity management:

Explore Zero-Trust or Risk-Adaptive Auth
- If a user tries to access a high-risk service from an unknown location or device, require step-up authentication:
  - e.g., Azure AD Conditional Access with risk-based sign-ins, AWS SSO with conditional logic, GCP Access Context Manager, or OCI adaptive authentication features.
Adopt Policy-as-Code for Identity
- Use Open Policy Agent or vendor-based solutions (e.g., AWS Organizations SCP, Azure Policy, GCP Org Policy, OCI Security Zones) to define identity and resource controls in code for versioning and traceability.
Enable Fine-Grained Roles and Minimal Privileges
- Continuously refine roles so each user only has what they need, referencing NCSC’s least privilege guidance or NIST SP 800-53 AC-6 on least privilege.
Implement Automated Access Certification
- Every few months, prompt managers to re-check their team’s privileges:
  - Tools like Azure AD Access Reviews, AWS IAM Access Analyzer, GCP IAM Recommender, or OCI IAM policy checks can highlight unneeded privileges.
Sustain a Culture of Continuous Improvement
- Encourage security champions to look for new features (like passwordless sign-in or advanced biometrics):
  - e.g., FIDO2-based solutions, hardware tokens, or passwordless approaches recommended by NCSC/NIST for next-level security.

By implementing zero-trust or risk-based authentication, adopting identity policy-as-code, refining least privilege roles, automating access certifications, and fostering continuous improvements, you advance from a strong integrated identity environment to a cutting-edge, security-first approach aligned with UK public sector best practices.

How to do better

Below are rapidly actionable ways to refine a mandatory single source of identity:

Implement Risk-Adaptive Authentication
- Combine the single identity with dynamic checks (like device compliance, location, or time) to apply additional verifications if risk is high:
  - e.g., Azure AD Identity Protection, AWS Cognito adaptive auth, GCP Identity-Aware Proxy, or OCI adaptive authentication.
Extend Identity to Multi-Cloud
- If you operate across multiple providers, unify identity definitions so staff seamlessly access AWS, Azure, GCP, or OCI:
  - Possibly referencing external IDPs or cross-cloud SSO integrations.
Incorporate Passwordless Tech
- FIDO2 or hardware token-based sign-ins for staff:
  - NCSC encourages strong, phishing-resistant MFA where possible.
Align with Cross-Government Identity Initiatives
- If relevant, collaborate with other departments on shared SSO or bridging solutions:
  - e.g., GOV.UK One Login program or NHS/Local Government identity bridging.
Continuously Review and Audit
- Maintain monthly or quarterly audits ensuring no system bypasses the single identity policy.
- Tools like Azure AD application listing, AWS Organizations integration, GCP Organization-level IAM policy, or OCI compartments integration can detect any outliers.

By adopting risk-based auth, ensuring multi-cloud identity unification, deploying passwordless approaches, collaborating with cross-government identity programs, and regularly auditing for compliance with the mandatory single source policy, you reinforce a top-tier security stance. This guarantees minimal identity sprawl and maximum accountability in the UK public sector environment.

Keep doing what you’re doing, and consider creating blog posts or making pull requests to this guidance about your advanced single-source identity management success. Sharing practical examples helps other UK public sector organisations move toward robust, consistent identity strategies.

How do you reduce the risk from staff with high-level access? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to bolster security beyond mere user vetting:

Implement the Principle of Least Privilege
- Even fully vetted staff should not have more privileges than needed:
  - E.g., use AWS IAM roles with distinct privileges, Azure RBAC, GCP IAM with granular roles, or OCI compartments/policies, IAM Access Controls
Mandate MFA for Privileged Accounts
- For root/admin accounts, enforce multi-factor authentication referencing NCSC guidance on MFA best practices.
- Minimises the chance of stolen credentials being abused.
Adopt Break-Glass Procedures
- Provide normal user roles with day-to-day privileges. Escalation to super-user (root/admin) requires justification or time-limited credentials.
Track Changes & Access
- Enable audit logs for all privileged actions, storing them in an immutable store:
  - e.g., AWS CloudTrail + Amazon S3 with SSE-KMS encryption, Azure Monitor with immutable storage, GCP Logging with Bucket Lock, or OCI Logging with WORM policies.
Periodic Re-Vetting
- Re-assess staff in privileged positions every 1-2 years or upon role changes to ensure continuous alignment with NCSC or departmental clearance policies.

By reinforcing least privilege, requiring MFA for admins, introducing break-glass accounts, logging privileged actions immutably, and scheduling re-vetting cycles, you address the limitations of purely one-time user vetting practices.

How to do better

Below are rapidly actionable steps for robust logging:

Centralise Logs
- Collect logs from all key systems into a single location:
  - AWS CloudWatch Logs or Amazon S3, Azure Monitor Logs or Log Analytics, GCP Cloud Logging, or OCI Logging service.
- Simplifies correlation and search.
Implement Basic Retention Policies
- Define how long logs remain:
  - e.g., minimum 90 days or 1 year for privileged user activity, referencing NCSC or departmental retention guidelines.
Add Tiered Access
- Ensure only authorised security or audit staff can retrieve log data, particularly sensitive privileged user logs.
Adopt Alerts or Scripting
- If no advanced SIEM in place, set simple CloudWatch or Monitor alerts for suspicious events:
  - e.g., repeated authentication failures, unusual times for privileged actions.
Plan for Future SIEM
- Keep in mind an upgrade to a security information and event management tool or advanced logging solution in the next 6-12 months:
  - e.g., AWS Security Hub, Azure Sentinel, GCP Security Command Center, or OCI Security Advisor integrations.

By centralising logs, defining retention policies, restricting log access, employing basic alerts, and charting a path to a future SIEM or advanced monitoring approach, you progress from minimal log compliance to meaningful protective monitoring for privileged accounts.

How to do better

Below are rapidly actionable steps to enhance local audit log checks:

Introduce Scheduled Log Reviews
- e.g., once a month or quarter, verify logs remain present, complete, and show no anomalies:
  - Provide a short checklist or script for consistent checks.
Adopt a Central Logging Approach
- Even if you keep local logs, replicate them to a central store or SIEM:
  - AWS S3 or Amazon ES, Azure Monitor Logs, GCP Logging + BigQuery, or OCI Logging Analytics.
Establish an Alerting Mechanism
- Set triggers for suspicious events:
  - repeated privileged commands, attempts to disable logging, or high-volume data exfil events.
Retest Periodically
- Expand from a pre-launch compliance check to ongoing compliance checks, referencing NCSC operational resilience or protective monitoring advice.
Involve Security/Operations in Reviews
- Encourage cross-team peer reviews, so security staff or ops can weigh in on log completeness or retention policies.

By scheduling routine log reviews, centralising logs or employing a SIEM, establishing real-time alerts, retesting logs beyond initial go-live, and collaborating with security teams on checks, you elevate from one-time assessments to ongoing protective monitoring.

How to do better

Below are rapidly actionable ways to enhance a centralised, immutable audit logging approach:

Incorporate a SIEM or Security Analytics
- e.g., Splunk, AWS Security Hub, Azure Sentinel, GCP Chronicle, or OCI Logging Analytics with advanced detection:
  - Gains rapid threat detection, correlation, and visual dashboards.
Define Tiered Log Retention
- Some logs might only need short retention, while privileged user logs or financial transaction logs might need multi-year retention, referencing departmental policies or NCSC recommended durations.
Implement Role-Based Log Access
- Ensure only authorised staff see certain logs (privileged user logs may contain sensitive data).
- Align with NIST SP 800-53 Access Control guidelines.
Add Instant Alerts for High-Risk Actions
- e.g., attempts to disable logging, repeated root-level changes, or suspicious escalations.
- Tools like AWS CloudWatch Alarms, Azure Monitor Alerts, GCP Logging Alerts, or OCI Notifications integrations are typically easy to set up.
Cross-department Collaboration
- If your service interacts with other public sector organisations, consider shared logging approaches for end-to-end traceability.
- Possibly referencing GOV.UK cross-department data sharing or NCSC supply chain security best practices.

By coupling an advanced SIEM with defined retention tiers, enforcing role-based log access, setting real-time alerts for critical events, and collaborating beyond your department, you push your centralised, immutable logging approach to best-in-class standards aligned with public sector needs.

How to do better

Below are rapidly actionable suggestions to deepen advanced log audits and legal compliance:

Formalise Forensic Readiness
- Publish an internal document describing how logs are collected, secured, and presented in legal contexts:
  - referencing NCSC forensic readiness best practices.
Simulate Real-World Insider Incidents
- Conduct tabletop exercises or “red team” scenarios focusing on a privileged user gone rogue:
  - confirm the logs indeed catch suspicious actions and remain legally defensible.
Adopt Chain-of-Custody Tools
- Use tamper-evident hashing or digital signatures on log files:
  - e.g., storing in AWS S3 Glacier with Vault Lock, Azure immutable storage with WORM, GCP Bucket Lock, or OCI Object Storage with retention policies.
Engage with Legal/HR for Pre-Agreed Procedures
- Ensure a consistent approach to handle suspected insider cases, clarifying roles for HR, security, legal, and management:
  - Minimises delays or confusion during investigations.
Leverage Cross-department Insights
- If possible, share experiences with other public sector bodies:
  - e.g., local councils or departmental agencies implementing similar forensic checks, referencing GOV.UK data and knowledge sharing communities.

By refining your forensic readiness policy, running insider threat simulations, implementing chain-of-custody measures, coordinating with legal/HR teams, and exchanging insights cross-department, you maximise the readiness and legal defensibility of your logs, ensuring robust protection against privileged internal threats in the UK public sector environment.

Keep doing what you’re doing, and consider blogging or creating pull requests to share these advanced approaches for safeguarding logs, verifying legal readiness, and mitigating privileged insider threats. Such knowledge helps strengthen collective security practices across UK public sector organisations.

How do you keep your software supply chain secure? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to handle unmanaged dependencies more safely:

Adopt Basic Package Manifests
- Even if you install packages with apt, create a minimal list of versions used. For language-based repos (Node, Python, etc.), commit package.json / Pipfile or equivalent:
  - Minimises drift and ensures consistent builds.
Begin Generating Simple SBOM
- Tools like Syft, CycloneDX CLI, or OWASP Dependency-Check can produce a rudimentary SBOM from your current dependencies.
- This helps you see what libraries you’re actually using.
Enable Automatic or Regular Patch Checks
- For OS packages, configure AWS Systems Manager Patch Manager, Azure Automation Update Management, GCP OS Patch Management, or OCI OS Management Service if you’re running cloud-based VMs, IBM Cloud
Document a Basic Update Policy
- e.g., “All packages are updated monthly,” referencing NCSC patch management best practices.
Plan an Overhaul to Managed Dependencies
- In the next 3-6 months, decide on a standard approach for dependencies:
  - e.g., using Node’s package-lock.json, Python’s requirements.txt, or Docker images pinned to specific versions.

By adopting minimal package manifests, generating basic SBOM data, automating patch checks, documenting an update policy, and planning a transition toward managed dependencies, you lay the groundwork for a more secure, transparent software supply chain.

How to do better

Below are rapidly actionable ways to strengthen basic dependency management:

Automate Regular Dependency Scans
- Integrate scanners into CI pipelines:
  - e.g., GitHub Dependabot, GitLab Dependency Scanning, Azure DevOps Security scanners, AWS CodeGuru Security, or 3rd-party solutions like Snyk or Sonatype Nexus.
Define a Scheduled Update Policy
- e.g., monthly or bi-weekly updates for critical libraries, referencing NCSC’s patch management recommendations.
Maintain SBOM or Lock Files
- Ensure each repo has a “lock file” or a manifest. Also, consider generating SBOM data (CycloneDX, SPDX) for compliance:
  - Aligns with NIST supply chain security guidance on EO 14028.
Enable Alerting for Known Vulnerabilities
- e.g., AWS Security Hub or Lambda scanning solutions, Azure Security Center with container scanning, GCP Container Analysis, or OCI Vulnerability Scanning if container-based.
Document Emergency Patching
- Formalise an approach for urgent CVE patching outside major releases.
- Minimises ad-hoc panic when a high severity bug appears.

By automating scans, scheduling regular update windows, maintaining SBOM or lock files, setting up vulnerability alerts, and establishing a well-defined emergency patch process, you move from ad-hoc monitoring to a more structured, frequent approach that better secures the software supply chain.

How to do better

Below are rapidly actionable ways to strengthen proactive repository remediation:

Introduce Risk Scoring or Context
- Distinguish vulnerabilities that truly impact your code path from those that are unreferenced dependencies:
  - e.g., using advanced scanning tools like Snyk, Sonatype, or vendor-based solutions.
Adopt Container and OS Package Scanning
- If using Docker images or base OS packages, incorporate scanning in your CI/CD:
  - AWS ECR image scanning, Azure Container Registry scanning, GCP Container Analysis, or OCI Vulnerability Scanning Service.
Refine Automated Testing
- Ensure new dependency updates pass comprehensive tests (unit, integration, security checks) before merging:
  - referencing NCSC DevSecOps recommendations and relevant NIST DevSecOps frameworks.
Define an SLA for Fixes
- e.g., “Critical vulnerabilities fixed within 48 hours, high severity within 7 days,” aligning with NCSC’s vulnerability management best practices.
Document & Track Exceptions
- If a patch is delayed (e.g., due to breakage risk), keep a formal record of why and a timeline for resolution:
  - Minimises the chance of indefinite deferral of serious issues.

By introducing vulnerability risk scoring, scanning container/OS packages, enhancing test automation for new patches, setting fix SLAs, and controlling deferrals, you significantly improve the proactive repository-level remediation approach across your entire software estate.

How to do better

Below are rapidly actionable ways to refine centralised, context-aware triage:

Add Real-Time Threat Intelligence
- Integrate intel feeds that highlight newly discovered exploits targeting specific libraries:
  - e.g., AWS Security Hub or Amazon Inspector with threat intel, Azure Defender’s threat DB, GCP threat intelligence, or OCI security advisories.
Automate Contextual Analysis
- Tools that parse call graphs or code references to see if a vulnerable function is actually invoked:
  - Minimises false positives and patch churn.
Collaborate with Dev Teams
- If a patch might break production, the SOC can coordinate safe rollout or canary testing to confirm stability before mandatory updates.
Measure & Publish Remediation Metrics
- e.g., average time to fix a critical CVE or high severity vulnerability.
- Encourages healthy competition and accountability across teams.
Align with Overall Risk Registers
- When a big vulnerability emerges, feed it into your organisational risk register, referencing NCSC or departmental risk management frameworks.

By integrating real-time threat intel, employing contextual code usage analysis, collaborating with dev for safe patch rollouts, tracking remediation metrics, and linking to broader risk management, you elevate centralised monitoring to a dynamic, strategic posture in addressing supply chain security.

How to do better

Below are rapidly actionable ways to refine advanced, integrated supply chain security:

Implement Automated Policy-as-Code
- e.g., Open Policy Agent (OPA) in CI/CD or vendor-based tools (AWS Service Control, Azure Policy, GCP Org Policy, OCI Security Zones).
Extend SBOM Generation & Validation
- Enforce real-time SBOM generation and sign-off at each build:
  - referencing NIST SBOM frameworks or NCSC guidelines on software transparency.
- Automate verifying known safe versions.
Adopt Multi-Factor Scanning
- Combine static code analysis, dependency scanning, container image scanning, and runtime threat detection:
  - e.g., AWS CodeGuru Security + ECR scanning, Azure DevOps SAST + Container Registry scans, GCP Cloud Build with Container Analysis + Snyk, or OCI Vulnerability Scanning + DevOps scanning integrations.
Coordinate with Supplier/Partner Security
- If you rely on external code or vendors, integrate them into your scanning or require them to produce SBOMs:
  - align with NCSC supply chain risk management best practices.
Drive a Security-First Culture
- Provide ongoing staff training, referencing NCSC e-learning resources or relevant NIST-based secure coding frameworks.
- Encourage environment that prioritises prompt, efficient patching.

By implementing policy-as-code in your pipelines, strengthening SBOM usage, blending multiple scanning techniques, managing upstream vendor security, and fostering a security-first ethos, you sustain a cutting-edge supply chain security environment—ensuring minimal risk, maximum compliance, and rapid threat response across UK public sector software development.

How do you find and fix security problems, vulnerabilities, and misconfigurations? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to implement basic vulnerability reporting:

Publish a Simple Disclosure Policy
- e.g., a “Contact Security” page or statement on your website acknowledging how to report vulnerabilities, referencing NCSC vulnerability disclosure best practices.
Set Up a Dedicated Email or Form
- Provide a clear email (like security@yourdomain.gov.uk) or secure submission form:
  - Minimises confusion about who to contact.
Respond with a Standard Acknowledgement
- Even an automated template that thanks the researcher and notes you’ll follow up within X days fosters trust.
Engage Leadership
- Brief senior management that ignoring external reports can lead to missed critical vulnerabilities.
Plan a Gradual Evolution
- Over the next 6-12 months, consider joining a responsible disclosure platform or adopting a bug bounty approach for larger-scale feedback.

By defining a minimal disclosure policy, setting up a dedicated channel, creating an acknowledgment workflow, involving leadership awareness, and planning for future expansions, you shift from no vulnerability management to a more transparent and open approach that encourages safe vulnerability reporting.

How to do better

Below are rapidly actionable ways to evolve beyond a standard disclosure policy:

Link Policy with Internal Remediation SLAs
- For example, “critical vulnerabilities responded to within 24 hours, resolved or mitigated within 7 days,” to ensure a consistent process.
Integrate with DevSecOps
- Feed reported vulnerabilities into your CI/CD backlog or JIRA/DevOps boards, referencing NCSC DevSecOps advice and NIST secure development frameworks.
Offer Coordinated Vulnerability Disclosure Rewards
- If feasible, small gestures (like public thanks or acknowledgement) or bug bounty tokens encourage more thorough testing from external researchers.
Publish Summary of Findings
- Periodically share anonymised or high-level results of vulnerability disclosures, illustrating how quickly you resolved them.
- Builds trust with citizens or partner agencies.
Combine with Automated Tools
- Don’t rely solely on external reports. Implement scanning solutions:
  - AWS Inspector, Azure Security Center, GCP Security Command Center, or OCI Vulnerability Scanning Service for internal checks.

By defining clear internal SLAs, integrating vulnerability disclosures into dev workflows, offering small acknowledgments or bounties, releasing summary fix timelines, and coupling with continuous scanning tools, you can both refine external disclosure processes and ensure robust internal vulnerability management.

How to do better

Below are rapidly actionable ways to enhance scanning and regular assessments:

Expand to Multi-Layer Scans
- Combine SAST (code scanning), DAST (runtime scanning), container image scanning, and OS patch checks:
  - e.g., AWS CodeGuru Security + Amazon Inspector, Azure DevOps Security + Container Registry scanning, GCP Security Command Center + container analysis, or OCI Vulnerability Scanning + DevOps scanning integrations.
Adopt Real-Time or Daily Scans
- If feasible, move from monthly/quarterly to daily or per-commit scanning in your CI/CD pipeline.
- Early detection fosters quicker fixes.
Integrate with SIEM
- Forward scanning results to a SIEM (e.g., AWS Security Hub, Azure Sentinel, GCP Chronicle, or OCI Security Advisor) for correlation with logs:
  - Helps identify patterns or repeated vulnerabilities.
Prioritise with Risk Scoring
- Tag vulnerabilities by severity and usage context. Tackle high-severity, widely used dependencies first, referencing NCSC guidelines on vulnerability prioritisation.
Publish Shared “Security Scorecards”
- Departments or teams see summary risk/vulnerability data. Encourages knowledge sharing and a culture of continuous improvement.

By broadening scanning layers, shifting to more frequent scans, integrating results in a SIEM, risk-scoring discovered issues, and creating departmental security scorecards, you refine a robust automated scanning regimen that swiftly addresses vulnerabilities.

How to do better

Below are rapidly actionable methods to refine proactive threat hunting and incident response:

Adopt Purple Teaming
- Combine red team (offensive) and blue team (defensive) exercises periodically to test detection and response workflows.
- e.g., referencing NCSC red teaming best practices.
Enable Automated Quarantine
- If a container, VM, or instance shows malicious behavior, automatically isolate it:
  - e.g., AWS Lambda or Azure Functions triggered by SIEM alerts, GCP Cloud Functions for security event response, or OCI cloud events integration.
Add Forensic Readiness
- Plan for collecting logs, memory dumps, or container images upon suspicious activity, referencing NCSC forensic readiness guidance or NIST SP 800-86.
Integrate Cross-Government Threat Intel
- If relevant, share or consume intelligence from local councils, NHS, or central government:
  - GOV.UK and NCSC programs for cross-department threat intel collaboration.
Expand Zero-Trust Microsegmentation
- Combine threat hunting with per-service or per-workload identity controls, so once an anomaly is found, lateral movement is minimised:
  - referencing NCSC zero trust or NIST SP 800-207 frameworks.

By introducing purple teaming, automating quarantine procedures, ensuring forensic readiness, collaborating on threat intel across agencies, and adopting zero-trust microsegmentation, you deepen your proactive stance and expedite incident responses.

How to do better

Below are rapidly actionable ways to optimise comprehensive security operations:

Incorporate HPC/AI Security
- If you run HPC or AI/ML workloads, ensure specialised testing in these unique environments:
  - referencing AWS HPC Competency, Azure HPC, GCP HPC solutions, or OCI HPC, plus relevant HPC security guidelines.
Include Third-Party Supply Chain
- Extend red/purple team scenarios to external suppliers or integrated services, referencing NCSC’s supply chain security approaches.
Automate Cross-Cloud Security
- If you operate in AWS, Azure, GCP, or OCI simultaneously, unify threat detection:
  - e.g., employing SIEM solutions like Azure Sentinel, Splunk, or AWS Security Hub aggregator across multiple accounts.
Public-Sector Collaboration
- Share red team findings or best practices with local councils, NHS, or other agencies within the constraints of sensitivity:
  - fosters wider security improvements, referencing GOV.UK cross-department knowledge sharing guidance.
Continuously Evaluate Zero-Trust
- Combine red team results with zero-trust strategy expansions:
  - referencing NIST SP 800-207 Zero Trust Architecture and NCSC zero trust guidance.

By adopting HPC/AI-targeted checks, incorporating suppliers in red team exercises, unifying multi-cloud threat intelligence, collaborating across public sector units, and reinforcing zero-trust initiatives, you further enhance your holistic security operations. This ensures comprehensive, proactive defense against sophisticated threats and misconfigurations in the UK public sector context.

Keep doing what you’re doing, and consider blogging or opening pull requests to share your advanced security operations approaches. This knowledge supports other UK public sector organisations in achieving robust threat/vulnerability management and protective monitoring aligned with NCSC, NIST, and GOV.UK best practices.

How do you secure your network and control access? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to strengthen or evolve from perimeter-only security:

Introduce MFA for Privileged Access
- Even if you maintain a perimeter, require multi-factor authentication for admin or root accounts:
  - e.g., AWS IAM MFA, Azure AD MFA, GCP IAM 2FA, or OCI IAM MFA, IBM Cloud MFA
- Minimises risk of compromised credentials bypassing the firewall.
Implement Least-Privilege IAM
- Don’t rely solely on IP allow-lists. Use role-based or attribute-based access for each service:
  - referencing NCSC’s guidance on access control.
Segment Networks Internally
- If you must keep a perimeter, create subnet-level or micro-segmentation to contain potential lateral movement:
  - e.g., AWS Security Groups + Network ACLs, Azure Network Security Groups, GCP VPC firewall rules, or OCI Security Lists/NSGs.
Enable TLS Everywhere
- Even inside the perimeter, adopt TLS for internal service traffic.
- NCSC’s guidance on TLS best practices ensures data in transit is protected if perimeter is breached.
Plan for Identity-Based Security
- Over the next 6-12 months, pilot a small zero-trust or identity-centric approach for a less critical app, paving the way to reduce dependence on perimeter rules.

By enforcing multi-factor authentication, introducing least-privilege IAM, segmenting networks internally, ensuring end-to-end TLS, and planning a shift toward identity-based models, you move beyond the risks of purely perimeter-centric security.

How to do better

Below are rapidly actionable ways to extend identity verification:

Enforce MFA for All Users
- Expand from privileged accounts to all staff, referencing NCSC’s multi-factor authentication guidance or vendor-based solutions:
  - AWS IAM, Azure AD MFA, GCP Identity, or OCI IAM MFA.
Increase Granularity of Access Controls
- Instead of letting a user into the entire internal network after login, define specific role-based or service-based access:
  - e.g., AWS IAM condition keys, Azure Conditional Access, GCP Access Context Manager, or OCI compartments/policies.
Adopt SSO
- If each app behind the perimeter uses separate user stores, unify them with SSO:
  - e.g., AWS SSO, Azure AD SSO, GCP Identity Federation, or OCI IDCS integration.
Enable Auditing & Logging
- Once inside the network, log user actions for each app or system:
  - e.g., AWS CloudTrail, Azure Monitor, GCP Cloud Logging, or OCI Audit Logs for post-identity verification behavior.
Consider Device Trust or Conditional Access
- If feasible, require verified device posture (up-to-date OS, security agent running) before granting app access.

By mandating MFA for all, refining role-based or service-level access, introducing SSO, logging all user actions, and optionally checking device security posture, you significantly reduce reliance on a single perimeter gate.

How to do better

Below are rapidly actionable ways to strengthen user+service identity verification:

Use mTLS or Short-Lived Tokens
- e.g., AWS IAM roles for EC2 with STS, Azure Managed Identities, GCP Workload Identity Federation, or OCI dynamic groups/tokens, plus mTLS for containers or microservices.
Adopt Policy-as-Code
- Incorporate Open Policy Agent or vendor-based solutions (AWS SCP, Azure Policy, GCP Org Policy, or OCI Security Zones) to define rules that check both user claims and service identity for each call.
Enforce Request-Level Authorisation
- For each critical API, evaluate the user identity, service identity, and method scope:
  - referencing NCSC least privilege guidance or NIST SP 800-53 AC-6 for role-based checks.
Implement JIT Privileges
- For especially sensitive or admin tasks, require ephemeral or just-in-time escalation tokens (with a short lifetime).
Log & Analyze Service-to-Service Interactions
- If microservices talk to each other, capture logs about which identity was used, referencing NCSC protective monitoring best practices.

By implementing mTLS or ephemeral tokens for user+service identity, deploying policy-as-code, requiring request-level authorisation, enabling JIT privileges for critical tasks, and thoroughly logging microservice communications, you move closer to a robust zero-trust framework within a partially perimeter-based model.

How to do better

Below are rapidly actionable ways to deepen identity-centric security:

Retire or Restrict VPN
- If a VPN is still used to reach certain legacy apps, plan a phased approach to move them behind identity-based gateways:
  - e.g., AWS AppStream or AWS WorkSpaces, Azure App Proxy, GCP BeyondCorp Enterprise, or OCI Identity Aware solutions.
Embed Device Trust
- Combine user identity with device compliance checks:
  - e.g., [Azure AD Conditional Access with device compliance, Google BeyondCorp device posture, AWS or OCI solutions integrated with MDM] for advanced zero-trust.
Embrace Microsegmentation
- Each app or microservice is accessible only with the correct identity claim, not broad network-level trust.
- referencing NCSC’s microsegmentation advice or DevSecOps patterns.
Establish Single Sign-On for All
- If some staff still need separate logins for older apps, unify them with AWS SSO, Azure AD, GCP Identity, or OCI IDCS Federation.
Continuously Train Staff
- Emphasize new patterns (no reliance on VPN, ephemeral credentials, and device checks).
- referencing GOV.UK or NCSC training resources on zero-trust and identity-based security.

By methodically retiring or limiting VPN usage, integrating device posture checks, employing microsegmentation, standardising single sign-on for all apps, and training staff on the identity-centric model, you further reduce perimeter dependence and approach a more robust zero-trust posture.

How to do better

Below are rapidly actionable ways to sustain no-perimeter, identity-based security:

Refine Device & User Risk Scoring
- If a device shows outdated OS or known vulnerabilities, reduce or block certain privileges automatically:
  - e.g., Azure AD Conditional Access with risk-based policies, AWS Cognito device posture checks, GCP BeyondCorp device trust, or OCI device posture solutions.
Enforce Continuous Authentication
- Check user identity validity at frequent intervals, not just at session start:
  - Tools for short-lived tokens or renewed claims, referencing NCSC’s recommended short-session best practices.
Extend Zero-Trust to Microservices
- Each microservice or container also obtains ephemeral credentials or mTLS, ensuring service-to-service trust.
- referencing NCSC supply chain guidance or NIST SP 800-53 AC controls for machine identity.
Use Policy-as-Code
- Implement Open Policy Agent (OPA), AWS SCP, Azure Policy, GCP Org Policy, or OCI Security Zones for dynamic, code-defined guardrails that adapt to real-time signals.
Collaborate & Share
- As a leading zero-trust example, share your experiences or case studies with other public sector bodies, referencing cross-government events or guidance from GDS / NCSC communities.

By deploying advanced device risk scoring, introducing continuous re-auth, expanding zero trust to microservices, employing policy-as-code for dynamic guardrails, and collaborating across the public sector, you refine your environment as a modern, identity-centric security pioneer, fully detached from traditional network perimeters and VPN reliance.

Keep doing what you’re doing, and consider writing up your experiences or opening pull requests to share your zero-trust or identity-centric security transformations. This knowledge benefits other UK public sector organisations striving to reduce reliance on network perimeters and adopt robust, identity-first security models.

How do you use two-factor or multi-factor authentication (2FA/MFA)? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move from an “encouraged” MFA model to a consistent approach:

Identify Privileged Accounts First
- Immediately enforce MFA for admin or root-level users, referencing AWS IAM MFA on privileged roles, Azure AD MFA on global admins, GCP IAM MFA, or OCI IAM MFA.
Educate Staff on Risks
- Provide short e-learning or internal comms about real incidents caused by single-factor breaches:
  - e.g., referencing NCSC’s blog or case studies on stolen credentials.
Incentivise Voluntary Adoption
- Recognise teams or individuals who enable MFA (e.g., shout-outs or small accolades).
- Encourages cultural acceptance before a final mandate.
Publish a Simple Internal FAQ
- Outline how to set up Google Authenticator, Microsoft Authenticator, hardware tokens, or other TOTP apps.
- Minimises friction for new adopters.
Plan a Timeline for Mandatory MFA
- Over 3–6 months, aim to require MFA for at least all staff accessing sensitive services.

By prioritising MFA for privileged users, educating staff on credential compromise scenarios, incentivising early adoption, providing user-friendly setup instructions, and scheduling a near-future MFA mandate, you evolve from optional guidance to real protective measures.

How to do better

Below are rapidly actionable methods to close the enforcement gap:

Enable Enforcement in Cloud IAM
- E.g., AWS IAM or AWS SSO policy to force MFA, Azure AD conditional access “MFA always required”, GCP Organization Policy for MFA, or OCI IAM policy for mandatory MFA.
Monitor for Noncompliance
- Generate monthly or weekly reports on which users still lack MFA:
  - e.g., AWS Security Hub or custom queries, Azure AD “Users without MFA” query, GCP Cloud Identity “MFA usage” checks, or OCI IAM logging for no-MFA users.
Apply a Hard Deadline
- Communicate a date beyond which single-factor logins will be revoked, referencing official departmental or local policy.
Offer Support & Tools
- Provide hardware tokens for staff without suitable smartphones, referencing FIDO2 or YubiKey-based methods recommended by NCSC or NIST.
Handle Legacy Systems
- For older apps, implement an SSO or MFA proxy if direct integration isn’t possible, e.g., Azure App Proxy or GCP IAP, AWS SSO bridging, or OCI integration with IDCS.

By enabling built-in forced MFA, monitoring compliance, communicating a strict cutoff date, supplying alternative authenticators, and bridging older systems with SSO or proxy solutions, you systematically remove any gaps that allow single-factor access.

How to do better

Below are rapidly actionable ways to remove or mitigate the last few exceptions:

Document a Sunset Plan for Exceptions
- If a system can’t integrate MFA now, define a target date or solution path (like an MFA-proxy or upgrade).
- Minimises indefinite exceptions.
Risk-Base or Step-Up
- If certain actions are higher risk (e.g., large data exports), require a second factor again or hardware key.
- referencing Azure Conditional Access, AWS contextual MFA, GCP BeyondCorp enterprise settings, or OCI advanced IAM polices.
Consider Device-Focused Security
- For known lower-risk cases, confirm devices meet compliance (updated OS, MDM) as a mitigating factor.
- referencing NCSC device posture or zero-trust approaches.
Combine with Identity-Centric Security
- Move from perimeter to identity-based approach if not already, ensuring MFA is central in each request’s trust decision.
- referencing NIST SP 800-207 zero-trust architecture, NCSC guidelines.
Review & Renew
- Periodically re-check each exception’s rationale—some may no longer be valid as technology or policies evolve.

By planning for the eventual elimination of exceptions, deploying step-up authentication for sensitive tasks, ensuring device posture checks for minimal-risk scenarios, integrating identity-based zero-trust, and reviewing exceptions regularly, you further strengthen your universal MFA adoption.

How to do better

Below are rapidly actionable enhancements:

Adopt FIDO2 or Hardware Security Keys
- For highly privileged accounts, consider YubiKey, Feitian, or other FIDO2-based solutions offering strong phishing resistance.
Set Up Backup Mechanisms
- Provide staff a fallback if TOTP or hardware tokens are lost/stolen:
  - e.g., secure self-service recovery using AWS SSO with backup codes, Azure AD with alternative verification, GCP Identity fallback factors, or OCI IAM backup tokens.
Integrate Risk-Based Policies
- If an account attempts to log in from an unusual location, require a higher assurance factor:
  - referencing Azure Conditional Access location-based rules, AWS context keys, GCP Access Context Manager, or OCI policy conditions.
Consider Device Certificates
- For some use cases, device-based certificates or mTLS can supplement user factors, further preventing compromised endpoints from impersonation.
Regularly Revisit Factor Security
- Check if new vulnerabilities arise in your TOTP or hardware token methods, referencing NCSC’s hardware token security briefs, FIDO Alliance updates, or NIST advisories.

By introducing hardware-based MFA, ensuring robust fallback processes, applying risk-based authentication for suspicious attempts, deploying device certs, and staying alert to newly discovered factor vulnerabilities, you push your “no weak MFA” stance to a sophisticated, security-first environment.

How to do better

Below are rapidly actionable ways to optimise hardware-based MFA:

Embrace Risk-Based Authentication
- If unusual attempts occur, force an additional step or token re-validation:
  - referencing Azure AD Conditional Access with hardware tokens, AWS context-based policies, GCP identity risk signals, or OCI advanced IAM policies.
Implement Zero-Trust & Microsegmentation
- Pair hardware tokens with per-request or per-service authentication. Each microservice may require ephemeral token requests.
- referencing NIST SP 800-207 zero-trust architecture guidelines.
Maintain Inventory & Lifecycle
- Automate key distribution, revocation, or replacement. If a staff member loses a token, the system quickly blocks it.
- e.g., a central asset management or HR-driven approach ensuring no leftover active tokens for departed staff.
Test Against Realistic Threats
- Conduct red team exercises specifically targeting hardware token scenarios:
  - referencing NCSC or local ITHC red/purple teaming best practices.
Plan for Cross-department Interoperability
- If staff need to collaborate with other departments, consider bridging identity solutions or allowing hardware tokens recognised across multiple organisations:
  - referencing GOV.UK single sign-on or cross-department identity frameworks.

By coupling hardware tokens with adaptive risk checks, adopting zero-trust microsegmentation for each request, carefully managing the entire token lifecycle, running targeted red team tests, and exploring cross-department usage, you elevate an already stringent hardware-based MFA approach to a seamlessly integrated, high-security ecosystem suitable for sensitive UK public sector operations.

Keep doing what you’re doing, and consider sharing your experiences or opening pull requests to this guidance. Others in the UK public sector can learn from how you enforce robust MFA standards, whether using FIDO2 hardware keys, advanced risk-based checks, or zero-trust patterns.

How do you manage privileged access? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond ad-hoc privileged credential management:

Create a Basic Privileged Access Policy
- Even a short doc stating how privileged accounts are created, stored, rotated, and revoked is better than none.
- Referencing NCSC’s privileged access management best practices.
Mandate Individual Admin Accounts
- Eliminate shared “admin” user logins. Each privileged user gets a unique account so you can track actions.
Introduce MFA for Admins
- Even if no vaulting solution is in place, require multi-factor authentication on any privileged ID:
  - AWS IAM with MFA, Azure AD PIM MFA, GCP IAM accounts with MFA, or OCI IAM MFA, IBM Cloud MFA
Document & Track Privileged Roles
- Keep a minimal register or spreadsheet listing all privileged accounts, systems they access, and assigned owners:
  - Helps see if too many administrators exist.
Schedule Transition to Vaulting
- Plan to adopt a basic password vault or secrets manager, e.g., AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, or OCI Vault, or IBM Cloud Secrets Manager for privileged credentials in the next 3-6 months.

By creating a short privileged access policy, enforcing unique admin accounts with MFA, documenting roles, and preparing for a vault-based solution, you significantly reduce the risk of ad-hoc mismanagement and insider threats.

How to do better

Below are rapidly actionable steps to refine centralised vaulting:

Enable Automatic Credential Rotation
- Many vault solutions allow scheduled rotation:
  - e.g., AWS Secrets Manager rotation, Azure Key Vault versioning, GCP Secret Manager rotation, or OCI Vault key rotation.
Integrate with CI/CD
- If dev pipelines need privileged credentials (e.g., for deployment), fetch them from the vault at runtime, never storing them in code or config:
  - referencing NCSC’s guidance on secrets management.
Automate Access Reviews
- Regularly review who has vault access, removing staff or contractors who no longer need it, referencing NIST SP 800-53 AC-2 for continuous account management.
Adopt Fine-Grained Access Policies
- Distinguish read-only vs. rotate vs. admin permissions in the vault.
- e.g., AWS IAM roles for Secrets Manager, Azure RBAC for Key Vault, GCP IAM for Secret Manager, or OCI IAM compartment policies.
Add Multi-Factor for Vault Access
- Ensure staff need an extra factor to retrieve privileged credentials from the vault, referencing NCSC’s MFA best practice.

By rotating credentials automatically, integrating vault secrets into CI/CD, conducting periodic access reviews, refining vault access policies, and enforcing MFA for vault retrieval, you build a stronger, more secure foundation for privileged credentials management.

How to do better

Below are rapidly actionable ways to strengthen identity administration and OTP usage:

Integrate OTP into Break-Glass Procedures
- When a user escalates to super-admin, require a one-time password from the vault, valid only for a few minutes:
  - e.g., AWS STS with short-lived tokens, Azure PIM with just-in-time, GCP short-lived role tokens, or OCI dynamic tokens.
Use Security Keys for Admin Access
- Consider hardware tokens (FIDO2, YubiKey) for privileged roles.
- referencing NCSC’s hardware token guidance.
Automate Logging & Alerts
- Generate real-time alerts if an OTP is used or if multiple OTP requests appear in quick succession:
  - e.g., AWS CloudWatch Events, Azure Monitor Alerts, GCP Logging Alerts, or OCI Notifications.
Schedule Regular Privileged Access Reviews
- Confirm that each privileged user still needs their role.
- referencing NIST SP 800-53 AC-3 for minimal role-based privileges.
Expand OTP to Non-Human Accounts
- Where feasible, short-lived tokens for services or automation tasks too, fostering ephemeral credentials.

By embedding OTP steps in break-glass procedures, adopting hardware tokens for admins, enabling automated logs/alerts, reviewing privileged roles frequently, and using ephemeral tokens for services as well, you build a more rigorous privileged access model with robust checks.

How to do better

Below are rapidly actionable ways to elevate automated, risk-based privileged access:

Incorporate Threat Intelligence
- If certain privileged users or roles are targeted in known campaigns, your system should adapt policies:
  - e.g., Azure Sentinel threat intel, AWS Security Hub with curated feeds, GCP Chronicle threat analysis, or OCI threat intelligence integrations.
Tie Access to Device Posture
- Checking if the user’s device meets security standards (latest patches, MDM compliance) before granting elevated privileges:
  - referencing NCSC’s device posture or MDM recommendations.
Implement Granular Observability
- For privileged sessions, record or track commands in near real-time, ensuring immediate response to suspicious operations:
  - e.g., AWS CloudTrail with CloudWatch Alarms, Azure Monitor logs for advanced admin actions, GCP Admin Activity logs, or OCI Audit logs + notifications.
Automate Just-in-Time (JIT) Access
- Use short-lived role escalations that revert automatically:
  - e.g., Azure Privileged Identity Management, GCP ephemeral role grants, AWS STS custom sessions, or OCI dynamic group tokens with time-based constraints.
Regular Security Drills
- Conduct scenario testing or red team exercises focusing on privileged accounts.
- referencing NCSC red teaming best practices.

By combining threat intelligence, verifying device posture, enabling granular session-level logging, adopting just-in-time privileges, and running regular security exercises, you further refine risk-based controls for privileged access across all cloud platforms.

How to do better

Below are rapidly actionable ways to optimise context-aware just-in-time privileges:

Deeper Risk-Based Logic
- For example, if a user requests privileged access on a weekend, the system demands additional manager approval or a second hardware token.
- referencing Azure PIM advanced policies, AWS Access Analyzer with context conditions, GCP short-lived roles + custom conditions, or OCI advanced IAM condition checks.
Enforce Micro-Segmentation
- Combine ephemeral privileges with strict micro-segmentation: each resource requires a separate ephemeral token:
  - Minimises lateral movement if any one credential is compromised.
Incorporate Real-Time Forensic Tools
- If privileged activity looks unusual, log a forensic snapshot or automatically isolate that user session:
  - referencing NCSC forensic readiness or advanced threat detection approaches.
Enable AI/ML Anomaly Detection
- Tools or scripts that examine normal patterns for each user, alerting on out-of-norm privileged requests:
  - e.g., Azure Sentinel ML rules, AWS DevOps Guru or Security Hub custom checks, GCP Chronicle AI, or OCI Security Advisor advanced analytics.
Regular Multi-Stakeholder Drills
- Include managers, security leads, and senior leadership in simulated privileged escalation misuse scenarios:
  - refining the after-action wash-up process, referencing NIST SP 800-61 incident handling guide or red/purple teaming.

By enhancing risk-based logic in JIT access, pairing ephemeral privileges with micro-segmentation, adopting real-time forensic checks, integrating AI-based anomaly detection, and practicing multi-stakeholder drills, you perfect a context-aware just-in-time privileged access model that secures the most sensitive operations in the UK public sector context.

Keep doing what you’re doing, and consider blogging or creating pull requests to share your experiences in implementing advanced privileged access systems with just-in-time context-based controls. Such knowledge benefits other UK public sector bodies aiming to secure administrative actions under a zero-trust, ephemeral access paradigm.

How does your organisation respond to security breaches and incidents? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond manual classification:

Adopt a Simple Data Classification Scheme
- E.g., Official, Official-Sensitive, or your departmental equivalents.
- Align with GOV.UK’s Government Security Classifications or relevant local policies.
Introduce Basic Tooling
- For shared file systems or code repos, use built-in labeling or metadata:
  - Azure Information Protection, AWS Macie or S3 object tagging, GCP DLP + labeling, or OCI data labeling approaches, IBM Cloud - Cloud Object Storage
Require Access Controls
- Even if classification is manual, enforce least privilege for each data repository:
  - referencing NCSC’s zero trust or NIST SP 800-207 for identity-based data protection.
Document a Minimal Process
- A short policy clarifying how staff label data, who can reclassify, and how they request access changes:
  - Minimises confusion or inconsistent labeling.
Plan for Automated Classification
- In the next 3–6 months, evaluate solutions like AWS Macie, Azure Purview, GCP DLP, or OCI Cloud Guard data detection for partial automation.

By introducing a simple classification scheme, adopting minimal tooling for labeling, ensuring basic least-privilege access, documenting a short classification process, and preparing for automated solutions, you create a more structured approach to data security than purely manual methods.

How to do better

Below are rapidly actionable ways to strengthen centralised data security policies:

Implement Automated Policy Enforcement
- Tools that apply encryption, retention, or classification automatically, e.g.:
  - AWS KMS + S3 bucket policies, Azure Information Protection auto-labelling, GCP DLP auto-redaction, or OCI encryption + data labeling solutions.
Add Tiered Access
- For sensitive data sets, require stronger verification or ephemeral credentials before granting read/write:
  - referencing NCSC’s privileged access management guidance or NIST SP 800-53 AC controls for data segmentation.
Consolidate Data Stores
- If departmental data is scattered, unify them under controlled solutions:
  - e.g., Azure Purview, AWS Glue Data Catalog + Macie, GCP Data Catalog, or OCI Data Catalog for consistent policy application.
Define a Data Lifecycle
- Outline how data is created, stored, archived, or destroyed:
  - referencing NCSC’s data management best practices, GOV.UK records management policies, or NIST retention guidance.
Monitor for Policy Deviations
- Tools like AWS Config, Azure Policy, GCP Org Policy, or OCI Security Zones can detect if a new resource bypasses encryption or classification requirements.

By automating policy enforcement, requiring tiered access for sensitive data, consolidating data stores, clarifying data lifecycle, and monitoring for policy anomalies, you refine your centralised data security approach, ensuring consistent coverage and minimal manual drift.

How to do better

Below are rapidly actionable ways to expand limited monitoring:

Adopt or Expand DLP Tools
- e.g., Microsoft Purview DLP (Azure), AWS Macie for S3 data scanning, GCP DLP scanning, or OCI Cloud Guard data scanning.
- Configurable for alerts on large data exports or suspicious file patterns.
Integrate SIEM for Correlation
- e.g., Azure Sentinel, AWS Security Hub / CloudWatch Logs, GCP Chronicle, or OCI Security Advisor for data exfil attempts correlated with user roles or session logs.
Add Real-Time Alerts
- If a user downloads an unusually large amount of data or from unusual IPs, trigger immediate SOC or security team notifications.
Include Lateral Movement Checks
- If an account with normal read privileges suddenly tries to access data not in their job role, flag it:
  - referencing NCSC zero trust or NIST SP 800-207 identity-based microsegmentation guidelines.
Regular Drills and Tests
- Simulate data exfil attempts or insider threat to test if your limited monitoring indeed picks up suspicious events.

By leveraging or expanding DLP solutions, correlating logs in a SIEM, implementing real-time anomaly alerts, detecting lateral movement, and running exfiltration drills, you enhance your approach from partial monitoring to more comprehensive oversight of data movements.

How to do better

Below are rapidly actionable methods to reinforce automated detection:

Risk-Scored Alerts
- Combine user identity, device posture, and data classification to prioritise which anomalies matter most:
  - referencing Azure Sentinel ML rules, AWS Security Hub with risk-based scoring, GCP Chronicle risk detection, or OCI Security Advisor advanced analytics.
Automated Quarantine & Blocking
- If exfil is suspected, block the user session or transfer automatically, referencing NCSC incident management playbooks.
Integrate Threat Intelligence
- Use external feeds or cross-government intel to see if certain IP addresses or tactics target your data assets.
Regularly Update Detection Rules
- Threat patterns evolve; schedule monthly or quarterly rule reviews to incorporate the latest TTPs (tactics, techniques, and procedures) used by adversaries.
Drill Data Restoration
- Data corruption or deletion can be as damaging as exfil. Ensure backups and DR processes are tested frequently:
  - e.g., referencing AWS Backup + DR, Azure Backup + Site Recovery, GCP Backup & DR, or OCI Backup & DR Services.

By adding risk-scored alerts, automatically quarantining suspicious activity, incorporating threat intelligence, periodically updating detection rules, and verifying backups or DR for data restoration, you create a highly adaptive system that promptly detects and mitigates data breach attempts.

How to do better

Below are rapidly actionable ways to refine fully automated, proactive data security:

Leverage AI/ML for Data Anomalies
- Tools that identify unusual data patterns or exfil attempts automatically:
  - e.g., Azure Purview ML classification, AWS Macie or Amazon Detective, GCP DLP with ML, or OCI Logging Analytics AI-based detection.
Adopt Policy-as-Code
- Tools like Open Policy Agent or vendor-specific: AWS SCP, Azure Policy, GCP Organization Policy, or OCI Security Zones define data security in code for version-controlled, auditable changes.
Expand Zero-Trust Microsegmentation
- Ensure each request for data is validated at the identity, device posture, and context level, even inside your environment:
  - referencing NCSC or NIST zero-trust frameworks.
Cross-Government Data Sharing
- If relevant, unify or standardise data security controls across multiple agencies or local councils:
  - referencing GOV.UK guidance on data sharing and collaboration security.
Regular “Chaos” or Stress Tests
- Simulate insider threats, external hacking, or HPC data manipulations to confirm your automated defenses.
- referencing NCSC red/purple teaming best practices.

By employing AI-driven anomaly detection, embedding policy-as-code for data security, adopting zero-trust microsegmentation, collaborating on cross-government data controls, and running robust chaos or stress tests, you sustain a cutting-edge, proactive data protection approach suitable for the evolving demands of UK public sector operations.

Keep doing what you’re doing, and consider blogging about or opening pull requests to share how you maintain or improve your data breach mitigation strategies. Your experiences support other UK public sector organisations, reinforcing best practices under NCSC, NIST, and GOV.UK guidance.

Technology

How do you choose technologies for new projects? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to move away from fully independent, unaligned technology decisions:

Start a Basic Tech Catalog
- Document each major technology used across projects, referencing at least version, licensing, security posture:
  - Helps discover overlaps or common solutions already in use.
Create a Minimal Governance Policy
- For instance, a short doc that outlines which technologies require sign-off (e.g., for security or cost reasons):
  - referencing GOV.UK’s technology code of practice or NCSC supply chain considerations.
Encourage Knowledge Sharing
- Run short “tech share” sessions, where teams present why they picked certain tools:
  - fosters cross-project alignment.
Identify Quick-Win Common Tools
- E.g., centralised logging or container orchestration solutions (AWS ECS/EKS, Azure AKS, GCP GKE, OCI OKE) standardising at least some operational aspects.
Plan for a Tech Radar or Steering Group
- Over the next 3–6 months, propose forming a small cross-departmental group or technology radar process to guide future selections.

By documenting existing tools, drafting minimal governance, facilitating knowledge exchange, pinpointing shared solutions, and preparing a technology steering approach, you mitigate fragmentation while still preserving some project autonomy.

How to do better

Below are rapidly actionable ways to refine a uniform tech mandate:

Allow Exceptions via a Lightweight Process
- Define how teams can request use of a new framework if they demonstrate clear benefits (e.g., for HPC, AI, or serverless solutions).
- referencing NCSC’s guidance on evaluating new cloud services securely.
Maintain a Living “Approved List”
- Encourage periodic updates to the mandated stack, adding modern solutions (like container orchestration or microservice frameworks) that align with cost and security best practices:
  - e.g., AWS ECS/EKS, Azure AKS, GCP GKE, or OCI OKE for container orchestration.
Pilot Innovations
- If staff identify potential new technology, sponsor a small pilot or proof-of-concept under controlled conditions, referencing NIST SP 800-160 SecDevOps guidelines.
Implement Regular Tech Reviews
- e.g., every 6–12 months, a board or steering group reviews the mandated stack in light of feedback or new GDS or NCSC recommendations.
Combine with Security & Cost Insights
- Show how uniform solutions reduce risk and expense, reassuring teams that standardisation benefits them while still enabling progress in areas like containerisation or DevSecOps.

By allowing exceptions via a straightforward process, regularly updating the approved tech list, sponsoring pilot projects, scheduling periodic reviews, and highlighting cost/security gains, you preserve the benefits of uniform technology while avoiding stagnation or shadow IT.

How to do better

Below are rapidly actionable ways to revitalise or replace outdated resources:

Initiate a Quick Radar Refresh
- A small cross-team group can produce an updated doc or web-based radar in 2-4 weeks, referencing recent frameworks, security improvements, and cost considerations:
  - e.g., adopting AWS Graviton, Azure Functions, GCP AI/ML solutions, or OCI HPC offerings.
Introduce a Living “Tech Patterns” Wiki
- Encourage teams to add their experiences or recommended patterns, so the resource remains collaborative and dynamic:
  - e.g., referencing [Confluence, GitHub Wiki, or internal SharePoint with version control].
Schedule Semi-Annual Reviews
- Put it on the organisational calendar to revisit or update the radar every 6 months, factoring in NCSC’s new advisories, GDS technology code of practice updates, or NIST’s emerging guidelines.
Gather Feedback
- Ask project teams what patterns they rely on or find missing. Include new technologies that have proven valuable:
  - fosters a sense of collective ownership.
Use Real Examples
- Populate the updated patterns with success stories from internal projects that solved real user needs.

By quickly refreshing the tech radar, establishing a living wiki, scheduling periodic updates, gathering project feedback, and focusing on real success stories, you transform outdated references into a relevant, frequently consulted guide that shapes better technology decisions.

How to do better

Below are rapidly actionable ways to enhance current, well-used guidance:

Introduce a “Feedback Loop”
- Provide an easy mechanism (e.g., Slack channel, GitHub Issues) for teams to propose new additions or share experiences.
- referencing NCSC’s agile and iterative approach to technology improvement.
Add Security & Cost Criteria
- For each technology in the radar, briefly discuss security posture and typical cost drivers (like egress fees or licensing):
  - referencing AWS TCO calculators, Azure Pricing, GCP Pricing, or OCI cost analysis tools.
Practice “Sunsetting”
- If a technology on the radar is outdated or replaced, mark it for deprecation with a recommended timeline:
  - Minimises legacy tech usage.
Conduct Regular Showcases
- Let teams demo how they used a recommended pattern or overcame a challenge.
- Encourages synergy and real adoption.
Cross-Gov Collaboration
- Consider aligning with other government department radars for consistency, referencing GOV.UK cross-department best practices or local council tech networks.

By enhancing feedback channels, adding security/cost insights to each item, marking deprecated technologies, hosting showcases, and collaborating across agencies, you keep the guidance fresh, relevant, and beneficial for new project tech decisions.

How to do better

Below are rapidly actionable ways to strengthen a collaborative, evolving tech ecosystem:

Establish a Formal Inner-Source Model
- Encourage code sharing or libraries across departments, referencing open-source practices but within the public sector context:
  - e.g., GitHub Enterprise or GitLab.
Encourage Pairing or Multi-Dept Projects
- Sponsor short stints where devs from different teams cross-pollinate or solve shared challenges:
  - referencing NCSC’s recommended cross-department collaboration practices.
Recognise Innovators
- Publicly highlight staff who introduce successful new frameworks or cost-saving architecture patterns:
  - fosters a healthy “improvement” culture.
Adopt Cross-department Show-and-Tell
- If relevant, share or co-present successful solutions with local councils or NHS, referencing GOV.UK cross-government knowledge sharing events.
Integrate Feedback into Tech Radar
- Each time a new solution is proven, update the radar or patterns promptly:
  - ensuring the living doc truly represents real usage and best practice.

By establishing an inner-source approach, supporting short cross-team collaborations, celebrating innovators, connecting with other public sector bodies for knowledge sharing, and consistently updating patterns or the tech radar, you continuously evolve an energetic ecosystem that fosters reuse, innovation, and high-quality technology decisions.

Keep doing what you’re doing, and consider writing some blog posts or opening pull requests to share how your collaborative, evolving tech environment benefits your UK public sector organisation. This helps others adopt or improve similar patterns and fosters a culture of open innovation across government.

What best describes your current technology stack? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to transition from a monolithic approach:

Identify Natural Component Boundaries
- E.g., separate a large monolith into core modules (user authentication, reporting, payment processing).
- Provide early scoping for partial decomposition.
Adopt Container or VM Packaging
- Even if the app remains monolithic, packaging in Docker /ECS, Azure Container Instances, GCP Cloud Run, or OCI Container Engine, or IBM Cloud Code Engine can simplify deployment and initial partial scaling.
Refactor Shared Libraries
- If multiple large monoliths share logic, isolate common code to reduce duplication:
  - referencing NCSC’s recommendations on code reuse and supply chain considerations.
Automate Basic CI/CD
- Even if a monolith, introduce versioned builds, automated tests, and environment-based deployments:
  - e.g., AWS CodePipeline, Azure DevOps, GCP Cloud Build, or OCI DevOps, or IBM Cloud Continuous Delivery
Plan a Phased Decomposition
- Over 6–12 months, pilot a single microservice or separate module as a stepping stone.
- referencing GOV.UK’s service manual for iterative technology improvements.

By identifying component boundaries, packaging the monolith for simpler deployments, refactoring shared libraries, automating CI/CD, and scheduling partial decomposition, you reduce friction and set a path toward more modular solutions.

How to do better

Below are rapidly actionable ways to shift modules from concept to independent deployment:

Introduce Containerisation at Module-Level
- If each module can run separately, containerise them individually:
  - referencing AWS ECS/EKS, Azure AKS, GCP GKE, or OCI OKE for container orchestration.
Provide Separate Build Pipelines
- For each module, create a distinct CI pipeline that compiles, tests, and packages it:
  - e.g., AWS CodeBuild + CodePipeline, Azure DevOps pipelines, GCP Cloud Build triggers, or OCI DevOps pipelines.
Adopt an API or Messaging Boundary
- Clarify how modules communicate via REST or message queues:
  - fosters loose coupling, referencing NCSC microservice security patterns or NIST microservices guidelines.
Test and Deploy Modules Independently
- Even if they remain part of a bigger system, trial partial independent deploys:
  - e.g., can you update a single library or microservice without redeploying everything?
Demonstrate Gains
- Show leadership how incremental module updates reduce downtime or accelerate security patching:
  - Encourages buy-in for further decoupling.

By containerising modules, setting up separate build pipelines, enforcing clear module boundaries, individually deploying or updating modules, and showcasing tangible benefits, you progress toward a fully independent deployment pipeline that capitalises on modularity.

How to do better

Below are rapidly actionable ways to handle interdependencies in individually deployable components:

Introduce Contract Testing
- For each module’s API or message interface, define stable contracts tested automatically:
  - referencing Pact.io, or custom contract test frameworks in AWS CodeBuild, Azure DevOps, GCP Cloud Build, or OCI DevOps pipelines.
Automate Consumer-Driven Testing
- Consumers of a service define expected inputs/outputs; the service must pass these for each release.
- Minimises “integration hell.”
Adopt Semantic Versioning
- Modules declare major/minor/patch versions, ensuring backward compatibility for minor or patch releases:
  - referencing NCSC software maturity advice or standard versioning best practices.
Publish a Dependency Matrix
- A short table or repo listing which module versions are known to be compatible, referencing GOV.UK or departmental guidance on multi-service environments.
Enforce Feature Flags
- If new functionality in one component requires changes in another, hide it behind a feature flag until both are deployed.
- referencing LaunchDarkly, Azure App Configuration flags, AWS AppConfig, GCP Cloud Run with feature toggles, or OCI config solutions.

By introducing contract or consumer-driven testing, adopting semantic versioning, publishing a compatibility matrix, and employing feature flags to manage cross-component rollouts, you reduce interdependency friction and safely leverage your modular architecture.

How to do better

Below are rapidly actionable ways to address the leftover monolithic elements:

Identify High-Impact Subsystem to Extract
- If a monolith is large, pick the subsystem or domain logic that changes most frequently. Migrate that to a microservice first.
- referencing AWS microservices patterns, Azure microservices guides, GCP microservices best practices, or OCI microservices solutions.
Establish Clear Migration Plan
- e.g., define a 12–24 month roadmap with incremental steps or re-platforming on containers:
  - Minimises big-bang rewrites.
Enhance DevOps for Monolith
- Even if it remains monolithic for a while, ensure robust CI/CD, container packaging, automated tests, referencing NCSC DevSecOps guidance.
Limit New Features in Legacy
- Encourage new capabilities or major enhancements in microservices around the edges, gradually reducing the monolith’s importance.
Highlight ROI & Risk
- Present management with cost of leaving the monolith vs. benefits of further decomposition (faster releases, easier security fixes).

By selecting high-impact subsystems for extraction, creating a phased migration plan, applying DevOps best practices to the existing monolith, steering new features away from legacy, and continuously communicating the ROI of decomposition, you inch closer to a fully modular environment.

How to do better

Below are rapidly actionable ways to optimise a fully component-based approach:

Enhance Observability & Tracing
- Adopt distributed tracing and advanced logging across microservices:
  - e.g., AWS X-Ray, Azure Monitor Application Insights, GCP Cloud Trace, or OCI Logging Analytics with tracing integrations.
Apply Zero-Trust for Service Communication
- Each microservice authenticates via mTLS or ephemeral tokens, referencing NCSC zero-trust or NIST SP 800-207 guidelines.
Adopt or Refine Service Mesh
- Tools like Istio, Linkerd, Consul, AWS App Mesh, Azure Service Fabric Mesh, GCP Anthos Service Mesh, or OCI OKE with mesh add-ons can handle cross-cutting concerns (observability, security, routing).
Continuous Architecture Review
- With so many components, schedule architecture retros or periodic design reviews ensuring no sprawl or duplication arises.
Collaborate Across Departments or Agencies
- If your microservices could benefit other public sector bodies (e.g., local councils or NHS units), share them via open repositories or knowledge sessions:
  - referencing GOV.UK or NCSC guidance on cross-public sector knowledge exchange.

By enhancing distributed tracing, adopting zero-trust service communications, exploring or refining a service mesh, scheduling architecture reviews, and collaborating with other government entities, you maintain a top-tier, fully component-based environment that remains agile, secure, and efficient in meeting public sector demands.

Keep doing what you’re doing, and consider sharing or blogging about your experience with modular architectures. Contributing pull requests to this guidance or other best-practice repositories helps UK public sector organisations adopt similarly progressive strategies for building and maintaining cloud and on-premises systems.