Skip to main content

Report

Modify improvement selections

Cost & Sustainability

How do you set capacity for live services? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to reduce waste and move beyond provisioning for the extreme peak:

  1. Implement Resource Monitoring and Basic Analytics

  2. Pilot Scheduled Shutdowns for Non-Critical Systems

    • Identify development and testing environments or batch-processing servers that don’t require 24/7 availability:
    • Sharing this data with stakeholders can highlight the discrepancy between peak and average usage, demonstrating immediate cost savings without impacting production systems.
  3. Explore Simple Autoscaling Solutions

    • Even if you continue peak provisioning for mission-critical workloads, consider selecting a smaller or non-critical service to test autoscaling:

Implementing autoscaling in a controlled environment allows you to evaluate its benefits and challenges, providing valuable insights before considering broader adoption for more critical workloads.

  1. Review Reserved or Discounted Pricing

    • If you must maintain consistently high capacity, consider vendor discount programs to reduce per-hour costs:

      • AWS Savings Plans or Reserved Instances: AWS offers Savings Plans, which provide flexibility by allowing you to commit to a consistent amount of compute usage (measured in $/hour) over a 1- or 3-year term, applicable across various services and regions. Reserved Instances, on the other hand, involve committing to specific instance configurations for a term, offering significant discounts for predictable workloads.

      • Azure Reservations for VMs and Reserved Capacity: Azure provides Reservations that allow you to commit to a specific VM or database service for a 1- or 3-year period, resulting in cost savings compared to pay-as-you-go pricing. These reservations are ideal for workloads with predictable resource requirements.

      • GCP Committed Use Discounts: Google Cloud offers Committed Use Discounts, enabling you to commit to a certain amount of usage for a 1- or 3-year term, which can lead to substantial savings for steady-state or predictable workloads.

      • OCI Universal Credits: Oracle Cloud Infrastructure provides Universal Credits, allowing you to utilise any OCI platform service in any region with a flexible consumption model. By purchasing a sufficient number of credits, you can benefit from volume discounts and predictable billing, which is advantageous for maintaining high-capacity workloads.

      • IBM Cloud Reservations are a great option when you want significant cost savings and dedicated resources for future deployments. You can choose a 1 or 3-year term, server quantity, specific profile, and provision those servers when needed. IBM Cloud Enterprise Savings Plan, with this billing model, you commit to spend a certain amount on IBM Cloud and receive discounts across the platform. You are billed monthly based on your usage and you continue to receive a discount even after you reach your committed amount.

    • Implementing these discount programs won’t eliminate over-provisioning but can soften the budget impact.

  2. Engage Leadership on the Financial and Sustainability Benefits

    • Present how on-demand autoscaling or even basic scheduling can reduce overhead and potentially improve your service’s environmental footprint.
    • Link these improvements to departmental net-zero or cost reduction goals, highlighting easy wins.

Through monitoring, scheduling, basic autoscaling pilots, and potential reserved capacity, you can move away from static peak provisioning. This approach preserves reliability while unlocking efficiency gains—an important step in balancing cost, compliance, and performance goals in the UK public sector.

How to do better

Here are rapidly actionable steps to evolve from manual seasonal scaling to a more automated, responsive model:

  1. Automate the Manual Steps You Already Do

    • If you anticipate seasonal peaks (e.g., quarterly public reporting load), replace manual processes with scheduled scripts to ensure timely scaling and prevent missed scale-downs:

      • AWS: Utilise AWS Step Functions in conjunction with Amazon EventBridge Scheduler to automate the start and stop of EC2 instances based on a defined schedule.

      • Azure: Implement Azure Automation Runbooks within Automation Accounts to create scripts that manage the scaling of resources during peak periods.

      • Google Cloud Platform (GCP): Leverage Cloud Scheduler to trigger Cloud Functions or Terraform scripts that adjust instance groups in response to anticipated load changes.

      • Oracle Cloud Infrastructure (OCI): Use Resource Manager stacks combined with Cron tasks to schedule scaling events, ensuring resources are appropriately managed during peak times.

    • Automating these processes ensures that scaling actions occur as planned, reducing the risk of human error and optimising resource utilisation during peak and off-peak periods.

  2. Identify and Enforce “Scale-Back” Windows

    • Even if you scale up for busy times, ensure you have a defined “sunset” for increased capacity:
      • Configure an autoscaling group or scale set to revert to default size after the peak.
      • Set reminders or triggers to ensure you don’t pay for extra capacity indefinitely.
  3. Introduce Autoscaling on a Limited Component

    • Choose a module that frequently experiences load variations within a day or week—perhaps a web front-end for a public information portal:

      • AWS: Implement Auto Scaling Groups with CPU-based or request-based triggers to automatically adjust the number of EC2 instances handling your service’s load.

      • Azure: Utilise Virtual Machine Scale Sets or the AKS Cluster Autoscaler to manage the scaling of virtual machines or Kubernetes clusters for your busiest microservices.

      • Google Cloud Platform (GCP): Use Managed Instance Groups with load-based autoscaling to dynamically adjust the number of instances serving your front-end application based on real-time demand.

      • Oracle Cloud Infrastructure (OCI): Apply Instance Pool Autoscaling or the OKE Cluster Autoscaler to automatically scale a specific containerised service in response to workload changes.

    • Implementing autoscaling on a targeted component allows you to observe immediate benefits, such as improved resource utilisation and cost efficiency, which can encourage broader adoption across your infrastructure.

  4. Consider Serverless for Spiky Components

    • If certain tasks run sporadically (e.g., monthly data transformation or PDF generation), investigate moving them to event-driven or serverless solutions:

      • AWS: Utilise AWS Lambda for event-driven functions or AWS Fargate for running containers without managing servers. AWS Lambda is ideal for short-duration, event-driven tasks, while AWS Fargate is better suited for longer-running applications and tasks requiring intricate orchestration.

      • Azure: Implement Azure Functions for serverless compute, Logic Apps for workflow automation, or Container Apps for running microservices and containerised applications. Azure Logic Apps can automate workflows and business processes, making them suitable for scheduled tasks.

      • Google Cloud Platform (GCP): Deploy Cloud Functions for lightweight event-driven functions or Cloud Run for running containerised applications in a fully managed environment. Cloud Run is suitable for web-based workloads, REST or gRPC APIs, and internal custom back-office apps.

      • Oracle Cloud Infrastructure (OCI): Use OCI Functions for on-demand, serverless workloads. OCI Functions is a fully managed, multi-tenant, highly scalable, on-demand, Functions-as-a-Service platform built on enterprise-grade infrastructure.

    • Transitioning to serverless solutions for sporadic tasks eliminates the need to manually adjust virtual machines for short bursts, enhancing efficiency and reducing operational overhead.

  5. Monitor and Alert on Usage Deviations

    • Utilise cost and performance alerts to detect unexpected surges or prolonged idle resources:

      • AWS: Implement AWS Budgets to set custom cost and usage thresholds, receiving alerts when limits are approached or exceeded. Additionally, use Amazon CloudWatch’s anomaly detection to monitor metrics and identify unusual patterns in resource utilisation.

      • Azure: Set up Azure Monitor alerts to track resource performance and configure cost anomaly alerts within Azure Cost Management to detect and notify you of unexpected spending patterns.

      • Google Cloud Platform (GCP): Create budgets in Google Cloud Billing and configure Pub/Sub notifications to receive alerts on cost anomalies, enabling prompt responses to unexpected expenses.

      • Oracle Cloud Infrastructure (OCI): Establish budgets and set up alert rules in OCI Cost Management to monitor spending. Additionally, configure OCI Alarms with notifications to detect and respond to unusual resource usage patterns.

    • Implementing these alerts enables quicker responses to anomalies, reducing the reliance on manual monitoring and helping to maintain optimal resource utilisation and cost efficiency.

By automating your manual scaling processes, exploring partial autoscaling, and shifting spiky tasks to serverless, you unlock more agility and cost efficiency. This approach helps ensure you’re not left scrambling if usage deviates from seasonal patterns.

How to do better

Below are actionable ways to upgrade from basic autoscaling:

  1. Broaden Autoscaling Coverage

    • Extend autoscaling to more workloads to enhance efficiency and responsiveness:

      • AWS:

        • EC2 Auto Scaling: Implement EC2 Auto Scaling across multiple groups to automatically adjust the number of EC2 instances based on demand, ensuring consistent application performance.
        • ECS Service Auto Scaling: Configure Amazon ECS Service Auto Scaling to automatically scale your containerised services in response to changing demand.
        • RDS Auto Scaling: Utilise Amazon Aurora Auto Scaling to automatically adjust the number of Aurora Replicas to handle changes in workload demand.
      • Azure:

        • Virtual Machine Scale Sets (VMSS): Deploy Azure Virtual Machine Scale Sets to manage and scale multiple VMs for various services, automatically adjusting capacity based on demand.
        • Azure Kubernetes Service (AKS): Implement the AKS Cluster Autoscaler to automatically adjust the number of nodes in your cluster based on resource requirements.
        • Azure SQL Elastic Pools: Use Azure SQL Elastic Pools to manage and scale multiple databases with varying usage patterns, optimising resource utilisation and cost.
      • Google Cloud Platform (GCP):

        • Managed Instance Groups (MIGs): Expand the use of Managed Instance Groups with autoscaling across multiple zones to ensure high availability and automatic scaling of your applications.
        • Cloud SQL Autoscaling: Leverage Cloud SQL’s automatic storage increase to handle growing database storage needs without manual intervention.
      • Oracle Cloud Infrastructure (OCI):

    • Gradually incorporating more of your application’s microservices into the autoscaling framework can lead to improved performance, cost efficiency, and resilience across your infrastructure.

  2. Incorporate More Granular Metrics

  3. Implement Dynamic, Scheduled, or Predictive Scaling

    • If you observe consistent patterns in your application’s usage—such as increased activity during lunchtime or reduced traffic on weekends—consider enhancing your existing autoscaling strategies with scheduled scaling actions:

    • Implementing scheduled scaling allows your system to proactively adjust resources in anticipation of predictable workload changes, enhancing performance and cost efficiency.

    • For environments with variable and unpredictable workloads, consider utilising predictive scaling features. Predictive scaling analyzes historical data to forecast future demand, enabling the system to scale resources in advance of anticipated spikes. This approach combines the benefits of both proactive and reactive scaling, ensuring optimal resource availability and responsiveness.

      • AWS: Explore Predictive Scaling for Amazon EC2 Auto Scaling, which uses machine learning models to forecast traffic patterns and adjust capacity accordingly.

      • Azure: While Azure does not currently offer a native predictive scaling feature, you can implement custom solutions by analyzing historical metrics through Azure Monitor and creating automation scripts to adjust scaling based on predicted trends.

      • GCP: Google Cloud’s autoscaler primarily operates on real-time metrics. For predictive capabilities, consider developing custom predictive models using historical data from Cloud Monitoring to inform scaling decisions.

      • OCI: Oracle Cloud Infrastructure allows for the creation of custom scripts and functions to implement predictive scaling based on historical usage patterns, although a native predictive scaling feature may not be available.

    • By integrating scheduled and predictive scaling strategies, you can enhance your application’s ability to handle varying workloads efficiently, ensuring optimal performance while managing costs effectively.

  4. Enhance Observability to Validate Autoscaling Efficacy

    • Instrument your autoscaling events and track them to ensure optimal performance and resource utilisation:

      • Dashboard Real-Time Metrics: Monitor CPU, memory, and queue metrics alongside scaling events to visualise system performance in real-time.

      • Analyze Scaling Timeliness: Assess whether scaling actions occur promptly by checking for prolonged high CPU usage or frequent scale-in events that may indicate over-scaling.

    • Tools:

      • AWS:

        • AWS X-Ray: Utilise AWS X-Ray to trace requests through your application, gaining insights into performance bottlenecks and the impact of scaling events.

        • Amazon CloudWatch: Create dashboards in Amazon CloudWatch to display real-time metrics and logs, correlating them with scaling activities for comprehensive monitoring.

      • Azure:

        • Azure Monitor: Leverage Azure Monitor to collect and analyze telemetry data, setting up alerts and visualisations to track performance metrics in relation to scaling events.

        • Application Insights: Use Azure Application Insights to detect anomalies and diagnose issues, correlating scaling actions with application performance for deeper analysis.

      • Google Cloud Platform (GCP):

        • Cloud Monitoring: Employ Google Cloud’s Operations Suite to monitor and visualise metrics, setting up dashboards that reflect the relationship between resource utilisation and scaling events.

        • Cloud Logging and Tracing: Implement Cloud Logging and Cloud Trace to collect logs and trace data, enabling the analysis of autoscaling impacts on application performance.

      • Oracle Cloud Infrastructure (OCI):

        • OCI Logging: Use OCI Logging to manage and search logs, providing visibility into scaling events and their effects on system performance.

        • OCI Monitoring: Utilise OCI Monitoring to track metrics and set alarms, ensuring that scaling actions align with performance expectations.

    • By enhancing observability, you can validate the effectiveness of your autoscaling strategies, promptly identify and address issues, and optimise resource allocation to maintain application performance and cost efficiency.

  5. Adopt Spot/Preemptible Instances for Autoscaled Non-Critical Workloads

    • To further optimise costs, consider utilising spot or preemptible virtual machines (VMs) for non-critical, autoscaled workloads. These instances are offered at significant discounts compared to standard on-demand instances but can be terminated by the cloud provider when resources are needed elsewhere. Therefore, they are best suited for fault-tolerant and flexible applications.

      • AWS: Implement EC2 Spot Instances within an Auto Scaling Group to run fault-tolerant workloads at up to 90% off the On-Demand price. By configuring Auto Scaling groups with mixed instances, you can combine Spot Instances with On-Demand Instances to balance cost and availability.

      • Azure: Utilise Azure Spot Virtual Machines within Virtual Machine Scale Sets for non-critical workloads. Azure Spot VMs allow you to take advantage of unused capacity at significant cost savings, making them ideal for interruptible workloads such as batch processing jobs and development/testing environments.

      • Google Cloud Platform (GCP): Deploy Preemptible VMs in Managed Instance Groups to run short-duration, fault-tolerant workloads at a reduced cost. Preemptible VMs provide substantial savings for workloads that can tolerate interruptions, such as data analysis and batch processing tasks.

      • Oracle Cloud Infrastructure (OCI): Leverage Preemptible Instances for batch processing or flexible tasks. OCI Preemptible Instances offer a cost-effective solution for workloads that are resilient to interruptions, enabling efficient scaling of non-critical applications.

    • By integrating these cost-effective instance types into your autoscaling strategies, you can significantly reduce expenses for non-critical workloads while maintaining the flexibility to scale resources as needed.

By broadening autoscaling across more components, incorporating richer metrics, scheduling, and advanced cost strategies like spot instances, you transform your “basic” scaling approach into a more agile, cost-effective solution. Over time, these steps foster robust, automated resource management across your entire environment.

How to do better

Here are actionable ways to refine your widespread autoscaling strategy to handle more nuanced workloads:

  1. Adopt Application-Level or Log-Based Metrics

    • Move beyond CPU and memory metrics to incorporate transaction rates, request latency, or user concurrency for more responsive and efficient autoscaling:

      • AWS:

        • CloudWatch Custom Metrics: Publish custom metrics derived from application logs to Amazon CloudWatch, enabling monitoring of specific application-level indicators such as transaction rates and user concurrency.
        • Real-Time Log Analysis with Kinesis and Lambda: Set up real-time log analysis by streaming logs through Amazon Kinesis and processing them with AWS Lambda to generate dynamic scaling triggers based on application behavior.
      • Azure:

        • Application Insights: Utilise Azure Monitor’s Application Insights to collect detailed usage data, including request rates and response times, which can inform scaling decisions for services hosted in Azure Kubernetes Service (AKS) or Virtual Machine Scale Sets.
        • Custom Logs for Scaling Signals: Implement custom logging to capture specific application metrics and configure Azure Monitor to use these logs as signals for autoscaling, enhancing responsiveness to real-time application demands.
      • Google Cloud Platform (GCP):

        • Cloud Monitoring Custom Metrics: Create custom metrics in Google Cloud’s Monitoring to track application-specific indicators such as request count, latency, or queue depth, facilitating more precise autoscaling of Compute Engine (GCE) instances or Google Kubernetes Engine (GKE) clusters.
        • Integration with Logging: Combine Cloud Logging with Cloud Monitoring to analyze application logs and derive metrics that can trigger autoscaling events based on real-time application performance.
      • Oracle Cloud Infrastructure (OCI):

        • Monitoring Custom Metrics: Leverage OCI Monitoring to create custom metrics from application logs, capturing detailed performance indicators that can inform autoscaling decisions.
        • Logging Analytics: Use OCI Logging Analytics to process and analyze application logs, extracting metrics that reflect user concurrency or transaction rates, which can then be used to trigger autoscaling events.
    • Incorporating application-level and log-based metrics into your autoscaling strategy allows for more nuanced and effective scaling decisions, ensuring that resources align closely with actual application demands and improving overall performance and cost efficiency.

  2. Introduce Multi-Metric Policies

    • Instead of a single threshold, combine metrics. For instance:
      • Scale up if CPU > 70% AND average request latency > 300ms.
      • This ensures you only scale when both resource utilisation and user experience degrade, reducing false positives or unneeded expansions.
  3. Implement Predictive or Machine Learning–Driven Autoscaling

    • To anticipate demand spikes before traditional metrics like CPU utilisation react, consider implementing predictive or machine learning–driven autoscaling solutions offered by cloud providers:

      • AWS:

      • Azure:

        • Predictive Autoscale: Utilise Predictive Autoscale in Azure Monitor, which employs machine learning to forecast CPU load for Virtual Machine Scale Sets based on historical usage patterns, enabling proactive scaling.
      • Google Cloud Platform (GCP):

        • Custom Machine Learning Models: Develop custom machine learning models to analyze historical performance data and predict future demand, triggering autoscaling events in services like Google Kubernetes Engine (GKE) or Cloud Run based on these forecasts.
      • Oracle Cloud Infrastructure (OCI):

        • Custom Analytics Integration: Integrate Oracle Analytics Cloud with OCI to perform machine learning–based forecasting, enabling predictive scaling by analyzing historical data and anticipating future resource requirements.
    • Implementing predictive or machine learning–driven autoscaling allows your applications to adjust resources proactively, maintaining performance and cost efficiency by anticipating demand before traditional metrics indicate the need for scaling.

  4. Correlate Autoscaling with End-User Experience

    • To enhance user satisfaction, align your autoscaling strategies with user-centric metrics such as page load times and overall responsiveness. By monitoring these metrics, you can ensure that scaling actions directly improve the end-user experience.

      • AWS:

        • Application Load Balancer (ALB) Target Response Times: Monitor ALB target response times using Amazon CloudWatch to assess backend performance. Elevated response times can indicate the need for scaling to maintain optimal user experience.
        • Network Load Balancer (NLB) Metrics: Track NLB metrics to monitor network performance and identify potential bottlenecks affecting end-user experience.
      • Azure:

        • Azure Front Door Logs: Analyze Azure Front Door logs to monitor end-to-end latency and other performance metrics. Insights from these logs can inform scaling decisions to enhance user experience.
        • Application Insights: Utilise Application Insights to collect detailed telemetry data, including response times and user interaction metrics, aiding in correlating autoscaling with user satisfaction.
      • Google Cloud Platform (GCP):

        • Cloud Load Balancing Logs: Examine Cloud Load Balancing logs to assess request latency and backend performance. Use this data to adjust autoscaling policies, ensuring they align with user experience goals.
        • Service Level Objectives (SLOs): Define SLOs in Cloud Monitoring to set performance targets based on user-centric metrics, enabling proactive scaling to meet user expectations.
      • Oracle Cloud Infrastructure (OCI):

        • Load Balancer Health Checks: Implement OCI Load Balancer health checks to monitor backend server performance. Use health check data to inform autoscaling decisions that directly impact user experience.
        • Custom Application Pings: Set up custom application pings to measure response times and user concurrency, feeding this data into autoscaling triggers to maintain optimal performance during varying user loads.
    • By integrating user-centric metrics into your autoscaling logic, you ensure that scaling actions are directly correlated with improvements in end-user experience, leading to higher satisfaction and engagement.

  5. Refine Scaling Cooldowns and Timers

    • Tweak scale-up and scale-down intervals to avoid thrashing:
      • A short scale-up delay can address spikes quickly.
      • A slightly longer scale-down delay prevents abrupt resource removals when a short spike recedes.
    • Evaluate your autoscaling policy settings monthly to align with evolving traffic patterns.

By incorporating more sophisticated application or log-based metrics, predictive scaling, and user-centric triggers, you ensure capacity aligns closely with real workloads. This approach elevates your autoscaling from a broad CPU/memory-based strategy to a finely tuned system that balances user experience, performance, and cost efficiency.

How to do better

Even at the top level, you can refine and push boundaries further:

  1. Adopt More Granular “Distributed SLO” Metrics

    • Evaluate Each Microservice’s Service-Level Objectives (SLOs): Define precise SLOs for each microservice, such as ensuring the 99th-percentile latency remains under 400 milliseconds. This granular approach allows for targeted performance monitoring and scaling decisions.

    • Utilise Cloud Provider Tools to Monitor and Enforce SLOs:

      • AWS:

        • CloudWatch ServiceLens: Integrate Amazon CloudWatch ServiceLens to gain comprehensive insights into application performance and availability, correlating metrics, logs, and traces.
        • Custom Metrics and SLO-Based Alerts: Implement custom CloudWatch metrics to monitor specific performance indicators and set up SLO-based alerts to proactively manage service health.
      • Azure:

        • Application Insights: Leverage Azure Monitor’s Application Insights to track detailed telemetry data, enabling the definition and monitoring of SLOs for individual microservices.
        • Service Map: Use Azure Monitor’s Service Map to visualise dependencies and performance metrics across services, aiding in the assessment of SLO adherence.
      • Google Cloud Platform (GCP):

        • Cloud Operations Suite: Employ Google Cloud’s Operations Suite to create SLO dashboards that monitor service performance against defined objectives, facilitating informed scaling decisions.
      • Oracle Cloud Infrastructure (OCI):

        • Observability and Management Platform: Implement OCI’s observability tools to define SLOs and correlate them with performance metrics, ensuring each microservice meets its performance targets.
    • Benefits of Implementing Distributed SLO Metrics:

      • Precision in Scaling: By closely monitoring how each component meets its SLOs, you can make informed decisions to scale resources appropriately, balancing performance needs with cost considerations.

      • Proactive Issue Detection: Granular SLO metrics enable the early detection of performance degradations within specific microservices, allowing for timely interventions before they impact the overall system.

      • Enhanced User Experience: Maintaining stringent SLOs ensures that end-users receive consistent and reliable service, thereby improving satisfaction and trust in your application.

    • Implementation Considerations:

      • Define Clear SLOs: Collaborate with stakeholders to establish realistic and measurable SLOs for each microservice, considering factors such as latency, throughput, and error rates.

      • Continuous Monitoring and Adjustment: Regularly review and adjust SLOs and associated monitoring tools to adapt to evolving application requirements and user expectations.

    • Conclusion: Adopting more granular “distributed SLO” metrics empowers you to fine-tune your application’s performance management, ensuring that each microservice operates within its defined parameters. This approach facilitates precise scaling decisions, optimising both performance and cost efficiency.

  2. Experiment with Multi-Provider or Hybrid Autoscaling

  3. Integrate with Detailed Cost Allocation & Forecasting

  4. Leverage AI/ML for Real-Time Scaling Decisions

  5. Adopt Sustainable/Green Autoscaling Policies

By blending advanced SLO-based scaling, multi-provider strategies, cost forecasting, ML-driven anomaly detection, and sustainability considerations, you ensure your autoscaling remains cutting-edge. This not only provides exemplary performance and cost control but also positions your UK public sector organisation as a leader in efficient, responsible cloud computing.

Keep doing what you’re doing, and consider sharing your successes via blog posts or internal knowledge bases. Submit pull requests to this guidance if you have innovative approaches or examples that can benefit other public sector organisations. By exchanging real-world insights, we collectively raise the bar for cloud maturity and cost effectiveness across the entire UK public sector.


How do you run services in the cloud? [change your answer]

You did not answer this question.

How to do better

Here are rapidly actionable improvements to help you move beyond purely static VMs:

  1. Enable Basic Monitoring and Cost Insights

  2. Leverage Built-in Right-sizing Tools

  3. Introduce Simple Scheduling

  4. Conduct a Feasibility Check for a Small Container Pilot

  5. Raise Awareness with Internal Stakeholders

    • Share simple usage and cost graphs with your finance or leadership teams. Show them the difference between “always-on” vs. “scaled” or “scheduled” usage.
    • This could drive more formal mandates or budget incentives to encourage partial re-architecture or adoption of short-lived compute in the future.

By monitoring usage, applying right-sizing, scheduling idle time, and introducing a small container pilot, you can meaningfully reduce waste. Over time, you’ll build momentum toward more flexible compute strategies while still respecting the constraints of your existing environment.

How to do better

Here are actionable next steps to accelerate your modernisation journey without overwhelming resources:

  1. Expand Container/Serverless Pilots in a Structured Way

  2. Implement Granular VM Auto-Scaling

  3. Use Container Services for Non-Critical Production

    • If you have a stable container proof-of-concept, consider migrating a small but genuine production workload. Examples:
      • Internal APIs, internal data analytics pipelines, or front-end servers that can scale up/down.
      • Focus on microservices that do not require extensive refactoring.
    • This fosters real operational experience, bridging from “non-critical tasks” to “production readiness.”
  4. Leverage Cloud Marketplace or Government Frameworks

    • Explore container-based solutions or DevOps tooling that might be available under G-Cloud or Crown Commercial Service frameworks.
    • Some providers offer managed container solutions pre-configured for compliance or security—this can reduce friction around governance.
  5. Train or Upskill Teams

    Building confidence and skills helps teams adopt more advanced compute models.

Through these steps—structured expansions of containerised or serverless pilots, improved auto-scaling of VMs, and staff training—your organisation can gradually shift from “limited experimentation” to a more balanced compute ecosystem. The result is improved agility, potential cost savings, and readiness for more modern architectures.

How to do better

Below are rapidly actionable ways to enhance your mixed compute model:

  1. Adopt Unified Deployment Pipelines

    • Strive for standard tooling that can deploy both VMs and container/serverless environments. For instance:
    • This reduces fragmentation and fosters consistent best practices (code review, automated testing, environment provisioning).
  2. Enhance Observability

  3. Introduce a Tagging/Governance Policy

  4. Implement Automated or Dynamic Scaling

    Implementing these scaling strategies ensures that your applications can efficiently handle varying workloads while controlling costs.

  5. Leverage Reserved or Discounted Pricing for Steady Components

    Implementing these strategies can lead to significant cost savings for workloads with consistent usage patterns.

By unifying your deployment practices, consolidating observability, enforcing tagging, and refining autoscaling or discount usage, you move from an ad-hoc mix of compute styles to a more cohesive, cost-effective cloud ecosystem. This sets the stage for robust, consistent governance and significant agility gains.

How to do better

Below are actionable expansions to push your ephemeral usage approach further:

  1. Adopt a “Compute Decision Framework”

    • Formalise how new workloads choose among FaaS (functions), CaaS (containers), or short-lived VMs:
      • If event-driven with spiky traffic, prefer serverless.
      • If the service requires consistent runtime dependencies but can scale, prefer containers.
      • If specialised hardware or older OS is needed briefly, use short-lived VMs.
    • This standardisation helps teams quickly pick the best fit.
  2. Enable Event-Driven Automation

  3. Implement Container Security Best Practices

  4. Refine Infrastructure as Code (IaC) and Pipeline Patterns

  5. Extend Tagging and Cost Allocation

By formalising your decision framework, expanding event-driven architectures, ensuring container security, and strengthening IaC patterns, you solidify your short-lived compute model. This approach reduces overheads, fosters agility, and helps UK public sector teams remain compliant with cost and operational excellence targets.

How to do better

Even at this advanced state, you can still hone practices. Below are suggestions:

  1. Automate Decision Workflows

    • Build an internal “Service Catalog” or “Decision Tree.” For instance:
      • A web-based form that asks about the workload’s functional, regulatory, performance, and cost constraints, then suggests suitable solutions (SaaS, FaaS, containers, etc.).
    • This can be integrated with pipeline automation so new projects must pass through the framework before provisioning resources.
  2. Deepen SaaS Exploration for Niche Needs

    • Explore specialised SaaS options for areas like data analytics, content management, or identity services.
    • Ensure your staff or solution architects regularly revisit the G-Cloud listings or other Crown Commercial Service frameworks to see if an updated SaaS solution can replace custom-coded or container-based systems.
  3. Further Standardise DevOps Across All Layers

  4. Maintain a Living Right-sizing Strategy

  5. Focus on Energy Efficiency and Sustainability

  6. Champion Cross-Public-Sector Collaboration

    • Share lessons or templates with other departments or agencies. This fosters consistent best practices across local councils, NHS trusts, or central government bodies.

By automating your decision workflows, continuously exploring SaaS, standardising DevOps pipelines, and incorporating advanced metrics (including sustainability), you maintain an iterative improvement path at the peak of compute maturity. This ensures you remain agile in responding to new user requirements and evolving government initiatives, all while controlling costs and optimising resource efficiency.

Keep doing what you’re doing, and consider writing up success stories, internal case studies, or blog posts. Submit pull requests to this guidance or relevant public sector best-practice repositories so others can learn from your achievements. By sharing real-world experiences, you help the entire UK public sector enhance its cloud compute maturity.


How do you track sustainability? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps that provide greater visibility and ensure you move beyond mere vendor assurances:

  1. Request Vendor Transparency

  2. Enable Basic Billing and Usage Reports

  3. Incorporate Sustainability Clauses in Contracts

    • When renewing or issuing new calls on frameworks like G-Cloud, add explicit language for carbon reporting.
    • Request quarterly or annual updates on how your usage ties into the vendor’s net-zero or carbon offset strategies.

    Incorporating sustainability clauses into your contracts is essential for ensuring that your cloud service providers align with your environmental goals. The Crown Commercial Service offers guidance on integrating such clauses into the G-Cloud framework. Additionally, the Chancery Lane Project provides model clauses for environmental performance, which can be adapted to your contracts.

    By proactively including these clauses, you can hold vendors accountable for their sustainability commitments and ensure that your organisation’s operations contribute positively to environmental objectives.

  4. Track Internal Workload Growth

    • Even if you rely on vendor neutrality claims, set up a simple spreadsheet or a lightweight tracker for each of your main cloud workloads (service name, region, typical CPU usage, typical memory usage). If usage grows, you will notice potential new carbon hotspots.
  5. Raise Internal Awareness

    • Create a short briefing note for leadership or relevant teams (e.g., finance, procurement) highlighting:
      1. Your current reliance on vendor offsetting, and
      2. The need for baseline data collection.

    This ensures any interest in deeper environmental reporting can gather support before usage grows further.

How to do better

Here are quick wins to strengthen your approach and make it more actionable:

  1. Use Vendor Sustainability Tools for Basic Estimation

  2. Create Simple Internal Guidelines

    • Expand beyond policy statements:
      1. Resource Tagging: Mandate that every new resource is tagged with an owner, environment, and a sustainability tag (e.g., “non-prod, auto-shutdown” vs. “production, high-availability”).
      2. Preferred Regions: If feasible, prefer data centers that the vendor identifies as more carbon-friendly. For example, some AWS and Azure UK-based regions rely on greener energy sourcing than others.
  3. Schedule Simple Sustainability Checkpoints

    • Alongside your standard procurement or architectural reviews, add a sustainability review item. E.g.:
      • “Does the new service use the recommended low-carbon region?”
      • “Is there a plan to power down dev/test resources after hours?”
    • This ensures your new policy is not forgotten in day-to-day activities.
  4. Offer Quick Training or Knowledge Sessions

    The point is to connect cost optimisation with sustainability—over-provisioned resources burn more carbon.

  5. Publish Simple Reporting

    • Create a once-a-quarter dashboard or presentation highlighting approximate cloud emissions. Even if the data is partial or not perfect, transparency drives accountability.

By rapidly applying these steps—using native vendor tools to measure usage, establishing minimal but meaningful guidelines, and scheduling brief training or check-ins—you elevate your policy from mere awareness to actual practice.

How to do better

Focus on rapid, vendor-native steps to convert targets into tangible reductions:

  1. Automate Right-sizing

    By automatically resizing or shifting to lower-tier SKUs, you reduce both cost and emissions.

  2. Implement Scheduled Autoscaling

    This directly lowers carbon usage by removing idle capacity.

  3. Leverage Serverless or Container Services

    Serverless can significantly cut wasted resources, which aligns with your reduction targets.

  4. Adopt “Carbon Budgets” in Project Plans

These tools provide insights into the carbon emissions associated with different regions, enabling more sustainable decision-making.

  1. Align with Departmental or National Sustainability Goals
    • Update your internal reporting to reflect how your targets link to national net zero obligations or departmental commitments (e.g., the NHS net zero plan, local authority climate emergency pledges). This ensures your measurement and goals remain relevant to broader public sector accountability.

Implementing these steps swiftly helps ensure you don’t just measure but actually reduce your carbon footprint. Regular iteration—checking usage data, right-sizing, adjusting autoscaling—ensures continuous progress toward your stated targets.

How to do better

Actionable steps to deepen your integrated approach:

  1. Set Up Automated Governance Rules

Implementing these policies ensures that resources are deployed in regions with lower carbon footprints, aligning with your sustainability objectives.

  1. Adopt Full Lifecycle Management

  2. Use Vendor-Specific Sustainability Advisors

    Incorporate these suggestions directly into sprint backlogs or monthly improvement tasks.

  3. Embed Sustainability in DevOps Pipelines

    • Modify build/deployment pipelines to check resource usage or region selection:
      • If a new environment is spun up in a high-carbon region or with large instance sizes, the pipeline can prompt a warning or require an override.
      • Tools like GitHub Actions or Azure DevOps Pipelines can call vendor APIs to fetch sustainability metrics and fail a build if it’s non-compliant.
  4. Promote Cross-Functional “Green Teams”

    • Form a small working group or “green champions” network across procurement, DevOps, governance, and finance, meeting monthly to share best practices and track new optimisation opportunities.
    • This approach keeps your integrated practices dynamic, ensuring you respond quickly to new vendor features or updated government climate guidance.

By adding these automated controls, pipeline checks, and cross-functional alignment, you ensure that your integrated sustainability approach not only continues but evolves in real time. You become more agile in responding to shifting requirements and new tools, maintaining a leadership stance in UK public sector cloud sustainability.

How to do better

Even at this advanced level, below are further actions to refine your dynamic management:

  1. Build or Leverage Carbon-Aware Autoscaling

  2. Collaborate with BEIS or Relevant Government Bodies

    • The Department for Business, Energy & Industrial Strategy (BEIS) or other departments may track grid-level carbon. If you can integrate their public data (e.g., real-time carbon intensity in the UK), you can refine your scheduling.
    • Seek synergy with national digital transformation or sustainability pilot programmes that might offer new tools or funding for experimentation.
  3. AI or ML-Driven Forecasting

    Then automatically shift or throttle workloads accordingly.

  4. Innovate with Low-Power Hardware

    Typically, these instance families consume less energy for similar workloads, further reducing carbon footprints.

  5. Automated Data Classification and Tiering

    This ensures minimal energy overhead for data retention.

  6. Set an Example through Openness

    • If compliance allows, publish near real-time dashboards illustrating your advanced scheduling successes or hardware usage.
    • Share code or Infrastructure-as-Code templates with other public sector teams to accelerate mutual learning.

By implementing these advanced tactics, you sharpen your dynamic optimisation approach, continuously pushing the envelope of what’s possible in sustainable cloud operations—while respecting legal constraints around data sovereignty and any performance requirements unique to public services.

Keep doing what you’re doing, and consider documenting or blogging about your experiences. Submit pull requests to this guidance so other UK public sector organisations can accelerate their own sustainability journeys. By sharing real-world results and vendor-specific approaches, you help shape a greener future for public services across the entire nation.


How do you manage costs? [change your answer]

You did not answer this question.

How do I do better?

If you want to improve beyond “Restricted Billing Visibility,” the next step typically involves democratising cost data. This transition does not mean giving everyone unrestricted access to sensitive financial accounts or payment details. Instead, it centers on making relevant usage and cost breakdowns accessible to those who influence spending decisions, such as product owners, development teams, and DevOps staff, in a manner that is both secure and comprehensible.

Below are tangible ways to create a more open and proactive cost culture:

  1. Role-Based Access to Billing Dashboards

    • Most major cloud providers offer robust billing dashboards that can be securely shared with different levels of detail. For example, you can configure specialised read-only roles that allow developers to see usage patterns and daily cost breakdown without granting them access to critical financial settings.
    • Look into official documentation and solutions from your preferred cloud provider:
    • By carefully configuring role-based access, you enable various teams to monitor cost drivers without exposing sensitive billing details such as invoicing or payment methods.
  2. Regular Cost Review Meetings

    • Schedule short, recurring meetings (monthly or bi-weekly) where finance, engineering, operations, and leadership briefly review cost trends. This fosters collaboration, encourages data-driven decisions, and allows everyone to ask questions or highlight anomalies.
    • Ensure these sessions focus on actionable items. For instance, if a certain service’s spend has doubled, discuss whether that trend reflects legitimate growth or a misconfiguration that can be quickly fixed.
  3. Automated Cost Alerts for Key Stakeholders

  4. Cost Dashboards Embedded into Engineering Workflows

    • Rather than expecting developers to remember to check a separate financial console, embed cost insights into the tools they already use. For example, if your organisation relies on a continuous integration/continuous deployment (CI/CD) pipeline, you can integrate scripts or APIs that retrieve daily cost data and present them in your pipeline dashboards or as part of a daily Slack summary.
    • Some organisations incorporate cost metrics into code review processes, ensuring that changes with potential cost implications (like selecting a new instance type or enabling a new managed service) are considered from both a technical and financial perspective.
  5. Empowering DevOps with Cost Governance

    • If you have a DevOps or platform engineering team, involve them in evaluating cost optimisation best practices. By giving them partial visibility into real-time spend data, they can quickly adjust scaling policies, identify over-provisioned resources, or investigate usage anomalies before a bill skyrockets.
    • You might create a “Cost Champion” role in each engineering squad—someone who monitors usage, implements resource tagging strategies, and ensures that the rest of the team remains mindful of cloud spend.
  6. Use of FinOps Principles

    • The emerging discipline of FinOps (short for “Financial Operations”) focuses on bringing together finance, engineering, and business stakeholders to drive financial accountability. Adopting a FinOps mindset means cost visibility becomes a shared responsibility, with iterative improvement at its core.
    • Consider referencing frameworks like the FinOps Foundation’s Principles to learn about building a culture of cost ownership, unit economics, and cross-team collaboration.
  7. Security and Compliance Considerations

    • Improving visibility does not mean exposing sensitive corporate finance data or violating compliance rules. Many organisations adopt an approach where top-level financial details (like credit card info or total monthly invoice) remain restricted, but usage-based metrics, daily cost reports, and resource-level data are made available.
    • Work with your governance or risk management teams to ensure that any expanded visibility aligns with data protection regulations and internal security policies.

By following these strategies, you shift from a guarded approach—where only finance or management see the details—to a more inclusive cost culture. The biggest benefit is that your engineering teams gain the insight they need to optimise continuously. Rather than discovering at the end of the month that a test environment was running at full throttle, teams can detect and fix potential overspending early. Over time, this fosters a sense of shared cost responsibility, encourages more efficient design decisions, and drives proactive cost management practices across the organisation.

How do I do better?

To enhance a “Proactive Spend Commitment by Finance” model, organisations often evolve toward deeper collaboration between finance, engineering, and product teams. This ensures that negotiated contracts and reserved purchasing decisions accurately reflect real workloads, growth patterns, and future expansions. Below are methods to improve:

  1. Integrated Forecasting and Capacity Planning

    • Instead of having finance make decisions based purely on past billing, establish a forecasting model that includes planned product launches, major infrastructure changes, or architectural transformations.
    • Encourage technical teams to share roadmaps (e.g., upcoming container migrations, new microservices, or expansions into different regions) so finance can assess whether existing reservation strategies are aligned with future reality.
    • By merging product timelines with historical usage data, finance can negotiate better deals and tailor them closely to the actual environment.
  2. Dynamic Monitoring of Reservation Coverage

  3. Cross-Functional Reservation Committees

    • Create a cross-functional group that meets quarterly or monthly to decide on reservation purchases or modifications. In this group, finance presents cost data, while engineering clarifies usage patterns and product owners forecast upcoming demand changes.
    • This ensures that any new commits or expansions account for near-future workloads rather than only historical data. If you adopt agile practices, incorporate these reservation reviews as part of your sprint cycle or program increment planning.
  4. Leverage Spot or Preemptible Instances for Variable Workloads

    • An advanced tactic is to blend long-term reservations for predictable workloads with short-term, highly cost-effective instance types—such as AWS Spot Instances, Azure Spot VMs, GCP Preemptible VMs, or OCI Preemptible Instances—for workloads that can tolerate interruptions.
    • Finance-led pre-commits for baseline needs plus engineering-led strategies for ephemeral or experimental tasks can minimise your total cloud spend. This synergy requires communication between finance and engineering so that the latter group can identify which workloads can safely run on spot capacity.
  5. Refining Commitment Levels and Terms

    • If your cloud vendor offers multiple commitment term lengths (e.g., 1-year vs. 3-year reservations, partial upfront vs. full upfront) and different coverage tiers, refine your strategy to match usage stability. For example, if 60% of your workload is unwavering, consider 3-year commits; if another 20% fluctuates, opt for 1-year or on-demand.
    • Over time, as your usage data becomes more accurate and your architecture stabilises, you can shift more workloads into longer-term commitments for greater discounts. Conversely, if your environment is in flux, keep your commitments lighter to avoid overpaying.
  6. Unit Economics and Cost Allocation

    • Enhance your commitment strategy by tying it to unit economics—i.e., cost per customer, cost per product feature, or cost per transaction. Once you can express your cloud bills in terms of product-level or service-level metrics, you gain more clarity on which areas most justify pre-commits.
    • If you identify a specific product line that reliably has N monthly active users, and you have stable usage patterns there, you can base reservations on that product’s forecast. Then, the cost savings from reservations become more attributable to specific products, making budgeting and cost accountability smoother.
  7. Ongoing Financial-Technical Collaboration

    • Beyond initial negotiations, keep the lines of communication open. Cloud resource usage is dynamic, particularly with continuous integration and deployment practices. Having monthly or quarterly check-ins between finance and engineering ensures you track coverage, refine cost models, and respond quickly to usage spikes or dips.
    • Consider forming a “FinOps” group if your cloud usage is substantial. This multi-disciplinary team can use data from daily or weekly cost dashboards to fine-tune reservations, detect anomalies, and champion cost-optimisation strategies across the business.

By progressively weaving in these improvements, you move from a purely finance-led contract negotiation model to one where decisions about reserved spending or commitments are strongly informed by real-time engineering data and future product roadmaps. This more holistic approach leads to higher reservation utilisation, fewer wasted commitments, and better alignment of your cloud spending with actual business goals. The result is typically a more predictable cost structure, improved cost efficiency, and reduced risk of paying for capacity you do not need.

How do I do better?

If you wish to refine your cost-efficiency, consider adding more sophisticated processes, automation, and cultural practices. Here are ways to evolve:

  1. Implement More Granular Auto-Scaling Policies

  2. Use Infrastructure as Code for Environment Management

    • Instead of ad hoc creation and shutdown scripts, adopt Infrastructure as Code (IaC) tools (e.g., Terraform, AWS CloudFormation, Azure Bicep, Google Deployment Manager, or OCI Resource Manager) to version-control environment configurations. Combine IaC with schedule-based or event-based triggers.
    • This approach ensures that ephemeral environments are consistently built and torn down, leaving minimal risk of leftover resources. You can also implement automated tagging to track cost by environment, team, or project.
  3. Re-Architect for Serverless or Containerised Workloads

    • If your application can tolerate stateless, event-driven, or container-based architectures, consider adopting serverless computing (e.g., AWS Lambda, Azure Functions, GCP Cloud Functions, OCI Functions) or container orchestrators (e.g., Kubernetes, Docker Swarm).
    • These models often scale to zero when no requests are active, ensuring you only pay for actual usage. While not all workloads are suitable, re-architecting certain components can yield significant cost improvements.
  4. Optimise Storage and Networking

    • Cost-effective management extends beyond compute. Look for opportunities to move infrequently accessed data to cheaper storage tiers, such as object storage archive classes or lower-performance block storage. Configure lifecycle policies to purge logs or snapshots after a specified retention.
    • Monitor data transfer costs between regions, availability zones, or external endpoints. If your architecture unnecessarily routes traffic through costlier paths, consider direct inter-region or peering solutions that reduce egress charges.
  5. Scheduled Resource Hibernation and Wake-Up Processes

    • Extend beyond typical off-hour shutdowns by creating fully automated schedules for every environment that does not require 24/7 availability. For instance, set a policy to shut down dev/test resources at 7 p.m. local time, and spin them back up at 8 a.m. the next day.
    • Tools or scripts can detect usage anomalies (e.g., someone working late) and override the schedule or send a prompt to confirm if the environment should remain active. This approach ensures maximum cost avoidance, especially for large dev clusters or specialised GPU instances.
  6. Incorporate Cost Considerations into Code Reviews and Architecture Decisions

    • Foster a culture in which cost is a first-class design principle. During code reviews, developers might highlight the cost implications of using a high-tier database service, retrieving data across regions, or enabling a premium feature.
    • Architecture design documents should include estimated cost breakdowns, referencing official pricing details for the services involved. Over time, teams become more adept at spotting potential overspending.
  7. Automated Auditing and Cleanup

    • Implement scripts or tools that run daily or weekly to detect unattached volumes, unused IP addresses, idle load balancers, or dormant container images. Provide automated cleanup or at least raise alerts for manual review.
    • Many cloud providers have built-in recommendations engines:
  8. Track and Celebrate Savings

    • Publicise cost optimisation wins. If an engineering team shaved 20% off monthly bills by fine-tuning auto-scaling, celebrate that accomplishment in internal communications. Show the before/after metrics to encourage others to follow suit.
    • This positive reinforcement helps maintain momentum and fosters a sense of shared ownership.

By layering these enhancements, you move beyond basic scheduling or minimal auto-scaling. Instead, you cultivate a deeply ingrained practice of continuous optimisation. You harness automation to enforce best practices, integrate cost awareness into everyday decisions, and systematically re-architect services for maximum efficiency. Over time, the result is a lean cloud environment that can expand when needed but otherwise runs with minimal waste.

How do I do better?

If you want to upgrade your cost-aware development environment, you can deepen the integration of financial insight into everyday engineering. Below are practical methods:

  1. Enhance Toolchain Integrations

    • Provide cost data directly in the platforms developers use daily:
      • Pull Request Annotations: When a developer opens a pull request in GitHub or GitLab that adds new cloud resources (e.g., creating a new database or enabling advanced analytics), an automated comment could estimate the monthly or annual cost impact.
      • IDE Plugins: Investigate or develop plugins that estimate cost implications of certain library or service calls. While advanced, such solutions can drastically reduce guesswork.
      • CI/CD Pipeline Steps: Incorporate cost checks as a gating mechanism in your CI/CD process. If a change is projected to exceed certain cost thresholds, it triggers a review or a labeled warning.
  2. Reward and Recognition Systems

    • Implement a system that publicly acknowledges or rewards teams that achieve significant cost savings or code optimisations that reduce the cloud bill. This can be a monthly “cost champion” award or a highlight in the company-wide newsletter.
    • Recognising teams for cost-smart decisions helps embed a culture where financial prudence is celebrated alongside feature delivery and reliability.
  3. Cost Education Workshops

    • Host internal workshops or lunch-and-learns where experts (whether from finance, DevOps, or a specialised FinOps team) explain how cloud billing works, interpret usage graphs, or share best practices for cost-efficient coding.
    • Make these sessions as practical and example-driven as possible: walk developers through real code and show the difference in cost from alternative approaches.
  4. Tagging and Chargeback/Showback Mechanisms

    • Encourage consistent resource tagging so that each application component or service is clearly attributed to a specific team, project, or feature. This tagging data feeds into cost reports that let you see which code bases or squads are driving usage.
    • You can then implement a “showback” model (where each team sees the monthly cost of their resources) or a “chargeback” model (where those costs directly affect team budgets). Such financial accountability often motivates more thoughtful engineering decisions.
  5. Guidelines and Architecture Blueprints

    • Produce internal reference guides that show recommended patterns for cost optimisation. For example, specify which database types or instance families are preferred for certain workloads. Provide example Terraform modules or CloudFormation templates that are pre-configured for cost-efficiency.
    • Encourage developers to consult these guidelines when designing new systems. Over time, the default approach becomes inherently cost-aware.
  6. Frequent Feedback Loops

    • Implement daily or weekly cost digests that are automatically posted in relevant Slack channels or email lists. These digests highlight the top 5 cost changes from the previous period, giving engineering teams rapid insight into where spend is shifting.
    • Additionally, create a channel or forum where developers can ask cost-related questions in real time, ensuring they do not have to guess how a new feature might affect the budget.
  7. Collaborative Budgeting and Forecasting

    • For upcoming features or architectural revamps, involve engineers in forecasting the cost impact. By inviting them into the financial planning process, you ensure they understand the budgets they are expected to work within.
    • Conversely, finance or product managers can learn from engineers about the real operational complexities, leading to more accurate forecasting and fewer unrealistic cost targets.
  8. Adopt a FinOps Mindset

    • Expand on the FinOps principles beyond finance alone. Encourage all engineering teams to take part in continuous cost optimisation cycles—inform, optimise, and operate. In these cycles, you measure usage, identify opportunities, experiment with changes, and track results.
    • Over time, cost efficiency becomes an ongoing practice rather than a one-time initiative.

By adopting these approaches, you elevate cost awareness from a passive, occasional concern to a dynamic, integrated element of day-to-day development. This deeper integration helps your teams design, code, and deploy with financial considerations in mind—often leading to innovative solutions that deliver both performance and cost savings.


How do you choose where to run workloads and store data? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to refine an intra-region approach:

  1. Enable Automatic Multi-AZ Deployments

  2. Replicate Data Synchronously

  3. Set AZ-Aware Networking

  4. Regularly Test AZ Failover

    • Induce a partial Availability Zone (AZ) outage or rely on “game days” to ensure applications properly degrade or failover:
    • Ensures systems can handle unexpected disruptions effectively.
  5. Monitor Cross-AZ Costs

By automatically spreading workloads, replicating data in multiple AZs, ensuring AZ-aware networking, regularly testing failover, and monitoring cross-AZ costs, you solidify your organisation’s resilience within a single region while controlling costs.

How to do better

Below are rapidly actionable improvements:

  1. Automate Cross-Region Backups

  2. Schedule Non-Production in Cheaper Regions

    • If cost is a driver, shut down dev/test in off-peak times or run them in a region with lower rates:
      • Referencing your chosen vendor’s regional pricing page.
  3. Establish a Basic DR Plan

  4. Regularly Test Failover

  5. Plan for Data Residency

By automating cross-region backups, offloading dev/test workloads where cost is lower, defining a minimal DR plan, regularly testing failover, and ensuring data residency compliance, you expand from a single-region approach to a modest but effective multi-region strategy.

How to do better

Below are rapidly actionable enhancements:

  1. Sustainability-Driven Tools

  2. Implement Real-Time Cost & Perf Monitoring

  3. Enable Multi-Region Data Sync

  4. Address Latency & End-User Performance

    • For services with user-facing components, consider CDN edges, multi-region front-end load balancing, or local read replicas to ensure acceptable performance.
  5. Document Region Swapping Procedures

    • If you occasionally relocate entire workloads for cost or sustainability, define runbooks or scripts to manage DB replication, DNS updates, and environment spin-up.

By using sustainability calculators to choose greener regions, implementing real-time cost/performance checks, ensuring multi-region data readiness, managing user latency via CDNs or local replicas, and documenting region-swapping, you fully leverage each provider’s global footprint for cost and environmental benefits.

How to do better

Below are rapidly actionable methods to refine dynamic, cost-sustainable distribution:

  1. Automate Workload Placement

    • Tools like [AWS Spot Instance with EC2 Fleet, Azure Spot VMs with scale sets, GCP Preemptible VMs, OCI Preemptible Instances] or container orchestrators that factor region costs:
      • referencing vendor cost management APIs or third-party cost analytics.
  2. Use Real-Time Carbon & Pricing Signals

  3. Add Continual Governance

    • Ensure no region usage violates data residency constraints or compliance:
      • referencing NCSC multi-region compliance advice or departmental data classification guidelines.
  4. Embrace Chaos Engineering

  5. Integrate Advanced DevSecOps

    • For each region shift, the pipeline or orchestrator re-checks security posture and cost thresholds in real time.

By automating workload placement with spot or preemptible instances, factoring real-time carbon and cost signals, applying continuous data residency checks, stress-testing region shifts with chaos engineering, and embedding advanced DevSecOps validations, you maintain a dynamic, cost-sustainable distribution model that meets the highest operational and environmental standards for UK public sector services.

Keep doing what you’re doing, and consider blogging about or opening pull requests to share how you handle multi-region distribution and operational management for cloud workloads. This information can help other UK public sector organisations adopt or improve similar approaches in alignment with NCSC, NIST, and GOV.UK best-practice guidance.

Data

How do you manage data storage and usage? [change your answer]

You did not answer this question.

How to do better

Here are rapidly actionable steps to establish foundational data management and reduce risks:

  1. Identify and Tag All Existing Data Stores

  2. Establish Basic Data Handling Guidelines

    • Document a short set of rules about where teams should store data, who can access it, and minimal security classification steps (e.g., “Use only these approved folders/buckets for OFFICIAL-SENSITIVE data”).
    • Reference the Government Security Classification Policy (GSCP) or departmental guidelines to outline baseline compliance steps.
  3. Enable Basic Monitoring and Access Controls

  4. Educate Teams on Data Sensitivity

    • Run short, targeted training or lunch-and-learns on recognising PII, official data, or other categories.
    • Emphasize that storing data in an “unofficial” manner can violate data protection laws or hamper future compliance efforts.
  5. Draft an Interim Data Policy

    • Outline a simple, interim policy that sets initial standards for usage. For example:
      • "Always store sensitive data (OFFICIAL-SENSITIVE) in an encrypted bucket or database.
      • “Tag resources with project name, data owner, and data sensitivity level.”
    • Having any policy is better than none, setting the stage for more formal governance.

By identifying your data storage resources, applying minimal security tagging, and sharing initial guidelines, you shift from ad hoc practices to a basic, more controlled environment. This foundation paves the way for adopting robust data governance tools and processes down the line.

How to do better

Below are rapidly actionable ways to improve upon team-based documentation:

  1. Adopt Centralised Tagging/Labeling Policies

  2. Introduce Lightweight Tools for Schema and Documentation

  3. Standardise on Security and Compliance Checklists

  4. Schedule Quarterly or Semi-Annual Data Reviews

    • Even if managed by each team, commit to an organisational cycle:
      • They update their data inventories, verify classification, and confirm no stale or untagged storage resources.
      • Summarise findings to central governance or a data protection officer for quick oversight.
  5. Motivate with Quick Wins

    • Share success stories: “Team X saved money by archiving old data after a manual review, or prevented a compliance risk by discovering unencrypted PII.”
    • This fosters cultural buy-in and continuous improvement.

By implementing standardised tagging, shared documentation tools, and routine checklists, you enhance consistency and reduce errors. You’re also positioning yourself for the next maturity level, which often involves more automated scanning and classification across the organisation.

How to do better

To refine your “Inventoried and Classified Data” approach, apply these rapidly actionable enhancements:

  1. Automate Scanning and Classification

  2. Introduce Basic Lineage Tracing

  3. Align with Legal & Policy Requirements

  4. Create a Single “Data Inventory” Dashboard

    • Consolidate classification statuses in a simple dashboard or spreadsheet so data governance leads can track changes at a glance.
    • If possible, generate monthly or quarterly “data classification health” reports.
  5. Provide Self-Service Tools for Teams

    • Offer them a quick way to see if their new dataset might include sensitive fields or which storage option is recommended for OFFICIAL-SENSITIVE data.
    • Maintaining “responsible autonomy” fosters compliance while reducing central bottlenecks.

With scanning, lineage insights, policy-aligned retention, and better visibility, you not only maintain your inventory but move it toward a dynamic, living data map. This sets the stage for deeper data understanding and advanced catalog solutions.

How to do better

Below are rapidly actionable steps to deepen your data lineage and documentation:

  1. Adopt or Expand a Data Catalog with Lineage Features

  2. Create a Standard Operating Procedure for Lineage Updates

    • Whenever a new data pipeline is created or an ETL job changes, staff must add or adjust lineage documentation.
    • Ensure this ties into your DevOps or CI/CD process:
      • E.g., new code merges automatically trigger updates in Purview or Data Catalog.
  3. Encourage Data Reuse and Collaboration

    • With partial lineage, teams might still re-collect or duplicate data. Create incentives for them to discover existing data sets:
      • Host a monthly “Data Discovery Forum” or internal knowledge-sharing session.
      • Highlight “success stories” where reusing a known dataset saved time or reduced duplication.
  4. Set Up Tiered Access Policies

  5. Integrate with Risk and Compliance Dashboards

    • If you have a departmental risk register, link data classification/lineage issues into that.
    • This ensures any changes or gaps in lineage are recognised as potential compliance or operational risks.

By systematically building out lineage features and embedding them in everyday workflows, you move closer to a truly integrated data environment. Over time, each dataset’s path through your infrastructure becomes transparent, boosting collaboration, reducing duplication, and easing regulatory compliance.

How to do better

Even at the highest maturity, here are actionable ways to refine:

  1. Incorporate Real-Time or Streaming Data

  2. Add Automated Data Quality Rules and Alerts

    • Configure threshold-based triggers that check data quality daily:
      • e.g., “If more than 5% of new rows fail validation, alert the data steward.”
    • Some vendor-native tools or third-party solutions can embed these checks in your data pipeline or catalog.
  3. Leverage AI/ML to Classify and Suggest Metadata

  4. Integrate Catalog with Wider Public Sector Ecosystems

    • If your data catalog can integrate with cross-government data registries or share metadata with partner organisations, you reduce duplication and improve interoperability. For instance:
      • Some local authorities or NHS trusts might share standardised definitions or GDS guidelines.
      • Tools or APIs that facilitate federation with external catalogs can open up broad data collaboration.
  5. Continuously Evaluate Security, Access, and Usage

    • Review who actually accesses data vs. who is authorised, adjusting policies based on usage patterns.
    • If certain data sets see heavy usage from a new department, ensure lineage, classification, and approvals remain correct.

At this advanced level, your main goal is to keep your data catalog living, dynamic, and well-integrated with the rest of your technology stack and governance frameworks. By embracing new data sources, automating quality checks, leveraging ML classification, and ensuring interoperability across the UK public sector, you solidify your position as a model of data governance and strategic data management.

Keep doing what you’re doing, and consider publishing blog posts or internal case studies about your data governance journey. Submit pull requests to this guidance or relevant public sector repositories to share innovative approaches. By swapping best practices, we collectively improve data maturity, compliance, and service quality across the entire UK public sector.


What is your approach to data retention? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to strengthen your organisational policy awareness and transition toward more robust management:

  1. Map Policy to Actual Cloud Storage

  2. Implement Basic Lifecycle Rules for Key Data Types

  3. Offer Practical Guidelines

    • Simplify your policy into short, scenario-based instructions. For instance:
      • “Project data that includes personal information must be kept for 2 years, then deleted.”
      • “No indefinite retention without approval from Data Protection Officer.”
    • Make these guidelines easily accessible (intranet page, project templates).
  4. Encourage Regular Self-Checks

  5. Align with Stakeholders

    • Brief senior leadership, legal teams, and information governance officers on any proposed changes or automation.
    • Gains their support by showing how these improvements reduce compliance risk and free up unnecessary storage costs.

By proactively mapping retention policies to actual data, implementing simple lifecycle rules, and guiding teams with clear, scenario-based instructions, you reinforce “Organisation-Level Policy Awareness” with tangible, enforceable practices.

How to do better

Below are rapidly actionable ways to ensure attestations translate to real adherence:

  1. Incorporate Retention Audits into CI/CD

  2. Spot-Check Attestations with Periodic Scans

  3. Centralise Retention Documentation

    • Instead of scattered project documents, maintain a central registry or dashboard capturing:
      • Project name, data types, retention period, date of last attestation.
    • Provide read access to compliance and governance staff, ensuring quick oversight.
  4. Link Attestation to Funding or Approvals

    • For large programmes, make data retention compliance a prerequisite for budget release or major go/no-go decisions.
    • This creates a strong incentive to maintain correct lifecycle settings.
  5. Short Mandatory Training

    • Provide teams a bite-sized eLearning or workshop on how to configure retention in their chosen cloud environment.
    • This ensures they know the practical steps needed, so attestation isn’t just paperwork.

By coupling attestation with actual configuration checks, spot audits, centralised documentation, and relevant training, you boost confidence that claims of compliance match reality.

How to do better

Below are rapidly actionable ways to strengthen your audit and review process:

  1. Adopt Automated Compliance Dashboards

  2. Include Retention in Security Scans

  3. Track Action Plans to Closure

    • Use a centralised ticketing or workflow tool (e.g., Jira, ServiceNow) to capture audit findings, track remediation, and confirm sign-off.
    • Tag each ticket with “Data Retention Issue” for easy reporting and trend analysis.
  4. Publish Trends and Success Metrics

    • Show leadership the quarterly or monthly improvement in compliance percentage.
    • Celebrating zero major findings in a review cycle fosters a positive compliance culture and encourages teams to keep up the good work.
  5. Integrate with Other Governance Reviews

    • Data retention checks can be coupled with data security, privacy, or cost reviews.
    • This holistic approach ensures teams address multiple dimensions of good data stewardship simultaneously.

By automating aspects of the review process, embedding retention checks into security tools, and systematically remediating findings, you evolve from static cyclical audits to a dynamic, ongoing compliance posture.

How to do better

Below are rapidly actionable ways to embed retention exceptions deeper into risk management:

  1. Automate Exception Labelling and Monitoring

  2. Set Time-Bound Exceptions

    • Rarely should exceptions be indefinite. Include an “exception end date” in your risk register.
    • Use cloud scheduling or lifecycle policies to revisit after that date:
      • E.g., if an exception ends in 1 year, revert to normal retention automatically unless renewed.
  3. Enhance Risk Register Integration

    • Link risk items to your data inventory or data catalog so you can quickly see which resources are covered by the exception.
    • Tools like ServiceNow, Jira, or custom risk management solutions can cross-reference cloud resource IDs or labels.
  4. Reevaluate Exception Cases in Each Audit

    • Incorporate exception checks into your regular data retention audits:
      • Confirm the exception is still valid and authorised.
      • If it’s no longer needed, remove it and revert to standard retention policies.
  5. Leverage Encryption or Extra Security for Exceptions

By systematically capturing exceptions as risks, labeling them in cloud resources, setting expiry dates, and ensuring periodic review, your exceptions process remains controlled rather than a loophole. This approach mitigates the dangers of indefinite data hoarding and supports robust risk governance.

How to do better

Even at the top maturity level, here are rapidly actionable ways to refine your automated enforcement:

  1. Deepen Integration with Data Catalog

  2. Leverage Event-Driven Remediation

  3. Expand to All Data Storage Services

  4. Adopt Predictive Monitoring for Storage Growth

  5. Utilise Predictive Analytics for Data Growth and Anomaly Detection

  6. Continuously Update Policies for New Data Types

    • As your department adopts new AI workloads, IoT sensor data, or unstructured media, confirm your automated retention tools can handle these new data flows.
    • Keep stakeholder alignment: if legislation changes (e.g., new FOI or data privacy rules), swiftly update your policy-as-code approach.

By aligning your advanced automation with data classification, extending coverage to all storage services, and employing event-driven remediation, you maintain an agile, reliable data retention program that rapidly adapts to technology or policy shifts. This ensures your UK public sector organisation upholds compliance, minimises data sprawl, and demonstrates best-in-class stewardship of public data.

Keep doing what you’re doing, and consider documenting or blogging about your journey to automated data retention enforcement. Submit pull requests to this guidance or share your success stories with the broader UK public sector community to help others achieve similarly robust data retention practices.

Governance

How do you decide who handles the different aspects of cloud security? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond minimal consideration of shared responsibilities:

  1. Identify Your Specific Obligations

  2. Apply Basic Tagging for Ownership

  3. Conduct a Simple Risk Assessment

    • Walk through a typical scenario (e.g., security incident or downtime) and identify who would act under the current arrangement.
    • Document any gaps (e.g., “We assumed the vendor patches the OS, but it’s actually an IaaS solution so we must do it ourselves.”) and address them promptly.
  4. Raise Awareness with a Short Internal Briefing

    • Present the shared responsibility model in a simple slide deck or lunch-and-learn:
      • Emphasize how it differs from on-prem or typical outsourcing.
      • Show real examples of misconfigurations that occurred because teams weren’t aware of their portion of responsibility.
  5. Involve Governance or Compliance Officers

    • Ensure your information governance team or compliance officer sees the model. They can help flag missing responsibilities, especially around data protection (UK GDPR) or official classification levels.
    • This can prevent future misunderstandings.

By clarifying essential tasks, assigning explicit ownership, and performing a quick risk assessment, you proactively plug the biggest gaps that come from ignoring the shared responsibility model.

How to do better

Here are rapidly actionable ways to convert basic awareness into structured alignment:

  1. Develop a Clear Responsibilities Matrix

    • Create a simple spreadsheet or diagram that outlines specific responsibilities for each service model (IaaS, PaaS, SaaS). For example:
      • “Networking configuration: Cloud vendor is responsible for physical network security; we handle firewall rules.”
      • “VM patching: We handle OS patches for IaaS; vendor handles it for managed PaaS.”
    • Share this matrix with all relevant teams—developers, ops, security, compliance.
  2. Embed Responsibility Checks in CI/CD

  3. Set Up Basic Compliance Rules

  4. Create a Minimum Standards Document

    • Summarise “We do X, vendor does Y” in a concise, 1- or 2-page reference for new hires, project leads, or procurement teams.
    • This helps each team swiftly verify if they’re meeting their obligations.
  5. Schedule Regular (Bi-Annual) Awareness Sessions

    • As new people join or existing staff shift roles, re-run an internal training on the shared responsibility model.
    • This ensures knowledge doesn’t degrade over time.

By formalising the understanding into documented responsibilities, embedding checks in your workflows, and reinforcing compliance rules, you strengthen your posture beyond mere awareness and toward consistent application across teams.

How to do better

Below are rapidly actionable improvements to reinforce your informed decision-making:

  1. Adopt a “Responsibility Checklist” in Every Project Kickoff

  2. Integrate with Governance Boards or Change Advisory Boards (CAB)

    • Whenever a major cloud solution is proposed, the governance board ensures the shared responsibility breakdown is explicit.
    • This formal gate fosters consistent compliance with your model.
  3. Track “Responsibility Gaps” in Risk Registers

    • If you discover any mismatch—like you thought the vendor handled container OS patching, but it’s actually your job—log it in your risk register until resolved.
    • This encourages a quick fix and ensures no gap remains unaddressed.
  4. Conduct Periodic “Mock Incident” Exercises

    • For key services, run a tabletop exercise or test scenario: e.g., a severe OS vulnerability or unexpected data leak.
    • Evaluate how well the team knows who must patch or respond. Document lessons learned to refine your decision-making process.
  5. Refine Cost Transparency

    • Show how responsibilities can affect cost:
      • If you’re using a fully managed database, you pay a premium but shift more patching or upgrades to the vendor.
      • If you choose IaaS, you do more patching but may see lower direct service charges.
    • Provide a quick cost/responsibility matrix so teams can weigh these trade-offs effectively.

By embedding the model into architecture reviews, governance boards, risk tracking, and cost analysis, you ensure each cloud decision is well-informed and widely understood across the organisation.

How to do better

Here are rapidly actionable ideas to refine your strategic integration:

  1. Formalise a “Shared Responsibility Roadmap”

    • Outline how your responsibilities may shift as you adopt new services or modernise apps:
      • E.g., “We plan to transition from self-managed DB to a fully managed service, shifting patching to the vendor by Q4.”
    • Maintain an updated doc or wiki, shared with vendor account managers if relevant.
  2. Implement Joint Incident-Response Protocols

  3. Regular Joint Reviews of SLAs and MoUs

    • MoU (Memorandum of Understanding) or contracts can explicitly reference responsibilities.
    • Revisit them at least annually to confirm they remain relevant, especially if the vendor introduces new features or if you adopt new compliance frameworks.
  4. Quantify Responsibility Impacts on Cost and Resource

    • Evaluate how shifting responsibilities (e.g., from IaaS to PaaS) reduces your operational overhead or risk while potentially increasing subscription fees.
    • This cost-benefit analysis should guide strategic decisions about which responsibilities to keep in house.
  5. Publish Internal Case Studies

    • Showcase a project that integrated the shared responsibility model successfully, explaining how it prevented major incidents or streamlined compliance.
    • This inspires other teams to replicate the approach.

By systematically planning your responsibilities roadmap, establishing joint incident protocols, and performing regular SLA reviews, you embed the shared responsibility model at the heart of your strategic cloud partnerships.

How to do better

Even at the pinnacle, there are actionable strategies to maintain and refine:

  1. Incorporate Real-Time Observability of Shared Responsibilities

  2. Conduct Regular Cost-Benefit Re-Evaluations

    • At least quarterly, re-check if shifting more responsibilities to vendor-managed solutions or retaining them in house remains the best approach:
      • Some tasks might become cheaper or more secure if the vendor has introduced an improved managed feature or a new region with stronger compliance credentials.
    • Document these findings for leadership to see the ROI of the chosen approach.
  3. Shape Best Practices Across the Public Sector

    • Share your advanced model with partner agencies, local councils, or central government departments.
    • Contribute to cross-government playbooks on cloud adoption, showing how the shared responsibility model fosters better outcomes.
  4. Combine Shared Responsibility Insights with Ongoing Cloud Transformation

    • If you’re running modernisation or digital transformation programs, embed the shared responsibility model into new microservices, container deployments, or serverless expansions.
    • Constantly question: “Where does the boundary lie, and is it cost-effective or compliance-aligned to shift it?”
  5. Prepare for Regulatory Changes

    • Monitor updates to UK data protection laws, the National Cyber Security Centre (NCSC) guidelines, or changes in vendor compliance certifications.
    • Adjust responsibilities quickly if new standards require a different approach (e.g., more encryption or different backup retention mandated by a new policy).

By ensuring real-time observability, frequent cost-benefit checks, sector-wide collaboration, and a readiness to pivot for regulatory shifts, you sustain a robust, adaptive shared responsibility model at the core of your cloud usage. This cements your organisation’s position as a leader in cost-effective, secure, and compliant public sector cloud adoption.

Keep doing what you’re doing, and consider sharing blog posts, case studies, or internal knowledge base articles on how your organisation integrates the shared responsibility model into cloud governance. Submit pull requests to this guidance or similar public sector best-practice repositories to help others learn from your success.


How do you manage and store build artefacts (files created when building software)? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move away from ad-hoc methods:

  1. Introduce a Basic CI/CD Pipeline

  2. Ensure Everything Is in Version Control

  3. Create a Shared Storage for Build Outputs

  4. Document Basic Rollback Steps

    • At a minimum, define how to revert a server or application if a live edit breaks something:
      • Write a short rollback procedure referencing the last known working code in version control.
    • This ensures you’re not stuck with manual edits you can’t undo.
  5. Educate the Team

    • Explain the risks of live server edits in short training sessions:
      • Potential compliance violations if changes are not auditable.
      • Difficulty diagnosing or rolling back production issues.

By adopting minimal CI/CD, storing artifacts in a shared location, and referencing everything in version control, you reduce chaos and set a foundation for more robust artifact management.

How to do better

Below are rapidly actionable strategies:

  1. Centralise Your Build Once

  2. Define a Consistent Build Container

    • If you want complete reproducibility:
      • Use a Docker image as your build environment (e.g., pinned versions of compilers, frameworks).
      • Keep that Docker image in your artifact registry so each new build uses the same environment.
  3. Implement Version or Commit Hash Tagging

    • Tag the artifact with a version or Git commit hash. Each environment references the same exact build (like “my-service:build-1234”).
    • This eliminates guesswork about which code made it to production vs. test.
  4. Apply Simple Promotion Strategies

    • Instead of rebuilding, you “promote” the tested artifact from dev to test to production:
      • Mark the artifact as “passed QA tests” or “passed security scan,” so you have a clear chain of trust.
    • This approach improves reliability and shortens lead times.
  5. Create Basic Documentation

    • Summarise the difference between “build once, deploy many” and “build in each environment.”
    • Show management how consistent builds reduce risk and effort.

By consolidating the build process, storing a single artifact per version, and promoting that same artifact across environments, you achieve consistency and reduce the risk of environment drift.

How to do better

Here are rapidly actionable enhancements:

  1. Adopt Write-Once-Read-Many (WORM) or Immutable Storage

  2. Set Up Access Controls and Auditing

  3. Enforce In-House or Managed Build Numbering Standards

    • Decide how you version artifacts (e.g., semver, build number, git commit) to ensure consistent tracking across repos.
    • This practice reduces confusion when dev/test teams talk about a specific build.
  4. Extend to Container Images or Package Repositories

  5. Introduce Minimal Integrity Checks

    • Even if you don’t have full cryptographic signatures, consider generating checksums (e.g., SHA256) for each artifact to detect accidental corruption.

By using immutable storage, controlling access, and standardising versioning, you strengthen artifact reliability and traceability without overwhelming your current processes.

How to do better

Below are rapidly actionable improvements:

  1. Leverage Vendor Tools for Dependency Scanning

  2. Sign Your Artifacts

  3. Adopt a “Bill of Materials” (SBOM)

    • Generate a Software Bill of Materials for each build, listing all dependencies and their checksums:
      • This clarifies exactly which libraries or frameworks were used, crucial for quick vulnerability response.
  4. Enforce Minimal Versions or Patch Levels

    • If a library has a known CVE, your pipeline rejects builds that rely on that version.
    • This ensures you don’t accidentally revert to vulnerable dependencies.
  5. Combine with Immutable Storage

    • If you haven’t already, store these pinned, verified artifacts in a write-once or strongly controlled location.
    • This ensures no tampering after you sign or hash them.

By scanning for vulnerabilities, signing artifacts, using SBOMs, and enforcing patch-level policies, you secure your supply chain and provide strong assurance of artifact integrity.

How to do better

Even at this pinnacle, there are actionable ways to refine:

  1. Automate Artifact Verification on Deployment

  2. Embed Forensic Analysis Hooks

    • Provide metadata in logs (e.g., commit hashes, SBOM references) so if an incident occurs, security teams can quickly retrieve the relevant artifact.
    • This reduces incident response time.
  3. Regularly Test Restoration Scenarios

    • Conduct a “forensic reenactment” once or twice a year:
      • Attempt to reconstruct an environment from your stored artifacts.
      • Check if you can seamlessly spin up an older version with pinned dependencies and configurations.
    • This ensures the system works under real conditions, not just theory.
  4. Apply Multi-Factor Access Control

  5. Participate in Industry or Government Communities

    • As you lead in artifact management maturity, share best practices with other public sector bodies or cross-governmental security groups.
    • Encourage consistent auditing and artifact immutability standards across local councils, departmental agencies, or NHS trusts.

By verifying artifacts on each deployment, maintaining robust forensic readiness, testing restoration scenarios, and securing signing keys with HSMs or advanced controls, you perpetually refine your processes. This ensures unwavering trust and compliance in your build pipeline, even under rigorous UK public sector scrutiny.

Keep doing what you’re doing, and consider sharing case studies or best-practice guides. Submit pull requests to this guidance or other UK public sector repositories to help others learn from your advanced artifact management journey.


How do you manage and update access policies, and how do you tell people about changes? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move away from ad-hoc management:

  1. Begin a Simple Policy Definition

    • Draft a one-page document outlining baseline access rules (e.g., “Least privilege,” “Need to know”).
    • Reference relevant UK government guidance on access controls or consult your departmental policy docs.
  2. Centralise Identity and Access

  3. Record Exemptions in a Simple Tracker

    • If you must grant an ad-hoc exception, log it in a basic spreadsheet or ticket system:
      • Who was granted the exception?
      • Why?
      • When will it be revisited or revoked?
  4. Define at Least One “Review Step”

    • If someone wants new permissions, ensure a second person or a small group must approve the request.
    • This adds minimal overhead but prevents hasty over-permissioning.
  5. Communicate the New Basic Policy

    • Email a short notice to your team, or host a 15-minute briefing.
    • Emphasize that all new access requests must align with the minimal policy.

By introducing a baseline policy, centralising identity management, tracking exceptions, and implementing a simple approval step, you achieve immediate improvements and lay the groundwork for more robust policy governance.

How to do better

Here are rapidly actionable enhancements:

  1. Schedule Regular Policy Updates

    • Commit to revisiting policies at least annually or semi-annually, and each time there’s a major change (e.g., new compliance requirement).
    • Add a reminder to your calendar or project board for a policy review session.
  2. Establish a Basic Change Log

  3. Use Consistent Communication Channels

    • If you have an organisational Slack, Teams, or intranet, create a #policy-updates channel (or equivalent) to announce changes.
    • Summarise the key differences in plain language.
  4. Apply or Update an RBAC Model

  5. Create a Briefing Deck

    • Summarise the policy in fewer than 10 slides or 1–2 pages, so teams quickly grasp their obligations.
    • Present it in your next all-hands or departmental meeting.

By versioning your policy documents, scheduling updates, and communicating changes through consistent channels, you ensure staff remain aligned with the policy’s intent and scope, even as it evolves.

How to do better

Below are rapidly actionable ways to refine:

  1. Introduce a “Policy Advisory Group”

    • Involve representatives from different teams (security, compliance, operations, major app owners).
    • They review proposed changes before final approval, fostering collaboration and broader buy-in.
  2. Leverage Automated Policy Tools

  3. Conduct Impact Assessments

    • Each time a policy update is proposed, share an “impact summary” so teams know if they must adjust access roles, add new logging, or change their workflows.
  4. Record Meeting Minutes or Summaries

    • Publish a short summary of each policy review session.
    • This allows staff who couldn’t attend to remain informed and fosters more transparency.
  5. Add a Feedback Loop

    • Let staff submit policy improvement suggestions via an online form or an email address.
    • Review these suggestions in each policy cycle, acknowledging them in announcements.

By establishing a policy advisory group, using automated enforcement, sharing impact assessments, and keeping transparent documentation, you enhance collaboration and understanding around policy changes.

How to do better

Below are rapidly actionable strategies:

  1. Use Version Control for Policy and Automated Testing

  2. Schedule Interactive Workshops

    • Quarterly or monthly policy workshops enable direct Q&A and early feedback on proposed changes, preventing surprises.
  3. Implement a Self-Service Portal or Dashboard

  4. Link Policy Changes to Organisational Goals

    • For each update, clearly state how it supports:
      • Security improvements (reducing potential data breaches).
      • Compliance with UK data protection or government classification requirements.
      • Operational efficiency or cost savings.
  5. Establish Basic Metrics

    • E.g., measure “time to complete a policy change,” “number of exemptions requested,” or “incident rate attributed to policy confusion.”
    • Track these to demonstrate improvements over time.

By versioning policy code, conducting interactive workshops, providing self-service dashboards, and linking changes to tangible organisational goals, you reinforce a collaborative, integrated policy management culture.

How to do better

Below are rapidly actionable improvements, even at the highest level:

  1. Adopt Advanced Policy Testing Frameworks

  2. Create a Sandbox for Policy Experiments

    • Let staff propose changes in a “policy staging environment” or a set of test subscriptions/accounts/folders.
    • Automatic validation ensures no harmful or contradictory rules get merged into production.
  3. Automate Documentation Generation

  4. Extend Collaboration to Partner Agencies

    • If you work closely with other local authorities or health boards, consider sharing a portion of your policy code or best practices across organisations.
    • This fosters consistency and accelerates policy alignment.
  5. Perform Periodic “Policy Drills”

    • Similar to security incident drills, you can test large policy changes:
      • E.g., “We propose removing direct SSH access to all VMs” or “We require multi-factor authentication for every console user.”
    • Observe the process of review, merging, and rollout—this ensures your pipeline works under pressure.

By integrating advanced testing frameworks, using a sandbox approach, automating documentation, and sharing with partner agencies, you keep your policy-as-code approach dynamic and continuously improving, setting a standard for robust and transparent governance in the UK public sector.

Keep doing what you’re doing, and consider writing blog posts or internal knowledge base articles on your policy management journey. Submit pull requests to this guidance or similar public sector best-practice repositories to help others learn from your successful practices.


How do you manage your cloud environment? [change your answer]

You did not answer this question.

How to do better

Runbooks and Playbooks
  1. Create Minimal Runbooks/Playbooks

  2. Ensure Accessibility & Security

  3. Enforce Update Discipline

    • Each time an admin modifies the environment, they must update the runbook.
    • Prevents drift where docs become irrelevant or untrusted.
Change Logs and Audit Logs
  1. Enable Cloud Provider Audit Logging

  2. Capture the “Why”

    • Maintain a short change log to record the rationale behind config changes:
      • Possibly a central wiki or a simple Slack channel for “cloud change announcements.”
  3. Plan Next Steps

    • Use these logs to identify repetitive tasks or areas ripe for automation in the near future.

By documenting runbooks/playbooks, ensuring logs are enabled and accessible, capturing rationale behind changes, and frequently updating your documentation, you reduce the risks tied to manual “click-ops” while preparing the groundwork for partial or full automation.

How to do better

Below are rapidly actionable improvements:

  1. Use Scripting for Repetitive Tasks

  2. Track Environment Differences

  3. Add Post-Deployment Verification

    • After each manual deployment, run a checklist or small script that verifies key resources are correct.
  4. Plan a Shift to Infrastructure-as-Code

  5. Initiate Basic Drift Detection

By partially automating recurring tasks, carefully recording environment discrepancies, verifying deployments, piloting Infrastructure-as-Code, and implementing drift checks, you mitigate errors and pave the way for more complete automation.

How to do better

Below are rapidly actionable ways to evolve from partial scripting:

  1. Expand Scripting to Complex Tasks

  2. Adopt an IaC Framework

  3. Introduce Basic CI/CD

  4. Set up a “Review & Approve” Process

  5. Leverage Cloud Vendor Tools

By incrementally automating complex changes, standardising on an IaC framework, establishing a basic CI/CD workflow, ensuring code reviews, and utilising vendor orchestration tools, you reduce your reliance on manual interventions and strengthen cloud environment consistency.

How to do better

Below are rapidly actionable ways to refine a highly automated approach:

  1. Implement Automatic Drift Remediation

  2. Incorporate Policy-as-Code

  3. Extend DevSecOps Tooling

    • e.g., scanning IaC templates for security issues, verifying recommended best practices in each pipeline step:
      • referencing NCSC’s secure developer guidelines or NIST SP 800-53 R5 for secure configurations.
  4. Perform Regular Architecture Reviews

    • With a high level of automation, a small monthly or quarterly session can keep IaC templates up to date with new cloud features or cost optimisation.
  5. Foster Cross-Department Knowledge Sharing

By enabling automatic drift remediation, implementing policy-as-code, enhancing DevSecOps pipeline checks, conducting periodic architecture reviews, and collaborating across agencies, you refine a strong foundation of standardised, highly automated processes for cloud management.

How to do better

Below are rapidly actionable methods to maximise a fully declarative, drift-detecting environment:

  1. Integrate Real-Time Security & Cost Checks

  2. Adopt Multi-Cloud or Hybrid Templates

    • If you operate across multiple clouds or on-prem, unify definitions in a single code base:
  3. Enhance Observability

  4. Foster a Culture of Peer Reviews

  5. Pursue Cross-Government Collaboration

    • If possible, share or open-source reusable modules or templates:

By adding real-time security and cost checks in your pipeline, adopting multi-cloud/hybrid IaC, enhancing observability, promoting peer reviews, and collaborating with other UK public sector bodies, you reinforce an already advanced, fully declarative environment with robust drift detection—ensuring secure, consistent, and efficient cloud management.

Keep doing what you’re doing, and consider publishing blog posts or making pull requests to share your approach to fully automated, code-based cloud management with drift detection. This knowledge can help other UK public sector organisations replicate your success under NCSC, NIST, and GOV.UK best-practice guidelines.


How do you apply and enforce policies? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to start applying policies:

  1. Define a Minimal Baseline Policy

    • Begin by stating basic governance guidelines (e.g., “All user accounts must have multi-factor authentication,” “All data must be encrypted at rest”).
    • Publish this in a short doc or wiki, referencing relevant UK public sector best practices.
  2. Identify a Small Pilot Use Case

  3. Communicate the Policy

    • Alert your team that from now on, they must follow this minimal policy.
    • Provide quick references or instructions in your Slack/Teams channel or an intranet page.
  4. Log Exceptions

    • If someone must deviate from the baseline (e.g., a short-term test needing an exception), track it in a simple spreadsheet or ticket system.
    • This fosters accountability and sets the stage for incremental improvement.

By taking these initial steps—defining a baseline policy, piloting it, and communicating expectations—you move from “no policy application” toward a more controlled environment.

How to do better

Below are rapidly actionable ways to start enforcing existing policies:

  1. Adopt Basic Monitoring or Reporting

  2. Automate Alerts for Major Breaches

  3. Introduce Basic Consequence Management

    • If a policy is violated, require the team to fill out an exception form or gain approval from a manager.
    • This ensures staff think twice before ignoring policy.
  4. Incrementally Expand Enforcement

    • Start with “auditing mode,” then gradually move to “deny mode.” For example:
      • In AWS, use Service Control Policies or AWS Config rules in “detect-only” mode first, then enforce.
      • In Azure, run Azure Policy in “audit” effect, then shift to “deny” once comfortable.
      • GCP or OCI similarly allow rules to initially only log and then eventually block non-compliant actions.

By automating policy checks, alerting on critical breaches, and phasing in enforcement, you build momentum toward consistent compliance without overwhelming teams.

How to do better

Below are rapidly actionable ways to enhance process-driven application:

  1. Introduce Lightweight Technical Automation

  2. Use a Single Source of Truth

    • Store policy documentation and forms in a single location (e.g., SharePoint, Confluence, or an internal Git repo).
    • This avoids confusion about which version of the process to follow.
  3. Add a “Policy Gate” to Ticketing Systems

    • For example, in ServiceNow or Jira:
      • A ticket for provisioning a new VM or network must pass a “policy gate” status, requiring sign-off from a compliance or security person referencing your standard steps.
  4. Measure Process Efficiency

    • Track how long it takes to apply each policy step. Identify bottlenecks or missed checks.
    • This helps you see where minimal automation or additional staff training could cut manual overhead.
  5. Conduct Periodic Spot Audits

    • Check a random subset of completed tickets or new resources to ensure every policy step was genuinely followed, not just ticked off.
    • Publicise the outcomes so staff remain vigilant.

By introducing minor automation, centralising policy references, adding a policy gate in ticketing, and auditing process compliance, you blend the reliability of your current manual approach with the efficiency gains of technical enablers.

How to do better

Below are rapidly actionable ways to reinforce or extend your existing setup:

  1. Expand Technical Enforcement

  2. Integrate Observability and Alerting

  3. Adopt “Immutability” or “Infrastructure as Code”

  4. Push for More Cross-Team Training

    • Ensure DevOps, security, and compliance teams understand how to interpret automated policy checks.
    • This fosters a shared sense of ownership and makes the half-automated approach more effective.
  5. Set Up a Policy Remediation or “Self-Healing” Mechanism

    • Where feasible, let your system automatically fix minor compliance drifts:
      • e.g., If a bucket is created public by mistake, the system reverts it to private and notifies the user.

By strengthening technical guardrails, improving alerting, and embedding your policies deeper into IaC, you evolve your limited technical controls into a more comprehensive and proactive enforcement model.

How to do better

Below are rapidly actionable refinements, even at the highest maturity:

  1. Adopt Policy-as-Code with Automated Testing

  2. Enable Dynamic, Real-Time Adjustments

    • Some advanced organisations adopt “adaptive policies” that can respond automatically to shifting risk contexts:
      • e.g., Requiring step-up authentication or extra scanning if abnormal usage patterns appear.
  3. Analytics and Reporting on Policy Efficacy

    • Track metrics like “time to resolve policy violations,” “number of exceptions requested per quarter,” or “percentage of resources in compliance.”
    • Present these metrics to leadership for data-driven improvements.
  4. Cross-department Collaboration

    • If you share data or resources with other public sector agencies, coordinate policy definitions or enforcement bridging solutions.
    • This ensures consistent governance and security across multi-department projects.
  5. Regularly Test Failover or Incident Response

    • Conduct simulation exercises to confirm that policy enforcement remains intact during partial outages or security incidents.
    • Evaluate whether the automated controls effectively protect resources and whether manual overrides are restricted or well-logged.

By implementing policy-as-code with automated testing, adopting dynamic enforcement, collecting analytics on compliance, and performing cross-department or incident drills, you ensure your integrated model remains agile and robust—setting a high benchmark for public sector governance.

Keep doing what you’re doing, and consider writing internal blog posts or external case studies about your policy enforcement journey. Submit pull requests to this guidance or related public sector best-practice repositories so others can learn from your advanced application and enforcement strategies.


How do you use version control and branch strategies? [change your answer]

You did not answer this question.

How do I do better?

Below are rapidly actionable next steps:

  1. Pick a Git-based Platform

  2. Require Commits for Every Change

  3. Document Basic Workflow

  4. Tag Notable Versions

    • If something is “ready for release,” apply a Git tag or version (e.g., v1.0).
    • Minimises guesswork about which commit correlates to a live environment.
  5. Plan for Future Branching Strategy

    • Over 3–6 months, adopt a recognised model (e.g., GitHub Flow or trunk-based) to handle multiple contributors or features.

By using a modern Git-based platform, ensuring all changes result in commits, documenting a minimal workflow, tagging key releases, and scheduling a shift to a recognised branching strategy, you quickly move from minimal version control to a more robust approach that supports collaboration and security needs.

How do I do better?

Below are rapidly actionable methods to move from a custom approach to a standard one:

  1. Map Existing Branching to a Known Strategy

  2. Document a Cross-Reference

    • If you choose GitFlow, rename your custom “hotfix” or “dev” branches to align with standard naming, making it easier for new joiners.
  3. Simplify Where Possible

    • Some custom strategies overcomplicate merges. Consolidate or reduce the number of long-lived branches to avoid confusion.
  4. Provide a Quick “Cheatsheet”

  5. Pilot a Standard Flow on a New Project

    • In parallel, adopt a recognised model (e.g., GitHub Flow) on a small project to gain team familiarity before rolling it out more widely.

By comparing your custom model to standard flows, documenting a cross-reference, simplifying branch use, providing a quick reference, and trialing a standard approach on a new project, you reduce complexity and align with recognised best practices.

How do I do better?

Below are rapidly actionable improvements:

  1. Document the Adaptations

    • Clarify how your version of GitFlow or trunk-based differs from the original.
    • Minimises onboarding confusion.
  2. Regularly Revisit Branch Usage

  3. Incorporate CI/CD Automation

  4. Train New Team Members

  5. Simplify for Next Project

    • If you find the strategy too complex for frequent releases, consider trunk-based or GitHub Flow on your next new service or microservice.

By documenting your adaptations clearly, removing unused branches, adding CI/CD hooks for every branch commit, onboarding new developers, and evaluating simpler flows for future projects, you ensure your branch strategy remains practical and efficient.

How do I do better?

Below are rapidly actionable ways to optimise a textbook GitFlow-like approach:

  1. Apply Automated Merges/Sync

  2. Monitor Branch Sprawl

    • Limit the number of concurrent “release” branches.
    • If dev sees multiple releases with cross-pollination, consider if a simpler model might be more agile.
  3. Include Security Checks per Branch

  4. Document Rarely Used Branches

    • If your GitFlow includes “hotfix” or “maintenance” branches rarely used, confirm usage patterns or retire them for simplicity.
  5. Evaluate Branch Strategy Periodically

    • Every 6–12 months, revisit whether GitFlow remains necessary or trunk-based dev might serve better for speed.

By automating merges, controlling branch sprawl, embedding security checks into every branch, documenting rarely used branches, and regularly re-evaluating your overall branching structure, you keep your textbook GitFlow or similar approach practical and effective.

How do I do better?

Below are rapidly actionable ways to refine a minimal branch strategy:

  1. Expand Test Coverage

  2. Establish Feature Flags

  3. Enforce Peer Review

  4. Set Real-Time Release Observability

  5. Encourage Short-Lived Branches

    • Keep branches open for days or less, not weeks, ensuring minimal drift from main and fewer merge conflicts.

By strengthening test coverage, leveraging feature flags, requiring peer reviews, observing real-time release metrics, and promoting short-lived branches, you optimise a streamlined approach that fosters continuous delivery and rapid iteration aligned with modern DevSecOps standards.

Keep doing what you’re doing, and consider sharing your version control and branching strategy successes through blog posts or contributing them as best practices. This helps other UK public sector organisations adopt effective workflows aligned with NCSC, NIST, and GOV.UK guidance for secure, efficient software development.


How do you provision cloud services? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond purely manual provisioning:

  1. Start Capturing Configurations in Scripts

  2. Implement Basic Naming and Tagging Conventions

    • Create a short doc listing agreed naming prefixes/suffixes and mandatory tags:
      • e.g., DepartmentName, Environment (Dev/Test/Prod), Owner tags.
    • This fosters consistency and prepares for more advanced automation.
  3. Add a Simple Approval Step

    • If you’re used to provisioning without oversight, set up a minimal “approval check.”
    • For instance, use a shared Slack or Teams channel where you post new resource requests, and a manager or security person acknowledges before provisioning.
  4. Consider a Pilot with Infrastructure as Code (IaC)

  5. Document Provisioning Steps

    • Keep a simple runbook or wiki page. Summarise each manual provisioning step so you can easily shift these instructions into scripts or templates later.

By scripting basic tasks, implementing a simple naming/tagging policy, adding minimal approvals, and piloting an IaC solution, you start transitioning from ad hoc provisioning to more consistent automation practices.

How to do better

Below are rapidly actionable ways to standardise your provisioning scripts:

  1. Adopt a Common Repository for Scripts

    • Create an internal Git repo (e.g., on GitHub, GitLab, or a cloud-hosted repo) for all provisioning scripts:
    • Encourage teams to share and reuse scripts, aligning naming conventions and code structure.
  2. Define Minimal Scripting Standards

    • E.g., standard file naming, function naming, environment variable usage, or logging style.
    • Keep it simple but ensure each team references the same baseline.
  3. Use Infrastructure as Code Tools

  4. Create a Shared Module or Template Library

    • If multiple teams need similar infrastructure (e.g., a standard VPC, a typical storage bucket), store that logic in a common template or module:
      • e.g., Terraform modules in a shared Git repo or a private registry.
    • This ensures consistent best practices are used across all projects.
  5. Encourage Collaboration and Peer Reviews

    • Have team members review each other’s scripts or templates in a code review process, catching mistakes and unifying approaches along the way.

By consolidating scripts in a shared repository, defining lightweight standards, introducing IaC tools, and fostering peer reviews, you gradually unify your provisioning process and reduce fragmentation.

How to do better

Below are rapidly actionable ways to expand your declarative automation:

  1. Set Organisation-Wide IaC Defaults

    • Decide on a primary IaC tool (Terraform, CloudFormation, Bicep, Deployment Manager, or others) and specify guidelines:
      • e.g., “All new infrastructure that goes to production must use Terraform for provisioning, with code in X repo.”
  2. Create a Reference Architecture or Template

  3. Extend IaC Usage to Lower Environments

    • Even for dev/test, use declarative templates so staff get comfortable and maintain consistency:
      • This ensures the same patterns scale up to production effortlessly.
  4. Implement Automated Checks

  5. Offer Incentives for Adoption

    • e.g., Team metrics or internal recognition if all new deployments use IaC.
    • Showcase success stories: “Team A reduced production incidents by 30% after adopting IaC.”

By standardising your IaC approach, providing shared templates, enforcing usage even in lower environments, and automating checks, you accelerate your journey toward uniform, declarative provisioning across teams.

How to do better

Below are rapidly actionable ways to continue refining:

  1. Integrate with CI/CD Pipelines

  2. Establish a Platform Engineering or DevOps Guild

    • A cross-team group can maintain shared IaC libraries, track upgrades, and collaborate on improvements.
    • This fosters ongoing enhancements and helps new teams onboard quickly.
  3. Strengthen Security and Compliance Automation

    • Embed more advanced checks into your IaC pipeline:
      • e.g., verifying that certain resources cannot be exposed to the public internet, forcing encryption at rest, etc.
  4. Expand to Multi-Cloud or Hybrid

    • If relevant, unify your IaC approach for resources across multiple clouds or on-prem environments:
      • Tools like Terraform can handle multi-cloud provisioning under one codebase.
  5. Continue Upskilling Staff

    • Offer advanced IaC training, sessions on best practices, or pair programming to help teams adopt more sophisticated patterns (modules, dynamic references, etc.).

By using formal CI/CD for all deployments, fostering a DevOps guild, strengthening compliance checks, and supporting multi-cloud approaches, you refine widespread IaC usage into a highly orchestrated, reliable practice across the organisation.

How to do better

Below are rapidly actionable ways to push the boundaries, even at the highest maturity:

  1. Implement Policy-as-Code

  2. Adopt Advanced Testing and Security Checks

    • Extend your pipeline to run static code analysis (SAST), dynamic checks (DAST), and security scanning for container images or VM base images.
    • Provide a thorough “shift-left” approach, catching issues pre-production.
  3. Introduce Automated Change Approvals

    • If you want a “human in the loop” for major changes, use pipeline gating:
      • e.g., a Slack or Teams approval step before applying infrastructure changes in production.
    • This merges automation with the final manual sign-off for critical changes.
  4. Evolve Toward Self-Service Platforms

  5. Expand to True “GitOps” for Ongoing Management

    • Continuously synchronise changes from Git to your runtime environment:
      • e.g., using FluxCD or ArgoCD for containerised workloads, or hooking a Terraform operator into a Git repo.

By integrating policy-as-code, advanced security checks, optional gating approvals, self-service catalogs, and GitOps strategies, you refine your mandatory declarative automation approach into a truly world-class, highly efficient model of modern cloud provisioning for the UK public sector.

Keep doing what you’re doing, and consider sharing internal or external blog posts about your provisioning automation journey. Submit pull requests to this guidance or similar public sector best-practice repositories to help others learn from your experiences and successes.

Operations

Do you use continuous integration and continuous deployment (CI/CD) tools? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to adopt a basic CI/CD foundation:

  1. Begin with Simple Scripting

  2. Implement Basic Automated Testing

    • Start by automating unit tests:
      • Each commit triggers a script that runs tests in a shared environment, providing at least a “pass/fail” outcome.
  3. Use a Shared Version Control Repository

    • If you’re not already using one, adopt Git (e.g., GitHub, GitLab, or an internal service) to store your source code so that you can begin integrating basic CI steps.
  4. Document the Process

    • Create a short runbook or wiki entry explaining how code is built, tested, and deployed.
    • This helps new team members adopt the new process.
  5. Set a Goal to Remove Manual Steps Gradually

    • Identify the most error-prone or time-consuming manual tasks. Automate them first to gain quick wins.

By introducing simple build/test scripting, hosting code in version control, and documenting your process, you establish the baseline for a more formal CI/CD pipeline in the future.

How to do better

Below are rapidly actionable ways to broaden CI/CD usage:

  1. Establish a Centralised CI/CD Reference

    • Create an internal wiki or repository showcasing how leading teams set up their pipelines:
    • Encourage other teams to replicate successful patterns.
  2. Provide or Recommend CI/CD Tools

  3. Host Skill-Sharing Sessions

    • Have teams currently using CI/CD present their approaches in short lunch-and-learn sessions.
    • Record these sessions so new staff or less mature teams can learn at their own pace.
  4. Create Minimal Pipeline Templates

    • Provide a starter template for each major language or platform (e.g., Node.js, Java, .NET).
    • Ensure these templates include basic build, test, and package steps out of the box.
  5. Reward Cross-Team Collaboration

    • If a more advanced project helps a struggling team set up their pipeline, recognise both parties’ efforts.
    • This fosters a culture of internal assistance rather than siloed approaches.

By sharing knowledge, offering recommended tools, and providing example templates, you organically expand CI/CD adoption and empower teams to adopt consistent approaches.

How to do better

Below are rapidly actionable ways to refine or unify CI/CD tool usage:

  1. Define Core Principles or Best Practices

    • Even if each team chooses different tools, align on key principles:
      • Every pipeline must run unit tests, produce build artifacts, and store logs.
      • Every pipeline must integrate with code reviews and version control.
    • This ensures consistency of outcomes, if not standard tooling.
  2. Document Cross-Tool Patterns

    • Create a short doc or wiki explaining how to handle:
      • Secrets management, environment variables, artifact storage, and standard branch strategies.
    • This helps teams use the same approach to security and governance, even if they use different CI/CD apps.
  3. Encourage Modular Pipeline Code

    • Teams can share modular scripts or config chunks for tasks like static analysis, security checks, or environment provisioning:
      • e.g., Docker build modules, Terraform integration steps, or test coverage logic.
  4. Highlight or Mentor

    • If certain pipelines are especially successful, highlight them as “recommended” or offer mentorship so other teams can replicate their approach.
    • Over time, the organisation may naturally converge on a handful of widely accepted tools.
  5. Consider a Central CI/CD Service for Key Use Cases

By defining core CI/CD principles, documenting shared patterns, and selectively offering a central service or recommended tool, you maintain team autonomy while reaping benefits of consistent practices.

How to do better

Below are rapidly actionable ways to refine widespread team-driven CI/CD:

  1. Introduce a DevOps Guild or CoE (Center of Excellence)

    • Regularly meet with representatives from each team, discussing pipeline improvements, new features, or security issues.
    • Gather best practices in a single location.
  2. Further Integrate Security (DevSecOps)

  3. Standardise Basic Access & Observability

    • Regardless of the pipeline tool, ensure:
      • A consistent approach to storing build logs and artifacts, tagging builds with version numbers, and applying RBAC for pipeline access.
    • This unifies the data your compliance officers or governance teams rely on.
  4. Automate Approvals for Critical Environments

    • If production deployments require sign-off, implement a pipeline-based approval process:
      • e.g., Slack or Teams-based approval checks, or an integrated manual approval step in the pipeline (Azure DevOps, GitHub Actions, GCP Cloud Build triggers, or AWS CodePipeline).
  5. Measure Pipeline Performance and Reliability

    • Gather metrics like average build time, deployment success rate, or lead time for changes.
    • Use these insights to target pipeline improvements or unify slow or error-prone steps.

By fostering a DevOps guild, infusing security checks, and unifying logging/artifact storage, you balance team autonomy with enough cross-cutting standards to maximise reliability and compliance.

How to do better

Below are rapidly actionable ways to refine your standardised CI/CD practices:

  1. Adopt Pipeline-as-Code for All

  2. Implement Advanced Deployment Strategies

    • For example, canary or blue/green deployments:
      • This reduces downtime and risk during releases, making your pipelines more robust.
  3. Integrate Policy-as-Code

    • Ensure pipeline runs automatically verify compliance with organisational policies:
      • e.g., scanning IaC templates or container images for security or cost violations, referencing official standards.
  4. Expand Observability

    • Offer real-time dashboards for build success rates, deployment times, and test coverage.
    • Publish these metrics in a central location so leadership and cross-functional teams see progress.
  5. Encourage “Chaos Days” or Hackathons

    • Let teams experiment with pipeline improvements, new integration patterns, or novel reliability tests.
    • This fosters ongoing innovation and ensures your standardised approach does not become static.

By version-controlling pipeline definitions, embracing advanced deployment patterns, integrating policy checks, and driving continuous improvement initiatives, you keep your standardised CI/CD framework at the cutting edge—well-aligned with UK public sector priorities of robust compliance, reliability, and efficiency.

Keep doing what you’re doing, and consider writing up your CI/CD journey in internal blog posts or knowledge bases. Submit pull requests to this guidance or related public sector best-practice repositories so others can learn from your experiences as well.


How fast are your builds and deployments? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to introduce basic measurements and reduce build/deployment durations:

  1. Implement a Simple Tracking Mechanism

    • Start by documenting each deployment’s start and end times in a spreadsheet or ticket system:
      • Track which environment was deployed, total time taken, any blockers encountered.
    • Over a few weeks, you’ll get a baseline for improvement.
  2. Automate Basic Steps

  3. Adopt a Central Version Control System

    • If you aren’t already, store source code and deployment artifacts in Git (e.g., GitHub, GitLab, Azure Repos, etc.):
    • This lays the groundwork for more advanced automation later.
  4. Introduce Basic SLAs for Deployment Windows

    • e.g., “We aim to complete production deployments within 1 working day once approved.”
    • This ensures staff start to see time-to-deploy as a priority.
  5. Identify Key Bottlenecks

    • Are approvals causing delays? Are you waiting for a single SME to do manual steps?
    • Focus on automating or streamlining the top pain point first.

By tracking deployments in a simple manner, automating the most time-consuming tasks, and setting minimal SLAs, you begin reducing deployment time and gain insight into where further improvements can be made.

How to do better

Below are rapidly actionable ways to reduce delays and evolve your tracking:

  1. Automate Testing

  2. Streamline Approvals

    • If manager sign-off is causing long waits, propose a structured yet efficient approval flow:
      • For example, define a Slack or Teams channel where changes can be quickly acknowledged.
      • Use a ticket system or pipeline-based manual approval steps that require minimal overhead.
  3. Implement Parallel or Incremental Deployments

    • Instead of a big-bang approach, deploy smaller changes more frequently:
      • If teams see fewer changes in each release, testing and validation can be quicker.
  4. Enforce Clear Deployment Windows

    • e.g., “Production deploys occur every Tuesday and Thursday at 2 PM,” with a cut-off for code submissions.
    • This planning reduces ad hoc deployments that cause confusion.
  5. Set Target Timelines

    • e.g., “Builds should not exceed 30 minutes from commit to artifact,” or “Deployments to test environments should complete within an hour of code merges.”
    • Start small, measure progress, and refine goals.

By adding automated testing, simplifying approvals, and promoting incremental deployments, you shorten delays and create a more responsive release pipeline.

How to do better

Below are rapidly actionable ways to enhance your moderate efficiency:

  1. Add Real-Time or Automated Monitoring

  2. Optimise Build and Test Steps

  3. Adopt Infrastructure as Code (IaC)

  4. Implement Rolling or Blue/Green Deployments

    • Reduce downtime and user impact by applying advanced deployment strategies.
    • The more confident you are in your pipeline, the faster you can roll out changes.
  5. Introduce Regular Retrospectives

    • e.g., monthly or bi-weekly sessions to review deployment metrics (average build time, deployment durations).
    • Plan small improvements each cycle—like removing a manual test step or simplifying a build script.

By improving monitoring, optimising test/build steps, adopting IaC, and refining deployment strategies, you make your moderately efficient process even faster and more stable.

How to do better

Below are rapidly actionable ways to optimise an already streamlined process:

  1. Expand Shift-Left Testing and Security

  2. Add Automated Rollback or Canary Analysis

  3. Adopt Feature Flags

    • Further speed up deployment by decoupling feature rollout from the actual code release:
      • This allows partial or user-segmented rollouts, improving feedback loops.
  4. Implement Detailed Pipeline Telemetry

    • If you only track overall build/deploy times, gather finer metrics:
      • Time spent in unit tests vs. integration tests, container image builds vs. scanning, environment creation vs. final validations.
    • These insights highlight your next optimisation targets.
  5. Formalise Continuous Improvement

    • Host regular pipeline reviews or “build engineering” sprints.
    • Evaluate changes in build times, error rates, or frequency of hotfixes. Use these insights to plan enhancements.

By infusing advanced scanning, canary release strategies, feature flags, and deeper telemetry into your existing streamlined pipeline, you further reduce risk, speed up feedback, and maintain a high level of operational maturity.

How to do better

Below are rapidly actionable ways to refine a near-optimal pipeline:

  1. Incorporate AI/ML Insights

    • Tools or custom scripts that analyze build logs and deployment results for anomalies or patterns over time:
      • e.g., predicting which code changes may cause test failures, optimising pipeline concurrency.
  2. Expand Multi-Stage Testing and Observability

  3. Share Expertise Across Agencies

    • If your pipeline is among the fastest in the UK public sector, participate in cross-government knowledge-sharing:
      • Offer case studies or presentations at GDS or GovTech events, or collaborate with other agencies for mutual learning.
  4. Fully Integrate Infrastructure and Policy as Code

    • Ensure that not only your app code but also your network, security group, and policy definitions are stored in the pipeline, with automatic checks:
      • This creates a fully self-service environment for dev teams, reducing manual interventions further.
  5. Set Zero-Downtime Deployment Goals

    • If you haven’t already, aim for zero user-impact deployments:
      • e.g., advanced canary or rolling strategies in every environment, with automated rollback if user metrics degrade.

By experimenting with AI-driven pipeline intelligence, chaos engineering, advanced zero-downtime deployment strategies, and cross-department collaboration, you continue pushing the boundaries of high-speed, highly reliable build/deployment processes—reinforcing your position as a leader in efficient operations within the UK public sector.

Keep doing what you’re doing, and consider creating blog posts or internal case studies to document your continuous improvement journey. You can also submit pull requests to this guidance or related public sector best-practice repositories, helping others learn from your approach to fast and dependable build/deployment processes.


How do you monitor your systems? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move from reactive observation to basic continuous monitoring:

  1. Implement Simple Infrastructure Monitoring

  2. Enable Basic Application Logging

  3. Set Up Minimal Alerts

    • e.g., CPU usage > 80% triggers an email, or container restarts exceed a threshold:
      • This ensures you don’t rely purely on user reports for operational awareness.
  4. Document Observability Practices

    • A short wiki or runbook describing how to check logs, which metrics to watch, and who to contact if issues emerge.
    • Even a minimal approach fosters consistency across dev and ops teams.
  5. Schedule a Monitoring Improvement Plan

    • Book a monthly or quarterly checkpoint to discuss any monitoring issues or data from the past period.
    • Decide on incremental enhancements each time.

By adopting basic infrastructure metrics, centralising logs, configuring minimal alerts, and documenting your approach, you shift from purely reactive observation to foundational continuous monitoring.

How to do better

Below are rapidly actionable ways to integrate your basic monitoring tools:

  1. Consolidate Metrics in a Central Dashboard

  2. Automate Alerts

    • Replace or supplement manual checks with automated alerts for abnormal spikes or dips:
      • e.g., memory usage, 5xx error rates, queue backlogs, etc.
    • Alerts should reach relevant Slack/Teams channels or an email distribution list.
  3. Introduce Tagging for Correlation

    • If you tag resources consistently, your monitoring tool can group related services:
      • e.g., “Project=ServiceX” or “Environment=Production.”
    • This helps you spot trends across all resources for a specific application.
  4. Document Standard Operating Procedures (SOPs)

    • For each common alert (e.g., high CPU, memory leak), define recommended steps or references to logs for quick troubleshooting.
    • This reduces reliance on guesswork or individual heroics.
  5. Integrate with Deployment Pipelines

    • If you have a CI/CD pipeline, embed a step that checks basic health metrics post-deployment:
      • e.g., if error rates spike after a new release, roll back automatically or alert the dev team.

By consolidating metrics, automating alerts, introducing consistent tagging, and creating SOPs, you reduce manual overhead and gain a more unified picture of your environment, improving response times.

How to do better

Below are rapidly actionable ways to deepen integration of infrastructure and application data:

  1. Adopt APM (Application Performance Monitoring) Tools

  2. Implement Unified Logging and Metric Correlation

    • Use a logging solution that supports correlation IDs or distributed traces:
      • This helps you pivot from an app error to the underlying VM or container metrics in one step.
  3. Create Multi-Dimensional Alerts

    • Instead of CPU-based alerts alone, combine them with application error rates or queue backlog:
      • e.g., alert only if CPU > 80% AND 5xx errors spike, reducing false positives.
  4. Enable Synthetic Monitoring

    • Set up automated user-journey or transaction tests:
      • If these fail, you know the user experience is impacted, not just backend metrics.
  5. Refine SLA/SLI/SLO

    • If you measure high-level “availability,” break it down into a more precise measure (e.g., 99.9% of user requests under 2 seconds).
    • Align your alerts to these SLOs so your monitoring focuses on real user impact.

By combining APM, correlated logs, synthetic tests, and multi-dimensional alerts, you ensure your teams spot potential issues quickly and tie them directly to user experience, thereby boosting operational effectiveness.

How to do better

Below are rapidly actionable methods to push partial integration to near full integration:

  1. Enhance Distributed Tracing

  2. Adopt an Observability-First Culture

    • Encourage developers to embed structured logs, custom metrics, and trace headers from day one.
    • This synergy helps advanced monitoring tools build a full picture of performance.
  3. Automate Root Cause Analysis (RCA)

    • Some advanced tools or scripts can identify potential root causes by analyzing correlated data:
      • e.g., pinpoint a failing database node or a memory leak in a specific container automatically.
  4. Refine Alert Thresholds Using Historical Data

    • If you have advanced metrics but struggle with noisy or missed alerts, adjust thresholds based on past trends.
    • e.g., If your memory usage typically runs at 70% baseline, alert at 85% instead of 75% to reduce false positives.
  5. Integrate ChatOps

    • Deliver real-time alerts and logs to Slack/Teams channels. Let teams query metrics or logs from chat directly:
      • e.g., a chatbot that surfaces relevant data for incidents or just-in-time debugging.

By fortifying distributed tracing, adopting an “observability-first” mindset, automating partial root cause analysis, and refining alerts, you close the remaining gaps and strengthen end-to-end situational awareness.

How to do better

Below are rapidly actionable ways to refine an already integrated “single pane of glass” approach:

  1. Leverage AI/ML-Based Anomaly Detection

  2. Implement Self-Healing

    • If your integrated system detects a consistent fixable issue, automate the remedy:
      • e.g., automatically scale containers or restart a microservice if certain metrics exceed thresholds.
    • Ensure any automated fix logs the action for audit or compliance.
  3. Integrate Observability with ChatOps

    • Offer real-time interactive troubleshooting:
      • e.g., Slack bots that can run queries or “explain” anomalies using your “single pane” data.
  4. Adopt Full Lifecycle Cost and Performance Analysis

    • Link your monitoring data to cost metrics for a holistic view:
      • e.g., seeing how scaling up or out affects not only performance but also budget.
    • This fosters more strategic decisions around resource usage.
  5. Share Observability Insights Across the Public Sector

    • If you’ve achieved a truly integrated solution, document your architecture, the tools you used, and best practices.
    • Present or collaborate with other agencies or local councils, uplifting broader public sector observability.

By harnessing AI-driven detection, automating remediation steps, integrating real-time ChatOps, and linking cost with performance data, you push your advanced single-pane-of-glass monitoring to a new level—enabling near-instant responses and deeper strategic insights.

Keep doing what you’re doing, and consider writing internal blogs or case studies on your observability journey. Submit pull requests to this guidance or other public sector best-practice repositories to help others learn from your experiences with integrated cloud monitoring.


How do you get real-time data and insights? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to improve data literacy and real-time insight capabilities:

  1. Provide Basic Data Literacy Training

    • Organise short workshops, possibly in partnership with GOV.UK Data in government guidance or local councils, focusing on:
      • How to read and interpret basic charts or dashboards.
      • Terminology for metrics (e.g., “mean,” “median,” “time series,” “confidence intervals”).
    • This empowers more staff to self-serve on simpler data queries.
  2. Adopt a Simple Visualisation or BI Tool

  3. Pilot a Data Lake or Central Data Repository

  4. Encourage a Data Buddy System

    • Pair domain experts with data-literate staff (or external analysts) who can guide them on structured data approaches.
    • This fosters knowledge transfer and upskills both sides.
  5. Reference Official Guidance on Data Handling

By improving data literacy, introducing a basic BI tool, creating a pilot data repository, and pairing experts with data-savvy staff, you begin reducing your reliance on point-in-time manual analysis. Over time, these steps pave the way for real-time insights.

How to do better

Below are rapidly actionable ways to transition from basic delayed reporting to more timely insights:

  1. Explore Incremental Data Refresh

  2. Add Near Real-Time Dashboards

    • Maintain existing weekly summary reports while layering a near real-time view for critical metrics:
      • e.g., the number of service requests in the last hour or real-time error rates in a public-facing service.
  3. Improve Data Quality Checks

  4. Set Timeliness KPIs

    • e.g., “All critical data sets must be updated at least every 2 hours,” or “System error logs refresh in analytics within 15 minutes.”
    • Over time, strive to meet or improve these targets.
  5. Align with NCSC and NIST Guidance on Continuous Monitoring

With incremental data refreshes, partial real-time dashboards, better data pipelines, and timeliness KPIs, you reduce the gap between data generation and insight delivery, improving responsiveness.

How to do better

Below are rapidly actionable ways to enhance your partially real-time analytics:

  1. Adopt Stream Processing for More Datasets

  2. Consolidate Real-Time Dashboards

  3. Enhance Data Integration

    • If certain data sets remain batch-only, try hybrid ingestion methods:
      • e.g., partial streaming for time-critical fields, scheduled for large historical loads.
  4. Conduct Cross-Team Drills

    • Run mock scenarios (e.g., a surge in user transactions or a security event) to test if real-time analytics allow quick response.
    • Identify where missing or delayed data hampers resolution.
  5. Leverage Gov/Industry Guidance

By increasing stream processing, consolidating dashboards, and expanding real-time coverage to more data sets, you minimise the blind spots in your analytics, enabling faster, more informed decisions across the board.

How to do better

Below are rapidly actionable ways to refine your advanced real-time analytics:

  1. Enhance Data Federation and Governance

  2. Promote Self-Service BI

    • Offer user-friendly dashboards with drag-and-drop analytics:
      • e.g., enabling policy officers, operation managers, or finance leads to build custom views without waiting on IT.
  3. Incorporate Automated Anomaly Detection

  4. Support Data Literacy Initiatives

  5. Set Real-Time Performance Goals

    • e.g., “90% of operational metrics should be visible within 60 seconds of ingestion.”
    • Routinely track how these goals are met or if data pipelines slow over time, making improvements as needed.

By strengthening data governance, encouraging self-service, adopting automated anomaly detection, and continuing to boost data literacy, you maximise the value of your advanced analytics environment.

How to do better

Below are rapidly actionable ways to refine self-service real-time insights:

  1. Expand Data Sources and Data Quality

    • Enrich dashboards by integrating external open data or cross-department feeds:
  2. Introduce Natural Language or Conversational Queries

  3. Automate Governance and Access Controls

  4. Integrate Predictive Insights in Dashboards

    • If you have ML models, embed their output directly into the dashboard:
      • e.g., forecasting future usage or risk, highlighting anomalies on live charts.
  5. Foster Cross-department Collaboration

By expanding data sources, enabling natural language querying, automating governance, embedding predictive analytics, and partnering with other agencies, you ensure your comprehensive self-service environment stays at the cutting edge—empowering a data-driven culture in UK public sector organisations.

Keep doing what you’re doing, and consider blogging about your journey toward real-time analytics and self-service dashboarding. Submit pull requests to this guidance or other public sector best-practice repositories to help others learn from your successes in delivering timely, actionable insights.


How do you release updates? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to transition from downtime-based updates to more resilient approaches:

  1. Pilot a Rolling or Blue/Green Approach

  2. Establish a Basic CI/CD Pipeline

    • So that updates are automated and consistent:
      • e.g., run unit tests, integration checks, and create a deployable artifact with each commit.
    • NCSC’s guidance on DevSecOps or NIST SP 800-160 can inform security integration into the pipeline.
  3. Use Snapshot Testing or Quick Cloning

    • If you remain reliant on backups for rollback, test them frequently:
      • Ensure daily or more frequent snapshots can be swiftly restored in a staging environment to confirm reliability.
  4. Communicate Downtime Effectively

    • If immediate elimination of downtime is not feasible, set up a transparent communication plan:
  5. Aim for Rolling Updates Pilot

    • Identify at least one non-critical service to pilot rolling or partial updates, building confidence for production.

By adopting minimal rolling or staging-based updates, automating deployment pipelines, and ensuring robust backup/restore processes, you reduce the disruptive nature of downtime-based updates—paving the way for more advanced, near-zero-downtime methods.

How to do better

Below are rapidly actionable improvements:

  1. Implement Automated Health Checks

  2. Adopt a Canary or Blue/Green Strategy for Critical Services

    • Gradually test changes on a small portion of traffic before proceeding:
      • This reduces risk if an update has issues.
  3. Shorten or Eliminate Maintenance Windows

    • If rolling updates are stable, see if you can do them in business hours for services with robust capacity.
    • Communicate frequently with users about partial capacity reductions, referencing relevant GOV.UK operational guidelines.
  4. Automate Rollback

  5. Reference NCSC Guidance on Operational Resilience

By adding health checks, introducing partial canary or blue/green methods, and continuously automating rollbacks, you further minimise the user impact even within a rolling update strategy—potentially removing the need for fixed maintenance windows.

How to do better

Below are rapidly actionable ways to enhance manual cut-over processes:

  1. Automate the Switch

  2. Incorporate Automated Testing Pre-Cut-Over

    • Run smoke/integration tests on the new environment before the final switch:
      • If tests pass, you simply approve the cut-over.
  3. Establish Clear Checklists

    • List each step, from final pre-check to DNS swap, ensuring all relevant logs, metrics, or alerts are turned on:
      • Minimises risk of skipping a crucial step during a manual process.
  4. Use Observability Tools for Rapid Validation

    • After switching, verify the new environment quickly with real-time dashboards or synthetic user tests:
      • This helps confirm everything runs well before fully retiring the old version.
  5. Refer to NCSC Operational Resilience Guidance

    • NCSC documentation offers principles for ensuring minimal disruption when switching environments.
    • NIST SP 800-160 Vol 2 can also provide insights on engineering for cyber-resilience in deployment processes.

By automating as many cut-over steps as possible, implementing integrated testing, and leveraging robust observability, you reduce manual overhead while retaining the safety of parallel versions.

How to do better

Below are rapidly actionable methods to enhance manual canary or blue/green strategies:

  1. Automate Traffic Shaping

  2. Implement Automated Rollback

    • If metrics degrade beyond thresholds, revert automatically to the stable version without waiting for manual action:
      • e.g., a pipeline checking real-time error rates or latency.
  3. Adopt Observability-Driven Deployment

    • Use real-time logging, metrics, and user experience monitoring to confirm if the new version is healthy:
  4. Enhance Developer Autonomy

    • If your policy allows, let smaller updates or patch releases auto-deploy after canary checks pass, reserving manual oversight only for major changes or high-risk deployments.
  5. Consider ChatOps or Tools for One-Click Approvals

    • Slack/Teams integrated pipeline steps let authorised personnel type a simple command or press a button to shift traffic from old to new version.
    • This lowers friction while preserving manual control.

By introducing traffic shaping with partial auto-deploy or rollback, deeper observability, and flexible chat-based control, you refine your canary or blue/green approach, reducing the manual overhead of each release while keeping high confidence.

How to do better

Even at this top maturity level, there are rapidly actionable improvements:

  1. Expand Automated Testing & AI/ML Analysis

  2. Implement Feature Flag Management

    • Decouple feature releases from deployments entirely:
      • e.g., changing user experience or enabling new functionality with toggles, tested gradually.
    • Tools like [LaunchDarkly], or vendor-based solutions [AWS AppConfig feature flags, Azure Feature Management, GCP Feature Flags, or OCI-based toggles] can help.
  3. Advance Security & Testing

  4. Explore Multi-Cluster or Multi-Region Failover

    • If one region or cluster is updating, route traffic to another fully operational cluster for absolute minimal disruption:
      • This further cements zero downtime across a national or global footprint.
  5. Collaborate with Other Public Sector Bodies

    • Share your near-instant, zero-downtime deployment patterns with local councils or other departments:

By embedding advanced anomaly detection, feature flag strategies, multi-region failover, and deepening security checks, you maintain a cutting-edge continuous deployment ecosystem—aligning with top-tier operational excellence in the UK public sector.

Keep doing what you’re doing, and consider documenting your advanced release strategies in internal or external blog posts. You can also submit pull requests to this guidance or other public sector best-practice repositories, helping others progress toward zero-downtime, high-confidence release methods.


How do you manage deployment and QA? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond entirely manual QA and deployments:

  1. Introduce a Simple CI Pipeline

  2. Document a Standard Release Checklist

    • Ensure each deployment follows a consistent procedure, covering essential steps like code review, environment checks, and sign-off by the project lead.
  3. Schedule a Pilot for Automated QA

    • If you typically rely on manual testers, pick a small piece of your test suite to automate:
      • e.g., smoke tests or a top-priority user journey.
    • This pilot can demonstrate the value of automation to stakeholders.
  4. Set Clear Goals for Reducing Manual Steps

    • Aim to reduce “time to deploy” or “time spent on QA” by a certain percentage over the next quarter, aligning with agile or DevOps improvement cycles recommended by GOV.UK Service Manual practices.
  5. Review Security Compliance

By establishing minimal CI automation, clarifying release steps, and piloting automated QA, you build confidence in incremental improvements, setting the foundation for more robust pipelines.

How to do better

Below are rapidly actionable methods to evolve from partial automation:

  1. Expand Automated Tests to Integration or End-to-End (E2E)

  2. Adopt a More Frequent Release Cadence

    • Commit to at least monthly or bi-weekly releases, allowing you to discover issues earlier and respond to user needs faster.
  3. Introduce Automated Rollback or Versioning

  4. Refine Manual Approvals

    • If manual gates remain, streamline them with a single sign-off or Slack-based approvals rather than long email chains:
      • This ensures partial automation doesn’t stall at a manual step for days.
  5. Consult NIST SP 800-53

    • Evaluate recommended controls for software release (CM-3, SA-10) and integrate them into your pipeline for better compliance documentation.

By broadening test coverage, increasing release frequency, and automating rollbacks, you lay the groundwork for more frequent, confident deployments that align with modern DevOps practices.

How to do better

Below are rapidly actionable ways to enhance integrated deployment and QA:

  1. Add Security and Performance Testing

  2. Implement Parallel Testing or Test Suites

    • If test execution time is long, parallelise them:
      • e.g., AWS CodeBuild parallel builds, Azure Pipelines multi-job phases, GCP Cloud Build multi-step concurrency, or OCI DevOps parallel test runs.
  3. Introduce Slack/Teams Notifications

    • Notify dev and ops channels automatically about pipeline status, test results, and potential regressions:
      • Encourages quick fixes and fosters a more collaborative environment.
  4. Adopt Feature Flag Approaches

    • Deploy new code continuously but hide features behind flags:
      • This ensures “not fully tested or accepted” features remain off for end users until QA sign-off.
  5. Reference GOV.UK and NCSC

By strengthening security/performance checks, parallelising tests, using real-time notifications, and employing feature flags, you further streamline your integrated QA pipeline while maintaining robust checks and balances.

How to do better

Below are rapidly actionable ways to refine your existing CI/CD with automated testing:

  1. Shift Left Security

    • Embed security tests (SAST, DAST, license compliance) earlier in the pipeline:
      • e.g., scanning pull requests or pre-merge checks for known vulnerabilities.
  2. Adopt Canary/Blue-Green Deployments

  3. Implement Automated Rollback

    • If user impact or error rates spike post-deployment, revert automatically to the previous version without manual steps.
  4. Use Feature Flags for Safer Experiments

    • Deploy code continuously but toggle features on gradually.
    • This approach de-risks large releases and speeds up delivery.
  5. Encourage Cross-Government Collaboration

By deepening security integration, adopting advanced deployment tactics, and refining rollbacks or feature flags, you enhance an already stable CI/CD pipeline. This leads to even faster, safer releases aligned with top-tier DevSecOps practices recommended by NCSC and NIST.

How to do better

Even at this apex, there are rapidly actionable improvements:

  1. Adopt Policy-as-Code for Environment Provisioning

  2. Automated Data Masking or Synthetic Data

    • If ephemeral environments need real data, ensure compliance with UK data protection regs:
      • Use synthetic test data or anonymise production copies to maintain NCSC data security best practices.
  3. Inject Chaos or Performance Tests

    • Incorporate chaos engineering (e.g., random container/network failures) and load tests in ephemeral environments:
      • This ensures high resilience under real-world stress.
  4. Optimise Environment Lifecycle

    • Monitor resource usage to avoid ephemeral environments lingering longer than needed:
      • e.g., automatically tear down environments if no activity is detected after 48 hours.
  5. Collaborate with UK Gov or Local Councils

By embedding policy-as-code, securing data in ephemeral environments, introducing chaos/performance tests, and aggressively managing environment lifecycles, you ensure your pipeline remains at the cutting edge—fully aligned with advanced DevOps capabilities recommended by NCSC, NIST, and other relevant bodies.

Keep doing what you’re doing, and consider writing up your experiences or creating blog posts about your ephemeral environment successes. You can also submit pull requests to this guidance or other public sector best-practice repositories, helping others in the UK public sector evolve their QA pipelines and deployment processes.


How do you develop and implement your cloud strategy? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to start formalising a cloud-oriented approach:

  1. Identify a Cloud Advocate

  2. Host Internal Workshops

  3. Create a Cloud Starter Doc

  4. Pilot a Small Cross-Functional Team

    • If you have an upcoming project with cloud components, assemble a temporary team from different departments (development, security, finance) to coordinate on cloud decisions.
  5. Define Basic Cloud Roles

    • Even without a dedicated cloud team, define who handles security reviews, cost optimisation checks, or architectural guidance.

By designating a cloud advocate, introducing basic cloud knowledge sessions, and forming a small cross-functional group for a pilot project, you lay the groundwork for a more coordinated approach to cloud strategy and operations.

How to do better

Below are rapidly actionable ideas to strengthen informal cloud expertise:

  1. Formalise a Community of Practice

  2. Create a Shared Knowledge Base

  3. Encourage One-Stop Repos

    • For repeated patterns (e.g., Terraform templates for secure VMs or container deployments), maintain a Git repo that all teams can reference.
  4. Promote Shared Governance

    • Align on a minimal set of “must do” controls (e.g., mandatory encryption, logging).
    • Consider referencing NIST SP 800-53 controls for cloud resource security responsibilities.
  5. Pilot a Small Formal Working Group

    • If informal collaboration works well, create a small “Cloud Working Group” recognised by leadership.
    • They can propose consistent patterns or cost-saving tips for cross-team usage.

By forming a community of practice, establishing a knowledge base, and beginning minimal governance alignment, you transition from ad hoc experts toward a more structured, widely beneficial cloud strategy.

How to do better

Below are rapidly actionable strategies to improve a formal Cloud COE:

  1. Offer Self-Service Catalogs or Templates

  2. Extend COE Services

    • e.g., specialised security reviews, compliance checks referencing NCSC 14 Cloud Security Principles, or cost optimisation workshops that unify departmental approaches.
  3. Set up a Community of Practice

    • Have the COE coordinate monthly open sessions for all cloud practitioners to discuss new vendor features, success stories, or security enhancements.
  4. Embed COE Members in Key Projects

    • Provide “COE ambassadors” who temporarily join project teams to share knowledge and shape architecture from the start.
  5. Consult NIST and GOV.UK for Strategy Guidance

By delivering self-service solutions, deeper security reviews, and an active cloud community, the COE matures into a vital driver for consistent, secure, and cost-effective cloud adoption across the organisation.

How to do better

Below are rapidly actionable steps to further integrate the COE’s standards into everyday operations:

  1. Adopt “Cloud-First” or “Cloud-Smart” Policies

    • Mandate that new solutions default to cloud-based approaches unless there’s a compliance or cost reason not to.
    • Reference relevant policy from GOV.UK’s Cloud First policy for alignment.
  2. Introduce Automated Compliance Checks

  3. Enable On-Demand Cloud Labs/Training

    • Provide hands-on workshops or sandbox accounts where staff can experiment with new cloud services in a safe environment.
    • Encourages further skill growth and cross-pollination.
  4. Measure Outcomes and Iterate

    • Track success metrics: e.g., time to provision environments, frequency of security incidents, cost savings realised by standard patterns.
    • Present these metrics in monthly or quarterly leadership updates, aligning with NCSC operational resilience guidance.
  5. Improve Cross-Functional Team Composition

    • Incorporate security engineers and cloud architects directly into product squads for new digital services, reducing handoffs.

By mandating automated compliance checks, fostering a “cloud-first” approach, expanding skill-building labs, and embedding security/architecture roles into each delivery team, you further entrench consistent, effective cloud usage across the public sector organisation.

How to do better

Below are rapidly actionable ways to refine an already advanced operating model:

  1. Introduce FinOps Practices

  2. Enable Self-Service Data & AI

    • If each product team can not only provision compute but also harness advanced analytics or ML on demand:
      • Speeds up data-driven policy or service improvements.
  3. Adopt Policy-as-Code

    • Extend your automated governance:
      • e.g., using [Open Policy Agent (OPA) or AWS Service Control Policies, Azure Policy, GCP Organisation Policy, or OCI Security Zones] to ensure consistent rules across the entire estate.
  4. Engage in Cross-Government Collaboration

    • Share your advanced COE successes with other departments, local councils, or healthcare orgs:
      • Possibly present at GOV.UK community meetups, or work on open-source infrastructure modules that other public bodies can reuse.
  5. Stay Current with Tech and Security Trends

    • Periodically assess new NCSC or NIST advisories, cloud vendor releases, or best-practice updates to keep your operating model fresh, secure, and cost-effective.

By incorporating robust FinOps, self-service AI, policy-as-code, cross-government collaboration, and continuous trend analysis, you ensure your advanced COE model remains at the forefront of effective and secure cloud adoption in the UK public sector.

Keep doing what you’re doing, and consider writing blog posts or internal knowledge-sharing articles about your advanced Cloud COE. Submit pull requests to this guidance or other public sector best-practice repositories to help others learn from your successes in structuring cross-functional cloud teams and ensuring an effective operating model.


Who manages your cloud operations? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond developer-exclusive cloud management:

  1. Form a DevOps Guild or Community of Practice

  2. Introduce Minimal Automated Monitoring & Alerts

  3. Implement Basic Infrastructure as Code

  4. Add a Cloud Security Checklist

  5. Request Budget or Headcount

    • If workloads grow, advocate for dedicated cloud engineering staff. Present cost/risk benefits to leadership, referencing GOV.UK cloud-first policy and potential agility gains.

By fostering a DevOps guild, adding automated monitoring, adopting IaC, and pushing for minimal security guidelines, you gradually evolve from purely developer-led ops to a more stable, repeatable cloud operation that can scale.

How to do better

Below are rapidly actionable ways to balance outsourced support with internal ownership:

  1. Retain Strategic Oversight

    • Even if operations remain outsourced, designate an internal “Cloud Lead” or small working group responsible for governance and security:
  2. Set Clear SLA and KPI Requirements

    • Make sure the vendor’s contract outlines response times, compliance with GOV.UK Cloud Security Principles or NCSC best practices, and regular cost-optimisation reviews.
  3. Insist on Transparent Reporting

    • Request routine dashboards or monthly metrics on performance, cost, security events.
    • Ask the vendor to integrate with your chosen monitoring tools if possible.
  4. Plan a Knowledge Transfer Path

    • Negotiate with the vendor to provide training sessions or shadowing opportunities, building internal cloud literacy:
      • e.g., monthly knowledge-sharing on cost optimisation or security patterns.
  5. Retain Final Decision Power on Strategic Moves

    • The vendor can propose solutions, but major platform changes or expansions should get internal review for alignment with departmental objectives.
    • This ensures the outsourced arrangement doesn’t override your broader digital strategy.

By keeping strategic authority, setting stringent SLAs, fostering vendor-provided knowledge transfer, and maintaining transparent reporting, you reduce vendor lock-in and ensure your cloud approach aligns with public sector priorities and compliance expectations.

How to do better

Below are rapidly actionable enhancements:

  1. Co-Create Operational Standards

  2. Embed Vendor Staff into Internal Teams

    • If feasible, have vendor ops staff attend your sprint reviews or planning sessions, improving communication and reducing friction.
  3. Establish Regular Strategic Review

    • Conduct quarterly or monthly reviews to align on:
      • Future cloud services adoption
      • Cost optimisation opportunities
      • Evolving security or compliance needs
  4. Request Real-Time Metrics

    • Ensure the vendor’s operational data (e.g., cost usage, performance dashboards) is accessible to your internal strategic leads:
      • e.g., a shared AWS Cost Explorer or Azure Cost Management view for weekly usage checks.
  5. Plan for Potential In-House Expansion

    • If usage grows or departmental leadership wants more direct control, negotiate partial insourcing of key roles or knowledge transfer from the vendor.

By jointly defining an operations handbook, integrating vendor ops staff in your planning, reviewing strategy regularly, and retaining real-time metrics, you strengthen internal leadership while enjoying the convenience of outsourced operational tasks.

How to do better

Below are rapidly actionable ways to optimise the hybrid approach:

  1. Standardise Tools and Processes

  2. Define Clear Responsibilities

    • For each area (e.g., incident management, security patching, cost reviews), specify whether the vendor or in-house staff leads.
    • Consult NCSC’s supply chain security guidance to ensure robust accountability.
  3. Integrate On-Call Rotations

    • If the vendor provides 24/7 coverage, have an internal secondary on-call or bridging approach:
      • This fosters knowledge exchange and ensures no single point of failure if the vendor struggles.
  4. Align on a Joint Roadmap

    • Create a 6-12 month cloud roadmap, listing major initiatives like infrastructure refreshes, security enhancements (e.g., compliance with NIST SP 800-53 controls), or cost optimisation steps.
  5. Encourage Cross-Training

    • Rotate vendor staff into internal workshops or hackathons, and have your staff occasionally shadow vendor experts to deepen in-house capabilities.

By unifying tools, clarifying roles, rotating on-call duties, aligning on a roadmap, and cross-training, you make the hybrid model more cohesive—maximising agility and ensuring consistent cloud operation standards across internal and outsourced teams.

How to do better

Below are rapidly actionable ways to refine an already dedicated in-house cloud team:

  1. Adopt a DevSecOps Center of Excellence (COE)

    • Evolve your cloud team into a central repository for best practices, security frameworks, and ongoing training:
      • Provide guidelines on ephemeral environments, compliance-as-code, or advanced ML operations.
  2. Set Up Autonomous Product Teams

    • Embed cloud team members directly into product squads, letting them self-manage infrastructure and pipelines with minimal central gatekeeping:
      • This fosters agility while the central team maintains overarching governance.
  3. Implement Policy-as-Code and FinOps

  4. Champion Innovations

    • Keep experimenting with advanced features (e.g., AWS Graviton, Azure confidential computing, GCP Anthos multi-cloud, or OCI HPC offerings) to continuously optimise performance and cost.
  5. Regularly Review and Update the Roadmap

By embedding security and cost best practices, enabling cross-functional product teams, instituting policy-as-code, and continually updating your roadmap, your dedicated in-house cloud team evolves into a dynamic, cutting-edge force that consistently meets UK public sector operational and compliance demands.

Keep doing what you’re doing, and consider writing up your experiences or publishing blog posts on your cloud team’s journey. Also, contribute pull requests to this guidance or similar public sector best-practice repositories, helping others evolve their organisational structures for effective cloud operations.


How do you plan for incidents? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond ad-hoc incident response:

  1. Draft a Simple Incident Response (IR) Checklist

  2. Identify Key Roles

    • Even if you can’t create a full incident response team, designate an incident lead and a communications point of contact.
    • Clarify who decides on severe actions (e.g., taking services offline).
  3. Set Up Basic Monitoring and Alerts

  4. Coordinate with Third Parties

    • If you rely on external suppliers or a cloud MSP, note their support lines and escalation processes in your checklist.
  5. Review and Refine After Each Incident

    • Conduct a mini post-mortem for any downtime or breach, adding lessons learned to your ad-hoc plan.

By drafting a minimal IR checklist, assigning key roles, enabling basic alerts, and learning from each incident, you can quickly improve your readiness without a massive resource investment.

How to do better

Below are rapidly actionable ways to strengthen an initial documented IR plan:

  1. Integrate IR Documentation into CI/CD

    • If you maintain an Infrastructure as Code or pipeline approach, embed references to the IR plan or scripts:
      • e.g., one-liners explaining how to isolate or roll back in the event of a security alert.
  2. Automate Some Deployment Checks

  3. Link IR Plan to Monitoring Dashboards

    • Provide direct references in the plan to the dashboards or logs used for incident detection:
      • This helps new team members quickly identify relevant data sources in a crisis.
  4. Consult Gov & NCSC Patterns

  5. Schedule a 3-Month Review Post-Launch

    • Ensure the IR plan is updated after initial real-world usage.
    • Adjust for any changes in architecture or newly discovered risks.

By embedding IR considerations into your pipeline, linking them to monitoring resources, referencing official guidance, and doing a post-launch review, you maintain an up-to-date plan that effectively handles incidents as the service evolves.

How to do better

Below are rapidly actionable ways to elevate a regularly updated IR plan:

  1. Link Plan Updates to Service/Org Changes

    • If new microservices launch or staff roles shift, require an immediate plan review:
      • e.g., add or remove relevant escalation points, update monitoring references.
  2. Automate IR Plan Distribution

    • Store the IR plan in version control (like GitHub), so everyone can see changes easily:
      • e.g., label each revision with a date or release tag.
    • This fosters transparency and avoids outdated copies lurking in email threads.
  3. Encourage DR Drills

  4. Include Ransomware or DDoS Scenarios

  5. Regular Stakeholder Briefings

    • Present IR readiness status updates to leadership or departmental leads, aligning them with the IR plan improvements.

By linking plan updates to actual org changes, distributing it via version control, frequently testing via drills, and preparing for advanced threats, you maintain an agile, effective IR plan that evolves with your environment.

How to do better

Below are rapidly actionable ways to further optimise integrated, tested IR plans:

  1. Adopt Multi-Cloud or Region Failover Testing

  2. Expand Real-Time Monitoring Integration

  3. Formalise Post-Incident Reviews

  4. Include Communication and PR

  5. Use NIST 800-61 or NCSC Models

    • Evaluate if your IR plan’s phases (preparation, detection, analysis, containment, eradication, recovery, post-incident) align with recognised frameworks.

By simulating cross-region failovers, integrating real-time alert triggers with continuity plans, conducting thorough post-incident reviews, and weaving communications into the IR plan, you maintain a robust, seamlessly tested approach that can respond to diverse incident scenarios.

How to do better

Even at this advanced stage, below are rapidly actionable refinements:

  1. Embed Chaos Drills

    • Randomly inject failures or security anomalies in production-like environments to ensure IR readiness:
  2. Adopt AI/ML-Driven Threat Detection

  3. Coordinate Regional or Multi-department Exercises

  4. Link IR Performance to Gov Accountability

    • Provide leadership with metrics or dashboards that show how quickly critical services can be restored.
    • This fosters ongoing support for practicing and funding IR improvements.
  5. Benchmark with International Standards

    • Assess if your IR process meets or exceeds frameworks like [NIST SP 800-61], [ISO 27035], or related global best practices.
    • Update or fine-tune accordingly.

By regularly practicing chaos drills, leveraging AI-driven threat detection, collaborating with other agencies, and aligning with recognised international standards, your IR capabilities become even more robust. This ensures you stay prepared for evolving threats while maintaining compliance and demonstrating exceptional public sector resilience.

Keep doing what you’re doing, and consider writing up your incident response practice experiences (e.g., tabletop drills, real-world successes) in a blog post or internal case studies. Submit pull requests to this guidance or public sector best-practice repositories so others can learn from your advanced approaches to incident preparedness and response.

People

How does your organisation work with cloud providers? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move from minimal interaction to stronger collaboration with cloud providers:

  1. Set Up Basic Account Management Contacts

  2. Use Vendor Documentation & Quickstart Guides

    • Encourage staff to leverage official tutorials and quickstarts for key services (compute, storage, networking).
    • Reference NIST Cloud Computing resources for broad conceptual best practices.
  3. Attend Vendor Webinars/Events

  4. Implement Minimal Security Best Practices

  5. Document Next Steps

    • E.g., “In 3 months, explore a higher support tier or schedule a call with a provider solutions architect to discuss cost or architecture reviews.”

By establishing basic contacts, using vendor quickstarts, tapping into free events, and implementing minimal security measures, you start reaping more value from your cloud provider relationship and set the stage for deeper engagement.

How to do better

Below are rapidly actionable ways to evolve beyond basic support:

  1. Establish Regular Check-Ins with Account Managers

  2. Request Architecture/Cost Reviews

    • Providers typically offer free or low-cost reviews to identify cost-saving or performance improvements:
      • e.g., AWS Well-Architected Review, Azure Architecture Review, GCP Architecture Check, OCI Architecture Center.
  3. Attend or Organise Vendor-Led Training

  4. Leverage Vendor Communities & Forums

  5. Institute a “Support Triage” Process

    • Define guidelines on which issues can be solved internally vs. escalated to the provider to expedite resolution times.
    • Helps staff know when to open tickets and what info to include.

By scheduling regular check-ins with account managers, requesting architecture and cost reviews, organising training sessions, and clarifying a support triage process, you step up from reactive usage of basic support to a more proactive and beneficial relationship.

How to do better

Below are rapidly actionable ways to leverage regular provider interaction more effectively:

  1. Pursue Dedicated Technical Engagement

  2. Targeted Workshops for Specific Projects

  3. Co-Develop a Cloud Roadmap

    • With the provider’s account manager, outline next-year priorities: e.g., expansions to new regions, adopting serverless, or cost optimisation drives.
    • Ensure these are documented in a shared action plan.
  4. Engage in Beta/Preview Programs

    • Providers often invite customers to test new features, offering direct input.
    • This can yield early insights into tools beneficial for your departmental use cases.
  5. Share Feedback on Public Sector Needs

    • Raise local government, NHS, or departmental compliance concerns so the provider can adapt or recommend solutions (e.g., private endpoints, advanced encryption key management).

By scheduling advanced support tiers or specialised workshops, co-developing a cloud roadmap, participating in early feature programs, and continuously feeding back public sector requirements, you strengthen the partnership for mutual benefit.

How to do better

Below are rapidly actionable ways to deepen this proactive, tailored relationship:

  1. Establish Joint Success Criteria

    • e.g., “Reduce average monthly cloud cost by 20%,” or “Achieve 99.95% uptime with no unplanned downtime over the next quarter.”
    • Collaborate with the provider’s solution architects to measure progress monthly.
  2. Conduct Regular Technical Deep-Dives

    • If using advanced analytics or HPC, schedule monthly architecture feedback with vendor specialists who can propose further optimisation or new service usage.
    • Incorporate relevant NIST SP 500-299 HPC guidelines or domain-specific standards if relevant.
  3. Engage in Co-Innovation Programs

  4. Formalise an Enhancement Request Process

    • For feature gaps or special compliance needs, let your account team log these requests, referencing [NCSC or GOV.UK requirements].
    • Potentially expedite solutions that meet public sector demand.
  5. Public Sector Showcases

    • Offer to speak at vendor events or in case studies, highlighting your success:
      • This often results in further tailored support or early access to relevant solutions.

By defining success metrics, scheduling technical deep-dives, pursuing co-innovation, and ensuring an open channel for feature requests, you make the most of your proactive provider engagement—driving continuous improvement in alignment with public sector priorities.

How to do better

Even at this advanced level, below are rapidly actionable ways to refine a strategic partnership:

  1. Co-Develop Advanced Pilots

    • Test cutting-edge solutions, e.g., advanced AI/ML for predictive analytics, HPC for large-scale modeling:
    • This pushes your public sector services into future-forward innovations.
  2. Integrate Multi-Cloud or Hybrid Strategies

  3. Spearhead Cross-Government Collaborations

    • Collaborate with local councils, NHS, or other agencies—invite them to share your advanced partnership benefits, referencing GOV.UK’s cross-government digital approach.
    • Potentially form shared procurement or compliance frameworks with the provider’s help.
  4. Ensure Regular, Comprehensive Security Drills

    • Pair with your provider for joint incident simulations, verifying consistent coverage of best practices:
  5. Establish a Lessons Learned Repository

    • Each joint initiative or advanced workshop should produce shareable documentation or “playbooks,” continuously updating your knowledge base for broader departmental usage.

By pushing into co-developed pilots, multi-cloud or hybrid expansions, cross-government collaborations, advanced security drills, and structured knowledge sharing, you maintain a forward-looking, fully integrated partnership with your cloud provider—ensuring ongoing alignment with strategic public sector aspirations.

Keep doing what you’re doing, and consider sharing your experiences (e.g., co-pilots, advanced solutions) in blog posts or on official channels. Submit pull requests to this guidance or related best-practice repositories to help others in the UK public sector benefit from your advanced collaborations with cloud providers.


How does your organisation support cloud training and certification? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to introduce at least a minimal structure for cloud-related training:

  1. Create a Basic Cloud Skills Inventory

    • Ask staff to self-report familiarity with AWS, Azure, GCP, OCI, or relevant frameworks (like DevOps, security, cost management).
    • This inventory helps identify who might need basic or advanced training.
  2. Encourage Free Vendor Resources

  3. Sponsor One-Off Training Sessions

    • If resources are extremely limited, schedule a short internal knowledge-sharing day:
      • For instance, have a staff member who learned AWS best practices do a 1-hour teach-in for colleagues.
  4. Reference GOV.UK and NCSC Guidelines

  5. Plan for Future Budget Requests

    • If adoption grows, prepare a case for funding basic training or at least paying for exam vouchers, showing potential cost or security benefits.

By initiating a simple skills inventory, directing staff to free resources, hosting internal sessions, and referencing official guidance, you plant the seeds for more structured, formalised cloud training down the line.

How to do better

Below are rapidly actionable steps to unify manager-led training into a more consistent approach:

  1. Set Organisation-Wide Cloud Skill Standards

  2. Track Training Efforts Centrally

    • Even if managers sponsor training, request monthly or quarterly updates from each manager:
      • Summaries of who took which courses, certifications earned, or next steps.
  3. Provide a Shared Training Budget or Resource Pool

    • Instead of leaving it entirely to managers, allocate a central fund for cloud courses or exam vouchers.
    • Teams can draw from it with minimal bureaucracy, ensuring equity.
  4. Host Cross-Team Training Days

    • Let managers co-sponsor internal “training day sprints,” where staff from different teams pair up for labs or workshops:
      • Possibly invite vendor solution architects for a half-day session on cost optimisation or serverless.
  5. Reference GOV.UK & NIST on Training Governance

By defining organisation-wide skill baselines, tracking training across teams, offering a shared budget, and running cross-team training events, you build a more equitable and cohesive approach—improving consistency in cloud competence.

How to do better

Below are rapidly actionable improvements:

  1. Customise Training by Role Path

  2. Incorporate Regular Skills Audits

    • Each quarter or half-year, staff update training statuses and new certifications.
    • Identify areas for further focus, e.g., advanced security or HPC skills.
  3. Implement Gamified Recognition

    • e.g., awarding digital badges or points for completing specific labs or passing certifications:
      • Ties in with internal comms celebrating achievements, boosting morale.
  4. Align Training with Security & Cost Goals

    • For instance, if cost optimisation is a priority, encourage staff to take relevant vendor cost management courses.
    • If advanced security is crucial, highlight vendor security specialty paths.
  5. Coordinate with GOV.UK Skills Framework

By mapping certifications to roles, regularly auditing skills, gamifying recognition, and aligning training with strategic objectives, you embed continuous cloud skill growth into your corporate culture—ensuring sustained readiness and compliance.

How to do better

Below are rapidly actionable ways to refine role-based training and self-assessment:

  1. Integrate Self-Assessments into Performance Reviews

    • Encourage staff to reference role-based metrics during appraisals:
      • e.g., “Achieved AWS Solutions Architect – Associate, aiming for Azure Security Engineer next.”
    • Ties personal development to formal performance frameworks.
  2. Provide “Skill Depth” Options

    • Some staff may prefer broad multi-cloud knowledge, while others want deep specialisation in a single vendor:
      • e.g., a “multi-cloud track” vs. “AWS advanced track” approach.
  3. Enable Peer Mentoring

    • Pair junior staff who want a certain certification with an experienced internal mentor or sponsor.
    • Encourages knowledge sharing, reinforcing your training culture.
  4. Automate Role-Based Onboarding

    • New hires get automatically assigned recommended learning modules or labs:
      • e.g., AWS Qwiklabs, Azure Hands-on Labs, GCP Quick Labs, or OCI hands-on labs that match their role.
  5. Check Alignment with NCSC & NIST

By linking self-assessments to performance, diversifying skill tracks, enabling peer mentoring, and automating onboarding processes, you create a fully integrated environment where each role’s learning path is clear, self-directed, and aligned to organisational needs.

How to do better

Below are rapidly actionable suggestions to perfect an incentivised and assessed training program:

  1. Tie Certifications to Mastery Projects

    • In addition to passing exams, employees might complete real, in-house projects demonstrating they can apply those skills:
      • e.g., building a pilot serverless application or implementing end-to-end security logging using NCSC best practices.
  2. Organise Internal “Training Sprints” or Hackathons

    • e.g., a week-long challenge where staff pursue advanced certification labs together, culminating in recognition or prises.
  3. Reward Mentors

    • If staff help others achieve certifications, consider awarding them additional recognition or digital badges:
      • Encourages a culture of mentorship and upskilling.
  4. Set Up Cross-Government Partnerships

  5. Monitor ROI & Impact

    • Track how training improvements affect cost optimisation, user satisfaction, or speed of service releases:
      • Present these metrics to leadership as evidence that the incentivised approach works.

By coupling incentives with real project mastery, hosting hackathons, rewarding mentors, forming cross-government partnerships, and measuring returns, you refine a world-class training program that fosters continual cloud skill advancement and directly benefits your public sector missions.

Keep doing what you’re doing, and consider writing up your training and certification successes, possibly in blog posts or internal case studies. Submit pull requests to this guidance or other public sector best-practice repositories so fellow UK organisations can follow your lead in creating robust cloud skill-building programs.


How important is cloud experience when hiring leaders, suppliers, and contractors? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to begin emphasizing cloud skills in hiring:

  1. Add Cloud Awareness to Job Descriptions

  2. Encourage Current Staff to Share Cloud Knowledge

    • If you have even one or two employees with cloud expertise, host internal lunchtime talks or short workshop sessions.
    • Build a minor internal market for cloud knowledge so that future roles can specify these basic competencies.
  3. Prepare for Cloud-Focused Future

    • If you have a known modernisation program, consider building a pipeline of cloud-savvy talent:
      • Start by adding basic cloud competence to “desired” (not required) criteria in some new roles.
  4. Reference NIST & NCSC Workforce Guidance

  5. Short Internal Hackathons

    • Let staff explore a simple cloud project, e.g., deploying a test app or serverless function.
    • This stirs interest in cloud skills, naturally leading to job postings that mention them.

By introducing even minor cloud awareness requirements, providing internal knowledge sharing, referencing official frameworks, and organising small hackathons, you start shifting your hiring practices to future-proof your organisation’s cloud readiness.

How to do better

Below are rapidly actionable improvements:

  1. Include Cloud Skills for Leadership

    • For senior or executive positions that influence technology strategy, add “awareness of cloud architectures and security” to the job description.
    • This aligns with modern public sector digital leadership standards.
  2. Establish Clear Criteria

    • Define which roles “must have,” “should have,” or “could have” cloud experience:
      • e.g., for a principal engineer or head of infrastructure, cloud experience is “must have.” For a data analyst, it might be “should have” or optional.
  3. Collaborate with HR or Recruitment

    • Ensure recruiters understand terms like “AWS Certified Solutions Architect,” “Azure DevOps Engineer Expert,” or “GCP Professional Cloud Architect.”
    • They can better filter or source candidates if they know relevant cloud certifications or skill sets.
  4. Assess Supplier Cloud Proficiency

    • When contracting or hiring contingent labor, require them to demonstrate cloud capabilities (like having staff certified to a certain level).
    • Reference NCSC supply chain security guidelines to set minimal standards for external vendors.
  5. Offer Pathways for Internal Staff

    • Provide existing employees an option to upskill into these “cloud-required” roles, reinforcing a culture of growth.
    • Supports staff retention and aligns with NIST workforce development frameworks.

By adding leadership-level cloud awareness, clarifying role-based cloud criteria, ensuring recruiters or contingent labor providers understand these requirements, and offering internal upskilling, you create a more consistent approach that meets both immediate and long-term organisational needs.

How to do better

Below are rapidly actionable ways to advance beyond simple mandatory requirements:

  1. Regularly Update Role Profiles

    • As AWS, Azure, GCP, and OCI evolve, review job descriptions annually:
      • e.g., adding modern DevSecOps patterns, container orchestration, serverless, or big data capabilities.
  2. Introduce Cloud Competency Levels

    • e.g., “Level 1 – Cloud Foundations,” “Level 2 – Advanced Cloud Practitioner,” “Level 3 – Cloud Architect.”
    • This ensures clarity about skill depth for each role, linking to vendor certifications.
  3. Ensure Continuity & Succession

    • Plan for staff turnover by establishing robust knowledge transfer processes, referencing NCSC workforce security advice.
    • Minimises risk if a key cloud-skilled individual leaves.
  4. Promote Multi-Cloud Awareness

    • If your organisation uses more than one provider, encourage roles to include cross-provider or “cloud-agnostic” concepts:
      • e.g., Terraform, Kubernetes, or zero-trust security patterns relevant across AWS, Azure, GCP, or OCI.
  5. Involve Senior Leadership

    • Demonstrate how mandatory cloud experience in roles directly supports mission-critical public services, cost optimisation, or security compliance, building top-level buy-in.

By routinely revising DDaT role definitions to keep pace with evolving cloud tech, defining competency levels, planning continuity, encouraging multi-cloud knowledge, and securing leadership sponsorship, you firmly embed cloud skill requirements into your organisational DNA.

How to do better

Below are rapidly actionable methods to keep role definitions agile in a cloud-first IT organisation:

  1. Periodically Revalidate Roles

    • Introduce a yearly review cycle where HR, IT leadership, and line managers re-check if roles align with current cloud usage or new compliance mandates (like NIST SP 800-53 revision updates).
  2. Provide Upgrade Path for Existing Staff

    • Offer training or special secondments so staff who were initially on-prem can adapt to cloud:
      • e.g., an internal “cloud transformation bootcamp,” referencing AWS, GCP, Azure, or OCI training labs.
  3. Embed Cloud in Performance Management

    • Align staff appraisal or objective-setting with adoption of new cloud skills, cost-saving initiatives, or security improvements.
  4. Create a Cloud Champion Network

    • For each department, designate “cloud champions” who ensure local roles remain updated and can escalate new skill demands if usage evolves.
  5. Follow GOV.UK or DDaT ‘Career Paths’

By systematically revalidating roles, offering staff training for on-prem to cloud transitions, linking performance metrics to cloud initiatives, and referencing official frameworks, you future-proof your team structures in a dynamic cloud landscape.

How to do better

Below are rapidly actionable methods to keep your fully cloud-oriented workforce thriving:

  1. Nurture Advanced Specialisations

    • Some roles may deepen knowledge in containers (Kubernetes), serverless, HPC, or big data analytics:
      • e.g., adopting advanced AWS, Azure, GCP, or OCI certifications for architecture, security, or data engineering.
  2. Embed Continuous Learning

    • Offer staff consistent updates, hack days, or vendor-led labs to adapt to new features quickly:
      • e.g., monthly community-of-practice sessions to discuss the latest cloud service releases or security advisories.
  3. Encourage Cross-Organisational Collaboration

    • Collaborate with other UK public sector bodies, sharing roles or secondment opportunities for advanced cloud experiences.
    • This fosters a broader, more resilient talent pool across government.
  4. Pursue International or R&D Partnerships

    • If your department engages in cutting-edge projects or HPC research, consider co-innovation programs with cloud providers or academic institutions:
      • This might spin up entirely new specialised roles (AI/ML ops, HPC performance engineer, etc.).
  5. Benchmark Against Leading Practices

    • Leverage NCSC or NIST case studies to compare your staff skill frameworks with top-tier digital organisations.
    • Conduct periodic audits on the relevance of your role definitions and skill requirements.

By encouraging advanced specialisations, sustaining continuous learning, collaborating with other public sector entities, pursuing co-innovation partnerships, and benchmarking against top-tier best practices, you maintain an extremely robust, cloud-first workforce strategy that evolves with emerging technologies and public sector demands.

Keep doing what you’re doing, and consider writing blog posts or internal knowledge base articles about your journey toward fully integrating cloud skills into hiring. Submit pull requests to this guidance or other public sector best-practice repositories, sharing lessons learned to help others adopt a comprehensive, future-ready cloud workforce strategy.


How do you choose suppliers and partners for cloud work? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond marketing-based selection:

  1. Define Basic Technical and Security Criteria

    • Before awarding a contract, ensure the supplier meets minimal security (e.g., ISO 27001) or compliance standards from NCSC’s cloud security guidelines.
    • Check if they have relevant cloud certifications (e.g., AWS or Azure partner tiers).
  2. Use Simple Supplier Questionnaires

    • Ask about their experience with public sector, references for past cloud projects, and how they manage cost optimisation or data protection.
    • This ensures more depth than marketing claims alone.
  3. Check Real-Life Feedback

    • Seek out reviews from other departments or local councils that used the same supplier:
      • e.g., informal networks, mailing lists, or digital communities of practice in the public sector.
  4. Ensure They Can Align with GOV.UK Cloud First

    • Ask if they understand government data classification, cost reporting, or typical NCSC compliance frameworks.
  5. Plan an Incremental Engagement

    • Start with a small pilot or short-term contract to validate their capabilities. If they prove reliable, expand the relationship.

By introducing a basic technical/security questionnaire, referencing real-life feedback, and piloting short engagements, you reduce reliance on marketing materials and ensure suppliers at least meet foundational public sector cloud requirements.

How to do better

Below are rapidly actionable methods to elevate from minimal compliance checks:

  1. Evaluate Supplier Cloud Certifications

  2. Request Past Performance or Case Studies

    • Ask for references from other UK public sector clients or comparable regulated industries.
    • Prefer those who’ve demonstrated cost-saving or security success stories.
  3. Incorporate Cloud-Specific Criteria in RFPs

  4. Conduct Briefing Sessions

    • Invite top candidates to present their capabilities or do a short proof-of-concept:
      • This highlights who truly understands your departmental needs.
  5. Ensure Contract Provisions for Exit and Risk

By integrating cloud-specific partner certifications, verifying past performance, and adding mandatory contract clauses around risk and exit, you ensure your due diligence extends beyond basic compliance to real technical and operational aptitude.

How to do better

Below are rapidly actionable suggestions to refine moderate screening:

  1. Request a Security & Architecture ‘Show Me’ Session

    • Potential suppliers should demonstrate a typical architecture for a user story or scenario relevant to your environment:
  2. Evaluate Supplier DevSecOps Maturity

    • Ask about their CI/CD pipeline, automated testing, or DevSecOps approach:
  3. Include Cost Management Criteria

  4. Check Multi-Region or DR Capabilities

  5. Formalise Weighted Scoring

    • Allocate points for each requirement (experience, security alignment, cost management, references).
    • This ensures an objective method to compare competing suppliers.

By pushing for real demonstrations of security/architecture, assessing DevSecOps maturity, reviewing cost management solutions, checking DR abilities, and using a weighted scoring system, you gain deeper insight into a supplier’s true capability and alignment with your goals.

How to do better

Below are rapidly actionable ways to expand a comprehensive evaluation:

  1. Adopt a Custom Supplier Questionnaire

  2. Verify Internal Code of Conduct Alignment

  3. Assess Cloud Roadmap Consistency

    • Evaluate how the supplier’s technology roadmap or R&D investments align with your department’s future strategy:
  4. Engage in Pilot Co-Creation

  5. Weight Sustainability in Procurement

By employing a custom questionnaire that includes ethical, environmental, and advanced cloud criteria, verifying code-of-conduct alignment, ensuring compatibility with your technical roadmap, piloting co-creation sprints, and weighting sustainability, you further refine the comprehensive evaluation for a well-rounded supplier selection process.

How to do better

Below are rapidly actionable ways to enhance strategic supplier selection:

  1. Promote Multi-Year Collaboration

    • Consider multi-year roadmaps with staged deliverables and built-in agility:
      • e.g., specifying review points for adopting new cloud services or ramping up HPC/ML capabilities when needed.
  2. Publish Clear Risk Management Requirements

    • Require suppliers to maintain a living risk register, shared with your security team, covering performance, security, and cost risks.
    • Align with NCSC’s risk management approach.
  3. Encourage Apprenticeships and Community Contributions

  4. Conduct Joint Business Reviews

    • Schedule an annual or semi-annual leadership review session, focusing on:
      • Roadmap alignment, upcoming technology expansions, sustainability targets, and success stories to share cross-government.
  5. Integrate ESG and Sustainability

By defining multi-year collaborative roadmaps, embedding a shared risk register, incentivising apprenticeships or broader skill contributions, maintaining periodic leadership reviews, and factoring in sustainability metrics, you cultivate a strategic, mutually beneficial relationship with cloud suppliers. This ensures alignment with public sector values, security standards, and a visionary approach to digital transformation.

Keep doing what you’re doing, and consider writing some blog posts about your advanced supplier selection processes or opening pull requests to this guidance for others. By sharing how you integrate technical, ethical, and sustainability factors, you help other UK public sector organisations adopt strategic, future-focused cloud supplier qualification processes.


How do you help staff with little or no cloud experience move into cloud roles? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to establish a baseline development path for new cloud learners:

  1. Create a Simple Cloud Familiarisation Resource

  2. Encourage Self-Study

    • Offer small incentives (e.g., internal recognition or minor expense coverage) if employees complete a fundamental cloud course.
    • Even a simple certificate of completion fosters motivation.
  3. Promote Internal Shadowing

    • If you have at least one cloud-savvy colleague, arrange informal shadowing or pair sessions.
    • This ensures staff with zero cloud background get exposure to real tasks.
  4. Reference GOV.UK and NCSC

  5. Pilot a Tiny Cloud Project

    • If budget or time is tight, propose a small, non-critical cloud POC. Staff with no cloud experience can attempt deploying a simple website or serverless function, building basic confidence.

By assembling free training resources, sponsoring small incentives, and facilitating internal shadowing or mini pilots, you kickstart a foundational path for employees to begin acquiring cloud knowledge in a low-cost, organic way.

How to do better

Below are rapidly actionable ways to strengthen basic on-the-job cloud training:

  1. Define Simple Mentorship Guidelines

    • Even if informally, specify a mentor’s role—e.g., conducting weekly check-ins, demonstrating best practices for provisioning, cost management, or security scanning.
  2. Adopt a Buddy System for Cloud Tasks

    • Pair a novice with a more experienced engineer on actual cloud tickets or incidents:
    • Encourages learning through real-world problem-solving.
  3. Introduce a Lightweight Skills Matrix

    • Track essential cloud tasks (e.g., spinning up a VM, setting up logging, basic security config) and check them off as novices learn:
      • e.g., [AWS/Azure/GCP/OCI basics], referencing relevant vendor quickstarts.
  4. Encourage Self-Paced Online Labs

    • Provide access to some structured labs:
      • [AWS Hands-on labs, Azure Lab Services, GCP codelabs, or OCI labs], guiding novices step-by-step.
  5. Celebrate Progress

    • Recognise or reward staff who complete key tasks or mini-certs (like AWS Cloud Practitioner):
      • This fosters a positive culture around skill growth.

By structuring mentorship roles, ensuring novices participate in real tasks, tracking essential skills, adding lab-based self-study, and giving recognition, you can rapidly accelerate staff readiness and consistency in cloud ops.

How to do better

Below are rapidly actionable ways to enhance structured training/mentorship:

  1. Formalise Cloud Learning Journeys

    • e.g., for a DevOps role, define stepping stones from fundamental vendor certs to advanced specialisations:
      • AWS Solutions Architect -> SysOps -> Security, Azure Administrator -> DevOps Engineer, GCP Associate Engineer -> Professional Architect, etc.
  2. Adopt Official Vendor Training Programs

  3. Establish Time Allocations

    • Guarantee staff a certain number of hours per month for cloud labs, workshops, or self-paced learning:
      • Minimises conflicts with daily duties.
  4. Integrate Real Projects into Training

    • Let trainees apply new skills to an actual low-risk project, e.g., a new serverless prototype or a cost optimisation analysis:
      • Encourages practical retention.
  5. Track & Reward Milestones

    • Summarise achievements in quarterly stats: “Team X gained five new AWS Solutions Architect Associates.”
    • Offer small recognition or career advancement alignment with Civil Service success profiles.

By defining clear cloud learning journeys, leveraging vendor training, scheduling dedicated study time, embedding real projects in the curriculum, and publicly recognising accomplishments, you foster a thriving environment for upskilling staff in cloud technologies.

How to do better

Below are rapidly actionable tips to refine integrated learning and development:

  1. Formal Apprenticeship or Bootcamp

    • Partner with recognised training providers:
      • e.g., AWS re/Start, Azure Academy, GCP JumpStart, or Oracle Next Education for more in-depth coverage.
    • Ensure alignment with NCSC or NIST cybersecurity modules.
  2. Set Clear Learning Roadmaps by Function

    • For Dev, Ops, Security, Data roles—each has curated course combos, from fundamentals to specialised advanced topics:
      • This fosters structured progression.
  3. Involve Senior Leadership Support

    • Encourage exec sponsors to highlight success stories, attend final presentations of training cohorts, or discuss how these new skills align with departmental digital transformation goals.
  4. Combine Internal & External Teaching

    • Use a mix of vendor trainers, in-house subject matter experts, and third-party specialists for well-rounded instruction.
    • This ensures staff see multiple perspectives.
  5. Measure ROI

    • Track cost savings, decreased deployment times, or increased user satisfaction from cloud projects led by newly trained staff:
      • Present these metrics in leadership reviews, justifying ongoing investment.

By implementing apprenticeship or structured bootcamp approaches, organising role-specific learning paths, ensuring leadership buy-in, blending internal and external expertise, and measuring ROI, you develop a truly comprehensive and outcome-driven cloud skill development program.

How to do better

Below are rapidly actionable ways to further refine your mature apprenticeship or bootcamp program:

  1. Expand Specialist Tracks

    • Develop advanced sub-tracks (e.g., HPC, AI/ML, Zero-Trust Security) for participants who excel at foundational cloud skills:
      • Align with vendor specialised training or NCSC/NIST security standards for deeper expertise.
  2. Coordinate Multi-department Bootcamps

    • Collaborate with local councils, NHS, or other government bodies to form a larger talent pool:
      • Shared labs, cross-government hackathons, or combined funding can scale impact.
  3. Ensure Continuous Performance Assessments

    • Conduct formal evaluations 6, 12, or 18 months post-bootcamp:
      • Checking advanced skill adoption, real project outcomes, and personal career growth.
  4. Public Acknowledgment & Advancement

  5. Incorporate Cost-Savings and ROI Proof

    • Track how newly trained staff reduce external consultancy reliance, deliver projects faster, or improve security.
    • Present data to leadership, ensuring sustained or increased budgets for these programs.

By launching specialised advanced tracks, fostering cross-department collaborations, performing ongoing performance assessments, integrating real career incentives, and measuring ROI, you secure a pipeline of skilled cloud professionals well-suited to public sector demands, maintaining a resilient workforce aligned with national digital transformation objectives.

Keep doing what you’re doing, and consider documenting your apprenticeship or bootcamp approaches in internal blog posts or knowledge bases. Submit pull requests to this guidance or other best-practice repositories so fellow UK public sector organisations can replicate your success in rapidly upskilling staff for cloud roles.


How much do you rely on third parties for cloud work? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to reduce over-dependence on a single third party:

  1. Retain Critical Access

  2. Require Transparent Documentation

    • Request the third party produce architecture diagrams, runbooks, and logs:
      • So your internal teams can reference them and step in if needed.
  3. Set Clear SLAs and Security Requirements

  4. Conduct Periodic Access Reviews

    • Evaluate who has root-level or full access privileges. Revoke or reduce if not absolutely necessary:
      • Minimises the impact if the supplier or a contractor is compromised.
  5. Begin In-House Skill Development

By retaining critical admin access, demanding thorough documentation, setting rigorous SLAs, auditing access, and growing your internal skill base, you hedge against supplier lock-in or failure and maintain some sovereignty over crucial cloud operations.

How to do better

Below are rapidly actionable improvements:

  1. Use Granular IAM Permissions

  2. Create Supplier-Specific Accounts or Subscriptions

    • Segment your cloud environment so suppliers only see or modify what’s relevant:
      • This helps contain damage if credentials leak or get misused.
  3. Mandate Activity Logging & Auditing

    • Configure [AWS CloudTrail, Azure Monitor, GCP Cloud Logging, OCI Audit] to track every privileged action:
      • Helps detect anomalies or investigate incidents quickly.
  4. Conduct Scheduled Joint Reviews

    • Align on cost management, architecture updates, security posture with the supplier monthly or quarterly:
      • e.g., use [AWS Trusted Advisor / Azure Advisor / GCP Recommender / OCI Advisor] to see if best practices are followed.
  5. Plan for Possible Transition

    • If you decide to reduce the supplier’s role in the future, ensure documentation or staff knowledge exist to avoid single-point dependencies.

By applying least privilege IAM, isolating supplier access, logging all privileged actions, collaborating on architecture/cost reviews, and planning for possible transitions, you maintain high security while leveraging external expertise effectively.

How to do better

Below are rapidly actionable ways to refine specialised third-party support:

  1. Automate Break-Glass Processes

    • e.g., storing break-glass credentials in a secure vault (like [AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, OCI Vault]) requiring multi-party approval or temporary permission escalation.
  2. Develop Clear Incident Protocols

  3. Perform Yearly Access Drills

    • Simulate a scenario requiring supplier intervention:
      • Validate that the break-glass account retrieval process, notifications, and post-incident re-lock steps all work smoothly.
  4. Enforce Accountability

  5. Periodic Skills Transfer

    • Let external experts run short workshops, training sessions, or knowledge transfers:
      • e.g., HPC performance tuning, advanced DevSecOps, or AI/ML best practices—improving your team’s ability to handle issues without always relying on break-glass.

By automating break-glass credentials, establishing clear incident protocols, conducting annual drills, logging all privileged actions, and regularly upskilling staff with supplier-led sessions, you can maintain strong security while accessing specialised expertise only when needed.

How to do better

Below are rapidly actionable ways to leverage specialised knowledge further:

  1. Add Read-Only or Auditor Roles

  2. Enable Collaborative Architecture Reviews

    • Provide sanitised environment data or architecture diagrams for the supplier to review:
      • e.g., removing any sensitive info but enough detail to yield beneficial recommendations.
  3. Request Proactive Security or Cost Analysis

  4. Formalise Knowledge Transfer

    • For each engagement, define deliverables like architectural guidelines, best-practice documents, or mini-lab sessions with staff.
    • Ensures that specialised advice becomes actionable in-house expertise.
  5. Regular Check-Ins and Feedback Loop

By granting read-only roles for better collaboration, scheduling architecture or security reviews, requesting continuous cost/security analysis, and structuring knowledge transfers, you maximise the benefits of external specialists while maintaining tight control over your environment.

How to do better

Below are rapidly actionable ways to refine a minimal/augmentative third-party approach:

  1. Maintain Partnerships Without Access

  2. Ensure Proper Documentation and Knowledge Transfer

    • Whenever you briefly hire contingent staff, they must update runbooks, diagrams, or code repos:
      • Mitigates risk of “knowledge walkout.”
  3. Incorporate Cross-Government Collaboration

    • For advanced or new cloud initiatives, consider partnering with other public sector bodies first, exchanging staff or expertise:
      • e.g., short secondments or co-located sprints can accelerate learning while minimising external costs.
  4. Benchmark Internal Teams Regularly

  5. Public Sector Thought Leadership

    • If you have minimal external dependencies, you likely have strong internal mastery—consider sharing success stories or best practices across local councils or GOV.UK communities of practice.

By maintaining a supplier list without granting them privileged access, enforcing thorough knowledge transfer, collaborating cross-government for specialised expertise, continuously benchmarking in-house capabilities, and showcasing your self-reliant approach, you preserve a high level of operational independence aligned with secure, cost-effective public sector cloud usage.

Keep doing what you’re doing, and consider writing about your strategies for third-party involvement in cloud initiatives or creating pull requests to this guidance. This helps other UK public sector organisations learn how to balance external expertise with robust internal control over their cloud environment.


What does success look like for your cloud team? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to establish at least minimal success criteria:

  1. Identify Key Cloud Objectives

    • E.g., reduce hosting costs by 10%, or migrate a pilot workload to AWS/Azure/GCP/OCI.
    • Reference departmental priorities or NIST cloud computing frameworks for initial guidance.
  2. Define Simple Metrics

    • Examples: “Number of staff trained on fundamental cloud skills,” “Mean time to deploy a new environment,” “Basic cost usage reduction from month to month.”
  3. Align with Leadership

    • Present a short list of proposed success metrics to senior management for sign-off, ensuring these metrics reflect organisational or GOV.UK Cloud First policies.
  4. Track Progress Visibly

    • Use a shared dashboard or simple spreadsheet to record outcomes:
      • e.g., new workloads migrated, number of test passes, or cost changes.
  5. Create a Baseline

    • If you have no prior data, quickly measure current on-prem costs or the time it takes to provision infrastructure:
      • This baseline will contextualise progress in adopting cloud solutions.

By identifying basic cloud objectives, selecting simple metrics, confirming leadership support, tracking progress, and establishing a baseline, you move from undefined success to a workable system that can be refined as your team matures.

How to do better

Below are rapidly actionable steps to advance beyond PoC-based success:

  1. Set PoC Transition Targets

    • Define a timeline or conditions under which successful PoCs move into pilot production or scale to more workloads:
      • e.g., “If the PoC meets X performance criteria at Y cost, proceed to production by date Z.”
  2. Establish Operational Metrics

  3. Involve Real End Users

    • If feasible, let a pilot serve actual staff or a subset of public users:
      • Gains more meaningful feedback on feasibility or user experience.
  4. Document & Share Learnings

  5. Link PoCs to Organisational Goals

    • Ensure each PoC addresses a genuine departmental need (like cost, user experience, or operational agility), so it’s not a siloed experiment.

By defining clear triggers for scaling PoCs, measuring advanced metrics, engaging real users, sharing lessons learned, and tying PoCs to broader goals, you accelerate from pilot outcomes to genuine organisational transformation.

How to do better

Below are rapidly actionable ways to refine production-based success criteria:

  1. Track Key Operational Metrics

  2. Integrate Security & Cost Efficiency

  3. Define a Full Lifecycle Approach

    • Ensure pipelines for new features, rollbacks, or replacements are tested and documented:
      • Reduces risk of “stagnation” where workloads remain unoptimised once launched.
  4. Share Achievements & Best Practices

  5. Plan for Next Steps

    • If a single workload is successful in production, identify the next logical workload or cost-saving measure to adopt:
      • e.g., serverless expansions, HPC jobs, advanced AI/ML adoption.

By incorporating operational metrics, weaving in security and cost success factors, ensuring a continuous pipeline approach, celebrating achievements, and planning further expansions, you create a robust definition of success that fosters ongoing improvements.

How to do better

Below are rapidly actionable strategies to further scale prototypes into core services:

  1. Adopt Advanced HA/DR Strategies

  2. Integrate Automated Security Testing

  3. Quantify Impact

    • Track cost savings, performance gains, or user satisfaction improvements from scaling cloud usage.
    • Present these metrics to leadership or cross-government peers.
  4. Develop or Refine Architectural Standards

  5. Collaborate with Other Public Sector Entities

By adopting advanced resiliency and security, measuring impact thoroughly, standardising architectural approaches, and collaborating with other public sector bodies, you mature from simply scaling prototypes to robust, enterprise-level cloud service delivery.

How to do better

Below are rapidly actionable ways to continue improving innovation- and value-centric success criteria:

  1. Adopt a Value Stream Approach

    • Link each cloud initiative to a user-facing or operational outcome:
      • e.g., reducing form-processing time from days to minutes, or improving public web performance by X%.
    • This ensures the entire pipeline, from idea to deployment, focuses on delivering measurable benefits.
  2. Incorporate Cross-Organisational Goals

    • For large departmental or multi-department programs, align success metrics to shared objectives:
      • e.g., joint cost savings, integrated citizen ID solutions, or unified data analytics capabilities.
  3. Advance Sustainability Metrics

  4. Enable Continuous Learning and Sharing

    • Promote open blog posts or internal wiki pages detailing each new experiment’s results—whether success or failure.
    • Encourages a virtuous cycle of rapid improvement.
  5. Periodically Recalibrate Metrics

    • As technology evolves, update or retire older success metrics (e.g., “time to spin up a VM” might be replaced by “time to deploy a new serverless function”), ensuring they stay relevant to strategic ambitions.

By implementing a value stream approach, embedding cross-organisational goals, focusing on sustainability, encouraging transparency in experiments, and periodically recalibrating metrics, your cloud team solidifies its role as a driver of innovation and public value creation. This ensures alignment with evolving public sector needs, best practices, and digital transformation objectives.

Keep doing what you’re doing, and consider writing blog posts about your success criteria or opening pull requests to this guidance so other public sector organisations can adopt or refine similar approaches to measuring and achieving cloud team success.


Do leadership support your move to the cloud? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable suggestions to secure at least minimal executive sponsorship:

  1. Document Quick-Win Success Stories

    • Show leadership how small pilots delivered cost savings, performance gains, or alignment with departmental digital goals.
    • For instance, highlight a pilot serverless function that replaced an aging on-prem script.
  2. Link Cloud Adoption to Organisational Mandates

  3. Prepare a Simple Business Case

  4. Request a Brief Meeting with a Senior Sponsor

    • Secure 15-30 minutes to share pilot results or near-term opportunities:
      • Stress risk of continuing without guidance from top leadership (e.g., security gaps, budget overruns, or duplication).
  5. Offer an Executive-Level Intro

    • Propose an hour-long cloud fundamentals overview for interested executives, possibly with vendor partner support or free training sessions.

By compiling quick-win stories, framing cloud adoption in organisational mandates, presenting a succinct business case, requesting a short meeting, and offering an executive primer, you begin building the case for at least baseline senior buy-in.

How to do better

Below are rapidly actionable steps to expand from senior management to full executive endorsement:

  1. Demonstrate Departmental Wins

    • Have senior managers publicise successful departmental cloud outcomes to executives:
      • e.g., a 20% cost reduction or improved user satisfaction in a pilot citizen-facing service.
  2. Facilitate an Exec-Level Briefing

  3. Align with Organisational Strategy

  4. Request Executive Sponsor for Large-Scale Migrations

    • If you plan a major migration (like HPC, AI/ML, or large data center closure), propose a “sponsor” role for a top exec:
      • Encourages them to champion budget allocations and remove cross-department barriers.
  5. Create a Vision Statement

    • Collaborate with senior managers to draft a concise “cloud vision” for the next 1-3 years, referencing success metrics (cost, security posture, user satisfaction) to interest executives.

By showcasing departmental successes, hosting briefings with executives, integrating the initiative into overarching strategies, seeking an executive sponsor for large projects, and formalising a short vision statement, you steadily shift from partial senior sponsorship to broader top-level leadership buy-in.

How to do better

Below are rapidly actionable ways to leverage C-level sponsorship further:

  1. Develop a Multi-Year Cloud Roadmap

    • Collaborate with the C-level sponsor to define short, medium, and long-term goals:
      • e.g., incremental migrations, security enhancements, cost optimisation targets.
  2. Establish Clear KPIs & Milestones

  3. Ensure Inter-Departmental Collaboration

  4. Embed Security as a First-Class Concern

  5. Highlight Public Sector Success

    • Encourage your sponsor to share wins at internal leadership summits or cross-gov conferences, fostering further executive-level peer collaboration.

By crafting a multi-year roadmap, specifying meaningful KPIs, promoting cross-department synergy, embedding robust security from the start, and publicising achievements, you realise the full benefits of C-level sponsorship—driving cohesive, secure, and strategic cloud adoption.

How to do better

Below are rapidly actionable ways to strengthen a comprehensive C-level sponsorship with a strategic roadmap:

  1. Involve Staff in Roadmap Updates

    • Host quarterly open sessions where devs, ops, or security can give feedback on the strategic plan:
      • Encourages buy-in and surfaces practical constraints.
  2. Institute a Cloud Steering Committee

    • Form a cross-functional group with representation from finance, HR, security, architecture, and user departments:
      • They meet regularly to track progress, share challenges, and drive adjustments in the roadmap.
  3. Focus on Advanced Migrations or Services

  4. Integrate Multi-Cloud or Hybrid Considerations

  5. Publish Success Metrics

    • Show top-level achievements or cost savings in staff newsletters or a leadership dashboard:
      • Reinforces organisational momentum for the roadmap.

By updating the roadmap collaboratively, establishing a cloud steering committee, venturing into advanced HPC/AI/ML, acknowledging multi-cloud/hybrid scenarios, and publicising success metrics, you deepen the synergy and accountability behind your cloud adoption plan—leading to dynamic, well-supported progress.

How to do better

Below are rapidly actionable ways to continuously strengthen a cloud-first culture under comprehensive C-level sponsorship:

  1. Scale Innovation Hubs

    • If you have a center of excellence or an innovation lab, extend its scope to HPC, AI/ML, IoT, or advanced HPC:
      • e.g., adopt HPC solutions from AWS, Azure, GCP, OCI, incorporating domain specialists.
  2. Open Source & Share

    • Encourage teams to open-source relevant code or automation, participating in cross-government communities:
  3. Enable Real-Time Security & Compliance

  4. Track Cloud Maturity Beyond Tech

    • Evaluate cultural aspects: e.g., dev empowerment, cost accountability, user feedback loops.
    • Revisit or revise success criteria every 6-12 months.
  5. Recognise and Reward Cloud Champions

    • Publicly celebrate individuals or squads who pioneer new solutions, demonstrate cost savings, or deliver advanced workloads in HPC or serverless.

By scaling innovation hubs, open-sourcing solutions, implementing real-time compliance guardrails, tracking maturity across cultural dimensions, and publicly recognising cloud champions, you cement a thriving, cloud-first culture that embraces experimentation, security, and strategic public sector outcomes.

Keep doing what you’re doing, and consider publishing blog posts or opening pull requests to share your experiences in fostering a cloud-first mindset under strong executive sponsorship. This helps others in the UK public sector replicate or learn from your advanced leadership-driven cloud transformation.

Security

How do you manage accounts used by software, not people? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to enhance service account security beyond basic user/pass credentials:

  1. Use Cloud-Native IAM for Service Accounts

  2. Adopt a Central Secret Manager

  3. Automate Rotation

    • If you must keep user/pass-based secrets temporarily, implement at least monthly or quarterly rotations:
      • Minimises window of exposure if leaked.
  4. Reference NCSC & NIST

  5. Plan for Future Migration

    • Target short-lived tokens or IAM role-based approaches as soon as feasible, phasing out permanent user credentials for non-human accounts.

By employing a secure secret manager, rotating basic credentials, and gradually moving to role-based or short-lived tokens, you significantly reduce the risk associated with static user/password pairs for service accounts.

How to do better

Below are rapidly actionable ways to move beyond static API keys:

  1. Store Keys in a Central Secret Manager

  2. Automate API Key Rotation

    • Implement a rotation schedule (e.g., monthly or quarterly) or on every deployment:
      • Reduces the window if a key is leaked.
  3. Consider IAM Role or Token-Based Alternatives

  4. Limit Scopes

  5. Log & Alert on Key Usage

By centrally managing keys, rotating them automatically, transitioning to role-based or token-based credentials, enforcing least privilege, and auditing usage, you substantially reduce the security risk associated with static API keys.

How to do better

Below are rapidly actionable ways to refine a centralised secret store with partial rotation:

  1. Extend Rotation to All or Most Credentials

    • If some are still static, define a plan for each credential’s rotation frequency:
      • e.g., monthly or upon every production deployment.
  2. Build Automated Pipelines

  3. Enforce Access Policies

  4. Combine with Role-Based Authentication

  5. Monitor for Stale or Unused Secrets

    • Regularly check your secret store for credentials not accessed in a while or older than a certain rotation threshold:
      • helps avoid accumulating outdated secrets.

By expanding automated rotation, integrating secret retrieval into pipelines, enforcing tight access controls, adopting role-based methods for new services, and cleaning stale secrets, you further strengthen your centralised secret store approach for secure, efficient credential management.

How to do better

Below are rapidly actionable ways to improve your mTLS-based authentication approach:

  1. Short-Lived Certificates

  2. Adopt a Service Mesh

    • If using microservices in Kubernetes, incorporate [Istio, Linkerd, or AWS App Mesh, Azure Service Mesh, GCP Anthos Service Mesh, OCI OKE integrated mesh] to handle mTLS automatically:
      • Enforces consistent policies across services.
  3. Implement Strict Certificate Policies

  4. Monitor for Expiry and Potential Compromises

  5. Combine with IAM for Additional Controls

    • For advanced zero-trust, complement mTLS with role-based or token-based checks:
      • e.g., verifying principal claims in addition to cryptographic identities.

By employing short-lived certs, possibly using a service mesh, establishing strict certificate policies, continuously monitoring usage, and optionally layering further IAM or token checks, you maximise the security benefits of mTLS for your service accounts.

How to do better

Even at this top level, below are rapidly actionable refinements:

  1. Leverage Vendor Identity Federation Tools

    • e.g., [AWS IAM roles with Web Identity Federation or AWS Secure Token Service, Azure AD token issuance, GCP IAM federation, OCI Identity Federation with IDCS], ensuring minimal friction for ephemeral tokens.
  2. Integrate Policy-as-Code

    • Tools like [Open Policy Agent or vendor policy engines (AWS SCP, Azure Policy, GCP Organisation Policy, OCI Security Zones)] can dynamically evaluate each identity request in real time.
  3. Adopt Service Mesh with Dynamic Identity

    • In container or microservice architectures, pair ephemeral identity with a service mesh that injects secure tokens automatically.
  4. Continuously Audit and Analyze Logs

  5. Cross-Government Federated Services

By fully harnessing vendor identity federation, embedding policy-as-code, integrating ephemeral identity usage in service meshes, analyzing usage logs for anomalies, and considering cross-government identity solutions, you refine an already highly secure and agile environment for non-human service accounts aligned with best-in-class public sector practices.

Keep doing what you’re doing, and consider publishing blogs or opening pull requests to this guidance about your success in elevating non-human identity security in cloud environments. Sharing your experiences helps other UK public sector organisations adopt robust credential management aligned with the highest security standards.


How does your organisation manage user identities and authentication? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable suggestions to introduce at least a minimal level of identity governance:

  1. Define a Basic Password/Passphrase Policy

    • For instance, require passphrases of at least 14 characters, no enforced complexity that leads to repeated password re-use.
    • Consult NCSC’s password guidance for recommended best practices.
  2. Centralise Authentication for Cloud Services

  3. Start Logging Identity Events

  4. Establish a Simple Governance Policy

  5. Plan for Incremental Improvement

    • Mark out a short timeline (e.g., 3-6 months) to adopt multi-factor authentication for privileged or admin roles next.

By introducing a foundational password policy, centralising authentication, enabling basic identity event logging, creating a minimal governance document, and scheduling incremental improvements, you’ll rapidly move beyond ad hoc practices toward a more secure, consistent approach.

How to do better

Below are rapidly actionable ways to automate and strengthen your identity policy enforcement:

  1. Deploy Automated Audits

  2. Enforce Basic MFA for Privileged Accounts

  3. Establish Self-Service or Automated Access Reviews

    • Implement a monthly or quarterly identity review:
      • e.g., a simple emailed listing of who has what roles, requiring managers to confirm or revoke access.
  4. Adopt Single Sign-On (SSO)

  5. Store Policies & Logs in a Central Repo

    • Keep your identity policy in version control and track changes:

By automating audits, enforcing MFA, implementing automated access reviews, consolidating sign-on, and centralising policy documentation, you move from manual enforcement to a more efficient, consistently secure identity posture.

How to do better

Below are rapidly actionable ways to progress toward advanced identity automation:

  1. Expand MFA Requirements to All Users

    • If only privileged users have 2FA, consider rolling out to all staff or external collaborators:
      • e.g., AWS, Azure, GCP, OCI support TOTP apps, hardware security keys, or SMS as fallback (not recommended if higher security needed).
  2. Use Role/Attribute-Based Access

    • For each environment (AWS, Azure, GCP, OCI), define roles or groups with appropriate privileges:
  3. Consolidate Identity Tools

  4. Integrate Automated Deprovisioning

  5. Enhance Monitoring & Alerting

    • Add real-time alerts for suspicious identity events:
      • e.g., multiple failed logins, sudden role escalations, or new key creation.

By extending MFA to all, embracing role-based access, consolidating identity management, automating deprovisioning, and boosting real-time monitoring, you achieve more robust, near-seamless identity automation aligned with best practices for public sector security.

How to do better

Below are rapidly actionable steps to elevate advanced identity management:

  1. Adopt Conditional Access or Policy-based Access

  2. Incorporate Just-In-Time (JIT) Privileges

    • For admin tasks, require users to elevate privileges temporarily:
      • e.g., AWS IAM Permission boundaries, Azure Privileged Identity Management, GCP short-lived access tokens, OCI dynamic roles with short-lived credentials.
  3. Monitor Identity with SIEM or Security Analytics

    • e.g., [AWS Security Hub, Azure Sentinel, GCP Security Command Center, OCI Logging Analytics] for real-time anomaly detection or advanced threat intelligence:
      • Ties into your identity logs to detect suspicious patterns.
  4. Engage in Regular “Zero-Trust” Drills

  5. Promote Cross-Government Identity Standards

By implementing conditional or JIT access, leveraging robust SIEM-based identity monitoring, holding zero-trust scenario drills, and sharing identity solutions across the public sector, you further strengthen an already advanced identity environment.

How to do better

Even at the apex, below are rapidly actionable ways to further optimise:

  1. Multi-Cloud Single Pane IAM

    • If you use multiple cloud providers, unify them under a single identity provider or a cross-cloud identity framework:
      • e.g., Azure AD for AWS + Azure + GCP roles, or a third-party IDaaS solution with robust zero-trust policies.
  2. Advanced Risk-Based Authentication

  3. Adopt Policy-as-Code for Identity

    • Tools like [Open Policy Agent or vendor policy frameworks (AWS Organizations SCP, Azure Policy, GCP Organization Policy, OCI Security Zones)] to define identity controls in code:
      • Facilitates versioning, review, and auditable changes.
  4. Extend 2FA to Cross-Government Collaboration

  5. Publish Regular Identity Health Reports

    • Summaries of user activity, stale accounts, or re-certifications. Encourages transparency and fosters trust in your identity processes.

By unifying multi-cloud identity, implementing advanced risk-based authentication, using policy-as-code for identity controls, expanding cross-government 2FA, and regularly reporting identity health metrics, you maintain a cutting-edge identity management ecosystem. This ensures robust security, compliance, and agility for your UK public sector organisation in an evolving threat environment.

Keep doing what you’re doing, and consider writing up your experiences, success metrics, or blog posts on advanced identity management. Contribute pull requests to this guidance or other best-practice repositories so fellow UK public sector entities can learn from your identity management maturity journey.


How do you make sure people have the right access for their role? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to transition from ad-hoc reviews to basic structured processes:

  1. Define a Minimal Access Policy

  2. Create a Simple RACI for Access Management

    • Identify who is Responsible, Accountable, Consulted, and Informed for each step (e.g., granting, revoking, auditing).
    • Helps clarify accountability if something goes wrong.
  3. Leverage Built-In Cloud IAM Tools

  4. Maintain a Basic User Inventory

    • Keep a spreadsheet or list of all privileged users, what roles they have, and last update date:
      • So you can spot dormant accounts or over-privileged roles.
  5. Plan for Periodic Checkpoints

    • Commit to a small monthly or quarterly access sanity check with relevant admins, reducing overlooked issues over time.

By laying out a minimal access policy, assigning RACI for administration, adopting cloud-native IAM, maintaining a simple user inventory, and scheduling monthly or quarterly check-ins, you’ll quickly improve from ad-hoc reviews to a more reliable approach.

How to do better

Below are rapidly actionable ways to evolve beyond limited-action reviews:

  1. Mandate a “Test Before Revoke” Procedure

    • If concerns about “breaking something” hinder revocations, adopt a short test environment to confirm the user or system truly needs certain permissions.
  2. Categorise Users by Risk

  3. Implement Review Dashboards

    • Summarise each user’s privileges, last login, or role usage:
      • If certain roles are not used in X days, consider removing them.
  4. Show Leadership Examples

    • Have a pilot case where you successfully reduce access for a role with no negative consequences, building confidence.
  5. Incentivise or Recognise Proper Clean-Up

    • Acknowledge teams or managers who diligently remove no-longer-needed permissions:
      • Encourages a habit of safe privilege reduction.

By adopting test environments before revoking privileges, classifying user risk levels, building simple dashboards, demonstrating safe revocations, and recognising best practices, you reduce hesitancy and further align with security best practices.

How to do better

Below are rapidly actionable steps to incorporate permission reduction:

  1. Implement a “Use it or Lose it” Policy

  2. Mark Temporary Access with Expiry

    • For short-term projects, set an end date for extra privileges:
      • e.g., using AWS or Azure policy conditions, GCP short-lived tokens, or OCI compartments-based ephemeral roles.
  3. Combine with Slack/Teams Approvals

    • Automate revocation requests: if an admin sees stale permissions, they click a button to remove them, and a second manager approves:
      • Minimises fear of accidental breakage.
  4. Reward “Right-sizing”

    • Celebrate teams that proactively reduce permission sprawl, referencing cost savings or risk reduction:
      • e.g., mention in staff newsletters or internal security updates.
  5. Refine Review Frequency

    • If reviews are monthly or quarterly, consider stepping up to weekly or adopting a continuous scanning approach for business-critical accounts.

By adding a usage-based revocation policy, setting expiry for short-lived roles, integrating quick approval workflows, recognising teams that successfully remove unused privileges, and potentially increasing review frequency, you shift from additive-only changes to an environment that truly enforces minimal privileges.

How to do better

Below are rapidly actionable methods to enhance expiry-based reviews:

  1. Use Cloud-Native Access Review Tools

  2. Adopt Automated Alerts for Upcoming Expiries

    • If a role is nearing its expiry date, the user and manager receive an email or Slack notice to re-certify or let it lapse.
  3. Incorporate Risk Scoring

    • If an account has high privileges or sensitive system access, require more frequent or thorough re-validation:
      • e.g., monthly for privileged accounts, quarterly for standard user roles.
  4. Implement Delegated Approvals

  5. Maintain Audit Trails

By leveraging cloud-native review tools, alerting for soon-to-expire roles, risk-scoring high-privilege accounts for more frequent checks, implementing delegated re-approval processes, and storing thorough audit trails, you maintain an agile, secure environment aligned with best practices.

How to do better

Below are rapidly actionable ways to refine a fully automated, risk-based review system:

  1. Incorporate Real-Time Risk Signals

  2. Use Policy-as-Code for Access

    • Tools like [Open Policy Agent or vendor-based solutions (AWS Organizations SCP, Azure Policy, GCP Organization Policy, OCI Security Zones)] can define rules for dynamic role allocation.
  3. Ensure Continuous Oversight

    • Provide dashboards for leadership or security officers, showing current risk posture, overdue re-certifications, or upcoming role changes:
      • Minimises the chance of an overlooked anomaly.
  4. Extend to Multi-Cloud or Hybrid

    • If your department spans AWS, Azure, GCP, or on-prem systems, unify identity reviews under a single orchestrator or Identity Governance tool:
  5. Cross-Government Sharing

By integrating real-time risk analysis, employing policy-as-code for dynamic role assignment, offering continuous oversight dashboards, supporting multi-cloud/hybrid scenarios, and sharing insights across government bodies, you further refine an already advanced, automated identity review system. This ensures minimal security risk and maximum agility in the public sector context.

Keep doing what you’re doing, and consider publishing blog posts or making pull requests to this guidance about your advanced access review processes. Sharing experiences helps other UK public sector organisations adopt similarly robust, automated solutions for managing user permissions.


How do you create and manage user accounts for cloud systems? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond shared/manual accounts:

  1. Eliminate Shared Accounts

    • Mandate each user has an individual account, referencing NCSC’s identity best practices.
    • This fosters actual accountability and compliance with typical public sector guidelines.
  2. Set Up Basic IAM

  3. Document a Minimal Process

  4. Enable Basic Audit Logging

  5. Move to a Single Sign-On Approach

    • Plan to adopt SSO with a single user directory in the next phase:
      • Minimises manual overhead and ensures consistency.

By ensuring each user has an individual account, using vendor IAM for creation, documenting a minimal lifecycle process, enabling audit logging, and preparing for SSO, you remedy the major pitfalls of shared/manual account approaches.

How to do better

Below are rapidly actionable steps to unify and automate your on-prem identity repository with cloud systems:

  1. Enable Federation or SSO

  2. Deploy Basic Automation Scripts

    • If a full federation is not possible immediately, create scripts that read from your directory and auto-provision or auto-delete accounts in cloud:
      • e.g., using vendor CLIs or REST APIs.
  3. Standardise User Roles

    • For each cloud environment, define roles that map to on-prem groups:
      • e.g., “Developer group in AD -> Dev role in AWS.”
    • Ensures consistent privileges across systems, referencing NCSC’s least-privilege principle.
  4. Implement a Scheduled Sync

    • Regularly compare your on-prem directory with each cloud environment to detect orphaned or mismatch accounts.
    • Could be monthly or weekly initially.
  5. Transition to Identity Provider Integration

By federating or automating the sync between your directory and cloud, standardising roles, scheduling periodic comparisons, and eventually adopting a modern identity provider, you gradually remove manual friction and potential security gaps.

How to do better

Below are rapidly actionable ways to refine standard identity management:

  1. Require SSO or Federation for All Services

  2. Implement Access Workflows

  3. Continuously Evaluate Cloud Services

    • Maintain a whitelist of services that meet your identity standards:
      • If a service can’t integrate with SSO or can’t match your password/MFA policies, strongly discourage its use.
  4. Include Role Mapping in a Central Catalog

  5. Expand Logging & Alerting

By enforcing SSO/federation for all services, deploying structured access workflows, continuously evaluating new cloud offerings, documenting role-to-privilege mappings, and bolstering security alerting, you ensure consistent, secure user identity alignment across your cloud ecosystem.

How to do better

Below are rapidly actionable ways to reinforce automated federated identity:

  1. Adopt Short-Lived Credentials

  2. Implement Policy-as-Code for Identity

  3. Add Real-Time Security Monitoring

    • If a user tries to access a new or high-risk service, enforce additional checks:
      • e.g., multi-factor step-up, manager approval, location-based restrictions.
  4. Integrate Cross-department SSO

    • If staff frequently collaborate across multiple public sector agencies, explore cross-government identity solutions:
      • e.g., bridging Azure AD tenants or adopting solutions that unify NHS, local council, or central government credentials.
  5. Review & Update Roles Continuously

    • Encourage monthly or quarterly role usage analyses, removing unneeded entitlements automatically:
      • Minimises risk from leftover privileges.

By adopting short-lived credentials, storing identity policy in code, enabling real-time security checks, exploring cross-department SSO, and continuously reviewing role usage, you transform a solid federation setup into a robust and adaptive identity ecosystem.

How to do better

Below are rapidly actionable ways to refine an already unified, cloud-based identity approach:

  1. Implement Passwordless or Phishing-Resistant MFA

  2. Add Dynamic Risk Scoring

  3. Extend Identity to Third-Party Collaboration

  4. Encourage Cross-Public Sector Federation

  5. Regularly Assess Identity Posture

By adopting passwordless MFA, integrating dynamic risk scoring, enabling external collaborator identity, exploring cross-public sector federation, and performing continuous zero-trust posture checks, you achieve an exceptionally secure, efficient environment—exemplifying best practices for user provisioning and identity management in the UK public sector.

Keep doing what you’re doing, and consider publishing blog posts or opening pull requests to share your experiences in creating a unified, cloud-based identity approach. By collaborating with others in the UK public sector, you help propagate secure, advanced authentication practices across government services.


How do you manage non-human service accounts in the cloud? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond human-like accounts for services:

  1. Introduce Role-Based Service Accounts

  2. Limit Shared Credentials

    • Immediately stop creating or reusing credentials across multiple services. Assign each service a unique identity:
      • Ensures logs and auditing can differentiate actions.
  3. Enforce MFA or Short-Lived Tokens

  4. Document a Minimal Policy

  5. Begin Transition to Cloud-Native Identity

    • Plan a short-term goal (2-4 months) to retire all shared/human-like service accounts, adopting native roles or short-lived credentials where feasible.

By introducing cloud-native roles for services, eliminating shared credentials, enabling MFA or short-lived tokens if needed, documenting a minimal policy, and planning a transition, you reduce security risks posed by long-lived, human-like service accounts.

How to do better

Below are rapidly actionable steps to centralise and secure long-lived API keys:

  1. Move Keys to a Central Secret Store

  2. Enforce Rotation Policies

  3. Use Tooling for Local Key Discovery

  4. Document a Single Organisational Policy

  5. Transition to Role-Based or Short-Lived Tokens

    • While central secret storage helps, plan a future move to ephemeral tokens or IAM roles:
      • Reduces reliance on static keys altogether.

By centralising key storage, rotating keys automatically, scanning for accidental exposures, formalising a policy, and starting to shift away from static keys, you significantly enhance the security of locally managed long-lived credentials.

How to do better

Below are rapidly actionable ways to strengthen your centralised secret store approach:

  1. Automate Secret Rotation

  2. Incorporate Access Control & Monitoring

  3. Reference a “Secret Lifecycle” Document

  4. Integrate into CI/CD

    • Ensure automation pipelines fetch credentials from the secret store at build or deploy time, never storing them in code.
  5. Begin Adopting Ephemeral Credentials

By automating secret rotation, refining access controls, documenting a secret lifecycle, hooking the store into CI/CD, and planning ephemeral credentials for new services, you build on your strong foundation of centralised secret usage to minimise risk further.

How to do better

Below are rapidly actionable improvements to further secure ephemeral identity usage:

  1. Embed Short-Lived Tokens in CI/CD

  2. Adopt Service Mesh or mTLS

  3. Leverage Policy-as-Code

  4. Regularly Audit Attestation Mechanisms

  5. Integrate with Cross-Org Federation

By embedding ephemeral tokens into your CI/CD, adding a service mesh or mTLS, employing policy-as-code, auditing attestation rigorously, and exploring cross-organisation federation, you evolve ephemeral identity usage into a highly secure, flexible, and zero-trust-aligned solution.

How to do better

Below are rapidly actionable ways to enhance code-managed identities with federated trust:

  1. Incorporate Real-Time Security Policies

    • Use policy-as-code (OPA, AWS SCP, Azure Policy, GCP Org Policy, OCI Security Zones) to automatically detect and block misconfigurations in your IaC definitions.
  2. Leverage DevSecOps Workflows

    • Integrate identity code linting, security scanning, and ephemeral token provisioning into CI/CD:
      • e.g., scanning Terraform or CloudFormation for suspicious identity references before merge.
  3. Implement Zero-Trust Microsegmentation

  4. Expand to Multi-Cloud/Hybrid

    • If multiple providers or on-prem systems are used, unify identity definitions across them:
  5. Regularly Validate & Audit

By employing policy-as-code, adopting DevSecOps scanning in your pipeline, embracing zero-trust microsegmentation, extending code-based identity to multi-cloud/hybrid, and continuously auditing for drift, you perfect a code-centric model that securely and efficiently manages service identities across your entire public sector environment.

Keep doing what you’re doing, and consider sharing your approach to code-managed identity and federated trust in blog posts or by making pull requests to this guidance. This knowledge helps other UK public sector organisations adopt similarly robust, zero-trust-aligned solutions for non-human service account authentication.


How do you manage risks? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to improve from an informal approach:

  1. Create a Simple Risk Checklist

  2. Record & Communicate Regularly

    • Even a single spreadsheet or Word doc with identified risks, likelihood, and impact fosters consistency.
    • Share it monthly or quarterly with the relevant stakeholders.
  3. Assign Risk Owners

    • For each risk, name someone responsible for tracking and mitigating.
    • Prevents duplication or “everyone and no one” owning an issue.
  4. Introduce Basic Likelihood & Impact Scoring

    • e.g., 1-5 scale for likelihood, 1-5 for impact, multiply for a total risk rating.
    • This helps prioritise and start discussion around risk tolerance.
  5. Plan for Next Steps

By establishing a simple risk checklist, scheduling short reviews, assigning ownership, adopting basic scoring, and outlining a plan for incremental improvements, you quickly move from purely informal approaches to a more recognisable and consistent risk management foundation.

How to do better

Below are rapidly actionable improvements:

  1. Adopt a Standardised Template

    • Provide a uniform risk register template across all projects.
    • Outline columns (e.g., risk description, category, likelihood, impact, owner, mitigations, target resolution date).
  2. Encourage Regular Cross-Project Reviews

    • Monthly or quarterly, each project lead presents top risks.
    • Creates awareness of shared or similar risks (like cloud credential leaks, compliance deadlines).
  3. Consolidate Key Risks

    • Extract major issues from each spreadsheet into a single “organisational risk summary” for senior leadership or departmental boards.
  4. Implement Basic Tool or Shared Repository

    • e.g., host a central SharePoint list, JIRA board, or Google Sheet consolidating all project-level risk inputs:
      • Minimises confusion while maintaining a single source of truth.
  5. Leverage Some Automation

By adopting a consistent template, hosting cross-project reviews, summarising top risks in an organisational-level register, using a shared tool or repository, and partly automating detection of cloud security concerns, you advance from ad-hoc spreadsheets to a more coordinated approach.

How to do better

Below are rapidly actionable ways to expand your formal risk register process:

  1. Introduce Real-Time Updates or Alerts

  2. Measure Risk Reduction Over Time

    • Track how mitigations lower risk levels. Summaries can feed departmental or board-level dashboards:
      • e.g., “Risk #12: Cloud credential leaks reduced from High to Medium after implementing MFA and secret rotation.”
  3. Encourage GRC Tools

    • Government Risk and Compliance tools can unify multiple registers:
      • e.g., ServiceNow GRC, RSA Archer, or open-source solutions.
    • Minimises duplication across large organisations or multiple projects.
  4. Link Mitigations to Budgets and Timelines

    • Where possible, highlight the cost or resource needed for each major mitigation:
      • Helps leadership see rationale for investing in e.g., staff training, new security tools.
  5. Adopt a Cloud-Specific Risk Taxonomy

    • Incorporate categories like “Data Residency,” “Vendor Lock-in,” “Cost Overrun,” or “Insecure IAM,” referencing NCSC or NIST guidelines.
    • Ensures team members identify typical cloud vulnerabilities systematically.

By setting up real-time triggers for new risks, visualising risk reduction, considering GRC tooling, linking mitigation to budgets, and classifying cloud-specific risk areas, you reinforce a structured risk registry that handles dynamic and evolving threats efficiently.

How to do better

Below are rapidly actionable ways to optimise integrated, centrally overseen risk management:

  1. Incorporate Cloud-Specific Telemetry

  2. Advance Real-Time Dashboards

    • Provide live risk dashboards for each department or service, updating as soon as a risk or its mitigations change:
      • e.g., hooking up GRC tools to Slack/Teams for immediate notifications.
  3. Use Weighted Scoring for Cloud Projects

  4. Formalise Risk Response Plans

  5. Encourage Cross-department Collaboration

By integrating real-time cloud telemetry into your central risk system, offering advanced dashboards, applying specialised scoring for cloud contexts, setting formal risk responses, and cross-collaborating among agencies, you achieve deeper, more proactive risk management.

How to do better

Below are rapidly actionable ways to enhance an already advanced, proactive risk management system:

  1. Adopt AI/ML for Predictive Risk

  2. Integrate Risk with DevSecOps

  3. Multi-Cloud or Hybrid Risk Consolidation

    • If operating across AWS, Azure, GCP, OCI, or on-prem, unify them in one advanced GRC or SIEM tool:
      • Minimises siloed risk reporting.
  4. Extend Collaborative Risk Governance

  5. Regularly Refresh Risk Tolerance & Metrics

    • Reassess risk thresholds to ensure they remain relevant.
    • If your environment scales or new HPC/AI workloads are introduced, adapt risk definitions accordingly.

By leveraging AI for predictive risk detection, embedding risk scoring in DevSecOps pipelines, consolidating multi-cloud/hybrid risk data, collaborating on risk boards across agencies, and regularly updating risk tolerance metrics, you optimise an already advanced, proactive risk management system—ensuring continuous alignment with evolving public sector challenges and security imperatives.

Keep doing what you’re doing, and consider documenting your advanced risk management approaches through blog posts or by opening pull requests to this guidance. Sharing such experiences helps other UK public sector organisations adopt progressive risk management strategies in alignment with NCSC, NIST, and GOV.UK best practices.


How do you manage staff identities? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond isolated identity management:

  1. Create a Basic Directory or SSO Pilot

  2. Maintain a Simple User Inventory

    • List out each app’s user base and identify duplicates or potential orphan accounts:
      • Helps to see the scale of the fragmentation problem.
  3. Encourage Unique Credentials

  4. Plan a Gradual Migration

    • Set a short timeline (e.g., 6-12 months) to unify at least a few key services under a single ID provider.
  5. Highlight Quick-Wins

    • If consolidating one or two widely used services to a shared login shows immediate benefits (less support overhead, better logs), use that success to rally internal support.

By implementing a small shared ID approach for new services, maintaining an org-wide user inventory, encouraging unique credentials with basic password hygiene, scheduling partial migrations, and publicising quick results, you steadily reduce the complexity and risk of scattered service-specific identities.

How to do better

Below are rapidly actionable steps to further unify your basic centralised identity:

  1. Mandate SSO for New Services

  2. Target Legacy Systems

    • Identify 1-3 high-value legacy applications and plan a short roadmap for migrating them to the central ID store:
      • e.g., rewriting authentication to use SAML or OIDC.
  3. Introduce Periodic Role or Access Reviews

  4. Extend MFA Requirements

  5. Aim for Full Integration by a Set Date

    • e.g., a 12-18 month plan to unify all services, presenting to leadership how this will lower security risk and reduce support costs.

By demanding SSO for new apps, migrating top-priority legacy systems, enabling periodic role reviews, enforcing MFA across the board, and setting a timeline for full integration, you reinforce your centralised identity approach and shrink vulnerabilities from leftover local user stores.

How to do better

Below are rapidly actionable ways to incorporate the last few outliers:

  1. Establish an “Exception Approval”

    • If a service claims it can’t integrate, mandate a formal sign-off by security or architecture boards:
      • Minimises indefinite exceptions.
  2. Plan Legacy Replacement or Integration

  3. Enhance Monitoring on Exceptions

  4. Regularly Reassess or Sunset Non-Compliant Services

    • If an exception remains beyond a certain period (e.g., 6-12 months), escalate to leadership.
    • This keeps pressure on removing exceptions eventually.
  5. Include Exceptions in Identity Audits

    • Ensure these standalone services aren’t forgotten in user account cleanup or security scanning efforts:
      • e.g., hooking them into an “all-of-org” identity or vulnerability scan at least quarterly.

By requiring official approval for non-integrated systems, scheduling integration projects, monitoring or sunsetting exceptions, and auditing them in the main identity reviews, you unify identity management and ensure consistent security across all cloud services.

How to do better

Below are rapidly actionable ways to enhance advanced integrated identity management:

  1. Explore Zero-Trust or Risk-Adaptive Auth

  2. Adopt Policy-as-Code for Identity

  3. Enable Fine-Grained Roles and Minimal Privileges

  4. Implement Automated Access Certification

    • Every few months, prompt managers to re-check their team’s privileges:
      • Tools like Azure AD Access Reviews, AWS IAM Access Analyzer, GCP IAM Recommender, or OCI IAM policy checks can highlight unneeded privileges.
  5. Sustain a Culture of Continuous Improvement

By implementing zero-trust or risk-based authentication, adopting identity policy-as-code, refining least privilege roles, automating access certifications, and fostering continuous improvements, you advance from a strong integrated identity environment to a cutting-edge, security-first approach aligned with UK public sector best practices.

How to do better

Below are rapidly actionable ways to refine a mandatory single source of identity:

  1. Implement Risk-Adaptive Authentication

  2. Extend Identity to Multi-Cloud

    • If you operate across multiple providers, unify identity definitions so staff seamlessly access AWS, Azure, GCP, or OCI:
      • Possibly referencing external IDPs or cross-cloud SSO integrations.
  3. Incorporate Passwordless Tech

  4. Align with Cross-Government Identity Initiatives

  5. Continuously Review and Audit

By adopting risk-based auth, ensuring multi-cloud identity unification, deploying passwordless approaches, collaborating with cross-government identity programs, and regularly auditing for compliance with the mandatory single source policy, you reinforce a top-tier security stance. This guarantees minimal identity sprawl and maximum accountability in the UK public sector environment.

Keep doing what you’re doing, and consider creating blog posts or making pull requests to this guidance about your advanced single-source identity management success. Sharing practical examples helps other UK public sector organisations move toward robust, consistent identity strategies.


How do you reduce the risk from staff with high-level access? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to bolster security beyond mere user vetting:

  1. Implement the Principle of Least Privilege

  2. Mandate MFA for Privileged Accounts

  3. Adopt Break-Glass Procedures

    • Provide normal user roles with day-to-day privileges. Escalation to super-user (root/admin) requires justification or time-limited credentials.
  4. Track Changes & Access

  5. Periodic Re-Vetting

By reinforcing least privilege, requiring MFA for admins, introducing break-glass accounts, logging privileged actions immutably, and scheduling re-vetting cycles, you address the limitations of purely one-time user vetting practices.

How to do better

Below are rapidly actionable steps for robust logging:

  1. Centralise Logs

  2. Implement Basic Retention Policies

  3. Add Tiered Access

    • Ensure only authorised security or audit staff can retrieve log data, particularly sensitive privileged user logs.
  4. Adopt Alerts or Scripting

    • If no advanced SIEM in place, set simple CloudWatch or Monitor alerts for suspicious events:
      • e.g., repeated authentication failures, unusual times for privileged actions.
  5. Plan for Future SIEM

By centralising logs, defining retention policies, restricting log access, employing basic alerts, and charting a path to a future SIEM or advanced monitoring approach, you progress from minimal log compliance to meaningful protective monitoring for privileged accounts.

How to do better

Below are rapidly actionable steps to enhance local audit log checks:

  1. Introduce Scheduled Log Reviews

    • e.g., once a month or quarter, verify logs remain present, complete, and show no anomalies:
      • Provide a short checklist or script for consistent checks.
  2. Adopt a Central Logging Approach

  3. Establish an Alerting Mechanism

    • Set triggers for suspicious events:
      • repeated privileged commands, attempts to disable logging, or high-volume data exfil events.
  4. Retest Periodically

  5. Involve Security/Operations in Reviews

    • Encourage cross-team peer reviews, so security staff or ops can weigh in on log completeness or retention policies.

By scheduling routine log reviews, centralising logs or employing a SIEM, establishing real-time alerts, retesting logs beyond initial go-live, and collaborating with security teams on checks, you elevate from one-time assessments to ongoing protective monitoring.

How to do better

Below are rapidly actionable ways to enhance a centralised, immutable audit logging approach:

  1. Incorporate a SIEM or Security Analytics

  2. Define Tiered Log Retention

    • Some logs might only need short retention, while privileged user logs or financial transaction logs might need multi-year retention, referencing departmental policies or NCSC recommended durations.
  3. Implement Role-Based Log Access

  4. Add Instant Alerts for High-Risk Actions

  5. Cross-department Collaboration

By coupling an advanced SIEM with defined retention tiers, enforcing role-based log access, setting real-time alerts for critical events, and collaborating beyond your department, you push your centralised, immutable logging approach to best-in-class standards aligned with public sector needs.

How to do better

Below are rapidly actionable suggestions to deepen advanced log audits and legal compliance:

  1. Formalise Forensic Readiness

  2. Simulate Real-World Insider Incidents

    • Conduct tabletop exercises or “red team” scenarios focusing on a privileged user gone rogue:
      • confirm the logs indeed catch suspicious actions and remain legally defensible.
  3. Adopt Chain-of-Custody Tools

  4. Engage with Legal/HR for Pre-Agreed Procedures

    • Ensure a consistent approach to handle suspected insider cases, clarifying roles for HR, security, legal, and management:
      • Minimises delays or confusion during investigations.
  5. Leverage Cross-department Insights

By refining your forensic readiness policy, running insider threat simulations, implementing chain-of-custody measures, coordinating with legal/HR teams, and exchanging insights cross-department, you maximise the readiness and legal defensibility of your logs, ensuring robust protection against privileged internal threats in the UK public sector environment.

Keep doing what you’re doing, and consider blogging or creating pull requests to share these advanced approaches for safeguarding logs, verifying legal readiness, and mitigating privileged insider threats. Such knowledge helps strengthen collective security practices across UK public sector organisations.


How do you keep your software supply chain secure? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to handle unmanaged dependencies more safely:

  1. Adopt Basic Package Manifests

    • Even if you install packages with apt, create a minimal list of versions used. For language-based repos (Node, Python, etc.), commit package.json / Pipfile or equivalent:
      • Minimises drift and ensures consistent builds.
  2. Begin Generating Simple SBOM

  3. Enable Automatic or Regular Patch Checks

  4. Document a Basic Update Policy

  5. Plan an Overhaul to Managed Dependencies

    • In the next 3-6 months, decide on a standard approach for dependencies:
      • e.g., using Node’s package-lock.json, Python’s requirements.txt, or Docker images pinned to specific versions.

By adopting minimal package manifests, generating basic SBOM data, automating patch checks, documenting an update policy, and planning a transition toward managed dependencies, you lay the groundwork for a more secure, transparent software supply chain.

How to do better

Below are rapidly actionable ways to strengthen basic dependency management:

  1. Automate Regular Dependency Scans

  2. Define a Scheduled Update Policy

  3. Maintain SBOM or Lock Files

  4. Enable Alerting for Known Vulnerabilities

  5. Document Emergency Patching

    • Formalise an approach for urgent CVE patching outside major releases.
    • Minimises ad-hoc panic when a high severity bug appears.

By automating scans, scheduling regular update windows, maintaining SBOM or lock files, setting up vulnerability alerts, and establishing a well-defined emergency patch process, you move from ad-hoc monitoring to a more structured, frequent approach that better secures the software supply chain.

How to do better

Below are rapidly actionable ways to strengthen proactive repository remediation:

  1. Introduce Risk Scoring or Context

    • Distinguish vulnerabilities that truly impact your code path from those that are unreferenced dependencies:
      • e.g., using advanced scanning tools like Snyk, Sonatype, or vendor-based solutions.
  2. Adopt Container and OS Package Scanning

  3. Refine Automated Testing

  4. Define an SLA for Fixes

  5. Document & Track Exceptions

    • If a patch is delayed (e.g., due to breakage risk), keep a formal record of why and a timeline for resolution:
      • Minimises the chance of indefinite deferral of serious issues.

By introducing vulnerability risk scoring, scanning container/OS packages, enhancing test automation for new patches, setting fix SLAs, and controlling deferrals, you significantly improve the proactive repository-level remediation approach across your entire software estate.

How to do better

Below are rapidly actionable ways to refine centralised, context-aware triage:

  1. Add Real-Time Threat Intelligence

  2. Automate Contextual Analysis

    • Tools that parse call graphs or code references to see if a vulnerable function is actually invoked:
      • Minimises false positives and patch churn.
  3. Collaborate with Dev Teams

    • If a patch might break production, the SOC can coordinate safe rollout or canary testing to confirm stability before mandatory updates.
  4. Measure & Publish Remediation Metrics

    • e.g., average time to fix a critical CVE or high severity vulnerability.
    • Encourages healthy competition and accountability across teams.
  5. Align with Overall Risk Registers

By integrating real-time threat intel, employing contextual code usage analysis, collaborating with dev for safe patch rollouts, tracking remediation metrics, and linking to broader risk management, you elevate centralised monitoring to a dynamic, strategic posture in addressing supply chain security.

How to do better

Below are rapidly actionable ways to refine advanced, integrated supply chain security:

  1. Implement Automated Policy-as-Code

  2. Extend SBOM Generation & Validation

  3. Adopt Multi-Factor Scanning

  4. Coordinate with Supplier/Partner Security

  5. Drive a Security-First Culture

By implementing policy-as-code in your pipelines, strengthening SBOM usage, blending multiple scanning techniques, managing upstream vendor security, and fostering a security-first ethos, you sustain a cutting-edge supply chain security environment—ensuring minimal risk, maximum compliance, and rapid threat response across UK public sector software development.


How do you find and fix security problems, vulnerabilities, and misconfigurations? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to implement basic vulnerability reporting:

  1. Publish a Simple Disclosure Policy

  2. Set Up a Dedicated Email or Form

    • Provide a clear email (like security@yourdomain.gov.uk) or secure submission form:
      • Minimises confusion about who to contact.
  3. Respond with a Standard Acknowledgement

    • Even an automated template that thanks the researcher and notes you’ll follow up within X days fosters trust.
  4. Engage Leadership

    • Brief senior management that ignoring external reports can lead to missed critical vulnerabilities.
  5. Plan a Gradual Evolution

    • Over the next 6-12 months, consider joining a responsible disclosure platform or adopting a bug bounty approach for larger-scale feedback.

By defining a minimal disclosure policy, setting up a dedicated channel, creating an acknowledgment workflow, involving leadership awareness, and planning for future expansions, you shift from no vulnerability management to a more transparent and open approach that encourages safe vulnerability reporting.

How to do better

Below are rapidly actionable ways to evolve beyond a standard disclosure policy:

  1. Link Policy with Internal Remediation SLAs

    • For example, “critical vulnerabilities responded to within 24 hours, resolved or mitigated within 7 days,” to ensure a consistent process.
  2. Integrate with DevSecOps

  3. Offer Coordinated Vulnerability Disclosure Rewards

    • If feasible, small gestures (like public thanks or acknowledgement) or bug bounty tokens encourage more thorough testing from external researchers.
  4. Publish Summary of Findings

    • Periodically share anonymised or high-level results of vulnerability disclosures, illustrating how quickly you resolved them.
    • Builds trust with citizens or partner agencies.
  5. Combine with Automated Tools

By defining clear internal SLAs, integrating vulnerability disclosures into dev workflows, offering small acknowledgments or bounties, releasing summary fix timelines, and coupling with continuous scanning tools, you can both refine external disclosure processes and ensure robust internal vulnerability management.

How to do better

Below are rapidly actionable ways to enhance scanning and regular assessments:

  1. Expand to Multi-Layer Scans

  2. Adopt Real-Time or Daily Scans

    • If feasible, move from monthly/quarterly to daily or per-commit scanning in your CI/CD pipeline.
    • Early detection fosters quicker fixes.
  3. Integrate with SIEM

  4. Prioritise with Risk Scoring

  5. Publish Shared “Security Scorecards”

    • Departments or teams see summary risk/vulnerability data. Encourages knowledge sharing and a culture of continuous improvement.

By broadening scanning layers, shifting to more frequent scans, integrating results in a SIEM, risk-scoring discovered issues, and creating departmental security scorecards, you refine a robust automated scanning regimen that swiftly addresses vulnerabilities.

How to do better

Below are rapidly actionable methods to refine proactive threat hunting and incident response:

  1. Adopt Purple Teaming

    • Combine red team (offensive) and blue team (defensive) exercises periodically to test detection and response workflows.
    • e.g., referencing NCSC red teaming best practices.
  2. Enable Automated Quarantine

  3. Add Forensic Readiness

  4. Integrate Cross-Government Threat Intel

  5. Expand Zero-Trust Microsegmentation

By introducing purple teaming, automating quarantine procedures, ensuring forensic readiness, collaborating on threat intel across agencies, and adopting zero-trust microsegmentation, you deepen your proactive stance and expedite incident responses.

How to do better

Below are rapidly actionable ways to optimise comprehensive security operations:

  1. Incorporate HPC/AI Security

  2. Include Third-Party Supply Chain

  3. Automate Cross-Cloud Security

  4. Public-Sector Collaboration

  5. Continuously Evaluate Zero-Trust

By adopting HPC/AI-targeted checks, incorporating suppliers in red team exercises, unifying multi-cloud threat intelligence, collaborating across public sector units, and reinforcing zero-trust initiatives, you further enhance your holistic security operations. This ensures comprehensive, proactive defense against sophisticated threats and misconfigurations in the UK public sector context.

Keep doing what you’re doing, and consider blogging or opening pull requests to share your advanced security operations approaches. This knowledge supports other UK public sector organisations in achieving robust threat/vulnerability management and protective monitoring aligned with NCSC, NIST, and GOV.UK best practices.


How do you secure your network and control access? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to strengthen or evolve from perimeter-only security:

  1. Introduce MFA for Privileged Access

  2. Implement Least-Privilege IAM

  3. Segment Networks Internally

  4. Enable TLS Everywhere

  5. Plan for Identity-Based Security

    • Over the next 6-12 months, pilot a small zero-trust or identity-centric approach for a less critical app, paving the way to reduce dependence on perimeter rules.

By enforcing multi-factor authentication, introducing least-privilege IAM, segmenting networks internally, ensuring end-to-end TLS, and planning a shift toward identity-based models, you move beyond the risks of purely perimeter-centric security.

How to do better

Below are rapidly actionable ways to extend identity verification:

  1. Enforce MFA for All Users

  2. Increase Granularity of Access Controls

  3. Adopt SSO

  4. Enable Auditing & Logging

  5. Consider Device Trust or Conditional Access

    • If feasible, require verified device posture (up-to-date OS, security agent running) before granting app access.

By mandating MFA for all, refining role-based or service-level access, introducing SSO, logging all user actions, and optionally checking device security posture, you significantly reduce reliance on a single perimeter gate.

How to do better

Below are rapidly actionable ways to strengthen user+service identity verification:

  1. Use mTLS or Short-Lived Tokens

  2. Adopt Policy-as-Code

  3. Enforce Request-Level Authorisation

  4. Implement JIT Privileges

    • For especially sensitive or admin tasks, require ephemeral or just-in-time escalation tokens (with a short lifetime).
  5. Log & Analyze Service-to-Service Interactions

By implementing mTLS or ephemeral tokens for user+service identity, deploying policy-as-code, requiring request-level authorisation, enabling JIT privileges for critical tasks, and thoroughly logging microservice communications, you move closer to a robust zero-trust framework within a partially perimeter-based model.

How to do better

Below are rapidly actionable ways to deepen identity-centric security:

  1. Retire or Restrict VPN

  2. Embed Device Trust

    • Combine user identity with device compliance checks:
      • e.g., [Azure AD Conditional Access with device compliance, Google BeyondCorp device posture, AWS or OCI solutions integrated with MDM] for advanced zero-trust.
  3. Embrace Microsegmentation

  4. Establish Single Sign-On for All

  5. Continuously Train Staff

By methodically retiring or limiting VPN usage, integrating device posture checks, employing microsegmentation, standardising single sign-on for all apps, and training staff on the identity-centric model, you further reduce perimeter dependence and approach a more robust zero-trust posture.

How to do better

Below are rapidly actionable ways to sustain no-perimeter, identity-based security:

  1. Refine Device & User Risk Scoring

  2. Enforce Continuous Authentication

  3. Extend Zero-Trust to Microservices

  4. Use Policy-as-Code

  5. Collaborate & Share

    • As a leading zero-trust example, share your experiences or case studies with other public sector bodies, referencing cross-government events or guidance from GDS / NCSC communities.

By deploying advanced device risk scoring, introducing continuous re-auth, expanding zero trust to microservices, employing policy-as-code for dynamic guardrails, and collaborating across the public sector, you refine your environment as a modern, identity-centric security pioneer, fully detached from traditional network perimeters and VPN reliance.

Keep doing what you’re doing, and consider writing up your experiences or opening pull requests to share your zero-trust or identity-centric security transformations. This knowledge benefits other UK public sector organisations striving to reduce reliance on network perimeters and adopt robust, identity-first security models.


How do you use two-factor or multi-factor authentication (2FA/MFA)? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move from an “encouraged” MFA model to a consistent approach:

  1. Identify Privileged Accounts First

  2. Educate Staff on Risks

  3. Incentivise Voluntary Adoption

    • Recognise teams or individuals who enable MFA (e.g., shout-outs or small accolades).
    • Encourages cultural acceptance before a final mandate.
  4. Publish a Simple Internal FAQ

  5. Plan a Timeline for Mandatory MFA

    • Over 3–6 months, aim to require MFA for at least all staff accessing sensitive services.

By prioritising MFA for privileged users, educating staff on credential compromise scenarios, incentivising early adoption, providing user-friendly setup instructions, and scheduling a near-future MFA mandate, you evolve from optional guidance to real protective measures.

How to do better

Below are rapidly actionable methods to close the enforcement gap:

  1. Enable Enforcement in Cloud IAM

  2. Monitor for Noncompliance

  3. Apply a Hard Deadline

    • Communicate a date beyond which single-factor logins will be revoked, referencing official departmental or local policy.
  4. Offer Support & Tools

  5. Handle Legacy Systems

By enabling built-in forced MFA, monitoring compliance, communicating a strict cutoff date, supplying alternative authenticators, and bridging older systems with SSO or proxy solutions, you systematically remove any gaps that allow single-factor access.

How to do better

Below are rapidly actionable ways to remove or mitigate the last few exceptions:

  1. Document a Sunset Plan for Exceptions

    • If a system can’t integrate MFA now, define a target date or solution path (like an MFA-proxy or upgrade).
    • Minimises indefinite exceptions.
  2. Risk-Base or Step-Up

  3. Consider Device-Focused Security

  4. Combine with Identity-Centric Security

  5. Review & Renew

    • Periodically re-check each exception’s rationale—some may no longer be valid as technology or policies evolve.

By planning for the eventual elimination of exceptions, deploying step-up authentication for sensitive tasks, ensuring device posture checks for minimal-risk scenarios, integrating identity-based zero-trust, and reviewing exceptions regularly, you further strengthen your universal MFA adoption.

How to do better

Below are rapidly actionable enhancements:

  1. Adopt FIDO2 or Hardware Security Keys

  2. Set Up Backup Mechanisms

  3. Integrate Risk-Based Policies

  4. Consider Device Certificates

    • For some use cases, device-based certificates or mTLS can supplement user factors, further preventing compromised endpoints from impersonation.
  5. Regularly Revisit Factor Security

By introducing hardware-based MFA, ensuring robust fallback processes, applying risk-based authentication for suspicious attempts, deploying device certs, and staying alert to newly discovered factor vulnerabilities, you push your “no weak MFA” stance to a sophisticated, security-first environment.

How to do better

Below are rapidly actionable ways to optimise hardware-based MFA:

  1. Embrace Risk-Based Authentication

  2. Implement Zero-Trust & Microsegmentation

  3. Maintain Inventory & Lifecycle

    • Automate key distribution, revocation, or replacement. If a staff member loses a token, the system quickly blocks it.
    • e.g., a central asset management or HR-driven approach ensuring no leftover active tokens for departed staff.
  4. Test Against Realistic Threats

  5. Plan for Cross-department Interoperability

By coupling hardware tokens with adaptive risk checks, adopting zero-trust microsegmentation for each request, carefully managing the entire token lifecycle, running targeted red team tests, and exploring cross-department usage, you elevate an already stringent hardware-based MFA approach to a seamlessly integrated, high-security ecosystem suitable for sensitive UK public sector operations.

Keep doing what you’re doing, and consider sharing your experiences or opening pull requests to this guidance. Others in the UK public sector can learn from how you enforce robust MFA standards, whether using FIDO2 hardware keys, advanced risk-based checks, or zero-trust patterns.


How do you manage privileged access? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond ad-hoc privileged credential management:

  1. Create a Basic Privileged Access Policy

  2. Mandate Individual Admin Accounts

    • Eliminate shared “admin” user logins. Each privileged user gets a unique account so you can track actions.
  3. Introduce MFA for Admins

  4. Document & Track Privileged Roles

    • Keep a minimal register or spreadsheet listing all privileged accounts, systems they access, and assigned owners:
      • Helps see if too many administrators exist.
  5. Schedule Transition to Vaulting

By creating a short privileged access policy, enforcing unique admin accounts with MFA, documenting roles, and preparing for a vault-based solution, you significantly reduce the risk of ad-hoc mismanagement and insider threats.

How to do better

Below are rapidly actionable steps to refine centralised vaulting:

  1. Enable Automatic Credential Rotation

  2. Integrate with CI/CD

    • If dev pipelines need privileged credentials (e.g., for deployment), fetch them from the vault at runtime, never storing them in code or config:
  3. Automate Access Reviews

  4. Adopt Fine-Grained Access Policies

  5. Add Multi-Factor for Vault Access

By rotating credentials automatically, integrating vault secrets into CI/CD, conducting periodic access reviews, refining vault access policies, and enforcing MFA for vault retrieval, you build a stronger, more secure foundation for privileged credentials management.

How to do better

Below are rapidly actionable ways to strengthen identity administration and OTP usage:

  1. Integrate OTP into Break-Glass Procedures

  2. Use Security Keys for Admin Access

  3. Automate Logging & Alerts

  4. Schedule Regular Privileged Access Reviews

  5. Expand OTP to Non-Human Accounts

    • Where feasible, short-lived tokens for services or automation tasks too, fostering ephemeral credentials.

By embedding OTP steps in break-glass procedures, adopting hardware tokens for admins, enabling automated logs/alerts, reviewing privileged roles frequently, and using ephemeral tokens for services as well, you build a more rigorous privileged access model with robust checks.

How to do better

Below are rapidly actionable ways to elevate automated, risk-based privileged access:

  1. Incorporate Threat Intelligence

  2. Tie Access to Device Posture

  3. Implement Granular Observability

  4. Automate Just-in-Time (JIT) Access

  5. Regular Security Drills

By combining threat intelligence, verifying device posture, enabling granular session-level logging, adopting just-in-time privileges, and running regular security exercises, you further refine risk-based controls for privileged access across all cloud platforms.

How to do better

Below are rapidly actionable ways to optimise context-aware just-in-time privileges:

  1. Deeper Risk-Based Logic

  2. Enforce Micro-Segmentation

    • Combine ephemeral privileges with strict micro-segmentation: each resource requires a separate ephemeral token:
      • Minimises lateral movement if any one credential is compromised.
  3. Incorporate Real-Time Forensic Tools

  4. Enable AI/ML Anomaly Detection

  5. Regular Multi-Stakeholder Drills

By enhancing risk-based logic in JIT access, pairing ephemeral privileges with micro-segmentation, adopting real-time forensic checks, integrating AI-based anomaly detection, and practicing multi-stakeholder drills, you perfect a context-aware just-in-time privileged access model that secures the most sensitive operations in the UK public sector context.

Keep doing what you’re doing, and consider blogging or creating pull requests to share your experiences in implementing advanced privileged access systems with just-in-time context-based controls. Such knowledge benefits other UK public sector bodies aiming to secure administrative actions under a zero-trust, ephemeral access paradigm.


How does your organisation respond to security breaches and incidents? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to move beyond manual classification:

  1. Adopt a Simple Data Classification Scheme

  2. Introduce Basic Tooling

  3. Require Access Controls

  4. Document a Minimal Process

    • A short policy clarifying how staff label data, who can reclassify, and how they request access changes:
      • Minimises confusion or inconsistent labeling.
  5. Plan for Automated Classification

By introducing a simple classification scheme, adopting minimal tooling for labeling, ensuring basic least-privilege access, documenting a short classification process, and preparing for automated solutions, you create a more structured approach to data security than purely manual methods.

How to do better

Below are rapidly actionable ways to strengthen centralised data security policies:

  1. Implement Automated Policy Enforcement

  2. Add Tiered Access

  3. Consolidate Data Stores

  4. Define a Data Lifecycle

  5. Monitor for Policy Deviations

By automating policy enforcement, requiring tiered access for sensitive data, consolidating data stores, clarifying data lifecycle, and monitoring for policy anomalies, you refine your centralised data security approach, ensuring consistent coverage and minimal manual drift.

How to do better

Below are rapidly actionable ways to expand limited monitoring:

  1. Adopt or Expand DLP Tools

  2. Integrate SIEM for Correlation

  3. Add Real-Time Alerts

    • If a user downloads an unusually large amount of data or from unusual IPs, trigger immediate SOC or security team notifications.
  4. Include Lateral Movement Checks

  5. Regular Drills and Tests

    • Simulate data exfil attempts or insider threat to test if your limited monitoring indeed picks up suspicious events.

By leveraging or expanding DLP solutions, correlating logs in a SIEM, implementing real-time anomaly alerts, detecting lateral movement, and running exfiltration drills, you enhance your approach from partial monitoring to more comprehensive oversight of data movements.

How to do better

Below are rapidly actionable methods to reinforce automated detection:

  1. Risk-Scored Alerts

  2. Automated Quarantine & Blocking

  3. Integrate Threat Intelligence

    • Use external feeds or cross-government intel to see if certain IP addresses or tactics target your data assets.
  4. Regularly Update Detection Rules

    • Threat patterns evolve; schedule monthly or quarterly rule reviews to incorporate the latest TTPs (tactics, techniques, and procedures) used by adversaries.
  5. Drill Data Restoration

By adding risk-scored alerts, automatically quarantining suspicious activity, incorporating threat intelligence, periodically updating detection rules, and verifying backups or DR for data restoration, you create a highly adaptive system that promptly detects and mitigates data breach attempts.

How to do better

Below are rapidly actionable ways to refine fully automated, proactive data security:

  1. Leverage AI/ML for Data Anomalies

  2. Adopt Policy-as-Code

  3. Expand Zero-Trust Microsegmentation

    • Ensure each request for data is validated at the identity, device posture, and context level, even inside your environment:
  4. Cross-Government Data Sharing

  5. Regular “Chaos” or Stress Tests

By employing AI-driven anomaly detection, embedding policy-as-code for data security, adopting zero-trust microsegmentation, collaborating on cross-government data controls, and running robust chaos or stress tests, you sustain a cutting-edge, proactive data protection approach suitable for the evolving demands of UK public sector operations.

Keep doing what you’re doing, and consider blogging about or opening pull requests to share how you maintain or improve your data breach mitigation strategies. Your experiences support other UK public sector organisations, reinforcing best practices under NCSC, NIST, and GOV.UK guidance.

Technology

How do you choose technologies for new projects? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable ways to move away from fully independent, unaligned technology decisions:

  1. Start a Basic Tech Catalog

    • Document each major technology used across projects, referencing at least version, licensing, security posture:
      • Helps discover overlaps or common solutions already in use.
  2. Create a Minimal Governance Policy

  3. Encourage Knowledge Sharing

    • Run short “tech share” sessions, where teams present why they picked certain tools:
      • fosters cross-project alignment.
  4. Identify Quick-Win Common Tools

    • E.g., centralised logging or container orchestration solutions (AWS ECS/EKS, Azure AKS, GCP GKE, OCI OKE) standardising at least some operational aspects.
  5. Plan for a Tech Radar or Steering Group

    • Over the next 3–6 months, propose forming a small cross-departmental group or technology radar process to guide future selections.

By documenting existing tools, drafting minimal governance, facilitating knowledge exchange, pinpointing shared solutions, and preparing a technology steering approach, you mitigate fragmentation while still preserving some project autonomy.

How to do better

Below are rapidly actionable ways to refine a uniform tech mandate:

  1. Allow Exceptions via a Lightweight Process

  2. Maintain a Living “Approved List”

    • Encourage periodic updates to the mandated stack, adding modern solutions (like container orchestration or microservice frameworks) that align with cost and security best practices:
  3. Pilot Innovations

  4. Implement Regular Tech Reviews

    • e.g., every 6–12 months, a board or steering group reviews the mandated stack in light of feedback or new GDS or NCSC recommendations.
  5. Combine with Security & Cost Insights

    • Show how uniform solutions reduce risk and expense, reassuring teams that standardisation benefits them while still enabling progress in areas like containerisation or DevSecOps.

By allowing exceptions via a straightforward process, regularly updating the approved tech list, sponsoring pilot projects, scheduling periodic reviews, and highlighting cost/security gains, you preserve the benefits of uniform technology while avoiding stagnation or shadow IT.

How to do better

Below are rapidly actionable ways to revitalise or replace outdated resources:

  1. Initiate a Quick Radar Refresh

  2. Introduce a Living “Tech Patterns” Wiki

    • Encourage teams to add their experiences or recommended patterns, so the resource remains collaborative and dynamic:
      • e.g., referencing [Confluence, GitHub Wiki, or internal SharePoint with version control].
  3. Schedule Semi-Annual Reviews

  4. Gather Feedback

    • Ask project teams what patterns they rely on or find missing. Include new technologies that have proven valuable:
      • fosters a sense of collective ownership.
  5. Use Real Examples

    • Populate the updated patterns with success stories from internal projects that solved real user needs.

By quickly refreshing the tech radar, establishing a living wiki, scheduling periodic updates, gathering project feedback, and focusing on real success stories, you transform outdated references into a relevant, frequently consulted guide that shapes better technology decisions.

How to do better

Below are rapidly actionable ways to enhance current, well-used guidance:

  1. Introduce a “Feedback Loop”

  2. Add Security & Cost Criteria

  3. Practice “Sunsetting”

    • If a technology on the radar is outdated or replaced, mark it for deprecation with a recommended timeline:
      • Minimises legacy tech usage.
  4. Conduct Regular Showcases

    • Let teams demo how they used a recommended pattern or overcame a challenge.
    • Encourages synergy and real adoption.
  5. Cross-Gov Collaboration

By enhancing feedback channels, adding security/cost insights to each item, marking deprecated technologies, hosting showcases, and collaborating across agencies, you keep the guidance fresh, relevant, and beneficial for new project tech decisions.

How to do better

Below are rapidly actionable ways to strengthen a collaborative, evolving tech ecosystem:

  1. Establish a Formal Inner-Source Model

    • Encourage code sharing or libraries across departments, referencing open-source practices but within the public sector context:
  2. Encourage Pairing or Multi-Dept Projects

  3. Recognise Innovators

    • Publicly highlight staff who introduce successful new frameworks or cost-saving architecture patterns:
      • fosters a healthy “improvement” culture.
  4. Adopt Cross-department Show-and-Tell

  5. Integrate Feedback into Tech Radar

    • Each time a new solution is proven, update the radar or patterns promptly:
      • ensuring the living doc truly represents real usage and best practice.

By establishing an inner-source approach, supporting short cross-team collaborations, celebrating innovators, connecting with other public sector bodies for knowledge sharing, and consistently updating patterns or the tech radar, you continuously evolve an energetic ecosystem that fosters reuse, innovation, and high-quality technology decisions.

Keep doing what you’re doing, and consider writing some blog posts or opening pull requests to share how your collaborative, evolving tech environment benefits your UK public sector organisation. This helps others adopt or improve similar patterns and fosters a culture of open innovation across government.


What best describes your current technology stack? [change your answer]

You did not answer this question.

How to do better

Below are rapidly actionable steps to transition from a monolithic approach:

  1. Identify Natural Component Boundaries

    • E.g., separate a large monolith into core modules (user authentication, reporting, payment processing).
    • Provide early scoping for partial decomposition.
  2. Adopt Container or VM Packaging

  3. Refactor Shared Libraries

  4. Automate Basic CI/CD

  5. Plan a Phased Decomposition

By identifying component boundaries, packaging the monolith for simpler deployments, refactoring shared libraries, automating CI/CD, and scheduling partial decomposition, you reduce friction and set a path toward more modular solutions.

How to do better

Below are rapidly actionable ways to shift modules from concept to independent deployment:

  1. Introduce Containerisation at Module-Level

  2. Provide Separate Build Pipelines

  3. Adopt an API or Messaging Boundary

  4. Test and Deploy Modules Independently

    • Even if they remain part of a bigger system, trial partial independent deploys:
      • e.g., can you update a single library or microservice without redeploying everything?
  5. Demonstrate Gains

    • Show leadership how incremental module updates reduce downtime or accelerate security patching:
      • Encourages buy-in for further decoupling.

By containerising modules, setting up separate build pipelines, enforcing clear module boundaries, individually deploying or updating modules, and showcasing tangible benefits, you progress toward a fully independent deployment pipeline that capitalises on modularity.

How to do better

Below are rapidly actionable ways to handle interdependencies in individually deployable components:

  1. Introduce Contract Testing

  2. Automate Consumer-Driven Testing

    • Consumers of a service define expected inputs/outputs; the service must pass these for each release.
    • Minimises “integration hell.”
  3. Adopt Semantic Versioning

  4. Publish a Dependency Matrix

  5. Enforce Feature Flags

By introducing contract or consumer-driven testing, adopting semantic versioning, publishing a compatibility matrix, and employing feature flags to manage cross-component rollouts, you reduce interdependency friction and safely leverage your modular architecture.

How to do better

Below are rapidly actionable ways to address the leftover monolithic elements:

  1. Identify High-Impact Subsystem to Extract

  2. Establish Clear Migration Plan

    • e.g., define a 12–24 month roadmap with incremental steps or re-platforming on containers:
      • Minimises big-bang rewrites.
  3. Enhance DevOps for Monolith

    • Even if it remains monolithic for a while, ensure robust CI/CD, container packaging, automated tests, referencing NCSC DevSecOps guidance.
  4. Limit New Features in Legacy

    • Encourage new capabilities or major enhancements in microservices around the edges, gradually reducing the monolith’s importance.
  5. Highlight ROI & Risk

    • Present management with cost of leaving the monolith vs. benefits of further decomposition (faster releases, easier security fixes).

By selecting high-impact subsystems for extraction, creating a phased migration plan, applying DevOps best practices to the existing monolith, steering new features away from legacy, and continuously communicating the ROI of decomposition, you inch closer to a fully modular environment.

How to do better

Below are rapidly actionable ways to optimise a fully component-based approach:

  1. Enhance Observability & Tracing

  2. Apply Zero-Trust for Service Communication

  3. Adopt or Refine Service Mesh

  4. Continuous Architecture Review

    • With so many components, schedule architecture retros or periodic design reviews ensuring no sprawl or duplication arises.
  5. Collaborate Across Departments or Agencies

By enhancing distributed tracing, adopting zero-trust service communications, exploring or refining a service mesh, scheduling architecture reviews, and collaborating with other government entities, you maintain a top-tier, fully component-based environment that remains agile, secure, and efficient in meeting public sector demands.

Keep doing what you’re doing, and consider sharing or blogging about your experience with modular architectures. Contributing pull requests to this guidance or other best-practice repositories helps UK public sector organisations adopt similarly progressive strategies for building and maintaining cloud and on-premises systems.