Skip to main content

Process Flows

Last Updated: 2026-03-06 Sources: repos/innovation-sandbox-on-aws, repos/innovation-sandbox-on-aws-approver, repos/innovation-sandbox-on-aws-costs, repos/innovation-sandbox-on-aws-deployer, repos/innovation-sandbox-on-aws-billing-seperator, repos/innovation-sandbox-on-aws-utils

Executive Summary

This document presents the complete user and operational process flows through the NDX:Try AWS platform. It covers the end-to-end user journey from discovery to production adoption, the full lease lifecycle with all state transitions, the deployment pipeline for CloudFormation templates, the cost tracking cycle from termination to chargeback, and the operational runbooks for daily, weekly, and monthly maintenance activities.


Flow 1: Complete User Journey (Discovery to Production)

End-to-End User Experience


Flow 2: Complete Lease Lifecycle

From Request to Cleanup


Flow 3: Complete Deployment Pipeline

CloudFormation Template Deployment


Flow 4: Complete Cost Tracking Cycle

From Termination to Chargeback


Flow 5: Approver Scoring Process

19-Rule Execution Flow


Flow 6: Account Cleanup Process (AWS Nuke)

Step Functions + CodeBuild Orchestration


Flow 7: Pool Account Provisioning (Manual)

Using innovation-sandbox-on-aws-utils

The innovation-sandbox-on-aws-utils repository contains Python scripts for manual pool account operations:

Available Utility Scripts

ScriptPurpose
create_sandbox_pool_account.pyCreate and register new pool account
assign_lease.pyManually assign a lease to a user
terminate_lease.pyForce-terminate an active lease
force_release_account.pyRelease a quarantined account
create_user.pyCreate user in Identity Center
clean_console_state.pyReset console preferences

Operational Processes

Daily Operations Checklist

[ ] Monitor quarantine queue (target: < 2 accounts)
[ ] Review cost overages from previous day
[ ] Check deployer success rate (target: > 95%)
[ ] Verify approver scoring (target: 80%+ auto-approval)
[ ] Review manual approval queue (target: < 10 pending)
[ ] Check pool capacity (target: >= 5 available accounts)
[ ] Review CloudWatch alarms (target: 0 active)
[ ] Scan EventBridge DLQ (target: empty)
[ ] Verify Cost Explorer quota usage (target: < 80%)

Weekly Operations

[ ] Update ukps-domains whitelist from govuk-digital-backbone
[ ] Review quarantined accounts (manual cleanup if needed)
[ ] Generate pool utilisation report
[ ] Review Bedrock AI cost trends
[ ] Rotate GitHub API token (if approaching expiry)
[ ] Update team channel with metrics summary

Monthly Operations

[ ] Generate chargeback reports (1st of month)
[ ] Send cost reports to finance team
[ ] Review capacity planning (add pool accounts if needed)
[ ] Audit permission sets in Identity Center
[ ] Review and update lease templates
[ ] Conduct security audit (access logs, IAM policies)
[ ] Check upstream ISB fork status (currently 10 commits behind)
[ ] Update documentation with operational learnings

Emergency Procedures

Pool Exhaustion (< 2 Available Accounts)

  1. Check quarantine queue for accounts ready to release
  2. Run force_release_account.py on oldest quarantined accounts
  3. If insufficient, create new pool accounts with create_sandbox_pool_account.py
  4. Escalate if capacity planning indicates sustained demand increase

Cleanup Failure (Account Stuck in Quarantine)

  1. Review CodeBuild logs for the failed Nuke execution
  2. Identify resources that could not be deleted
  3. Manually delete residual resources via AWS Console
  4. Run force_release_account.py to move account back to Available OU
  5. Document failure pattern for future nuke-config updates

Cost Explorer Outage

  1. Cost collection Lambda will retry via DLQ
  2. Billing separator extends quarantine automatically (up to 96h)
  3. At 96h, force-release with alert to ops team
  4. Estimate costs manually from CloudWatch metrics if needed

References


Generated from source analysis. See 00-repo-inventory.md for full inventory.