6 Pillars AWS System Design
1. Operational Excellence
1.1. Design principles
-
Organize teams around business outcomes.
-
Implement observability for actionable insights.
-
Safely automate where possible.
-
Make frequenct, small, resersible changes.
-
Refine operations procedures frequently.
-
Anticiplate/Predict failure.
-
Learn from all operational events and metrics.
-
Use managed available service.
1.2. Questions
Operations:
OPS 1: How do you determine what your priorities are?
-
OPS01-BP01: Evaluate external customer needs.
-
OPS01-BP02: Evaluate internal customer needs.
-
OPS01-BP03: Evaludate govermance requirements: company rules.
-
OPS01-BP04: Evaluate compliance requirements: country rules.
-
OPS01-BP05: Evaluate threat landscape
-
OPS01-BP06: Evaluate tradeoffs while managing benefits and risks
OPS 2: How do you structure your organization to support your business outcomes?
-
OPS02-BP01: Resources have identified owners.
-
OPS02-BP02: Processes and procedures have identified owner.
-
OPS02-BP03: Operation activities have identified owners responsible for their performance.
-
OPS02-BP04: Merchanisms exist to manage responsibilities and ownership.
-
OPS02-BP05: Mechanisms exist to request additions, changes, and exceptions
-
OPS02-BP06: Responsibilities between teams are predefined or negotiated
OPS 3: How does your organizational culture support your business outcomes?
-
OPS03-BP01: Provide executive sponsorship
-
OPS03-BP02: Team members are empowered to take action when outcomes are at risk.
-
OPS03-BP03: Escalation is encouraged
-
OPS03-BP04: Communications are timely, clear, and actionable
-
OPS03-BP05: Experimentation is encouraged
-
OPS03-BP06: Team members are encouraged to maintain and grow their skill sets
-
OPS03-BP07: Resource teams appropriately
OPS 4: How do you implement observability in your workload?
-
OPS04-BP01: Identify key performance indicators
-
OPS04-BP02: Implement application telemetry
-
OPS04-BP03: Implement user experience telemetry
-
OPS04-BP04: Implement dependency telemetry
-
OPS04-BP05: Implement distributed tracing
OPS 5: How do you reduce defects, ease remediation, and improve flow into production?
-
OPS05-BP01: Use version control
-
OPS05-BP02: Test and validate changes
-
OPS05-BP03: Use configuration management systems
-
OPS05-BP04: Use build and deployment management systems
-
OPS05-BP05: Perform patch management: bản vá
-
OPS05-BP06: Share design standards
-
OPS05-BP07: Implement practices to improve code quality
-
OPS05-BP08: Use multiple environments
-
OPS05-BP09: Make frequent, small, reversible changes
-
OPS05-BP10: Fully automate integration and deployment
OPS 6: How do you mitigate deployment risks?
-
OPS06-BP01: Plan for unsuccessful changes
-
OPS06-BP02: Test deployments
-
OPS06-BP03: Employ safe deployment strategies
-
OPS06-BP04: Automate testing and rollback
OPS 7: How do you know that you are ready to support a workload?
-
OPS07-BP01: Ensure personnel capability
-
OPS07-BP02: Ensure a consistent review of operational readiness
-
OPS07-BP03: Use runbooks to perform procedures: A runbook is a documented process to achieve a specific outcome.
-
OPS07-BP04: Use playbooks to investigate issues: Playbooks are step-by-step guides used to investigate an incident
-
OPS07-BP05: Make informed decisions to deploy systems and changes
-
OPS07-BP06: Create support plans for production workloads
Operate:
OPS 8: How do you utilize workload observability in your organization?
-
OPS08-BP01: Analyze workload metrics
-
OPS08-BP02: Analyze workload logs
-
OPS08-BP03: Analyze workload traces
-
OPS08-BP04: Create actionable alerts
-
OPS08-BP05: Create dashboards
OPS 9: How do you understand the health of your operations?
-
OPS09-BP01: Measure operations goals and KPIs with metrics
-
OPS09-BP02: Communicate status and trends to ensure visibility into operation
-
OPS09-BP03: Review operations metrics and prioritize improvement
OPS 10: How do you manage workload and operations events?
-
OPS10-BP01: Use a process for event, incident, and problem management.
-
OPS10-BP02: Have a process per alert
-
OPS10-BP03: Prioritize operational events based on business impact
-
OPS10-BP04: Define escalation paths
-
OPS10-BP05: Define a customer communication plan for service-impacting events
-
OPS10-BP06: Communicate status through dashboards
-
OPS10-BP07: Automate responses to events
Evolve:
OPS 11: How do you evolve operations?
-
OPS11-BP01: Have a process for continuous improvement
-
OPS11-BP02: Perform post-incident analysis
-
OPS11-BP03: Implement feedback loops
-
OPS11-BP04: Perform knowledge management
-
OPS11-BP05: Define drivers for improvement
-
OPS11-BP06: Validate insights
-
OPS11-BP07: Perform operations metrics reviews
-
OPS11-BP08: Document and share lessons learned
-
OPS11-BP09: Allocate time to make improvements
2. Security
2.1. Design principles
-
Implement a least privilege foundation.
-
Maintain traceability: audit changes in your environment in real-time.
-
Apply security at all layers
-
Automate security best practices
-
Protect data in transit and at rest
-
Keep people away from data
-
Prepare for security events
2.2. Questions
Security foundations:
SEC 1: How do you securely operate your workload?
-
SEC01-BP01: Separate workloads using accounts
-
SEC01-BP02: Secure account root user and properties
-
SEC01-BP03: Identify and validate control objectives
-
SEC01-BP04: Stay up to date with security threats and recommendations
-
SEC01-BP05: Reduce security management scope
-
SEC01-BP06: Automate deployment of standard security controls
-
SEC01-BP07: Identify threats and prioritize mitigations using a threat model
-
SEC01-BP08: Evaluate and implement new security services and features regularly
Identity and access management:
SEC 2: How do you manage identities for people and machines?
-
SEC02-BP01: Use strong sign-in mechanisms
-
SEC02-BP02: Use temporary credentials
-
SEC02-BP03: Store and use secrets securely
-
SEC02-BP04: Rely on a centralized identity provider
-
SEC02-BP05: Audit and rotate credentials periodically
-
SEC02-BP06: Employ user groups and attributes
SEC 3: How do you manage permissions for people and machines?
-
SEC03-BP01: Define access requirements
-
SEC03-BP02: Grant least privilege access
-
SEC03-BP03: Establish emergency access process
-
SEC03-BP04: Reduce permissions continuously
-
SEC03-BP05: Define permission guardrails for your organization
-
SEC03-BP06: Manage access based on lifecycle
-
SEC03-BP07: Analyze public and cross-account access
-
SEC03-BP08: Share resources securely within your organization
-
SEC03-BP09: Share resources securely with a third party
Detection:
SEC 4: How do you detect and investigate security events?
-
SEC04-BP01: Configure service and application logging
-
SEC04-BP02: Capture logs, findings, and metrics in standardized locations
-
SEC04-BP03: Correlate and enrich security alerts
-
SEC04-BP04: Initiate remediation for non-compliant resources: alert resources that do not meet compliance rules.
Infrastructure protection:
SEC 5: How do you protect your network resources? (Network)
-
SEC05-BP01: Create network layers
-
SEC05-BP02: Control trffic flow within your network layers
-
SEC05-BP03: Implement inspection-based protection
-
SEC05-BP04: Automate network protection
SEC 6: How do you protect your compute resources? (Compute)
-
SEC06-BP01: Perform vulnerability management
-
SEC06-BP02: Provision compute from hardened images
-
SEC06-BP03: Reduce manual management and interactive access
-
SEC06-BP04: Validate software integrity
-
SEC06-BP05: Automate compute protection
Data protection:
SEC 7: How do you classify your data? (Data)
-
SEC07-BP01: Understand your data classification scheme
-
SEC07-BP02: Apply data protection controls based on data sensitivity
-
SEC07-BP03 Automate identification and classification
-
SEC07-BP04: Define scalable data lifecycle management
SEC 8: How do you protect your data at rest? (Data rest)
-
SEC08-BP01: Implement secure key management
-
SEC08-BP02: Enforce encryption at rest
-
SEC08-BP03: Automate data at rest protection
-
SEC08-BP04: Enforce access control
SEC 9: How do you protect your data in transit? (Data transit)
-
SEC09-BP01: Implement secure key and certificate management
-
SEC09-BP02: Enforce encryption in transit
-
SEC09-BP03: Authenticate network communications
Incident response:
SEC 10: How do you anticipate, respond to, and recover from incidents? (Detect incidents)
-
SEC10-BP01: Identify key personnel and external resources
-
SEC10-BP02: Develop incident management plans
-
SEC10-BP03: Prepare forensic capabilities
-
SEC10-BP04: Develop and test security incident response playbooks
-
SEC10-BP05: Pre-provision access
-
SEC10-BP06: Pre-deploy tools
-
SEC10-BP07: Run simulations
-
SEC10-BP08: Establish a framework for learning from incidents
Application security:
SEC 11: How do you incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle? (application security)
-
SEC11-BP01: Train for application security
-
SEC11-BP02: Automate testing throughout the development and release lifecycle
-
SEC11-BP03: Perform regular penetration testing
-
SEC11-BP04: Conduct code reviews
-
SEC11-BP05: Centralize services for packages and dependencies
-
SEC11-BP06: Deploy software programmatically
-
SEC11-BP07: Regularly assess security properties of the pipelines
-
SEC11-BP08: Build a program that embeds security ownership in workload teams
3. Reliability
3.1. Design principles
-
Automatically recover from failure
-
Test recovery procedures
-
Scale horizontally to increase aggregate workload availability
-
Stop guessing capacity
-
Manage change through automation: can tracked and reviewed.
3.2. Questions
Foundations:
REL 1: How do you manage Service Quotas (Limits) and constraints?
-
REL01-BP01: Aware of service quotas and constraints
-
REL01-BP02: Manage service quotas across accounts and regions
-
REL01-BP03: Accommodate fixed service quotas and constraints through architecture
-
REL01-BP04: Monitor and manage quotas
-
REL01-BP05: Automate quota management
-
REL01-BP06: Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover
REL 2: How do you plan your network topology? (network)
-
REL02-BP01: Use highly available network connectivity for your workload public endpoints
-
REL02-BP02: Provision redundant connectivity between private networks in the cloud and on-premises environments.
-
REL02-BP03: Ensure IP subnet allocation accounts for expansion and availability
-
REL02-BP04: Prefer hub-and-spoke topologies over many-to-many mesh
-
REL02-BP05: Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
Workload architecture:
REL 3: How do you design your workload service architecture? (workload)
-
REL03-BP01: Choose how to segment your workload
-
REL03-BP02: Build services focused on specific business domains and functionality
-
REL03-BP03: Provide service contracts per API
REL 4: How do you design interactions in a distributed system to prevent failures? (prevent failure)
-
REL04-BP01: Identify the kind of distributed systems you depend on
-
REL04-BP02: Implement loosely coupled dependencies
-
REL04-BP03: Do constant work
-
REL04-BP04: Make mutating operations idempotent
REL 5: How do you design interactions in a distributed system to mitigate or withstand failures? (migrate, withstand failures)
-
REL05-BP01: Implement graceful degradation to transform applicable hard dependencies into soft dependencies
-
REL05-BP02: Throttle requests
-
REL05-BP03: Control and limit retry calls
-
REL05-BP04: Fail fast and limit queues
-
REL05-BP05: Set client timeouts
-
REL05-BP06: Make systems stateless where possible
-
REL05-BP07: Implement emergency levers
Change management:
REL 6: How do you monitor workload resources? (monitor)
-
REL06-BP01: Monitor all components for the workload (Generation)
-
REL06-BP02: Define and calculate metrics (Aggregation)
-
REL06-BP03: Send notifications (Real-time processing and alarming)
-
REL06-BP04: Automate responses (Real-time processing and alarming)
-
REL06-BP05: Analyze logs
-
REL06-BP06: Regularly review monitoring scope and metrics
-
REL06-BP07: Monitor end-to-end tracing of requests through your system
REL 7: How do you design your workload to adapt to changes in demand? (adapt changes)
-
REL07-BP01: Use automation when obtaining or scaling resources
-
REL07-BP02: Obtain resources upon detection of impairment to a workload
-
REL07-BP03: Obtain resources upon detection that more resources are needed for a workload
-
REL07-BP04: Load test your workload
REL 8: How do you implement change? (implement changes)
-
REL08-BP01: Use runbooks for standard activities such as deployment
-
REL08-BP02: Integrate functional testing as part of your deployment
-
REL08-BP03: Integrate resiliency testing as part of your deployment
-
REL08-BP04: Deploy using immutable infrastructure
-
REL08-BP05: Deploy changes with automation
Failure management:
REL 9: How do you back up data? (back up data)
-
REL09-BP01: Identify and back up all data that needs to be backed up, or reproduce the data from sources
-
REL09-BP02: Secure and encrypt backups
-
REL09-BP03: Perform data backup automatically
-
REL09-BP04: Perform periodic recovery of the data to verify backup integrity and processes
REL 10: How do you use fault isolation to protect your workload? (fault isolation)
-
REL10-BP01: Deploy the workload to multiple locations
-
REL10-BP02: Automate recovery for components constrained to a single location
-
REL10-BP03: Use bulkhead architectures to limit scope of impact
REL 11: How do you design your workload to withstand component failures? (withstand fault)
-
REL11-BP01: Monitor all components of the workload to detect failures
-
REL11-BP02: Fail over to healthy resources
-
REL11-BP03: Automate healing on all layers
-
REL11-BP04: Rely on the data plane and not the control plane during recovery
-
REL11-BP05: Use static stability to prevent bimodal behavior
-
REL11-BP06: Send notifications when events impact availability
-
REL11-BP07: Architect your product to meet availability targets and uptime service level agreements (SLAs)
REL 12: How do you test reliability? (test)
-
REL12-BP01: Use playbooks to investigate failures (strategy)
-
REL12-BP02: Perform post-incident analysis
-
REL12-BP03: Test scalability and performance requirements
-
REL12-BP04: Test resiliency using chaos engineering
-
REL12-BP05: Conduct game days regularly
REL 13: How do you plan for disaster recovery (DR)? (discovery)
-
REL13-BP01: Define recovery objectives for downtime and data loss
-
REL13-BP02: Use defined recovery strategies to meet the recovery objectives
-
REL13-BP03: Test disaster recovery implementation to validate the implementation
-
REL13-BP04: Manage configuration drift at the DR site or Region
-
REL13-BP05: Automate recovery
4. Performance Efficiency
4.1. Design principles
-
Democratize advanced technologies: delegating complex tasks to cloud vendor instead of asking IT team to learn about hosting and running new technology.
-
Go global in minutes
-
Use serverless architectures: do not need to maintain physical server, just run the source code.
-
Experiment more often
-
Consider mechanical sympathy: consider patterns when doing with data access.
4.2. Questions
Architecture selection:
PERF 1: How do you select appropriate cloud resources and architecture patterns for your workload? (select cloud architect)
-
PERF01-BP01: Learn about and understand available cloud services and features
-
PERF01-BP02: Use guidance from your cloud provider or an appropriate partner to learn about architecture patterns and best practices.
-
PERF01-BP03: Factor cost into architectural decisions.
-
PERF01-BP04: Evaluate how trade-offs impact customers and architecture efficiency
-
PERF01-BP05: Use policies and reference architectures
-
PERF01-BP06: Use benchmarking to drive architectural decisions
-
PERF01-BP07: Use a data-driven approach for architectural choices
Compute and hardware:
PERF 2: How do you select and use compute resources in your workload? (select compute resource)
-
PERF02-BP01: Select the best compute options for your workload
-
PERF02-BP02: Understand the available compute configuration and features
-
PERF02-BP03: Collect compute-related metrics
-
PERF02-BP04: Configure and right-size compute resources
-
PERF02-BP05: Scale your compute resources dynamically
-
PERF02-BP06: Use optimized hardware-based compute accelerators
Data management:
PERF 3: How do you store, manage, and access data in your workload? (select database, store, query, access)
-
PERF03-BP01: Use a purpose-built data store that best supports your data access and storage requirements
-
PERF03-BP02: Evaluate available configuration options for data store
-
PERF03-BP03: Collect and record data store performance metrics
-
PERF03-BP04: Implement strategies to improve query performance in data store
-
PERF03-BP05: Implement data access patterns that utilize caching
Networking and content delivery:
PERF 4: How do you select and configure networking resources in your workload? (select networking, CDN)
-
PERF04-BP01: Understand how networking impacts performance
-
PERF04-BP02: Evaluate available networking features
-
PERF04-BP03: Choose appropriate dedicated connectivity or VPN for your workload
-
PERF04-BP04: Use load balancing to distribute traffic across multiple resources
-
PERF04-BP05: Choose network protocols to improve performance
-
PERF04-BP06: Choose your workload’s location based on network requirements
-
PERF04-BP07: Optimize network configuration based on metrics
Process and culture:
PERF 5: What process do you use to support more performance efficiency for your workload? (select process, KPIs, metrics)
-
PERF05-BP01: Establish key performance indicators (KPIs) to measure workload health and performance
-
PERF05-BP02: Use monitoring solutions to understand the areas where performance is most critical.
-
PERF05-BP03: Define a process to improve workload performance
-
PERF05-BP04: Load test your workload
-
PERF05-BP05: Use automation to proactively remediate performance-related issues
-
PERF05-BP06: Keep your workload and services up-to-date
-
PERF05-BP07: Review metrics at regular intervals
5. Cost Optimization
5.1. Design principles
-
Implement Cloud Financial Management
-
Adopt a consumption model: pay only the computing resources, increase or decrease usage depend on business requirements.
-
Measure overall efficiency
-
Stop spending money on undifferentiated heavy lifting: do not need to manage hardware.
-
Analyze and attribute expenditure: analyze accurately the usage and cost of the system
5.2. Questions
Practice Cloud Financial Management:
COST 1: How do you implement cloud financial management?
-
COST01-BP01: Establish ownership of cost optimization
-
COST01-BP02: Establish a partnership between finance and technology
-
COST01-BP03: Establish cloud budgets and forecasts
-
COST01-BP04: Implement cost awareness in your organizational processes
-
COST01-BP05: Report and notify on cost optimization
-
COST01-BP06: Monitor cost proactively
-
COST01-BP07: Keep up-to-date with new service releases
-
COST01-BP08: Create a cost-aware culture
-
COST01-BP09: Quantify business value from cost optimization
Expenditure and usage awareness:
COST 2: How do you govern usage? (rule to use)
-
COST02-BP01: Develop policies based on your organization requirements
-
COST02-BP02: Implement goals and targets
-
COST02-BP03: Implement an account structure
-
COST02-BP04: Implement groups and roles
-
COST02-BP05: Implement cost controls
-
COST02-BP06: Track project lifecycle
COST 3: How do you monitor usage and cost? (monitor cost)
-
COST03-BP01: Configure detailed information sources
-
COST03-BP02: Add organization information to cost and usage
-
COST03-BP03: Identify cost attribution categories
-
COST03-BP04: Establish organization metrics
-
COST03-BP05: Configure billing and cost management tools
-
COST03-BP06: Allocate costs based on workload metrics
COST 4: How do you decommission resources? (clean resources)
-
COST04-BP01: Track resources over their lifetime
-
COST04-BP02: Implement a decommissioning process
-
COST04-BP03: Decommission resources
-
COST04-BP04: Decommission resources automatically
-
COST04-BP05: Enforce data retention policies
Cost-effective resources:
COST 5: How do you evaluate cost when you select services? (evaluate cost)
-
COST05-BP01: Identify organization requirements for cost
-
COST05-BP02: Analyze all components of the workload
-
COST05-BP03: Perform a thorough analysis of each component
-
COST05-BP04: Select software with cost-effective licensing
-
COST05-BP05: Select components of this workload to optimize cost in line with organization priorities
-
COST05-BP06: Perform cost analysis for different usage over time
COST 6: How do you meet cost targets when you select resource type, size and number? (how to meet cost target)
-
COST06-BP01: Perform cost modeling
-
COST06-BP02: Select resource type, size, and number based on data
-
COST06-BP03: Select resource type, size, and number automatically based on metrics
-
COST06-BP04: Consider using shared resources
COST 7: How do you use pricing models to reduce cost? (how to reduce cost in all components)
-
COST07-BP01: Perform pricing model analysis
-
COST07-BP02: Choose Regions based on cost
-
COST07-BP03: Select third-party agreements with cost-efficient terms
-
COST07-BP04: Implement pricing models for all components of this workload
-
COST07-BP05: Perform pricing model analysis at the management account level
COST 8: How do you plan for data transfer charges (data transfer)
-
COST08-BP01: Perform data transfer modeling
-
COST08-BP02: Select components to optimize data transfer cost
-
COST08-BP03: Implement services to reduce data transfer costs
Manage demand and supply resources:
COST 9: How do you manage demand, and supply resources? (Manage demand and supply resources)
-
COST09-BP01: Perform an analysis on the workload demand
-
COST09-BP02: Implement a buffer or throttle to manage demand
-
COST09-BP03: Supply resources dynamically
Optimize over time:
COST 10: How do you evaluate new services? (review resource)
-
COST10-BP01: Develop a workload review process
-
COST10-BP02: Review and analyze this workload regularly
COST 11: How do you evaluate the cost of effort? (review efforts)
- COST11-BP01: Perform automation for operations: automation instead of doing manually.
6. Sustainability
6.1. Design principles
-
Understand your impact
-
Establish sustainability goals
-
Maximize utilization
-
Anticipate and adopt new, more efficient hardware and software offerings
-
Use managed services
-
Reduce the downstream impact of your cloud workloads
6.2. Questions
Region selection:
SUS 1: How do you select Regions for your workload? (region)
- SUS01-BP01: Choose Region based on both business requirements and sustainability goals
Alignment to demand:
SUS 2: How do you align cloud resources to your demand? (save resources)
-
SUS02-BP01 Scale workload infrastructure dynamically
-
SUS02-BP02: Align SLAs with sustainability goals
-
SUS02-BP03: Stop the creation and maintenance of unused assets
-
SUS02-BP04: Optimize geographic placement of workloads based on their networking requirements
-
SUS02-BP05: Optimize team member resources for activities performed
-
SUS02-BP06: Implement buffering or throttling to flatten the demand curve
Software and architecture:
SUS 3: How do you take advantage of software and architecture patterns to support your sustainability goals? (how to use patterns)
-
SUS03-BP01: Optimize software and architecture for asynchronous and scheduled jobs
-
SUS03-BP02: Remove or refactor workload components with low or no use
-
SUS03-BP03: Optimize areas of code that consume the most time or resources
-
SUS03-BP04: Optimize impact on devices and equipment
-
SUS03-BP05: Use software patterns and architectures that best support data access and storage patterns
Data management:
SUS 4: How do you take advantage of data management policies and patterns to support your sustainability goals? (data management)
-
SUS04-BP01: Implement a data classification policy
-
SUS04-BP02: Use technologies that support data access and storage patterns
-
SUS04-BP03: Use policies to manage the lifecycle of your datasets
-
SUS04-BP04: Use elasticity and automation to expand block storage or file system
-
SUS04-BP05: Remove unneeded or redundant data
-
SUS04-BP06: Use shared file systems or storage to access common data
-
SUS04-BP07: Minimize data movement across networks
-
SUS04-BP08: Back up data only when difficult to recreate
Hardware and services:
SUS 5: How do you select and use cloud hardware and services in your architecture to support your sustainability goals? (cloud hardware and services)
-
SUS05-BP01: Use the minimum amount of hardware to meet your needs
-
SUS05-BP02: Use instance types with the least impact
-
SUS05-BP03: Use managed services
-
SUS05-BP04: Optimize your use of hardware-based compute accelerators
Process and culture:
SUS 6: How do your organizational processes support your sustainability goals? (process)
-
SUS06-BP01: Communicate and cascade your sustainability goals
-
SUS06-BP02: Adopt methods that can rapidly introduce sustainability improvements
-
SUS06-BP03: Keep your workload up-to-date
-
SUS06-BP04: Increase utilization of build environments
-
SUS06-BP05: Use managed device farms for testing