Well Architected Framework

The AWS Well-Architected Framework was developed to help cloud architects build secure, high-performing, resilient, and efficient application infrastructure.

The Cloud Practitioner and Architecting on AWS courses both have one slide on each pillar.

They are a set of white papers which can be downloaded here

They are recommended reading for the Architecting Associate exam.

I summarise here, adding some examples to the bullet points that are presented in the official course ware.

Operational Excellence

When creating a design or architecture, you must be aware of how it will be deployed, updated, and operated. It is imperative that you work towards defect reductions and safe fixes and enable observation with logging instrumentation.

  • Perform operations as code.

For example, use Cloud Formation templates, Systems Manager Run Command scripts and shell scripts, and place them under a source control system such as Github.

  • Annotate documentation.

For example, the code or script itself can describe its intended function

  • Make frequent, small, reversible changes.

Making small changes reduces the scope and impact of the change . Many of the deployment services support the ability to roll back changes.

  • Refine operations procedures frequently.

When the workload changes to improve it, change runbooks, scripts, and documentation

  • Anticipate failure.

Build a replica of your production environment, possibly using simulated data, and for example, terminate EC2 instances to check how the application recovers. Disable a microservice and, depending on its criticality to the overall system, check how the application as a whole is affected.

  • Learn from all operational failures.

When you find a problem, update procedures, scripts and documentation. In other words, don’t make the same mistake twice.

Security

Security deals with protecting information and mitigating possible damage.

  • Implement a strong identity foundation.

Use the principle of least privilege. For example, if an EC2 instance needs to write to S3, it doesn’t need read permission, and it doesn’t need long term credentials. Instead, it can use a role. An operator may not need access to the data in order to maintain the system.

Review privileges on a regular basis, for example, in case someone has been given a privilege for a one-off event. Does a user need Full Access to service? Many AWS managed policies provide full access, where it may not be required.

Cloud Trail can help identify what a user is has done, and therefore what privilege they require.

  • Enable traceability.

Be able to track who did what and when using Cloud Trail and AWS Config. Use Cloud Trail as the basis for checking if an action is normal or not, possibly using machine learning.

  • Apply security at all layers.

Apply to all layers (e.g., edge network, VPC, subnet, load balancer, every instance, operating system, and application).

  • Automate security best practices.

For example, use Cloud Formation templates, where the security can be built into the templates themselves. Use Cloud Formation Drift detection to detect manual changes. Use AWS Config to automate the remediation of non-compliant configurations.

Customize the delivery of AWS Cloud Trail and other service-specific logging to capture API activity globally and centralize the data for storage and analysis.

Integrate the flow of security events and findings into a notification and workflow system such as a ticketing system, a bug/issue system, or other security information and event management (SIEM) system.

  • Protect data in transit and at rest.

As required to meet legal and compliance requirements. Classify your data into sensitivity levels and use where appropriate.

  • Prepare for security events.

For example, if a system is compromised, have a procedure/Playbook to follow. For example, adjust security groups rules to isolate a compromised instance and de-register it from a load balancer in order to contain the incident. Place the instance on a forensic network for analyses.

Reliability

The reliability pillar encompasses the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as mis-configurations or transient network issues.

  • Test recovery procedures

Use automation to simulate different failures or to recreate scenarios that led to failures before.

  • Automatically recover from failure.

Use automated recovery processes that work around or repair a failure. For example, use multiple regions or AZs.

Know the availability requirements of the application. For example 99.99% availability works out at  1 hour per year. If manual intervention is involved, that is an unrealistic figure.

  • Scale horizontally to increase aggregate system availability.

For example, use EC2 Auto Scaling, replacing one large resource with multiple smaller ones to reduce the impact of a single failure. Be aware of limits, for example limits in number of instances or IP addresses may cause issues with Auto Scaling.

  • Stop guessing capacity

Monitor demand and system utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over or under provisioning. For example, use AWS Auto Scaling, which includes the scaling of Dynamo DB.

  • Manage change in automation.

For example, use Cloud Formation Change Sets to manage changes in the infrastructure.

Performance Efficiency

When considering performance, you want to maximize your performance by using computation resources efficiently and maintain that efficiency as the demand changes.

  • Democratize advanced technologies.

Technologies such as NoSQL can become services so you can focus on product development. . In situations where technology is difficult to implement yourself, consider using a vendor. In implementing the technology for you, the vendor takes on the complexity and knowledge, freeing your team to focus on more value-added work.

  • Go global in minutes

Go global through the use multiple regions or reducing latency using of edge locations. Cloud Front, Web Application Firewall and Lambda @edge all integrate with the edge locations.

  • Use serverless architectures.

They remove the need for you to run and maintain servers. They offer high performance at low cost.

  • Experiment more often.

Test out new ideas in a way that is not possible with on-premises hardware. Quickly carry out comparative testing using different types of instances, storage, or configurations.

  • Apply mechanical sympathy.

Use the technology approach that aligns best to what you are trying to achieve. If a new service is released, research it as it might save you money.

Consider data access patterns when you select database or storage approaches.

Cost optimization

The ability to avoid or reduce unneeded cost

Cost optimization is an ongoing requirement of any good architectural design. The process is iterative and should be refined and improved throughout your production lifetime.

  • Adopt a consumption model.

For example, you can stop test resources when they are not in use. Right size instances by monitoring how the resources are being used.

  • Measure overall efficiency.

Measure the costs associated with delivering an application or service, for example using cost allocation tags. The goal is to deliver business value at the lowest price point.

  • Stop spending money on data center operations.

AWS does the heavy lifting of racking, stacking, and powering servers, so you can focus on your customers and business projects rather than on IT infrastructure.

  • Analyze and attribute expenditure.

Using tags, you can attribute costs to the various business owners.

Often, the cost of data transfer is overlooked.

  • Use managed services to reduce cost of ownership.

This removes the operational burden of maintaining servers.