AWS Well-Architected - Why so many get it wrong

This blog follows on from my speaking session as a Community Builder at the AWS Summit in London (June 7th 2024).

I looked at why so many people miss understand the Well Architected Framework, and as a result perform poorly in Well-Architected Reviews.

While this post will not cover everything I talked about it should give you an idea of the content and act as a reminder if you were at the event.


Summary of the Framework

Incase you are unsure what the framework consists of, the image below gives a summary of each pillar and what, in my view, you should cover when reviewing workloads.

6 Pillars of the framework

Full details of the framework can be found at aws.amazon.com/architecture/well-architected. You can also take a look at some of my posts from a few years back on the What, Why and How of the Well-Architected Framework.


How people translate the pillars

Unfortunately many people misunderstand the pillars and this becomes the start of their problems.

Operational Excellence
When talking about Operational Excellence, I often hear the following statements;

I have lots of metrics that AWS gives me.
I’ve implemented X solution for this.
I’m cloud native so AWS does all this for me.
Yeah, we do DevOps.

*Insert your choice of ops tool like Splunk, Data Dog, AppDynamics, etc. for solution X.

The problem with this is it assumes that no thought is needed for operations. This is not the case and decisions should be made no matter the tool you are using to make sure you can manage the full development lifecycle.

Security
When it comes to security, while focus at a high level, I hear worrying comments such as;

We don’t have to worry about that, we’re all in on AWS
We’ve not had a breach yet so must be doing ok.
It’s not production so we don’t need to worry.
We have a security team for all that stuff.

The worrying thing is that it assumes security as an after though and rather as part of the development process. This means that mistakes can be very costly, not only to remediate but in terms of reputation and potential legal penalties.

Reliability
While the cloud can offer more reliability, it is not something that is inherent in the implementation especially if not going for managed services. It is worrying therefore that I hear comments such as;

We moved to micro-services, so that fixes all our reliability concerns.
AWS SLAs mean we don’t need to worry about reliability.
Everything's elastic so will just self heal.
Everything fails, what’s the issue?

It is a complacency that using modern architectural patterns or cloud services that reliability is not something that needs to be managed.

Performance Efficiency
Similar to operational efficiency and reliability there is a worrying trend in assuming by using hyper-scalers such as AWS, or modern solutions such as Kubernetes, that performance efficiencies are gain "out of the box". Comments such as the following show a lack of understanding of the cloud and modern technologies;

We have auto-scaling to manage performance.
We are using AWS Services for all components so performance efficiency is built in.
We are serverless so this is not an issue for us.
Yeah, K8S manages all that for us.

As with the operations excellence and reliability pillars, performance effeciencies is something that needs to be managed through out the lifecycle of workloads.

Cost Optimization
While moving to the cloud can offer cost savings that does not translate to cost optimization. When people make the below comments they are saying it's not my job to worry about costs.

Serverless means we don’t need to worry about optimizing our costs.
Someone else manages the bill so I don’t need to worry about costs.
Its cheaper that the legacy solution we replaced so what’s the issue?

The reality is that cost optimization is not an after thought. Similar to security, if everyone is aware of the cost implications of the work they do, they can then help to reduce the financial burden.

Sustainability
This is another pillar where it often seems it's someone else's responsibility. Comments I've heard seem to be common in those developing solutions and almost an indifference in the impact solutions have.

The cloud is greener than on-premises so why worry too much?
We have no control over how sustainable AWS is.
As developers can we really impact sustainability that much?

The reality, similar to cost, is that if you ensure sustainability is part of the culture and process there can be huge impact to the sustainability performance.


Well-Architected in Action

The next section of my talk took a little dig at DevOps and SRE processes.

Focused on the software lifecycle I looked at how both of the agile mindsets often have a narrow focus of what's needed at each phase.

For example; In the requirements phase I talked about how DevOps might just want the minimum requirements to push through the process where as SRE will look at the minimum needed for the system not to fail.

In design and build phases I talked about DevOps wanting the latest tech and tools to try and build up their CVs, while SRE just want to do everything Netflix or Google do.

In the test and run phases I focused on DevOps just wanting to keep everyone happy and get sign off where SREs want to deploy fast but not worry about quality and commitments with KPI or SLAs.

Finally I talked about what the lifecycle should look like. The main focus is tying back to business needs and values and ensuring decisions are made based on this and not developer/engineer preference. I also talked about testing at each phase of the development cycle, and not just to pass a requirement but to destruction so that information on how the system reacts in certain scenarios can be measure, investigated/understood, documented, automated and handed to support teams.


What is Architecture

After talking about Well-Architected in Action, I looked at two areas I think also lead to poor performance in reviews.

The first was to look at the different types of architecture. Many misunderstand that the Well-Architected Framework and subsequent reviews are focused on technical architect. In fact the framework and reviews are focused on the enterprise and solution architectures in addition to the technical ones. A good article to find more on the different types oar architecture is from Michael Widjaja over on itarch.info.

What is IT Architecture & Types of Architectures
Explaining what is Technology architecture and different types of IT architecture
Understanding the levels of IT architecture on itarch.info

I also talked about the golden triangle, and how good architecture is not only focused on technology but also the people and processes and how these are cruicial in being "Well-Architected". Becky Simon over on smartsheet.com has a great article on People Process and Technology (PPT).

Complete Guide to the PPT Framework | Smartsheet
Learn how the people, process, technology (PPT) framework can aid in your plans for organizational change.

I closed my talk looking at what could be seen as a legacy architecture and how, when looking at the reasoning and choices it actually becomes a high performing workload in a Well-Architected Review.

🗒️
Well Architected is not just about the technology you deploy, it's about the processes that lead you to those decisions.

In the next few blogs posts I'll dive into each pillar and look at what you can do to ensure you don't end up with your team making some of these comments and ensure you come out top when performing a review of your workloads.