SRE's role in safer infrastructure-as-code
This article explores 2 simple ways for SREs to drive better practices and code hygiene within infrastructure-as-code (IAC) tooling like Terraform.
Why bother?
Because of its centrality to cloud infrastructure efficiency, it's highly likely that you will get involved with an IAC problem at some point in your SRE career.
I will mention Terraform from time-to-time as an adjunct to IAC. However, many of the issues will likely apply to similar IAC tools like Pulumi, AWS CloudFormation and more.
Terraform helps automate large parts of platform work and cloud provisioning. However, despite its automation promise, the nature of Terraform work can generate ongoing maintenance toil.
Let's explore two optimizations for Terraform to reduce potential toil and risk burden.
Keep your Terraform code organized
Yevginy "Jim" Brikman is a consultant at Gruntworks, specializing in Terraform, Hashicorp's infrastructure provisioning tool. He learned a few interesting things from writing 300,000 lines of Terraform code.
One of these learnings was that many teams keep their Terraform code as a single monolithic file. It's mind blowing to consider this, but many engineering teams still have elements of legacy culture.
He noted more than a few teams storing all the environment configurations (development, QA, testing, staging, and production) on one gigantic file. This led to several problems including:
difficulty finding the line of code that needed to be updated
long wait times (5+ minutes) for the monolith Terraform code to execute
an increased risk of red line errors getting missed, e.g. “executing code but the database will be deleted”
An even more concerning issue was that everyone was given admin permission to edit .tf files. In other words, less experienced developers gain admin-level access and unwittingly make undesirable changes to production infrastructure.
This may be an acceptable risk in more progressive organizations with a strong psychological safety apparatus. But having extensive experience in bureaucratic workplaces, I can tell you it's not always wise.
The latter will most certainly apply to more traditional work environments like finance, government, healthcare etc.
In summary, a monolith Terraform file risks a minor change breaking everything. So make sure it doesn't happen. And if it is happening, change it, pronto.
Making a monolithic codebase is a thing that most teams don’t plan to do, but it’s happening if IAC consultants are finding them. Here are recommendations for banishing monolith risk:
Isolate each environment as a separate .tf file or folder
Break down each environment into various components - database, VPC, frontend etc
Make all the above components reusable
Let's quickly unpack #2 — VPC configuration changes maybe once or twice a year while frontend might change 10 times a day. By separating them, you’re reducing the risk of 10 configuration changes a day affecting the more fragile VPC daily.
Outcome: modular architecture with smaller components reduces the surface area for attacks and errors affecting uptime.
Safeguard your Terraform templates
There are 2 ways that Terraform templates make it into a software system:
Platform engineers equip developers with Terraform templates or
Developers source their code templates if there is no platform support
The latter brings with it an inherent risk. Both externally and internally sourced templates may have built-in vulnerabilities — sometimes by mistake, sometimes intentionally.
The result remains the same: potential production-level risks may get overlooked and cause future downtime.
An analysis done by Unit 42 of Palo Alto Networks in 2020 found 200,000+ vulnerabilities in IAC templates.
The risk is greater for teams that give developers full responsibility for their IAC. I am not trying to undermine the efforts of developers here, but it's an unfortunate reality.
Let's take the encryption standards of PCI for example. Since 2018, PCI (payment card industry) has made TLS 1.0 a non-compliant encryption. Hackers can intercept and listen to traffic encrypted by that protocol.
A 2021 study by Azure operators found that developers were still trying to launch services to production with TLS 1.0 encryption. Almost 3 years later, for services that handled financial data.
The message here isn’t to avoid IAC if you aren't a Terraform expert. It’s that teams should consider the potential for risks in their deployments and mitigate them.
Here are 3 methods that can lower the risk in IAC without stifling developers:
Institute 2-person authentication to assure the IAC has been coded correctly
Have platform engineers or SREs write IAC templates with input from AppSec
Set appropriate policy controls at the cloud service provider level
Let's now explore these 3 methods in more detail.
1. Institute 2-person authentication of IAC code
How would this work? One person would write the infrastructure code, either the platform engineer (if available) or developer. The other person would review and comment on the code's veracity.
SREs are well-positioned to be reviewers. It's in their interest to pay attention, as improper IAC configuration and vulnerabilities can lead to downtime. This would consequently eat into their error budgets and impact their SLOs.
2. Platform/SRE teams collaborate with security
In many forward-looking organizations, SREs have have regular and direct interactions with security teams. If this is not the general case for yours, consider it for IAC collaborations.
Security involvement in IAC development gives another perspective on vulnerabilities that can bring down the system in production.
SREs can bake their learnings from security colleagues into IAC templates thus assuring security moving forward.
3. Set appropriate policy controls at the cloud provider level
Cloud (IaaS) providers may have in-built policy controls for the kind of infrastructure that should get provisioned e.g. Azure Policy.
However, tread carefully with these as you don’t want to restrict teams from deploying fast and often from overly onerous generic controls.
I suggest that SREs evaluate developer deployment patterns and service criticality to create more appropriate and customized controls at the cloud provider level.
Conclusion
By nature of their work, SREs need to have skill or understanding of IAC. This understanding needs to be constantly refined, as potential IAC vulnerabilities evolve over time.
A typical SRE's work is too varied for day-to-day IAC checks. Instead, they can provide timely guidance to developers and run monthly checks on the infrastructure code.
This article hits only the tip of the IAC iceberg. There are many more considerations one could make to ensure the infrastructure is provisioned effectively and securely. As always, keep learning.