Detecting Terraform Drift

Infrastructure-as-Code (IaC) is a great thing. But just moving to IaC doesn't mean all of your application infrastructure problems go away. In fact, it can lead to new issues you never anticipated.

One of the more pressing issues is drift. In this article, we'll cover what cloud infrastructure drift is and why it's so vexing. Then we'll look at the tools that one popular IaC platform, Terraform, provides to detect and manage drift.

What is drift?

With IaC, you can stand up and tear down the architecture for your application on demand. A fully automated IaC deployment means you can launch your application in production without any manual interference. Just check in your changes to your code repository and your application is live a short time later.

But what happens after your application is in production?

Once in production, it's tempting to tweak your stack yourself. Maybe you're trying to address an immediate customer problem. Or management told your team to stop spending so much money on virtual machines.

Other members of your team may also change your stack in other ways. For example, they may make API calls that modify or even delete stack infrastructure.

Drift occurs when an instance of your application stack deviates in reality from its written definition. It's all the changes you forgot or neglected to write down. Loosely stated, drift is a measure of how different your running stack would look if you re-deployed it from code.

Drift and desired configuration

As Drew Wright from Fugue notes, drift isn't always bad. Some changes are expected and even welcomed.

For example, let's say you launched an Amazon Elastic Container Service (ECS) cluster with three compute nodes. Let's say you also added an autoscaling rule that will power down a node on the cluster if its vCPU is less than 50% utilized for 30 minutes.

Now let's say you check on your cluster at some point and only one instance is running. That's drift. But it's not bad drift. You expect the number of instances you're running on your cluster to change in response to your traffic needs. That's one of the great benefits of the cloud: you can power off infrastructure you no longer need.

Additionally, this instance of drift isn't directly connected to your IaC templates. If you relaunched your production stack, then - assuming traffic to your app remained the same - the new ECS cluster would naturally return to running one instance based on the specified autoscaling rules.

What would be bad drift would be if you changed your autoscaling rules directly in production without changing them in your IaC code. In that case, if you relaunched your stack, how you autoscale would revert back to old rules that no longer meet the needs of your business.

In other words, you don't care about all of the values in your application changing. You care about the ones you care about. The set of configuration values that you wish to remain stable across your application's lifetime represent its desired configuration.

Common causes of drift

The most common cause of drift is manual intervention. I.e., someone makes a change to production without making it first in code.

Manual drift often starts innocently enough. A Site Reliability Engineer or even a dev changes a configuration value directly in production to resolve a pressing issue. Such quick changes are sometimes fine if they're followed up by an IaC deployment that records the change and makes it repeatable.

However, manual drift usually exposes two potential problems:

Overprivileged access. If fixing issues directly in production is team policy, it likely means too many people on your team have access to production. For security and privacy reasons, production access should be limited and granted on-demand.
Lack of change management oversight. All changes to production should go through their own deployment process - even configuration changes. This ensures that changes are reviewed and tested gradually before they're pushed live.

Drift can also occur across different environments due to incomplete deployments. If you tested a change in staging but didn't push it through to production, the environments will be out of sync.

Finally, other programs - such as automated tooling - can also change the state of a stack.

Why is architectural drift bad?

Drift can result in a jumble of mismatched configurations that destabilize your application's performance. It can even create a loss of site functionality so severe that it takes down your application.

Drift can also be a security breach waiting to happen. Think of the damage that can happen if a security fix - e.g., removal of plain-text credentials, enabling of HTTPS connections - never makes it to production.

Drift detection and reconciliation

Drift detection is an issue that predates the cloud. It plagues any complex software application where stack configuration isn't heavily monitored and a uniform configuration enforced.

For years, several software companies have produced configuration management tools that address exactly this problem. Tools like Chef and Ansible can not only deploy your application but also enforce a desired configuration state across all of your cloud resources.

How drift occurs with Terraform

Fortunately for its users, Terraform provides its own methods for reconciling drift.

As we explained in our last article, Terraform uses a state file to keep track of everything it's deployed into your cloud account. Drift can occur whenever you make a change to your infrastructure independent of your Terraform files.

How Terraform performs drift detection

Terraform integrates with cloud systems via a collection of plugins called providers. Each provider implements a READ method that enables Terraform to capture a cloud object's state.

Terraform Refresh

Terraform will automatically refresh its state file before it begins a deployment. It does this by calling the READ method on each provider for each resource hosted. You can also instruct Terraform to perform this operation at any time by using the terraform refresh command.

How Terraform resolves drift

Terraform Refresh will reconcile your Terraform state file with whatever is running in your cloud account. However, it will not reconcile your actual infrastructure with a desired state. For that, you need to use Terraform Plan.

Terraform Plan

terraform plan is a command that Terraform runs prior to terraform apply. You can also run it automatically yourself. terraform plan generates a list of the changes that running terraform apply will enact on your Terraform deployment. In other words, Terraform Plan captures the different between actual state and desired state.

Terraform Plan will generate an action plan to restore desired state in a large number of scenarios, including:

Re-creating a cloud object that you defined in Terraform but deleted in your cloud account
Changing tags and configuration values back to their Terraform-defined values
Destroying and re-creating an object if the change can't be made in place (e.g., changing the AMI used as the base image of a virtual machine)

Pros of Terraform drift detection and resolution

The biggest pro of using Terraform's built-in drift detection and mitigation is that it keeps everything in a single tool. If you've already made the investment in Terraform, it makes sense to leverage that investment as much as possible.

Plus, utilizing Terraform's drift detection requires zero additional effort. It's run every time you perform a terraform apply and re-deploy your application. This keeps your architecture simple and reduces moving parts.

Cons of Terraform drift detection and resolution

The downside of relying on Terraform for drift detection and mitigation is that it might not work quickly enough. Usually, you won't call terraform apply until the next time you perform a deployment. By then, a change to production could have already resulted in unnecessary damage.

The benefit of using a separate configuration management system is that it will monitor your environments for drift and restore monitored values within moments of a change. For example, tools like Ansible or AWS State Manager can ensure that EC2 firewall settings, Amazon S3 bucket read/write settings, and other values are set and maintained in a consistent manner across your infrastructure.

Another downside of relying solely on Terraform is that it only enforces configuration for the assets managed by Terraform. If you have other teams or projects not using Terraform, you will likely need a more comprehensive solution (or move everyone onto Terraform).

How TinyStacks can help

One of the major causes of drift is that multi-stage DevOps deployments are often too complicated to setup and maintain. TinyStacks makes it easy for anyone - from development shops to customer success teams - to deploy IaC-driven stacks to any environment and manage them through a single pane of glass. Check out TinyStacks today to see how it can accelerate and simplify your DevOps deployments.