SES TemporaryFailure: The Terraform Resource That Makes You Wait and Pray

Have you ever debugged something that worked perfectly until you ran your own infrastructure code again? That was my Tuesday afternoon. The day before, SES was happily verifying my domain in two regions, inbound and outbound, and sending and receiving email like a normal piece of infrastructure should. By Tuesday afternoon, one of the two regions had quietly slipped into verification-pending, and I could not figure out why.

I checked the DNS. dig TXT _amazonses.threemoonsnetwork.net. Saw one TXT value. Looked fine. Cross-referenced against the SES console’s expected token. Matched. SES, however, said Failed: TemporaryFailure. I was getting the polite version of “I don’t know, ask me later.”

So I asked Claude. Claude said: “DNS propagation can take a few minutes. Wait, then try again.”

Fair. Reasonable. DNS propagation is real. I waited. I re-ran terraform apply because I had been mid-deploy. I re-checked the DNS. Different TXT value this time. I re-checked the SES console. Different region was now pending. Different region was now verified.

I asked Claude again. Claude said: “DNS records can be slow to converge. Give it a few more minutes.”

I waited. I apply’d. The DNS value changed again. The verification states flipped. I was watching a binary state machine where neither state was stable.

I asked Claude a third time. Claude said: “Sometimes you have to wait for SES to retry the verification check.”

Claude was right every time, in the sense that DNS propagation is a real thing, SES verification retries are a real thing, and waiting was a reasonable suggestion given the question I was asking. The question I was asking was “why is SES verification flaky.” The question I should have been asking, the one I did not think to ask Claude because I did not yet know to ask it, was “why is the DNS record itself changing every time I run terraform apply.”

The actual bug: The question I was asking was “why is SES verification flaky.” The question I should have been asking was “why is the DNS record itself changing every time I run terraform apply.”

I figured it out on the fourth or fifth round of dig-apply-dig. The TXT value for _amazonses.<domain> was not stable. It was being overwritten on every apply. Each apply produced a different value. Each value, in isolation, was a valid SES verification token. The token just happened to be for whichever region’s SES module had last touched the record.

The setup, which I had built and which I had never put together as one architecture in my head, was this. I had an aws_route53_record for _amazonses.<domain> inside the ses module, set to the outbound region’s verification token. I had another aws_route53_record for _amazonses.<domain> inside the ses-inbound module, set to the inbound region’s verification token. Each module set allow_overwrite = true on its record. Each module’s apply ran without errors. Each module silently overwrote the other’s TXT value, depending on which one terraform decided to apply last in the dependency graph.

The DNS only stored one value at a time. Whichever module wrote last won. SES verification for the losing region would then fail until the next apply, at which point it would maybe win and the other region would lose. The system was technically working — terraform was happy, the records existed, the values were valid tokens — but the architecture was fighting itself.

This is the “two abstractions, one resource” anti-pattern. It is the exact shape of a class of bug I have hit before in other systems. Ansible and Puppet both managing /etc/nginx/nginx.conf. A deploy pipeline and a developer both running database migrations. ECS service desired_count declared in Terraform while autoscaling tries to override it. In every case, the failure mode is the same: two writers, one resource, neither one knows the other exists. The infrastructure ping-pongs based on whoever wrote last.

The pattern: Two writers, one resource, neither one knows the other exists. The infrastructure ping-pongs based on whoever wrote last.

The fix is also always the same: pick one owner. The other becomes a consumer. The resource has one source of truth in the code, and any other thing that needs to influence it does so by feeding inputs to the owner, not by writing directly.

In this case the fix was to consolidate the TXT record. Take it out of both modules. Create one aws_route53_record at the environment level, with a records array containing both region tokens. The TXT resource record type allows multiple values per name. SES is happy to verify against any matching token in the record. Both regions get verified. Neither overwrites the other. The fight ends because there is no longer anyone to fight with.

It is now in environments/prod/main.tf. There is a comment block above it that says, more or less: “Do not split this back into two records. They fight. SES randomly breaks. The fix took an afternoon to debug. The fix is to leave this alone.” I have a strong feeling that comment will save someone — possibly a future me — an afternoon, the next time someone decides to “clean up” the consolidated record and put each region’s token back in its own module for better separation of concerns.

I went back later and looked at the Claude conversation. Three “wait a few minutes” suggestions, each correct in the abstract, each unhelpful in my specific situation. The failure was not Claude’s. I had asked a context-free question — “why is SES verification flaky” — and got a context-free textbook answer. If I had pasted both module files and the apply order and said “here’s what I see, here’s what terraform is doing,” Claude would almost certainly have caught the dual-owner pattern. I never gave Claude the context, because I didn’t yet realize the context mattered. That is its own lesson. Pair programming, with an AI or with a human, is only as good as the framing of the problem you bring to it. A search-engine question gets a search-engine answer.

The bigger SRE lesson is the one about waiting. AWS told me to wait. The DNS docs told me to wait. SES told me to wait. Claude told me three times to wait. They were all right that DNS propagation is real and you usually need to wait a few minutes after a record change. They were also all unaware that the record was changing every time I waited and applied again. Waiting only works when the system is converging. When the system is fighting itself, waiting just makes you a slower spectator to the fight.

There is now a one-line check at the end of my deploy: dig the relevant TXT records and compare against expected values. If they don’t match, the deploy fails loudly. Two minutes to add. Would have saved an afternoon. That is the most boring possible piece of automation, and it is exactly the kind of thing that makes the next afternoon less stupid.

The Claude said wait. The AWS docs said wait. My own scripts now say wait. They were all right. The infrastructure was in flux and it needed to settle. The part nobody told me — the part I had to figure out by myself in dig-apply-dig — was that the infrastructure had to be allowed to settle. As long as I kept running terraform apply, it was never going to. I was the source of the flux. I was waiting for myself to stop being the problem, which is one of the slower-converging things in computing.

Anyway. SES is verified now, in both regions, off one record, with one owner. The system has settled. I am as patient as the infrastructure required me to be. It only took me four hours of impatience to get there.