Running AI agents in production: what nobody tells you about credentials

Seven credential failure modes that don't show up until you've shipped. Notes from running agents past the demo stage.

April 25, 20268 min read

Running AI agents in production: what nobody tells you about credentials.

The interesting bugs in AI agents are not in the prompts. They're not in the model. They're in the boring layer underneath: auth, tokens, refresh, scopes, who owns what. A demo agent works because the demo runs once, on your laptop, with all the keys you personally pasted in. A production agent runs forever, on someone else's box, against tokens that age, with concurrency you didn't model, behind networks you don't control.

This post is a punch list of seven failure modes I've watched first-hand or debugged for someone else. None of them are in the framework docs. All of them are in the postmortems.

1. The refresh-token race

The bug: two replicas of the same agent share a refresh token. Both notice the access token is about to expire at roughly the same time. Both call the token endpoint. One refresh succeeds and rotates the refresh token. The other gets an "invalid grant" because the old refresh token was just invalidated. That replica falls over. The next replica that picks up the work also falls over, because nobody updated the shared store with the new tokens.

You can pre-detect this in two minutes if you remember that OAuth2 refresh tokens are single-use by default for any provider that follows the spec. GitHub does. Google does. Anthropic does. Linear does not, but Linear doesn't issue refresh tokens at all, which is its own problem.

The fix: serialize refreshes through a single broker process, or use a lease/leader pattern so only one replica refreshes at a time. The other replicas wait for the new token to land in the shared store and read it from there.

This is one of the reasons "just put the token in an env var per replica" stops working once you scale past one instance.

Warning

If your agent is multi-replica today and you have not built refresh serialization, you have a latent outage. It hasn't fired because all your replicas haven't tried to refresh in the same minute yet. They will. The triggering event is usually a deploy that restarts everything at once and forces a synchronized refresh.

2. Tokens that "work" but are actually wrong

A token can pass validation and still be the wrong token for the current request. Three common shapes:

Wrong account. Personal GitHub token used against a work API call. The API responds 200 with the wrong data. Your agent processes it and emits subtle bugs for a week.

Wrong scope. Token has repo but not workflow. Most API calls work. The one that needs workflow returns 403. Your agent retries it forever because the error doesn't look transient but the retry policy thinks 403 might be a rate limit.

Wrong installation. GitHub App tokens are scoped to an installation. If you minted a token for installation A and the repo lives under installation B, you get 404. Not 403. 404 looks like "repo doesn't exist", which sends your error handler the wrong way.

The fix is to validate that the identity of the token matches the target of the call before retrying. A good broker does this by tagging each credential with the account/installation/scope it belongs to and refusing to use it for a call that doesn't match.

3. The midnight-restart silent break

You ship at 4pm. Everything works. At midnight, your container orchestrator does its routine restart. The agent comes back up. The agent's first task runs at 3am. The task fails because the OAuth refresh that ran during the restart used a stale client_secret that you rotated last week but forgot to push to the secrets manager the orchestrator pulls from.

You wake up at 8am to a queue of failed tasks and no obvious smoking gun, because the error in the logs is invalid_client and the production agent has never logged that string before.

The lessons:

  • Treat client_secret rotation as a deploy-time concern, not a "set it once and forget it" concern.
  • Have a startup-time auth probe that fails the container's readiness check if the OAuth credentials are bad. Better to crash-loop on startup than to silently fail tasks at 3am.
  • If you can avoid client_secret entirely (PKCE for SPAs, GitHub App private keys, etc.), do.

4. The provider that revoked your refresh token

Some providers expire refresh tokens after a period of inactivity. GitHub revokes after six months. Google after six months for some scopes and never for others. Microsoft has a 90-day sliding window for personal accounts and a different one for tenants.

If your agent runs every day, you never hit this. If your agent runs once a quarter (compliance reporting, billing rollups), you absolutely hit this and you find out at the worst possible moment.

The mitigation is to schedule a no-op refresh on a cadence shorter than the shortest revocation window for any provider you depend on. Daily is conservative and cheap. The broker should handle this; you shouldn't have to write a cron job to ping refresh endpoints.

5. The CI runner that holds a personal key

This one's a people problem more than a technical one. An engineer wires up an agent in a CI workflow with their personal API key, "just to get the demo working". Six months later, the engineer leaves. Their account gets deprovisioned. The CI workflow starts failing. Nobody knows whose key was in there until someone reads the workflow file.

The fix is policy, not tooling: any credential used in CI must belong to a service account, not a person. The broker pattern helps by making it obvious which connections are service vs. personal (you tag them at login time), and refusing to export personal credentials into CI environments.

A surprisingly common variant is the personal OPENAI_API_KEY that someone pasted into the team's shared .env.example. Search every repo you own for OPENAI_API_KEY=sk- right now. You may be surprised.

6. The HTTPS_PROXY that breaks DNS

If your agent runs in a corporate network, there's a proxy. If you also use a credential broker that sets HTTPS_PROXY=127.0.0.1:7998, you have two proxies fighting over the same env var. Which one wins depends on which one set it last. Subtle behavior: outbound traffic that's supposed to go through the corporate proxy goes through the broker, the broker forwards it to the internet directly, the corporate firewall drops the packet because that source IP shouldn't be making external calls.

The fix is to use NO_PROXY aggressively. Your broker should know which destinations it cares about and let everything else fall through to whatever proxy was set before it took over. The good ones do this automatically. If you're rolling your own proxy boundary, this is the thing that bites you in week three.

7. The audit log that nobody reads

When a credential gets misused, the question your security team will ask is: "who used this token, when, and what for?". If the answer is "I don't know, OpenAI's dashboard shows 14,000 calls last week and they're all from our server's IP", that's a long incident.

A broker's request log is the natural answer to this. Every outbound call passes through it. Every call can be tagged with the agent identity, the requesting task, the target host, the response code, and the time. None of those require additional plumbing in your agent code; the broker has all of them.

The catch is that the log is useless if nobody looks at it. Set up a weekly anomaly check: requests per host per agent, error rates, new destinations the agent has never called before. That last one is the prompt-injection alarm bell.

The pattern

If you re-read this list, six of the seven items become tractable when there's a thing sitting between the agent and the provider that:

  • Owns the refresh logic (1, 4)
  • Knows the identity-to-target mapping (2)
  • Validates credentials at startup (3)
  • Tags personal vs. service credentials (5)
  • Owns the proxy environment (6)
  • Logs every request (7)

That thing is a credential broker. It's not magic; it's just the natural home for the cross-cutting concerns that every production agent rediscovers six months in.

The seven failure modes above don't go away with a broker. They become someone else's problem. Which is usually progress.

The other thing nobody tells you

You will be on call for your agents. The agent that worked perfectly in your dev environment will, in production, encounter every retry-storm, every TLS-pinning library, every rate limit, every CIDR allowlist, and every cron-time DST transition that exists. The credential layer is where most of those failures route through, because authentication is the first thing any API call does and the most informative thing it can fail on.

Plan for being woken up by an OAuth error. Plan for the runbook needing one command, not five. Plan for the on-call to be someone other than the original implementer.

If you've been the second person on call for an agent built by someone who left, you know what bad looks like. The broker pattern is one of the few things that genuinely makes the second-on-call situation better, because all the credential state lives in one well-known place with a documented CLI.

Priyansh Khodiyar

Priyansh Khodiyar

Maintainer

Works on authsome and the agentr.dev tooling.