Building a DevOps agent: cluster, cloud, PagerDuty, GitHub, without a single long-lived key

A field report on building a production DevOps incident triage agent across EKS, CloudWatch, PagerDuty, Datadog, GitHub and Slack with zero long-lived credentials on disk.

May 31, 202614 min read

Building a DevOps agent: cluster, cloud, PagerDuty, GitHub, without a single long-lived key.

Your CTO read a vendor blog. Your security team read a different one. Both arrive at your desk on the same Monday with the same instruction: "stand up an incident-triage agent on the prod cluster." You have until Friday.

The agent has to do real work. When PagerDuty fires, it pulls the incident, runs kubectl get pods against the affected namespace, queries CloudWatch or Datadog for the relevant time window, makes a couple of read-only AWS API calls under an STS-assumed role, opens a GitHub issue with the findings, and drops a note in #incidents on Slack. Then it stops and waits for a human.

You have also watched a steady drumbeat of public-repo credential leaks over the last few years. Research like Palo Alto Unit 42's EleKtra-Leak shows that exposed AWS keys on GitHub get picked up and abused by attacker tooling on very short timelines. GitGuardian, Snyk, and GitHub themselves have all published yearly "state of secrets" reports documenting millions of credentials leaked into public repositories. The exact numbers vary by methodology; the direction is consistent.

So the answer "drop AWS_SECRET_ACCESS_KEY, PD_API_KEY, DD_API_KEY, DD_APPLICATION_KEY, GITHUB_TOKEN, and SLACK_BOT_TOKEN into a Kubernetes Secret and exec the agent" is not a starting point. It is the postmortem.

This is the build I actually shipped. It triages incidents across six surfaces with zero long-lived credentials on disk, and the trick is that no single tool solves it. AWS is solved by AWS. Kubernetes is solved by Kubernetes. The SaaS layer (GitHub, Slack, Linear) is solved by a local credential broker. PagerDuty and Datadog sit in the honest middle where you either write a small custom provider or accept a tightly-scoped K8s Secret. I will show all of it.

The threat model in one paragraph

Treat the agent process as compromised. Assume an attacker can read its environment variables, dump its memory, exfiltrate any file it can open(), and make any outbound HTTP request it can make. The design goal is that even when all of that is true, the attacker cannot:

  1. Move laterally inside your AWS account beyond what a 1-hour STS session permits.
  2. Exfiltrate Kubernetes Secrets.
  3. Reach any host outside your allowlist.
  4. Hold a credential that survives the agent's pod restarting.

That is the whole post. The rest is mechanics.

For a wider framing of the four threat models AI agents face today, see AI agent security in 2026: four threat models.

The architecture

code
┌─────────────────────────────────────────────────────────────────────┐
│  EKS prod-east · namespace: ops                                      │
│                                                                      │
│   ┌────────────────────────────────────────┐                         │
│   │  Pod: devops-agent                     │                         │
│   │   - ServiceAccount: devops-agent       │                         │
│   │     ↳ EKS Pod Identity → IAM role      │  AWS APIs               │
│   │       (1h STS, read-only)              │ ───────────────────►    │
│   │   - kubeconfig: in-cluster, view RBAC  │  K8s API                │
│   │   - PD/DD creds: K8s Secret, scoped    │ ───────────────────►    │
│   │   - GH/Slack/Linear: NO env vars       │                         │
│   │     ↳ local broker injects per-host    │  api.github.com         │
│   │       at the proxy boundary            │  slack.com              │
│   └────────────────────────────────────────┘  api.linear.app         │
│                                                                      │
│   Calico egress policy: deny by default, allowlist 6 FQDNs           │
└─────────────────────────────────────────────────────────────────────┘

Six external surfaces, four credential mechanisms, one egress policy. Let's go through them in the order they bite you.

1. AWS: let AWS handle AWS

The single most important rule in this whole build: do not route AWS calls through any third-party broker. AWS already has short-lived credentials. AWS already has automatic rotation. AWS already has CloudTrail. Adding a TLS-terminating intermediary on the path to your prod account is strictly worse.

On EKS, the modern recommended mechanism is EKS Pod Identity. The Pod Identity Agent runs as a node-local DaemonSet, and AWS SDKs inside your pod fetch credentials from it over a link-local address. Sessions are short-lived and rotated automatically; review the Pod Identity Agent setup docs for the exact TTL behavior in your SDK version. The SDK reloads transparently.

If you are off EKS (Outposts, EKS Anywhere, self-managed k8s on EC2), use IRSA. If your agent runs on a laptop, on-prem, or another cloud, use IAM Roles Anywhere with an X.509 cert and the AWS signing helper. In every case the workload holds a short-lived credential, not a long-lived key.

Wire it up:

bash
cat > trust.json <<'JSON'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "pods.eks.amazonaws.com" },
    "Action": ["sts:AssumeRole", "sts:TagSession"]
  }]
}
JSON

aws iam create-role \
  --role-name devops-agent-readonly \
  --assume-role-policy-document file://trust.json

aws iam attach-role-policy \
  --role-name devops-agent-readonly \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess

aws eks create-pod-identity-association \
  --cluster-name prod-east \
  --namespace ops \
  --service-account devops-agent \
  --role-arn arn:aws:iam::111122223333:role/devops-agent-readonly

One quirk to call out before you build a more restrictive policy: several CloudWatch read actions like cloudwatch:GetMetricData and cloudwatch:ListMetrics do not support resource-level permissions and require Resource: "*" in IAM policies. See the CloudWatch IAM reference for the current list. You can restrict by action, you can restrict KMS keys, you can scope log-group ARNs for logs:*, but the metric-read actions are wildcard-only. Don't waste 90 minutes trying to get a Resource: arn:aws:cloudwatch:... policy to validate. It won't.

AWS has published guidance on secure access patterns for AI agents that echoes the same principles: short-lived credentials, least-privilege roles, scoped tokens, network controls. If your security team needs a vendor link to bless the design, that is the family of references to point at.

2. Kubernetes: this is not a credential problem

The agent's pod already has a mounted service-account token. The question is what that service account is allowed to read.

The built-in view ClusterRole gives you namespace-scoped read access to nearly everything except Secrets (Kubernetes RBAC docs). That exclusion is intentional, it's the cluster's escalation-prevention guarantee, and it is exactly the surface a triage agent needs. Don't invent a custom role.

yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: devops-agent
  namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: devops-agent-view
subjects:
- kind: ServiceAccount
  name: devops-agent
  namespace: ops
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

If you want to be even tighter, drop the ClusterRoleBinding to a RoleBinding per namespace the agent is responsible for. The principle is the same: a Kubernetes-issued token, automatically rotated, scoped via RBAC. No kubectl config file with embedded certs. No long-lived KUBECONFIG baked into a Docker image.

Warning

Do not give the agent a kubeconfig pulled from aws eks update-kubeconfig on someone's laptop. Inside the cluster, client-go reads /var/run/secrets/kubernetes.io/serviceaccount/token automatically. Outside the cluster (the dev case), use aws eks get-token so the auth token is short-lived and scoped to the running developer's AWS session.

3. Egress: deny by default, allowlist six hosts

This is the step everyone skips and regrets.

Native Kubernetes NetworkPolicy does not support FQDN egress rules (k8s NetworkPolicy docs). It only understands IPs, CIDRs, namespace selectors, and pod selectors. The IP address behind api.pagerduty.com is on a CDN and rotates. You cannot allowlist it by IP without breaking the agent twice a week.

You need either Calico's FQDN-aware policies, Cilium's equivalent, or an egress proxy (Squid, Istio egress gateway). Calico is the easiest if you already have it.

yaml
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: devops-agent-egress
  namespace: ops
spec:
  selector: app == "devops-agent"
  egress:
  - action: Allow
    destination:
      domains:
      - api.pagerduty.com
      - api.datadoghq.com
      - api.github.com
      - slack.com
      - sts.amazonaws.com
      - monitoring.us-east-1.amazonaws.com
  - action: Deny

This single policy is the difference between "the agent got prompt-injected and exfiltrated our incident data to a pastebin" and "the agent got prompt-injected and got a DNS resolution error." For the longer version of that argument, see how prompt injection becomes credential exfiltration.

Six hosts. Everything else denied. The list is short on purpose, when it grows past ten you have lost the plot.

4. PagerDuty: scoped OAuth, not a global API key

PagerDuty's classic REST API key is a generic admin key. Don't use it for an agent. Use scoped OAuth 2.0 and request only incidents.read (and incidents.write if and only if the agent acknowledges incidents). Check the current PagerDuty OAuth docs for token lifetimes and refresh-token behavior; you will need refresh tokens for a long-running service.

The triage call is unremarkable:

bash
curl -sS -G "https://api.pagerduty.com/incidents" \
  --data-urlencode "statuses[]=triggered" \
  --data-urlencode "statuses[]=acknowledged" \
  -H "Authorization: Bearer ${PD_OAUTH_TOKEN}" \
  -H "Accept: application/vnd.pagerduty+json;version=2"

The interesting part is webhook verification on the way in. PagerDuty v3 webhooks are signed with HMAC-SHA256 in the x-pagerduty-signature header (versioned, e.g. v1=...), and the signing secret is returned in the create-subscription response (PagerDuty webhooks docs). Verify every webhook. An unverified PagerDuty webhook handler is a remote trigger waiting for someone to find it.

PagerDuty is not bundled in Authsome's stock provider set. Two honest options:

  • Define it as a custom provider. Authsome reads provider definitions from ~/.authsome/providers/<name>.json and supports OAuth 2.0 with PKCE. You write it once.
  • Keep the PagerDuty OAuth refresh token in a tightly-scoped Kubernetes Secret with RoleBinding to the agent's service account only. Acceptable if you'd rather not maintain a custom provider for one service.

Both are fine. Both keep the secret off disk in the agent's process tree. Don't let perfect be the enemy of "off disk."

5. Datadog: two headers, scoped application key

Datadog requires both DD-API-KEY (org-level) and DD-APPLICATION-KEY (user/scope-level) on every authenticated call (Datadog API authentication). The application key is the one that carries RBAC scope; create a dedicated service-account user, attach only the read permissions you need (metrics, logs, monitors), and generate the application key under that user (API and application keys).

bash
curl -sS "https://api.datadoghq.com/api/v2/query/timeseries" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "timeseries_request",
      "attributes": {
        "formulas": [{"formula": "query1"}],
        "queries": [{
          "name": "query1",
          "data_source": "metrics",
          "query": "avg:kubernetes.cpu.usage.total{kube_namespace:prod} by {pod_name}"
        }],
        "from": '"$(($(date +%s) - 900))"'000,
        "to": '"$(date +%s)"'000
      }
    }
  }'

Like PagerDuty, Datadog is not in Authsome's bundled set. Same two options: custom provider JSON, or a K8s Secret with a tight RoleBinding. Pick one and move on.

6. GitHub: installation tokens, not a PAT

This is the one most teams get wrong. They generate a fine-grained PAT under "the SRE service account everyone shares" and paste it in. That token now belongs to a Google login that nobody owns, never rotates, and lives in plaintext in a .env.

Use a GitHub App installation token instead (GitHub Apps docs). The app installs to specific repos, the installation token has a short TTL (around an hour), rate limits scale per installed repo, and there is no user account that can be deactivated and break everything. If you must use a PAT, use a fine-grained PAT with repo-level scoping, not classic.

bash
JWT=$(jwt encode --alg RS256 --secret @app.pem \
  --iss "${GH_APP_ID}" --exp +600 -)

TOKEN=$(curl -sS -X POST \
  -H "Authorization: Bearer ${JWT}" \
  -H "Accept: application/vnd.github+json" \
  "https://api.github.com/app/installations/${INSTALLATION_ID}/access_tokens" \
  | jq -r .token)

curl -sS -X POST \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Accept: application/vnd.github+json" \
  "https://api.github.com/repos/acme/runbooks/issues" \
  -d '{"title":"PD-INC-12345: pod CrashLoopBackOff in prod","labels":["incident"]}'

For a longer treatment of multi-account and token-hygiene patterns, see GitHub token hygiene for AI agents.

7. Slack: bot token, two scopes, that's it

Slack bot tokens (xoxb-) are tied to the app, not a user, so they survive when the human who installed it leaves (Slack token docs). For an incident-reporting agent you need exactly chat:write to post messages, plus channels:read (and groups:read if you post to private channels) to resolve channel IDs (Slack scopes reference). Don't grant channels:history, don't grant users:read.email, don't grant files:write. None of those serve a triage agent and all of them widen the blast radius.

bash
curl -sS -X POST https://slack.com/api/chat.postMessage \
  -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
  -H "Content-Type: application/json; charset=utf-8" \
  -d '{
    "channel":"#incidents",
    "text":"PD-INC-12345 triaged · 3 pods CrashLoopBackOff in `prod/api` · runbook: https://github.com/acme/runbooks/issues/981"
  }'

8. Collapsing the SaaS surface

Here is where the architecture stops being three OAuth dances, three refresh-token rotations, and three plaintext tokens in the agent's environment.

GitHub, Slack, and Linear (the three SaaS providers the agent talks to that are bundled) collapse into a single local broker. The pattern is simple: the agent makes a request to api.github.com with no auth header, a local proxy matches the destination, looks up the right credential in an encrypted vault, and adds the Authorization header on the outbound TLS connection. The agent process itself never reads the token, never holds it in memory, never has it in its environment.

If you want this without writing it yourself, Authsome is an open-source MIT-licensed local broker built for exactly this. authsome login github, authsome login slack, authsome login linear to seed the vault, then launch the agent with authsome run -- python triage_agent.py. The agent's env vars hold only a placeholder like authsome-proxy-managed; the real tokens live in an encrypted SQLite vault under ~/.authsome/ and never enter the agent's process. Every credential read is appended to a local JSONL audit log.

You can use a credential broker, an MCP gateway, or a homegrown sidecar. The point is the boundary, not the brand. For the broader argument, see secrets managers vs credential brokers for AI agents.

What the broker does not do, and should not do: AWS. Routing AWS through any third-party proxy adds a TLS termination point you don't need on the path to your prod account. EKS Pod Identity already gives you short-lived STS credentials with automatic rotation and CloudTrail logging. Let it.

9. The audit trail

When the incident-of-incidents happens (the agent did something weird at 3am and the on-call wants to know what), you need four logs that line up by timestamp:

SurfaceLog sourceWhat it answers
AWS API callsCloudTrailWhich IAM principal called what, when, from where
Kubernetes APIapiserver audit logWhich service account read which pod
SaaS credential useBroker's JSONL audit logWhich credential was injected for which outbound request
Network egressCalico flow logsWhat the agent tried to reach, allowed or denied

This is the level of evidence a SOC 2 auditor or a postmortem will actually accept. For the compliance angle, see compliance for AI agents: SOC2, audit trails, credentials.

10. Things that will bite you

A short list of bruises from the build, in no particular order.

  • PagerDuty webhook signatures are versioned (v1=...). Verify against the actual x-pagerduty-signature header your subscription emits before trusting any example code; the docs have lagged the behavior in the past.
  • Datadog renames RBAC permission slugs. Don't hardcode permission identifiers in a comment as if they're permanent; link to the live permissions list when you document the service-account user you created.
  • EKS Pod Identity sessions are short. Your SDK reloads automatically; your homegrown HTTP client probably does not. If you wrote your own AWS signer, fix the reload.
  • view ClusterRole excludes Secrets. Good. If your agent claims it needs to read a Secret to triage, the agent is wrong; the Secret can be referenced by name in the postmortem without its value.
  • Native NetworkPolicy is IP-only. If you don't have Calico or Cilium yet, an Istio egress gateway with an AuthorizationPolicy is a defensible alternative. Don't ship without egress controls.
  • Don't put PagerDuty / Datadog into your broker just to have them there. A custom provider you wrote in a hurry is worse than a tightly-scoped K8s Secret with a clean RoleBinding. Earn the broker entry.

What good looks like at the end of the week

  • The agent's pod has no environment variable containing a long-lived credential.
  • AWS calls use EKS Pod Identity. Sessions are short-lived. CloudTrail logs every call.
  • Kubernetes calls use the in-cluster service-account token bound to the view ClusterRole. Secrets are inaccessible by construction.
  • PagerDuty and Datadog use scoped OAuth and scoped application keys respectively, stored in K8s Secrets with RoleBinding to one service account.
  • GitHub, Slack, Linear are injected at a local broker proxy. Their tokens never enter the agent's process.
  • Calico egress policy denies everything except six FQDNs.
  • Four audit streams line up by timestamp for any postmortem.

The agent is useful. Nothing it holds is worth stealing. That is the entire bar.

Priyansh Khodiyar

Priyansh Khodiyar

Maintainer

Works on authsome and the agentr.dev tooling.