Infrastructure Engineer
What you'll do;
- Support the application teams: turn around infra requests (permissions, roles, service setup, project peering) so product engineers stay focused on shipping.
- Own CI/CD and deployments: maintain and extend our GitHub Actions workflows and help migrate toward a dedicated CD tool with proper permissioning — the goal is fully automated, locked-down deploys via service accounts, no direct engineer access to production.
- Build and maintain infrastructure as code: author and update Terraform modules for new and existing services across GCP environments.
- Run Kubernetes the right way: manage service deployments via Helm (we're on Helm 4) keep async workloads healthy on Dagster.
- Unify observability (likely first project): consolidate today's per-team alerting into a single view — system-to-system dashboards plus incident alerting that routes upstream service/vendor failures to the right impacted teams and on-call rotations.
- Advance resilience: help move us toward a fully region- and cloud-agnostic posture so services can pick up and move if something fails.
- Strengthen security & access: apply IAM, secrets management, least privilege, and auditability; contribute to SOC 2 readiness.
- Automate with AI: build agent skills / agents.md so routine tasks (provisioning access, simple changes) can be handled by an agent instead of human engineering hours, and use AI to reason through bigger problems.
First 30 days. Ramp on the stack (GCP, Kubernetes/Helm, Terraform, GitHub Actions, Dagster). Meet the application and security stakeholders, and start reliably handling application-team requests.
First 90 days. Operating independently on the reactive workload and proactively creating/updating/managing infrastructure across GCP environments. On-call onboarding complete (Roby shadows then reverse-shadows your first shifts).
In 1 year. Delivered concrete platform improvements — new Terraform modules meeting app-team needs, upstream dependency upgrades, and a unified alerting/observability framework wired into incident reporting and on-call. Trusted to take significant infra projects off the lead's plate.
What you bring;
- Strong software-engineering fundamentals in at least one production language (Python, Go, TypeScript, or Rust); Python especially valued, plus comfort scripting and working in the shell.
- Hands-on experience with cloud infrastructure and core cloud services, especially GCP (AWS/Azure transferable).
- Experience operating large-scale Kubernetes production systems.
- Experience with Infrastructure as Code, especially Terraform.
- Familiarity with CI/CD systems, especially GitHub Actions or Octopus Deploy.
- Ability to debug production issues using logs, metrics, traces, shell tools, and source code.
- Security and access-control fundamentals: IAM, secrets management, least privilege, and auditability.
- Clear written communication around incidents, design decisions, and operational procedures.
- Supporting SOC 2 controls - evidence collection, access reviews, change management, or audit readiness.
- Observability with Datadog, Prometheus, Grafana, OpenTelemetry, Honeycomb, or similar.
- Improving developer experience through internal tooling, templates, scripts, or platform APIs.
- Incident response experience, including postmortems and follow-up remediation.
- Experience with Dagster, Helm 3+, high-scale CD tooling (Bazel, Octopus), or AI/agent-assisted ops.
- Basic web3 / DeFi literacy (transactions, wallets) and genuine curiosity about onchain — the role doesn't touch chain directly, but the business is onchain.
This listing is from ats_lever. View original listing ↗