Senior Site Reliability Engineer, Wikimedia Enterprise
jobgether
Belgium
Temps plein
18 offres à Belgium — et d'autres dans les environs.
Importez votre CV et voyez lesquelles vous correspondent vraiment.
Accountabilities
In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:
- Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
- Design and enhance observability systems including metrics, logging, and distributed tracing
- Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
- Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
- Implement infrastructure-as-code and automation-first practices to reduce operational toil
- Design and operate scalable cloud infrastructure across production environments
- Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
- Improve developer experience by enabling self-service infrastructure and streamlined workflows
- Collaborate with security, software, and release engineering teams to embed reliability and security best practices
- Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
- Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
- Contribute to platform engineering initiatives that standardize infrastructure across teams
- Mentor peers and promote best practices in SRE, automation, and systems reliability
Requirements
This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have
- 5+ years of experience in SRE, DevOps, or infrastructure engineering roles
- Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
- Proficiency in at least one programming language (Python, Go, or similar)
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure
- Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
- Strong understanding of SRE principles including SLOs, SLIs, and error budgets
- Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
- Proven experience in incident response, on-call operations, and postmortem analysis
- Ability to operate and optimize large-scale distributed systems with high availability requirements
- Strong communication and collaboration skills in distributed, remote-first environments
- Ability to document systems clearly and contribute to shared engineering knowledge
- Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
- Adaptability to fast-evolving, technology-driven environments
- Remote-first work model with global collaboration
- Opportunity to work on high-impact systems supporting global knowledge platforms
- Exposure to large-scale distributed systems and modern cloud-native architectures
- Culture of engineering excellence, automation, and continuous improvement
- Strong emphasis on learning, experimentation, and open collaboration
- Competitive compensation adjusted to location and experience
- Inclusive and diverse work environment with global team exposure
- Opportunity to contribute to open knowledge infrastructure used worldwide
Cette annonce provient de ats_lever. Voir l'annonce originale ↗