Senior Site Reliability Engineer
jobgether
Romania
Tempo pieno
47 lavori a Romania — e altri nelle vicinanze.
Carica il tuo CV e scopri quali ti si addicono davvero.
Accountabilities
- Perform day-to-day operations and DevOps responsibilities across large-scale public-facing infrastructure, including deployment, configuration, maintenance, and troubleshooting.
- Manage and optimize configuration and deployment systems using tools such as Puppet and Kubernetes.
- Automate infrastructure provisioning, service deployment, and operational workflows to improve reliability and efficiency.
- Collaborate with product and engineering teams to design scalable architectures and ensure systems operate reliably under global traffic loads.
- Participate in a 24/7 on-call rotation, handling incident response, system alerts, troubleshooting, and post-incident reviews.
- Conduct root cause analysis of production incidents and implement preventive measures to improve system stability.
- Contribute to system monitoring, observability, and performance optimization initiatives.
- Mentor engineers and share operational expertise within a distributed, cross-functional team environment.
- Work asynchronously with global teams while ensuring clear and effective technical communication.
Requirements
- 6+ years of experience in Site Reliability Engineering, DevOps, or infrastructure operations roles within complex distributed systems.
- Strong proficiency in Linux systems administration, troubleshooting, and performance tuning.
- Experience with scripting languages such as Python, Bash, Go, or Ruby for automation and operational tooling.
- Hands-on experience with configuration management tools such as Puppet or Ansible.
- Solid understanding of distributed systems, caching technologies, and system optimization techniques.
- Experience with Linux package management (e.g., Debian-based systems).
- Proven track record of automating operational processes and identifying opportunities for system improvement.
- Experience participating in incident response, postmortems, and reliability engineering practices.
- Strong communication skills in English, with the ability to work effectively in a fully remote, globally distributed team.
- Ability to work independently while collaborating across multiple time zones and teams.
Nice to Have
- Experience with monitoring and observability tools such as Prometheus, Grafana, or similar platforms.
- Experience contributing to or working within open-source software communities.
- Familiarity with LAMP stack technologies, including PHP-based systems and caching solutions like Redis or Memcached.
- Experience with Linux kernel tuning and advanced system performance optimization.
- Knowledge of large-scale storage or database systems such as Cassandra, MariaDB, Ceph, or OpenStack Swift.
- Experience defining and managing service-level objectives (SLOs) across teams.
- Exposure to MediaWiki or similar large-scale content platforms.
Benefits
- Fully remote position with global collaboration across multiple time zones.
- Opportunity to contribute to one of the world’s most visited and impactful knowledge platforms.
- Competitive compensation aligned with location, experience, and market standards.
- Strong culture of open-source collaboration, transparency, and knowledge sharing.
- High-impact role influencing the reliability and scalability of global infrastructure systems.
- Exposure to large-scale distributed systems and modern SRE practices.
- Supportive, mission-driven environment focused on free and open knowledge.
- Opportunities for continuous learning, technical growth, and participation in global engineering initiatives.
- Inclusive workplace committed to diversity, equity, and accessibility.
Questo annuncio proviene da ats_lever. Vedi l'annuncio originale ↗