Senior Site Reliability (R-19383)
dun bradstreet
Dublin - Ireland
employee: full time
Does this job fit you?
Upload your CV and see which ones actually match you.
Responsibilities:
- Own and improve the reliability, availability, and performance of production services in Google Cloud (GCP).
- Participate in incident management, including detection, triage, mitigation, escalation, and recovery.
- Use and improve incident workflows and tooling (e.g., ServiceNow) to ensure clear ownership and timely communication.
- Design, implement, and operate observability solutions including monitoring, logging, tracing, synthetics, and dashboards (e.g., Splunk Observability, OpenTelemetry).
- Reduce operational toil through automation and engineering-led solutions, proactively introducing and driving SRE best practices.
- Support on-call rotations across multiple time zones, contributing to a sustainable 24/7 support model.
- Define, monitor, and report SLIs, SLOs, and error budgets for critical services.
- Drive and be accountable for best-in-class service availability through SRE principles, automation, and proactive reliability engineering.
- Bachelor’s degree in Computer Science, Information Technology or related field
- Strong experience with cloud-native concepts and technologies, with a strong preference for Google Cloud Platform (GCP) and Kubernetes (GKE).
- Proven experience with Site Reliability Engineering and production incident management, ideally using platforms such as ServiceNow.
- Experience with monitoring and observability tools, including metrics, logs, traces, and synthetics (e.g., Splunk Observability, OpenTelemetry).
- Exposure to reliability testing, resilience engineering, or cost optimisation initiatives.
- Excellent analytical and problem-solving skills, with the ability to diagnose complex production issues quickly.
- Software development or automation experience using Python, shell scripts, or similar languages.
- Hands-on experience operating production cloud infrastructure at scale.
- Experience managing multi-region, high-availability production systems with a focus on scalability, resilience, and minimising service disruption during failures.
- Proficiency in Microsoft Office Suites Skills
- Show an ownership mindset in everything you do; be a problem solver, be curious and be inspired to take action, be proactive, seek ways to collaborate and connect with people and teams in support of driving success.
- Continuous growth mindset, keep learning through social experiences and relationships with stakeholders, experts, colleagues and mentors as well as widen and broaden your competencies through structural courses and programs.
- Where applicable, fluency in English and languages relevant to the working market.
This listing is from ats_lever. View original listing ↗