Site Reliability Engineer

https://www.therecruitability.com/wp-content/uploads/2022/12/img-visual-03.jpg

Our client is a fast growing, highly successful Technology Product Company that is the leader in DevOps for the Database. Downloaded more than 100 million times, their software enables DevOps teams around the globe to accelerate the software delivery process by automating database updates, security, and governance. They are a nimble, fast-paced, innovative team with the opportunity to make an outsized impact on the business and the industry.

They are hiring a Site Reliability Engineer (SRE) with specific, prior SRE experience supporting AWS-based, cloud-native applications. This role is focused on building, securing, and maintaining a highly resilient, multi-tenant SaaS platform utilizing AWS services. This is not a general DevOps role; they are looking for an SRE expert who excels at security, monitoring, alerting, and operational resilience. You will report directly to the Manager of DevOps, and your work will have a direct impact on building the foundation for a robust and scalable SaaS platforms.

What YOU get to DO

  • Design, implement, and maintain highly resilient and secure infrastructure for our SaaS platform using AWS services, including API Gateway, Lambda, Aurora Serverless, OpenSearch Serverless, Secrets Manager, and FusionAuth 
  • Ensure best-in-class security of the application using AWS security services such as WAF, Shield, GuardDuty, and implement industry-leading security practices
  • Develop, implement, and maintain robust monitoring and alerting solutions to ensure the reliability and performance of our SaaS platform, including the use of CloudWatch, Prometheus, Grafana, etc
  • Facilitate and drive incident response, triage & resolution, and retrospective/root cause analysis to maintain the reliability and uptime of our platform
  • Lead incident post-mortem/retrospectives to surface reliability improvements and drive to completion
  • Implement strategies to increase system resilience and performance through on-call rotation and process optimization
  • Strong understanding of SRE principles, including error budgets, SLOs, SLIs, and SLAs, including the ability to identify and establish them for the team
  • Build and maintain infrastructure as code using Terraform
  • Provide input and expertise for system architecture and feature development
  • Engage and collaborate with stakeholders including Product, Development, QA, Customer Success & others to ensure work is properly defined, prioritized, and executed, including improvements & future initiatives
  • Educate and guide Engineering teams on best practices wrt reliability, resiliency, security, etc
  • Participate in the Agile Development lifecycle helping us to stay realistic on our goals and flexible in our execution
  • Foster a culture of group collaboration while being effective at working independently at the same time

What you NEED to SUCCEED

  • Bachelor’s degree in Computer Science, Software Engineering, or a related field (or equivalent work experience)
  • 5+ years of hands-on experience in site reliability engineering roles
  • AWS Solutions Architect and/or AWS DevOps Professional Certifications
  • Expert knowledge of AWS services, specifically API Gateway, Lambda, Aurora Serverless, OpenSearch Serverless, Secrets Manager, and FusionAuth
  • Expertise in AWS security services, including WAF, Shield, GuardDuty, and a deep understanding of cloud security practices
  • Prior SRE experience supporting a cloud-native SaaS platform with AWS
  • Strong experience with monitoring and alerting tools such as CloudWatch, Prometheus, Grafana, or similar
  • Willingness and availability for participation in a 24x7x365 on-call rotation, (business hours only) ensuring prompt and effective responses to business-critical alerts outside of regular working hours
  • Extensive experience with Terraform for infrastructure as code
  • Experience building, securing, and maintaining a multi-tenant SaaS application
  • Experience with IDPs such as FusionAuth, Okta, Auth0, or similar
  • Strong understanding of information security principles and practices
  • Security certifications such as CISSP, CISM, CEH, or similar (preferred)
  • Experience with other cloud providers such as Azure and GCP (preferred)
  • Experience with Docker and container orchestration (preferred)
  • Prior startup experience, especially launching applications with customer adoption and usage (preferred)

What’s in it for YOU?

  • Remote culture, the potential for company-wide in-person gatherings
  • Home office allowance for remote workers
  • Meaningful equity (US only)
  • Comprehensive health, vision, and dental benefits – country dependent
  • Generous paid time off and paid holidays
  • 401K matching (US only)