Site Reliability Engineer

https://www.therecruitability.com/wp-content/uploads/2022/12/img-visual-03.jpg

Jobs

Our client is a fast growing, highly successful Technology Product Company that is the leader in DevOps for the Database. Downloaded more than 100 million times, their software enables DevOps teams around the globe to accelerate the software delivery process by automating database updates, security, and governance. They are a nimble, fast-paced, innovative team with the opportunity to make an outsized impact on the business and the industry.

They are hiring a Site Reliability Engineer (SRE) with specific, prior SRE experience supporting AWS-based, cloud-native applications. This role is focused on building, securing, and maintaining a highly resilient, multi-tenant SaaS platform utilizing AWS services. This is not a general DevOps role; they are looking for an SRE expert who excels at security, monitoring, alerting, and operational resilience. You will report directly to the Manager of DevOps, and your work will have a direct impact on building the foundation for a robust and scalable SaaS platforms.

What YOU get to DO

Design, implement, and maintain highly resilient and secure infrastructure for our SaaS platform using AWS services, including API Gateway, Lambda, Aurora Serverless, OpenSearch Serverless, Secrets Manager, and FusionAuth
Ensure best-in-class security of the application using AWS security services such as WAF, Shield, GuardDuty, and implement industry-leading security practices
Develop, implement, and maintain robust monitoring and alerting solutions to ensure the reliability and performance of our SaaS platform, including the use of CloudWatch, Prometheus, Grafana, etc
Facilitate and drive incident response, triage & resolution, and retrospective/root cause analysis to maintain the reliability and uptime of our platform
Lead incident post-mortem/retrospectives to surface reliability improvements and drive to completion
Implement strategies to increase system resilience and performance through on-call rotation and process optimization
Strong understanding of SRE principles, including error budgets, SLOs, SLIs, and SLAs, including the ability to identify and establish them for the team
Build and maintain infrastructure as code using Terraform
Provide input and expertise for system architecture and feature development
Engage and collaborate with stakeholders including Product, Development, QA, Customer Success & others to ensure work is properly defined, prioritized, and executed, including improvements & future initiatives
Educate and guide Engineering teams on best practices wrt reliability, resiliency, security, etc
Participate in the Agile Development lifecycle helping us to stay realistic on our goals and flexible in our execution
Foster a culture of group collaboration while being effective at working independently at the same time

What you NEED to SUCCEED

Bachelor’s degree in Computer Science, Software Engineering, or a related field (or equivalent work experience)
5+ years of hands-on experience in site reliability engineering roles
AWS Solutions Architect and/or AWS DevOps Professional Certifications
Expert knowledge of AWS services, specifically API Gateway, Lambda, Aurora Serverless, OpenSearch Serverless, Secrets Manager, and FusionAuth
Expertise in AWS security services, including WAF, Shield, GuardDuty, and a deep understanding of cloud security practices
Prior SRE experience supporting a cloud-native SaaS platform with AWS
Strong experience with monitoring and alerting tools such as CloudWatch, Prometheus, Grafana, or similar
Willingness and availability for participation in a 24x7x365 on-call rotation, (business hours only) ensuring prompt and effective responses to business-critical alerts outside of regular working hours
Extensive experience with Terraform for infrastructure as code
Experience building, securing, and maintaining a multi-tenant SaaS application
Experience with IDPs such as FusionAuth, Okta, Auth0, or similar
Strong understanding of information security principles and practices
Security certifications such as CISSP, CISM, CEH, or similar (preferred)
Experience with other cloud providers such as Azure and GCP (preferred)
Experience with Docker and container orchestration (preferred)
Prior startup experience, especially launching applications with customer adoption and usage (preferred)

What's in it for YOU?

Remote culture, the potential for company-wide in-person gatherings
Home office allowance for remote workers
Meaningful equity (US only)
Comprehensive health, vision, and dental benefits – country dependent
Generous paid time off and paid holidays
401K matching (US only)

Site Reliability Engineer

Other Jobs in:
Tech & Engineering

Manufacturing Engineering Manager – Minnesota

Software Automation Engineer

Principal Software Developer

Staff Software Engineer

Electrical Engineer

Controls Engineer

Reliability Engineer

Sr Project Engineer

Cloud Engineer – Azure

Fire Protection Engineer

Mechanical Engineer

Principal Electrical Engineer

Electrical Engineer

Send Referral

Job Seekers

Site Reliability Engineer

Other Jobs in: Tech & Engineering

Manufacturing Engineering Manager – Minnesota

Software Automation Engineer

Principal Software Developer

Staff Software Engineer

Electrical Engineer

Controls Engineer

Reliability Engineer

Sr Project Engineer

Cloud Engineer – Azure

Fire Protection Engineer

Mechanical Engineer

Principal Electrical Engineer

Electrical Engineer

Send Referral

Other Jobs in:
Tech & Engineering