Manager, Site Reliability Engineer

Company: NVIDIA

Location: US, CA, Santa Clara

Commitment: Full time

Posted on: 2023-10-28 18:36

We are seeking a highly motivated, hands-on Site Reliability Engineer Manager/Technical-lead to join our applications Infrastructure organization which automates, deploys and maintains Infrastructure for various NVIDIA AI workflows and applications like Metropolis, ACE, Riva hosted in the cloud. Site Reliability Engineering (SRE) focuses on production health to prevent outages and it does so by defining and developing deep software engineering solutions and practices, which simplify the operating environment and make not only NVIDIA cloud services reliable, but also make feature rollouts faster and safer.As a SRE leader, you will work with our Application devops engineers and help architect solutions to maintain and scale our ever-growing number of services hosted in the cloud. You will help lead a team of SRE’s and put SRE processes in place. You will serve as front-line support, triaging issues in the platform, the applications, or the underlying infrastructure. In this role, you will partner with multiple teams within and outside the application Infrastructure team, including a second SRE team who supervises the GPU cloud infrastructure, while this role will focus on monitoring the application stack. You will be involved in on-boarding customers to our services and leading the customer lifecycle. You will work in a multi-functional capacity, to successfully bring our cloud services to market. This position will require candidate to work from our Santa Clara, CA location.What you'll be doing:You will be leading a team of Site reliability engineers to bring in a data driven approach to operations, with focus on observability, well defined success metrics, and making continuous improvementsWork with our DevOps team to innovate, build/integrate new software, tools and analytics that drive improvements to the availability, scalability, latency, and efficiency of our cloud services products and servicesHandle upgrades, and automated rollbacks across all clustersMaintain Service Level Agreement (SLAs) of measurable benchmarks, working hand in hand with developers of new services on how to define SLIs, and design a stable, secure serviceHelp guide the Change Advisory Board, and RCCA processesWork with engineering, devops and product area leads from technologies across the GPU cloud services stack to guide product engineering to build fast, reliable, and durable production systemsPractice sustainable incident response and blameless postmortemsBe part of an on call rotation to support production systemsDrive process changes to improve reliability and performance of our cloud servicesDebug production issues across services and levels of the stackOn-board customers onto our cloud servicesImprove operational processesWhat we need to see:Bachelor's degree in Computer Science or a related field, or equivalent experience5+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance, and application issues5+ years of experience authoring, and debugging software written in C++ and python3-5+ years of leading a teamDeep hands-on experience with Kubernetes based cloud environmentsMulti-cloud experienceAbility to work with partners across multiple teamsExperience operating production systemsExperience leading peopleWays To Stand out from the Crowd:Background with PaaS, and SaaS offeringsExperience in application issues, algorithms, and data structuresUnderstanding of the functioning of AI services.With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our best-in-class engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you!The base salary range is $164,000 - $316,250. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

View Original Job Posting

Manager, Site Reliability Engineer - AI Services