Site Reliability Engineer

Company: NVIDIA
Company: NVIDIA
Location: US, CA, Santa Clara
Commitment: Full time
Posted on: 2023-10-28 18:37
We are seeking a highly motivated, Site Reliability Engineer to join our applications Infrastructure organization which automates, deploys and maintains Infrastructure for various NVIDIA AI workflows and applications like Metropolis, ACE, Riva hosted in the cloud. Site Reliability Engineering (SRE) focuses on production health to prevent outages and it does so by defining and developing deep software engineering solutions and practices, which simplify the operating environment and make not only NVIDIA cloud services reliable, but also make feature rollouts faster and safer.As a SRE, you will work with our Application devops engineers to maintain and scale our ever-growing number of services hosted in the cloud. You will serve as front-line support, triaging issues to the platform, the applications, or the underlying infrastructure. In this role, you will partner with multiple teams within and outside the application Infrastructure team, including a second SRE team who supervises the GPU cloud infrastructure, while this role will focus on monitoring the application stack. You will be involved in on-boarding customers to our services and managing the customer lifecycle.What you'll be doing:Build and integrate new software, tools and analytics that drive improvements to the availability, scalability, latency, and efficiency of our cloud services products and servicesHandle upgrades, and automated rollbacks across all clustersMaintain Service Level Agreement (SLAs) of measurable benchmarks, working hand in hand with developers of new services on how to define SLIs, and design a stable, secure serviceHelp guide the Change Advisory Board, and RCCA processesWork with engineering, devops and product area leads from technologies across the GPU cloud services stack to guide product engineering to build fast, reliable, and durable production systemsDrive process changes to improve reliability and performance of our cloud servicesDebug production issues across services and levels of the stackImprove operational processesWhat we need to see:Bachelor's degree in Computer Science or a related field, or equivalent experience3+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance, and application issues3+ years of background with authoring, and debugging software written in C++ and pythonHands-on experience with Kubernetes based cloud environmentsMulti-cloud experienceAbility to work with partners across multiple teamsExperience operating and maintaining production systemsExperience in application issues, algorithms, and data structuresWays To Stand out from the Crowd:Background with SaaS offeringsExperience in application issues, algorithms, and data structuresExcellent interpersonal skillsNVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most hard-working and dedicated people in the world working for us. If you're creative and autonomous, we want to hear from you!The base salary range is $136,000 - $212,750. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
View Original Job Posting