Site Reliability Engineer

Company: NVIDIA
Company: NVIDIA
Location: India, Pune
Commitment: Full time
Posted on: 2023-10-28 18:36
We are seeking a highly motivated Site Reliability Engineer to join our applications Infrastructure organization which automates, deploys and maintains Infrastructure for various Nvidia AI workflows and applications like Metropolis, ACE, Riva hosted in the cloud. Site Reliability Engineering (SRE) focuses on production health to prevent outages and it does so by defining and developing deep software engineering solutions and practices, which simplify the operating environment and make not only Nvidia cloud services reliable, but also make feature rollouts faster and safer. As a SRE, you will work with our Application devops engineers to maintain and scale our ever-growing number of services hosted in the cloud. You will serve as front-line support, triaging issues to the platform, the applications, or the underlying infrastructure. In this role, you will partner with multiple teams within and outside the application Infrastructure team, including a second SRE team who supervises the GPU cloud infrastructure, while this role will focus on monitoring the application stack. You will be involved in on-boarding customers to our services and managing the customer lifecycle. What you'll be doing: Build/integrate new software, tools and analytics that drive improvements to the availability, scalability, latency, and efficiency of our cloud services products and servicesHandle upgrades, and automated rollbacks across all clustersMaintain Service Level Agreement (SLAs) of measurable benchmarks, working hand in hand with developers of new services on how to define SLIs, and design a stable, secure serviceHelp guide the Change Advisory Board, and RCCA processes Work with engineering, devops and product area leads from technologies across the GPU cloud services stack to guide product engineering to build fast, reliable, and durable production systems Drive process changes to improve reliability and performance of our cloud servicesDebug production issues across services and levels of the stackImprove operational processesWhat we need to see:Bachelor's degree in Computer Science or a related field, or equivalent experience3+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance, and application issues3+ years of background authoring, and debugging software written in C++ and pythonhands-on experience with Kubernetes based cloud environmentsMulti-cloud experienceExperience working with partners across multiple teams Experience operating production systemsWays To Stand out from the Crowd: Background with SaaS offeringsExperience in application issues, algorithms, and data structures
View Original Job Posting