Site Reliability Engineering Manager

Company: NVIDIA

Location: US, CA, Santa Clara

Commitment: Full time

Posted on: 2023-05-03 15:38

NVIDIA is the leading artificial intelligence computing company and paving the way with innovations in generative AI, conversational AI, supercomputing, gaming and visualization. NVIDIA gives research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.As the Manager of Site Reliability, you will establish an enthusiastic and dedicated SRE team serving the forefront of the latest science and technology trends. Working together with the NeMo development team, you will build and run large-scale, fault-tolerant systems and services able to run in any cloud. Are you passionate about infrastructure and looking for complex meaningful issues? Are you ready to run the next generation of cloud services, design and code innovative solutions that address the needs of a whole organization? Then we are excited to have a motivated person like you!What You Will Be Doing:The NeMo Service team is responsible for building and deploying Generative AI services, including large language models and BioNeMo - our drug discovery cloud service. You will apply engineering leadership and deep knowledge of infrastructure and software development at scale to own the operation, adoption, and evolution of these services. You will lead by example, mentor the SRE and engineering teams, and establish credibility through quality technical execution, including hands-on contributions to code and automation to keep things running smoothly.What We Need To See:5+ overall years of demonstrated ability in site reliability and technical operations leadershipBSCS or BSEE or equivalent experienceExperience building large and geographically disperse infrastructure supporting business-critical cloud & on-premises services3+ years of people management and team leadership experience, including headcount planning and developing strong and motivated teamsExperience running AI/ML operations through CI/CD pipelineExperience designing and implementing CI/CD back-end services.Strong programming skills in Go. Python proficiency.Excellent debugging and troubleshooting skills.Ways To Stand Out From The Crowd:Excellent understanding of Kubernetes and one or more public cloudsAbility to reason and choose the best possible algorithm to meet scaling and availability challenges.Skilled at decomposing complex requirements into simple tasks and reuse available solutions to implement most of those.You can design simple and reliable systems that can work without much support.Strong cloud management foundation.Proven record of delivering solutions using Agile process and methodologies.The base salary range is $216,000 - $333,500. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

View Original Job Posting

Site Reliability Engineering Manager - NeMo LLM Service