We are seeking a highly motivated Director of Site Reliability Engineering to join our Omniverse Infrastructure organization which develops hardware and software systems to power Omniverse Cloud.NVIDIA Omniverse™ Cloud is a platform-as-a-service (PaaS) that provides developers and enterprises a full-stack cloud environment to design, develop, deploy, and manage industrial metaverse applications. As a Director, SRE & DevOps, you will lead a team of systems & software engineering to build and run large-scale, fault-tolerant and highly-available Omniverse systems and services. You will need to be self-motivated, a critical thinker, data-driven, and results-oriented, with a focus on delivering outstanding user experience.What you'll be doing:Practical technical experiences in dealing with large customer servicesLead by example, mentor the team of Managers & IC enabling them to deliver high-quality Systems & End User experienceEstablish credibility through quality technical execution, and pitch in with hands-on help and code as needed to keep things running smoothlyOwn the strategy and development of the incident response management and service capacity management through core engineering executionHelp implement automated deployments, monitoring, and operational tools along with true observabilityYou will apply engineering leadership and deep knowledge of infrastructure and software development at scale to lead the operation, adoption, and evolution of these servicesSolid understanding of software development, debugging, optimization, and/or troubleshooting - hands-on experience with common programming languages preferredOwn, innovate, and create programs, new software, and analytics that drive improvements to the availability, scalability, latency, and efficiency of Omniverse products and servicesWork cross-functionally with product area leads from technologies across the Omniverse stack to guide product engineering to build fast, reliable, and durable production systemsManage, lead, and grow a global team of Site Reliability EngineersWhat we need to see:8+ years of demonstrated ability in site reliability leadership in a kubernetes based cloud environment10+ years experience in developing technical solutions, including but not limited to: operations/engineering, infrastructure/database architecture, containerization and modern application design pattern, infrastructure as a code, disaster recovery or chaos engineering.Experience building large and geographically disperse infrastructure supporting business-critical cloud & on-premises servicesBS or MS in Computer Science, a related field, or equivalent experience7+ years of people management and team leadership experience, including headcount planning and developing strong and motivated teamsExperience with 24/7 site monitoring and ability to own uptime & performance SLAsExcellent written and verbal communication, able to collaborate and rally supportComfortable leading discussions with upper management and have experience tailoring the level of technical details to suit the audience.NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!The base salary range is $304,000 - $460,000. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
View Original Job Posting