We are seeking a highly motivated Principal Site Reliability Engineer to join our Omniverse Infrastructure organization which develops hardware and software systems to power Omniverse Cloud. NVIDIA Omniverse™ Cloud is a platform-as-a-service (PaaS) that provides developers and enterprises a full-stack cloud environment to craft, develop, deploy, and run industrial Omniverse applications. Our Site Reliability Team focus on Improving the Reliability of our platform. We do this by diligently measuring the customer experience, tracing it to the health of our platform, actively responding to outages and collaborating with our internal partners for continuous improvement.As a Principal Omniverse Cloud SRE, you will architect solutions to scale our ever-growing number of clusters around the world to ensure Omniverse Cloud’s mission of enabling companies to unify digitalization across their core product and business processes. You will set the best practices and choose and create the tools and automation to improve the reliability of the platform. You will drive roadmaps, lead our SREs and engineers in implementing solutions that hit our SLA targets, reduce the operational toil of the team. You will be on the frontline of our Incident Response team, curate our incident data and lead communications to our partners (Omniverse developers, Infrastructure team, NVIDIA Cloud SRE team, external Cloud vendors) in driving reliability improvements. At NVIDIA Omniverse we expect everyone to be highly autonomous, a great teammate and uniquely focused on the mission. We self-organize and swarm as needed, and everyone is here to do their life’s work. What you’ll be doing: Own, innovate, and build programs, new software, and analytics that drive improvements to the availability, scalability, latency, and efficiency of Omniverse products and servicesMaintain Service Level Agreement (SLAs) of measurable benchmarks, working hand in hand with developers of new services on how to define SLIs, and design a stable, secure serviceEstablish strong culture of shared production ownership in Omniverse Cloud.Create and guide strong processes for triage, feedback, automation, incident management and reliability tools and improvements.Work with product area leads from technologies across the Omniverse stack to guide product engineering to build fast, reliable, and durable production systemsBring experience in best practices and production excellence to help create a strong, exemplary Omniverse Cloud SRE team.What we need to see: Masters’ degree in Computer Science or a related field, or equivalent experience10+ years of experienceBackground in distributed systems, large scale systems, deployment, troubleshooting and incident management in enterprise systems.Proven experience authoring, and debugging software written in C++ and pythonDeep hands-on experience with Kubernetes based cloud environmentsBackground with commercial cloud systems, preferably AzureExperience working and influencing partners across multiple teamsExecutive stakeholder management, deep experience in highly complex and constantly evolving, customer focused environments.Ways To Stand out from the Crowd: Extensive experience with AzureExperience with Prometheus, Grafana, Azure MonitorBackground with PaaS, and SaaS offeringsBackground as SRE or equivalent experience in a very large scale, customer facing environment.
View Original Job Posting