Manager, DevOps and Infrastructure Systems Engineering

Company: NVIDIA

Location: India, Pune

Commitment: Full time

Posted on: 2023-09-08 05:56

NVIDIA is looking for a Manager DevOps, to lead NVIDIA Metropolis Applied AI Applications infrastructure team for our internal and external facing applications and services with focus on workflow automation, ensuring reliability and uptime of cloud and On Prem bare-metal systems. Hands-On, Mentality and attitude of engineering approaches to running better production systems. NVIDIA Metropolis is leading this AI revolution, providing the tools, technologies, and expertise to meet every challenge with smarter, faster applications. Much of our software development focuses on eliminating manual work through automation, performance measurements, tuning and growing efficiency of products and systems. As DevOps engineers are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to address a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.This is a great opportunity to lead a team of talented engineers with proven standard methodologies and create an industry enviable team. We partner with Service Owners to drive reliability of the service. The metropolis and Applied AI Application Services are exciting services in the newly growing Applied AI industry specific to environments such as smart cities, retail, safety, industrial automations etc. This is a rare opportunity to work on multi-OSes, multi-cloud platforms, and proprietary hardware with NVIDIA GPUs.What you will be doing:Manage our global cloud and on-premises data platform servicesPerform root cause analysis, investigations and remediation.Develop tools to automate functions using programming languages such as Python, Go, Bash on *Nix operating systems.Be responsible for chartering new technologies while helping to mentor and guide DevOps engineers.Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.Support services before they launch through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.Maintain services once they are live by measuring and monitoring availability, latency and overall system health.Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.Practice balanced incident response and blameless postmortems.Lead significant production improvement around tooling, automation, and process improvements.What we need to see:Master’s degree in computer science/Engineering or equivalent experience.7+ overall years’ experience DevOps or SRE teams owning end-to-end availability and performance of critically important services.Must be hands on with various DevOps tools and best practices.Understanding of AI technologies and practices.3+ years of technical leadership beyond development that includes scoping, requirements gathering, leading, and influencing multiple teams of engineers on broad development initiatives.Able to deliver software automation in various languages (Python, bash, Go) and technologies (CI/CD, performance measurement tools).Experience managing an engineering team on projects with technical deep dives into cloud technologies (AWS/AZURE/GCP/OCI), code, networking, operating systems, storage etc.Project management skills consistent with leading a team through the complex development to production transition phase.Strong communication skills (written and oral)

View Original Job Posting