NVIDIA is looking for a world class engineer to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior DevOps and SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated build & test environments for a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors along with various operating systems (Windows/Linux). The team works with various other business units within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics and Driverless Cars to cater to their infrastructure & system's needs.What you'll be doing:Monitoring & supporting critical high-performance, large-scale services running on a farm of 10000+ hosts.Ensure more than 95% availability for the build and test farms.Participate in triaging & resolution of complex build and test infra related issues.Collaborate with our other engineering teams to expose any defects and constraints.Collaborate with software development teams to deliver reliable, robust, and high-performance capability of the underlying infra.Perform Root Cause Analysis & Implement Corrective Actions for any persistent & user impacting issues.Implementing high availability infrastructure and disaster recovery solutions.Large scale deployments across multiple Kubernetes, ESXi clusters to support CI/CD pipelines for NVIDIA products.Design and implement monitoring solutions to gain more insight into applications and system health. Implement critical metric using various analytics methods and dashboards.Craft and develop tools needed for automating workflows.Take part in prototyping, crafting, and developing cloud infrastructure for Nvidia.Participating in on-call support and critical issue coverage as a SRE engineer.What we need to see:Solid programming background in python/tcl and/or similar scripting languages.Strong background with CI/CD workflows, GitLab/Jenkins or any other CI/CD tools.Proficient with configuration management tools like Ansible, Chef, Puppet and source code management & binary repository systems like GitLab, GitHub, Artifactory etc.Demonstrable experience working in large scale enterprise production systems.Proficient with Kubernetes administration, dockers & virtualization. Knowledge of standard methodologies related to security.Proficient with data analytics/visualization & monitoring tools like Kibana, Grafana, Splunk, Zabbix, Prometheus and/or similar systems.Strong background in dockers, containerization and managing large scale container/pod deployments for Kubernetes clusters.Excellent debugging, problem solving and analytical skills.Strong understanding of architectural requirements and development processes involved in building reliable, robust, scalable data products and pipelines.Experience in writing complex queries for MySQL or similar DB.5+ years of proven experience.Bachelor’s or Master’s degree in computer science, Software Engineering, or equivalent experience.Ways to stand out from the crowd:Experience/Knowledge of supporting Java based applications, webservers etc is a plus .Thrives in a multi-tasking environment with constantly evolving priorities.Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those. Ability to design simple systems that can work efficiently without needing much support.Prior experience with large scale operations team.Outstanding interpersonal skills and communication with all levels of management.
View Original Job Posting