Senior ML Platform Engineer, AI

Company: NVIDIA

Location: US, CA, Santa Clara

Commitment: Full time

Posted on: 2023-11-10 05:01

Join the team building software which will be used by the entire world of AI. Work with high class software engineers to implement a large scale toolset that tests deep learning models and frameworks on the most powerful computers. Ability to work in a multifaceted, fast-paced environment is required as well as strong social skills. In this role you will be interacting with internal partners, users, and members of the open source community to implement solutions for building, testing, integrating, and releasing of NVIDIA AI Services and Deep Learning Frameworks on the most powerful, enterprise-grade GPU clusters capable of hundreds of Peta FLOPS. This role spans multiple products such as PyTorch, TensorFlow, JAX, PaddlePaddle. You will work with internal engineering teams to deploy and operationalize AI models and services at scale by driving adoption for end-to-end Machine Learning and Deep Learning solutions in the cloud and on prem.We are seeking passionate individuals to help us scale our AI and deep learning services, platforms, models and internal tools. You will be responsible for implementing and maintaining the DevOps/MLOps practices, tools, and infrastructure that enable our teams to deliver high-quality software reliably and efficiently, while ensuring smooth release management and deployment processes. Are you ready for this challenge?What you’ll be doing:Develop, maintain, and improve CI/CD tools for on-prem and cloud deployment of our software, enable sophisticated cross-platform build systems, and bring world-class release engineering to NVIDIA's platform and cloud deployment process.Enable a self-service Deep Learning testing and benchmarking platform using industry-standard tools (e.g. Gitlab, GitHub, Jenkins, Docker, Bash, …) and NVIDIA proprietary tools. Lead best-practices and methodologies for building, testing, and releasing DL software and support users of the platform.Monitor and fix the software development and deployment pipelines, identifying and resolving issues related to build failures, test failures, code quality, and performance, in collaboration with development, operations, and quality assurance teams.Prepare documentation for the proposed approaches, policies, data formats, test cases and the expected results within the scope of your projects. Document and evangelize about them.Collaborate with development, operations, and quality assurance teams to establish and maintain efficient and reliable DevOps practices, tools, and infrastructure that enable continuous integration, continuous delivery (CI/CD), and efficient software release management.What we need to see:BSc or MS degree in Computer Science, Computer Architecture or related technical field, or equivalent experience.5+ years of work experience in platform engineering/MLOps/DevOpsVery good Python and bash programming skills.Proficiency with popular CI/CD tools (e.g., GitLab CI, Jenkins), git, Linux including management practices, versioning, branching, merging, and tagging, and experience with release management tools and processes.Knowledge of Docker, REST API services, Kubernetes, ElasticSearch, HashiCorp Vault and AnsibleExperience in working with Cloud Providers (AWS, OCI, GCP)Strong experience in setting up, maintaining, and automating continuous integration systems. Knowledge and love for DevOps/MLOps practices. Proficient in modern CI/CD techniques, GitOps and Infrastructure as Code(IaC)Basic understanding of ML/DL training and inferencing conceptsStrong understanding of software testing principles, including unit testing, integration testing, and end-to-end testing, and experience with automated testing frameworks and tools.Good communication and documentation habits. Detail oriented with great communication and documentation skillsWays to stand out from the crowd:Hands-on in creating integration, delivery and deployment pipelines for ML/DL products and/or xperience working with Deep Learning models and/or servicesFamiliarity with large-scale distributed computing systems and cloud platforms or experience with HPC based compute clusters and scheduling solutions like SlurmProven track record of delivering solutions to customers. Deep understanding of deployments at scale and/or upstream contribution in open source projectsRelevant certifications (e.g., AWS Certified DevOps Engineer, Linux RedHAt, Oracle, …) are a plusNVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most brilliant and talented people in the world working for us. If you're creative and autonomous, we want to hear from you!The base salary range is 144,000 USD - 270,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.#deeplearning

View Original Job Posting

Senior ML Platform Engineer, AI - MLOps