You will join a group that specializes in Security and Networking areas in relation to ML/AI development. Building and maintaining the infrastructure, tools, and processes necessary to support the machine learning (ML) lifecycle in a production environment. Collaborate closely with data scientists, software engineers, and DevOps teams to ensure the smooth deployment, monitoring, and optimization of ML models. This role involves problem-solving alongside engineering teams and contributing to the development of a successful NVIDIA practice!What you'll be doing:Develop and maintain scalable infrastructure for handling and deploying security and networking ML models in production, ensuring high availability, scalability, performance.Design and implement data pipelines to efficiently process and transform large volumes of data for training and inference purposes.Optimize and fine-tune ML models for performance, scalability, and resource utilization, considering factors such as latency, efficiency, and cost.Collaborate with data scientists and software engineers to operationalize and deploy ML models, including model versioning, packaging, and integration with existing systems.Collaborate with DevOps teams to integrate ML pipelines and workflows into the overall CI/CD process, ensuring flawless deployments and rollbacks.Automate the training and retraining processes, ensuring regular model updates and improvements based on new data and performance feedback.Implement and manage A/B testing frameworks to evaluate and compare the efficiency of different ML models or algorithmic approaches.Build and maintain monitoring and alerting systems to proactively identify and resolve issues related to model performance, data quality, and infrastructure.Implementing access controls, authentication mechanisms, and encryption standards for ML models and data.Document, guidelines, and standard operating procedures for ML Ops processes and share knowledge with the wider team.What we need to see:Bachelor’s or master’s degree in computer science, data science, or a related field.Strong background in machine learning, with experience in deploying and maintaining ML models in a production environment - at least 6 years of experienceProficiency in programming languages such as Python, Java, or Scala, along with experience in using ML frameworks and libraries (e.g., TensorFlow, PyTorch, scikit-learn).Experience with containerization technologies (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, Azure, GCP) for deploying and scaling ML applications.Solid understanding of data engineering principles and experience with tools for data processing and storage (e.g., Apache Spark, Hadoop, SQL databases, NoSQL databases).Familiarity with version control systems (e.g., Git) and continuous integration/continuous deployment (CI/CD) tools and practices.Security and networking background would be an advantage, with knowledge of security protocols, network architectures, firewalls, intrusion detection systems, and other relevant security and networking concepts.Strong problem-solving skills and ability to troubleshoot and resolve sophisticated issues in a timely manner.Excellent communication and collaboration skills, with the ability to work effectively in multi-functional teams.Attention to detail and a focus on quality, ensuring robustness and reliability in production ML systems.Ways to stand out from the crowd:Exude high energy and a positive attitude.Stellar verbal and written communication skills.Passionate about data science and implementation.Have data science and GPU performance experience.Want to make what was impossible possible!We are an equal opportunity employer and value diversity at our company. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
View Original Job Posting