Senior Site Reliability Engineer, NeMo Services

Company: NVIDIA

Location: US, CA, Santa Clara

Commitment: Full time

Posted on: 2023-05-03 15:37

NVIDIA is the leading artificial intelligence computing company and paving the way with innovations in generative AI, conversational AI, supercomputing, gaming and visualization. Nvidia gives research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.As a Site Reliability Engineer, you will join an enthusiastic and dedicated site reliability engineering team serving the forefront of the latest science and technology trends. Working together with the NeMo development team, you will build and run large-scale, fault-tolerant systems and services able to run in any cloud. Are you passionate about infrastructure and looking for complex meaningful issues? Are you ready to run the next generation of cloud services, design and code innovative solutions that address the needs of a whole organization? Then we are excited to have a motivated person like you!What You Will Be Doing:The NeMo Service team is responsible for building and deploying Generative AI services, including large language models and BioNeMo - our drug discovery cloud service. You will apply engineering leadership and deep knowledge of infrastructure and software development at scale to own the operation, adoption, and evolution of these services. You will lead by example, mentor the site reliability engineering and engineering teams, and establish credibility through quality technical execution, including hands-on contributions to code and automation to keep things running smoothly.Design, implement and support large scale Kubernetes clusters with monitoring, logging and alertingEngage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinementSupport services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviewsMaintain services once they are live by measuring and monitoring availability, latency and overall system healthScale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocityPractice sustainable incident response and blameless postmortemsBe part of an on call rotation to support production systemsWhat We Need To See:BS degree in Computer Science or related technical field involving coding , physics or mathematics, or equivalent experienceMinimum of 3 years relevant experienceExcellent interpersonal and written communication skills.Experience with algorithms, data structures, complexity analysis and software designExperience in one or more of the following: Golang, Python, Node, C++, CUDAOutstanding teammate who can collaborate and influence in a multifaceted environmentWays To Stand Out From The Crowd:Interest in crafting, analyzing and fixing large-scale distributed systemsSystematic problem-solving approach, coupled with strong communication skills and a sense of ownership and driveAbility to debug and optimize code and automate routine tasksExperience in using or running large private and public cloud systems based on Kubernetes, OpenStack and DockerThe base salary range is $144,000 - $270,250. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

View Original Job Posting