Senior Software Engineer - Distributed Systems

Company: NVIDIA
Company: NVIDIA
Location: India, Bengaluru
Commitment: Full time
Posted on: 2023-10-28 18:39
NVIDIA is looking for outstanding engineers to work on breakthrough technologies that will scale next-generation AI and Simulation systems. We expect you to have a deep understanding of distributed systems, file systems, IO, networking, concurrency, data structures, scalable runtime systems and fault-tolerance. Candidates having hands-on development experience with OS internals, scalable systems design, networking, and container runtimes will be preferred. We also welcome out-of-the-box problem solvers who can provide new insights, challenge the status quo, and are willing to open up the boundaries. You and others in this team will help advance NVIDIA's state-of-art technology to deliver ground breaking systems and solutions for modern AI applications.NVIDIA has pioneered accelerated computing to tackle challenges that otherwise can’t be solved. NVIDIA is a world leader in AI and our work is redefining industries valued at more than $100 trillion, from gaming to healthcare to transportation, and profoundly impacting society. If you're creative, passionate about what you do, determined, and love having fun, then we encourage you to apply today!What you will be doing:Join a core group of engineers with high critical-thinking abilities passionate about tackling some of the most sophisticated and hard problems in distributed systems and fault-tolerant design in real-world production systemsSolid technical foundation in distributed computing and storage, including significant experience with most of the following: server systems, operating systems, storage, I/O, networking, and system software.Design, develop, test, and maintain cluster monitoring and validation systems.Expand and optimize container orchestration infrastructure (K8s) for AI model training and inference, high-performance server systems, storage, I/O, networking, and system software.Deploy, monitor, and debug your software in production environmentsWork with engineering teams across all of NVIDIA to ensure your software integrates seamlessly up and down the stack.What we need to see:Deep understanding of data structures, concurrency, fault-tolerance, scalable runtime systems, operating systems and distributed systems design.Strong programming skills and expert-level knowledge of a systems programming language (C/C++/Go).Highly motivated with strong interpersonal skills, you have the ability to work successfully with multi-functional teams, principals, and architects and coordinate effectively across interpersonal boundaries and geographies.5+ years of software engineering or research lab experience on large-scale systemsPh.D/MS/BS in Computer Science/Engineering/Physics/Mathematics or other comparable Degree or equivalent experienceSolid understanding of performance, security, and reliability in complex distributed systems. Familiarity with system-level architecture, such as interconnects, memory hierarchy, interrupts, and memory-mapped IOWays to stand out from the crowd:Hands-on development experience with OS internals, schedulers, networking, and container runtimes, and scale-out systems design.Familiarity with AI/ML technologiesBackground with batch scheduling with K8s, Mesos, SlurmOperational experience in AI Infrastructure and large-scale distributed systems
View Original Job Posting