HPC Cluster Administrator

Company: NVIDIA

Location: US, CA, Santa Clara

Commitment: Full time

Posted on: 2023-12-20 05:20

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are difficult to solve, that only we can seek, and that makes a difference to the world. Do you hunger to realize your potential to perform at a high level and make a significant contribution at NVIDIA? Join us in revolutionizing the world of AI!We are now looking for an HPC Cluster Administrator in the Datacenter Systems & Board Design Team to lead a diverse cluster of GPU-accelerated systems and provide architectural direction to teams in development, hardware, operating system domains. As a member of DCS Team, you will collaborate with an innovative multidisciplinary extraordinary team of engineers to help build and develop our industry leading GPU accelerated clustered computing products. This role will support and maintain existing GPU compute clusters involving sophisticated system level hardware and software. In this role, you will help us with the strategic challenges, including compute, networking, and supervising for large-scale, high-performance workloads and effective resource utilization in a heterogeneous environment.What You Will Be Doing:As an HPC Administrator, you will play a crucial role in crafting and implementing innovative architectures for high-performance computing systems, enabling efficient and scalable computation for scientific, research, and data-intensive applicationsCollaborating closely with multi-functional teams, including hardware engineers, software developers, and domain experts, to deliver optimized solutions that meet the demanding requirements of HPC workloadsCollaborate with software and hardware engineers to design and optimize the system's software frameworks, computational components, including processors, accelerators, interconnects, and memory subsystems.Conduct performance analysis, benchmarking, and modeling to identify performance bottlenecks, optimize system parameters, and guide architectural enhancementsProvide technical guidance and mentorship to junior team members, fostering partnership, and standard processes within the HPC architecture domainAutomate configuration management, software updates, and maintenance and monitoring of system availability using modern DevOps tools (Ansible, Gitlab, etc.)Actively connect with management and SMEs regarding any problems with the equipment and propose resolution.Design, implement and support large scale infrastructure with monitoring, logging, and alerting.Maintain services once they are live by measuring and supervising availability, latency, and overall system health.What We Need to See:BS or MS in Computer Science or equivalent experience with 3+ years of proven experience,Extensive experience building and owning large-scale, multi-threaded, compute systems.Must have experience with Linux system administration(Ubuntu, Centos/Redhat) and linux cli, TCP/IP network fundamentals.Must have HPC cluster scheduler experience in setup and administration like SLURM &/ LSF.Comfortable in collaborating with the engineering team to diagnose, and owning issues in HPC deployments and shell scripting and automation of repetitive administration tasks using Python /Perl / Bash.Proficient in handling highly available and scalable IT infrastructure, with knowledge on Docker/Virtualization, Monitoring, Ansible, Puppet, Chef, Log analysis and performance skillsStrong technical skills and understanding of embedded systems, orchestration & automation systems, data centers and cloud architecture, as well as excellent communication and planning skills.Experience with industry-standard interconnects and network fabrics, such as InfiniBand, Ethernet, or Omni-Path, and their impact on HPC system performanceGood attention to detail, as well as clear written and verbal communications. Ways to Stand Out From the Crowd:Familiarity and prior work experience with one or more technologies such as: Ansible, GIT, Slurm, Zabbix, Prometheus, Grafana, Docker, and Bright Cluster Manager (BCM)Experience with mobile and embedded systemsUnderstanding on InfiniBand or Ethernet concepts.Experience with high performance or large scale computing environments, parallel computing for product bring ups.Special skills in large-scale computing and cluster computing(MPI), data center design include high speed interconnect InfiniBand, Cluster Storage and Scheduling related design and/or management experience. NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most forward-thinking and hardworking people in the world on our team and our collaborative talent continues to drive NVIDIA's growth. We are seeking creative and independent engineers with real passion for technology!The base salary range is 116,000 USD - 224,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

View Original Job Posting