The NVIDIA IT organization is looking for site reliability engineering talent to build, deploy and scale NVIDIA’s infrastructure. These services include software to manage hardware and network provisioning to deploy and manage a multi-tenant infrastructure. As a site reliability engineer, you will work with other site reliability engineers, software engineers, product owners, and network engineers as a collaborative team to deliver and maintain end-to-end solutions to manage complex hybrid cloud infrastructure deployments. You will write as well as integrate with services and software that aligns with the broad architectural vision for the NVIDIA IT Network Infrastructure, working with other teams to develop a robust and scalable, sustainable system. You own your code - from development to commit to test to production. We expect you to be passionate about code quality, documentation, testing, deployment efficiency/simplicity and bringing amazing products and capacity to market.What you will be doing:Work with NVIDIA internal customers.Design, Build and Operate scalable software systems to manage NVIDIA’s network infrastructure.Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities for NVIDIA.Provide guidance to other team members on managing end-to-end availability and performance of mission critical services, on building automation to prevent problem recurrence, and on building automated responses for non-exceptional service conditions.Building network and systems automation software for managing a multi-tenant cloud infrastructure.Debugging complex problems across full stack and creating solid solutions via the ability to identify and delve deeper into Root Cause Analysis efforts on network incidents with a strong network background is good to have.Automating work across a variety of infrastructure needs such as testing, failover, policy modifications and deployment.Writing, updating, and using documentation, including runbooks/playbooks with the ability to respond consistently via the regular creation of runbooks/playbooks with an eye towards additional automation opportunities in the environment is a must have skill.What we need to see:8+ years of experience with designing and building distributed software systems.BS/MS degree in Computer science or related areas (or equivalent experience).Demonstrated ability to write code in a mainstream systems programming language such as C, C++, Go, Python, Java, Rust, etc.Demonstrated ability to use, design and implement maintainable APIs including use of tools such as Git, NetBox, Cloud Vision Portal, SaltStack, Victoria Metrics. SNMP and HashiVault.Practical experience with asynchronous programming, type safety, threading models, state machines.Understanding of underlying Linux Internals: Kernel scheduling, memory management, and networking subsystems.Knowledge of networking protocols such as IP, IPv6, BGP, HTTP, ICMP, tunneling protocols (VXLAN, Geneve, GRE) in a multi-vendor environment as implemented on platforms such as Arista, Cumulus, Cisco, HP Palo Alto and others.Proven ability of data persistence (SQL or similar).Understanding of secure communication protocols (mutual-TLS, IPsec, or similar).Consistent record to reach cross-functional consensus without all the details.Ways to stand out from the crowd:Experience in a Hyperscale Cloud Service Provider (public facing or not).Familiarity with high level compiled languages such as Go or Java.Exposure with host security services and security principles such as TPM, TXT, SecureBoot.Knowledge of SRE principles (observability, SLOs, SLIs, logging, etc).NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people in the world working for us. If you're creative and passionate about developing cloud services, we want to hear from you!The base salary range is 160,000 USD - 304,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
View Original Job Posting