Site Reliability Engineer - Metrics

Company: NVIDIA
Company: NVIDIA
Location: US, NC, Durham
Commitment: Full time
Posted on: 2024-08-17 05:14
Nvidia has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. Nvidia is a “learning machine” that constantly evolves by seeking new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human creativity and intelligence. Make the choice to join us today!As an SRE focused on metrics reporting, you will collaborate closely with cross-functional teams, including software engineers, data scientists, and operations, to monitor, analyze, and optimize our systems. Your primary responsibility will be to collect, analyze, and present key performance indicators (KPIs) that drive operational excellence and inform strategic decisions.What you’ll be doing:Develop, test, and deploy data collectors, pipelines, and services to enhance use of our AI/ML and chip development infrastructureParticipate in the full life-cycle of tool development, test, and deployment.Work in a diverse team to provide operational and strategic metrics which empower our engineers to develop at the speed of light.Continuously improve our chip develop process through better observabilityDirectly contribute to the overall quality and improve time to market for our next generation chips. What we need to see:Experience in applying data analysis principles and influencing data-driven decisionsExperience with turning raw data into actionable reportsHands-on experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open source toolsAuthoritative level Python programming experience and use of API callsExtensive experience with CI/CD pipelines such as Jenkins and/or GitLabPassion for improving the productivity of othersExcellent planning and interpersonal skillsFlexibility/adaptability working in a dynamic environment with changing requirements MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience.5+yrs of relevant experience. Ways to stand out from the crowd:Hands-on experience running GPU-based workloads in a batch computing environmentPassion for gathering and visualizing metrics and dataExperience with chip design workflows, such as front end verification, back end workflows, or mixed signal workflowsExperience with job schedulers (in particular IBM Spectrum LSF and/or SLURM)Mastery of distributed system principlesNVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
View Original Job Posting