WitnessAI

SRE - Performance Engineering

Posted 5 Days Ago

Be an Early Applicant

7 Locations

180K-220K Annually

Senior level

7 Locations

180K-220K Annually

Senior level

As a Site Reliability Engineer focusing on performance, you will analyze and optimize cloud infrastructure, conduct performance tuning, design performance dashboards, and mentor teams on performance best practices. You will apply data-driven methodologies and profiling tools to solve complex system challenges involving Linux systems, AWS, Kubernetes, and GPU workloads.

The summary above was generated by AI

Job Title: Site Reliability Engineering - Performance Engineer

Location: Bay Area preferred/Hybrid

Department: DevOps

At WitnessAI, we're at the intersection of innovation and security in AI. We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.

Key Responsibilities

Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.
Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.
Design and implement performance dashboards to visualize key performance metrics in real-time.
Recommend Linux and Cloud Server tuning improvements to increase throughput and latency
Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.
Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.
Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.
Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.
Optimize distributed training pipelines using industry-standard frameworks.
Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.
Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.
Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.
Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.
Work with developers to refactor applications for performance and scalability, using profiling tools
Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.

Qualifications Required:

Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.
Strong experience with AWS cloud services and their performance optimization techniques.
Proficiency in performance analysis and load testing tools and other system tracing frameworks.
Hands-on experience with database tuning, query analysis, and indexing strategies.
Expertise in GPU workload optimization, and cloud-based GPU instances
Familiarity with message queuing systems including performance tuning.
Programming experience with a focus on profiling and tuning
Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.

Preferred:

Knowledge of distributed AI/ML training frameworks
Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.
Expertise in optimizing AI inference pipelines.
Familiarity with Brendan Gregg’s methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.

Benefits:

Hybrid work environment
Competitive salary
Health, dental, and vision insurance
401(k) plan
Opportunities for professional development and growth
Generous vacation policy

Salary range:

$180,000-$220,000

Top Skills

AWS

Bash

Gpus

Kubernetes

Linux

Python

San Mateo, CA, United States, 94403

Similar Jobs

Mondelēz International

Project Engineer I

An Hour Ago

Hybrid

East York, ON, CAN

Mid level

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing

The Project Engineer I plans, develops, and executes capital projects, ensuring adherence to engineering standards and company processes while managing project budgets and timelines.

Top Skills: Engineering StandardsIntegrated Lean Six Sigma (Il6S)Project Management

Square

Staff Software Engineer, Android Platform

An Hour Ago

Remote

Hybrid

264K-395K Annually

Senior level

264K-395K Annually

Senior level

eCommerce • Fintech • Hardware • Payments • Software • Financial Services

The Staff Software Engineer will lead the design and implementation of features for the Square Android Platform, mentor engineers, and collaborate with stakeholders to enhance products.

Top Skills: AndroidJavaKotlin

Square

Staff Android Engineer, Cart Platform

An Hour Ago

Remote

Hybrid

264K-395K Annually

Expert/Leader

264K-395K Annually

Expert/Leader

eCommerce • Fintech • Hardware • Payments • Software • Financial Services

The Staff Android Engineer will lead architectural design and feature implementation for the Cart Platform, mentoring engineers and fostering engineering excellence while developing scalable Android applications.

Top Skills: Android DevelopmentJavaKotlin

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

By clicking Apply you agree to share your profile information with the hiring company.

WitnessAI

SRE - Performance Engineering

Top Skills

WitnessAI San Mateo, California, USA Office

Similar Jobs

Project Engineer I

Staff Software Engineer, Android Platform

Staff Android Engineer, Cart Platform

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech