X Corp.

Site Reliability Engineer - High Performance Computing / AI-ML

Posted 24 Days Ago

5 Locations

120K-297K Annually

Junior

5 Locations

120K-297K Annually

Junior

As a Site Reliability Engineer, you will manage large-scale HPC systems, ensure stability and performance, and automate deployment processes.

The summary above was generated by AI

Role: Site Reliability Engineer - HPC / AI-ML (All Levels)
Location: Palo Alto, New York, Seattle or Austin
Base Salary Range: $120,000 to $297,000 + Equity

Who We Are:

At X, we’re pioneering the frontier of technology with our innovative Everything App. Our mission is to revolutionize how people connect, share ideas, and engage in meaningful conversations. We champion freedom of speech and strive to create a platform that embraces diverse perspectives. Our commitment is to foster open dialogue and empower individuals to express themselves freely.

What You’ll Do:

As a Site Reliability Engineer (SRE) supporting HPC (High Performance Computing) + AI/ML initiatives at X, you will play a crucial role in maintaining and enhancing the reliability, availability, and performance of our large-scale systems. Your responsibilities will include:

Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux + Kubernetes)
Collaborating with cross-functional teams, including hardware engineers and software developers, to support and improve our infrastructure
Automating the provisioning and deployment of systems to enhance long-term health and scalability
Ensuring the robustness of our HPC environments and storage clusters
Writing and maintaining scripts and tools for automation and monitoring
Addressing system failures and performance issues, identifying root causes, and implementing preventive measures
Working closely with end-users to understand changing needs as our environment evolves.

Who You Are:

We're looking for exceptional engineers who are passionate about our mission and have a strong desire to make a meaningful impact. The ideal candidate will have:

2+ years of professional software development experience
Extensive experience with Kubernetes and container orchestration
Proficiency in one or more object-oriented programming languages (e.g. Python, Java, C++, Scala)
Proficiency in scripting languages (Python, Bash, etc.)
Strong experience in configuration management (e.g., puppet, ansible, chef, etc.)
Familiarity with Ethernet networking at scale and distributed systems
Strong troubleshooting skills and experience with HPC environments
Experience managing large-scale systems, ideally supporting thousands of machines
Working understanding of the storage systems required to support such environments
Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms.
Ability to think outside the box and come up with innovative solutions to complicated problems.
Extremely committed, willing to work in a fast paced environment
Excellent communication and interpersonal skills

At X, our small but fast-paced team values innovation, creativity, and a strong commitment to our mission. As a Site Reliability Engineer, you'll have the opportunity to make a significant impact on the future of X and our aspiration to build the Everything App.

Top Skills

Ansible

Bash

C++

Chef

Java

Kubernetes

Linux

Puppet

Python

Scala

San Francisco, CA, United States

Similar Jobs

Cloudflare

Incident Response Engineer

34 Minutes Ago

Hybrid

Austin, TX, USA

115K-141K Annually

Mid level

115K-141K Annually

Mid level

Cloud • Information Technology • Security • Software • Cybersecurity

The Incident Response Engineer leads incident management, conducts forensic investigations, and collaborates on security processes and automation to address security threats.

Top Skills: AWSAzureCrowdstrikeDockerElkGCPKubernetesPythonSIEMSoarSQL

Capital One

Senior Data Engineer (Python, SQL, AWS)

An Hour Ago

Hybrid

Plano, TX, USA

144K-165K Annually

Senior level

144K-165K Annually

Senior level

Fintech • Machine Learning • Payments • Software • Financial Services

As a Senior Data Engineer, you'll design and build scalable data pipelines using emerging technologies, impacting thousands of auto dealerships while enhancing analytics capabilities.

Top Skills: AWSCi/CdDynamoDBFlinkKafkaOpensearchPythonRedshiftSnowflakeSparkSQLUnix/Linux

Capital One

Ping Engineer, Principal Associate

An Hour Ago

Hybrid

144K-181K Annually

Senior level

144K-181K Annually

Senior level

Fintech • Machine Learning • Payments • Software • Financial Services

The role involves enhancing and supporting PingFederate SSO platform, managing AWS infrastructure, developing automated solutions, and leading incident resolutions.

Top Skills: Active DirectoryAWSAzure Resource ManagerBashCloudFormationGCPNode.jsOauth2.0Openid ConnectPingfederatePythonSAMLTerraform

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Apply Save

By clicking Apply you agree to share your profile information with the hiring company.

X Corp.

Site Reliability Engineer - High Performance Computing / AI-ML

Top Skills

X Corp. San Francisco, California, USA Office

Similar Jobs

Incident Response Engineer

Senior Data Engineer (Python, SQL, AWS)

Ping Engineer, Principal Associate

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech