RunPod

Site Reliability Engineer

Sorry, this job was removed at 10:16 p.m. (PST) on Monday, Feb 03, 2025

Easy Apply

Remote

Hiring Remotely in USA

Easy Apply

Remote

Hiring Remotely in USA

RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI.

As our organization continues its rapid expansion in managing large-scale, distributed systems, we are seeking a full-time, remote Site Reliability Engineer to join our team. This technical position will be pivotal in designing, implementing, and maintaining our robust infrastructure across multiple data centers. The ideal candidate will have deep knowledge of Linux systems, containerization, and virtualization technologies, coupled with strong experience in managing large bare-metal fleets and implementing secure best practices. This role offers the opportunity to work with cutting-edge GPU/AI technologies, solve complex problems at scale, and contribute to the reliability and performance of our critical systems. We provide competitive compensation, including stock options, and the flexibility of remote work within a culture that values innovation, continuous learning, and technical excellence.

Key aspects of our SRE approach include:

Automation First: We write software to manage, scale, and optimize our infrastructure, moving beyond manual operations to enable rapid, consistent, and reliable system scaling.
Systems Thinking: Our SREs approach problems with a holistic view, considering how changes and improvements in one area can positively impact the entire system.
Continuous Improvement: We constantly iterate on our processes and tooling, using data-driven decisions to enhance system reliability and performance.
Proactive Problem Solving: Rather than reactively addressing issues, we build systems and tools that anticipate and mitigate potential problems before they occur.
Scalability Through Code: We believe in managing infrastructure as code, allowing us to version, test, and deploy our infrastructure configurations with the same rigor as application code.

As an SRE in our team, you'll be at the forefront of this approach, using your software engineering skills to build robust, scalable systems that support our rapidly growing infrastructure. You'll work on challenging projects that require innovative solutions, always with an eye towards automation, reliability, and performance at scale.

If you are passionate about building and maintaining highly reliable, scalable systems and have the skills to match, we want to hear from you. Join our team and help shape the future of AI compute infrastructure!

Responsibilities:

Design, implement, and maintain robust, scalable, and highly available systems
Troubleshoot and resolve complex issues in distributed environments
Develop and implement SLIs and SLOs to ensure system reliability and performance
Manage and optimize large-scale bare-metal fleets across multiple data centers
Implement and maintain secure practices for foundational systems
Collaborate with cross-functional teams to improve system design and operation
Automate processes to increase efficiency and reduce human error
Participate in on-call rotations to provide 24/7 support for critical systems

Requirements:

Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components]
Extensive experience with distributed system troubleshooting and design
Proficiency in at least one programming language, preferably Python or Golang
Proven experience implementing and managing SLIs and SLOs
Experience with pull-based configuration management tools such as Chef or Puppet
Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers
Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems
Comprehensive understanding of OSI model Layers 3, 4, and 7
Successful completion of a background check

Preferred:

Bachelor's degree in Computer Science, Engineering, or a related field
Relevant industry certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
Experience with cloud platforms (AWS, GCP, Azure)
Familiarity with monitoring and observability tools (e.g., Statsd, Grafana, Datadog, OpenTelemetry, VictoriaMetrics)
Experience with managing fleets of GPU compute resources at scale
Strong communication skills and ability to work effectively in a team environment

What You’ll Receive:

The competitive base pay for this position ranges from $152,000 - $175,000. Factors that may be used to determine your actual pay may include your specific job related knowledge, skills and experience
Stock options
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.
Contribute to a company with a global impact based in the US, Canada, and Europe.

RunPod is committed to maintaining a workplace free from discrimination and upholding the principles of equality and respect for all individuals. We believe that diversity in all its forms enhances our team. As an equal opportunity employer, RunPod is committed to creating an inclusive workforce at every level. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, marital status, protected veteran status, disability status, or any other characteristic protected by law.

San Francisco, California, United States

Similar Jobs at RunPod

RunPod

Site Reliability Engineer - Manager

25 Days Ago

Easy Apply

Remote

USA

Easy Apply

Senior level

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)

The SRE Manager leads a team of Site Reliability Engineers, overseeing infrastructure reliability and scalability, driving adoption of best practices, and managing teams across multiple data centers.

Top Skills: AWSAzureConfiguration Management ToolsContainerizationGCPGoInfrastructure-As-CodeLinuxNetworkingPythonVirtualization

RunPod

Senior Full-Stack Engineer

2 Days Ago

Easy Apply

Remote

USA

Easy Apply

Expert/Leader

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)

Seeking a Senior Full-Stack Engineer with 10+ years experience to optimize AI PaaS platforms, specializing in frontend and cloud systems, database design, and scalable infrastructures.

Top Skills: AWSAzureDatadogDockerDynamoDBGCPGoGraphQLGrpcJavaScriptKafkaKubernetesMemcachedMongoDBMySQLNatsNext.JsOpentelemetryPostgresPrometheusPythonRabbitMQReactRedisTypescriptWebsockets

RunPod

Software Development Engineer in Test

3 Days Ago

Easy Apply

Remote

USA

Easy Apply

Mid level

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)

As a Software Development Engineer in Test (SDET), you will design automated testing frameworks for cloud-scale systems, validate performance, and ensure system reliability through various testing strategies.

Top Skills: Chaos MeshCi/CdDatadogGoGrafanaGremlinIacJmeterK6KubernetesLitmuschaosLocustOpentelemetryPrometheusPythonTypescript

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

By clicking Apply you agree to share your profile information with the hiring company.

RunPod

Site Reliability Engineer

RunPod San Francisco, California, USA Office

Similar Jobs at RunPod

Site Reliability Engineer - Manager

Senior Full-Stack Engineer

Software Development Engineer in Test

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech