Fireworks AI Logo

Fireworks AI

Software Engineer, Site Reliability Engineer

Job Posted 6 Days Ago Posted 6 Days Ago
Be an Early Applicant
8 Locations
Mid level
8 Locations
Mid level
The Software Engineer will focus on reducing incident response time, optimizing Kubernetes clusters, enhancing cloud networking, and automating infrastructure management. Responsibilities include implementing monitoring systems, conducting post-mortems, and improving service health through robust architectures and automation.
The summary above was generated by AI

About Us:

Here at Fireworks, we’re building the future of generative AI infrastructure. Fireworks offers the generative AI platform with the highest-quality models and the fastest, most scalable inference. We’ve been independently benchmarked to have the fastest LLM inference and have been getting great traction with innovative research projects, like our own function calling and multi-modal models. Fireworks is funded by top investors, like Benchmark and Sequoia, and we’re an ambitious, fun team composed primarily of veterans from Pytorch and Google Vertex AI.

The Role:

We’re seeking a highly skilled SRE/PE with deep expertise in Kubernetes (k8s), cloud networking, and infrastructure automation. This role will focus on reducing incident response time, implementing auto-remediation, optimizing auto-scaling, and improving cluster efficiency and service health. You’ll design systems that balance performance, cost, and reliability while working onsite at our Redwood City or New York City team.

Key Responsibilities:

  1. Incident Response & Reliability Engineering:

    • Drive initiatives to reduce incident response time through improved monitoring, alerting, and automated remediation.

    • Build self-healing systems and playbooks for common failure scenarios.

    • Lead blameless post-mortems and implement preventative measures.

  2. Kubernetes & GPU Cluster Optimization:

    • Manage and optimize GPU-enabled Kubernetes clusters for AI/ML workloads, focusing on cost-performance efficiency, auto-scaling, and resource utilization.

    • Debug performance bottlenecks in distributed systems (e.g., network, storage, GPU scheduling).

  3. Cloud Networking & Service Health:

    • Strengthen service health by refining cloud networking stacks (VPCs, load balancers, service meshes) and ensuring low-latency communication.

    • Design fault-tolerant architectures to minimize downtime.

  4. Monitoring & Observability:

    • Enhance service monitoring with tools like Prometheus, Grafana, and custom metrics pipelines.

    • Implement predictive analytics to proactively address system health risks.

  5. Automation & Infrastructure-as-Code (IaC):

    • Build automation for cluster provisioning, scaling, and recovery using Terraform, Argo, and CI/CD pipelines.

    • Develop tools to streamline operational workflows (e.g., automated rollbacks, canary deployments).

Minimum Qualifications:

  • 3+ years in SRE/PE/DevOps roles with production-grade Kubernetes experience.

  • Proficiency in cloud networking (AWS/GCP/Azure VPCs, firewalls, DNS) and service monitoring (Prometheus, Alertmanager, Grafana).

  • Hands-on experience with incident management and improving system reliability/SLOs.

  • Strong scripting/coding skills (Python/Go/Bash) for automation and tooling.

  • Familiarity with object storage (S3, GCS) and data pipeline integration.

Preferred Qualifications:

  • Experience with GPU clusters (NVIDIA GPUs, MIG, CUDA) and AI/ML workloads.

  • Knowledge of auto-scaling technologies (K8s HPA/VPA) and auto-remediation frameworks.

  • Expertise in service meshes (Istio)

Why Fireworks AI?

  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.

  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.

  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results.

  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.

Top Skills

AWS
Azure
Bash
GCP
Go
Grafana
Kubernetes
Prometheus
Python
Terraform
HQ

Fireworks AI Redwood, California, USA Office

Redwood, CA, United States, 94063

Similar Jobs

3 Days Ago
Toronto, ON, CAN
Senior level
Senior level
Big Data • Cloud • Internet of Things
The Staff Software Engineer will enhance the reliability and performance of systems, automate infrastructure management, and support data engineering efforts.
Top Skills: Apache BeamApache KafkaSparkAWSAzureCloudFormationDatadogDockerGithub ActionsGCPGoogle Cloud DataflowGrafanaJavaJenkinsKubernetesPrometheusPythonTerraform
Senior level
Big Data • Cloud • Internet of Things
Seeking a Senior Platform Engineer & SRE to enhance system reliability, performance, and scalability through automation, monitoring, and collaboration with teams to implement DevOps best practices.
Top Skills: Apache BeamApache KafkaSparkAWSAzureCi/CdCloudFormationDatadogDockerGithub ActionsGCPGoogle Cloud DataflowGrafanaJavaJenkinsKubernetesPrometheusPythonTerraform
An Hour Ago
Easy Apply
Remote
Hybrid
Canada
Easy Apply
116K-160K Annually
Senior level
116K-160K Annually
Senior level
Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
The Senior Database Engineer will coordinate and maintain database systems, optimize performance, support engineering teams, and develop ETL pipelines.
Top Skills: ActiverecordsEtl PipelinesLiquibaseNoSQLPerlPostgresPythonRedisShell

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account