Software Engineer, Cloud Infrastructure

Fireworks AI
Full_timeβ€’$205k-240k/year (USD)β€’Redwood City, United States

πŸ“ Job Overview

  • Job Title: Software Engineer, Cloud Infrastructure
  • Company: Fireworks AI
  • Location: Redwood City, California, United States
  • Job Type: Full-Time
  • Category: DevOps Engineer, Infrastructure Engineer
  • Date Posted: 2025-06-20
  • Experience Level: Mid-Senior Level (5-10 years)

πŸš€ Role Summary

  • Build and maintain scalable, resilient, and high-performance backend infrastructure for distributed training, inference, and data processing pipelines.
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions.
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence.

πŸ“ Enhancement Note: This role requires a deep understanding of distributed systems, cloud-native infrastructure, and machine learning platforms. The ideal candidate will have a proven track record in building and operating large-scale ML infrastructure and driving system performance, reliability, and cost-efficiency improvements.

πŸ’» Primary Responsibilities

  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines.
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure.
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency.
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning.
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions.
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance the platform's capabilities and reliability.
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence.

πŸ“ Enhancement Note: The primary responsibilities of this role revolve around designing, building, and maintaining scalable, resilient, and high-performance backend infrastructure for distributed AI workloads. This requires a strong focus on efficiency, low latency, and operational simplicity across compute, storage, and networking layers.

πŸŽ“ Skills & Qualifications

Education: Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience). A Master’s or PhD in Computer Science or a related field is preferred.

Experience: 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure). Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.) is required.

Required Skills:

  • Strong software development skills in languages like Python, or C++.
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization.
  • Experience with cloud-native infrastructure and machine learning platforms.
  • Familiarity with infrastructure-as-code and CI/CD tooling (e.g., Terraform, ArgoCD, GitOps) is preferred.

Preferred Skills:

  • Experience leading infrastructure projects supporting large-scale ML/AI workloads or high-throughput systems.
  • Contributions to open-source cloud or ML infrastructure projects are a plus.

πŸ“ Enhancement Note: The required and preferred skills for this role emphasize a strong background in distributed systems, cloud-native infrastructure, and machine learning platforms. Candidates should have proven experience in ML infrastructure and tooling, as well as strong software development skills in languages like Python or C++.

πŸ“Š Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate experience in building and operating large-scale ML infrastructure by showcasing projects that highlight your expertise in distributed systems, cloud-native infrastructure, and machine learning platforms.
  • Display a strong understanding of infrastructure optimization by presenting projects that focus on compute cost reduction, storage lifecycle management, and network performance tuning.
  • Highlight your ability to collaborate cross-functionally by showcasing projects that involve working with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions.
  • Demonstrate your expertise in cloud-native and open-source technologies by presenting projects that integrate Kubernetes, Ray, Kubeflow, MLFlow, or other relevant tools.

Technical Documentation:

  • Provide detailed documentation for your projects, including code quality, commenting, and documentation standards.
  • Explain your version control, deployment processes, and server configuration in detail, highlighting your understanding of infrastructure-as-code and CI/CD tooling.
  • Describe your testing methodologies, performance metrics, and optimization techniques to showcase your commitment to building reliable, efficient, and high-performing infrastructure.

πŸ“ Enhancement Note: The portfolio requirements for this role emphasize a strong background in building and operating large-scale ML infrastructure, as well as a deep understanding of infrastructure optimization and collaboration with cross-functional teams. Candidates should demonstrate their expertise in cloud-native and open-source technologies through their projects and technical documentation.

πŸ’΅ Compensation & Benefits

Salary Range: The total compensation for this role includes meaningful equity in a fast-growing startup, along with a competitive salary and comprehensive benefits package. The base pay range is intended as a guideline and may be adjusted. The listed salary range is $205,000 - $240,000 USD per year.

Benefits:

  • Equity in a fast-growing startup
  • Comprehensive benefits package

Working Hours: This role requires a full-time commitment, with standard working hours typically being 40 hours per week. However, the role may require flexibility for deployment windows, maintenance, and project deadlines.

πŸ“ Enhancement Note: The salary range for this role has been estimated based on market data and regional adjustments for the San Francisco Bay Area, where the company is located. The total compensation package includes meaningful equity in a fast-growing startup, along with a competitive salary and comprehensive benefits.

🎯 Team & Company Context

🏒 Company Culture

Industry: Fireworks AI is a generative AI infrastructure company, focusing on building the future of AI by offering the highest-quality models and the fastest, most scalable inference. They have been independently benchmarked to have the fastest LLM inference and have been getting great traction with innovative research projects, like their own function calling and multi-modal models.

Company Size: Fireworks AI is a startup composed primarily of veterans from Pytorch and Google Vertex AI, with a team of around 50-100 employees.

Founded: Fireworks AI was founded in 2022 and has been getting great traction with innovative research projects and top investors, like Benchmark and Sequoia.

Team Structure:

  • The Cloud Infrastructure team is responsible for architecting and building the foundational systems that power Fireworks AI's revolutionary generative AI platform.
  • The team works closely with engineering partners, product teams, and infrastructure stakeholders to design solutions that balance performance, cost-efficiency, and operational simplicity across compute, storage, and networking layers.
  • The team is composed of experienced engineers with deep expertise in distributed systems, cloud-native infrastructure, and machine learning platforms.

Development Methodology:

  • Fireworks AI uses Agile methodologies, with a focus on delivering unparalleled reliability, efficiency, and scalability for AI workloads.
  • The team emphasizes collaboration, code review, and continuous learning to ensure the highest quality of work.
  • Infrastructure-as-code and CI/CD tooling are used to automate deployment and ensure consistency across the platform.

Company Website: Fireworks AI

πŸ“ Enhancement Note: Fireworks AI is a startup focused on building the future of generative AI infrastructure. The company culture emphasizes collaboration, continuous learning, and a strong commitment to delivering high-quality, reliable, and efficient AI solutions.

πŸ“ˆ Career & Growth Analysis

Web Technology Career Level: This role is at the mid-senior level, requiring a deep understanding of distributed systems, cloud-native infrastructure, and machine learning platforms. The ideal candidate will have a proven track record in building and operating large-scale ML infrastructure and driving system performance, reliability, and cost-efficiency improvements.

Reporting Structure: The Software Engineer, Cloud Infrastructure, will report directly to the Engineering Manager of the Cloud Infrastructure team. They will work closely with engineering partners, product teams, and infrastructure stakeholders to design and implement robust infrastructure solutions.

Technical Impact: The technical impact of this role is significant, as the Software Engineer, Cloud Infrastructure, will be responsible for architecting and building the foundational systems that power Fireworks AI's revolutionary generative AI platform. Their work will directly influence the performance, reliability, and scalability of the platform, enabling innovative research projects and driving the company's success.

Growth Opportunities:

  • Technical Growth: The ideal candidate will have the opportunity to learn and grow their skills in distributed systems, cloud-native infrastructure, and machine learning platforms. They will work alongside experienced engineers and have the chance to contribute to open-source cloud or ML infrastructure projects.
  • Leadership Development: As the team grows, there may be opportunities for the Software Engineer, Cloud Infrastructure, to take on leadership roles, mentoring other engineers and driving best practices for building and operating large-scale ML infrastructure.
  • Architecture Decision-Making: The ideal candidate will have the opportunity to make critical architecture decisions that balance performance, cost-efficiency, and operational simplicity across compute, storage, and networking layers.

πŸ“ Enhancement Note: This role offers significant growth opportunities for technical and leadership development. The ideal candidate will have the chance to learn from experienced engineers, contribute to open-source projects, and make critical architecture decisions that drive the success of Fireworks AI's generative AI platform.

🌐 Work Environment

Office Type: Fireworks AI's office is a collaborative workspace designed to foster innovation and creativity. The team works together in an open, modern office space that encourages communication and collaboration.

Office Location(s): Fireworks AI's headquarters is located in Redwood City, California, with additional offices in San Francisco and remote team members worldwide.

Workspace Context:

  • Collaborative Web Development Environment: The workspace is designed to facilitate collaboration and communication between engineers, product teams, and stakeholders. This ensures that everyone is aligned and working towards the same goals.
  • Development Tools and Resources: The workspace is equipped with state-of-the-art development tools and resources, including multiple monitors, testing devices, and high-speed internet connectivity.
  • Cross-Functional Collaboration Opportunities: The workspace is designed to encourage cross-functional collaboration between engineers, designers, and stakeholders. This ensures that everyone's voices are heard and that the best possible solutions are implemented.

Work Schedule: The work schedule for this role is typically 40 hours per week, with flexibility for deployment windows, maintenance, and project deadlines. The team uses Agile methodologies, with regular sprint planning and stand-up meetings to ensure everyone is on track and working towards the same goals.

πŸ“ Enhancement Note: Fireworks AI's work environment is designed to be collaborative, innovative, and supportive of its employees' growth and success. The workspace is equipped with state-of-the-art development tools and resources, and the team uses Agile methodologies to ensure everyone is aligned and working towards the same goals.

πŸ“„ Application & Technical Interview Process

Interview Process:

  1. Technical Phone Screen: A 30-minute phone screen to assess your understanding of distributed systems, cloud-native infrastructure, and machine learning platforms. Be prepared to discuss your experience with backend infrastructure, ML infrastructure tooling, and software development skills.
  2. Technical Deep Dive: A 60-minute deep dive into your technical expertise, focusing on your experience with backend infrastructure, ML infrastructure tooling, and software development skills. Be prepared to discuss your approach to designing, building, and maintaining scalable, resilient, and high-performance backend infrastructure.
  3. Behavioral and Cultural Fit: A 30-minute conversation to assess your cultural fit with the Fireworks AI team. Be prepared to discuss your problem-solving skills, collaboration style, and commitment to driving system performance, reliability, and cost-efficiency improvements.
  4. Final Decision: A final decision will be made based on your technical expertise, cultural fit, and alignment with the company's mission and values.

Portfolio Review Tips:

  • Highlight your experience in building and operating large-scale ML infrastructure by showcasing projects that demonstrate your expertise in distributed systems, cloud-native infrastructure, and machine learning platforms.
  • Demonstrate your understanding of infrastructure optimization by presenting projects that focus on compute cost reduction, storage lifecycle management, and network performance tuning.
  • Showcase your ability to collaborate cross-functionally by presenting projects that involve working with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions.
  • Explain your approach to designing, building, and maintaining scalable, resilient, and high-performance backend infrastructure by discussing your experience with backend infrastructure, ML infrastructure tooling, and software development skills.

Technical Challenge Preparation:

  • Brush up on your knowledge of distributed systems fundamentals, including scheduling, orchestration, storage, networking, and compute optimization.
  • Review your experience with cloud-native infrastructure and machine learning platforms, ensuring you are familiar with the latest trends and best practices.
  • Prepare for questions about your experience with backend infrastructure, ML infrastructure tooling, and software development skills, focusing on your ability to design, build, and maintain scalable, resilient, and high-performance backend infrastructure.

ATS Keywords:

  • Programming Languages: Python, C++, Java, Go, Rust
  • Web Frameworks: Flask, Django, FastAPI, Spring Boot
  • Server Technologies: Kubernetes, Docker, AWS EKS, GKE, AKS
  • Databases: PostgreSQL, MySQL, MongoDB, Redis
  • Tools: Terraform, ArgoCD, GitOps, CI/CD pipelines, Infrastructure-as-Code
  • Methodologies: Agile, Scrum, Kanban, GitOps
  • Soft Skills: Problem-solving, collaboration, communication, leadership, mentoring
  • Industry Terms: Cloud infrastructure, distributed systems, machine learning platforms, backend development, ML infrastructure, CI/CD, infrastructure-as-code, networking, storage management, performance optimization, resource management, autoscalers

πŸ“ Enhancement Note: The interview process for this role focuses on assessing the candidate's technical expertise in distributed systems, cloud-native infrastructure, and machine learning platforms. The portfolio review tips and technical challenge preparation guidance are designed to help candidates showcase their experience and expertise in building and operating large-scale ML infrastructure.

πŸ“Œ Application Steps

To apply for this Software Engineer, Cloud Infrastructure position at Fireworks AI:

  1. Customize your resume and portfolio to highlight your experience with backend infrastructure, ML infrastructure tooling, and software development skills. Make sure to emphasize your ability to design, build, and maintain scalable, resilient, and high-performance backend infrastructure.
  2. Research Fireworks AI's mission, values, and culture to ensure you are a strong fit for the team. Prepare thoughtful questions to ask during the behavioral and cultural fit interview.
  3. Prepare for the technical phone screen and deep dive by brushing up on your knowledge of distributed systems fundamentals, cloud-native infrastructure, and machine learning platforms. Review your experience with backend infrastructure, ML infrastructure tooling, and software development skills.
  4. Submit your application through the application link provided in the job listing. Make sure to include your resume, portfolio, and any other relevant documents that showcase your experience and expertise.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web development/server administration industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Application Requirements

Bachelor’s degree in a technical field and 5+ years of experience in cloud environments required. Proven experience in ML infrastructure and strong software development skills are essential.