Senior / Staff Infrastructure Engineer (Compute)

FluidStack
Full_timeSan Francisco, United States

📍 Job Overview

  • Job Title: Senior / Staff Infrastructure Engineer (Compute)
  • Company: FluidStack
  • Location: San Francisco, California, United States & New York, New York, United States
  • Job Type: Full-Time
  • Category: DevOps, Infrastructure
  • Date Posted: 2025-08-03
  • Experience Level: 5-10 years
  • Remote Status: On-site

🚀 Role Summary

  • Design, deploy, and manage high-performance GPU compute infrastructure for AI workloads at scale.
  • Collaborate with cross-functional teams to ensure cohesive infrastructure operations and support AI workloads.
  • Troubleshoot complex GPU and compute system-related failures, optimizing performance and reliability.
  • Develop and maintain hardware/firmware management services, automating server lifecycle tasks.

📝 Enhancement Note: This role requires a strong background in compute infrastructure engineering, with a focus on GPU/ASIC infrastructure and AI workloads. Experience with bare metal provisioning tools and automation is crucial for success in this position.

💻 Primary Responsibilities

  • Infrastructure Design & Deployment: Design and implement GPU/ASIC infrastructure at the server, rack, and system level, ensuring high performance and scalability.
  • Hardware Troubleshooting: Troubleshoot complex GPU and compute system-related failures, collaborating with hardware vendors for RMAs when necessary.
  • Hardware Management Services: Develop and maintain hardware/firmware management services, automating server lifecycle tasks to improve efficiency.
  • Compute Lifecycle Management: Own the end-to-end compute lifecycle, including partnering with vendors on RMAs and ensuring optimal resource utilization.
  • System Performance Monitoring: Monitor system performance, identifying and resolving bottlenecks to maintain high system availability and performance.
  • Automation & Collaboration: Automate deployment and management tasks, and collaborate with storage and network teams to ensure cohesive infrastructure operations.

📝 Enhancement Note: This role requires a deep understanding of Linux systems administration and performance tuning, as well as familiarity with GPU hardware and workload optimization. Proficiency in automation tools and experience operating Kubernetes and SLURM clusters are also essential for success.

🎓 Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant work experience may be considered in lieu of a degree.

Experience: 5+ years of experience in compute infrastructure engineering, with a focus on GPU/ASIC infrastructure and AI workloads.

Required Skills:

  • Strong knowledge of Linux systems administration and performance tuning.
  • Experience with bare metal provisioning tools (MaaS, Metal3, Tinkerbell, or other).
  • Familiarity with GPU hardware and workload optimization, especially kernel and driver level requirements.
  • Proficiency in automation tools (e.g., Ansible, Terraform).
  • Experience operating Kubernetes and SLURM clusters.

Preferred Skills:

  • Familiarity with AI workloads and optimization techniques.
  • Experience with hardware management services and firmware updates.
  • Knowledge of hardware vendor-specific tools and APIs.

📝 Enhancement Note: Candidates with experience in AI infrastructure and GPU optimization will have a competitive advantage in this role. Familiarity with AI workloads and optimization techniques is highly desirable.

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate experience in designing, deploying, and managing high-performance GPU compute infrastructure.
  • Showcase projects that involve troubleshooting complex GPU and compute system-related failures.
  • Highlight automation and scripting skills, with examples of automating server lifecycle tasks.
  • Include examples of collaborating with cross-functional teams to ensure cohesive infrastructure operations.

Technical Documentation:

  • Document system architecture, hardware specifications, and software configurations for your projects.
  • Include performance metrics, optimization techniques, and troubleshooting steps for GPU and compute systems.
  • Demonstrate understanding of AI workloads and optimization techniques through technical documentation.

📝 Enhancement Note: A strong portfolio will showcase the candidate's ability to design, deploy, and manage high-performance GPU compute infrastructure, with a focus on AI workloads and optimization techniques. Include examples of collaborating with cross-functional teams and automating server lifecycle tasks.

💵 Compensation & Benefits

Salary Range: $150,000 - $200,000 per year (based on experience and location)

Benefits:

  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
  • Fluidstack has offices in key hubs including NYC and SF.

📝 Enhancement Note: The salary range for this role is estimated based on market research for senior-level DevOps and infrastructure positions in San Francisco and New York. The actual salary may vary depending on the candidate's experience and skills.

🎯 Team & Company Context

🏢 Company Culture

Industry: AI Cloud Platform

Company Size: Small, highly motivated team focused on providing a world-class supercomputing experience.

Founded: 2021

Team Structure:

  • Small, highly motivated team focused on customer success.
  • Collaborative environment with a strong emphasis on customer focus and high standards.
  • Flat organizational structure with open communication and decision-making.

Development Methodology:

  • Agile development process with a focus on customer needs and continuous improvement.
  • Collaborative environment with regular team meetings and code reviews.
  • Strong emphasis on automation, testing, and quality assurance.

Company Website: www.fluidstack.io

📝 Enhancement Note: FluidStack is a small, highly motivated team focused on providing a world-class supercomputing experience for AI labs, governments, and enterprises. The company values customer focus, high standards, and open communication.

📈 Career & Growth Analysis

Web Technology Career Level: Senior/Staff Infrastructure Engineer (Compute)

Reporting Structure: This role reports directly to the CTO and works closely with hardware and software teams.

Technical Impact: This role has a significant impact on the performance, scalability, and reliability of Fluidstack's GPU clusters, supporting AI workloads for top AI labs, governments, and enterprises.

Growth Opportunities:

  • Technical Leadership: Grow into a technical leadership role, mentoring junior engineers and driving infrastructure best practices.
  • Architecture Decisions: Influence architecture decisions, driving the design and deployment of high-performance GPU compute infrastructure.
  • Emerging Technologies: Stay up-to-date with emerging technologies in AI infrastructure and GPU optimization, driving innovation within the team.

📝 Enhancement Note: This role offers significant growth opportunities for candidates interested in technical leadership, architecture decisions, and driving innovation in AI infrastructure and GPU optimization.

🌐 Work Environment

Office Type: On-site, with offices in key hubs including NYC and SF.

Office Location(s): San Francisco, California, United States & New York, New York, United States

Workspace Context:

  • Collaborative workspace with a focus on customer success and high standards.
  • Flat organizational structure with open communication and decision-making.
  • Access to cutting-edge hardware and software tools for AI workloads.

Work Schedule: Standard full-time work schedule, with flexibility for deployment windows, maintenance, and project deadlines.

📝 Enhancement Note: FluidStack offers a collaborative work environment with a focus on customer success and high standards. The company provides access to cutting-edge hardware and software tools for AI workloads, with a flexible work schedule that accommodates deployment windows, maintenance, and project deadlines.

📄 Application & Technical Interview Process

Interview Process:

  • Phone Screen: A brief phone call to discuss the role, company, and candidate's background.
  • Technical Deep Dive: A detailed discussion of the candidate's experience with compute infrastructure engineering, GPU hardware, and AI workloads.
  • System Design: A system design exercise focused on designing high-performance GPU compute infrastructure for AI workloads.
  • Final Interview: A final interview with the CTO to discuss the candidate's fit within the team and company culture.

Portfolio Review Tips:

  • Highlight projects that demonstrate experience in designing, deploying, and managing high-performance GPU compute infrastructure.
  • Include examples of troubleshooting complex GPU and compute system-related failures, with a focus on optimization techniques and performance improvement.
  • Showcase automation and scripting skills, with examples of automating server lifecycle tasks and improving efficiency.
  • Include examples of collaborating with cross-functional teams to ensure cohesive infrastructure operations.

Technical Challenge Preparation:

  • Brush up on Linux systems administration and performance tuning skills.
  • Familiarize yourself with GPU hardware and workload optimization techniques, especially kernel and driver level requirements.
  • Prepare for system design exercises focused on high-performance GPU compute infrastructure for AI workloads.

ATS Keywords: Linux, Systems Administration, Performance Tuning, Bare Metal Provisioning, GPU Hardware, AI Workloads, Automation Tools, Kubernetes, SLURM Clusters, Infrastructure Engineering, Compute Infrastructure, Hardware Management Services, AI Infrastructure, GPU Optimization

📝 Enhancement Note: The interview process for this role focuses on the candidate's experience with compute infrastructure engineering, GPU hardware, and AI workloads. A strong portfolio will showcase the candidate's ability to design, deploy, and manage high-performance GPU compute infrastructure, with a focus on AI workloads and optimization techniques. The technical challenge preparation should focus on Linux systems administration, GPU hardware, and AI workloads.

🛠 Technology Stack & Web Infrastructure

Compute Infrastructure Technologies:

  • Linux (Ubuntu, CentOS)
  • Kubernetes
  • SLURM Clusters
  • Bare Metal Provisioning Tools (MaaS, Metal3, Tinkerbell, or other)
  • GPU Hardware (NVIDIA, AMD)
  • AI Workloads (TensorFlow, PyTorch, etc.)

Hardware Management Services:

  • Hardware Management Tools (e.g., IPMI, Redfish)
  • Firmware Updates
  • Hardware Vendor-Specific Tools and APIs

Automation & Configuration Management Tools:

  • Ansible
  • Terraform
  • Puppet
  • Chef

📝 Enhancement Note: This role requires a strong background in compute infrastructure engineering, with a focus on GPU/ASIC infrastructure and AI workloads. Experience with Linux, Kubernetes, SLURM Clusters, and bare metal provisioning tools is essential for success in this position. Familiarity with hardware management services and automation tools is also highly desirable.

👥 Team Culture & Values

FluidStack Values:

  • Customer Focus: We put our customers first in everything we do, working hard to win repeated business and customer referrals.
  • High Standards: We expect ourselves and each other to care deeply about the work we do, the products we build, and the experience our customers have in every interaction with us.
  • Ownership: We take ownership from inception to delivery, approaching every problem with an open mind and a positive attitude.
  • Growth Mindset: We value effectiveness, competence, and a growth mindset, continuously learning and improving our skills and processes.

Collaboration Style:

  • Cross-Functional Integration: We collaborate with hardware and software teams to ensure cohesive infrastructure operations and support AI workloads.
  • Code Review Culture: We maintain a strong code review culture, with a focus on quality, performance, and maintainability.
  • Knowledge Sharing: We encourage knowledge sharing, technical mentoring, and continuous learning within the team.

📝 Enhancement Note: FluidStack values customer focus, high standards, ownership, and a growth mindset. The team encourages knowledge sharing, technical mentoring, and continuous learning, with a strong code review culture focused on quality, performance, and maintainability.

⚡ Challenges & Growth Opportunities

Technical Challenges:

  • High-Performance GPU Compute Infrastructure: Design and deploy high-performance GPU compute infrastructure that can support AI workloads at scale.
  • AI Workload Optimization: Optimize GPU hardware and workloads for maximum performance and efficiency, with a focus on AI-specific requirements.
  • Hardware Troubleshooting: Troubleshoot complex GPU and compute system-related failures, collaborating with hardware vendors for RMAs when necessary.
  • Automation & Efficiency: Automate deployment and management tasks to improve efficiency and reduce manual effort.

Learning & Development Opportunities:

  • Technical Skill Development: Stay up-to-date with emerging technologies in AI infrastructure and GPU optimization, driving innovation within the team.
  • Conference Attendance & Certification: Attend industry conferences and obtain relevant certifications to enhance your technical skills and knowledge.
  • Technical Mentoring: Mentor junior engineers and drive infrastructure best practices, contributing to the team's growth and success.

📝 Enhancement Note: This role offers significant technical challenges and growth opportunities for candidates interested in designing, deploying, and managing high-performance GPU compute infrastructure for AI workloads. The team encourages continuous learning and innovation, with a focus on emerging technologies in AI infrastructure and GPU optimization.

💡 Interview Preparation

Technical Questions:

  • Linux Systems Administration: Prepare for in-depth questions about Linux systems administration, performance tuning, and hardware management.
  • GPU Hardware & Workload Optimization: Brush up on your knowledge of GPU hardware, kernel and driver level requirements, and AI workload optimization techniques.
  • System Design: Prepare for system design exercises focused on designing high-performance GPU compute infrastructure for AI workloads.
  • Troubleshooting: Prepare for troubleshooting scenarios involving complex GPU and compute system-related failures.

Company & Culture Questions:

  • Customer Focus: Prepare for questions about FluidStack's customer focus and how you would ensure the success of our customers in this role.
  • High Standards: Prepare for questions about FluidStack's high standards and how you would maintain and improve the quality of our products and services.
  • Ownership & Growth Mindset: Prepare for questions about FluidStack's ownership and growth mindset, and how you would approach problems and challenges in this role.

Portfolio Presentation Strategy:

  • Project Walkthrough: Prepare a detailed walkthrough of your projects, highlighting your experience in designing, deploying, and managing high-performance GPU compute infrastructure.
  • Technical Deep Dive: Prepare for a deep dive into the technical aspects of your projects, with a focus on optimization techniques, performance improvement, and troubleshooting.
  • Collaboration & Teamwork: Prepare examples of collaborating with cross-functional teams to ensure cohesive infrastructure operations and support AI workloads.

📝 Enhancement Note: The interview process for this role focuses on the candidate's experience with compute infrastructure engineering, GPU hardware, and AI workloads. Prepare for in-depth questions about Linux systems administration, GPU hardware, and AI workloads, as well as system design exercises and troubleshooting scenarios. The portfolio presentation strategy should focus on the candidate's experience in designing, deploying, and managing high-performance GPU compute infrastructure, with a focus on AI workloads and optimization techniques.

📌 Application Steps

To apply for this Senior / Staff Infrastructure Engineer (Compute) position:

  • Submit your application through the application link provided.
  • Portfolio Customization: Tailor your portfolio to highlight your experience in designing, deploying, and managing high-performance GPU compute infrastructure, with a focus on AI workloads and optimization techniques.
  • Resume Optimization: Optimize your resume for web technology roles, with a focus on project highlighting and technical skills relevant to this position.
  • Technical Interview Preparation: Prepare for the technical interview process, focusing on Linux systems administration, GPU hardware, and AI workloads. Brush up on your system design skills and prepare for troubleshooting scenarios.
  • Company Research: Research FluidStack's customer focus, high standards, ownership, and growth mindset, and prepare for questions about how you would contribute to the team's success in this role.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web development/server administration industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Application Requirements

Candidates should have 5+ years of experience in compute infrastructure engineering and strong knowledge of Linux systems administration. Familiarity with GPU hardware and proficiency in automation tools are also required.