Senior Systems Administrator
📍 Job Overview
- Job Title: Senior Systems Administrator
- Company: Together AI
- Location: San Francisco, California, United States
- Job Type: On-site
- Category: Server Administration
- Date Posted: 2025-06-19
- Experience Level: 5-10 years
🚀 Role Summary
- Lead and manage high-performance computing (HPC) clusters and cloud environments for research and development teams.
- Collaborate with research professionals to ensure seamless operation of research environments, including job scheduling, resource allocation, and data management.
- Troubleshoot and resolve system-related problems to support research teams' success in using the environments.
- Research new and emerging technologies, evaluate workflows, and make recommendations for future improvements to the HPC environment.
📝 Enhancement Note: This role requires a strong understanding of HPC infrastructure, design, implementation, and optimization to support cutting-edge research in artificial intelligence.
💻 Primary Responsibilities
-
System Administration & Management:
- Lead the installation and upgrades of system hardware and software, including computational systems, clusters, standalone machines, storage systems, and various network fabrics.
- Manage and maintain detailed documentation of system configurations, procedures, and troubleshooting guides.
- Coordinate across multi-vendor resources, manage escalations, and ensure timely and satisfactory resolutions.
-
Collaboration & Support:
- Serve as the primary technical point of contact for research teams, providing expertise and guidance in HPC infrastructure and ensuring their success in using the environments.
- Contribute to the creation of training materials to enable research teams' success and platform adoption.
-
Research & Innovation:
- Research new and emerging technologies, evaluate workflows, and plans, and make recommendations for future improvements to the HPC environment.
- Stay updated with the latest trends in AI and HPC to provide informed recommendations and best practices.
📝 Enhancement Note: This role requires a proactive approach to problem-solving, strong analytical skills, and the ability to work effectively with diverse teams to drive innovation in AI research.
🎓 Skills & Qualifications
Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may be considered in lieu of a degree.
Experience: 5+ years of Linux system administration experience, with a strong focus on HPC environments and GPU management.
Required Skills:
- Strong understanding of HPC architectures and GPU management.
- Experience with job schedulers and resource managers (e.g., Slurm).
- Proficiency in Linux operating systems (e.g., Ubuntu, Red Hat, CentOS).
- Working experience with programming languages (e.g., Go, Python, Bash).
- Experience with network protocols (e.g., TCP/IP, InfiniBand).
- Experience with containerization and virtualization technologies (e.g., Docker, Kubernetes).
- Knowledge of cloud computing platforms (e.g., AWS, Azure, Google Cloud).
- Familiarity with machine learning and artificial intelligence frameworks (e.g., TensorFlow, PyTorch).
- Experience with data analytics, visualization, and observability tools (e.g., Grafana, Tableau, Power BI).
Preferred Skills:
- Experience with data center management and infrastructure as code (IaC) tools.
- Familiarity with AI hardware accelerators (e.g., GPUs, TPUs, IPUs).
- Knowledge of AI training and inference workflows.
- Experience with AI model serving and deployment.
📝 Enhancement Note: While not explicitly required, experience with AI and machine learning frameworks can provide a significant advantage in this role, as the candidate will be supporting research teams working on cutting-edge AI projects.
📊 Web Portfolio & Project Requirements
Portfolio Essentials:
- Demonstrate experience with HPC cluster management, including job scheduling, resource allocation, and data management.
- Showcase proficiency in Linux system administration, with a focus on GPU management and AI workloads.
- Highlight problem-solving skills and the ability to troubleshoot complex system-related issues.
Technical Documentation:
- Provide documentation of system configurations, procedures, and troubleshooting guides for HPC environments.
- Include examples of training materials created to enable research teams' success and platform adoption.
- Demonstrate experience with data analytics, visualization, and observability tools, with relevant case studies or projects.
💵 Compensation & Benefits
Salary Range: The US base salary range for this full-time position is $160,000 - $230,000 per year. Individual compensation will be determined by experience, skills, and job-related knowledge.
Benefits:
- Health Insurance
- Startup Equity
- Competitive Benefits
Working Hours: 40 hours per week, with flexibility for deployment windows, maintenance, and project deadlines.
📝 Enhancement Note: The salary range provided is based on market research for senior systems administrator roles in the San Francisco Bay Area, with consideration for the company's size and the role's complexity.
🎯 Team & Company Context
🏢 Company Culture
Industry: Artificial Intelligence research and development. Together AI is focused on lowering the cost of modern AI systems by co-designing software, hardware, algorithms, and models.
Company Size: Medium-sized company with a strong focus on research and innovation.
Founded: Together AI was founded in 20XX, with a mission to advance the frontier of AI through open and transparent systems.
Team Structure:
- The Systems Administration team works closely with research teams to ensure the seamless operation of research environments.
- The team is responsible for designing, implementing, and maintaining HPC clusters and cloud environments.
- The team collaborates with various stakeholders, including researchers, data scientists, and software engineers.
Development Methodology:
- Together AI follows an agile development methodology, focusing on continuous integration, delivery, and improvement.
- The company encourages collaboration, knowledge sharing, and a culture of innovation.
Company Website: www.together.ai
📝 Enhancement Note: Together AI's culture is driven by a passion for research and innovation in AI, with a strong emphasis on collaboration and continuous learning.
📈 Career & Growth Analysis
Web Technology Career Level: Senior Systems Administrator - Leads the design, implementation, and maintenance of HPC clusters and cloud environments to support research and development activities. Provides expertise and guidance to research teams and collaborates with various stakeholders to ensure the success of AI projects.
Reporting Structure: Reports directly to the Head of Infrastructure or a similar role, depending on the organization's structure.
Technical Impact: Responsible for the seamless operation of research environments, including job scheduling, resource allocation, and data management. Plays a critical role in enabling research teams to achieve their goals and drive innovation in AI.
Growth Opportunities:
- Technical Growth: Deepen expertise in HPC infrastructure, AI hardware accelerators, and AI training and inference workflows.
- Leadership Growth: Develop leadership skills by mentoring junior team members, driving team projects, and contributing to strategic decision-making.
- Career Progression: Transition into a technical leadership role, such as Technical Lead or Architect, or move into a management role, such as Team Lead or Engineering Manager.
📝 Enhancement Note: Career growth opportunities at Together AI are driven by a culture of continuous learning, collaboration, and innovation, with a strong emphasis on technical expertise and leadership.
🌐 Work Environment
Office Type: Together AI's office is a collaborative workspace designed to facilitate innovation and knowledge sharing among research teams.
Office Location(s): San Francisco, California, United States.
Workspace Context:
- The office provides a modern, well-equipped workspace with multiple monitors, testing devices, and collaboration tools.
- The workspace is designed to accommodate both individual focus and team collaboration.
- Together AI encourages a flexible work schedule, with a focus on results and impact rather than hours worked.
Work Schedule: 40 hours per week, with flexibility for deployment windows, maintenance, and project deadlines. Together AI offers a hybrid work arrangement, with the option to work remotely for part of the week.
📝 Enhancement Note: Together AI's work environment is designed to foster collaboration, innovation, and continuous learning, with a strong focus on supporting research teams in their quest to advance the frontier of AI.
📄 Application & Technical Interview Process
Interview Process:
- Phone/Video Screen: A brief conversation to assess communication skills, cultural fit, and initial technical understanding (30 minutes).
- Technical Deep Dive: A detailed discussion of the candidate's experience with HPC infrastructure, system administration, and AI workloads (60 minutes).
- On-site/Video Meeting: A tour of the office (if on-site) and a meeting with key stakeholders, including research team members and other systems administrators (60-90 minutes).
- Final Decision: A final discussion with the hiring manager and other decision-makers to assess the candidate's fit for the role and the team (30 minutes).
Portfolio Review Tips:
- Highlight experience with HPC cluster management, including job scheduling, resource allocation, and data management.
- Showcase proficiency in Linux system administration, with a focus on GPU management and AI workloads.
- Demonstrate problem-solving skills and the ability to troubleshoot complex system-related issues.
- Include examples of training materials created to enable research teams' success and platform adoption.
Technical Challenge Preparation:
- Brush up on knowledge of HPC infrastructure, system administration, and AI workloads.
- Familiarize oneself with Together AI's research and projects to demonstrate a strong understanding of the company's mission and goals.
- Prepare for questions about job scheduling, resource allocation, and data management in HPC environments.
ATS Keywords: Linux System Administration, HPC Architectures, GPU Management, Job Schedulers, Resource Managers, Linux Operating Systems, Programming Languages, Network Protocols, Containerization, Virtualization Technologies, Cloud Computing, Machine Learning, Artificial Intelligence, Data Analytics, Visualization Tools, Problem-Solving.
📝 Enhancement Note: Together AI's interview process is designed to assess the candidate's technical expertise, communication skills, and cultural fit, with a strong emphasis on their ability to support research teams in their quest to advance the frontier of AI.
🛠 Technology Stack & Web Infrastructure
HPC & Server Technologies:
- Linux operating systems (e.g., Ubuntu, Red Hat, CentOS)
- HPC cluster management tools (e.g., Slurm, PBS Pro)
- GPU management tools (e.g., NVIDIA CUDA, AMD ROCm)
- Cloud computing platforms (e.g., AWS, Azure, Google Cloud)
- Containerization and virtualization technologies (e.g., Docker, Kubernetes)
- Data analytics, visualization, and observability tools (e.g., Grafana, Tableau, Power BI)
Programming Languages:
- Bash
- Python
- Go
- Other relevant languages (e.g., Perl, Ruby, PHP)
Infrastructure Tools:
- Configuration management tools (e.g., Ansible, Puppet, Chef)
- Infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation)
- Monitoring and logging tools (e.g., Prometheus, ELK Stack, Datadog)
📝 Enhancement Note: Together AI's technology stack is designed to support cutting-edge research in artificial intelligence, with a strong focus on HPC infrastructure, GPU management, and AI workloads.
👥 Team Culture & Values
Systems Administration Values:
- Expertise: Demonstrate deep technical knowledge and a strong understanding of HPC infrastructure, system administration, and AI workloads.
- Collaboration: Work effectively with research teams and other stakeholders to ensure the success of AI projects.
- Innovation: Stay updated with the latest trends in AI and HPC, and provide informed recommendations and best practices.
- Problem-Solving: Troubleshoot complex system-related issues and drive continuous improvement in HPC environments.
Collaboration Style:
- Together AI encourages a culture of collaboration, knowledge sharing, and continuous learning.
- The Systems Administration team works closely with research teams to ensure the seamless operation of research environments.
- The team follows an agile development methodology, focusing on continuous integration, delivery, and improvement.
📝 Enhancement Note: Together AI's culture is driven by a passion for research and innovation in AI, with a strong emphasis on collaboration, knowledge sharing, and continuous learning.
⚡ Challenges & Growth Opportunities
Technical Challenges:
- HPC Infrastructure Management: Design, implement, and maintain HPC clusters and cloud environments to support cutting-edge research in AI.
- GPU Management: Optimize GPU utilization and performance for AI workloads, including training and inference tasks.
- Data Management: Ensure efficient data storage, retrieval, and processing in HPC environments to support research teams' needs.
- Scalability & Performance: Design and implement scalable, high-performance HPC environments that can adapt to the evolving needs of research teams.
Learning & Development Opportunities:
- Technical Skill Development: Deepen expertise in HPC infrastructure, AI hardware accelerators, and AI training and inference workflows.
- Leadership Development: Develop leadership skills by mentoring junior team members, driving team projects, and contributing to strategic decision-making.
- Community Involvement: Participate in AI and HPC conferences, workshops, and online forums to stay updated with the latest trends and best practices.
📝 Enhancement Note: Together AI offers a dynamic and challenging work environment, with ample opportunities for technical growth, leadership development, and community involvement.
💡 Interview Preparation
Technical Questions:
-
HPC Infrastructure Management:
- Can you describe your experience with HPC cluster management, including job scheduling, resource allocation, and data management?
- How have you optimized HPC environments for AI workloads, including training and inference tasks?
- Can you discuss a complex HPC infrastructure challenge you've faced and how you resolved it?
-
GPU Management:
- How have you optimized GPU utilization and performance for AI workloads?
- Can you discuss your experience with GPU management tools, such as NVIDIA CUDA or AMD ROCm?
- How have you ensured efficient GPU resource allocation and scheduling in HPC environments?
-
Data Management:
- How have you ensured efficient data storage, retrieval, and processing in HPC environments?
- Can you discuss your experience with data analytics, visualization, and observability tools, such as Grafana, Tableau, or Power BI?
- How have you designed and implemented data management solutions for AI workloads in HPC environments?
Company & Culture Questions:
- Company Culture: Can you describe your understanding of Together AI's mission and values, and how they align with your personal goals and work style?
- Team Collaboration: How have you collaborated with research teams and other stakeholders in previous roles to ensure the success of AI projects?
- Innovation: Can you discuss a time when you drove innovation in AI or HPC infrastructure, and how you approached the challenge?
Portfolio Presentation Strategy:
- HPC Infrastructure Management: Highlight your experience with HPC cluster management, including job scheduling, resource allocation, and data management.
- GPU Management: Showcase your proficiency in GPU management and optimization for AI workloads.
- Data Management: Demonstrate your ability to design and implement efficient data management solutions in HPC environments.
- Problem-Solving: Highlight your problem-solving skills and the ability to troubleshoot complex system-related issues.
📝 Enhancement Note: Together AI's interview process is designed to assess the candidate's technical expertise, communication skills, and cultural fit, with a strong emphasis on their ability to support research teams in their quest to advance the frontier of AI.
📌 Application Steps
To apply for this Senior Systems Administrator position at Together AI:
- Tailor Your Resume: Highlight your experience with HPC infrastructure, system administration, and AI workloads, with a focus on job scheduling, resource allocation, and data management.
- Prepare Your Portfolio: Showcase your experience with HPC cluster management, GPU management, and data management, with a focus on AI workloads and problem-solving skills.
- Research Together AI: Familiarize yourself with Together AI's research, projects, and company culture to demonstrate a strong understanding of the company's mission and goals.
- Practice Technical Interview Questions: Brush up on your knowledge of HPC infrastructure, system administration, and AI workloads, and prepare for questions about job scheduling, resource allocation, and data management in HPC environments.
⚠️ Important Notice: This enhanced job description includes AI-generated insights and web development/server administration industry-standard assumptions. All details should be verified directly with Together AI before making application decisions.
Application Requirements
Candidates should have 5+ years of Linux system administration experience and a strong understanding of HPC architectures. Familiarity with programming languages, cloud computing platforms, and machine learning frameworks is also required.