Senior Manager, Technical Program Management - DGX Cloud

NVIDIA
Full_time$232k-368k/year (USD)Santa Clara, United States

📍 Job Overview

  • Job Title: Senior Manager, Technical Program Management - DGX Cloud
  • Company: NVIDIA
  • Location: Santa Clara, California, United States
  • Job Type: Full-Time, Hybrid
  • Category: Technical Program Management, AI Infrastructure
  • Date Posted: 2025-07-21
  • Experience Level: 10+ years

🚀 Role Summary

  • Lead and scale a high-performing team focused on delivering a world-class AI platform for over 1,000 NVIDIA researchers.
  • Drive sophisticated, cross-functional programs involving compute platforms, cluster bring-ups, and infrastructure metrics tracking across the global DGX Cloud fleet.
  • Collaborate with NVIDIA Research and DGXC Engineering to ensure a resilient, high-performance infrastructure for AI training and inference.

📝 Enhancement Note: This role requires a strong background in AI/ML infrastructure, program management, and cross-functional team leadership to drive impact and ensure the success of the DGX Cloud platform.

💻 Primary Responsibilities

  • Team Leadership: Lead, mentor, and grow a team of Technical Program Managers focused on delivering a world-class AI platform that empowers NVIDIA researchers.
  • AI Platform Development: Drive the development of resilient, high-performance infrastructure for AI training and inference, setting industry standards in productivity, performance, and global impact.
  • Capacity Management: Deep understanding of Slurm architecture, configuration, workload management, and job prioritization policies to drive capacity management and allocation processes.
  • Cluster Bring-ups & Integration: Experience with end-to-end cluster bring-ups and integration with MLOps stacks, including deployment across hyperscaler environments such as OCI, GCP, and others.
  • Capacity Modeling & Demand Forecasting: Skilled in capacity modeling, demand forecasting, and supply-demand balancing to ensure optimal resource utilization and fleet efficiency.
  • Program Governance & Risk Management: Establish and enforce best-in-class program governance, roadmap planning, and risk management processes to ensure project success and accountability.
  • Stakeholder Communication: Develop and execute a communication strategy that keeps stakeholders informed about program progress, blockers, and impact at all levels.

🎓 Skills & Qualifications

Education: Bachelor's degree in Computer Science, a related technical field, or equivalent experience.

Experience: 15+ years of program management experience, including 8+ years of managing a team. Proven track record in delivering AI/ML infrastructure programs at scale.

Required Skills:

  • Proven experience in driving AI/ML infrastructure programs at scale, with a deep understanding of system architecture and cluster deployments.
  • Strong grasp of capacity modeling, forecasting techniques, and demand/supply reconciliation in compute environments.
  • Proficiency with tools like Grafana, Prometheus, or scheduler-native tools to monitor job efficiency, wait times, and node health.
  • Solid communication and leadership skills to work effectively with multi-functional teams and coordinate across organizational boundaries and geographies.

Preferred Skills:

  • Solid understanding of cloud technologies.
  • Experience with new product introduction and program managing research teams.
  • Background with productivity tools and process automation.

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate experience in leading large-scale AI/ML infrastructure programs, with a focus on cluster management, resource allocation, and fleet optimization.
  • Showcase successful capacity management strategies and demand forecasting techniques to improve resource utilization and reduce idle waste.
  • Highlight experience with Slurm and other workload management systems, including job prioritization policies and hybrid scheduling architectures.

Technical Documentation:

  • Provide case studies or examples of successful AI/ML infrastructure projects, including cluster bring-ups, integration with MLOps stacks, and deployment across hyperscaler environments.
  • Document capacity modeling and demand forecasting processes, along with the tools and methodologies used to track fleet efficiency metrics.

📝 Enhancement Note: As this role focuses on AI/ML infrastructure and program management, a strong portfolio showcasing relevant experience and achievements in these areas will be crucial for success.

💵 Compensation & Benefits

Salary Range: $232,000 - $368,000 USD per year. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

Benefits:

Working Hours: Full-time, with a hybrid work arrangement (on-site and remote work).

📝 Enhancement Note: The provided salary range is based on NVIDIA's job posting and is subject to regional adjustments and individual negotiation. The benefits package is comprehensive and includes equity and various health, financial, and lifestyle benefits.

🎯 Team & Company Context

🏢 Company Culture

Industry: NVIDIA operates in the technology industry, specializing in AI, graphics, and high-performance computing. The company's focus on AI and machine learning makes this role an excellent fit for professionals interested in driving innovation in these areas.

Company Size: NVIDIA is a large, multinational corporation with a significant global presence. This size offers opportunities for career growth, collaboration with diverse teams, and exposure to cutting-edge technologies.

Founded: NVIDIA was founded in 1993 and has since grown to become a leading innovator in AI, graphics, and high-performance computing.

Team Structure:

  • The DGX Cloud team is part of the broader NVIDIA organization, working closely with NVIDIA Research and DGXC Engineering.
  • The team consists of Technical Program Managers, Infrastructure Engineers, and other specialists focused on delivering world-class AI infrastructure.
  • The role reports directly to the DGX Cloud leadership and collaborates with various teams across the organization.

Development Methodology:

  • NVIDIA follows Agile methodologies for software development, with a focus on iterative improvement and customer-centric design.
  • The company emphasizes cross-functional collaboration, data-driven decision-making, and a culture of innovation.

Company Website: NVIDIA

📝 Enhancement Note: NVIDIA's culture is characterized by a strong focus on innovation, collaboration, and customer-centric design. This role will require a candidate who thrives in a dynamic, fast-paced environment and is passionate about driving AI infrastructure development.

📈 Career & Growth Analysis

Web Technology Career Level: This role is a senior management position within the AI infrastructure domain. It requires a high level of expertise in AI/ML infrastructure, program management, and team leadership. The ideal candidate will have a proven track record of driving large-scale infrastructure projects and managing high-performing teams.

Reporting Structure: The Senior Manager, Technical Program Management - DGX Cloud reports directly to the DGX Cloud leadership and collaborates with various teams across the organization, including NVIDIA Research and DGXC Engineering.

Technical Impact: This role has a significant technical impact on NVIDIA's AI infrastructure, directly influencing the performance, scalability, and usability of the DGX Cloud platform. The successful candidate will drive innovation in AI infrastructure and set industry standards in productivity, performance, and global impact.

Growth Opportunities:

  • Team Expansion: As the team grows, there will be opportunities for the candidate to mentor and develop junior team members, fostering a culture of continuous learning and growth.
  • Technical Leadership: The candidate may have the opportunity to take on more technical leadership roles within the organization, driving innovation in AI infrastructure and setting industry standards.
  • Cross-Functional Collaboration: Working closely with NVIDIA Research and DGXC Engineering, the candidate will have the chance to expand their knowledge and skills in AI/ML and hardware development.

📝 Enhancement Note: This role offers significant growth potential for the right candidate, with opportunities to expand their technical expertise, develop their leadership skills, and drive innovation in AI infrastructure.

🌐 Work Environment

Office Type: NVIDIA's Santa Clara office is a modern, collaborative workspace designed to foster innovation and creativity. The company offers a mix of open-plan and private workspaces, along with various amenities to support employee well-being and productivity.

Office Location(s): Santa Clara, California, United States. NVIDIA has multiple office locations worldwide, offering flexibility for employees to work remotely or on-site.

Workspace Context:

  • Collaboration: The office encourages cross-functional collaboration, with dedicated spaces for team meetings, brainstorming sessions, and social events.
  • Technology: NVIDIA provides state-of-the-art technology and tools to support employee productivity, including high-performance workstations, multiple monitors, and testing devices.
  • Work-Life Balance: NVIDIA offers flexible work arrangements, including hybrid and remote work options, to support a healthy work-life balance.

Work Schedule: Full-time, with a hybrid work arrangement (on-site and remote work). The working hours are typically Monday to Friday, with some flexibility for project deadlines and maintenance windows.

📝 Enhancement Note: NVIDIA's work environment is designed to support collaboration, innovation, and employee well-being. The hybrid work arrangement offers flexibility for employees to balance their personal and professional lives while maintaining a strong connection to the team and the organization.

📄 Application & Technical Interview Process

Interview Process:

  1. Phone Screen: A brief phone call to discuss your background, experience, and motivation for the role. Be prepared to share examples of your experience in AI/ML infrastructure, program management, and team leadership.
  2. Technical Deep Dive: A detailed discussion of your technical expertise in AI/ML infrastructure, capacity management, and demand forecasting. Be prepared to walk through your portfolio and provide specific examples of your achievements in these areas.
  3. Behavioral & Cultural Fit: An assessment of your leadership style, communication skills, and cultural fit within the NVIDIA organization. Be prepared to share examples of your experience working with cross-functional teams and driving impact in a dynamic, fast-paced environment.
  4. Final Interview: A meeting with NVIDIA leadership to discuss your fit for the role, answer any remaining questions, and make a final decision.

Portfolio Review Tips:

  • Highlight your experience in AI/ML infrastructure, capacity management, and demand forecasting.
  • Include specific examples of your achievements in driving large-scale infrastructure projects and improving resource utilization.
  • Showcase your ability to work effectively with cross-functional teams and drive impact in a dynamic, fast-paced environment.

Technical Challenge Preparation:

  • Brush up on your knowledge of AI/ML infrastructure, capacity management, and demand forecasting techniques.
  • Prepare specific examples of your experience in these areas, focusing on your ability to drive impact and improve resource utilization.
  • Familiarize yourself with NVIDIA's products, services, and company culture to demonstrate your enthusiasm for the role and the organization.

ATS Keywords:

  • Program Management: Technical Program Management, Agile, Scrum, Roadmap Planning, Risk Management, Stakeholder Communication, Cross-Functional Collaboration
  • AI/ML Infrastructure: AI Infrastructure, Cluster Management, Capacity Management, Demand Forecasting, Fleet Optimization, MLOps, GPU Resource Management, Slurm, Prometheus, Grafana
  • Leadership: Team Leadership, Mentoring, Cross-Functional Teamwork, Strategic Planning, Change Management, Operational Excellence
  • Cloud Technologies: Cloud Technologies, Hyperscaler Environments, OCI, GCP, AWS, Azure

📝 Enhancement Note: NVIDIA's interview process is designed to assess your technical expertise, leadership skills, and cultural fit within the organization. By preparing specific examples of your experience and achievements in AI/ML infrastructure, program management, and team leadership, you can demonstrate your qualifications for the role and increase your chances of success.

🛠 Technology Stack & Web Infrastructure

AI/ML Infrastructure:

  • Cluster Management: Slurm, Kubernetes, Mesos, YARN
  • Capacity Management: Prometheus, Grafana, New Relic, Datadog
  • Demand Forecasting: Prophet, ARIMA, LSTM, Prophet, PyCaret
  • MLOps: MLflow, Kubeflow, TensorFlow Extended, AWS SageMaker, Azure ML
  • GPU Resource Management: NVIDIA CUDA, NVIDIA DGX, NVIDIA DRIVE, NVIDIA Clara

Cloud Technologies:

  • Hyperscaler Environments: OCI, GCP, AWS, Azure
  • Infrastructure as Code: Terraform, CloudFormation, Azure Resource Manager
  • Containerization: Docker, Kubernetes, Singularity
  • Serverless: AWS Lambda, Azure Functions, Google Cloud Functions

📝 Enhancement Note: NVIDIA's technology stack is focused on AI/ML infrastructure, cloud technologies, and GPU resource management. Familiarity with these technologies and relevant tools will be crucial for success in this role.

👥 Team Culture & Values

AI/ML Infrastructure Values:

  • Customer Obsessed: Focus on delivering world-class AI infrastructure that empowers NVIDIA researchers and drives innovation in AI/ML.
  • Innovation: Embrace a culture of continuous learning and improvement, driving innovation in AI/ML infrastructure and setting industry standards.
  • Collaboration: Foster a collaborative work environment that encourages cross-functional teamwork and knowledge sharing.
  • Accountability: Hold yourself and your team accountable for delivering high-quality AI infrastructure that meets the needs of NVIDIA researchers.

Collaboration Style:

  • Cross-Functional Integration: Work closely with NVIDIA Research and DGXC Engineering to ensure a seamless and efficient AI infrastructure experience for NVIDIA researchers.
  • Code Review Culture: Encourage a culture of code review and peer programming to ensure high-quality AI infrastructure and knowledge sharing.
  • Knowledge Sharing: Foster a culture of knowledge sharing, technical mentoring, and continuous learning to drive team growth and development.

📝 Enhancement Note: NVIDIA's team culture is characterized by a strong focus on innovation, collaboration, and accountability. The ideal candidate will thrive in this environment and be passionate about driving AI/ML infrastructure development and delivering world-class AI experiences for NVIDIA researchers.

⚡ Challenges & Growth Opportunities

Technical Challenges:

  • AI/ML Infrastructure: Stay up-to-date with the latest AI/ML infrastructure trends and technologies, and drive innovation in this rapidly evolving field.
  • Capacity Management: Develop and implement sophisticated capacity management strategies to optimize resource utilization and reduce idle waste in AI/ML infrastructure.
  • Demand Forecasting: Continuously refine and improve demand forecasting techniques to ensure optimal resource allocation and fleet efficiency in AI/ML infrastructure.
  • MLOps: Collaborate with NVIDIA Research and DGXC Engineering to integrate MLOps pipelines and ensure seamless user experiences for NVIDIA researchers.

Learning & Development Opportunities:

  • AI/ML Infrastructure: Attend industry conferences, webinars, and workshops to stay current with the latest trends and best practices in AI/ML infrastructure.
  • Cloud Technologies: Expand your knowledge of cloud technologies, including hyperscaler environments, infrastructure as code, and serverless architectures.
  • Leadership Development: Participate in leadership development programs and workshops to enhance your skills in team management, strategic planning, and change management.
  • Technical Mentoring: Provide technical mentoring and guidance to junior team members, fostering a culture of continuous learning and growth within the team.

📝 Enhancement Note: NVIDIA's technical challenges and learning opportunities offer significant growth potential for the right candidate, with the chance to drive innovation in AI/ML infrastructure, expand their technical expertise, and develop their leadership skills.

💡 Interview Preparation

Technical Questions:

  • AI/ML Infrastructure: Describe your experience with AI/ML infrastructure, cluster management, and capacity management. Provide specific examples of your achievements in these areas.
  • Demand Forecasting: Walk through your process for demand forecasting, including the tools and methodologies you've used to optimize resource allocation and fleet efficiency in AI/ML infrastructure.
  • MLOps: Explain your experience with MLOps pipelines and how you've integrated them with AI/ML infrastructure to ensure seamless user experiences for NVIDIA researchers.

Company & Culture Questions:

  • NVIDIA Culture: Describe what you understand about NVIDIA's culture and how your leadership style aligns with the company's values.
  • Cross-Functional Collaboration: Explain your experience working with cross-functional teams and how you've driven impact in a dynamic, fast-paced environment.
  • AI/ML Infrastructure Strategy: Outline your vision for AI/ML infrastructure at NVIDIA, including your plans for driving innovation, optimizing resource utilization, and improving user experiences for NVIDIA researchers.

Portfolio Presentation Strategy:

  • AI/ML Infrastructure: Highlight your experience with AI/ML infrastructure, cluster management, and capacity management. Include specific examples of your achievements in these areas and how they demonstrate your qualifications for the role.
  • Demand Forecasting: Showcase your demand forecasting techniques and how they've improved resource allocation and fleet efficiency in AI/ML infrastructure. Include any relevant data or visualizations to support your presentation.
  • MLOps: Demonstrate your experience with MLOps pipelines and how you've integrated them with AI/ML infrastructure to ensure seamless user experiences for NVIDIA researchers. Include any relevant case studies or success stories to support your presentation.

📝 Enhancement Note: NVIDIA's interview process is designed to assess your technical expertise, leadership skills, and cultural fit within the organization. By preparing specific examples of your experience and achievements in AI/ML infrastructure, capacity management, and demand forecasting, you can demonstrate your qualifications for the role and increase your chances of success.

📌 Application Steps

To apply for this Senior Manager, Technical Program Management - DGX Cloud position at NVIDIA:

  1. Tailor Your Resume: Highlight your experience in AI/ML infrastructure, program management, and team leadership. Include specific examples of your achievements in these areas and how they demonstrate your qualifications for the role.
  2. Prepare Your Portfolio: Showcase your experience with AI/ML infrastructure, cluster management, and capacity management. Include specific examples of your achievements in these areas and how they demonstrate your qualifications for the role.
  3. Research NVIDIA: Familiarize yourself with NVIDIA's products, services, and company culture to demonstrate your enthusiasm for the role and the organization.
  4. Practice Interview Questions: Prepare for technical and behavioral interview questions by reviewing the "💡 Interview Preparation" section of this document. Practice your responses and refine your presentation strategies to ensure a strong performance in the interview.

📝 Enhancement Note: NVIDIA's application process is designed to assess your technical expertise, leadership skills, and cultural fit within the organization. By tailoring your resume, preparing your portfolio, and researching the company, you can demonstrate your qualifications for the role and increase your chances of success.


Content Guidelines (IMPORTANT: Do not include this in the output)

Web Technology-Specific Focus:

  • Tailor every section specifically to AI/ML infrastructure, program management, and team leadership roles.
  • Include AI/ML infrastructure methodologies, capacity management, demand forecasting, and MLOps best practices.
  • Emphasize AI/ML infrastructure team dynamics, cross-functional collaboration with NVIDIA Research and DGXC Engineering, and user experience considerations.
  • Address AI/ML infrastructure career progression, emerging technology adoption, and technical specialization in AI/ML infrastructure and program management.

Quality Standards:

  • Ensure no content overlap between sections - each section must contain unique information.
  • Only include Enhancement Notes when making significant inferences about AI/ML infrastructure, capacity management, or team leadership, with specific reasoning based on role level and AI/ML infrastructure industry practices.
  • Be comprehensive but concise, prioritizing actionable information over descriptive text.
  • Strategically distribute AI/ML infrastructure, program management, and team leadership-related keywords throughout all sections naturally.
  • Provide realistic salary ranges based on location, experience level, and AI/ML infrastructure specialization.

Industry Expertise:

  • Include specific AI/ML infrastructure technologies, frameworks, server platforms, and infrastructure tools relevant to the role.
  • Address AI/ML infrastructure career progression paths and technical leadership opportunities in AI/ML infrastructure and program management.
  • Provide tactical advice for AI/ML infrastructure portfolio development, live demonstrations, and project case studies.
  • Include AI/ML infrastructure-specific interview preparation and coding challenge guidance.
  • Emphasize AI/ML infrastructure-specific interview preparation and coding challenge guidance.
  • Emphasize AI/ML infrastructure-specific interview preparation and coding challenge guidance.

Professional Standards:

  • Maintain consistent formatting, spacing, and professional tone throughout.
  • Use AI/ML infrastructure and program management industry terminology appropriately and accurately.
  • Include comprehensive benefits and growth opportunities relevant to AI/ML infrastructure and program management professionals.
  • Provide actionable insights that give AI/ML infrastructure and program management candidates a competitive advantage.
  • Focus on AI/ML infrastructure team culture, cross-functional collaboration, and user impact measurement.

Technical Focus & Portfolio Emphasis:

  • Emphasize AI/ML infrastructure best practices, capacity management, demand forecasting, and MLOps optimization.
  • Include specific portfolio requirements tailored to the AI/ML infrastructure discipline and role level.
  • Address AI/ML infrastructure-specific browser compatibility, accessibility standards, and user experience design principles.
  • Focus on problem-solving methods, performance optimization, and scalable AI/ML infrastructure architecture.
  • Include technical presentation skills and stakeholder communication for AI/ML infrastructure projects.

Avoid:

  • Generic business jargon not relevant to AI/ML infrastructure, program management, or team leadership roles.
  • Placeholder text or incomplete sections.
  • Repetitive content across different sections.
  • Non-technical terminology unless relevant to the specific AI/ML infrastructure role.
  • Marketing language unrelated to AI/ML infrastructure, program management, or user experience.

Generate comprehensive, AI/ML infrastructure-focused content that serves as a valuable resource for AI/ML infrastructure, program management, and team leadership professionals seeking their next opportunity.

Application Requirements

15+ years of program management experience, including 8+ years managing a team. Proven track record in delivering AI/ML infrastructure programs at scale with a strong grasp of capacity modeling and forecasting techniques.