Big Data Engineer (PySpark & Apache Iceberg) - C12 – AVP - Chennai
📍 Job Overview
- Job Title: Big Data Engineer (PySpark & Apache Iceberg) - C12 – AVP - Chennai
- Company: Citi
- Location: Chennai, Tamil Nadu, India
- Job Type: Full-time
- Category: Data Engineering
- Date Posted: July 31, 2025
- Experience Level: 10+ years
- Remote Status: On-site
🚀 Role Summary
- Design, develop, and maintain big data pipelines using PySpark and Apache Iceberg.
- Collaborate with data scientists and analysts to deliver actionable insights.
- Ensure data quality, consistency, and security across all pipelines.
- Optimize data workflows for performance and scalability.
- Monitor and troubleshoot data pipeline issues in production.
📝 Enhancement Note: This role requires a strong focus on data processing, transformation, and optimization. The candidate should be comfortable working with large datasets and have a solid understanding of big data technologies.
💻 Primary Responsibilities
- Pipeline Development: Design, develop, and maintain big data pipelines using PySpark and Apache Iceberg to process large datasets efficiently.
- Data Transformation: Implement data transformation, cleansing, and enrichment processes to ensure data quality and consistency.
- Workflow Optimization: Optimize data workflows for performance and scalability, considering factors like data volume, velocity, and variety.
- Collaboration: Work closely with data scientists and analysts to understand data requirements, deliver actionable insights, and support data-driven decision-making.
- Data Quality & Security: Ensure data quality, consistency, and security across all pipelines by implementing appropriate measures and following best practices.
- Monitoring & Troubleshooting: Monitor data pipeline performance and troubleshoot issues in production environments to minimize downtime and ensure data reliability.
📝 Enhancement Note: The candidate should have a proactive approach to problem-solving, with a strong focus on performance optimization and data quality.
🎓 Skills & Qualifications
Education: Bachelor’s degree/University degree or equivalent experience
Experience: 10+ years of experience in big data engineering, with a strong focus on PySpark and Apache Iceberg
Required Skills:
- Technical Proficiency: Strong experience with PySpark and Apache Iceberg, along with proficiency in distributed data processing and big data technologies.
- Data Modeling: Knowledge of data modeling and schema design to create efficient and scalable data structures.
- Cloud Platforms: Familiarity with cloud platforms like AWS, Azure, or GCP for deploying and managing big data pipelines.
- Version Control: Experience with version control tools like Git for collaborative development and version tracking.
- Communication: Excellent communication skills to collaborate effectively with cross-functional teams and stakeholders.
Preferred Skills:
- Emerging Technologies: Familiarity with emerging big data technologies and trends to stay ahead of the curve.
- Scripting: Proficiency in scripting languages like Python or Bash for automating tasks and workflows.
- Data Visualization: Experience with data visualization tools to create meaningful insights and reports.
📝 Enhancement Note: The ideal candidate will have a strong technical background in big data engineering, with a proven track record of designing and implementing scalable data pipelines.
📊 Web Portfolio & Project Requirements
Portfolio Essentials:
- Pipeline Projects: Include examples of big data pipelines designed and implemented using PySpark and Apache Iceberg, highlighting performance optimization and scalability considerations.
- Data Transformation: Demonstrate data transformation, cleansing, and enrichment processes to showcase your ability to ensure data quality and consistency.
- Collaboration: Showcase projects where you collaborated with data scientists and analysts to deliver actionable insights and support data-driven decision-making.
- Monitoring & Troubleshooting: Include examples of monitoring and troubleshooting data pipeline performance in production environments.
Technical Documentation:
- Code Quality: Document your code with clear comments and follow best practices for code quality and maintainability.
- Version Control: Demonstrate your use of version control tools like Git for collaborative development and version tracking.
- Deployment Processes: Document your deployment processes, including any automation scripts or CI/CD pipelines used.
- Testing Methodologies: Describe your approach to testing big data pipelines, including any performance metrics or optimization techniques employed.
📝 Enhancement Note: The candidate's portfolio should highlight their ability to design, develop, and maintain big data pipelines using PySpark and Apache Iceberg, with a strong focus on performance optimization and data quality.
💵 Compensation & Benefits
Salary Range: INR 25,00,000 - 30,00,000 per annum (Based on experience and market standards for big data engineering roles in Chennai, India)
Benefits:
- Health, dental, and vision insurance
- Retirement savings plans
- Employee stock purchase plan
- Paid time off and holidays
- Employee discounts and perks
Working Hours: Full-time position with standard working hours, including flexibility for deployment windows and maintenance as required.
📝 Enhancement Note: The salary range provided is an estimate based on market research for big data engineering roles in Chennai, India. The actual salary may vary depending on the candidate's experience, skills, and the company's compensation structure.
🎯 Team & Company Context
Company Culture:
- Industry: Financial Services
- Company Size: Large (Over 200,000 employees)
- Founded: 1812
- Team Structure: The data engineering team consists of big data engineers, data architects, and data analysts, working collaboratively to deliver data-driven solutions.
Development Methodology:
- Agile/Scrum: The team follows Agile/Scrum methodologies for project management, with sprint planning, daily stand-ups, and regular retrospectives.
- Code Review: The team practices code reviews to ensure code quality, maintainability, and knowledge sharing.
- CI/CD Pipelines: The team uses CI/CD pipelines for automated deployment and testing, ensuring efficient and reliable data pipeline delivery.
Company Website: Citi
📝 Enhancement Note: Citi is a large financial services company with a global presence, offering extensive opportunities for career growth and development in the data engineering field.
📈 Career & Growth Analysis
Big Data Engineering Career Level: This role is at the senior level, with a focus on designing, developing, and maintaining big data pipelines using PySpark and Apache Iceberg. The candidate will have significant experience in big data engineering and a strong track record of delivering high-quality, scalable data solutions.
Reporting Structure: The candidate will report to the Data Engineering Manager, working closely with data scientists, analysts, and other stakeholders to deliver actionable insights and support data-driven decision-making.
Technical Impact: The candidate will have a significant impact on data processing, transformation, and optimization, ensuring data quality, consistency, and security across all pipelines. Their work will enable the organization to make data-driven decisions, improve operational efficiency, and drive business growth.
Growth Opportunities:
- Technical Leadership: With experience and proven performance, the candidate may have the opportunity to move into a technical leadership role, mentoring junior team members and driving best practices in big data engineering.
- Architecture & Design: The candidate may have the opportunity to work on architecture and design projects, shaping the organization's data infrastructure and driving innovation in big data technologies.
- Emerging Technologies: The candidate will have the opportunity to stay up-to-date with emerging big data technologies and trends, expanding their skillset and driving innovation in the data engineering field.
📝 Enhancement Note: This role offers significant opportunities for career growth and development in the big data engineering field, with a focus on technical leadership, architecture, and design.
🌐 Work Environment
Office Type: The Chennai office is a modern, collaborative workspace designed to foster innovation and creativity, with state-of-the-art technology and amenities to support the team's success.
Office Location(s): The Chennai office is located in the heart of the city, with easy access to public transportation and nearby amenities.
Workspace Context:
- Collaboration: The workspace is designed to encourage collaboration and knowledge sharing, with open-plan offices, meeting rooms, and breakout spaces.
- Technology: The workspace is equipped with high-performance workstations, multiple monitors, and testing devices to support the team's development and testing activities.
- Flexibility: The workspace offers flexible working arrangements, including hot-desking and remote work options, to support work-life balance and employee well-being.
Work Schedule: The work schedule is typically Monday to Friday, with standard working hours and flexibility for deployment windows, maintenance, and project deadlines as required.
📝 Enhancement Note: The Chennai office offers a modern, collaborative workspace designed to support the success of the data engineering team, with a focus on innovation, creativity, and work-life balance.
📄 Application & Technical Interview Process
Interview Process:
- Phone/Video Screen: A brief phone or video call to discuss the candidate's background, experience, and fit for the role.
- Technical Assessment: A hands-on technical assessment, focusing on PySpark, Apache Iceberg, and big data technologies, with a focus on performance optimization and data quality.
- Behavioral & Cultural Fit: An interview to assess the candidate's communication skills, problem-solving abilities, and cultural fit within the team.
- Final Interview: A final interview with the hiring manager or a panel of stakeholders to discuss the candidate's fit for the role and the organization's long-term goals.
Portfolio Review Tips:
- Pipeline Projects: Highlight your experience with PySpark and Apache Iceberg, demonstrating your ability to design, develop, and maintain big data pipelines.
- Data Transformation: Showcase your data transformation, cleansing, and enrichment processes, highlighting your focus on data quality and consistency.
- Collaboration: Demonstrate your ability to collaborate with data scientists and analysts, delivering actionable insights and supporting data-driven decision-making.
- Monitoring & Troubleshooting: Include examples of monitoring and troubleshooting data pipeline performance in production environments, highlighting your problem-solving skills and attention to detail.
Technical Challenge Preparation:
- Big Data Technologies: Brush up on your knowledge of PySpark, Apache Iceberg, and other big data technologies, with a focus on performance optimization and data quality.
- Cloud Platforms: Familiarize yourself with cloud platforms like AWS, Azure, or GCP, understanding their features, services, and best practices for big data engineering.
- Data Modeling: Review your knowledge of data modeling and schema design, ensuring you can create efficient and scalable data structures for big data pipelines.
- Problem-Solving: Practice problem-solving techniques and approaches, with a focus on performance optimization and data quality in big data engineering.
ATS Keywords: (Organized by category)
- Programming Languages: Python, Spark, Scala, Java, SQL
- Big Data Technologies: PySpark, Apache Iceberg, Hadoop, Hive, Pig, Spark Streaming, Kafka, Flink
- Cloud Platforms: AWS, Azure, GCP, AWS Glue, AWS EMR, AWS Redshift, Azure Databricks, Azure HDInsight, GCP BigQuery, GCP Dataproc
- Databases: MySQL, PostgreSQL, MongoDB, Cassandra, Redis, Amazon DynamoDB, Azure Cosmos DB, Google Cloud Spanner
- Tools: Git, JIRA, Confluence, Jenkins, Docker, Kubernetes, Terraform, AWS CloudFormation, Azure Resource Manager, Google Cloud Deployment Manager
- Methodologies: Agile, Scrum, Kanban, Waterfall, DevOps, CI/CD
- Soft Skills: Communication, Collaboration, Problem-Solving, Critical Thinking, Attention to Detail, Time Management
- Industry Terms: Data Pipeline, ETL, ELT, Data Warehouse, Data Mart, Data Lake, Data Governance, Data Quality, Data Security, Metadata Management
📝 Enhancement Note: The interview process for this role will focus on the candidate's technical proficiency in big data engineering, with a strong emphasis on PySpark, Apache Iceberg, and performance optimization. The candidate should be prepared to discuss their experience with big data technologies, cloud platforms, and data modeling, as well as their approach to problem-solving and collaboration.
🛠 Technology Stack & Web Infrastructure
Big Data Technologies:
- PySpark: PySpark is the primary big data processing engine used for designing, developing, and maintaining big data pipelines. The candidate should have strong experience with PySpark and a solid understanding of its features, APIs, and best practices.
- Apache Iceberg: Apache Iceberg is a table format for big data lakes that provides metadata management and data governance. The candidate should have experience with Apache Iceberg and a solid understanding of its features, APIs, and best practices.
- Distributed Data Processing: The candidate should have experience with distributed data processing frameworks like Apache Hadoop, Apache Spark, or Apache Flink, with a focus on performance optimization and scalability.
Cloud Platforms:
- AWS: Amazon Web Services (AWS) is a popular cloud platform for big data engineering, offering a wide range of services for data storage, processing, and analysis. The candidate should have familiarity with AWS services like Amazon S3, Amazon Redshift, AWS Glue, and AWS EMR.
- Azure: Microsoft Azure is another popular cloud platform for big data engineering, offering services like Azure Databricks, Azure HDInsight, Azure Data Lake, and Azure Cosmos DB. The candidate should have familiarity with Azure services relevant to big data engineering.
- GCP: Google Cloud Platform (GCP) is a third popular cloud platform for big data engineering, offering services like BigQuery, Dataproc, Cloud Storage, and Cloud Pub/Sub. The candidate should have familiarity with GCP services relevant to big data engineering.
Version Control:
- Git: Git is a widely-used version control system for collaborative development and version tracking. The candidate should have experience with Git, including branching, merging, and pull request workflows.
📝 Enhancement Note: The technology stack for this role focuses on big data technologies, cloud platforms, and version control, with a strong emphasis on PySpark, Apache Iceberg, and performance optimization. The candidate should have a solid understanding of these technologies and their best practices for big data engineering.
👥 Team Culture & Values
Big Data Engineering Values:
- Data Quality: Citi places a strong emphasis on data quality, consistency, and security. The candidate should have a deep understanding of data quality principles and best practices, with a commitment to ensuring high-quality data across all pipelines.
- Performance Optimization: Citi values performance optimization and scalability in big data engineering. The candidate should have a strong focus on performance optimization, with a commitment to improving data processing, transformation, and analysis efficiency.
- Collaboration: Citi fosters a culture of collaboration and knowledge sharing, with a focus on cross-functional teamwork and stakeholder engagement. The candidate should have excellent communication skills and a commitment to working effectively with data scientists, analysts, and other stakeholders.
- Innovation: Citi encourages innovation and continuous learning in big data engineering. The candidate should have a strong curiosity and a commitment to staying up-to-date with emerging big data technologies and trends.
Collaboration Style:
- Cross-Functional Integration: The big data engineering team works closely with data scientists, analysts, and other stakeholders to deliver actionable insights and support data-driven decision-making. The candidate should be comfortable working in a cross-functional environment and collaborating effectively with diverse teams.
- Code Review Culture: The team practices code reviews to ensure code quality, maintainability, and knowledge sharing. The candidate should be comfortable with code reviews and committed to contributing to the team's collective knowledge and expertise.
- Mentoring & Knowledge Sharing: The team encourages mentoring and knowledge sharing, with a focus on continuous learning and professional development. The candidate should be committed to mentoring junior team members and driving best practices in big data engineering.
📝 Enhancement Note: Citi's big data engineering team values data quality, performance optimization, collaboration, and innovation, with a strong emphasis on cross-functional teamwork and knowledge sharing. The candidate should be committed to these values and demonstrate a strong focus on big data engineering best practices.
⚡ Challenges & Growth Opportunities
Technical Challenges:
- Big Data Processing: Designing, developing, and maintaining big data pipelines using PySpark and Apache Iceberg, with a focus on performance optimization and scalability.
- Data Transformation: Implementing data transformation, cleansing, and enrichment processes to ensure data quality and consistency, with a focus on performance optimization and efficiency.
- Cloud Platforms: Deploying and managing big data pipelines on cloud platforms like AWS, Azure, or GCP, with a focus on cost optimization, security, and scalability.
- Data Governance: Ensuring data quality, consistency, and security across all pipelines, with a focus on metadata management, access control, and data lineage.
Learning & Development Opportunities:
- Technical Skill Development: Staying up-to-date with emerging big data technologies and trends, expanding your skillset, and driving innovation in the big data engineering field.
- Conference Attendance: Attending industry conferences, webinars, and workshops to learn from thought leaders, network with peers, and gain insights into emerging big data technologies and trends.
- Certification: Pursuing relevant certifications, such as AWS Certified Big Data – Specialty, Azure Certified: Big Data Engineer Associate, or Google Cloud Certified - Professional Data Engineer, to demonstrate your expertise and commitment to continuous learning.
- Mentorship: Seeking mentorship opportunities from experienced big data engineers, data architects, and data scientists to gain insights into best practices, architecture decisions, and career development strategies.
📝 Enhancement Note: This role offers significant technical challenges and growth opportunities in big data engineering, with a focus on performance optimization, data quality, and cloud platform management. The candidate should be committed to continuous learning, skill development, and driving innovation in the big data engineering field.
💡 Interview Preparation
Technical Questions:
- PySpark & Apache Iceberg: Describe your experience with PySpark and Apache Iceberg, highlighting your ability to design, develop, and maintain big data pipelines with a focus on performance optimization and data quality.
- Cloud Platforms: Discuss your familiarity with cloud platforms like AWS, Azure, or GCP, highlighting your ability to deploy and manage big data pipelines with a focus on cost optimization, security, and scalability.
- Data Transformation: Explain your approach to data transformation, cleansing, and enrichment processes, highlighting your focus on data quality, consistency, and performance optimization.
- Data Governance: Describe your understanding of data governance principles, with a focus on metadata management, access control, and data lineage in big data engineering.
Company & Culture Questions:
- Data-Driven Decision-Making: Discuss your experience with data-driven decision-making, highlighting your ability to collaborate with data scientists and analysts to deliver actionable insights and support data-driven decision-making.
- Agile Methodologies: Explain your experience with Agile methodologies, with a focus on sprint planning, daily stand-ups, and regular retrospectives in big data engineering.
- Stakeholder Management: Describe your approach to stakeholder management, with a focus on collaborating effectively with data scientists, analysts, and other stakeholders to deliver actionable insights and drive business value.
Portfolio Presentation Strategy:
- Pipeline Projects: Highlight your experience with PySpark and Apache Iceberg, demonstrating your ability to design, develop, and maintain big data pipelines with a focus on performance optimization and data quality.
- Data Transformation: Showcase your data transformation, cleansing, and enrichment processes, highlighting your focus on data quality, consistency, and performance optimization.
- Cloud Platforms: Demonstrate your familiarity with cloud platforms like AWS, Azure, or GCP, highlighting your ability to deploy and manage big data pipelines with a focus on cost optimization, security, and scalability.
- Data Governance: Include examples of metadata management, access control, and data lineage in big data engineering, highlighting your commitment to data quality, consistency, and security.
📝 Enhancement Note: The interview process for this role will focus on the candidate's technical proficiency in big data engineering, with a strong emphasis on PySpark, Apache Iceberg, and performance optimization. The candidate should be prepared to discuss their experience with big data technologies, cloud platforms, and data governance, as well as their approach to collaboration, stakeholder management, and data-driven decision-making.
📌 Application Steps
To apply for this big data engineering position:
- Customize Your Portfolio: Tailor your portfolio to highlight your experience with PySpark, Apache Iceberg, and big data engineering, with a focus on performance optimization, data quality, and cloud platform management.
- Resume Optimization: Optimize your resume for big data engineering roles, highlighting your experience with PySpark, Apache Iceberg, and relevant big data technologies, as well as your familiarity with cloud platforms and data governance principles.
- Technical Interview Preparation: Prepare for technical interviews by brushing up on your knowledge of PySpark, Apache Iceberg, and big data technologies, with a focus on performance optimization, data quality, and cloud platform management.
- Company Research: Research Citi's big data engineering team, understanding their culture, values, and approach to big data engineering, with a focus on collaboration, innovation, and data-driven decision-making.
⚠️ Important Notice: This enhanced job description includes AI-generated insights and big data engineering industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.
Application Requirements
The candidate should have strong experience with PySpark and Apache Iceberg, along with proficiency in distributed data processing and big data technologies. Familiarity with cloud platforms and version control tools is also required.