Pratham Yadav

Data Engineer | PySpark, Airflow, NiFi, Trino | ETL, Data Warehousing, Big Data | AWS, Azure

New Delhi, Delhi, India

About

I’m a Data Engineer with 2.6+ years of experience building scalable data pipelines and data platforms that turn complex datasets into actionable insights. Currently at KPMG India, I design and optimize end-to-end ETL workflows, enabling reliable, high-performance data systems for analytics and decision-making. 🚀 Key Achievements • Built and scaled ETL pipelines processing 10M+ records daily using Apache NiFi and Python • Reduced data quality issues by 95% through automated validation frameworks (Great Expectations) • Improved query performance by up to 60% via optimized SQL and data modeling • Increased pipeline reliability by 85% with monitoring, alerting, and retry mechanisms 🛠️ Core Skills • Languages: Python, SQL • Data Engineering: Apache Airflow, Apache NiFi, PySpark, ETL Pipelines • Databases: PostgreSQL, MongoDB • Visualization: Apache Superset, Power BI • Cloud: AWS (S3, IAM, Lambda, Glue, Athena) • Tools: Git, Docker, Linux/Unix 💡 I focus on building scalable, reliable, and cost-efficient data systems, with strong emphasis on data quality, automation, and performance optimization. 📌 Currently exploring opportunities to work on large-scale data platforms, real-time processing, and modern data stack technologies. Let’s connect if you’re working on data engineering, analytics, or building data-driven products.

Experience

Associate Consultant at KPMG India
Oct 2025 - Present · 9 mos
Data Engineer at Dhwani Rural Information Systems
Sep 2023 - Oct 2025 · 2 yrs 2 mos
Project: MGrant- Grant Allocation Tracking System • Developed multiple Airflow DAGs to orchestrate data workflows for MGrant data processing and feeding Superset dashboards, ensuring automation and scalability. • Engineered data migration workflows from MongoDB to PostgreSQL using Trino, improving query speed by 60%. • Integrated Great Expectations for automated data validation, reducing quality issues by 95%. • Designed real-time dashboards in Apache Superset for monitoring CSR grant allocations and project outcomes. Project: ISDM DataSights- Open Source Data Platform • Designed end-to-end ETL pipelines using Apache NiFi, PySpark, and Python, processing 10M+ daily records from AWS S3. • Built multi-layer data architecture with Hive & PostgreSQL, reducing query execution time by 60%. • Developed PySpark-based data quality and lineage framework, cutting data errors by 85%. • Deployed ML models for classification and anomaly detection, improving efficiency by 70%.
Data Analyst at Atkins
Apr 2023 - Jun 2023 · 3 mos
• Led automation initiatives within Atkins Global’s finance department, leveraging Python scripts to streamline daily and weekly tasks, enhancing operational efficiency and accuracy. • Designed and developed comprehensive finance dashboards utilizing HR department data to provide insightful analysis and reporting for informed decision-making within the organization’s finance department.
Data Analyst at SequelString AI Private Limited
Jan 2023 - Mar 2023 · 3 mos
• Implemented robust data extraction pipelines using Python to parse information from PDFs, coupled with web scraping techniques using BeautifulSoup and Scrapy. Orchestrated seamless database connectivity and storage tailored to client specifications, ensuring efficient data management and retrieval. • Provided foundational backend support in MongoDB during the initial phases of development, ensuring seamless data storage, retrieval, and management for scalable and efficient application architecture.