Sounak Roy

Senior Manager | Architect @ Capgemini | Designing Data Ecosystems

Kolkata metropolitan area, West Bengal, India

About

With 11+ years of experience in Data Engineering and Big Data platforms, I specialize in building scalable distributed data solutions using Apache Spark (Scala/PySpark), cloud technologies, and modern data architectures. Over the years, I have worked across BFSI, Retail, Pharmaceutical, and Media domains, designing and modernizing enterprise-scale data platforms on AWS, GCP, and Azure. My experience spans large-scale data ingestion, transformation, analytics, cloud migration, data quality frameworks, and graph-based knowledge systems. Currently, I am focused on integrating AI and NLP capabilities into enterprise data platforms. My work includes developing Spark-Scala-based inference pipelines using transformer models such as BERT, RoBERTa, and DistilBERT, implementing model evaluation workflows with PyTorch, leveraging ONNX for optimized deployment, and building scalable AI-enabled data processing solutions. I have also worked with AWS Bedrock, SageMaker, Neptune, and Vespa to support intelligent data and knowledge-driven applications. Key areas of expertise: • Apache Spark (Scala & PySpark) • Data Engineering & Distributed Systems • AWS, GCP & Azure Data Platforms • AI/ML Pipeline Integration • NLP & Transformer Models (BERT, RoBERTa, DistilBERT) • PyTorch & ONNX • Graph & Knowledge Systems (AWS Neptune, Vespa) • Data Platform Modernization • Cloud Migration & Enterprise Architecture I enjoy solving complex data challenges, modernizing legacy systems, and building scalable platforms that bridge the gap between Data Engineering and AI.

Experience

  • Senior Manager | Architect at Capgemini
    Dec 2023 - Present · 2 yrs 7 mos

    Working on large-scale data platforms for a global internet product organization, contributing to modernization, migration, and maintenance of complex data ecosystems. • Designing and maintaining Apache Spark (Scala & PySpark) pipelines for ingestion and transformation of XML, JSON, text-delimited files, and Hive datasets. • Contributing to legacy platform modernization, including Apache Pig to Spark migration and on-prem Hadoop to GCP migration initiatives. • Implementing data quality and validation frameworks using Great Expectations (PySpark) and developing Python-based automation tools to improve reliability and operational efficiency. • Supporting workflow modernization efforts from Oozie to Airflow through design understanding and dependency mapping. • Modernizing and maintaining graph-based knowledge systems using GraphQL (Apollo) and AWS Neptune. • Recipient of the Capgemini Pioneer Award for two consecutive quarters for excellence in project delivery and performance.

  • Big Data Specialist at LTIMindtree
    Mar 2022 - Dec 2023 · 1 yr 10 mos

    ▪ Development and design of Apache Spark-Scala-based applications for ingesting data from various source systems-Azure ADLS, HDFS, Hive, and Oracle. Creating pipelines in Azure Databricks ▪ Enhancements and maintenance of existing big data pipelines in Cloudera CDP platform. ▪ Carrying out multiple POCs to implement Delta Lakes in some of the use cases which is currently following a Data Lake type of architecture. ▪ Data orchestration using Apache Airflow and NiFi. ▪ Migration and enhancement of legacy projects to newer tech stacks and development methodologies. ▪ Leading and mentoring a team of 3 use best practices and reduce security vulnerabilities in code. ▪ Interviewing internal and external candidates for employment and project deployment.

  • Senior Associate at PwC India
    Dec 2020 - Mar 2022 · 1 yr 4 mos

    ▪ Understanding the use cases from various business stakeholders and delivering applications to support the same and creating DE pipelines from various manufacturing source systems hosted on Azure ADLS, Oracle, SQL Server, SAP, and SFTP sources using NiFi, PySpark and Scala-Spark to support reporting and analytical use cases. ▪ Development of analytics applications and APIs to evaluate various business-critical KPIs using Spark for downstream applications and dashboards. ▪ Developed data quality (DQ) framework build on PySpark to enforce DQ checks and rules on DE pipelines across multiple sources. ▪ Monitoring and maintenance of 200+ data pipelines running on NiFi on daily basis for consistency and developing and deployment of patches and fixes for inconsistencies if any.

  • Assistant Manager IT at Indian Bank
    Sep 2017 - Dec 2020 · 3 yrs 4 mos

    • Submission of reports to senior management and regulatory bodies using machine learning and statistical tools on transactional, customer, account level data from Hadoop-HDFS based data lakes using Apache Spark, Map-Reduce and Hive to aid in business decision-making and planning. • Implementation of fraud detection methods by analyzing transaction patterns for transactions originating from Financial Inclusion endpoints. • Monitoring of daily cash intake and deposit records from transactional dumps for increasing accuracy and reducing discrepancies. • Transformations, loading and cleaning of data from varied sources and designing of data pipelines and workflows from RDBMS and file based data sources to Hadoop data lakes and vice-versa. • Planning, setup and maintenance for on premise Hadoop based data-lake for transactional data processing and storage.

  • System Engineer at Tata Consultancy Services
    Sep 2014 - Aug 2017 · 3 yrs

    • Development of business critical applications primarily using Spring Integration for middleware integration. • Development of web based app for inventory management for the client using Spring Boot framework. • Development of REST API endpoints to serve external systems as well as consuming services from external systems to serve business needs. • Development of web based and desktop based JAVA apps for monitoring and support purposes.