Sudhanshu Kaushik

Senior Applied AI Scientist at SAP Labs India

Bengaluru, Karnataka, India

About

I like training large deep neural nets on graphics cards. Skills : Primary Skills : Data Science, Machine Learning, Deep Learning, Deep Reinforcement Learning, Natural Language Processing, AI Engineering, Generative AI, Large Language Models, Big Data Technologies Secondary Skills : Microservices, Distributed Systems, Backend Development, MLOps, Production Deployment, Cloud, System Design, CNCF Cloud-Native Domain Knowledge : AIOps, Data Privacy and Protection Programming language : Python Machine Learning : PyTorch, Hugging Face Transformers, PEFT / LoRA, HF TRL, Scikit-learn, SpaCy, Implicit, Keras / Tensorflow, XGBoost, LightGBM, Apache Spark ML and MLlib, DeepFace, MLxtend, datasketch, Optuna, Databricks, ONNX Runtime LLMs / GenAI : LangGraph, LangChain, PGVector, SentenceTransformers, Llama.cpp, faiss NLP : spaCy, Yet Another Keyword Extractor (YAKE) MLOps : MLflow, DVC, Evidently AI Numerical Analysis : Pandas, Numpy, Pandas-Profiling Web Development Framework : FastAPI, Flask, Python-Eve Asynchronous Programming : AsyncIO, Uvloop ORMs : SQLAlchemy, SQLModel, Odmantic, Motor Data Engineering : Pydantic Big Data : Apache Spark, Kafka Distributed Computing : PySpark, GraphFrames, GraphX, Apache Spark ML and MLlib, Dask, Ray Graph Processing : GraphFrames, GraphX, NetworkX Message Queue : ARQ ( Redis ), RabbitMQ, Apache Kafka Workflow Orchestration and Task Scheduling : Rundeck, APScheduler, supervisord Containerization : Docker, Docker Compose, K8s ( basic ) Observability : OpenTelemetry Cloud APIs : Boto3, apache-libcloud, Microsoft Azure Testing : pytest, locust, k6 Templating : Cookiecutter Scripting : Bash Query Languages : SQL Dev Tools : Git, pre-commit Object Storage : S3, BlobFuse, MinIO, SeaweedFS Relational Databases : PostgreSQL NoSQL Databases : MongoDB, ElasticSearch, Redis Cache : Redis, cachetools Vector Database : PGVector Graph Database : Cayley DB, Neo4j DB Migration : Flyway Dependency Management : Poetry, venv Blog : https://sudhanshukaushik.wordpress.com

Experience

Senior Data and Applied Scientist - T3L3 at SAP
Aug 2024 - Present · 1 yr 11 mos
Developing an Anonymization Service that performs anonymization (irreversible) and pseudonymization (reversible) of PII entities in unstructured, multilingual data such as text, files, PDFs, images, and videos, at SAP BTP scale. The system preserves semantic meaning (Differential Privacy) and downstream usability, enabling safe LLM prompting, training, and synthetic data generation while mitigating bias, re-identification risk, PII leakage, and ensuring GDPR compliance. - Trained and evaluated GLiNER gliner_multi-v2.1 with mDeBERTa as backbone ( mDeBERTa-v3-base ) for multilingual NER using OntoNotes 5, CoNLL, UD, OpenNER and SAP internal Datasets - Used Distant Supervision to Annotate Datasets using Wikidata, SPARQL, LangExtract - Synthetic Data Generation for Address & Phone No. Detection in the wild - Trained and evaluated Cross-Encoder filter based on mDeBERTa, to reduce false positives for NER - Used Aho–Corasick algorithm to post-process NER findings - Implemented Cross-Document Coreference Resolution for Entity Linkage in reversible anonymization, ensuring that each UUID consistently refers to the same entity - Implemented Short Text Language Detection using Lingua. Evaluated Lingua, FastText, Lang-Detect, and Ldig - Implemented and Evaluated Pre-Tokenizers for CJK Languages (SudachiPy, HanLP, Mecab-Ko) - Implemented Face Detection and Redaction using SCRFD Model, ResNet SSD and Haar Cascades ( Viola Jones Algorithm ) - Implemented OCR using Chargrid Model, for Text Redaction in PDFs and Images. Implemented Character Segmentation using, Character Region Awareness for Text Detection (CRAFT) Model, Vertical Projection Profile, Contours and Image processing techniques, for multilingual texts, in order to perform masking in documents and images - Translated English Datasets to various languages using SAP Translation Hub Language Translation and custom Bitext word alignment NLP Task i.e. Word Alignment as Binary Sequence Labeling using mPLMs
Senior Data Scientist at Jio Platforms Limited (JPL)
Aug 2022 - Aug 2024 · 2 yrs 1 mo
Worked in AI / ML Team on Microservices Architecture in an Agile environment, within AIOps Engineering Department developing AI / ML modules for CloudXP Project ( A Jio Platforms AIOps PaaS Enterprise Software Product ) . - Developed Face Deduplication Engine to identify people with multiple SIMs from 400 million KYC passport size images using Approx Nearest Neighbor Search, PySpark, and NVIDIA RAPIDS Accelerator for Apache Spark for GPU Enabled Spark Clusters. ( Similar to ASTR by DOT : https://indianexpress.com/article/explained/explained-sci-tech/govt-ai-face-recognition-tool-astr-detect-phone-frauds-8614162/ ) - Log Analysis / QnA / RCA using LLM ( RAG and RAG Fusion ) - Named Entity Recognition for AIOPs using spaCy en_core_web and Custom Annotated BILOU tagged Training Data. - Developed various ML uses-cases ( Text Classification, Recommendation System, Association Rule Learning, Forecasting, Anomaly Detection, Dimensionality Reduction, Clustering, NLP, Document Search ) - Developed End to End ML Platform. It consists of model-trainer, model-registry, model-serving and job-scheduler pods. Supports below features : Data Source Management : Pull Training Data from MongoDB, PostgreSQL, Elasticsearch Feedback Management ML Model Management : Out of the Box Models, Serve Only Models, External ML Model APIs ML Model Configuration Management Parameterised ML Pipelines ML Experiments Management Continuous Training and Deployment as REST APIs ML Model Registry ML Model Version Management ML Model Logging : Audit Trail, Control Boards Scheduling Batch Training and Inference Jobs Model Monitoring : Model and Data Drift Technologies Used : PySpark, GraphFrames, Apache Spark ML and MLlib, LangChain, pgvector, SentenceTransformers, Llama.cpp, Ray Serve, Dask, MLflow, DVC, Evidently AI, ARQ ( Redis ), RabbitMQ, APScheduler, Implicit, spaCy, FastAPI, SQLAlchemy, SQLModel, Pydantic, Docker, Pandas, Scikit-learn, PostgreSQL, MongoDB, Elasticsearch, Python, Azure Cloud
Senior Software Engineer - ML at Analog Devices
Jul 2021 - Aug 2022 · 1 yr 2 mos
Worked in Software COE in Customer Tools Experience(CTX) Project on Microservices Architecture in an Agile environment. Developed various back-end modules for ADI vLab(virtual lab) Platform. - Developed recommendation engine for cross-selling of ADI parts using Neo4j graph database using node embeddings, collaborative filtering and personalized PageRank. - Developed Back-end Service and Package Templates for scaffolding with Async REST Interface, Web Socket Interface, Queue Interface(job workers pattern), Formatter, Linter, Docs generator, Data Access Layer, Storage(VFS libfuse), Distributed Tracing, Logging, Metrics, Unit and Performance(Shift Left) Tests baked in. - Developed Execution Engine for Signal Chain Processing using Graph Algorithms, Azure Web PubSub, WebSockets and Azure Queues. - Developed Data Access Layer(common interface for MongoDB) using Adapter Pattern, Singleton Pattern and Motor : async driver for MongoDB. - Developed Interceptor Engine(Observability) for collecting metrics, traces and logs using OpenTelemetry for product measurement and KPIs to enable linkages to revenue measurement. Secured it using Keycloak Authentication Server and TLS(CFSSL). Technologies used : Python, Bash, FastAPI, Pydantic, asyncio, uvloop, Uvicorn, Gunicorn, MongoDB, Motor, ODMantic, Cookiecutter, Jinja2, pre-commit, WebSockets, Poetry, Pytest, K6, VFS libfuse, BlobFuse, OpenTelemetry, Grafana Loki, OpenCensus, Keycloak, Azure ( Azure Web PubSub, WebSockets, Azure Queues, Azure Monitor, Azure Application Insights, Blob Storage, Azure Kubernetes Service, Azure DevOps ), AWS ( Cognito ), Docker, Git, Alpine Linux, Sphinx, Visual Studio Code.
Data Scientist at Fractal
Nov 2020 - Jul 2021 · 9 mos
- Pricing and Promotion Mix Modelling for CPG Industry using LightGBM, Bayesian regression. - MLOps using MLflow on Azure Databricks. Technologies used : Python, SQL, MLflow, Azure Databricks, LightGBM, Bayesian regression, PyMC3, MCMC, Pandas, NumPy.
Associate - Data Scientist at Cognizant
Sep 2016 - Nov 2020 · 4 yrs 3 mos
Worked in AI / ML team of Cognizant’s Automation Center (A Cognizant PaaS AIOps Enterprise Software Product : https://www.cognizant.com/automation-center), on Microservices Architecture in an Agile environment. - Developed Machine Learning RESTful Engines ( Recommender Systems, Text Classification, Time Series Forecasting, Text Summarizer, Outage Prediction ). - Developed End to End Production grade Machine Learning Module consisting of Data Feeder, Trainer and Model Manager for Horizontal Scaling and Monitoring. - Developed API Engine web service for the database(Postgres) of the project. It acted as a layer between Web App and Database. All CRUD operations to the database were performed using this web service. Encryption(SCRAM-SHA-256) and DB Connection Pooling were also handled within this layer. - Developed File Storage Engine for persisting objects to Cloud or NAS Volume. Used Facade Pattern to make it work for various Cloud Providers. Used rsync for horizontal scaling. - Developed web service to store and retrieve entity relationships in a Graph Database (Cayley DB). - Docker containerized Machine Learning RESTful engines using Alpine Linux base image for automated build. - Scheduled Machine Learning Training Batch Jobs using Rundeck. - Analyzed and Resolved security vulnerabilities and lint issues reported in SAST/DAST Scans like SonarQube(Static Code Analysis), Checkmarx, Twistlock, Anchore, AquaSec, Blackduck, Fortify etc. Technologies used : Python, SQL, Bash, FastAPI, Flask, SQLAlchemy, Rasa, spaCy, MLflow, Traefik, Supervisord, Azure ( Blob Storage, Azure DevOps ), AWS ( S3 ), Boto3, Docker, Kubernetes, Rancher, Alpine Linux, Git, Scala, Apache Spark ML and MLlib, Implicit, scikit-learn, NumPy, Pandas, Cayley DB ( Graph database ), MinIO, Flyway, PostgreSQL, PgBouncer, Rundeck, SonarQube, Visual Studio Code.