Bengaluru, Karnataka, India
Learning something new has always been super fun. I like to do experiment with stuffs, face challenges and solve problems! I'm a data engineer trying to figure out how the work can be done in a more effective way. The world of AI is very fascinating to me. I just love the way AI is making our daily life much easier, so want to contribute my own portion to the industry.
• Modernized a customer-facing, SLA-critical data pipeline, migrating from pandas to Spark to enable weekly ingestion of 3M+ records to a customer-managed S3 data lake, reducing end-to-end latency by 90%. • Built distributed PySpark transformations processing 4M+ records into 40k+ nested JSON files, joining, validating, and uploading in parallel to GCP data lake, ensuring correctness under immutable storage constraints. • Optimized memory-intensive pandas workflows using lazy evaluation and Modin on Ray, reducing its cloud compute cost by 60% in monthly cloud billing and improving processing efficiency. • Achieved 99% job reliability by automating distributed workloads on legacy Ubuntu clusters through a custom SSH-based orchestration framework, eliminating manual intervention and recurring failures. • Led productivity through code reviews and mentored junior engineers in data engineering and modular design.
• Designed scalable, fault-tolerant batch data ingestion systems, handling 1M+ data points per day from APIs and unstructured web sources, ensuring timely data availability for external consumers. • Partnered with R&D team to reduce ingestion latency by 40% via adaptive IP rotation and distributed execution. • Increased platform reliability and reduced delivery errors by 30% by implementing data quality checks, retries, and alerts to detect job failures and schema drifts.