Singapore
I am a Data Engineer with a strong foundation in building scalable data architectures and a Master’s degree from the University of Arizona. Currently, I specialize in architecting Medallion Lakehouse structures on Databricks, where I have successfully migrated 500+ tables and optimized pipelines to handle 200M+ daily records, achieving a 25% boost in processing efficiency. With a unique blend of experiences—ranging from large-scale data ingestion at TSMC to research in Graph Foundation Models—I bridge the gap between robust data engineering and advanced machine learning. My background as a Graduate Teaching Assistant in DBMS and Algorithms has honed my ability to communicate complex technical concepts clearly in high-pressure, international environments. Core Expertise: ● Data Engineering: Databricks, Apache Spark (PySpark), ETL/ELT, CDC, Data Modeling. ● Infrastructure & DevOps: Cloud Migration, Docker, CI/CD, Pipeline Monitoring. ● Advanced Analytics: Graph AI, Multimodal Data, NLP, Machine Learning. I am passionate about building data-driven solutions that scale and am always open to connecting with fellow data professionals and exploring global opportunities in the tech space.
● Architected a Medallion Architecture (Bronze/Silver/Gold) on Databricks to migrate 500+ tables; unified disparate data formats from DB2, PostgreSQL, and SQL Server into a high-performance cloud lakehouse. ● Developed custom Spark-based modules to handle complex data inconsistencies, including cross-system Timezone normalization and rigorous Null-handling strategies (managing dummy values vs. literal strings) to ensure downstream data accuracy. ● Engineered robust ETL/ELT pipelines using PySpark & Spark Streaming; implemented CDC (Change Data Capture) logic to process 200M+ daily records, maintaining 100% data integrity during schema transitions. ● Built a centralized monitoring framework with categorized error reporting; slashed manual troubleshooting time from 1 hour to minutes, enabling instant identification of table-level failures. ● Optimized resource utilization by refactoring logic for parallel execution and leveraging Databricks' Optimize/Vacuum features, resulting in a 25% faster runtime and reduced cloud infrastructure costs.
MIS 301 Data Structure & Algorithm ● Conducted problem-solving sessions for Python-based Data Structures and Algorithms; provided real-time debugging support and code reviews for students to reinforce fundamental programming logic. ● Addressed diverse technical inquiries regarding algorithm complexity and data manipulation, ensuring students' mastery of efficient coding practices.
● Developed a Graph Foundation Model integrating LLMs with graph-text contrastive learning to analyze 60K+ supply chain nodes; automated supplier identification for semiconductor manufacturers under shifting policy frameworks. ● Implemented a Multimodal LLM framework to process large-scale unstructured text and satellite imagery; built a pipeline to detect ESG risks in the EV supply chain, providing actionable insights for revenue loss mitigation. ● Streamlined data ingestion from public records using Python APIs and Scikit-learn to curate training datasets for LLM-based ESG reporting; reduced data preparation time by 40% through automated cleaning and standardization.
MIS 331 Database Management Systems ● Facilitated weekly technical labs focusing on Advanced SQL and Database Design (ERD); provided hands-on guidance for 100+ students in solving complex query optimization and schema normalization problems. ● Demonstrated technical leadership by bridging the gap between theoretical database concepts and practical implementation, maintaining high teaching standards in a high-pressure, English-speaking academic environment.
● Automated competitive data collection by designing robust Python-based scraping pipelines with Selenium and Scrapy; processed 1.5M+ records weekly, reducing manual effort by 80% and enhancing strategic planning accuracy. ● Refined raw data structures by implementing custom cleaning and normalization logic to ensure high data fidelity for competitive intelligence analysis. ● Optimized text preprocessing workflows for 10K+ news articles using NLTK; automated tokenization and labeling processes, which halved manual labeling time and improved market analysis efficiency.
● Developed Power BI dashboards using SQL and regression models (Statsmodels) to analyze customer feedback, identifying key service drivers and improving satisfaction scores by 10%. ● Automated data pipelines using Python and Excel VBA for cleaning and aggregating interaction data, reducing manual errors by 30% and significantly driving digital banking adoption rates.
Authored bi-weekly digital transformation reports for government agencies, providing strategic business insights across 5+ industries.