Greater Seattle Area
Highly analytical and results-driven AI/ML and Data Engineering professional with 7 years of experience designing, developing, and deploying end-to-end machine learning solutions at scale, with a focus on leveraging data to improve user-facing products. Proven expertise in modern GenAI and machine learning architectures, along with advanced skillsets in petabyte-scale data ETL pipelines and data infrastructure. Adept at leading cross-functional teams and delivering innovative AI/ML solutions for complex business challenges. Passionate about leveraging data as the voice of the customer to drive product impact through measurement and evaluation. SKILL • CORE TECHNOLOGIES: Python, Java, SQL, Linux • BIG DATA ECOSYSTEM: Apache Spark, Apache Airflow, Hadoop, Flink, Kafka, Delta Lake • MODEL TRAINING: TensorFlow, PyTorch, Scikit-Learn, Spark MLib • DATA INFRASTRUCTURE: Distributed Systems, Real-time Streaming, Redshift, DynamoDB, S3, Glue • AI EXPERTISE: Natural Language Processing, Deep Learning, Prompt Engineering, Hyperparameter Tuning, LLM integration • STATISTICS ANALYSIS: Hypothesis Testing, A/8 Testing, Prediction Analysis, Time Series Analysis
- Built a scalable data infrastructure that processes and analyzes multi-modal customer interaction data (text, voice, and click streams) to enhance our LLM's understanding of complex customer queries hence increase AI agents' successful rates, - Designed and implemented a distributed data processing framework that handles millions of customer-agent conversations daily to create high-quality training datasets for our LLM models. - Developed an intelligent data pipeline that automatically identifies and extracts valuable conversation patterns, reducing LLM hallucination rates by 45% through better training data curation.
- Designed and implemented real-time customer-facing recommendation data by clustering similar products based on item attributes. This streamlined the decision-making process for millions of global FBA sellers on Seller Central, leading to million-dollar cost reductions for Amazon FBA and removing dependencies on manual processes between Amazon FBA users and customer representatives. https://www.amazon-packaging.com/sellers - Designed and developed end-to-end data pipeline orchestrator platform leveraging Apache Airflow, which improved runtime by 37% for TP99 data pipelines at terabyte scale, which enabled centralized management of data pipelines living on various AWS services, with monitoring and alarming system that dynamically predict and alert anomaly pattern with AI - Built and maintained machine learning infrastructure on AWS to serve repetitive large-scale model trainings. Developed a modularized and parameterized user interface, to allow self-serving interactions that significantly improved training data preparation, ML model deployments, & training orchestrations etc., for in-house data scientists - Automated metadata management and data retention across data warehouse and data lake, to identify personally identifiable information (PII) according to global and local legal compliances. Consequently, set up standard procedures for data governance to operate at PII granularity level
- Operated and maintained petabyte-scale data infrastructures with AWS CDK for Amazon global packaging data, leveraging AWS data services such as Redshift, S3, Glue, Step Functions etc. along with company internal developer tools - Developed and leveled up the ETL pipelines by SQL on AWS Redshift, to integrate and standardize multiple complex data sources, eliminated data silos and reduced 50% of the time spent on generating daily business reports - Created customized API-based and streaming-data-based data pipelines by normalizing various external data sources from different third parties into structured data, using AWS Services such as Step Functions and Lambda, for internal analytic teams to understand user experiences of packaging
- Developed and maintained ETL pipelines in SQL and Tableau Prep to integrate multiple data sources, such as HubSpot, Google Spreadsheet and AWS, hence reduced 50% of the time spent on generating patients’ health reports - Cleaned the unstructured data of patients to train a prediction model that identifies the root causes of lupus disease, using RDD and Pyspark approaches on Jupyter Notebook, to reduce ramp-up time - Lead by the Head of Product, constructed product performance dashboard on Tableau Desktop for the company’s first pilot with an insurance company to fully present product effectiveness - Worked with Marketing team and boosted B2C sales by 8%, by providing insights derived from data analysis and visualization of customer activity data on Mymee’s online platform, using Tableau Desktop - Designed KPIs and dashboards for Customer team to quickly learn 5 coaches’ working status, that elevated operational efficiency - Collaborated with IT team to provide product (Mymee Mobile App’s user interface) improvement plans and added on 3 in-need features that derived from observations of customer usage behavior on Mymee mobile App
- Started from an intern to a full-time position, and engaged in cross-training in ALL departments, hence accessed to the management idea of the company - Increased reservation rate by 30% in 3 months by conducting market research and implementing tailored marketing tactics - Analyzed customer satisfaction survey result, using SPSS and SAS, to provide business strategies to improve customer retention rate
- Responsible for an A/B testing of marketing promotions in three branches in Haidian District, Beijing - Collected user requirements and cooperated with the SDE team to improve promotion performance, and boosted click-through rate by 37% - Maintained official social account, wrote and posted 10+ articles which increased 30k exposure