Toronto, Ontario, Canada
ELTL, hevo + dbt, data modeling
Build data pipelines on GCP
• Collect data to SFTP site and use replicator rules to relocate the raw data • Processed data with stable Spark pipelines to filtering, cleaning, joining related files together, and aggregating on different levels and save files in cache for later usage. • Load processed data to different end users such as ML model, reporting, and analytics • Using Python libraries such as NumPy and Math and some self-defined (lambda) functions to generate mathematical distribution of data • Design and execute A/B tests that aims for a 5% conversion lift. Report A/B tests to clients to provide recommendations and guidelines. • Connect to Cloud Engine (AWS and GCP) and build Spark pipeline that is reliable and rerunnable. • Use Jenkins and airflow to automate pipeline jobs
Wearable Optimizations Build an interactive Tableau dashboard to understand sales of Samsung wearable products and build a dynamic model to increase the sell-out and keep Samsung competitive in the wearable marketing • Used Tableau to capture the big picture of overall wearable devices selling performance by region, products, time and customer purchasing behavior for the past 5 years • Clustered customers based on customer values into 3 classes to identify potential wearables buyers, providing business insights to the marketing team for promotion decisions • Analyzed the life cycle of the products and the past selling trends to predict the future sellouts and help on saving promoting budgets • Enhanced the understanding of sellouts distribution by identifying the top-selling locations on FSA level • Achieved better understanding of customer base with demographic information (age, gender, and ethnicity) to provide unique promotion strategies
Beam Data | Client: WeCloudData | Data Science Consultant Learning Experience Optimizations Build an end-to-end data pipeline that implements data on Stack Overflow to return relevant posts based on the student’s questions to improve their learning experience • Collected question, answer text and its labels (python, SQL, etc.) from Stack Overflow • Applied regular expression to clean text and use Sklearn multiclassification algorithm to predict the labels • Created explicit sub-labels of the text with unsupervised topic modeling in LDA (Latent Dirichlet Allocation) to better categorize the text
Customer Propensity Model Build complex customer propensity ML model to understand Samsung customer purchasing and upgrading behavior • Wrote SQL queries to extract purchase history, geolocation and price promotion data from SQL Server • Reduced the data imbalance rate about 5% by identifying the target customer instead of using raw data • Segmented customers who are more likely to upgrade their device in the next 3 months with analytics focused on marketing campaign optimization • Implemented tree-based classification models (Random Forest and XGBoost) to build the baseline model and use precision and recall to evaluate the model performance • Engineered new features and used correlation map, Graphviz to determine the key features that have the most impact on customer’s upgrading decision • Reported project progress to the Data Science team manager and documented feedbacks and changes on JIRA and Confluence weekly • Found potential upgrading users by deploying the model and narrowed down the number by 49% with different filters to cooperate with the budget