Yini Zhang

SR Data & Applied Scientist at Microsoft

Seattle, Washington, United States

About

A self-motivated data scientist with 2+ years working experience in machine learning and statistical analysis with Python, R and MATLAB, 5 years advanced academic experience in predictive modeling, including 2 years machine learning experience on big data platform Hadoop/Spark, and 2 year database experience with SQL/Hive. Highly skilled in machine learning algorithms including classification, clustering and regression. Hands-on experience in image recognition with Neural networks and TensorFlow. A team-player and fast learner with great passion of learning new technologies and algorithms. ========================================================= Skill sets: • Programming Languages: R (proficient), Python (proficient), MATLAB, RStan, Spark Apache (PySpark), Hadoop; • Machine Learning Skills: TensorFlow, Apache MLlib, Keras, nltk, scikit-learn; • Visualization Libraries/Tools: ggplot2 (proficient), matplotlib, RShiny, Plotly, Leaflet, LDAvis, TensorBoard; • Database Systems: MySQL, MySQL Workbench, Hive, HBase; • Software Skills: Jupyter Notebook (proficient), Anaconda (proficient), RStudio (proficient), SPSS; • Documentation: Markdown (proficient), LaTex (proficient); • Version Control: Git, Github Desktop; • Operating Systems: Linux/Unix, Mac OS, Windows Series; • Also know about: Tableau, SAS;

Experience

  • 微软 (Full-time · 8 yrs)
    • Senior Data & Applied Scientist
      Mar 2022 - Present · 4 yrs 4 mos

    • Data & Applied Scientist
      Jul 2018 - Mar 2022 · 3 yrs 9 mos

  • Columbia Business School (1 yr 3 mos)
    • Research Assistant
      Oct 2016 - Dec 2017 · 1 yr 3 mos

      Project: NLP Aspect Analysis The aim of this project is to extract key information from massive unstructured text data, in order to identify the reasons for choosing green energy or traditional energy, and to analyze how these reasons influence their final choices. All data are in text format and collected from online surveys. I joined the project from the very beginning, and was responsible for the whole part of text mining. Started from cleaning messy raw text data collected from surveys and tokenizing data, I then extracted important features from topic modeling and sentiment analysis on the corpus at respondent level, and used topic scores and sentiment scores to train logistic regression model to predict choices. The analysis was carried out in R and Python. Responsibilities: 1. Preprocessed and tokenized messy text data with R libraries like ‘tidyr’, ‘tidytext’ and ‘dplyr’ to transform data set into clean format. 2. Conducted Sentiment Analysis on sentence level using ‘nltk’ library in Python and visualized the distribution of sentiment by violin plots. 3. Implemented LDA topic modeling with cross validation in R, and evaluated the results by mean perplexity to decide the best reasonable number of topics . 4. Explored the cleaned data with visualizations such as Wordcloud and barplots, and interactive D3 visualizations of topic-term relations in R with library ‘LDAvis’. 5. Analyzed emotion differences in two groups of datasets based on NRC sentiment lexicon, and visualized relations between emotions by dendrograms.

    • Research Assistant
      Oct 2016 - Dec 2017 · 1 yr 3 mos

      Project: Time Preference Model Parameter Simulation This project aims to estimate time preference model parameters with two approaches, RStan and MATLAB, and make comparisons between the results. I worked on writing RStan scripts to build bayesian hierarchical model, revised Matlab codes into functions, ran multiple simulations to compare results by SPLOM plots, and published detailed documentations for both RStan and MATLAB codes on lab website. The simulations were conducted using command lines on Linux system. The project was carried out in R and MATLAB. Responsibilities: 1. Built Bayesian Logistic Regression model in RStan which run 10 times faster than that in MATLAB. 2. Added new functions to lab’s R library ‘decider’ that draw SPLOM plots with more features and complex visualizations. 3. Visualized the distributions of parameters simulated by using MATLAB and RStan with SPLOM plots for comparison. 4. Produced analytical reports and MATLAB/R code documentations with R Markdown (HTML/LaTex) and reported to lab manager.

    • Research Assistant
      Oct 2016 - Dec 2017 · 1 yr 3 mos

      Project: Meta-Analysis on Query Theory Research This project aims to compare the results from multiple studies in the field of Query Theory. My job was to use web scrapping to collect data from research databases, derived cohen's d and variance of d from raw data of each research paper and drew forest plots to compare results. The analysis was carried out in Python and R. Responsibilities: 1. Pulled titles, authors and key words from research paper database by web scrapping in Python. 2. Conducted hypothesis testing (chi-square & t-test) on raw data collected from each research paper and calculated effect size cohen’s d. 3. Built random-effects model in R using library ‘metafor’, and visualized model results by forest plots and funnel plots.

  • Data Analyst at Beijing Municipal Bureau of Civil Affairs
    Sep 2013 - Aug 2014 · 1 yr

    This is a part time job I joined while pursuing my Master Degree in Quantitative Economics. I handled the SQL queries and data cleaning, built multiple time series models to analyze the stochastic trend in long and short terms, predicted future financial expense with ADL model and assisted in writing final analytical reports. Responsibilities: 1. Fetched and cleaned financial data from database by writing SQL scripts. 2. Built cointegration model and vector error correction model to predict long term and short term relations between Real Economy and Financial Market. 3. Implemented auto-regressive distributed lag model in SPSS to predict Bureau’s financial budget for next 5 years. 4. Wrote analytical reports and presented results to 5+ Bureau staff in weekly meetings.