Data Scientist: Walmart Labs
February 2019 – Present
- Lifted the mapping coverage for prescription claims to PBM schedules from 70% to 95%+ by formulating rule based engine in Hive and multi-class classification model in Spark using Python with 96%+ recall and precision
- Developed price elasticity and optimization models for Walmart pharmacy’s prescription drugs regional pricing, increased monthly profit by 30%; Engineered GPU accelerated computing for optimization model using Numba in Rapids, outpacing the CPU version by 700x
- Structured event detection system for EBT transaction outage triage in Python through ensemble clustering methods (Isolation Forest and DBSCAN) on streaming data; Developed alert notification app using Slack API
- Built automated Web based dashboard as feedback loop in R Shiny and Dash to monitor model performance
Utilized: Python & Dash, R & Shiny, SQL, Spark, Hive, Cuda, Linux, Docker, Tableau
Research Assistant: Olin Business School
May 2018 – December 2018
- Modeled and predicted customer re-purchase time interval probabilities using discrete hazard model in R and stan
- Implemented HMM to predict customers’ churn probability and re-purchase time intervals in R
Utilized: R, stan, Probit, HMM, Monte Carlo simulation
Data Analyst Intern: Bates Forum
June 2018 – December 2018
- Boosted open comment classification model’s F1-score from 0.2 to 0.8+ through optimizing feature extraction by integrating n-gram collocation, ranking-based Spell-checking and hypernymy LDA in Python
- Piloted network visualization in R to expedite workplace spatial layout analysis on department-level adjacency preference interview data (3x improvement)
- Architected MS Access and SQL database for enhancing quantitative survey data stewardship
- Built automated data pipeline in R for client-specific strategic workplace gap analysis dashboards
Utilized: Python, R, SQL, Power BI, Unix command, NLP, Network analysis
Project I: NLP for open comment in Python
A very revolutionary solution at that time for this company. It started from very little.
We sent out online questionnaire to clients through SurveyGizmo for feedback collection. The final question of the survey was open-ended inquiring what changes they expected to have for their workplace. So, the previous practice was that our strategists would manually read through the comments (~500 monthly) and summarize the common topics that interviewees mentioned. What I found was that, the summarization topics were categories with limited counts and did not change too much from survey to survey. Also, it really ate huge amount of time from our strategists and made them a lot less productive.
To me, in data science world, it just seemed as a perfect use case for text classification. Each comment is one instance of data and the target variable is the label of category. Also, we had multiple categories, e.g., Lighting, Furniture, Acoustic, HVAC, Communication, etc and people may talk any combination of the labels. So, it is actually a multi-label classification problem in NLP.
Multi-label classification is super hard in this context for two reasons.
- We have limited size of the training data, we only have approx. 3k of comments. It would be hard for model to learn labels at the same time.
- It was open-ended comments, people wrote completely freely. As a result, the data quality is not good and it was very dirty, with multiple spelling error and odd usage of punctuations.
I simplified the problems and performed better feature extraction techniques so that we could have decent outcomes for production use.
Data Cleaning & Feature Extraction
I think this is an interesting project for we are creating a network visualization from the interview data. Our strategists conducted F2F talks to a stratified sample from each department and collected the survey data from each department. Interviewees would express their adjacency preference on behalf of their groups/departments. The preference will have four degrees, “primary”, “secondary”, “no preference” and “cannot be by”.
Inspired by Graph Search and Dijkstra Algorithm, I split the unstructured survey data to two new files. One stores only the department information, such as the location, size and etc., and the other stores the data about the department’s requests, i.e., adjacency preferences. So, in a nutshell, I transformed the survey data to department info (nodes) and adjacency requests (edges). By doing so, we could leverage the iGraph R package to create the network both static and interactive plots.
This visualization in Python intends to show employees’ preferences for workplaces. The data behind this plot is based on the question, “please select the top 3 places that you like in your workplace on the floor plan”. It encourage employees to express their ideas about their preference/dislikes about the workplace by simply clicking the location on the floor plan graph provided in the online survey instead of picking one from a long list of drop-down menu.
By using Python to collect and clean the coordinates data from the survey and create the heat map overlying the original floor plan, we could accomplish real-time update about what are the most popular areas that need to be maintain. The same logic could be applied for the dislike places.
Data Scientist Intern: Deloitte
December 2016 – June 2017
- Created 360 customer profile for precision marketing by tidying 2M customer demographics and behavioral data in R, and trained k-medoids model to discern high-value segments
- Implemented feature selection using filter method, lasso and principle component analysis, and constructed logistic regression and CART on 40K+ customer-level credit data with 500+ predictors in R and SPSS Modeler
- Democratized data through Wechat feeds. Increased unique visitor viewing by 30% and page view by 4x in 7 days
Utilized: R, SPSS Modeler, Tableau, k-medoids, Feature engineering