Data Projects

Alibaba Cloud Chinese NLP Challenge

February 2020 – March 2020

Top 8% on final leaderboard

Quarantined at home and faced abundant data regarding the prevailing coronavirus (COVID-19), I decided to experiment some new DS techniques on the data available. Coincidentally, my friend was looking for a teammate for a Chinese NLP competition hosted by Alibaba. The goal of the competition is to pair the similar semantic health/medical questions people searched on the Internet related to coronavirus.

Approx. 8k training data and 2k development data was provided. The data contains of four columns – the category of the questions, question_1, question_2 and label (1/0, 1 indicates semantics identical). We implemented different state-of-art pre-trained NLP models (BERT, NEZHA, ELECTRA, RoBERTa) on GCP VM and leveraged GPU computing. After data cleaning, data augmentation, K-fold Cross Validation, ensemble and fine-tuning, we got top 8% for our model prediction accuracy in the final leaderboard on the test dataset.

Utilized: Python (TensorFlow, Keras), Docker, GCP, Ali Cloud, BERT, NEZHA

Code Reference on Github



Geo-Location Clustering in Apache Spark

November 2018 – December 2018

A class team project for CSE427s – Cloud Computing and Big Data Application.Aim to improve the transportation experience and the operation efficiency of the taxi services in NYC, we choose the NYC Green Cab dataset from Kaggle, which contains 1,458,644 rows of green cab trip data in May 2018. Each individual trip record indicates the precise latitude and longitude of pick up location. In this project, we challenged ourselves to create a k-means clusters on only the longitude-latitude pairs.

Imported data from Amazon S3 and into Spark RDD, performed transformations and actions on RDDs. Set up k-means algorithm in pyspark from sample data locally (Cloudera pseudo cluster) to full data on Amazon EMR. Compared the clustering difference between Euclidean distance and Great Circle distance. Visualized the clustering result using Python bokeh and Google Map API.

Utilized: Spark,Python (bokeh, Google Map API), Unix, Java, svn, AWS (EC, EMR, S3), Cloudera, K-means (Euclidean & Great Circle Distance)

Code Reference on Github

Pre-processing and k-means set-up using PySpark on AWS

Data cleaning process could be done locally or through AWS EMR. After uploading a data cleaning Python script to S3, a step is added in EMR as the picture showed above. The first argument is the input and the second argument is the output. This step aims to select the columns (pickup_lon, pickup_lat) we wanted from the whole train data set.

After cleaning the data, run the k-means Python code for iteratively performing the clustering algorithm until converging.

How to choose hyper parameter k in the k-means?

K-means could be viewed as an optimization problem that given the number of clusters k, how to distribute the data points to different clusters to meet a certain criteria.

In our project, we minimize the inertia, which is the squared distance within each cluster. Set the number of clusters as the X axis and inertia as the Y axis, the elbow plot for this k-means is shown as follows. The optimal k is chosen to be 5 for 5 is at/near the changing point of the slope from steep to gradual and also computation and interpretation is taken into considerations.

\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)

Based on the plot, k=5 is chosen

What we can learn from this application?

Euclidean distance VS Great Circle distance

Right is the initial result of using ‘Euclidean distance’ and left is the initial result of using ‘Great Circle distance’

There seems no noticeable difference between two distance calculation based on the cluster I got. The reason for it may due to the fact that when using different approaches to conduct the distance calculation, distance to centroids results for those dense data points in such small area is pretty much the same. In other words, in small areas, distance along the surface of the Earth (Great Circle distance) is approximated by straight line distance (Euclidean distance).



October 2018 – December 2018

This is one of my projects I did during my internship at Bates Forum. Using open comment surveys is one of the quantitative ways to analyze employees’ opinions about their workplace. Workplace strategists have accustomed to reading comments line by line and bucketing them into different categories by themselves. It is inefficient and unrepeatable. I am developing a tool to do text classification by machine instead of ‘labor intensive’ to decrease classification subjective bias and save more time.

Created SQL database for open comment text data stewardship. Performed text cleaning, tokenization, feature extraction in Python. Built Tree-based models for prediction, reaching 80%+ F1-score for some of the buckets. Currently, trying to use BERT to boost the performance and build BERT Python-friendly interface.

Utilized: Python, Shell, Unix, BERT, Neural Network, CART(Decision Tree, Random Forest, Boosted Tree), NLP(spell-checking, n-gram, wordnet, hypernymy, LDA, text classification)

Work in progress…


2018 Teradata Data Challenge

 January 2018 – October 2018

🏆Global Overall Winner🥇

Do you know about Multiple Sclerosis (MS)? Bike MS is the National MS Society’s largest fundraising campaign, engaging over 70,000 participants to raise $68 million in over 75 rides across the country. It is the largest charity cycling series in the United States.

Over the years, the charity cycling space has become very crowded with many events vying for the same audience. Bike MS participants (both new and old) have been having a steady decreasing trend since the peak in 2012, and thus causing the revenue to decline. Bike MS would like to find ways to increase new participant acquisition as well as retain registered participants to reverse the status quo.

The story comes in twofold: acquisition and retention. We used k-means clustering analysis to help Bike MS re-segment the city markets by three factors into new five tiers and built random forest to discover five KPIs for captain churn prediction. We also estimated by doing so, we could not only acquire more than 49% of the new participants and also increase Bike MS profit by 180%.

Utilized: R, SQL, Tableau, Google Analytics, K-means, Random forest, Feature engineering, Logit, LPM


For higher resolution of the poster:


*Note: it may take a while for the webpage to load pdf

Teradata photo

For more information about the competition,

2018 Teradata Data Challenge


We conducted a descriptive analysis using SQL to pull data from Teradata database, discovering that on average, 92% of the potential participants left the registration process at the first stage of acquisition funnel. Therefore, we believed that Bike MS launch marketing campaign to the wrong cities. So, we chose three predictors by feature engineering,

  • (#) Number of participants
  • ($) Fundraising per participant
  • ($) Cost per acquisition (CPA)

for each city to run K-means clustering analysis (K is set by “elbow point”) to provide a better target segment for Bike MS to better allocate their limited resources. And plotted this three dimensional clustering plot using Tableau.

Teradat k-means

The X-axis is CPA and the Y-axis represents the average fundraising per participants and the size of the bubble shows the number of the participants in this city. The bigger the higher. So, if the bubble is at the upper-left corner and the size is large will be a promising city.

k-means tier

Why use K-means?

Great question! The simple answer would be k-means is easy to implement in R, straightforward to be understood and thus powerful. The sophisticated one would be first, facing this business problem of market segmentation, we do not have a label in the data. And we did not create one because Bike MS’ strategy for the moment was not focusing on one specific business goal that could be quantified but evaluating different city markets. Second, follow Ockham’s razor principle.


Teradata retention

We made a biased but concrete decision that we only focus on retaining team captains. Captains are those participants who are team organizers, coordinators. After EDA, we discovered that comparing to a normal participant, captain has a higher fundraising per participant (10x) and lower CPA (0.2x). Hence, retaining captain is much more cost-effective.

Teradata Model

We lagged the data and created the binary label for churn. Based on the flat file, we used 2016’s captain data to predict captain churn in 2017 using three models and compared the results. 5 out of 83 predictors were standing out and chosen to be KPIs. They carried statistical significance, strongest predicative ability and actionable business insights. Therefore, on the retention part, we built a predictive tool for Bike MS that given a captain profile, our model could return this captain’s churn probability. Thus, we could rank the captains according to their churn rates in a descending order. And then Bike MS could lay emphasis on those captains with super high churn rate and try to retain them. In this way, we are optimizing our resources allocation and making the most productive effort.

Love to talk more about the lessons learned through this project if interested! 


Wells Fargo Advisors Practicum Project

January 2018 – May 2018

Wells Fargo Advisors provides investment advice and guidance to clients through nearly 14,600 Financial Advisors and referrals from around 4,300 Licensed Bankers in retail branches across the U.S. Wells Fargo Advisors administers $1.6 trillion in client assets.

With huge amounts of customer transaction data for each financial products, Wells Fargo Advisors (WFA) is curious about what insights could be drawn from the data. We are tasked to discover the key factors that drives product-level demand, and using these factors to build predicative models to develop product-level demand forecasts.

By Cleaning and transforming 6Gb+ customer-level monthly financial products transaction data in R, our team would be able to construct product-specific data matrices. Based on EDA, we tuned logistic regression to predict customer purchase behavior (0.85+ recall rate on average) and discovered key demand drivers and identifying opportunities for each financial product. Finally, we designed data-oriented business application for future prediction on both individual customer value and aggregated product demand in a given period by ensemble learning a logistic regression.

Utilized: R, SQL, Tableau, Feature selection, Predictive analytics, Logistic regression, Lasso

Code Reference on Github


By building a Logit model, we could answer the question of, given last month’s customer profile, what will be the probability that this account would purchase or sell a product this month?

WFA propensity model

We are provided with 4 data tables, which are the account information (AccountID as key and account identity is protected), Position table, Activity table and the FA (Financial Advisors, FA data is also protected) table. And I joined four tables into one account flat file as the input for the model, the DAG is shown below.

WFA data prep

The biggest challenge I had during the modeling was not performing a Logit, but understanding the relations between each tables our client provided and merged them into one with correct assumptions in R.

I will use one of the equity fund as a showcase, for confidentiality reason, the real number will be protected. We made the prediction using the Lasso and Logit, we could discern the driving factors for this product by the summary of our tuned model.

WFA Driving factors

Then we sum the probabilities of the prediction result as the prediction for the number customer that made purchase in a given month. The graph is made in Tableau.

WFA equity fund propensity

The prediction for the purchase side is more accurate than the sell side, partly because we have more data for purchase, i.e., purchase is a more frequently observed activity than sell for WFA. With a rather accurate prediction of whether this customer will make a purchase or not, we’d also like to know a question of, given this customer made a purchase, how much she/he would spend on this product? So, we built a conditional amount model for each product.

WFA conditional amount model

As we could see that the model input is the same as the propensity model, but the dependent variable was changed to the dollar amount that spent on certain product. So, under the Central Limit Theorem, we have the assumption that the customer would spend approximately the mean of the amount of money that previous purchases have been made. Hence, y = mean(dollar\, amount\,|\, purchase\, made).

WFA conditional amount model

Combining the propensity model and conditional amount model by multiplication, we can predict the dollar amount for each product in a given month. And with these analyses at hand, we came up with business applications for WFA.

Love to talk more about the lessons learned through this project if interested! 

TLDR...(very detail)

In-class Kaggle Sales Prediction Competition

October 2017 – December 2017

🏆1st out of 45

Imagine a grocery chain with multiple stores. In each store, they sell numerous products. The goal is to predict sales of 24 product categories for one store and the prediction accuracy will be judged based on MSE criterion.

Employed feature engineering techniques to enhance the model prediction accuracy (reduce MSE). Consolidated different model result to increase the robustness of the model.

Utilized: R, Feature extraction, Statistical Models (Linear regression, Lasso, Tree-based models, MARS), Ensemble learning

Code Reference on Github

In a nutshell, the challenge is to predict 24 products’ sales (Y) using 4 predictors (X) per product.

For a high level analysis, twenty-four product categories are considered. For each of the above listed product categories, weekly data were collected on the following variables: Unit Price (P), Promotion (PR), Display (D), Feature (F) and Volume of Sales (Y). Each row in the data represents a week.

in-class Kaggle

Data at a glance

Following are the product categories:

1.Carbonated Beverages, 2. Cigarettes, 3. Coffee, 4. Cold Cereal, 5. Deodorant, 6. Diapers, 7. Face Tissue, 8. Frozen Dinner Entree, 9. Frozen Pizza, 10. Hot Dog, 11. Laundry Detergent, 12. Margarine & Butter, 13. Mayonnaise, 14. Mustard & Ketchup, 15. Paper Towel, 16. Peanut Butter, 17. Shampoo, 18. Soup, 19. Spaghetti Sauce, 20. Sugar Substitute, 21. Toilet Tissues, 22. Tooth Paste, 23. Yogurt, and 24. Beer.

How to fine tune the model?

  • Other product’s performance matters – Use all features to predict each product’s sale
    • With the assumption that for product A, other products’ (such as B, C, D and etc.) features will have influence over A. So, I used all the features to predict each product’s sales, it increases the prediction accuracy as expected (results in a decrease in MSE).
  • Scan*Pro – take log transformation to the predictors and response
    • This is a marketing regression model that uses log(sales) as response and log(price) as one of predictors.
    • The coefficients of the log(price) is called price elasticity. Using a log transformation in the regression will not only make the linear regression more credible since price and sales will not be negative values but also help us normalize the distribution of the price and sales (reduce the difference) among different products and among different predictors.
  • Model Ensemble – average model results to reduce MSE
    • With the assumption that regression to the mean and some model may be better performing certain products and some other model may be good at others, so by taking the average of some good model results will have a more robust result.
search previous next tag category expand menu location phone mail time cart zoom edit close