Machine Learning / Predict the Customer segment in Advertising - Top 1% in IndiaHacks Machine Learning Competition

Indiahacks Machine Learning competition is an All India machine learning competition conducted once in a year. I participated in the qualification round and secured 6th position(out of 6000 participants), which is Top 1%. Only top 60 participants were selected to participate in offline zonal round. However, I was unable to participate in zonal round since I was traveling.

Note: Code is not production ready yet, so not sharing it on github. Will share it when I get some free time.

The challenge was to predict the segment(pos, neg) based on the given features:

ID: unique identifier variable
titles: format “title:watch_time”, titles of the shows watched by the user and watch_time on different titles
genres: same format as titles
cities: same format as titles
tod: total watch time of the user spread across different time of days (24 hours format)
dow: total watch time of the user spread across different days of week (7 days format)

Model Pipeline:

Segment Hotstar Competition Pipeline

Features extracted from text variables:

Titles variable: I used word embedding using word2vec, a deep learning technique which maps similar words to context after trying Bag of Words. Word embedding improved my validation score significantly.

Tod variable: Several features were extracted from total watch time column out of which I ended up using the following features:

tod_median_time
tod_min_time
dow_max_time
watch time counts at hours (0 to 23) mapped from t0 to t23
tod_start
tod_end
Days

Cities variable: Several features were extracted out of which I ended up using the following features:

cities_min_time
cities_count

Genres variable: Extracted the genres from this column and mapped each genre as a binary feature.

Apart from these, I extracted the following features:

genres_min_time
genres_max_time
genres_mean_time
genres_median_time

Other text related features which improved by validation score are:

titles length
titles count
cities strings

Features extracted from numeric columns:

dow variable:

Binary features for each day of the week starting from Monday to Sunday.
Watch time spent on each day from Monday to Sunday.

Model Evaluation:

I tried several different models which produced the following scores:

Model	Score
Logistic Regression	0.70
Linear Discriminant Analysis	0.79
K Nearest Neighbors	0.59
CART	0.55
AdaBoost	0.79
Gradient Boosting	0.80
Random Forests	0.69
Extra Trees Regressor	0.70
LightGBM(after hyper parameter tuning)	0.822
Xgboost(after hyper parameter tuning)	0.821

I ended up using LightGBM to generate my predictions.

Hyper parameter tuning:

Used hyperopt to automatically find the right hyper parameters which improved my validation score for LightGBM model

The competition rewarded contestants who did feature engineering.

Things that I tried which din't work:

Bag of Words approach.
Dimensionality reduction on Word2vec features.
Several extracted numerical and text features.

Machine Learning / Predict the Customer segment in Advertising - Top 1% in IndiaHacks Machine Learning Competition

Model Pipeline:

Features extracted from text variables:

Features extracted from numeric columns:

Model Evaluation:

Hyper parameter tuning:

Things that I tried which din't work:

Published

Category

Tags

Contact