This is a documentation for Predicting Email Opens challenge of the Machine Learning Codesprint Competition conducted by HackerRank.
We will provide you with metadata for emails sent to HackerRank users over a certain period of time. This metadata contains specific information about:
Download the zip file (MD5 checksum is eaed376534b4b5efa464214d9838a139) provided here. The zip file contains files named training_dataset.csv, test_dataset.csv, and attributes.pdf. The files are organized as follows:
Solution to Predicting Email Opens Challenge
Initially I started to code in Python, but later shifted to R as I wanted to imporve my R programming skills. Also I did not want to limit to Python for predictive analytics. I attempted this contest when I was still a novice in data analytics couple of months back, with further experience you can easily improve this model.
Importing the libraries:
Here we import the necessary libraries. We use data.tables library which is similar to scikit-learn library in Python.
library(utils)
library(data.table)
library(randomForest)
Reading the data:
Read training and test data from the csv files using fread.
train <- fread('training_dataset.csv')
test <- fread('test_dataset.csv')
Training Dataset Quality:
The data size was enough to learn the model and can be loaded fully into single machine. There are many features which has no role in deciding the outcome of model like feature 'mail_id', 'sent_time' etc. There are also many features which are only on training data and not in test data given like 'click_time' , 'clicked', 'open_time', 'unsubscribe_time' etc. which have to removed from train data. Dataset was very sparse as most of the values in features like count of submission on 1 days and other features on small no of days were zero. Many features in dataset contain missing values like 'hacker_timezone', 'last_online', 'mail_type', 'mail_category'.
Data Preprocessing:
Features like 'opened' having True /False are replaced by 1/0 and like 'mail_type' and 'mail_category' by their respective number.
target <- train$opened
target[target=='false'] <- 0
target[target=='true'] <- 1
target <- as.numeric(target)
train$opened<-target
test$opened <- "NA"
Remove features from training data which are not in test data like click_time", "open_time", "clicked", "unsubscribe_time", "unsubscribed". Later you will be able to see that we will feed only positive features to the training model.
train[,c("click_time","open_time","clicked","unsubscribe_time","unsubscribed"):=NULL]
total<- rbind(train, test)
Many features in dataset contain missing values like 'hacker_timezone', 'last_online', 'mail_type', 'mail_category' so they have to replaced by some standard statistical measure like mode, median or mean.
Mode <- function(x) {
ux <- x[!is.na(x)]
ux[which.max(tabulate(match(x, ux)))]
}
Mean <- function(x) {
if (all(is.na(x))){
return(NA)
}
else{
x<-as.numeric(x)
ux <- x[!is.na(x)]
return(mean(ux))
}
}
One huge uplift to the score was when I constructed one feature on average score 'opened_by' grouped by user. Merged train and test in one and take the mean over each user. Then we will split them up again.
total[ ,c("group_by_id_mean"):= Mean(opened), by = c("user_id")]
train <- total[1:nrow(train), ]
test <- total[-c(1:nrow(train)), ]
train$hacker_confirmation <- as.integer(as.logical(train$hacker_confirmation))
test$hacker_confirmation <- as.integer(as.logical(test$hacker_confirmation))
Model Fitting and Prediction:
After construction I applied Random Forest Classifer for Model fitting. I have also used other classifiers like Logistic Regression and AdaBoost Classifer among which Random Forest was giving best result and I have tuned random forest using its best parameters like 'ntree'=10 etc.
modelFit <- randomForest(as.factor(opened) ~ group_by_id_mean + sent_time + last_online + hacker_created_at + contest_login_count + contest_login_count_1_days + contest_login_count_7_days + contest_login_count_30_days + contest_participation_count + contest_participation_count_1_days + contest_participation_count_7_days + contest_participation_count_30_days + submissions_count_contest + hacker_confirmation,
data=train,
ntree=10,
mtry=5,
importance=TRUE, na.action=na.roughfix)
model_pred <- predict(modelFit, test)
Writing to file:
Writing the Model prediction to a csv file using write.csv.
write.csv(model_pred, file = "predR.csv")
In conclusion, I found that this model can be improved by using XGBoost model. Doing further feature extraction like splitting 'sent_time' into date, time etc plus crossvalidation on folds would give you better results. Nevertheless, a simple Random Forest is good way to start.