Predicting Email Opens using Random Forest

December 30, 2016    R Prediction Random Forest Classification

This is a documentation for Predicting Email Opens challenge of the Machine Learning Codesprint Competition conducted by HackerRank.

  • The problem statement is as follows:
  • We will provide you with metadata for emails sent to HackerRank users over a certain period of time. This metadata contains specific information about:

    • The user the email was sent to.
    • The email that was sent.
    • The user's reaction to the email, including (among other things) whether or not the user opened the email.
    Given the metadata for additional future emails, you must predict whether or not each user will open an email.
  • Dataset details:
  • Download the zip file (MD5 checksum is eaed376534b4b5efa464214d9838a139) provided here. The zip file contains files named training_dataset.csv, test_dataset.csv, and attributes.pdf. The files are organized as follows:

    • training_dataset.csv: This file contains the details for various emails sent. If an email was opened, then the value of the opened attribute is 1; otherwise, its value is 0.
    • test_dataset.csv: This file contains the test dataset. All fields relevant to the user's reaction to the email are missing in this dataset. You must predict the value of the opened attribute for each email.
    • attributes.pdf: This file contains definitions for all the attributes given in the dataset.
From this, we can conclude that this is a Classification problem. Also, click here for the complete description of the problem statement and attribute details.

Solution to Predicting Email Opens Challenge

Initially I started to code in Python, but later shifted to R as I wanted to imporve my R programming skills. Also I did not want to limit to Python for predictive analytics. I attempted this contest when I was still a novice in data analytics couple of months back, with further experience you can easily improve this model.

Importing the libraries:
Here we import the necessary libraries. We use data.tables library which is similar to scikit-learn library in Python.


Reading the data:
Read training and test data from the csv files using fread.

train <- fread('training_dataset.csv')
test <- fread('test_dataset.csv')

Training Dataset Quality:

The data size was enough to learn the model and can be loaded fully into single machine. There are many features which has no role in deciding the outcome of model like feature 'mail_id', 'sent_time' etc. There are also many features which are only on training data and not in test data given like 'click_time' , 'clicked', 'open_time', 'unsubscribe_time' etc. which have to removed from train data. Dataset was very sparse as most of the values in features like count of submission on 1 days and other features on small no of days were zero. Many features in dataset contain missing values like 'hacker_timezone', 'last_online', 'mail_type', 'mail_category'.

Data Preprocessing:

  • Conversion of Categorical features:
  • Features like 'opened' having True /False are replaced by 1/0 and like 'mail_type' and 'mail_category' by their respective number.

    target <- train$opened
    target[target=='false'] <- 0
    target[target=='true'] <- 1
    target <- as.numeric(target)
    test$opened <- "NA"
  • Removal of unwanted feature:
  • Remove features from training data which are not in test data like click_time", "open_time", "clicked", "unsubscribe_time", "unsubscribed". Later you will be able to see that we will feed only positive features to the training model.

    total<- rbind(train, test)
  • Errors in Training dataset and Missing value:
  • Many features in dataset contain missing values like 'hacker_timezone', 'last_online', 'mail_type', 'mail_category' so they have to replaced by some standard statistical measure like mode, median or mean.

    Mode <- function(x) {
      ux <- x[!]
      ux[which.max(tabulate(match(x, ux)))]
    Mean <- function(x) {
      if (all({
        ux <- x[!]
  • Feature construction:
  • One huge uplift to the score was when I constructed one feature on average score 'opened_by' grouped by user. Merged train and test in one and take the mean over each user. Then we will split them up again.

    total[ ,c("group_by_id_mean"):= Mean(opened), by = c("user_id")]
    train <- total[1:nrow(train), ]
    test <- total[-c(1:nrow(train)), ]
    train$hacker_confirmation <- as.integer(as.logical(train$hacker_confirmation))
    test$hacker_confirmation <- as.integer(as.logical(test$hacker_confirmation))

Model Fitting and Prediction:

After construction I applied Random Forest Classifer for Model fitting. I have also used other classifiers like Logistic Regression and AdaBoost Classifer among which Random Forest was giving best result and I have tuned random forest using its best parameters like 'ntree'=10 etc.

modelFit <- randomForest(as.factor(opened) ~ group_by_id_mean + sent_time + last_online + hacker_created_at + contest_login_count + contest_login_count_1_days + contest_login_count_7_days + contest_login_count_30_days + contest_participation_count + contest_participation_count_1_days + contest_participation_count_7_days + contest_participation_count_30_days + submissions_count_contest + hacker_confirmation,
                         importance=TRUE, na.action=na.roughfix)
model_pred <- predict(modelFit, test)

Writing to file:
Writing the Model prediction to a csv file using write.csv.

write.csv(model_pred, file = "predR.csv")

In conclusion, I found that this model can be improved by using XGBoost model. Doing further feature extraction like splitting 'sent_time' into date, time etc plus crossvalidation on folds would give you better results. Nevertheless, a simple Random Forest is good way to start.

Fork me on GitHub