Aggregate by column

aggregate(col_name ~ by_column_name, data = df_name, mean)

We can use other functions instead of mean as well. The variable after tilde is what the aggregation is done with respect to. If by_column_name is continent, the above code will calculate the mean for all continents for given column: e.g. age or anything like that.

Nearest Neighbors

library(class)
knn(train = train_data, test = test_data, cl=vector_train_labels)
  • Confusion Matrix

    table(actual_classes, pred_classes)
    

    This creates a confusion matrix based on number of observations.

  • Choosing the value of k

    Bigger values of k will lose differentiability whereas smaller values of k will not properly help identify different classes. So, there is no specific rule about which value of k should be used.

Bayesian Methods

library(naivebayes)
bayesian_model <- naive_bayes(formula, data)
bayesian_mode1 <- naive_bayes(y ~ x, data = df_name)
bayesian_mode1 <- naive_bayes(y ~ x1 + x2, data = df_name)

When using multiple columns as predictor for certain event y , we can use formula like the last case.

  • Binning for numeric data

    You can often create bins to categorize data with ease and efficiency as well.

  • Laplace Correction

    Adding an amount of constant factor so that the probabilities when multiplied to each other don’t go to zero even if one of them is zero. Having no correction might state that there is no probability for certain events to occur which might not be generalized case.

Binary Predictions with Regression

model_var <- glm(var1 ~ var2 + var3, data = df_name, family="binomial")
predicted_obj <- predict(model_name, newdata = new_df, type = "response")

For binary predictions, we use family as binomial. By default, glm uses gaussian as the family function.

ROC Curves

  • The area under ROC curve is equal to a better model but sometimes it can be misleading.
  • Sometimes the AOC might be similar but different way of predictions using the model.
library(pROC)
roc_obj <- roc(actual, predicted)
plot(roc_obj, color = "sth")
auc(roc_obj)

The area is given by auc function and roc to create a roc curve object.

Dummy Coding Categorical Data

  • Handling categorical data for linear regressions

        factor(df_name$col_name,
                levels = c(0, 1, 2),
                labels = c("Small", "Medium", "Large"))
    
  • Relevel factors

        relevel(factored_col, ref = "some_val")
    

    The value from the column will be used to relevel the factors. This will be used to set reference value and all other factors are re-ordered so that the value specified by ref is first.

Automatic Feature Selection

This is only to be used when there is no or very little domain knowledge to start with.

  • Stepwise Regression

    Build a bigger model using all independent variables and removes them one by one if they don’t contribute much to the output of the model.

    null_model <- glm(dep_var ~ 1, data = df_name, family = "binomial")
    full_model <- glm(dep_var ~ ., data = df_name, family = "binomial")
    step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "backward")
    
  • Forward Selection

    Examines each independent variable and adds them one by one unless it doesn’t get much improvement in predictions.

    null_model <- glm(dep_var ~ 1, data = df_name, family = "binomial")
    full_model <- glm(dep_var ~ ., data = df_name, family = "binomial")
    step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
    

Prediction using Models

predicted_obj <- predict(model_name, newdata = new_df, type = "prob")
predict(model_name, newdata = new_df)
attr(predicted_obj, "prob")

You can just get the output columns or the probability associated with those predictions as well. To get the predictions, you need to use attr function to get attribute called prob from the predicted output object.

Decision Trees

  • More useful where transparancy is required.
library(rpart)
m <- rpart(formula, data = df_name, method = "class")
  • Plotting a decision tree

    library(rpart.plot)
    rpart.plot(dec_tree_model)
    rpart.plot(model_name, type = num, box.palette = list_of_colors, fallen.leaves = TRUE)
    
    • fallen.leaves is used to depict tree straight like a flow or show triangular curves.
    • box.palette is used to set colors for the boxes.
    • Type specifies different visual aesthetics for the tree plot.
  • Growing larger decision trees

    • Simple rules lead to axis parallel splits which are often not the best fit. So, combination of two or more variables can make tree more robust in predicting. But they can be very complex.
    • Tree being more complex can be overfitted.

Train Test Splits

Splitting data into training and testing rows can be useful to prevent overfitting of model when new data is there to be predicted.

sample_rows <- sample(total_df_length, training_count)
sample_rows <- sample(nrow(df), 0.8 * nrow(df)) # 80 % of data to be used to sample
train_data <- df[sample_rows, ]
test_data <- df[- sample_rows, ]

Pruning of trees

  • Pre Pruning

    Based on maximum depth or minimum number of observations in splitting.

    prune_control <- rpart.control(maxdepth = 10, minsplit = 30)
    

    rpart.control has parameters such as maxdepth and minsplit which can be used to control the pruning of the tree.

  • Post Pruning

    Remove overly complex branch. Set cp to zero to grow overly complex tree.

    plotcp(m) # get the plot of complexity of model vs error rate
    pruned_model <- prune(m, cp= some_val) # obtained from the plot
    

    Th

Forest from trees

Collection of simple many trees to have more robust and powerful models. Ensemble of methods to predict is useful in making decisions.

library(randomForest)
forest_model <- randomForest(formula,
                             data = df_name,
                             ntree = num_val,
                             mtry = sqrt(p))
predict(forest_model, df_test)

mtry argument here is the number of predictors per tree.