5 minutes
R Programming 105- Supervised Learning 1
Aggregate by column
aggregate(col_name ~ by_column_name, data = df_name, mean)
We can use other functions instead of mean as well. The variable after tilde is what the aggregation is done with respect to. If by_column_name is continent, the above code will calculate the mean for all continents for given column: e.g. age or anything like that.
Nearest Neighbors
library(class)
knn(train = train_data, test = test_data, cl=vector_train_labels)
-
Confusion Matrix
table(actual_classes, pred_classes)This creates a confusion matrix based on number of observations.
-
Choosing the value of k
Bigger values of k will lose differentiability whereas smaller values of k will not properly help identify different classes. So, there is no specific rule about which value of k should be used.
Bayesian Methods
library(naivebayes)
bayesian_model <- naive_bayes(formula, data)
bayesian_mode1 <- naive_bayes(y ~ x, data = df_name)
bayesian_mode1 <- naive_bayes(y ~ x1 + x2, data = df_name)
When using multiple columns as predictor for certain event y , we can use formula like the last case.
-
Binning for numeric data
You can often create bins to categorize data with ease and efficiency as well.
-
Laplace Correction
Adding an amount of constant factor so that the probabilities when multiplied to each other don’t go to zero even if one of them is zero. Having no correction might state that there is no probability for certain events to occur which might not be generalized case.
Binary Predictions with Regression
model_var <- glm(var1 ~ var2 + var3, data = df_name, family="binomial")
predicted_obj <- predict(model_name, newdata = new_df, type = "response")
For binary predictions, we use family as binomial. By default, glm uses gaussian as the family function.
ROC Curves
- The area under ROC curve is equal to a better model but sometimes it can be misleading.
- Sometimes the AOC might be similar but different way of predictions using the model.
library(pROC)
roc_obj <- roc(actual, predicted)
plot(roc_obj, color = "sth")
auc(roc_obj)
The area is given by auc function and roc to create a roc curve object.
Dummy Coding Categorical Data
-
Handling categorical data for linear regressions
factor(df_name$col_name, levels = c(0, 1, 2), labels = c("Small", "Medium", "Large")) -
Relevel factors
relevel(factored_col, ref = "some_val")The value from the column will be used to relevel the factors. This will be used to set reference value and all other factors are re-ordered so that the value specified by ref is first.
Automatic Feature Selection
This is only to be used when there is no or very little domain knowledge to start with.
-
Stepwise Regression
Build a bigger model using all independent variables and removes them one by one if they don’t contribute much to the output of the model.
null_model <- glm(dep_var ~ 1, data = df_name, family = "binomial") full_model <- glm(dep_var ~ ., data = df_name, family = "binomial") step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "backward")
-
Forward Selection
Examines each independent variable and adds them one by one unless it doesn’t get much improvement in predictions.
null_model <- glm(dep_var ~ 1, data = df_name, family = "binomial") full_model <- glm(dep_var ~ ., data = df_name, family = "binomial") step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
Prediction using Models
predicted_obj <- predict(model_name, newdata = new_df, type = "prob")
predict(model_name, newdata = new_df)
attr(predicted_obj, "prob")
You can just get the output columns or the probability associated with those predictions as well. To get the predictions, you need to use attr function to get attribute called prob from the predicted output object.
Decision Trees
- More useful where transparancy is required.
library(rpart)
m <- rpart(formula, data = df_name, method = "class")
-
Plotting a decision tree
library(rpart.plot) rpart.plot(dec_tree_model) rpart.plot(model_name, type = num, box.palette = list_of_colors, fallen.leaves = TRUE)- fallen.leaves is used to depict tree straight like a flow or show triangular curves.
- box.palette is used to set colors for the boxes.
- Type specifies different visual aesthetics for the tree plot.
-
Growing larger decision trees
- Simple rules lead to axis parallel splits which are often not the best fit. So, combination of two or more variables can make tree more robust in predicting. But they can be very complex.
- Tree being more complex can be overfitted.
Train Test Splits
Splitting data into training and testing rows can be useful to prevent overfitting of model when new data is there to be predicted.
sample_rows <- sample(total_df_length, training_count)
sample_rows <- sample(nrow(df), 0.8 * nrow(df)) # 80 % of data to be used to sample
train_data <- df[sample_rows, ]
test_data <- df[- sample_rows, ]
Pruning of trees
-
Pre Pruning
Based on maximum depth or minimum number of observations in splitting.
prune_control <- rpart.control(maxdepth = 10, minsplit = 30)rpart.control has parameters such as maxdepth and minsplit which can be used to control the pruning of the tree.
-
Post Pruning
Remove overly complex branch. Set cp to zero to grow overly complex tree.
plotcp(m) # get the plot of complexity of model vs error rate pruned_model <- prune(m, cp= some_val) # obtained from the plotTh
Forest from trees
Collection of simple many trees to have more robust and powerful models. Ensemble of methods to predict is useful in making decisions.
library(randomForest)
forest_model <- randomForest(formula,
data = df_name,
ntree = num_val,
mtry = sqrt(p))
predict(forest_model, df_test)
mtry argument here is the number of predictors per tree.
Backlinks
886 Words
2020-10-08 00:00 +0545