R Programming 105- Supervised Learning 2 Regression ::

TODO Gain Curve

Writing formulas

formula_obj <- var1 ~ var2
formula_obj <- var1 ~ var2 + var3
formula_obj <- as.formula("var1 ~ var2")

Getting information about the model

print(model_var)
summary(model_var)
broom::glance(model_var)
sigr:wrapFTest(model_var)

print shows the basic formula used and the coefficients for the model.
summary shows the formula, coefficients, residuals as well as probabilistic statistics such as R^2 value and so on.
glance shows the whole data from summary in the form of nice dataframe with columns such as df.residual, sigma and so on.
wrapFTest shows the most relevant details of the model.

Predicting using the model

predict(model, newdata = new_data_vals)

Collinearity

The variables assumed to be independent might not be independent after all.
Unusually high collinearity may make model unstable though.

Gain curve

The gain curve is important in sorting the outcome rather than making all predictions correct. Checks if model’s predictions sort in the same order as the true outcome.

library(WVPlots)
GainCurvePlot(df_var, predictions, actual_value, title_for_plot)
GainCurvePlot(frame, xvar, truthvar, title)
GainCurvePlot(df_name, "pred_col_name", "actual_col_name" , "Title of plot")

Here, the column names need to be passed as strings or else these objects cannot be found when creating the plot.

RMSE

RMSE much smaller than the outcome’s standard deviation suggests that a model predicts well.

R^2

Variance explained by the model \[R^{2}=1-\frac{R S S}{S S_{T o t}}\] where RSS = Residual sum of squares i.e. Variance of model and \[SS_{TOT}\] = Total sum of squares i.e. Variance of data Having large R^2 means that the model fits the data well.

summary(model_var)$r.squared
glance(model_var)$r.squared

The summary function for linear model using lm() returns the R^2 error already there.

Cross Validation

library(vtreat)
splitWay <- kWayCrossValidation(n_rows, n_splits, NULL, NULL)
splitWay[[index]]

The last two parameters are df and y ; these parameters are just for compatibility purposes.

for ( i in 1:num_folds ) {
  split_ind <- splitObject[[i]]
  model <- lm(formula, data = df[split_ind$train,])
  df$predictions[split_ind$app] <- predict(model, newdata = df[split_ind$app, ])
    }

The test indexes are named as app and can be used as used in the code snippet above.

Random train test split

train_size <- 0.8 * total_size
tot_ind <- runif(total_size)
df_train <- df[tot_ind < 0.75, ]
df_test <- df[tot_ind >= 0.75, ]

One Hot Encoding

Every categorical variable is encoded under the hood as numerical values from 0 to N-1 by R itself. But with too many levels, this can be a problem.

mmat <- model.matrix(formula, data)

Doing this you know how exactly is R changing the dataframe for you to be fitted to any model later.

Creating treatment plans

The treatment plans help to work on specific columns from the data in a certain way such as creating levels on categorical data and removing NAs from numerical data.

:ID: e664edf5-fc3e-488e-85e4-cd802f1ad389
```
designTreatmentsZ(dframe, varslist, verbose = FALSE)
prepare(treatmentPlan, dframe, varRestriction = new_vars_list)
```
You need to use the treatment plan on test data as well before predicting the output.

Variables Interactions

When representing variables as formulas, there are chances that the variables will be related as well which need to be incorporated into the formula itself.

dep_var <- var_1 + var_2 + var_1:var_2

The two variables separated by column are dependent on each other. So, we need to map that relation too.

dep_var <- var_1 * var_2

The asterisk means the same as the formula we wrote above.

dep_var <- I(var_1, var_2)

This expresses the product of two variables in the formula.

Transforming Variables

The variables can be changed based on some information that we have priorly.

Logistic Regression

glm(formula, data, family = "binomial")
predict(model, newdata, type = "response")

Deviance and Null Deviance instead of SSE and SST.

wrapChiSqTest("pred_col", "actual_col", "tgt_event")

We can also use GainCurvePlot to see the effectiveness of the model.

out_df <- glance(glm_model)
pseudo_r2 <- 1 - ( out_df$deviance / out_df$null_deviance )

Poission and QuasiPoisson Regression

Mean is equal to or in simlar order as the Variance means a Poisson Distribution.
family = “Poisson” or family = “quasipoisson” is used.
Used to predict counts.
Inputs additive and linear in log(count)
Mean is different from Variance in a Quasi Poisson distribution.
Similar to Logistic Regression, pseudo R^2 can be found with a similar process.

Generalized Additive Models

library(mgcv)
gam(formula, family, data)
s(col_name) # fit a smooth function
fmla <- dep_col ~ s(ind_col_1)
plot(gam_model)

family = “gaussian” or “poisson” or “binomial”
Best for larger data sets since they are more complex.
Can pass smooth function to continuous variables only.
Predict function for this model gives output as matrix; so need to use as.numeric() to get the output.

Tree Based Methods

library(ranger)
ranger(fmla,
       data,
       num.trees = number,
       respect.unordered.factors = "order")

mtry : number of variables to try at each node.

Gradient Boosting
1. Fit a shallow tree T1 to the data. M1 = T1
2. Fit tree T2 to the residuals and find λ such that M2 = M1 + λ * T2
3. For regularized learning, add a regularization parameter η ranging from 0 to 1 where smaller η means slow learning and higher means faster learning.
```
library(xgboost)
xgb.cv() # evaluation_log has the rmse for each round and used to find the optimum model
xgb.cv(data = as.matrix(treatedData),
       label = df$res_col,
       objective = "reg:linear",
       nrounds = 200,
       nfold = 3,
       eta = 0.2,
       depth = 8)
```
- Arguments
  1. Data as matrix
  2. Nrounds with maximum number of trees to fit.
  3. Eta as learning rate
  4. Depth as maximum depth of single tree
  5. Nfolds to specify how many cross validation folds to be made.
  6. Objective with what kind of model to be fitted.
- Finding best number of trees
```
elog <- as.data.frame(cv$evaluation_log)
nrounds <- which.min(elog$test_rmse_mean)
```
  First you get the evaluation log and then find the position with minimum rmse. Then use xgboost() to create the final model for prediction. You need to note that the predict function also takes a matrix when using a xgboost model for prediction.

Backlinks

Data Scientist with R