5 minutes
R Programming 105- Supervised Learning 2 Regression
TODO Gain Curve
Writing formulas
formula_obj <- var1 ~ var2
formula_obj <- var1 ~ var2 + var3
formula_obj <- as.formula("var1 ~ var2")
Getting information about the model
print(model_var)
summary(model_var)
broom::glance(model_var)
sigr:wrapFTest(model_var)
- print shows the basic formula used and the coefficients for the model.
- summary shows the formula, coefficients, residuals as well as probabilistic statistics such as R^2 value and so on.
- glance shows the whole data from summary in the form of nice dataframe with columns such as df.residual, sigma and so on.
- wrapFTest shows the most relevant details of the model.
Predicting using the model
predict(model, newdata = new_data_vals)
Collinearity
- The variables assumed to be independent might not be independent after all.
- Unusually high collinearity may make model unstable though.
Gain curve
The gain curve is important in sorting the outcome rather than making all predictions correct. Checks if model’s predictions sort in the same order as the true outcome.
library(WVPlots)
GainCurvePlot(df_var, predictions, actual_value, title_for_plot)
GainCurvePlot(frame, xvar, truthvar, title)
GainCurvePlot(df_name, "pred_col_name", "actual_col_name" , "Title of plot")
Here, the column names need to be passed as strings or else these objects cannot be found when creating the plot.
RMSE
RMSE much smaller than the outcome’s standard deviation suggests that a model predicts well.
R^2
Variance explained by the model \[R^{2}=1-\frac{R S S}{S S_{T o t}}\] where RSS = Residual sum of squares i.e. Variance of model and \[SS_{TOT}\] = Total sum of squares i.e. Variance of data Having large R^2 means that the model fits the data well.
summary(model_var)$r.squared
glance(model_var)$r.squared
The summary function for linear model using lm() returns the R^2 error already there.
Cross Validation
library(vtreat)
splitWay <- kWayCrossValidation(n_rows, n_splits, NULL, NULL)
splitWay[[index]]
The last two parameters are df and y ; these parameters are just for compatibility purposes.
for ( i in 1:num_folds ) {
split_ind <- splitObject[[i]]
model <- lm(formula, data = df[split_ind$train,])
df$predictions[split_ind$app] <- predict(model, newdata = df[split_ind$app, ])
}
The test indexes are named as app and can be used as used in the code snippet above.
Random train test split
train_size <- 0.8 * total_size
tot_ind <- runif(total_size)
df_train <- df[tot_ind < 0.75, ]
df_test <- df[tot_ind >= 0.75, ]
One Hot Encoding
Every categorical variable is encoded under the hood as numerical values from 0 to N-1 by R itself. But with too many levels, this can be a problem.
mmat <- model.matrix(formula, data)
Doing this you know how exactly is R changing the dataframe for you to be fitted to any model later.
-
Creating treatment plans
The treatment plans help to work on specific columns from the data in a certain way such as creating levels on categorical data and removing NAs from numerical data.
:ID: e664edf5-fc3e-488e-85e4-cd802f1ad389
designTreatmentsZ(dframe, varslist, verbose = FALSE) prepare(treatmentPlan, dframe, varRestriction = new_vars_list)You need to use the treatment plan on test data as well before predicting the output.
Variables Interactions
When representing variables as formulas, there are chances that the variables will be related as well which need to be incorporated into the formula itself.
dep_var <- var_1 + var_2 + var_1:var_2
The two variables separated by column are dependent on each other. So, we need to map that relation too.
dep_var <- var_1 * var_2
The asterisk means the same as the formula we wrote above.
dep_var <- I(var_1, var_2)
This expresses the product of two variables in the formula.
Transforming Variables
The variables can be changed based on some information that we have priorly.
Logistic Regression
glm(formula, data, family = "binomial")
predict(model, newdata, type = "response")
- Deviance and Null Deviance instead of SSE and SST.
wrapChiSqTest("pred_col", "actual_col", "tgt_event")
We can also use GainCurvePlot to see the effectiveness of the model.
out_df <- glance(glm_model)
pseudo_r2 <- 1 - ( out_df$deviance / out_df$null_deviance )
Poission and QuasiPoisson Regression
- Mean is equal to or in simlar order as the Variance means a Poisson Distribution.
- family = “Poisson” or family = “quasipoisson” is used.
- Used to predict counts.
- Inputs additive and linear in log(count)
- Mean is different from Variance in a Quasi Poisson distribution.
- Similar to Logistic Regression, pseudo R^2 can be found with a similar process.
Generalized Additive Models
library(mgcv)
gam(formula, family, data)
s(col_name) # fit a smooth function
fmla <- dep_col ~ s(ind_col_1)
plot(gam_model)
- family = “gaussian” or “poisson” or “binomial”
- Best for larger data sets since they are more complex.
- Can pass smooth function to continuous variables only.
- Predict function for this model gives output as matrix; so need to use as.numeric() to get the output.
Tree Based Methods
library(ranger)
ranger(fmla,
data,
num.trees = number,
respect.unordered.factors = "order")
- mtry : number of variables to try at each node.
-
Gradient Boosting
- Fit a shallow tree T1 to the data. M1 = T1
- Fit tree T2 to the residuals and find λ such that M2 = M1 + λ * T2
- For regularized learning, add a regularization parameter η ranging from 0 to 1 where smaller η means slow learning and higher means faster learning.
library(xgboost) xgb.cv() # evaluation_log has the rmse for each round and used to find the optimum model xgb.cv(data = as.matrix(treatedData), label = df$res_col, objective = "reg:linear", nrounds = 200, nfold = 3, eta = 0.2, depth = 8)-
Arguments
- Data as matrix
- Nrounds with maximum number of trees to fit.
- Eta as learning rate
- Depth as maximum depth of single tree
- Nfolds to specify how many cross validation folds to be made.
- Objective with what kind of model to be fitted.
-
Finding best number of trees
elog <- as.data.frame(cv$evaluation_log) nrounds <- which.min(elog$test_rmse_mean)First you get the evaluation log and then find the position with minimum rmse. Then use xgboost() to create the final model for prediction. You need to note that the predict function also takes a matrix when using a xgboost model for prediction.
Backlinks
957 Words
2020-10-08 00:00 +0545