Practice Course

  • Dropping levels from categorical dataset

    df %>%
      filter(col_name != "certain_level") %>%
      droplevels(df$col_name)
    
  • Counts vs Proportions

    prop.table(table_obj)
    prop.table(table_obj, number) # used to condition on some factor
    prop.table(table_obj, 1) # condition on rows
    prop.table(table_obj, 2) # condition on columns
    

    With this method, we can have marginal distributions for ourselves.

  • Numerical Data

    Note on the variability of the distribution and mean of the distribution when looking at numerical data representations.

    • Density plot

      geom_density()
      geom_density(bw = number) # smoothing of the curve
      

      Often times when we want to just see the overall pattern of the distribution rather than details, we can use density plot.

  • Distribution of One Variable

    • Marginal distribution
    • Conditional distribution
    • Vertical Box Plot

      ggplot(aes(x = 1, y=col_name)) + geom_boxplot()
      ggplot(aes(col_name)) + geom_boxplot()
      

      The first one here will make a boxplot that’s vertical in shape and the second one by default creates a horizontal one.

  • Measures of Center

    • Medians and IQRs are more robust to outliers and non-normal data.
    • Mean and Standard Deviation are more robust to data with uniform spread.
  • Modality of Distribution

    • Shape
      • Unimodal
      • Bimodal
      • Multimodal
      • Uniform
    • Skew
      • Right Skewed
      • Left Skewed
      • Symmetric
  • Outliers
  • Zero Inflation Strategies

    • Analyze two components separately
    • Collapse into two level categorical variable

Case Study

Country Code Package

library(countrycode)
countrycode(code, "cnam", "country.name")

Study more usefulness in the documentation.

Broom Package

library(broom)
tidy(model_obj)
bind_rows(tidy(model_obj1), tidy(model_obj2))

Often when we have output of linear models, the output are messy ones with a lot of text and no clear results. We can clean this up using broom package and later combine those outputs using bind_rows() to form a proper dataframe of the output as well.

Splitting Data by Some column using Nest

library(tidyr)
df %>% nest(- col_name) %>%
  unnest(by_column)

Nesting moves all columns to a list of items and unnesting reverts it back to it’s original place.

Using map to operate over a list one by one

library(purrr)
x <- c(0, 1, 2)
map(x, ~ . + 1)
map(list, function) # all items of list are applied to function

Here ~ and . refers to all items in the list. Each item is passed to the function one by one.

TODO P value Adjustments

TODO Understand gather and how column names appear

Gathering data

Gathers data as key value pairs.

library(tidyr)
gather(key_col, value_col, col_names)
gather(key_col, value_col, col_1:col_8)

Recoding columns differently

recode(column_name,
       "prev_value" = "New name",
       "prev_val2" = "New name2")