R Programming 103- Exploratory Data Analysis ::

Practice Course

Dropping levels from categorical dataset

df %>%
  filter(col_name != "certain_level") %>%
  droplevels(df$col_name)

Counts vs Proportions

prop.table(table_obj)
prop.table(table_obj, number) # used to condition on some factor
prop.table(table_obj, 1) # condition on rows
prop.table(table_obj, 2) # condition on columns

With this method, we can have marginal distributions for ourselves.

Numerical Data

Note on the variability of the distribution and mean of the distribution when looking at numerical data representations.
- Density plot
```
geom_density()
geom_density(bw = number) # smoothing of the curve
```
  Often times when we want to just see the overall pattern of the distribution rather than details, we can use density plot.

Distribution of One Variable
- Marginal distribution
- Conditional distribution
- Vertical Box Plot
```
ggplot(aes(x = 1, y=col_name)) + geom_boxplot()
ggplot(aes(col_name)) + geom_boxplot()
```
  The first one here will make a boxplot that’s vertical in shape and the second one by default creates a horizontal one.

Measures of Center
- Medians and IQRs are more robust to outliers and non-normal data.
- Mean and Standard Deviation are more robust to data with uniform spread.

Modality of Distribution
- Shape
  - Unimodal
  - Bimodal
  - Multimodal
  - Uniform
- Skew
  - Right Skewed
  - Left Skewed
  - Symmetric

Outliers

Zero Inflation Strategies
- Analyze two components separately
- Collapse into two level categorical variable

Case Study

Country Code Package

library(countrycode)
countrycode(code, "cnam", "country.name")

Study more usefulness in the documentation.

Broom Package

library(broom)
tidy(model_obj)
bind_rows(tidy(model_obj1), tidy(model_obj2))

Often when we have output of linear models, the output are messy ones with a lot of text and no clear results. We can clean this up using broom package and later combine those outputs using bind_rows() to form a proper dataframe of the output as well.

Splitting Data by Some column using Nest

library(tidyr)
df %>% nest(- col_name) %>%
  unnest(by_column)

Nesting moves all columns to a list of items and unnesting reverts it back to it’s original place.

Using map to operate over a list one by one

library(purrr)
x <- c(0, 1, 2)
map(x, ~ . + 1)
map(list, function) # all items of list are applied to function

Here ~ and . refers to all items in the list. Each item is passed to the function one by one.

TODO P value Adjustments

TODO Understand gather and how column names appear

Gathering data

Gathers data as key value pairs.

library(tidyr)
gather(key_col, value_col, col_names)
gather(key_col, value_col, col_1:col_8)

Recoding columns differently

recode(column_name,
       "prev_value" = "New name",
       "prev_val2" = "New name2")

Backlinks

Data Scientist with R