2 minutes
R Programming 103- Exploratory Data Analysis
Practice Course
-
Dropping levels from categorical dataset
df %>% filter(col_name != "certain_level") %>% droplevels(df$col_name)
-
Counts vs Proportions
prop.table(table_obj) prop.table(table_obj, number) # used to condition on some factor prop.table(table_obj, 1) # condition on rows prop.table(table_obj, 2) # condition on columnsWith this method, we can have marginal distributions for ourselves.
-
Numerical Data
Note on the variability of the distribution and mean of the distribution when looking at numerical data representations.
-
Density plot
geom_density() geom_density(bw = number) # smoothing of the curveOften times when we want to just see the overall pattern of the distribution rather than details, we can use density plot.
-
-
Distribution of One Variable
- Marginal distribution
- Conditional distribution
-
Vertical Box Plot
ggplot(aes(x = 1, y=col_name)) + geom_boxplot() ggplot(aes(col_name)) + geom_boxplot()The first one here will make a boxplot that’s vertical in shape and the second one by default creates a horizontal one.
-
Measures of Center
- Medians and IQRs are more robust to outliers and non-normal data.
- Mean and Standard Deviation are more robust to data with uniform spread.
-
Modality of Distribution
- Shape
- Unimodal
- Bimodal
- Multimodal
- Uniform
- Skew
- Right Skewed
- Left Skewed
- Symmetric
- Shape
- Outliers
-
Zero Inflation Strategies
- Analyze two components separately
- Collapse into two level categorical variable
Case Study
Country Code Package
library(countrycode)
countrycode(code, "cnam", "country.name")
Study more usefulness in the documentation.
Broom Package
library(broom)
tidy(model_obj)
bind_rows(tidy(model_obj1), tidy(model_obj2))
Often when we have output of linear models, the output are messy ones with a lot of text and no clear results. We can clean this up using broom package and later combine those outputs using bind_rows() to form a proper dataframe of the output as well.
Splitting Data by Some column using Nest
library(tidyr)
df %>% nest(- col_name) %>%
unnest(by_column)
Nesting moves all columns to a list of items and unnesting reverts it back to it’s original place.
Using map to operate over a list one by one
library(purrr)
x <- c(0, 1, 2)
map(x, ~ . + 1)
map(list, function) # all items of list are applied to function
Here ~ and . refers to all items in the list. Each item is passed to the function one by one.
TODO P value Adjustments
TODO Understand gather and how column names appear
Gathering data
Gathers data as key value pairs.
library(tidyr)
gather(key_col, value_col, col_names)
gather(key_col, value_col, col_1:col_8)
Recoding columns differently
recode(column_name,
"prev_value" = "New name",
"prev_val2" = "New name2")
Backlinks
395 Words
2020-10-05 00:00 +0545