4 minutes
R Programming 102- Cleaning data
library(assertive)
assert_is_numeric(df$col_name)
assert_is_character(df$col_name)
class(df$col_name) # check data types
Similarly, we have more functions for more data type checks.
library(stringr)
str_remove(df$tbl_name, "char_to_be_removed")
Assert values within a range given by lower and upper arguments.
assert_all_are_in_closed_range(df$tbl_name, lower = num, upper = num)
Assert all dates are in the past
assert_all_are_in_past(tbl_name$col_name)
as.Date(dateCol) # to convert to date type
Uniqueness constraints for Full duplicates
duplicated(df) # finds full duplicates and returns logical values
sum(duplicated(df)) # count of full duplicates
Uniqueness constraints for partial duplicates
df %>% count(some_col) %>% filter(n > 1)
This can be done using dplyr packages functionality.
Remove partial duplicates using distinct function
credit_scores %>% distinct(col_name1, col_name2, ... , .keep_all = TRUE )
Using distinct we can choose number of columns as index and find duplicates by that index. Keep all returns all columns after distinct.
Replace partial duplicate by their summary: mean, median and so on.
df %>% group_by(index_col1, index_col2, ...) %>%
mutate(mean_col_value = mean(some_col)) %>%
distinct(index_col1, index_col2, ...) %>%
select(-some_col)
Here you first group by indexes, then calculate the mean value using those groupings, find distinct rows across those indexes and remove that original column from which mean was calculated.
Categorical Data
Removing rows with categorical variable of no sense. e.g. When the value should be “YES” or “NO” but it is “Light”.
semi_join(dest_tbl) # to get only useful rows
anti_join(dest_tbl) # to remove unuseful rows
When working with factors, you might have strings with spaces as well as case differences such as " col_name" and “col_name” OR “col_Name” and “COL_NAME”. You need to make sure that these all belong to one and the same group.
str_trim("string")
to_lower("string")
Collapse categories that have very few number of observations and can be mapped onto some other category.
collapse_categories <- c("category1", "category2")
mutate(col_collapsed = fct_collapse(col_name, categ_to_rename_to = collapse_categories))
Cleaning dirty text data
- Formatting inconsistencies: Phone numbers written differently
- Information Inconsistency: Full Names vs Only Last Names
str_remove_all("-")
str_remove("-") # removes the first occurence only
str_detect(col_name, fixed("("))
str_detect(tbl_name$col_name, "-")
str_replace_all(col_name, "to_be_replaced", "replace_with_this")
Specially for paranthesis, we need to use fixed function to use them inside our strings in R. Uniformity in data entries e.g. Money units, Temperatures and so on.
- Find why the data is not uniform by researching into possible causes.
- Infer based on other data in the dataset.
- Find which sources data come from. This makes it easier to find most possible reasons of things.
Working with date objects
library(lubridate)
formats <- c ("%Y/%m/%d", ...)
parse_date_time(col_name, orders = formats)
as.numeric(date_col %--% today(), "years") # get the number of years by subtraction
Find NA values.
sum(is.na(tbl_name))
vis_miss(tbl_name)
mean(col_name, na.rm = TRUE) # mean without NA values.
Types of Missing Data
- Missing completely at random: No relation to other variables.
- Missing at random: Systematic relation between missing data and observed variables.
- Missing not at random: Systematic relation between missing data and some unobserved variable.
Assert there are no NA values:
assert_all_are_not_na(tbl$col_name)
String Comparisons
library(stringdist)
stringdist("string_1", "string_2", method="any")
Some Methods of comparison are as follows:
-
Damerau Levenshtein distance
Minimum edit distance is the least number of steps [ Insertion, Deletion, Transposition, Substitution] that you need to do to get to the other string.
-
Levenshtein distance
No transposition of characters.
-
Longest Common Subsequence (LCS)
Only insertion and deletion
-
Fuzzy Join Package
library(fuzzyjoin) stringdist_left_join(one_tbl, another_tbl, by="some_col", method = "any", max_dist = num)
-
Record Linkage Package
Generate and compare pairs
-
Generate pairs
library(reclin) pair_blocking(df1, df2) pair_blocking(df1, df2, blocking_var = "any_var")When using multiple comparisons, if we compare each row to the other; there will be non scalable number of comparisons. So, we use blocking. Blocking means consider pairs only if the rows agree on the blocking variable. e.g. People’s information can be blocked by country.
-
Compare pairs
pair_blocking(...) %>% compare_pairs(by = c("col_name", "col_other"), default_comparator=lcs()) %>% score_simsum() score_problink() # scores probabilistically select_n_to_m() %>% link()On comparing with multiple columns, it will show the scores for each columns.
-
Backlinks
642 Words
2020-10-03 00:00 +0545