R Programming 102- Cleaning data ::

library(assertive)
assert_is_numeric(df$col_name)
assert_is_character(df$col_name)
class(df$col_name) # check data types

Similarly, we have more functions for more data type checks.

library(stringr)
str_remove(df$tbl_name, "char_to_be_removed")

Assert values within a range given by lower and upper arguments.

assert_all_are_in_closed_range(df$tbl_name, lower = num, upper = num)

Assert all dates are in the past

assert_all_are_in_past(tbl_name$col_name)
as.Date(dateCol) # to convert to date type

Uniqueness constraints for Full duplicates

duplicated(df) # finds full duplicates and returns logical values
sum(duplicated(df)) # count of full duplicates

Uniqueness constraints for partial duplicates

df %>% count(some_col) %>% filter(n > 1)

This can be done using dplyr packages functionality.

Remove partial duplicates using distinct function

credit_scores %>% distinct(col_name1, col_name2, ... , .keep_all = TRUE )

Using distinct we can choose number of columns as index and find duplicates by that index. Keep all returns all columns after distinct.

Replace partial duplicate by their summary: mean, median and so on.

df %>% group_by(index_col1, index_col2, ...) %>%
  mutate(mean_col_value = mean(some_col)) %>%
  distinct(index_col1, index_col2, ...) %>%
  select(-some_col)

Here you first group by indexes, then calculate the mean value using those groupings, find distinct rows across those indexes and remove that original column from which mean was calculated.

Categorical Data

Removing rows with categorical variable of no sense. e.g. When the value should be “YES” or “NO” but it is “Light”.

semi_join(dest_tbl) # to get only useful rows
anti_join(dest_tbl) # to remove unuseful rows

When working with factors, you might have strings with spaces as well as case differences such as " col_name" and “col_name” OR “col_Name” and “COL_NAME”. You need to make sure that these all belong to one and the same group.

str_trim("string")
to_lower("string")

Collapse categories that have very few number of observations and can be mapped onto some other category.

collapse_categories <- c("category1", "category2")
mutate(col_collapsed = fct_collapse(col_name, categ_to_rename_to = collapse_categories))

Cleaning dirty text data

Formatting inconsistencies: Phone numbers written differently
Information Inconsistency: Full Names vs Only Last Names

str_remove_all("-")
str_remove("-") # removes the first occurence only
str_detect(col_name, fixed("("))
str_detect(tbl_name$col_name, "-")
str_replace_all(col_name, "to_be_replaced", "replace_with_this")

Specially for paranthesis, we need to use fixed function to use them inside our strings in R. Uniformity in data entries e.g. Money units, Temperatures and so on.

Find why the data is not uniform by researching into possible causes.
Infer based on other data in the dataset.
Find which sources data come from. This makes it easier to find most possible reasons of things.

Working with date objects

library(lubridate)
formats <- c ("%Y/%m/%d", ...)
parse_date_time(col_name, orders = formats)
as.numeric(date_col %--% today(), "years") # get the number of years by subtraction

Find NA values.

sum(is.na(tbl_name))
vis_miss(tbl_name)
mean(col_name, na.rm = TRUE) # mean without NA values.

Types of Missing Data

Missing completely at random: No relation to other variables.
Missing at random: Systematic relation between missing data and observed variables.
Missing not at random: Systematic relation between missing data and some unobserved variable.

Assert there are no NA values:

assert_all_are_not_na(tbl$col_name)

String Comparisons

library(stringdist)
stringdist("string_1", "string_2", method="any")

Some Methods of comparison are as follows:

Damerau Levenshtein distance

Minimum edit distance is the least number of steps [ Insertion, Deletion, Transposition, Substitution] that you need to do to get to the other string.

Levenshtein distance

No transposition of characters.

Longest Common Subsequence (LCS)

Only insertion and deletion

Fuzzy Join Package

library(fuzzyjoin)
stringdist_left_join(one_tbl, another_tbl, by="some_col", method = "any", max_dist = num)

Record Linkage Package

Generate and compare pairs
- Generate pairs
```
library(reclin)
pair_blocking(df1, df2)
pair_blocking(df1, df2, blocking_var = "any_var")
```
  When using multiple comparisons, if we compare each row to the other; there will be non scalable number of comparisons. So, we use blocking. Blocking means consider pairs only if the rows agree on the blocking variable. e.g. People’s information can be blocked by country.
- Compare pairs
```
pair_blocking(...) %>%
  compare_pairs(by = c("col_name", "col_other"), default_comparator=lcs()) %>%
  score_simsum()
score_problink() # scores probabilistically
select_n_to_m() %>%
  link()
```
  On comparing with multiple columns, it will show the scores for each columns.

Backlinks

Data Scientist with R