R Programming 101- Dplyr ::

Dplyr

Dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

install.packages(“dplyr”)

Pipe command similar to python and other languages in R is represented as:

%>%

See an example:

gapminder %>% filter(year == 1957) %>% group_by(continent)

The command above first applies filter operation to the dataset and on that output groups by continent.

Continent	MedianPop
Asia	300000000
Africa	200000000

Count( sort=TRUE, wt=some_column)

The specific variables can be counted from the dataset using this verb. The additional parameter can be used to sort results based on count directly from this verb. The weight variable can be used for sorting based on some value.

Top N

Used together with group. For each group, returns specific number of results. Mostly used when plotting into graphs the extremes for each cases.

top_n(NUMBER, column_name) top_n(5, population)

Ungroup

After grouping, ungroup data to make more different kinds of summaries about data. You often will do some grouping and find certain values, then ungroup and perform over it like a new dataset to work with in the first place.

ungroup()

Rename

This is used to rename columns in case you want to change them.

dataframe %>% rename(new_name = original_name)

Inner Join

inner_join(table_name, by=c(“col_name_in_1st_table” = “col_name_in_2nd_table”, suffix=c(“new_name_for_col1”, “new_name_for_col2”, so on.))

We join tables using inner join conditions using such command. If both column names are same in both tables, we can just use by = “col_name” to make a join. NOTE Here notice that inside c used in by clause, there is a equals sign in between. I was confusing it to be comma all the time.

Left Join

When you want to get the elements of the first table fully along with the matching rows in two tables, you ought to use left_join.

left_join(similar_to_other_joins)

Right Join

When you want to get the elements of the second table fully along with the matching rows in two tables, you ought to use right_join.

right_join(similar_to_other_joins)

Full Join

Keeps the output of both tables after matching even if it doesn’t appear in the other table.

full_join(table_2, by=… , suffix = ….)

Semi Join

This gives the same columns that are in the first as well as second table.

semi_join(similar_to_other_joins)

Anti Join

This join gives the columns in another table that are not in the first table.

anti_join(similar_to_other_joins)

Replace NAs

replace_na(list(col_name = 0))

We often have NA values that we want to give meaning to. In this case, we set all NA values to 0 in that column.

Suffix Argument

any_join_type(table_name, by_condition, suffix=c(“replaceText”, “replaceTextforSecondMatch”))

To clarify the use of this variable, let’s assume there are two tables and both contain column called “name”. Then, what should R do when it joins two tables based on some index from the table. This is where suffix comes into play. You give two arguments to suffix as "_replaceText" then the first name will append “_replaceText” to the column name and similarly goes on for the next value as well.

n() in Summarize Columns

When you want to count the number of rows after grouping, you will use n() function to get the count of rows after grouping.

summarize(countValue = n(), others…)

Bind rows

This is to stack tables over one another.

table_1 %>% bind_rows(table_2)

Glimpse()

Give a brief glimpse of the dataset. Shows all columns and their first 10 values or so in a row wise format. This gives an idea about the type of values you will see through the column and often ranges as well.