Chapter 3 Data wrangling using tidyverse
Follow this link to download everything you need for this unit. When you get to GitHub click on “Code” (green button) and select “download zip”. You will then save this to a local folder where you should do all of your work for this class. You will work through the “_blank.Rmd”. Always be sure to read the README.md files in the GitHub repo.
Once you have this folder saved where you would like it, open RStudio and navigate to the folder. Next, open the project (“.Rproj”). Doing so will set the folder as the working directory, make your life easier, and make everything generally work.
3.1 Introduction
We have messed around with plotting a bit and you’ve seen a little of what R can do. So now let’s review or introduce you to some basics. Even if you have worked in R before, it is good to be remind of/practice with this stuff, so stay tuned in!
Reading for this week:
3.2 Learning objectives:
Working through this exercise will help students:
- become more familiar with the RStudio IDE
- get in the habit of running single lines of code
- know what a tibble is
- know what the assignment operator is
- begin using base and dplyr functions
3.3 You can use R as a calculator
If you just type numbers and operators in, R will spit out the results.
It is generally good to run one line of code at a time. In mac you do that by putting your cursor on the line and hitting command + enter. On windows/PC that is ctrl + enter.
Here is a link to info on Editing and Executing code in RStudio
Very handy link to all keyboard shortcuts Windows, Linux and Mac
## [1] 3
## [1] 4
3.4 You can create new objects using <-
Yes, = does the same thing. But use <-. We will call <- assignment or assignment operator. When we are coding in R we use <- to assign values to objects and = to set values for parameters in functions/equations/etc. Using <- helps us differentiate between the two. Norms for formatting are important because they help us understand what code is doing, especially when stuff gets complex.
Oh, one more thing: Surround operators with spaces.
x <- 1 is easier to read than x<-1
You can assign single numbers or entire chunks of data using <-
So if you had an object called my_data and wanted to copy it into my_new_data you could do:
my_new_data <- my_data
You can then recall/print the values in an object by just typing the name by itself.
In the code chunk below, assign a 3 to the object “y” and then print it out.
# This is a code chunk.
# Putting a pound sign in here allows me to type text that is not code.
# The stuff below is code. Not text.
y <- 3
y
## [1] 3
If you want to assign multiple values, you have to put them in the function c() c means combine. R doesn’t know what to do if you just give it a bunch of values with space or commas, but if you put them as arguments in the combine function, it’ll make them into a vector.
Any time you need to use several values, even passing as an argument to a function, you have to put them in c() or it won’t work.
## [1] 1 2 3 4
When you are creating objects, try to give them meaningful names so you can remember what they are. You can’t have spaces or operators that mean something else as part of a name. And remember, everything is case sensitive.
Assign the value 5.4 to water_pH and then try to recall it by typing “water_ph”
## [1] 5.4
If we want to remove something from the environment we can use rm(). Try to remove water_pH.
You can also set objects equal to strings, or values that have letters in them. To do this you just have to put the value in quotes, otherwise R will think it is an object name and tell you it doesn’t exist.
Try: name <- “your name” and then name <- your name
What happens if you forget the ending parenthesis?
R can be cryptic with it’s error messages or other responses, but once you get used to them, you know exactly what is wrong when they pop up.
As a note - when you go to the internet for example code it will often say things like df <- your_data, this is similar to what I’ve written above: name <- “your name”. It means enter you name (or your data). As you progress you will get better at understanding example code and understanding error messages.
3.5 Using functions
As an example, let’s try the seq() function, which creates a sequence of numbers.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 10 9 8 7 6 5 4 3 2 1
3.6 Read in some data
First we will load the tidyverse library, everything we have done so far today is in base R. Next, let’s load a few dataframes and have a look at them. We will load the “PINE_NFDR_Jan-Mar_2010.csv” and “flashy_dat_all.csv” files.
Important: read_csv() is the tidyverse csv reading function, the base R function is read.csv(). read.csv() will not read your data in as a tibble, which is the format used by tidyverse functions. You should get in the habit of using the tidyverse versions such as read_csv().
## Rows: 4320 Columns: 8
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## chr (2): StationID, surrogate
## dbl (5): cfs, year, quarter, month, day
## dttm (1): datetime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 1144 Columns: 26
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## chr (5): STANAME, HUC02, STATE, CLASS, AGGECOREGION
## dbl (21): site_no, RBI, RBIrank, DRAIN_SQKM, LAT_GAGE, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3.7 What is a tibble?
Good question. It’s a fancy way to store data that works well with tidyverse functions. Let’s look at the flow tibble with “head” and “str”
## # A tibble: 6 × 8
## StationID cfs surrogate datetime year
## <chr> <dbl> <chr> <dttm> <dbl>
## 1 PINE 11.6 N 2010-01-01 00:00:00 2010
## 2 PINE 11.6 N 2010-01-01 01:00:00 2010
## 3 PINE 11.2 N 2010-01-01 02:00:00 2010
## 4 PINE 11.2 N 2010-01-01 03:00:00 2010
## 5 PINE 11.2 N 2010-01-01 04:00:00 2010
## 6 PINE 11.2 N 2010-01-01 05:00:00 2010
## # ℹ 3 more variables: quarter <dbl>, month <dbl>,
## # day <dbl>
## spc_tbl_ [4,320 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ StationID: chr [1:4320] "PINE" "PINE" "PINE" "PINE" ...
## $ cfs : num [1:4320] 11.6 11.6 11.2 11.2 11.2 ...
## $ surrogate: chr [1:4320] "N" "N" "N" "N" ...
## $ datetime : POSIXct[1:4320], format: "2010-01-01 00:00:00" ...
## $ year : num [1:4320] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ quarter : num [1:4320] 1 1 1 1 1 1 1 1 1 1 ...
## $ month : num [1:4320] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:4320] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. StationID = col_character(),
## .. cfs = col_double(),
## .. surrogate = col_character(),
## .. datetime = col_datetime(format = ""),
## .. year = col_double(),
## .. quarter = col_double(),
## .. month = col_double(),
## .. day = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Now read in the same data with read.csv() which will NOT read the data as a tibble. How is it different? Output each one in the Console.
Knowing the data type for each column is super helpful for a few reasons…. let’s talk about them.
## StationID cfs surrogate datetime year
## 1 PINE 11.58 N 2010-01-01T00:00:00Z 2010
## 2 PINE 11.58 N 2010-01-01T01:00:00Z 2010
## 3 PINE 11.24 N 2010-01-01T02:00:00Z 2010
## 4 PINE 11.24 N 2010-01-01T03:00:00Z 2010
## 5 PINE 11.24 N 2010-01-01T04:00:00Z 2010
## 6 PINE 11.24 N 2010-01-01T05:00:00Z 2010
## quarter month day
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
## # A tibble: 6 × 8
## StationID cfs surrogate datetime year
## <chr> <dbl> <chr> <dttm> <dbl>
## 1 PINE 11.6 N 2010-01-01 00:00:00 2010
## 2 PINE 11.6 N 2010-01-01 01:00:00 2010
## 3 PINE 11.2 N 2010-01-01 02:00:00 2010
## 4 PINE 11.2 N 2010-01-01 03:00:00 2010
## 5 PINE 11.2 N 2010-01-01 04:00:00 2010
## 6 PINE 11.2 N 2010-01-01 05:00:00 2010
## # ℹ 3 more variables: quarter <dbl>, month <dbl>,
## # day <dbl>
3.8 Data wrangling in dplyr
If you forget syntax or what the following functions do, here is a cheat sheet: https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
We will demo five functions below (these are tidyverse/dplyr functions):
- filter() - returns rows that meet specified conditions
- arrange() - reorders rows
- select() - pull out variables (columns)
- mutate() - create new variables (columns) or reformat existing ones
- summarize() - collapse groups of values into summary stats
3.9 Filter
Write an expression that returns data in rbi for the state of Montana (MT)
## # A tibble: 17 × 26
## site_no RBI RBIrank STANAME DRAIN_SQKM HUC02 LAT_GAGE
## <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 6.04e6 0.0338 8 Madiso… 1126. 10U 44.7
## 2 6.04e6 0.0724 61 Gallat… 2120. 10U 45.5
## 3 6.07e6 0.104 123 Dearbo… 835. 10U 47.2
## 4 6.15e6 0.118 157 Little… 33.4 10U 48.0
## 5 6.19e6 0.110 143 Soda B… 73.1 10U 45.0
## 6 6.19e6 0.0627 46 Gardne… 514. 10U 45.0
## 7 6.19e6 0.0649 49 Yellow… 6784. 10U 45.1
## 8 6.29e6 0.0652 50 Little… 471. 10U 45.0
## 9 6.29e6 0.0923 94 Lodge … 218 10U 45.1
## 10 1.24e7 0.0922 93 Middle… 2939. 17 48.5
## 11 1.24e7 0.0848 77 Mill C… 50.8 17 47.8
## 12 1.24e7 0.0985 113 South … 19.7 17 47.5
## 13 1.24e7 0.118 153 Missio… 32.2 17 47.3
## 14 1.24e7 0.0698 55 South … 151. 17 47.2
## 15 1.24e7 0.0470 19 Big Kn… 17.7 17 47.1
## 16 1.24e7 0.100 117 Revais… 60.8 17 47.3
## 17 1.24e7 0.0738 64 Prospe… 470. 17 47.6
## # ℹ 19 more variables: LNG_GAGE <dbl>, STATE <chr>,
## # CLASS <chr>, AGGECOREGION <chr>, PPTAVG_BASIN <dbl>,
## # PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## # T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>,
## # T_MAXSTD_BASIN <dbl>, T_MAX_SITE <dbl>,
## # T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## # T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>, …
And one that keeps flows less than 100 cfs in the “flow” dataframe.
## # A tibble: 4,169 × 8
## StationID cfs surrogate datetime year
## <chr> <dbl> <chr> <dttm> <dbl>
## 1 PINE 11.6 N 2010-01-01 00:00:00 2010
## 2 PINE 11.6 N 2010-01-01 01:00:00 2010
## 3 PINE 11.2 N 2010-01-01 02:00:00 2010
## 4 PINE 11.2 N 2010-01-01 03:00:00 2010
## 5 PINE 11.2 N 2010-01-01 04:00:00 2010
## 6 PINE 11.2 N 2010-01-01 05:00:00 2010
## 7 PINE 11.2 N 2010-01-01 06:00:00 2010
## 8 PINE 11.2 N 2010-01-01 07:00:00 2010
## 9 PINE 10.9 N 2010-01-01 08:00:00 2010
## 10 PINE 10.9 N 2010-01-01 09:00:00 2010
## # ℹ 4,159 more rows
## # ℹ 3 more variables: quarter <dbl>, month <dbl>,
## # day <dbl>
Above we just executed the operation, but didn’t save it. Let’s save that work using the assignment operator.
3.9.1 Multiple conditions
How many gages are there in Montana with an rbi greater than 0.05
## # A tibble: 15 × 26
## site_no RBI RBIrank STANAME DRAIN_SQKM HUC02 LAT_GAGE
## <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 6.04e6 0.0724 61 Gallat… 2120. 10U 45.5
## 2 6.07e6 0.104 123 Dearbo… 835. 10U 47.2
## 3 6.15e6 0.118 157 Little… 33.4 10U 48.0
## 4 6.19e6 0.110 143 Soda B… 73.1 10U 45.0
## 5 6.19e6 0.0627 46 Gardne… 514. 10U 45.0
## 6 6.19e6 0.0649 49 Yellow… 6784. 10U 45.1
## 7 6.29e6 0.0652 50 Little… 471. 10U 45.0
## 8 6.29e6 0.0923 94 Lodge … 218 10U 45.1
## 9 1.24e7 0.0922 93 Middle… 2939. 17 48.5
## 10 1.24e7 0.0848 77 Mill C… 50.8 17 47.8
## 11 1.24e7 0.0985 113 South … 19.7 17 47.5
## 12 1.24e7 0.118 153 Missio… 32.2 17 47.3
## 13 1.24e7 0.0698 55 South … 151. 17 47.2
## 14 1.24e7 0.100 117 Revais… 60.8 17 47.3
## 15 1.24e7 0.0738 64 Prospe… 470. 17 47.6
## # ℹ 19 more variables: LNG_GAGE <dbl>, STATE <chr>,
## # CLASS <chr>, AGGECOREGION <chr>, PPTAVG_BASIN <dbl>,
## # PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## # T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>,
## # T_MAXSTD_BASIN <dbl>, T_MAX_SITE <dbl>,
## # T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## # T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>, …
Challenge: Filter for flow less than 100 cfs just for the NFDR gauge in “flow”.
3.10 Arrange
Arrange sorts by a column in your dataset.
Sort the rbi data by the RBI column in ascending and then descending order
## # A tibble: 1,144 × 26
## site_no RBI RBIrank STANAME DRAIN_SQKM HUC02 LAT_GAGE
## <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 6.41e6 0.0108 1 RHOADS… 20.8 10U 44.1
## 2 6.43e6 0.0192 2 LITTLE… 71.8 10U 44.3
## 3 1.02e7 0.0230 3 STEPTO… 28.2 16 39.2
## 4 2.27e6 0.0286 4 CATFIS… 169. 3 28.0
## 5 6.78e6 0.0300 5 MIDDLE… 5460. 10L 41.8
## 6 6.43e6 0.0303 6 CROW C… 106. 10U 44.6
## 7 1.42e7 0.0306 7 MCKENZ… 237. 17 44.4
## 8 6.04e6 0.0338 8 Madiso… 1126. 10U 44.7
## 9 1.01e7 0.0382 9 COM F … 556. 16 41.7
## 10 1.02e7 0.0391 10 CLEVE … 83.5 16 39.2
## # ℹ 1,134 more rows
## # ℹ 19 more variables: LNG_GAGE <dbl>, STATE <chr>,
## # CLASS <chr>, AGGECOREGION <chr>, PPTAVG_BASIN <dbl>,
## # PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## # T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>,
## # T_MAXSTD_BASIN <dbl>, T_MAX_SITE <dbl>,
## # T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>, …
## # A tibble: 1,144 × 26
## site_no RBI RBIrank STANAME DRAIN_SQKM HUC02 LAT_GAGE
## <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 9486350 1.76 1144 CANADA… 675. 15 32.3
## 2 9513860 1.75 1143 SKUNK … 170. 15 33.7
## 3 8401200 1.62 1142 SOUTH … 534 13 32.6
## 4 9487000 1.58 1141 BRAWLE… 2028. 15 32.1
## 5 9535100 1.57 1140 SAN SI… 1483. 15 32.0
## 6 8202700 1.56 1139 Seco C… 435. 12 29.4
## 7 11065000 1.55 1138 Lytle … 365. 18 34.1
## 8 7019120 1.51 1137 Fishpo… 24.9 7 38.6
## 9 7233500 1.50 1136 Palo D… 2909. 11 36.2
## 10 6846500 1.40 1135 BEAVER… 4358. 10L 40.0
## # ℹ 1,134 more rows
## # ℹ 19 more variables: LNG_GAGE <dbl>, STATE <chr>,
## # CLASS <chr>, AGGECOREGION <chr>, PPTAVG_BASIN <dbl>,
## # PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## # T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>,
## # T_MAXSTD_BASIN <dbl>, T_MAX_SITE <dbl>,
## # T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>, …
3.11 Select
Look at the RBI dataframe. There are too many columns! You will often want to get rid of some columns and clean up the dataframe (df) for analysis.
Select Site name, state, and RBI from the rbi data
Note they come back in the order you put them in in the function, not the order they were in in the original data.
You can do a lot more with select, especially when you need to select a bunch of columns but don’t want to type them all out. For example, if you want to select a group of columns you can specify the first and last with a colon in between (first:last) and it’ll return all of them. Select the rbi columns from site_no to DRAIN_SQKM. You can also remove one column with select(-column). Remove the “surrogate” column from flow.
3.12 Mutate
Use mutate to add new columns based on additional ones. Common uses are to create a column of data in different units, or to calculate something based on two columns. You can also use it to just update a column, by naming the new column the same as the original one (but be careful because you’ll lose the original one!).
Create a new column in rbi called T_RANGE by subtracting T_MIN_SITE from T_MAX_SITE
## # A tibble: 1,144 × 27
## site_no RBI RBIrank STANAME DRAIN_SQKM HUC02 LAT_GAGE
## <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1013500 0.0584 35 Fish R… 2253. 1 47.2
## 2 1021480 0.208 300 Old St… 76.7 1 44.9
## 3 1022500 0.198 286 Narrag… 574. 1 44.6
## 4 1029200 0.132 183 Seboei… 445. 1 46.1
## 5 1030500 0.114 147 Mattaw… 3676. 1 45.5
## 6 1031300 0.297 489 Piscat… 304. 1 45.3
## 7 1031500 0.320 545 Piscat… 769 1 45.2
## 8 1037380 0.318 537 Ducktr… 39 1 44.3
## 9 1044550 0.242 360 Spence… 500. 1 45.3
## 10 1047000 0.344 608 Carrab… 909. 1 44.9
## # ℹ 1,134 more rows
## # ℹ 20 more variables: LNG_GAGE <dbl>, STATE <chr>,
## # CLASS <chr>, AGGECOREGION <chr>, PPTAVG_BASIN <dbl>,
## # PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## # T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>,
## # T_MAXSTD_BASIN <dbl>, T_MAX_SITE <dbl>,
## # T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>, …
When downloading data from the USGS through R, you have to enter the gage ID as a character, even though they are all made up of numbers. So to practice doing this, update the site_no column to be a character datatype
## # A tibble: 1,144 × 26
## site_no RBI RBIrank STANAME DRAIN_SQKM HUC02 LAT_GAGE
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1013500 0.0584 35 Fish R… 2253. 1 47.2
## 2 1021480 0.208 300 Old St… 76.7 1 44.9
## 3 1022500 0.198 286 Narrag… 574. 1 44.6
## 4 1029200 0.132 183 Seboei… 445. 1 46.1
## 5 1030500 0.114 147 Mattaw… 3676. 1 45.5
## 6 1031300 0.297 489 Piscat… 304. 1 45.3
## 7 1031500 0.320 545 Piscat… 769 1 45.2
## 8 1037380 0.318 537 Ducktr… 39 1 44.3
## 9 1044550 0.242 360 Spence… 500. 1 45.3
## 10 1047000 0.344 608 Carrab… 909. 1 44.9
## # ℹ 1,134 more rows
## # ℹ 19 more variables: LNG_GAGE <dbl>, STATE <chr>,
## # CLASS <chr>, AGGECOREGION <chr>, PPTAVG_BASIN <dbl>,
## # PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## # T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>,
## # T_MAXSTD_BASIN <dbl>, T_MAX_SITE <dbl>,
## # T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>, …
3.13 Summarize
Summarize will perform an operation on all of your data, or groups if you assign groups.
Use summarize to compute the mean, min, and max rbi
Now use the group function to group rbi by state and then summarize in the same way as above, but for the full r
3.14 Multiple operations with pipes
You will note that your environment is filling up with objects. We can eliminate many of those by using pipes.
The pipe operator %>% allows you to perform multiple operations in a sequence without saving intermediate steps. Not only is this more efficient, but structuring operations with pipes is also more intuitive than nesting functions within functions (the other way you can do multiple operations).
When you use the pipe, it basically takes whatever came out of the first function and puts it into the data argument for the next one, so:
rbi %>% group_by(STATE)
is the same as
group_by(rbi, STATE)
Take the groupby and summarize code from above and perform the operation using the pipe
3.15 A final comment on NAs
We will talk more about this when we discuss stats, but some operations will fail if there are NA’s in the data. If appropriate, you can tell functions like mean() to ignore NAs by using na.rm = TRUE. You can also use drop_na() if you’re working with a tibble. But be aware if you use that and save the result, drop_na() gets rid of the whole row, not just the NA. Because what would you replace it with…. an NA?
First, lets create a small data frame called x that includes: 1, 2, 3, 4, NA. How do we do that?
Next, lets take the mean of x.
## [1] NA
How do you think we can fix this problem?
## [1] 2.5