Chapter 3 Data wrangling using tidyverse

Follow this link to download everything you need for this unit. When you get to GitHub click on “Code” (green button) and select “download zip”. You will then save this to a local folder where you should do all of your work for this class. You will work through the “_blank.Rmd”. Always be sure to read the README.md files in the GitHub repo.

Once you have this folder saved where you would like it, open RStudio and navigate to the folder. Next, open the project (“.Rproj”). Doing so will set the folder as the working directory, make your life easier, and make everything generally work.

3.1 Introduction

We have messed around with plotting a bit and you’ve seen a little of what R can do. So now let’s review or introduce you to some basics. Even if you have worked in R before, it is good to be remind of/practice with this stuff, so stay tuned in!

Reading for this week:

Chapter 2.1: Graphical analysis of single datasets in SMWR

Workflow: basics

Data transformation

3.2 Learning objectives:

Working through this exercise will help students:
- become more familiar with the RStudio IDE
- get in the habit of running single lines of code
- know what a tibble is
- know what the assignment operator is
- begin using base and dplyr functions

3.3 You can use R as a calculator

If you just type numbers and operators in, R will spit out the results.

It is generally good to run one line of code at a time. In mac you do that by putting your cursor on the line and hitting command + enter. On windows/PC that is ctrl + enter.

Here is a link to info on Editing and Executing code in RStudio

Very handy link to all keyboard shortcuts Windows, Linux and Mac

1 + 2

## [1] 3

2 + 2

## [1] 4

3.4 You can create new objects using <-

Yes, = does the same thing. But use <-. We will call <- assignment or assignment operator. When we are coding in R we use <- to assign values to objects and = to set values for parameters in functions/equations/etc. Using <- helps us differentiate between the two. Norms for formatting are important because they help us understand what code is doing, especially when stuff gets complex.

Oh, one more thing: Surround operators with spaces.

x <- 1 is easier to read than x<-1

You can assign single numbers or entire chunks of data using <-

So if you had an object called my_data and wanted to copy it into my_new_data you could do:

my_new_data <- my_data

You can then recall/print the values in an object by just typing the name by itself.

In the code chunk below, assign a 3 to the object “y” and then print it out.

# This is a code chunk. 
# Putting a pound sign in here allows me to type text that is not code.
# The stuff below is code. Not text. 

y <- 3
y

## [1] 3

If you want to assign multiple values, you have to put them in the function c() c means combine. R doesn’t know what to do if you just give it a bunch of values with space or commas, but if you put them as arguments in the combine function, it’ll make them into a vector.

Any time you need to use several values, even passing as an argument to a function, you have to put them in c() or it won’t work.

a <- c(1,2,3,4)
a

## [1] 1 2 3 4

When you are creating objects, try to give them meaningful names so you can remember what they are. You can’t have spaces or operators that mean something else as part of a name. And remember, everything is case sensitive.

Assign the value 5.4 to water_pH and then try to recall it by typing “water_ph”

water_pH <- 5.4

water_pH

## [1] 5.4

If we want to remove something from the environment we can use rm(). Try to remove water_pH.

You can also set objects equal to strings, or values that have letters in them. To do this you just have to put the value in quotes, otherwise R will think it is an object name and tell you it doesn’t exist.

Try: name <- “your name” and then name <- your name

What happens if you forget the ending parenthesis?

R can be cryptic with it’s error messages or other responses, but once you get used to them, you know exactly what is wrong when they pop up.

As a note - when you go to the internet for example code it will often say things like df <- your_data, this is similar to what I’ve written above: name <- “your name”. It means enter you name (or your data). As you progress you will get better at understanding example code and understanding error messages.

name <- "Tim"
#name <- Tim

3.5 Using functions

As an example, let’s try the seq() function, which creates a sequence of numbers.

seq(from = 1, to = 10, by = 1) # these are base R functions

##  [1]  1  2  3  4  5  6  7  8  9 10

# or

seq(1, 10, 1)

##  [1]  1  2  3  4  5  6  7  8  9 10

# or

seq(1, 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

# Predict what this does
seq(10,1)

##  [1] 10  9  8  7  6  5  4  3  2  1

3.6 Read in some data

First we will load the tidyverse library, everything we have done so far today is in base R. Next, let’s load a few dataframes and have a look at them. We will load the “PINE_NFDR_Jan-Mar_2010.csv” and “flashy_dat_all.csv” files.

Important: read_csv() is the tidyverse csv reading function, the base R function is read.csv(). read.csv() will not read your data in as a tibble, which is the format used by tidyverse functions. You should get in the habit of using the tidyverse versions such as read_csv().

library(tidyverse)

flow <- read_csv("pine_nfdr_jan-mar_2010.csv")

## Rows: 4320 Columns: 8
## ── Column specification ───────────────────────────────────────────────
## Delimiter: ","
## chr  (2): StationID, surrogate
## dbl  (5): cfs, year, quarter, month, day
## dttm (1): datetime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

rbi <- read_csv("flashy_dat_all.csv")

## Rows: 1144 Columns: 26
## ── Column specification ───────────────────────────────────────────────
## Delimiter: ","
## chr  (5): STANAME, HUC02, STATE, CLASS, AGGECOREGION
## dbl (21): site_no, RBI, RBIrank, DRAIN_SQKM, LAT_GAGE, LNG_GAGE, PP...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.7 What is a tibble?

Good question. It’s a fancy way to store data that works well with tidyverse functions. Let’s look at the flow tibble with “head” and “str”

head(flow)

## # A tibble: 6 × 8
##   StationID   cfs surrogate datetime             year quarter month
##   <chr>     <dbl> <chr>     <dttm>              <dbl>   <dbl> <dbl>
## 1 PINE       11.6 N         2010-01-01 00:00:00  2010       1     1
## 2 PINE       11.6 N         2010-01-01 01:00:00  2010       1     1
## 3 PINE       11.2 N         2010-01-01 02:00:00  2010       1     1
## 4 PINE       11.2 N         2010-01-01 03:00:00  2010       1     1
## 5 PINE       11.2 N         2010-01-01 04:00:00  2010       1     1
## 6 PINE       11.2 N         2010-01-01 05:00:00  2010       1     1
## # ℹ 1 more variable: day <dbl>

str(flow)

## spc_tbl_ [4,320 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ StationID: chr [1:4320] "PINE" "PINE" "PINE" "PINE" ...
##  $ cfs      : num [1:4320] 11.6 11.6 11.2 11.2 11.2 ...
##  $ surrogate: chr [1:4320] "N" "N" "N" "N" ...
##  $ datetime : POSIXct[1:4320], format: "2010-01-01 00:00:00" ...
##  $ year     : num [1:4320] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ quarter  : num [1:4320] 1 1 1 1 1 1 1 1 1 1 ...
##  $ month    : num [1:4320] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : num [1:4320] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   StationID = col_character(),
##   ..   cfs = col_double(),
##   ..   surrogate = col_character(),
##   ..   datetime = col_datetime(format = ""),
##   ..   year = col_double(),
##   ..   quarter = col_double(),
##   ..   month = col_double(),
##   ..   day = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now read in the same data with read.csv() which will NOT read the data as a tibble. How is it different? Output each one in the Console.

Knowing the data type for each column is super helpful for a few reasons…. let’s talk about them.

flow_NT <- read.csv("pine_nfdr_jan-mar_2010.csv") # this is base R

head(flow_NT)

##   StationID   cfs surrogate             datetime year quarter month
## 1      PINE 11.58         N 2010-01-01T00:00:00Z 2010       1     1
## 2      PINE 11.58         N 2010-01-01T01:00:00Z 2010       1     1
## 3      PINE 11.24         N 2010-01-01T02:00:00Z 2010       1     1
## 4      PINE 11.24         N 2010-01-01T03:00:00Z 2010       1     1
## 5      PINE 11.24         N 2010-01-01T04:00:00Z 2010       1     1
## 6      PINE 11.24         N 2010-01-01T05:00:00Z 2010       1     1
##   day
## 1   1
## 2   1
## 3   1
## 4   1
## 5   1
## 6   1

head(flow)

## # A tibble: 6 × 8
##   StationID   cfs surrogate datetime             year quarter month
##   <chr>     <dbl> <chr>     <dttm>              <dbl>   <dbl> <dbl>
## 1 PINE       11.6 N         2010-01-01 00:00:00  2010       1     1
## 2 PINE       11.6 N         2010-01-01 01:00:00  2010       1     1
## 3 PINE       11.2 N         2010-01-01 02:00:00  2010       1     1
## 4 PINE       11.2 N         2010-01-01 03:00:00  2010       1     1
## 5 PINE       11.2 N         2010-01-01 04:00:00  2010       1     1
## 6 PINE       11.2 N         2010-01-01 05:00:00  2010       1     1
## # ℹ 1 more variable: day <dbl>

# We can remove the flow_NT dataframe from the enviroment with rm(flow_NT)
rm(flow_NT)

3.8 Data wrangling in dplyr

If you forget syntax or what the following functions do, here is a cheat sheet: https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

We will demo five functions below (these are tidyverse/dplyr functions):

filter() - returns rows that meet specified conditions
arrange() - reorders rows
select() - pull out variables (columns)
mutate() - create new variables (columns) or reformat existing ones
summarize() - collapse groups of values into summary stats

3.9 Filter

Write an expression that returns data in rbi for the state of Montana (MT)

filter(rbi, STATE == "MT")

## # A tibble: 17 × 26
##     site_no    RBI RBIrank STANAME   DRAIN_SQKM HUC02 LAT_GAGE LNG_GAGE
##       <dbl>  <dbl>   <dbl> <chr>          <dbl> <chr>    <dbl>    <dbl>
##  1  6037500 0.0338       8 Madison …     1126.  10U       44.7    -111.
##  2  6043500 0.0724      61 Gallatin…     2120.  10U       45.5    -111.
##  3  6073500 0.104      123 Dearborn…      835.  10U       47.2    -112.
##  4  6154410 0.118      157 Little P…       33.4 10U       48.0    -109.
##  5  6187915 0.110      143 Soda But…       73.1 10U       45.0    -110.
##  6  6191000 0.0627      46 Gardner …      514.  10U       45.0    -111.
##  7  6191500 0.0649      49 Yellowst…     6784.  10U       45.1    -111.
##  8  6289000 0.0652      50 Little B…      471.  10U       45.0    -108.
##  9  6291500 0.0923      94 Lodge Gr…      218   10U       45.1    -108.
## 10 12358500 0.0922      93 Middle F…     2939.  17        48.5    -114.
## 11 12374250 0.0848      77 Mill Cr …       50.8 17        47.8    -115.
## 12 12375900 0.0985     113 South Cr…       19.7 17        47.5    -114.
## 13 12377150 0.118      153 Mission …       32.2 17        47.3    -114.
## 14 12381400 0.0698      55 South Fo…      151.  17        47.2    -114.
## 15 12383500 0.0470      19 Big Knif…       17.7 17        47.1    -114.
## 16 12388400 0.100      117 Revais C…       60.8 17        47.3    -114.
## 17 12390700 0.0738      64 Prospect…      470.  17        47.6    -115.
## # ℹ 18 more variables: STATE <chr>, CLASS <chr>, AGGECOREGION <chr>,
## #   PPTAVG_BASIN <dbl>, PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## #   T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>, T_MAXSTD_BASIN <dbl>,
## #   T_MAX_SITE <dbl>, T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## #   T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>,
## #   PRECIP_SEAS_IND <dbl>, FLOWYRS_1990_2009 <dbl>, wy00_09 <dbl>

And one that keeps flows less than 100 cfs in the “flow” dataframe.

filter(flow, cfs < 100)

## # A tibble: 4,169 × 8
##    StationID   cfs surrogate datetime             year quarter month
##    <chr>     <dbl> <chr>     <dttm>              <dbl>   <dbl> <dbl>
##  1 PINE       11.6 N         2010-01-01 00:00:00  2010       1     1
##  2 PINE       11.6 N         2010-01-01 01:00:00  2010       1     1
##  3 PINE       11.2 N         2010-01-01 02:00:00  2010       1     1
##  4 PINE       11.2 N         2010-01-01 03:00:00  2010       1     1
##  5 PINE       11.2 N         2010-01-01 04:00:00  2010       1     1
##  6 PINE       11.2 N         2010-01-01 05:00:00  2010       1     1
##  7 PINE       11.2 N         2010-01-01 06:00:00  2010       1     1
##  8 PINE       11.2 N         2010-01-01 07:00:00  2010       1     1
##  9 PINE       10.9 N         2010-01-01 08:00:00  2010       1     1
## 10 PINE       10.9 N         2010-01-01 09:00:00  2010       1     1
## # ℹ 4,159 more rows
## # ℹ 1 more variable: day <dbl>

Above we just executed the operation, but didn’t save it. Let’s save that work using the assignment operator.

rbi_mt <- filter(rbi, STATE == "MT")

low_flows <- filter(flow, cfs < 100)

3.9.1 Multiple conditions

How many gages are there in Montana with an rbi greater than 0.05

filter(rbi, STATE == "MT" & RBI > 0.05)

## # A tibble: 15 × 26
##     site_no    RBI RBIrank STANAME   DRAIN_SQKM HUC02 LAT_GAGE LNG_GAGE
##       <dbl>  <dbl>   <dbl> <chr>          <dbl> <chr>    <dbl>    <dbl>
##  1  6043500 0.0724      61 Gallatin…     2120.  10U       45.5    -111.
##  2  6073500 0.104      123 Dearborn…      835.  10U       47.2    -112.
##  3  6154410 0.118      157 Little P…       33.4 10U       48.0    -109.
##  4  6187915 0.110      143 Soda But…       73.1 10U       45.0    -110.
##  5  6191000 0.0627      46 Gardner …      514.  10U       45.0    -111.
##  6  6191500 0.0649      49 Yellowst…     6784.  10U       45.1    -111.
##  7  6289000 0.0652      50 Little B…      471.  10U       45.0    -108.
##  8  6291500 0.0923      94 Lodge Gr…      218   10U       45.1    -108.
##  9 12358500 0.0922      93 Middle F…     2939.  17        48.5    -114.
## 10 12374250 0.0848      77 Mill Cr …       50.8 17        47.8    -115.
## 11 12375900 0.0985     113 South Cr…       19.7 17        47.5    -114.
## 12 12377150 0.118      153 Mission …       32.2 17        47.3    -114.
## 13 12381400 0.0698      55 South Fo…      151.  17        47.2    -114.
## 14 12388400 0.100      117 Revais C…       60.8 17        47.3    -114.
## 15 12390700 0.0738      64 Prospect…      470.  17        47.6    -115.
## # ℹ 18 more variables: STATE <chr>, CLASS <chr>, AGGECOREGION <chr>,
## #   PPTAVG_BASIN <dbl>, PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## #   T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>, T_MAXSTD_BASIN <dbl>,
## #   T_MAX_SITE <dbl>, T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## #   T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>,
## #   PRECIP_SEAS_IND <dbl>, FLOWYRS_1990_2009 <dbl>, wy00_09 <dbl>

Challenge: Filter for flow less than 100 cfs just for the NFDR gauge in “flow”.

3.10 Arrange

Arrange sorts by a column in your dataset.

Sort the rbi data by the RBI column in ascending and then descending order

arrange(rbi, RBI)

## # A tibble: 1,144 × 26
##     site_no    RBI RBIrank STANAME   DRAIN_SQKM HUC02 LAT_GAGE LNG_GAGE
##       <dbl>  <dbl>   <dbl> <chr>          <dbl> <chr>    <dbl>    <dbl>
##  1  6408700 0.0108       1 RHOADS F…       20.8 10U       44.1   -104. 
##  2  6430850 0.0192       2 LITTLE S…       71.8 10U       44.3   -104. 
##  3 10244950 0.0230       3 STEPTOE …       28.2 16        39.2   -115. 
##  4  2267000 0.0286       4 CATFISH …      169.  3         28.0    -81.5
##  5  6775500 0.0300       5 MIDDLE L…     5460.  10L       41.8   -100. 
##  6  6430532 0.0303       6 CROW CRE…      106.  10U       44.6   -104. 
##  7 14158500 0.0306       7 MCKENZIE…      237.  17        44.4   -122. 
##  8  6037500 0.0338       8 Madison …     1126.  10U       44.7   -111. 
##  9 10109001 0.0382       9 COM F LO…      556.  16        41.7   -112. 
## 10 10243700 0.0391      10 CLEVE C …       83.5 16        39.2   -115. 
## # ℹ 1,134 more rows
## # ℹ 18 more variables: STATE <chr>, CLASS <chr>, AGGECOREGION <chr>,
## #   PPTAVG_BASIN <dbl>, PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## #   T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>, T_MAXSTD_BASIN <dbl>,
## #   T_MAX_SITE <dbl>, T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## #   T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>,
## #   PRECIP_SEAS_IND <dbl>, FLOWYRS_1990_2009 <dbl>, wy00_09 <dbl>

arrange(rbi, desc(RBI))

## # A tibble: 1,144 × 26
##     site_no   RBI RBIrank STANAME    DRAIN_SQKM HUC02 LAT_GAGE LNG_GAGE
##       <dbl> <dbl>   <dbl> <chr>           <dbl> <chr>    <dbl>    <dbl>
##  1  9486350  1.76    1144 CANADA DE…      675.  15        32.3   -111. 
##  2  9513860  1.75    1143 SKUNK CRE…      170.  15        33.7   -112. 
##  3  8401200  1.62    1142 SOUTH SEV…      534   13        32.6   -104. 
##  4  9487000  1.58    1141 BRAWLEY W…     2028.  15        32.1   -111. 
##  5  9535100  1.57    1140 SAN SIMON…     1483.  15        32.0   -112. 
##  6  8202700  1.56    1139 Seco Ck a…      435.  12        29.4    -99.3
##  7 11065000  1.55    1138 Lytle Cre…      365.  18        34.1   -117. 
##  8  7019120  1.51    1137 Fishpot C…       24.9 7         38.6    -90.5
##  9  7233500  1.50    1136 Palo Duro…     2909.  11        36.2   -101. 
## 10  6846500  1.40    1135 BEAVER C …     4358.  10L       40.0   -101. 
## # ℹ 1,134 more rows
## # ℹ 18 more variables: STATE <chr>, CLASS <chr>, AGGECOREGION <chr>,
## #   PPTAVG_BASIN <dbl>, PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## #   T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>, T_MAXSTD_BASIN <dbl>,
## #   T_MAX_SITE <dbl>, T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## #   T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>,
## #   PRECIP_SEAS_IND <dbl>, FLOWYRS_1990_2009 <dbl>, wy00_09 <dbl>

3.11 Select

Look at the RBI dataframe. There are too many columns! You will often want to get rid of some columns and clean up the dataframe (df) for analysis.

Select Site name, state, and RBI from the rbi data

Note they come back in the order you put them in in the function, not the order they were in in the original data.

You can do a lot more with select, especially when you need to select a bunch of columns but don’t want to type them all out. For example, if you want to select a group of columns you can specify the first and last with a colon in between (first:last) and it’ll return all of them. Select the rbi columns from site_no to DRAIN_SQKM. You can also remove one column with select(-column). Remove the “surrogate” column from flow.

rbi_mt_thin <- select(rbi, STANAME, STATE, RBI)

rbi_thin <- select(rbi, site_no:DRAIN_SQKM)

flow_thin <- select(flow, -surrogate)

3.12 Mutate

Use mutate to add new columns based on additional ones. Common uses are to create a column of data in different units, or to calculate something based on two columns. You can also use it to just update a column, by naming the new column the same as the original one (but be careful because you’ll lose the original one!).

Create a new column in rbi called T_RANGE by subtracting T_MIN_SITE from T_MAX_SITE

mutate(rbi, T_RANGE = T_MAX_SITE - T_MIN_SITE)

## # A tibble: 1,144 × 27
##    site_no    RBI RBIrank STANAME    DRAIN_SQKM HUC02 LAT_GAGE LNG_GAGE
##      <dbl>  <dbl>   <dbl> <chr>           <dbl> <chr>    <dbl>    <dbl>
##  1 1013500 0.0584      35 Fish Rive…     2253.  1         47.2    -68.6
##  2 1021480 0.208      300 Old Strea…       76.7 1         44.9    -67.7
##  3 1022500 0.198      286 Narraguag…      574.  1         44.6    -67.9
##  4 1029200 0.132      183 Seboeis R…      445.  1         46.1    -68.6
##  5 1030500 0.114      147 Mattawamk…     3676.  1         45.5    -68.3
##  6 1031300 0.297      489 Piscataqu…      304.  1         45.3    -69.6
##  7 1031500 0.320      545 Piscataqu…      769   1         45.2    -69.3
##  8 1037380 0.318      537 Ducktrap …       39   1         44.3    -69.1
##  9 1044550 0.242      360 Spencer S…      500.  1         45.3    -70.2
## 10 1047000 0.344      608 Carrabass…      909.  1         44.9    -70.0
## # ℹ 1,134 more rows
## # ℹ 19 more variables: STATE <chr>, CLASS <chr>, AGGECOREGION <chr>,
## #   PPTAVG_BASIN <dbl>, PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## #   T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>, T_MAXSTD_BASIN <dbl>,
## #   T_MAX_SITE <dbl>, T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## #   T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>,
## #   PRECIP_SEAS_IND <dbl>, FLOWYRS_1990_2009 <dbl>, wy00_09 <dbl>, …

When downloading data from the USGS through R, you have to enter the gage ID as a character, even though they are all made up of numbers. So to practice doing this, update the site_no column to be a character datatype

mutate(rbi, site_no = as.character(site_no))

## # A tibble: 1,144 × 26
##    site_no    RBI RBIrank STANAME    DRAIN_SQKM HUC02 LAT_GAGE LNG_GAGE
##    <chr>    <dbl>   <dbl> <chr>           <dbl> <chr>    <dbl>    <dbl>
##  1 1013500 0.0584      35 Fish Rive…     2253.  1         47.2    -68.6
##  2 1021480 0.208      300 Old Strea…       76.7 1         44.9    -67.7
##  3 1022500 0.198      286 Narraguag…      574.  1         44.6    -67.9
##  4 1029200 0.132      183 Seboeis R…      445.  1         46.1    -68.6
##  5 1030500 0.114      147 Mattawamk…     3676.  1         45.5    -68.3
##  6 1031300 0.297      489 Piscataqu…      304.  1         45.3    -69.6
##  7 1031500 0.320      545 Piscataqu…      769   1         45.2    -69.3
##  8 1037380 0.318      537 Ducktrap …       39   1         44.3    -69.1
##  9 1044550 0.242      360 Spencer S…      500.  1         45.3    -70.2
## 10 1047000 0.344      608 Carrabass…      909.  1         44.9    -70.0
## # ℹ 1,134 more rows
## # ℹ 18 more variables: STATE <chr>, CLASS <chr>, AGGECOREGION <chr>,
## #   PPTAVG_BASIN <dbl>, PPTAVG_SITE <dbl>, T_AVG_BASIN <dbl>,
## #   T_AVG_SITE <dbl>, T_MAX_BASIN <dbl>, T_MAXSTD_BASIN <dbl>,
## #   T_MAX_SITE <dbl>, T_MIN_BASIN <dbl>, T_MINSTD_BASIN <dbl>,
## #   T_MIN_SITE <dbl>, PET <dbl>, SNOW_PCT_PRECIP <dbl>,
## #   PRECIP_SEAS_IND <dbl>, FLOWYRS_1990_2009 <dbl>, wy00_09 <dbl>

3.13 Summarize

Summarize will perform an operation on all of your data, or groups if you assign groups.

Use summarize to compute the mean, min, and max rbi

rbi_sum <- summarize(rbi, meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

Now use the group function to group rbi by state and then summarize in the same way as above, but for the full r

rbi_state <- group_by(rbi, STATE)

rbi_state <- summarize(rbi_state, meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

3.14 Multiple operations with pipes

You will note that your environment is filling up with objects. We can eliminate many of those by using pipes.

The pipe operator %>% allows you to perform multiple operations in a sequence without saving intermediate steps. Not only is this more efficient, but structuring operations with pipes is also more intuitive than nesting functions within functions (the other way you can do multiple operations).

When you use the pipe, it basically takes whatever came out of the first function and puts it into the data argument for the next one, so:

rbi %>% group_by(STATE)

is the same as

group_by(rbi, STATE)

Take the groupby and summarize code from above and perform the operation using the pipe

rbi_sum <- rbi %>%
  group_by(STATE) %>%
  summarize(meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

3.15 A final comment on NAs

We will talk more about this when we discuss stats, but some operations will fail if there are NA’s in the data. If appropriate, you can tell functions like mean() to ignore NAs by using na.rm = TRUE. You can also use drop_na() if you’re working with a tibble. But be aware if you use that and save the result, drop_na() gets rid of the whole row, not just the NA. Because what would you replace it with…. an NA?

First, lets create a small data frame called x that includes: 1, 2, 3, 4, NA. How do we do that?

x <- c(1,2,3,4,NA)

Next, lets take the mean of x.

mean(x)

## [1] NA

How do you think we can fix this problem?

mean(x, na.rm = TRUE)

## [1] 2.5

3.16 That’s it for today

Can you run all of the code in this unit? Do you get errors? If you are working on a lab computer have you gotten your folder management set up?

3.17 Exit ticket

If you have an error or problem we need to manage - let me know. If not write NA!