Chapter 5 Part 4: Spatial Models and GIS Integration

5.1 Gridded and tabular data retrieval

15 pnts

The repo for this module can be found here

5.1.1 Learning Module

5.1.1.1 Objectives

In this module you will:

-Retrieve flow data using the dataretrieval package
-Learn to retrieve and manage digital elevation models
-Retrieve and import remotely sensed data for the watershed extent

5.1.1.2 Background

A refresher in GIS concepts: This module assumes familiarity with basic Geographic Information System (GIS) such as coordinate systems (latitude and lognitude, UTM). Some terms to recall:

Gridded data refers to spatial data in raster format. If you have performed any GIS work, you are likely using gridded data. Rasters organize geographical areas into a grid where each regularly spaced cell represents a specific location and contains some attribute representing that location. For example, digital elevation models (DEM) are a form of gridded data. Imagine that we are working in a watershed and we can drop a grid or square celled net over the entire catchment. We can then model the elevation data of that catchment by assigning each cell in the net a mean elevation value for the area represented by the cell. Similarly, we can assign any value we wish to model to the cells. If we were to drop another net and assign a maximum tree height to every cell, we would be adding another ‘layer’ to our model. The spatial resolution of gridded data refers to the length and width of each cell represents. If we want to observe gridded data over time, we can generate a raster stack, where each raster layer represents a point in time. With a raster stack, we can increase the temporal resolution of our gridded data.

Vector data describes spatial features using points, lines, and polygons. These spatial features are represented a geometrical objects with associated attributes, like lines to represent linear features such as rivers, or points to represent single locations. We will work with a .shp file in this module that is a collection of points representing the boundary of a watershed.

Tabular data refers to data organized in a row-column format. In this class, each of our .csv datasets are a form of tabular data.

5.1.1.3 A refresher in remote sensing:

Remote sensing is the science of obtaining information about objects or areas from a distance, typically from aircraft or satellites. It involves the collection and interpretation of data gathered by sensors that detect electromagnetic radiation reflected or emitted from the Earth’s surface and atmosphere. As ecologists, many of us utilize remote sensing to observe and analyze the Earth’s surface and its dynamics, providing valuable insights into environmental processes, land use and land cover changes, natural disasters, and more. Some helpful terms to recall:

spatial resolution: Remotely sensed data can vary in its spatial resolution, which refers to the scale of the smallest unit, or level of visual detail.

temporal resolution: When used to describe remotely sensed data, we are often talking about how frequently a remote sensing system revisits the same area on the Earth surface. Systems with high temporal resolution revisit the same location frequently, allowing for monitoring rapid changes in phenomena, where a lower resolution may limit a system’s ability to capture short-term events.

We often find that data collection requires a balance of spatial and temporal resolution to select the most appropriate data for the coverage and detail that ecological work requires.

5.1.1.4 Ethical conduct

When utilizing publicly available datasets, it is important for us as users to:

-Be aware of any copyright or licensing agreements. There may be restrictions or usage terms imposed by data providers. Read provided agreements or terms to be sure we are in compliance with these when using datasets in our research. This may also include requirements for registration or subscription established by providers or agencies like USGS or NASA.

-Cite and acknowledge the contributions of data providers and researchers whose datasets we utilize. This also facilitates transparency and reproducibility in our research.

- Assess the quality and accuracy of geospatial data before using it for decision making purposes. It is important that we consider spatial resolution, temporal resolution and metadata documentation to critically evaluate the reliability of the datasets we use.

-Promote Open Data Principles: Try to explore and use publicly available datasets that support transparency and collaboration within our scientific and academic communities.

5.1.2 Labwork (20 pnts)

In this module we will import streamflow (tablular) data, using USGS’s DataRetrieval package, then visualize vector data to delineate our area of interest and learn how to find spatial gridded data for the area.

5.1.2.1 Setup:

When importing packages and setting parameters, think about how the local working environment can vary among users and computers. A function that examines and installs packages before importing them enhances the script’s ability to be easily used by others.

5.1.2.2 Selecting a watershed:

Here, we are using USGS’ dataRetrieval package to search and filter hydrologic datasets. There are many ways data can be searched; more can be found in this dataRetrieval vingette. In this example, we will fetch USGS site numbers for Colorado watersheds with streamflow (discharge) data for a specified time period.

Q1 (2 pts). If you were creating this script for your work team’s shared collaboration, or for yourself to use again a year from now, what might be the disadvantages of embedding search parameters directly into the function parameters?

ANSWER:

Q2 (3 pts). What is a potential use for whatNWISsites?

ANSWER:

EXTRA (1 pt): Why might the above script return an error if dpylr:: is not placed before select()?

ANSWER:

drain_area_va is a column value returned by readNWISsite. Other variables can be found here

Q3 (1 pt). In a sentence, what does this dataframe (all_site_data) tell you?

ANSWER:

Note that while your research interests may not require streamflow data, it can serve as a valuable resource for a wide array of data types.

For the remainder of this assignment, we’ll focus on a single watershed in central Colorado near the Experimental Forest that we explored in the Snowmelt module. There are several dataRetrieval functions could we utilize to see what data is available for a particular site number. Check dataRetrieval docs to learn package functions or tools.

Q4 (4 pts). Find the longest running dataset in DataAvailable (i.e. either by count or by start and end date). Use https://help.waterdata.usgs.gov to look up the meaning of some of column headers and the codes associated with them.
a. What kind of data does this long running dataset hold?
b. What column did you use to learn that?
c. What does ‘dv’ mean in data_type_cd?

ANSWER:

5.1.2.3 Flow data retrieval:

Let’s rename our columns. For this you will need to determine the units of the date

Q5 (2 pts). Based on the hydrograph plot, what type of precipitation events appear to influence the observed intra-annual variations in flow rates?

ANSWER:

5.1.2.4 Watershed boundary vector data:

A boundary shapefile has been provided to save time and improve repeatability. Watershed boundaries for gauging stations can be found at USGS StreamStats by zooming into the water map and clicking ‘Delineate’ near your desired gage or point. Once your watershed is delineated, you can click on ‘Download Basin’ and select from a choice of file types. This can be a great to to delineate river basins if all you require is a shapefile. For smaller streams, other analyses are available. We will cover those in the next module.

Let’s import a pre-created shapefile:

Let’s transform this to a projected CRS (flat surface) rather than a geographic (spherical surface) CRS. A projected CRS will depict the data on a flat surface which is required for distance or area measurements. A more detailed explanation can be found at the University of Colorado’s Earth Lab if needed.

5.1.2.5 Visualization - watershed boundary

i.e., sanity check

5.1.2.6 Digital Elevation Models (DEM)

A DEM has been provided for consistency among users.

However, if you are unfamiliar with the process of retrieving a DEM here are three methods that our team commonly uses:

  1. USDA’s Natural Resources Conservation Science (NRCS) website. Here you can order (free) DEMs by state, county or bounding box. It takes just minutes for them to be emailed to you.

  2. If you have a USGS account (fast and free) you can also find DEM data at https://earthexplorer.usgs.gov. See this video if interested. This is the method we used to provide this DEM and we will use this resource again to retrieve Landsat data.

  3. GIS or hydrological software like QGIS and SAGA offer tools for downloading DEMs directly into your project for analysis. This can be optimal when working over large geographic extents to streamline the data acquisition process. Both QGIS and SAGA are free and easy to learn if you have experience with ESRI products.

Let’s check out our DEM:

If the watershed and raster don’t overlap as you expect, be sure the projection and extent is the same for all data files plotted.

5.1.2.7 Other remotely sensed data:

Now that we have site data spatial data, let’s cover a couple of techniques or resources you can use to access satellite data for your site. In the event that you only need a vegetative map layer from a specific point in time, it is not too difficult to download actual Landsat imagery from usgs.gov. In the case of Cabin Creek let’s look at a single summer image from 2020.

Navigate to https://ers.cr.usgs.gov/, and if you don’t already have an account, you can create one quickly or sign in if you already have one. Once signed in, go to https://earthexplorer.usgs.gov you should see a surface map with tabs on the left. Start with the ‘Search Criteria’ tab. For this assignment the shape file we have is too large to upload, so it is easiest to scroll down and choose ‘polygon’ or ‘circle’ to define your area. Under the circle tab you can use Center Latitude 39.9861 Center Longitude -105.719 and a radius of about 3.5 kilometers to capture our study area.

Let’s say we want an NDVI map layer from summer of 2020. In ‘date range’, select dates for the summer of 2020. In this high elevation area of Colorado, the greenest times of the year without snow cover are likely June and July with senescence beginning sometime in late July. First search from June 1 to July 15 2020. Then click on ‘Data Sets’ either at the bottom or top of the left-hand side menus. From the data sets we will select Landsat Collection 2 level 2, Landsat 8 - 9 OLI TIRS C2L2.

We can skip ‘Additional Criteria’ for now and go to ‘Results’. You should receive a set of about 11 images. For each image you can check out the metadata. That is the ‘pen and notepad’ icon in the middle of image specific menu. Here you can look at data like cloud cover. For these images, the values you see are a cloud coverage percent. Cloud coverage measurements and the codes to filter for certain cloud cover thresholds may change across missions. If you are retrieving Landsat data for time series using code, it will be important to explore the different quality control parameters and what they represent. You don’t need to download anything for this assignment, we’ve provided it for you. However to gain familiarity with this process, click on the ‘download’ icon for July 7, 2020 and click ‘select files’ on the Reflectance Bands option.

Q6 (1 pt): What 2 bands should we download to calculate NDVI?

Answer:

Q7 (3 pt): Find 2 other indices that can be used as indicators of vegetative health. What might each of these indices indicate. What are the bands used for each index?

Check out other earthexplorer resources while you are at the site. You will likely find resources here for your MS projects, even if it is just for making maps of your study area for presentations or paper.

5.1.2.7.1 Landsat

If you want to learn about Landsat missions or utilize Landsat data, USGS offers all of the reference information you need. The sensors on these missions can be a fantastic resource if characterizing land or water surface over large temporal scales. Landsat data provides (30m) spatial resolution so can capture spatial heterogeneity, even for smaller study areas. If calculating vegetation indices over different time periods, it is important to know which mission you are retrieving data from and check the band assignments for that mission. For example, on the Landsat 6 instrument, Band 4 is near-infrared, whereas Band 4 on Landsat 9’s instrument is red.

5.1.2.7.2 Moderate Resolution Imaging Spectroradiometer (MODIS):

is a sensor aboard NASA’s Terra and Aqua satellites which orbit Earth and collect data on a daily basis. MODIS data provides moderate spatial resolution (250m to 1km) and high temporal resolution. Therefore, MODIS data can be an optimal way to monitor or characterize rapid changes to Earth’s surface like land cover changes resulting from wildfire. Data can be accessed and downloaded directly through NASA’s Earthdata tool.

Let’s import our raster data.

Q8 (2 pt) What does ‘SR’ in the .tif names tell us about the data?

ANSWER:

Q9 (2 pts) What is the difference between ‘mask’ and ‘crop’ when filtering large files to fit the area of interest?

ANSWER:

5.1.2.8 Earth surface changes over time

If considering changes over time, comparing individual rasters from different time periods directly may not always provide meaningful insights because the absolute values of vegetation indices can vary due to factors such as differences in solar angle, atmospheric conditions, and land cover changes. Instead, it’s often more informative to analyze trends in vegetation over time.

5.1.2.8.1 LandTrendr

One fun way to do this is with Oregon State University’s LandTendr. With a Google account and access to Earth Engine, you can easily interact with the UI Applications in section 8. You can jump to the pixel timeseries here. This will allow you to view changes in vegetation indices over a specified time period within a single Landsat pixel.

5.1.2.8.2 Google Earth Engine

Another effective way to access evaluate large spatial datasets is through Google Earth Engine (GEE). GEE is a cloud-based platform developed by Google for large scale environmental data analysis. It provides a massive archive of satellite imagery and other geospatial datasets, as well as computational tools to view and analyze them. GEE uses Google’s cloud to provide processing of data so you don’t need powerful computing ability to run a potentially computationally expensive spatial analysis.

5.1.2.8.2.1 GEE Access

If spatial data is something you may need for your MS project, we recommend exploring this resource

Sign Up: Go to the Google Earth Engine website (https://earthengine.google.com/) and sign in with your Google account. If you don’t have a Google account, you can sign up for access by filling out a request form.

Request Access: If you’re signing up for the first time, you’ll need to request access to Google Earth Engine. Provide information about your affiliation, research interests, and how you plan to use Earth Engine (learn and access data!).

Approval: After submitting your request, you’ll need to wait for approval from the Google Earth Engine team. Approval times may vary, but you should receive an email notification once your request has been approved.

Access Earth Engine: Once approved, you can access Google Earth Engine through the Code Editor interface in your web browser at code.earthengine.google.com to start exploring the available datasets, running analyses, and visualizing your results.

We have provided a basic script for retrieving and viewing NDVI data. This is a code snapshot, so you can manually copy the code from the snapshot and paste it into you own Google Earth Engine Code editor. You are welcome to save this to your own GEE files and try changing the area of interest, vegetation indices, time periods or exploring other data.

Colorado State University also has a great tutorial here that will start you from scratch.

Note that JavaScript is the default language for the code editor interface in your web browser. If you are familiar with Python, GEE provides a Python API that can allow you to use Python syntax. If neither of these languages are familiar, don’t give up yet! You can still use this tool with your R coding background and trial and error. We recommend finding other scripts that suit your need (there are a lot of shared resources online) and adapting them to your area of interest. Once you understand coding concepts, like those in R, other languages can be picked up easily. You will find that you are learning JavaScript before you realize it.

5.2 DEM processing

5.2.2 Learning Module:

5.2.2.1 Objective:

Users will explore basic hydrological tools through the process of DEM preprocessing and defining stream networks for a watershed in the Fraser Experimental Forest in Colorado using Whitebox Tools for R. This exercise will also demonstrate methods for writing functions and efficiently handling multiple files simultaneously, including importing, processing, and exporting data within the context of Whitebox Tools for R.

Hydrological analysis preprocessing involves the use of digital elevation model (DEM) raster data to establish a watershed model and a simulation of surface hydrological processes. These steps enable us to quantify key parameters such as flow accumulation, stream network characteristics, and hydrological connectivity, which are essential for studying the dynamics of water movement within a landscape. Overall, preprocessing is the set of foundational steps in hydrological modeling and analysis.

Whitebox Tools is an advanced geospatial data analysis platform that can be used to perform common geographical information systems (GIS) analysis operations. This platform was developed with the Center for Hydrogeomatics in Guelph University so it is focused on hydrological analysis. With just a DEM, it allows us to produce a multitude of outputs that can be used for future analysis (Lindsay, 2016) doi:10.1016/j.cageo.2016.07.003). While we are demonstrating its use in R, these tools are also available in QGIS and Python platforms.

5.2.3 Labwork (20 pts)

5.2.3.1 Installing libraries

We are going to try installing the whitebox R package from CRAN as it should be the simplest method.

However, if this does not work for you, you can install the development version from GitHub by putting this inside a code chunk:

if (!require(“remotes”)) install.packages(‘remotes’) remotes::install_github(“opengeos/whiteboxR”, build = FALSE)

More information on installation can be found at: https://cran.r-project.org/web/packages/whitebox/readme/README.html

Helpful whitebox documentation can be found at https://jblindsay.github.io/wbt_book/preface.html. Essentially, we will be using input rasters via filepath and output filepaths as arguments for various whitebox functions. The script is designed to perform functions on all rasters in a given folder at once.

When writing scripts, developers typically follow a standard workflow:
1. Import required libraries
2. Generate functions useful throughout the script
3. Establish working directories or paths to other directories if needed
4. Import data
5. Data Cleaning and Preprocessing - this may involve handling missing values, removing outlines, converting to preferred units. etc.
6. Exploratory Data Analysis - it is beneficial to explore data visually to help uunderstand the characteristics of the data.
7. Apply functions, or models or other analytical techniques
8. Evaluate results - If modeling, this may involve comparing model predictions with observed data or conducting sensitivity analysis
9. Visualize and report - plots, maps and tables can be effective ways to communicate findings.

While you might find slight variations among collaborators, following this general workflow ensures that our scripts are structured in a way that facilitates easy sharing and reproducibility of results.

5.2.3.1.1 Generate functions

Since we have imported the required libraries, let’s generate some functions.

Q1.(2 pnt) What does date: “14 January, 2025” in the header do?

ANSWER:

Q2.(3 pnt) What does ‘recursive = TRUE’ do? What would the ‘recursive = FALSE’ return? ANSWER:

EXTRA (1 pnt) : Rewrite this function to generate ‘splitbase’ with fewer lines. For example, what happens when you replace ‘basename’ in ‘splitbase <- strsplit(basename,’_‘)[[1]][1]’ with ‘splitpath[[1]][2]’?

Q3. (3 pnt) What is the point of writing a function? Why do you think it is advantageous to write functions early in your script? ANSWER:

5.2.3.1.2 Establish new directories(folders)

in our working directory. We will use these to store the outputs of the Whitebox functions.

Check your working directory, you should see the new folders there.

5.2.3.2 Resample DEMs

Here we will start with LiDAR data with 0.5m resolution. While this resolution has useful applications, a high resolution DEM can make hydrological models very computationally expensive with little or no improvement to the output. If you have the option of high resolution data in your work, you can test model outputs at different resolutions to determine what is the most efficient and effective resolution for your work. Here, we will resample our LiDAR data to a 10m resolution.

Q4. (3 pnt) Did we use the function extractsitenames in the above chunk? How? What did it do?

ANSWER:

Let’s quickly check our work by importing a resampled DEM and checking the resolution.

Note: This can also be done without importing the raster to the workspace by installing the library gdalUtils.

Q5.(2 pnt) What is the resolution of the resampled DEM? Where and how could we change the resolution to 30m if desired?

ANSWER:

5.2.3.3 Filling and breaching

When performing hydrological analysis on a DEM, the DEM usually needs to be pre-processed by ‘filling’ or ‘breaching’ any depressions or sinks to create a hydraulically connected and filled DEM. There are several depression or pit filling options available in whitebox. Breach depressions can be a better option that just pit filling according to whitebox documentation, however, some users argue that this can smooth too much, resulting in an altered watershed delineation. It is prudent to investigate different DEM pre-processing methods and their resulting DEM. You can fill depressions directly, breach depressions and then fill them, applying breach or fill single cell pit before breach/fill depressions, and use the one that generates the most reasonable watershed delineation results. Here we are going to make extra sure our flowpaths are uninhibited by first filling in single cell pits, and then breaching any larger depressions.

Q6 (2 pnt) What is breach1? How is lapply using breach1? ANSWER:

5.2.3.4 Flow direction and accumulation rasters

Q7 (2 pnt) Check out the WhiteboxTools User Manual. What does a d8pointer raster tell us and what might we use it for? ANSWER:

Let’s visualize some of our work so far:

Q8. (3 pnt) What are the units for the values that you see in each of these legends? It may be helpful to check out the Manual again. ANSWER:

5.2.3.5 Streams

Define streams using the flow accumulation raster. Use of the wbt_extract_streams function will return a raster with stream cells indicated only.

Sometimes we would prefer to see the stream within the watershed boundary. Here we are using the extent of the flow accumulation raster to generate a raster with ‘0’ to indicate watershed cells, along with stream cells indicated by the streams.tif. This is a demonstration for one watershed.

5.2.3.6 Final thoughts:

There are many more hydrological preprocessing and analysis tools available through Whitebox for R. If you are interested in watershed delineation in R, there is a tutorial here that is fairly easy to follow. However, if you find that you use these tools frequently and do not use R much in other work, you may also consider these options for hydrological analysis:

1. SAGA SAGA tools offer a versatile suite of geospatial processing capabilities accessible through both QGIS and ArcGIS plugins as well their standalone GUI interface. Often I find the GUI easiest for preprocessing, then I will import SAGA’s output rasters to QGIS for formatting or map making, or into model scripts. SAGA has a robust online support community, so it can be a valuable resource for hydrological work.

2. Similarly, Whitebox GAT tools can be used as plugin to QGIS and ArcGIS, providing Whitebox functionality directly with in a GIS environment.
When using these tools, the order of operations is similar to our work above: fill the DEM, generate a flow direction and flow accumulation raster, identify channels, delineate watersheds, then you can move forward according to the specificity of your project. Ultimately, the choice of workflow is yours, but I suggest documenting your process as you proceed, including links or file paths to projects and scripts within the written workflow. It’s also important to carefully consider the organization and storage of your projects and files. For instance, files generated by a GIS project should be readily accessible to any associated scripts. Returning to a preprocessing step can be challenging if there’s no clear way to trace back your workflow and regenerate a crucial layer.

5.3 Transfer function rainfall-runoff models

5.3.1 Learning Module

5.3.1.1 Summary

In previous modules, we explored how watershed characteristics influence the flow of input water through or over hillslopes to quickly contribute to stormflow or to be stored for later contribution to baseflow. Therefore, the partitioning of flow into baseflow or stormflow can be determined by the time it spends in the watershed. Furthermore, the residence time of water in various pathways may affect weathering and solute transport within watersheds. To improve our understanding of water movement within a watershed, it can be crucial to incorporate water transit time into hydrological models. This consideration allows for a more realistic representation of how water moves through various storage compartments, such as soil, groundwater, and surface water, accounting for the time it takes for water to traverse these pathways.

In this module, we will model the temporal aspects of runoff response to input using a transfer function. First, please read:

TRANSEP - a combined tracer and runoff transfer function hydrograph separation model

Then this chapter will step through key concepts in the paper to facilitate hands-on exploration of the rainfall-runoff portion of the TRANSEP model in the assessment. Then, we will introduce examples of other transfer functions to demonstrate alternative ways of representing time-induced patterns in hydrological modeling, prompting you to consider response patterns in your study systems.

5.3.1.2 Overall Learning Objectives

At the end of this module, students should be able to describe several ways to model and identify transit time within hydrological models. They should have a general understanding of how water transit time may influence the timing and composition of runoff.

5.3.1.3 Terminology

In modeling a flow system, note that consideration of time may vary depending on the questions being asked. Transit time is the average time required for water to travel through the entire flow system, from input (e.g., rainfall on soil surface) to output (e.g., discharge). Residence time is a portion of transit time, describing the amount of time water spends within a specific component of the flow system, like storage (e.g., in soil, groundwater, or a lake).

Figure 6.3. Conceptual diagram of the lumped parameter transit time modeling approach (McGuire & McDonnell, 2006)
Figure 6.3. Conceptual diagram of the lumped parameter transit time modeling approach (McGuire & McDonnell, 2006)



A transfer function (TF) is a mathematical representation of how a system responds to input signals. In a hydrological context, it describes the transformation of inputs (e.g. precipitation) to outputs (e.g. runoff). These models can be valuable tools for understanding the time-varying dynamics of a hydrological system.

5.3.1.4 The Linear Time-Invariant TF

We’ll begin the discussion in the context of a linear reservoir. Linear reservoirs are simple models designed to simulate the storage and discharge of water in a catchment. These models assume that the catchment can be represented as single storage compartments or as a series of interconnected storage compartments and that the change the amount of water stored in the reservoir (or reservoirs) is directly proportional to the inflows and outflows. In other words, the linear relationship between inflows and outflows means that the rate of water release is proportional to the amount of water stored in the reservoir.



5.3.1.4.1 The Instantaneous Unit Hydrograph:

The Instantaneous Unit Hydrograph (IUH) represents the linear rainfall-runoff model used in the TRANSEP model. It is an approach to hydrograph separation that is useful for analyzing the temporal distribution of runoff in response to a ‘unit’ pulse of rainfall (e.g. uniform one-inch depth over a unit area represented by a unit hydrograph). In other words, it is a hydrograph that results from one unit (e.g. 1 mm) of effective rainfall uniformly distributed over the watershed and occurring in a short duration. Therefore, the following assumptions are made when the IUH is used as a transfer function:
1. the IUH reflects the ensemble of watershed characteristics
2. the shape characteristics of the unit hydrograph are independent of time
3. the output response is linearly proportional to the input



Chow, V.T., Maidment, D.R. and Mays, L.W. (1988) Applied Hydrology. International Edition, McGraw-Hill Book Company, New York.

By interpreting the IUH as a transfer function, we can model how the watershed translates rainfall into runoff. In the TRANSEP model, this transfer function is represented as \(g(\tau)\) and thus the rainfall-induced response to runoff.

\[ g(\tau) = \frac{\tau^{\alpha-1}}{\mathrm{B}^{\alpha}\Gamma(\alpha)}exp(-\frac{\tau}{\alpha}) \]

The linear portion of the TRANSEP model describes a convolution of the effective precipitation and a runoff transfer function.

\[ Q(t)= \int_{0}^{t} g(\tau)p_{\text{eff}}(t-\tau)d\tau \]

Whoa, wait…what? Tau, integrals, and convolution? Don’t worry about the details of the equations. Check out this video to have convolution described using dollars and cookies, then imagine each dollar as a rainfall unit and each cookie as a runoff unit. Review the equations again after the video.



knitr::include_url("https://www.youtube.com/embed/aEGboJxmq-w")
5.3.1.4.2 The Loss Function:

The loss function represents the linear rainfall-runoff model used in the TRANSEP model.

\[ s(t) = b_{1} p(t + 1 - b_{2}^{-1}) s(t - \triangle t) \] \[ s(t = 0) = b_{3} \] \[ p_{\text{eff}}(t) = p(t) s(t) \] where \(p_{\text{eff}}(t)\) is the effective precipitation.
\(s(t)\) is the antecedent precipitation index which is determined by giving more importance to recent precipitation and gradually reducing that importance as we go back in time.
The rate at which this importance decreases is controlled by the parameter \(b_{2}\).
The parameter \(b_{3}\) sets the initial antecedent precipitation index at the beginning of the simulated time series.

In other words, these equations are used to simulate the flow of water in a hydrological system over time. The first equation represents the change in stored water at each time step, taking into account precipitation, loss to runoff, and the system’s past state. The second equation sets the initial condition for the storage at the beginning of the simulation. The third equation calculates the effective precipitation, considering both precipitation and the current storage state.

5.3.1.4.3 How do we code this?

We will use a skeletal version of TRANSEP, focusing only on the rainfall-runoff piece which includes the loss-function and the gamma transfer function.

We will use rainfall and runoff data from TCEF to model annual streamflow at a daily time step. Then we can use this model as a jump-off point to start talking about model calibration and validation in future modules.

5.3.1.4.4 Final thoughts:

If during your modeling experience, you find yourself wading through a bog of complex physics and multiple layers of transfer functions to account for every drop of input into a system, it is time to revisit your objectives. Remember that a model is always ‘wrong’. Like a map, it provides a simplified representation of reality. It may not be entirely accurate, but it serves a valuable purpose. Models help us understand complex systems, make predictions, and gain insights even if they are not an exact replica of the real world. Check out this paper for more:

https://agupubs.onlinelibrary.wiley.com/doi/10.1029/93WR00877

5.3.2 Labwork (20 pts)

5.3.3 Download the repo for this lab HERE

In this homework/lab, you will write a simple, lumped rainfall-runoff model. The foundation for this model is the TRANSEP model from Weiler et al., 2003. Since TRANSEP (tracer transfer function hydrograph separation model) contains a tracer module that we don’t need, we will only use the loss function (Jakeman and Hornberger) and the gamma transfer function for the water routing.
The data for the model is from the Tenderfoot Creek Experimental Forest in central Montana.

Load packages with a function that checks and installs packages if needed:

#knitr::opts_chunk$set(echo = FALSE)
# 
# # Write your package testing function
# pkgTest <- function(x)
# {
#   if (x %in% rownames(installed.packages()) == FALSE) {
#     install.packages(x, dependencies= TRUE)
#   }
#   library(x, character.only = TRUE)
# }
# 
# # Make a vector of the packages you need
# neededPackages <- c('tidyverse', 'lubridate', 'tictoc', 'patchwork') #tools for plot titles 
# 
# # For every package in the vector, apply your pkgTest function
# for (package in neededPackages){pkgTest(package)}
# 
# # tictoc times the execution of different modeling methods
# # patchwork is used for side-by-side plots
# # Let's assume that you know you put your data file in the working directory, but cannot recall its name. Let's do some working directory exploration with script:
# 
# # Check your working directory:
# print(getwd())
# 
# # Check the datafile name or path by listing files in the working directory.
# filepaths <-list.files()
# 
# # Here is an option to list only .csv files in your working directory:
# csv_files <- filepaths[grepl(".csv$", filepaths, ignore.case = TRUE)]
# print(csv_files)
# ```
# 
# #### Read in the data, convert the column that has the date to a (lubridate) date, and add a column that contains the water year

# # Identify the path to the desired data. 
# filepath <- "P_Q_1996_2011.csv"
# indata <- read.csv(filepath)
# 
# indata <- indata %>%
#   mutate(Date = mdy(Date)) %>% # convert "Date" to a date object with mdy()
#   mutate(wtr_yr = if_else(month(Date) > 9, year(Date) + 1, year(Date)))

Define input year We could use every year in the time series, but for starters, we’ll use 2006. Use filter() to extract the 2006 water year.

# PQ <- indata %>%
#   filter(wtr_yr == 2006) # extract the 2006 water year with filter()
# 
# # plot discharge for 2006
# ggplot(PQ, aes(x = Date, y = Discharge_mm)) +
#   geom_line()
# 
# # make flowtime correction - flowtime is a time-weighted cumulative flow, which aids in understanding the temporal distribution of flow, giving more weight to periods of higher discharge. flow time correction is relevant in hydrology when analyzing Q or time series data, so we can compare hydrological events on a standardized time scale. It can help to identify patterns, assess the duration of high or low flows, and compare behavior of watersheds over time.
# 
# PQ <- PQ %>% 
#   mutate(flowtime = cumsum(Discharge_mm)/mean(Discharge_mm)) %>% 
#   mutate(counter = 1:nrow(PQ))
# 
# ggplot() +
#   geom_line(data=PQ, aes(x = flowtime, y = Discharge_mm)) + 
#   geom_line(data=PQ, aes(x = counter, y = Discharge_mm), color="red")

1) QUESTION: What does the function cumsum() do?(1 pt) ANSWER:

Define the initial inputs

This chunk defines the initial inputs for measured precip and measured runoff.

# tau <- 1:nrow(PQ) # simple timestep counter the same length as PQ
# Pobs <- PQ$RainMelt_mm # observed precipitation from PQ df
# Qobs <- PQ$Discharge_mm # observed streamflow from PQ df

Parameterization

We will use these parameters. You can change them if you want, but I’d suggest leaving them like this at least until you get the model to work. Even tiny changes can have a huge effect on the simulated runoff.

# # Loss function parameters
# b1 <- 0.0018 # volume control parameter (b1 in eq 4a)
# b2 <- 50 # backwards weighting parameter (b2 in eq 4a) 
# # determines how much weight or importance is given to past precipitation events when calculating an antecedent precipitation index. "Exponential weighting backward in time" means that the influence of past precipitation events diminishes as you move further back in time, and this diminishing effect follows an exponential pattern.
# 
# b3 <- 0.135 # initial s(t) value for s(t=0) (b3 in eq 4b) - Initial antecedent precipitation index value.
# 
# # Transfer function parameters
# a <- 1.84 # TF shape parameter
# b <- 3.29 # TF scale parameter

Loss function

This is the module for the Jakeman and Hornberger loss function where we turn our measured input precip into effective precipitation (p_eff). This part contains three steps.
1) preallocate a vector p_eff: Initiate an empty vector for effective precipitation that we will fill in with a loop using Peff(t) = p(t)s(t). Effective precipitation is the portion of precipitation that generates streamflow and event water contribution to the stream. It is separated to produce event water and displace pre-event water into the stream. 2) set the initial value for s: s(t) is an antecedent precipitation index. How much does antecedent precipitation affect effective precipitation? 3) generate p_eff inside of a for-loop
Please note: The Weiler et al. (2003) paper states that one of the loss function parameters (vol_c) can be determined from the measured input. That is actually not the case.

s(t) is the antecedent precipitation index that is calculated by exponentially weighting the precipitation backward in time according to the parameter b2 is a ‘dial’ that places weight on past precipitation events.

# # preallocate the p_eff vector
# p_eff <- vector(mode = "double", length(tau))
# 
# s <- b3 # at this point, s is equal to b3, the start value of s
# 
# # loop with loss function
# for (i in 1:length(p_eff)) {
#   s <- b1 * Pobs[i] + (1 - 1/b2) * s # this is eq 4a from Weiler et al. (2003)
#   p_eff[i] <- Pobs[i] * s # this is eq 4c from Weiler et al. (2003)
# }
# 
# 
# # #### alternative way to calculate p_eff by populating an s vector  
# # preallocate s_alt vector
# # s_alt <- vector(mode = "double", length(tau))
# # s_alt[1] <- b3
# # # preallocate p_eff vector
# # p_eff_new <- vector(mode = "double", length(tau))
# # # start loop
# # for (i in 2:(length(p_eff))) {
# #   s_alt[i] <- b1 * Pobs[i] + (1 - 1/b2) * s_alt[i-1]
# #   p_eff_new[i] <- Pobs[i] * s_alt[i]
# # }
# ####
# # set a starting value for s
# 
# 
# ## plot observed and effective precipitation against the date
# # wide to long with pivot_longer()
# precip <- tibble(date = PQ$Date, obs = Pobs, sim = p_eff) %>%
#   pivot_longer(names_to = "key", values_to = "value", -date)
# 
# 
# # a good way to plot P time series is with a step function. ggplot2 has geom_step() for this.
# ggplot(precip, aes(x = date, y = value, color = key)) +
#   geom_step() +
#   labs(color = "P")
# 
# 
# ## plot the ratio of p_eff/Pobs (call that ratio "frac")
# precip_new <- tibble(date = PQ$Date, obs = Pobs, sim = p_eff, frac = p_eff/Pobs)
# 
# ggplot(data = precip_new, aes(x = date, y = frac)) +
#   geom_line()

2) QUESTION (3 pts): Interpret the two figures. What is the meaning of “frac”? For this answer, think about what the effective precipitation represents. When is frac “high”, when is “frac” low?

ANSWER:

Extra Credit (2 pt): Combine the two plots into one, with a meaningfully scaled secondary y-axis

Runoff transfer function

Routing module -

This short chunk sets up the TF used for the water routing. This part contains only two steps: 1) the calculation of the actual TF. 2) normalization of the TF so that the sum of the TF equals 1.

tau(0) is the mean residence time

# #gTF <- tau
# 
# # this is the "raw" transfer function. This is eq 13 from Weiler 2003. Use the time step for tau. The Gamma function in R is gamma(). NOTE THAT THIS EQ IN WEILER ET AL IS WRONG! It's exp(-tau / b) and not divided by alpha.
# g <- (tau^(a - 1)) / ((b^a * gamma(a))) * exp(-tau / b)
# 
# # normalize the TF, g, by the total sum of g.
# gTF <- g / sum(g)
# 
# # plot the TF as a line against tau. You need to put the two vectors tau and gTF into a new df/tibble for this. 
# tf_plot <- tibble(tau, gTF)
# ggplot(tf_plot, aes(tau, gTF)) +
#   geom_line()

3) QUESTION (2 pt): Why is it important to normalize the transfer function?
ANSWER:

4) QUESTION (4 pt): Describe the transfer function. What are the units and what does the shape of the transfer function mean for runoff generation?
ANSWER:

Convolution

This is the heart of the model. Here, we convolute the input with the TF to generate runoff. There is another thing you need to pay attention to: We are only interested in one year (365 days), but since our TF itself is 365 timesteps long, small parts of all inputs except for the first one, will be turned into streamflow AFTER the water year of interest, that is, in the following year. In practice, this means you have two options to handle this.
1) You calculate q_all and then manually cut the matrix/vector to the correct length at the end, or
2) You only populate a vector at each time step and put the generated runoff per iteration in the correct locations within the vector. For this to work, you would need to trim the length of the generated runoff by one during each iteration. This approach is more difficult to code, but saves a lot of memory since you are only calculating/storing one vector of length 365 (or 366 during a leapyear).
We will go with option 1). The code for option 2 is shown at the end for reference.

Convolution summarized (recall dollars and cookies): Each loop iteration results in a row of the matrix representing the convolution at a specific time step. each time step of p_eff is an iteration, and for each timestep, it multiplies the effective precipitation at that timestep by the entire transfer function. Then each row is summed and stored in the vector q_all. q_all_loop is an intermediate step in the convolution process and can help visualize how the convolution evolves over time. As an example, if we have precip at time step 1, we are interested in how this contributes to runoff at time steps 1,2,3 etc, all the way to 365 (or the end of our period). so the first row of the matrix q_all_loop represents the contribution of precipitation at timestep 1 to runoff at each timestep. the second row represents the contribution of precipitation at time step 2 to runoff at each timestep. Then when we sum up the rows, we get q_all, where each element represents the total runoff at a specific time step.

# tic()
# # preallocate qsim matrix with the correct dimensions. Remember that p_eff and gTF are the same length.
# q_all_loop <- matrix(0, length(p_eff) * 2, length(p_eff)) # set number of rows and columns
# 
# # convolution for-loop
# for (i in 1:length(p_eff)) { # loop through length of precipitation timeseries
#   q_all_loop[(i):(length(p_eff) + i - 1), i] <- p_eff[i] * gTF # populate the q_all matrix (this is the same code as the UH convolution problem with one tiny change because you need to reference gTF and not UH)
# }
# 
# 
# # add up the rows of the matrix to generate the final runoff and replace matrix with final Q
# q_all <- apply(q_all_loop, 1, sum, na.rm = TRUE)
# 
# # cut the vector to the appropriate length of one year (otherwise it won't fit into the original df with the observed data)
# q_all <- q_all[1:length(p_eff)]
# 
# # Write the final runoff vector into the PQ df
# PQ$Qsim <- q_all
# toc()

5) QUESTION (5 pts): We set the TF length to 365. What is the physical meaning of this (i.e., how well does this represent a real system and why)? Could the transfer function be shorter or longer than that?

ANSWER:

# ## THIS PART SAVES ALL HOURLY Q RESPONSES IN A NEW DF AND PLOTS THEM
# Qall <- as_tibble(q_all_loop, .name_repair = NULL)
# Qall <- Qall[ -as.numeric(which(apply(Qall, 2, var) == 0))]
# toc()
# 
# # Qall[Qall == 0] <- NA
# 
# Qall <- Qall %>%
#   mutate(Time_hrs = 1:nrow(Qall)) %>% # add the time vector
#   gather(key, value, -Time_hrs)
# 
# Qall$key <- as.factor(Qall$key)
# 
# 
# ggplot(Qall, aes(x = Time_hrs, y = value, fill = key)) +
#   geom_line(alpha = 0.2) +
#   theme(legend.position = "none") +
#   lims(x = c(1, 365)) +
#   labs(x = "DoY", y = "Q (mm/day")

Plots

Plot the observed and simulated runoff. Include a legend and label the y-axis.

# # make long form
# PQ_long <- PQ %>%
#   select(-wtr_yr, -RainMelt_mm) %>%
#   pivot_longer(names_to = "key", values_to = "value", -Date)
# 
# # plot hydrographs (as lines) and label the y-axis
# ggplot(data = PQ_long, aes(x = Date, y = value, color = key)) +
#   geom_line() +
#   labs(x = {}, y = "Q (mm/day)", color = {}) +
#   lims(x = as.Date(c("2005-10-01", "2006-09-30")))

6) QUESTION (3 pt): Evaluate how good or bad the model performed (i.e., visually compare simulated and observed streamflow, e.g., low flows and peak flows).
ANSWER:

7) QUESTION (2 pt): Compare the effective precipitation total with the simulated runoff total and the observed runoff total. What is p_eff and how is it related to q_all? Discuss why there is a (small) mismatch between the sums of p_eff and q_all.
ANSWER:

# sum(Pobs) # observed P
# 
# sum(p_eff) # effective (modeled) P
# 
# sum(q_all) # modeled Q
# 
# sum(Qobs) # observed Q

THIS IS THE CODE FOR CONVOLUTION METHOD 2
This method saves storage requirements since only a vector is generated and not a full matrix that contains all response pulses. The workflow here is to generate a a response vector for the first pulse. This will take up 365 time steps. On the second time step, we generate another response pulse with 364 steps that starts at t=2. We then add that vector to the first one. On the third time step, we generate a response pulse of length 363 that starts at t=3 and then add it to the existing one. And so on and so forth.

# # METHOD 2, Option 1
# tic()
# # preallocate qsim vector
# q_all <- vector(mode = "double", length(p_eff))
# # start convolution
# for (i in 1:length(p_eff)) {
#   A <- p_eff[i] * gTF # vector with current Q pulse
#   B <- vector(mode = "double", length(p_eff)) # reset/preallocate vector with 0s
#   B[i:length(p_eff)] <- A[1:(length(p_eff) - i + 1)] # iteration one uses the full A, iteration 2 full lengthmodel minus one, etc
#   q_all <- q_all + B # add new convoluted vector to total runoff
# }
# toc()
# 
# 
# # Method 2, Option 2
# tic()
# # preallocate qsim vector
# q_all_2 <- vector(mode = "double", length(p_eff))
# B <- vector(mode = "double", length(p_eff)) # preallocate vector with 0s
# 
# # start convolution
# for (i in 1:length(p_eff)) {
#   A <- p_eff[i] * gTF # vector with current Q pulse
#   B[i:length(p_eff)] <- A[1:(length(p_eff) - i + 1)] # iteration one uses the full A, iteration 2 full lengthmodel minus one, etc
#   q_all_2[i:length(p_eff)] <- q_all_2[i:length(p_eff)] + B[i:length(p_eff)] # add new convoluted vector to total runoff at correct locations
# }
# toc()
# 
# # plot to show the two Q timeseries are the same
# # Create two ggplot objects
# plot_q1 <- ggplot(data=test_q_all, aes(x=time, y=q1)) +
#   geom_line() +
#   labs(title = "Q1 Plot")
# 
# plot_q2 <- ggplot(data=test_q_all, aes(x=time, y=q2)) +
#   geom_line(color="blue") +
#   labs(title = "Q2 Plot")
# 
# # Combine plots side by side
# combined_plots <- plot_q1 + plot_q2 + plot_layout(ncol = 2)
# 
# # Display combined plots
# combined_plots