Chapter 5 Term Project Assignment 1 - 10 pnts

As part of the Professional Masters’ Program, you are required to develop a professional paper. It is intended to give you an in-depth experience in the design, implementation, and completion of an original project. The paper will also be a way to showcase your research interests and accomplishments to potential employers or admissions committees. (See LRES 575 for more).

In this assignment, you will develop the main goal and objectives of your term project and explore, evaluate and select data sources. The workflow for this assignment will help you develop your research question and identify the datasets you will use to address that question.

5.1 Brainstorming

Q1. What is the primary research question for your term project? (1-2 sentences)(1 pt)
Think of this as your first approach to developing a research question. It should reflect a clear purpose and be specific enough to guide your initial search for data, but it likely will evolve throughout this assignment. That is okay, even expected. You will be asked for a refined response at the end of the assignment.
Consider:
- The problem or issue you want to address
- What specific phenomenon within that issue are you interested in understanding
- What measurable or observable outcomes do you want to analyze.
For example, instead of ‘how does climate change affect forests?’, you might consider ‘How has seasonal precipitation variability impacted the timing of peak NDVI in the Pacific Northwest from 2000 to 2020.’ Note that this question is specific enough to guide your search for the necessary data, such as precipitation records and NDVI time series.
ANSWER:

Q2. What are 2-3 types of data your research question requires? Address each sub-question (2 pts) Be as specific as you can, what types of variables or indicators are you looking for?
What format should the data be in (e.g., spatial datasets, time-series data, species counts)?
What might be the origin of this data? (e.g., satellite imagery, weather station records, survey data?)
What temporal length and resolution would be ideal to answer this question?
ANSWER:

5.2 Data exploration and selection

Now that you have an idea of what kind of data your question requires, the next step is to locate actual datasets that support your inquiry. There are numerous datasets available to public use, including public repositories, government agencies, academic institutions, even data uploaded to repositories like Zendodo and Figshare making data for specific studies discoverable and citable by other researchers. It is up to you to explore a variety of sources and identify datasets that you can retrieve and work with, but here are some suggestions to get you started. Many of these platforms allow R access to their datasets with packages that facilitate downloading, managing, and analyzing the data directly within R.

Climate or weather data:
NOAA,
SNOTEL,
and Google’s climate engine are all potential sources of earth data.

Hydrological data: USGS

Soil data: Web Soil Survey, try R package soilDB

Water quality and air pollution: EPA Envirofacts

Global biodiversity: Global Biodiversity Information Facility with the R processing package rgbif
eBird and R package auk

Fire: Monitoring Trends in Burn Severity

Remotely sensed spatial data: You may have already some experience with Google Earth Engine. This is a fast way to access an incredible library of resources. If you have used this in other courses but feel rusty, we can help! Also check out this e-book. However, if your interest is in a few images, USGS’ Earth Explorer GloVis is also a good resource.

5.2.1 Expand your search

This is not an exhaustive list of possibilities. Consider searching MSU’s library databases. For example, the Web of Science database contains organized searchable datasets built by published researchers. Google searches using advanced search operators to include specific file types with your search terms can be helpful (e.g., .csc, .xlsx, .geojson). Engage with AI tools to explore potential datasets or repositories based your specific topic. An active approach to searching and learning will help you discover datasets that will support your research question.

5.2.2 Adapt your research question

Once you spend a solid hour or two searching databases, you may find that you need to adapt your research question based on data availability. You will likely spend some time refining your initial question to fit the datasets you find and that is encouraged! Then resume the dataset search, and continue refining your question and searching until you can develop an executable methodology.

PART 3. Explore your data (3 pnts)

You are encouraged to explore many packages and write many more code chunks for your personal use. However, for the assignment submission, retrieve at least one dataset relevant to your question using an R package like dataRetrieval.

Install needed packages

#install.packages('')
#library()

Explore the available data using vignettes or Cran documentation for your specific dataset and package like dataRetrieval’s documentation. Create a dataframe containing 4-6 columns and print the head of the dataframe.

#head(dfname)

Write a brief summary (3-4 sentences) of what your dataframe describes.

Generate plots: Create at least one plot that summarizes the data and describe it’s use to you. Are there gaps in the data? Does the data cover the time or space you are interested in? Are there significant outliers that need consideration?

Generate a histogram or density plot of at least one variable in your dataset. The script here will help start a density plot showing multiple variables. You may adapt or change this as needed.

#plottable_vars <- dfname %>%
#  dplyr::select(variable1, variable2,...)
                
#long <- plottable_vars %>%
#  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot density plots for each variable
#ggplot(long, aes(x = Value, fill = Variable)) +
#  geom_density(alpha = 0.5) +  # Add transparency for overlapping densities
# facet_wrap(~ Variable, scales = "free", ncol = 2) +  # Create separate panels per variable
#  theme_minimal() +
#  labs(title = "Density Plots for Variables",
 #      x = "Value", y = "Density")

What does the above plot tell you about the distribution of your data? Is it as expected? Why do we need to consider the distribution of data when deciding on analysis methods?

Q4. Putting it all together (3 pts) Create a table in R and export it as a .csv file to submit with this assignment .rmd.

In this table, include all datasets (at least two; one explanatory variable and one response variable though you can have more of either) that you will use for your term project. The script below specifies all of the sections required in this table, though you will need to change names and information accordingly in c().

# Load necessary library
library(dplyr)

# Create a data frame
data_table <- data.frame(
  Dataset_Name = c("Dataset1", 'Dataset2'),
  Data_Source = c('NOAA', 'NRCS'), #Agency or institution name
  Data_Type = c("Climate", "Hydrology"), # what kind of data is it?
  Source_Link = c("https://www.link1/", 
                  "https://link2/"), # We will used these to access the sources
  Key_variables = c('precipitation', 'soil_conductivity'), 
  Temporal_range = c('2020-2021', '1979-today'),
  Spatial_Coverage = c('Montana', 'Juneau_AK'),
  Data_Quality = c("High", "Moderate"), # An example of high quality might be a dataset that has already been cleaned by an agency is not missing data in the period or space of interest. 
  Data_Quality_notes = c('QCd with no missing data', 'some unexpected values'),
  Feasibility = c("High", "Medium"), # How useable is this to you? Do you need help figuring out how to download it? Is there an R package that you need to learn to access the data?
  Feasibility_notes = c('note1', 'note2'),
  variable_type = c('explanatory', 'response')
)

# Display the table
print(data_table)

# Export the data table as a .csv to a file path of your choice:
#exportpath <- file.path(getwd(), somefolder, term_assign_table.csv) #replace somefolde with an actual folder name that exists in your local environment
#write.csv(data_able, exportpath, row.names=FALSE)

Once this is exported, you can format if desired for readibility and submit the .csv with a completed .rmd.

Q5. What is your updated research question? How, specifically, will the data sources listed above help you to answer that question? (1 pt)