3 Part 1. Data Wrangling
This section teaches some tricks and tips for data wrangling using the dplyr
package. It reviews some of the more detailed introduction to dplyr
from BIOL 1002 labs. If you would like to review the lab manual for BIOL 1002, you can access it here.
A useful “cheatsheet” to print out and hang by your desk for this module is the dplyr
cheatsheet
3.1 Step 1 - Install package and read in files
install and load the package
dplyr
(if you don’t remember how to do this, refer to Section 4.10 of the Biology R Guideread in the three data sets. The code is given below, you can chose to give the dataframes different names, but for ease of following along it would be wise to use the same as here.
<- read.csv("MACPsites.csv")
MACPsites <- read.csv("MACPspp.csv")
MACPspp <- read.csv("MACPtraits.csv") MACPtraits
3.2 Step 2: Create a new column and add it to a dataframe
Use the mutate
function in dplyr
to create a column which sums all the species at that site, and writie it to a new dataframe called MACPspp_sum. Note that this is a row-wise function and that I am using the pipes
tool (denoted by the %>%
symbol). If you would like to review using pipes, see Section 2.1.5 of the manual for BIOL1002.
<- MACPspp %>%
MACPspp_sum rowwise() %>%
mutate(sppRich = sum(c_across(ABRhyp:XYLnig)))
3.3 Step 3: Querying and filtering the data
Use the filter
function in dplyr
to select only those sites that have more than 80 species and use this to create a new dataframe called MACPhotspots. Check Section 2.1.1 of the Manual for BIOL1002 to review the code.
HAND IN Then try using a function to find out what the highest species richness is across all sites (HINT another word for highest richness is max
richness). Describe what function you used (a code snippet would be fine!)
3.4 Step 4. Grouping and Joining Data
Because both the MACPsites data and the MACPspp data are arranged in rows, with one row per site, we can combine them into one dataframe that contains both the species and the environment data. We can do it like this:
<- inner_join(MACPsites, MACPspp_sum, by = "site") MACPall
HAND IN: How would you confirm that you have done this correctly (there are different ways)?
Sometimes we might want to group and summarize the data in different ways. For example, we might want to group the data by Ecoregion and then see what the mean richness is per ecoregion, to see if some ecoregions have higher richness than others. In the next section we will use these summary data to make some graphs.
<- group_by(MACPall, ER)
EcoR_spp
<- summarize(EcoR_spp, mean_rich = mean(sppRich))
EcoR_spp_mean <- summarize(EcoR_spp, se_rich = sd(sppRich)) EcoR_spp_se
3.5 What to hand in for Part 1.
Summarize the results from above where it says HAND IN. As well, in one or two sentences explain the difference between the select
, filter
and mutate
functions in the package dplyr
.