# Cleaning data using GBIF issues

#### 2020-06-29

rgbif now has the ability to clean data retrieved from GBIF based on GBIF issues. These issues are returned in data retrieved from GBIF, e.g., through the occ_search() function. Inspired by magrittr, we’ve setup a workflow for cleaning data based on using the operator %>%. You don’t have to use it, but as we show below, it can make the process quite easy.

Note that you can also query based on issues, e.g., occ_search(taxonKey=1, issue='DEPTH_UNLIKELY'). However, we imagine it’s more likely that you want to search for occurrences based on a taxonomic name, or geographic area, not based on issues, so it makes sense to pull data down, then clean as needed using the below workflow with occ_issues().

Note that occ_issues() only affects the data element in the gbif class that is returned from a call to occ_search(). Maybe in a future version we will remove the associated records from the hierarchy and media elements as they are remove from the data element.

occ_issues() also works with data from occ_download().

## Get rgbif

Install from CRAN

install.packages("rgbif")

Or install the development version from GitHub

remotes::install_github("ropensci/rgbif")

library('rgbif')

## Get some data

Get taxon key for Helianthus annuus

(key <- name_suggest(q='Helianthus annuus', rank='species')$key[1]) Then pass to occ_search() (res <- occ_search(taxonKey=key, limit=100)) ## Examine issues The dataset gbifissues can be retrieved using the function gbif_issues(). The dataset’s first column code is a code that is used by default in the results from occ_search(), while the second column issue is the full issue name given by GBIF. The third column is a full description of the issue. head(gbif_issues()) You can query to get certain issues gbif_issues()[ gbif_issues()$code %in% c('cdround','cudc','gass84','txmathi'), ]

The code cdround represents the GBIF issue COORDINATE_ROUNDED, which means that

Original coordinate modified by rounding to 5 decimals.

The content for this information comes from http://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/OccurrenceIssue.html.

## Parse data based on issues

Now that we know a bit about GBIF issues, you can parse your data based on issues. Using the data generated above, and using the function %>% imported from magrittr, we can get only data with the issue gass84, or GEODETIC_DATUM_ASSUMED_WGS84 (Note how the records returned goes down to 98 instead of the initial 100).

res %>%
occ_issues(gass84)

Note also that we’ve set up occ_issues() so that you can pass in issue names without having to quote them, thereby speeding up data cleaning.

Next, we can remove data with certain issues just as easily by using a - sign in front of the variable, like this, removing data with issues depunl and mdatunl.

res %>%
occ_issues(-depunl, -mdatunl)

## Expand issue codes to full names

Another thing we can do with occ_issues() is go from issue codes to full issue names in case you want those in your dataset (here, showing only a few columns to see the data better for this demo):

out <- res %>% occ_issues(mutate = "expand")
head(out$data[,c(1,5)]) ## Add columns Sometimes you may want to have each type of issue as a separate column. Split out each issue type into a separate column, with number of columns equal to number of issue types out <- res %>% occ_issues(mutate = "split") head(out$data[,c(1,5:10)])

out <- res %>% occ_issues(mutate = "split_expand")
head(out\$data[,c(1,5:10)])
We hope this helps users get just the data they want, and nothing more. Let us know if you have feedback on data cleaning functionality in rgbif at or at https://github.com/ropensci/rgbif/issues.