Darwinazing biodiversity data in R

14 Mar

“Darwin Core (DwC) is a standard maintained by the Darwin Core maintenance group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.” Darwin Core website

DwC is an evolving community-developed biodiversity data standard (Wieczorek et al. 2012). In simple words, it’s a list and definitions of common biodiversity data terms, ~200 of them, for more details see DwC quick reference guide. Today, hundreds of millions of biodiversity records from around the world are published in DwC format and aggregated into various portals (e.g. GBIF, vertNet, iDigBio). Nonetheless, data publishers still struggle with the essential step of mapping fields in their data to the terms in Darwin Core (Wieczorek et al. 2017a). Doing so requires a good understanding of both the data set and Darwin Core (Wieczorek et al. 2017b).

Related work

The remarkable Kurator project creates biodiversity data quality workflows, and via it web interface (Kurator-Web)– data quality-control is highly accessible. Thanks to this invaluable project, we have easy access to different lookup tables, that aggregates rare and highly valuable data regarding DwC vocabularies. For example in the Darwin Cloud table, knowledge is being accumulated about variations in DwC field names. Fully utilizing these precious data in the R environment can significantly enhance our ability to address more biodiversity data quality issues.

Details of project

Darwinizer workflow in R:

While DwC was adopted by most biodiversity data publishers, it implementation is somewhat incomplete. Imposing controlled vocabulary on millions of records is a complex and daunting task. For example, there are inconsistencies regarding field names between different data publishers. The CSV File Darwinizer Kurator workflow standardizes field names to the DwC standard name, thanks to the Darwin Cloud lookup file. By generating this workflow in R, we can easily input a wider range of data from different publishers. This module needs to work on various data files downloaded from different biodiversity portals, and handle all of them.

Data checks:

Data checks must be specifically tailored around the structure of the data, in our case- the DwC standard. Under this module we will address three major data checks collections:

  • TDWG core suite of tests and assertions: Tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. Ideally, this core suite of data quality checks need to be embrace by all data publishers (as a standard), and hopefully in the long term, this will be the case. However, in the short term, since constructing many of them in R is rather feasible we plan to achieve that. Furthermore, embracing this standard will improve our ability to properly manage data checks. In Ashwinand Thiloshon last year GSoC projects various data checks have been developed, while some adjustment and further development is still required.
  • Imposing controlled vocabulary on key data fields: Using Kurator’s vocabulary data, different DwC standardization procedures can be addressed. The challenge will be to assess and prioritize the development of these procedures.
  • New frontiers: Enriching DwC data (i.e. accurately joining external data) can greatly boost data checks capacity and diversity. For example, joining species trait data, or retrieving climatic data for each record opens variety of check capabilities. Here we need to screen for robust data enrichment procedures in R and design exciting data checks around them.

R based Kurator actors:

Following a communication with the Kurator development team, we will explore the development of R actors (functions), that hopefully, will be seemingly integrated into the kurator infrastructure. Handling some Java and Python code will be required.

Expected impact

Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data.

If you are a full time University student and would like to participate in Google Summer of Code 2018 helping us build this, the project idea is listed here which has further details on how to start.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: