Tag Archives: bdclean

Darwinazing biodiversity data in R

14 Mar

“Darwin Core (DwC) is a standard maintained by the Darwin Core maintenance group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.” Darwin Core website

DwC is an evolving community-developed biodiversity data standard (Wieczorek et al. 2012). In simple words, it’s a list and definitions of common biodiversity data terms, ~200 of them, for more details see DwC quick reference guide. Today, hundreds of millions of biodiversity records from around the world are published in DwC format and aggregated into various portals (e.g. GBIF, vertNet, iDigBio). Nonetheless, data publishers still struggle with the essential step of mapping fields in their data to the terms in Darwin Core (Wieczorek et al. 2017a). Doing so requires a good understanding of both the data set and Darwin Core (Wieczorek et al. 2017b).

Related work

The remarkable Kurator project creates biodiversity data quality workflows, and via it web interface (Kurator-Web)– data quality-control is highly accessible. Thanks to this invaluable project, we have easy access to different lookup tables, that aggregates rare and highly valuable data regarding DwC vocabularies. For example in the Darwin Cloud table, knowledge is being accumulated about variations in DwC field names. Fully utilizing these precious data in the R environment can significantly enhance our ability to address more biodiversity data quality issues.

Details of project

Darwinizer workflow in R:

While DwC was adopted by most biodiversity data publishers, it implementation is somewhat incomplete. Imposing controlled vocabulary on millions of records is a complex and daunting task. For example, there are inconsistencies regarding field names between different data publishers. The CSV File Darwinizer Kurator workflow standardizes field names to the DwC standard name, thanks to the Darwin Cloud lookup file. By generating this workflow in R, we can easily input a wider range of data from different publishers. This module needs to work on various data files downloaded from different biodiversity portals, and handle all of them.

Data checks:

Data checks must be specifically tailored around the structure of the data, in our case- the DwC standard. Under this module we will address three major data checks collections:

  • TDWG core suite of tests and assertions: Tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. Ideally, this core suite of data quality checks need to be embrace by all data publishers (as a standard), and hopefully in the long term, this will be the case. However, in the short term, since constructing many of them in R is rather feasible we plan to achieve that. Furthermore, embracing this standard will improve our ability to properly manage data checks. In Ashwinand Thiloshon last year GSoC projects various data checks have been developed, while some adjustment and further development is still required.
  • Imposing controlled vocabulary on key data fields: Using Kurator’s vocabulary data, different DwC standardization procedures can be addressed. The challenge will be to assess and prioritize the development of these procedures.
  • New frontiers: Enriching DwC data (i.e. accurately joining external data) can greatly boost data checks capacity and diversity. For example, joining species trait data, or retrieving climatic data for each record opens variety of check capabilities. Here we need to screen for robust data enrichment procedures in R and design exciting data checks around them.

R based Kurator actors:

Following a communication with the Kurator development team, we will explore the development of R actors (functions), that hopefully, will be seemingly integrated into the kurator infrastructure. Handling some Java and Python code will be required.

Expected impact

Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data.

If you are a full time University student and would like to participate in Google Summer of Code 2018 helping us build this, the project idea is listed here which has further details on how to start.

Advertisements

GSoC 2017 : Biodiversity data cleaning

17 May

By Ashwin Agrawal

URL of the Project Idea: https://github.com/rstats-gsoc/gsoc2017/wiki/Biodiversity-data-cleaning

Introduction

There are an increasing number of scientists using R for their data analyses, however, the skill set required to handle biodiversity data in R, is considerably varies. Since, users need to retrieve, manage and assess high volume data with complex structure (Darwin Core standard, DwC); only users with an extremely sound R programming background can attempt this. Recently, various R packages dealing with biodiversity data and specifically data cleaning have been published (e.g. scrubr, biogeo, rgeospatialquality, assertr , and taxize). Though numerous new procedures are now available, implementing them requires users to prepare the data according to the formats of each of these packages and learning each R package. Dealing with the integration related tasks which would facilitate the data format conversions and smooth execution of all the available data cleaning functions from various packages, is being addressed in another GSOC project (link). The purpose of my project is to identify and address missing crucial functionalities for handling biodiversity (big) data in R. Properly addressing these gaps will hopefully enable us to offer a more complete framework for data quality assessment in R.

Proposed components

1. Standardized flagging system:

Biodiversity quality assessment is based upon a user capability to execute variety of data checks. Thus, a well-designed flagging system will allow users to easily manage their data checks result, and facilitate control on the desired quality level on one hand, and user flexibility on the other hand. I will assess several approaches for designing such a system, factoring comprehensibility and programming complexity.

Any insights and ideas regarding this task will be highly appreciated (please create a github issues).

2. A DwC summary table:

When dealing with high complexity and high-volume data, summary statistics of different fields and categories, can have an immense value. I will develop a DwC summary table based on DwC fields and vocabulary. First, I will explore different R packages dealing with descriptive statistics and table visualizations. Then, I will map key DwC data fields and key categories for easy faceting of the summary table. In addition, the developed framework can be used to enhance the flagging system, by utilizing it unique functionality to summarize the data quality checks results.

3. Outliers analysis:

Identifying spatial, temporal, and environmental outliers can single out erroneous records. However, identifying an outlier is a subjective exercise, and not all outliers are errors.  I will develop a set of functions which will aid in detection of outliers. Various statistical methods and techniques will be evaluated (e.g. Reverse Jackknife, Standard Deviations from the Mean, Alphahull).

4. Developing new data quality checks and procedure

I will identify critically missing spatial, taxonomic and temporal data cleaning routines, factoring users need level and programming complexity.  Ideas and needs regarding this task will be highly appreciated (please create a github issues).

Significance

Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data. This can greatly serve the scientific community and consequently our ability to address more accurately urgent conservation issues.

Feedback

For feedback, suggestions please post them on github issues

GSoC 2017 : Integrating biodiversity data curation functionality

8 May

By Thiloshon Nagarajah

Any data used in data science analyses, either it be simple statistical inference-making or high end machine learnings, needs to meet certain quality. Any dataset has ‘Signal’ the answer we are trying to find and ‘Noise’ the disturbances and anomalies in the data. The important part of preparing data for any analysis is to make it easier to distinguish noise from data. In biodiversity researches,  the data can be very large in number. Thus there is high probability of having a lot of noise. Giving control to researchers on this noise reduction will provide a clean and tidy data. 

Biodiversity research is a huge spectrum. It varies from analyzing simple heredity, climate and Eco-system impacts on species to complex Genome Sequencing researches. So the requirements of data in each of these fields vary with the type of researches. Taxonomic researchers will be interested in taxonomic fields and not so in spatial or temporal aspects of the data. They will be okay with loosing spatial data in compensation for better taxonomic data. Whereas the spatial related researchers will be lousy on taxonomic fields but not on spatial fields. This application-specific control on cleaning and preparing data helps in having an immensely efficient subsequent processes. 

The gist of this project and it’s sister project (Biodiversity data cleaning done by Ashwin Agrawal) is to provide this customizable control over cleaning of the data. The cleaning and standardization is not done on the fly, it’s customized, tweaked and refined as per user needs. The controls for that will be given by our solution. The brief proposal of our solution is given below. For the full proposal, click here.

 

Introduction

The importance of data in the biodiversity research has been repeatedly stressed in the recent times and various organizations have come together and followed each other to provide data for advancing biodiversity research. But, that is exactly where the main hiccup of biodiversity research lies. Since there are many such organizations, the data aggregated by these organizations vary in precision and in quality. Further, though in recent times more researchers have started to use R for their data analyses, since they need to retrieve, manage and assess data with complex (DwC) structure and high volume, only researchers with extremely sound R programming background have been able to attempt this.

Various R packages created so far have been focused on addressing some elements of the entire process. For example

Thus when a researcher decides to use these tools, he needs to

  1. Know these packages exist
  2. Understand what each package does
  3. Compare and contrast packages offering same functionalities and decide the best for his needs
  4. Maintain compatibility between the packages and datasets transferred between packages.

What we propose:

We propose to create a R Package that will function as the main data retrieval, cleaning, management and backup tool to the researchers. The functionalities of the package are culmination of various existing packages and enhancements to existing functions rather than creating one from the scratch. This way we can cultivate the existing resources, collaborative knowledge and skills and also address the problem we identified efficiently. The package will also address the issue of researchers not having sound R programming skills.

Before we analyze solutions, it’s important to understand the stakeholders and scope of the project.

blog_img01

Data Flow

Proposed Package:

The package will cover major processes in the research pipeline.

  1. Getting Biodiversity data to the workspace
    The biodiversity data can be read from existing DwC archive files in various formats (DwCA, XML, CSV) or it can be downloaded from online sources (GBIF, Vertnet). So functions to read local files in XML, DwCA and CSV formats, to download data directly from GBIF and to retrieve from respective APIs will be included. In case the user doesn’t know what data to retrieve the name suggestion functions will also be included. Converting common name to scientific name, scientific name to common name, getting taxon keys to names will also be covered. Further functions to convert to simple data frames to retrieve medias associated with occurrences will also be included.
  2. Flagging the data
    The biodiversity data is aggregated by various different organizations. Thus these data vary in precision and in quality. It is highly necessary to first check the quality of the data and strip the records which lacks the quality expected before using it. Various packages built thus far have been able to check data for various discrepancies such as scrubr and rgeospatialquality. Integrating functionalities given by such packages to produce a better quality control will benefit the community greatly. The data will be checked for following discrepancies

    1. In spatial – Incorrect, impossible, incomplete and unlikely coordinates and invalid country and country codes
    2. In temporal – missing or incorrect dates in all time fields
    3. In taxonomic – epithet, scientific name and common name discrepancies and also fixing scientific names.
    4. And duplicate records of data will be flagged.
  3. Cleaning data
    The process is done step by step to help user configure and control the cleaning process.
    The data will be flagged first for various discrepancies. It can be any combination of spatial, temporal, taxonomic and duplicate flags as user specifies. Then the user can view the data he will be losing and decide if he wants to tweak the flags. In times when the data is high in volume, this procedural cleaning would help user for number of reasons.

    1. When user wants multiple flags, he can apply each quality check one by one and decide if he wants to remove flagged data once he applies one check. If he is to apply all flags at once the records to be removed will be high in number and he has to put much effort to go through all of it to decide if he wants that quality check
    2. When user views the flagged data, not all fields will be shown. Only the fields the quality check was done on and the flagged result will be shown. This saves user from having to deal with all the complex fields in the original data and having to go to and forth the data to check the flags.
      If he is satisfied with the records he will lose then, the data can be cleaned.

    Ex:

    # Step 01
    biodiversityData
          %>% coordinateIncompleteFlag()
          %>% allFlags(taxonomic)
    
    #Step 02
    viewFlaggedData()
    
    #Step 03
    cleanAll()
    
    
  4. Maintaining backups of dataThe original data user retrieved and any subsequent resultant data of his process can be backed up with versioning to maintain reproducibility. Functions for maintaining repositories, backing up to repositories, loading from repositories and achieving will be implemented.

Underneath, the package we plan to implement these:

  1. Standardization of dataThe data retrieved will be standardized according to DwC formats. To maintain consistency and feasibility I have decided to use GBIF fields as the standard. The reasons are,
    1. The GBIF uses DwC as the standardization and the fields from GBIF backbone complies with DwC terms.
    2. GBIF is a well-established organization and using the fields from their backbone will assure consistency and acceptance by researchers.
    3. GBIF is a superset of all major biodiversity data available. Any data gathered from GBIF can be expected to also be in other sources too.
  2. Unique fields grouping system for DwC fields, based on recommended grouping (see this and this)

Conclusion

We believe a centralized data retrieval, cleaning, management and backup tool will benefit the bio diversity research community and eradicate many short comes in the current research process. The package will be built over the course of next few months and any insights, guidance and contributions from the community will be greatly appreciated.

Feedback

Please give us your feedback, suggestions on github issues