Creating figures like the paper ‘Completeness of Digital Accessible Knowledge of Plants of Ghana’ Part 1

27 Oct

Recently I got to read the paper about Completeness of Digital Accessible Knowledge DAK by Alex Asase and A. Townsend Peterson. I really enjoyed reading the paper and liked the way the figures are presented. There is a lot of overlap of this with my work on package bdvis (of course under guidance of Town Peterson). So I thought I will share some code snippets to recreate figures similar to the ones in the paper using package bdvis.

Since I do not have the copy of the data in the paper, I am using data downloaded from GBIF website. I decided to use Birds data for India.

To create Figure 1a. Graph showing accumulation of records through time (years) we need to set the data in bdvis format and then use function distrigraph.

library(bdvis)

# Download GBIF data from data.gbif,org portal and
# extract occurrence.txt file in Data folder
occ <- read.delim( 'verbatim.txt',
                          quote='', stringsAsFactors=FALSE)
# Construct Date field form day, month, year
occ$Date_collected <- as.Date( paste( occ$year,
                                      occ$month ,
                                      occ$day , sep = "." ),
                               format = "%Y.%m.%d" )
# Set configuration variables to format data
conf <- list(Latitude='decimalLatitude',
             Longitude='decimalLongitude',
             Date_collected='Date_collected',
             Scientific_name='specificEpithet')
occ <- format_bdvis(occ, config=conf) occ_date=occ[occ$Date_collected > as.Date("1500-01-01") &
           occ$Date_collected < as.Date("2017-01-01") &
           !is.na(occ$Date_collected) ,]
distrigraph(occ_date, ptype="efforts", type="h")

Now this created the following graph:

BirdDistriPlot1

The graph shows what we wanted to show, but we would like to modify this a bit to look more that the Figure in the paper. So let us exclude some more data and change the color and width of the lines in the graph.

occ_date1 <- occ[occ$Date_collected > as.Date("1900-01-01") &
               occ$Date_collected < as.Date("2015-01-01") &
               !is.na(occ$Date_collected) ,]
distrigraph(occ_date1, ptype="efforts", col="red",
            type="h", lwd=3)

Now this created the following graph:

BirdDistriPlot2

References

Visualize completeness of biodiversity data

10 Jun Completeness Visualization

Package bdvis: Biodiversity data visualizations using R is helpful to understand completeness of biodiversity inventory, extent of geographical, taxonomic and temporal coverage, gaps and biases in data. Package bdvis version 0.2.6 is on CRAN now. This version has several features added since version 0.1.0. I plan to post set of blog entries here to describe some of the key features of the package with some code snippets.

The function bdcomplete computes completeness values for each cell. So after dividing the extent of the dataset in cells (via the getcellid function), this function calculates the Chao2 estimator of species richness. In simple terms, the function estimates looking at the data records in each cell and how many species are represented, how complete that dataset.

The following code snippet shows how the data downloaded from Global Biodiversity Information Facility GBIF Data Portal. The .zip file downloaded using the portal has a file occurrence.txt which contains the data records. Copy that file in the working folder and try the following script.

library(bdvis)

# Download GBIF data from data.gbif,org portal and
# extract occurrence.txt file in Data folder
occurrence &lt;- read.delim( 'occurrence.txt',
                         quote='', stringsAsFactors=FALSE)
# Set configuration variables to format data
conf &lt;- list(Latitude='decimalLatitude',
             Longitude='decimalLongitude',
             Date_collected='eventDate',
             Scientific_name='specificEpithet')
occurrence &lt;- format_bdvis(occurrence, config=conf)
# Compute completeness and visualize using mapgrid
comp=bdcomplete(occurrence)
mapgrid(comp,ptype='complete')

The completeness function produces a graph showing Completeness vs number of Species. More points in higher range of completeness indices indicates better data.

 Completeness vs Species

Completeness vs Species

Now to visualize the data spatially, if any particular region needs better sampling the function mapgrid can now be used with ptype = “complete” parameter. This plots all the grids that have data records more than recs parameter (default = 50) using a color range from light purple to dark blue. Darker the color better the data in that cell.

Completeness Visualization

Completeness Visualization

References:

Visualizing bdsns data using bdvis

12 Aug

One of the tasks in my Google Summer of Code 2015 was to integrate new package bdsns with existing package bdvis to identify strengths and gaps in the data. This can be achieved with few simple steps.

Begin with opening both libraries

library(devtools)
install_github("vijaybarve/bdsns")
install_github("vijaybarve/bdvis")
library(bdsns)
library(bdvis)

Get data for few species of butterflies using bdsns package from Flickr and store in sqlite database. User needs to get own API key form Flickr website from here. A file containing few scientific names of butterfly species

bflytest.txt
scname
Graphium agetes
Graphium antiphates 
Graphium aristeus
Colias nilagiriensis
Dercas verhuelli
Eurema andersoni 
Gonepteryx rhamni
Hebomoia glaucippe
Euripus nyctelius 
Hestinalis nama
Mimathyma ambica 
Ariadne merione
Byblia ilithyia
Abisara echerius
Abisara neophron 
Zemeros flegyas
Curetis thetis
Heliophorus epicles
Spalgis epeus
Hasora badra
Hasora chromus
Gangara lebadea
Gangara thyrsis

And then we are all set to run the command to download and store the data in sqlite database.

flickrtodatabase(myapikey,"bflytest.txt",
                  "scname","testdb")

Read in the sqlite database

dat=extract_flickrdb("testdb","t1.csv")

Set up the data for use in bdvis.Function format_bdvis will set the field names for scientific name, latitude, longitude and date in the bdvis format and also assigh grid cell ids. Function gettaxo will fetch and store higher taxonomy of the species.

dat=format_bdvis(dat)
dat=gettaxo(dat)

Now bdvis functions can be used for visualizations

mapgrid(dat)
tempolar(dat)
taxotree(dat)
chronohorogram(dat)
bdcalenderheat(dat)

Here is a sample of what this code will produce:

Butterfly MapGrid

MapGrid output of Butterfly Data

Temporal Butterfly

Temporal output of daily butterfly data

Taxotree output of butteerfly dataChronohorogram of Butterfly dataCalander Heat Map of Butterfly dataPlease note the results may not exactly match, since new photographs are being posted continuously on Flickr.

Read more about bdsns here

Barve, V. (2014). Discovering and developing primary biodiversity data from social networking sites: A novel approach. Ecological Informatics, 24, 194–199. doi:10.1016/j.ecoinf.2014.08.008

package bdvis is on CRAN

8 May

We are happy to announce that package bdvis is on CRAN now. http://cran.r-project.org/web/packages/bdvis/index.html

bdvis: Biodiversity Data Visualizations

Biodiversity data visualizations using R would be helpful to understand completeness of biodiversity inventory, extent of geographical, taxonomic and temporal coverage, gaps and biases in data.

As part of Google Summer of Code 2014, we hope to make progress on the development of this package and the proposed additions are posted here.

If you have never used package bdvis the following code will give you a quick introduction of the capabilities of the package.

First to install the package

install.packages("bdvis")
library(bdvis)
# We use rinat package to get some data from
# iNaturalist project
# install.packages("rinat")
library(rinat)

Now let us get some data from iNaturlist project ReptileIndia

inat=get_inat_obs_project("reptileindia")
239  Records
0-100-200-300

We need to convert the data in bdvis format.

  • Use fixstr function to change names of two fields.
  • Use getcellid function to calculate grid numbers for each records with coordinates.
  • Use gettaxo function to fetch higher taxonomy of each record. This function will take some time to run and might need some human interaction to resolve names depending on the data we have.
# Function fixstr is now replaced with format_bdvis
# inat=fixstr(inat,DateCollected="Observed.on",SciName="Scientific.name")
inat=format_bdvis(inat,source='rinat')
inat=getcellid(inat)
inat=gettaxo(inat)

Our data is ready for trying out bdvis functions now. First a function to see what data we have.

bdsummary(inat)

The output should look something like this:

 Total no of records = 239 
 Date range of the records from  2004-07-31  to  2014-05-04 
 Bounding box of records  5.9241302618 , 72.933495  -  
30.475012 , 95.6058760174 
 Taxonomic summary... 
 No of Families :  16 
 No of Genus :  52 
 No of Species :  117 

Now let us generate a heat-map with geography superimposed. Since we know this project is for Indian subcontinent, we list the countries we need to show on the map.

mapgrid(inat,ptype="records",
        bbox=c(60,100,5,40),
        region=c("India","Nepal","Bhutan",
                  "Pakistan","Bangladesh",
                   "Sri lanka", "Myanmar"),
        title="ReptileIndia records")
ReptileIndia mapgrid

ReptileIndia mapgrid

For temporal visualization we can use tempolar function with plots number of records on a polar plot. The data can be aggregated by day, week or month.

tempolar(inat, color="green", title="iNaturalist daily",
         plottype="r", timescale="d")
tempolar(inat, color="blue", title="iNaturalist weekly",
         plottype="p", timescale="w")
tempolar(inat, color="red", title="iNaturalist monthly",
         plottype="r", timescale="m")
ReptileIndia tempolar daily

ReptileIndia tempolar daily

ReptileIndia tempolar weekly

ReptileIndia tempolar weekly

ReptileIndia tempolar monthly

ReptileIndia tempolar monthly

Another interesting temporal visualization is Chronohorogram. This plots number of records on each day with colors indicating the value and concentric circles for each year.

chronohorogram(inat)
ReptileIndia chronohorogram

ReptileIndia chronohorogram

And finally for taxonomic visualization we can generate a tree-map of the records. Here the color of each box indicates number of genus in the family and the size of the box indicates proportion of records in the data set of each family.

taxotree(inat)
ReptileIndia taxotree

ReptileIndia taxotree

The large empty box at bottom center indicates there are several records which are not identified at family level.

Check the post GSoC Proposal 2014: package bdvis: Biodiversity Data Visualizations for what to expect in near future and comments and suggestions are always welcome.

GSoC Proposal 2014: package bdvis: Biodiversity Data Visualizations

17 Mar

Update: The proposal has been approved for participation in Google Summer of Code 2014. I will post updates on the progress on the blog once the coding phase starts.

I am applying for Google Summer of Code 2014 again with “Biodiversity Data Visualizations using R” proposal. We are proposing to take package bdvis to next level by adding more functions and making it available through CRAN. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Background

Package bdvis is already under development and was successful projects in GSoC 2013. As of now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for additional features are coming up. We propose to add the user requested functionality and implement some new functions to take bdvis to next level. Following are the major tasks of proposed project.

  1. Fix currently reported bugs and complete documentation to submit package to CRAN.
  2. Implementation of additional features requested by users.
  3. Develop seamless data support.
  4. Additional functions for visualizations.
  5. Prepare detailed vignette.

User requested features

The features and functionality requested by users so far are the following:

  • A versatile function to subset the data based on taxonomy for a species, genus, family etc. or date like a particular year or range of years and so on.
  • Tempolar ability to show average records per day/week/month rather than just raw numbers currently
  • Taxotree additional parameters to control the diagram like Title, Legend, Colors. Also to add ability to choose summary based on number of records, number of species or higher taxonomy
  • bdsummary number of grid cells covered by data records and % of coverage of the bounding box
  • Visualisation ability for the output of completeness analysis bdcomplete function
  • Improve gettaxo efficiency by adding ability to search by genus rather than current scientific name. This could be added as an option in case user needs to search by full scientific names for some reason.

Data formats support

Develop functions for seamless support for major available Biodiversity occurrence data formats in R environment to work with bdvis package. Preliminary list of packages that make data available are rgbif, rvertnet, rinat, spocc. Get feedback from user community for additional data sources they might be using and incorporate them into the worklist.

Additional visualizations

    • Distribution of collection efforts over time (line graph) [Fig 1 Soberon et al 2000]

Soberon_Fig_1

    • Distribution of number of records among taxon, cells (histogram) [Fig 3,4 Soberon et al 2000]

Soberon_Fig_3

  • Distribution of number of species among cells (histogram) [Fig 5 Soberon et al 2000]
  • Completeness vs number of species(scatterplot) [Fig 6 Soberon et al 2000]
  • Record densities for day of year and week of year [Otegui 2012]

RecordsPerDayofYear

  • Records per year dot plots [Otegui 2012]

RecPerYear

  • calenderHeat maps of number of records or species recorded

IndianMoths_calenderheat

Interactive Map of records

A function to plot records on an interactive map. The plan is to develop a function that will generate a geoJSON based map using a html / java script file. User can open the file in web browser to explore the records. Considering the performance we might have to restrict number of records for this function.

geoJSON example screenshot

Vignette preparation

Prepare test data sets for the vignette. Three data sets one with global geographical coverage and wide species coverage, second with country level geographical and Class or Order level species coverage and final narrow species selection may be at genus level to demonstrate functionality. Write up code and explanation of each of the function in package, add result tables, graphs and maps to complete the vignette.

References

  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Llorente, J., & Oñate, L. (2000). The use of specimen-label databases for conservation purposes: an example using Mexican Papilionid and Pierid butterflies. Biodiversity and Conservation, 9(Roman 1997), 1441–1466. Retrieved from http://www.springerlink.com/index/H58022627013233W.pdf

Package rinat use case: map of iNaturalist project

11 Mar

iNaturalist projects are collection of records posted on iNatualist. Now that we have a R package rinat from rOpenSci I thought of playing around with the data. Here is a function I wrote, to quickly map all the records of a project using ggmap package.

library(ggmap)
library(rinat)

inatmap <- function(grpid){
  data1=get_inat_obs_project(grpid, type = "observations")
  data1=data1[which(!is.na(data1$Latitude)),]
  map <-get_map(location =c(min(data1$Longitude),
                            min(data1$Latitude),
                            max(data1$Longitude),
                            max(data1$Latitude)),
                messaging = FALSE)
  p <-ggplot()
  p= ggmap(map)+geom_point(data=data1,
                           aes(x=as.numeric(Longitude),
                               y=as.numeric(Latitude)))
  p
}

We can used get_inat_obs_project function from rinat package to get all the observation from the specified project. get_map function form ggmap package to download google maps base layer and ggplot function form ggplot2 package to actually plot the map with points.

Now call to the function with a group name will produce a map with all the records in the project.

inatmap("birdindia")

inatmap_birdindia

We can use other ggplot options to add title, legend etc. to the map. This is just a simple example.

Temporal visualization of records of IndianMoths project using bdvis

13 Aug

I was looking for some data set which has some bias in terms of temporal data. I thought of checking out the data from iNaturalist project IndianMoths. This project is aimed at documenting moths from India. This project was initiated in July 2012 but really caught steam in January 2013, with members contributing regularly, minimum of 100 records per month. The reason that this project has not yet completed one year, I thought it might have some bias form the missed out months. Another reason for bias could be the fact that moths are not seen in the same numbers through out the year.

IndianMoths project on iNaturalist

IndianMoths project on iNaturalist


To explore this data, I first downloaded the data in a .csv file and loaded into R.

The data summary looked like this:

Total no of records = 2958
Bounding box of records Inf , Inf - -Inf , -Inf
Taxonomic summary...
No of Families : 0
No of Genus : 0
No of Species : 0

This tells us that the data is read by the package, but it has not understood the format well and we might have to do some transformations to get this going with our package. So let us use the function fixstr

to get the data into (somewhat) required format.

imoth=fixstr(imoth,Latitude="latitude",
                 Longitude="longitude",
                 DateCollected="observed_on")

Now let us check the summary again


 Total no of records = 2958
 Date range of the records from  0208-07-26  to  2013-08-07
 Bounding box of records  6.660428 , 72.8776559  -  32.5648529099 , 96.2124788761
 Taxonomic summary...
 No of Families :  0
 No of Genus :  0
 No of Species :  0

Now we have date and Latitude-Longitudes in a form that our package can understand. A quick glance at this data summary shows us that there is some problem with dates. In our data set we have one record form year 208 (which must be typo for year 2008). And the data is all form in and around India looking at the bounding box values of records.
We still need to get the taxonomy in place, but we will leave that for later time, and start working with this data. Let us create temporal plots of this data for different timescales of Daily, Weekly and Monthly.

tempolar(imoth,title="Daily Records")
tempolar(imoth,title="Weekly Records",timescale="w")
tempolar(imoth,title="Monthly Records",timescale="m")

would produce following three plots.

Indian Moths Daily Records

Indian Moths Daily Records

These are records per calender day and we see that 2-3 days in April have very high number of records compared to other dates. This could be due to some targeted survey during that time. This also shows us that we do not have much data records from September till April.

Indian Moths Weekly Records

Indian Moths Weekly Records

The weekly aggregation of same records highlights the fact that April month does have some spike in numbers, and otherwise the number of records seem to fairly uniform.

Indian Moths Monthly Records

Indian Moths Monthly Records

Monthly plot shows that April has recorded more than 800 records, where as no other month have more than 500 records in a month.

This could be due to several reasons, but mainly because of the activity of this particular project.