Tag Archives: GSoC

Visualizing bdsns data using bdvis

12 Aug Calander Heat Map of Butterfly data

One of the tasks in my Google Summer of Code 2015 was to integrate new package bdsns with existing package bdvis to identify strengths and gaps in the data. This can be achieved with few simple steps.

Begin with opening both libraries

library(devtools)
install_github("vijaybarve/bdsns")
install_github("vijaybarve/bdvis")
library(bdsns)
library(bdvis)

Get data for few species of butterflies using bdsns package from Flickr and store in sqlite database. User needs to get own API key form Flickr website from here. A file containing few scientific names of butterfly species

bflytest.txt
scname
Graphium agetes
Graphium antiphates 
Graphium aristeus
Colias nilagiriensis
Dercas verhuelli
Eurema andersoni 
Gonepteryx rhamni
Hebomoia glaucippe
Euripus nyctelius 
Hestinalis nama
Mimathyma ambica 
Ariadne merione
Byblia ilithyia
Abisara echerius
Abisara neophron 
Zemeros flegyas
Curetis thetis
Heliophorus epicles
Spalgis epeus
Hasora badra
Hasora chromus
Gangara lebadea
Gangara thyrsis

And then we are all set to run the command to download and store the data in sqlite database.

flickrtodatabase(myapikey,"bflytest.txt",
                  "scname","testdb")

Read in the sqlite database

dat=extract_flickrdb("testdb","t1.csv")

Set up the data for use in bdvis.Function format_bdvis will set the field names for scientific name, latitude, longitude and date in the bdvis format and also assigh grid cell ids. Function gettaxo will fetch and store higher taxonomy of the species.

dat=format_bdvis(dat)
dat=gettaxo(dat)

Now bdvis functions can be used for visualizations

mapgrid(dat)
tempolar(dat)
taxotree(dat)
chronohorogram(dat)
bdcalenderheat(dat)

Here is a sample of what this code will produce:

Butterfly MapGrid

MapGrid output of Butterfly Data

Temporal Butterfly

Temporal output of daily butterfly data

Taxotree output of butteerfly dataChronohorogram of Butterfly dataCalander Heat Map of Butterfly dataPlease note the results may not exactly match, since new photographs are being posted continuously on Flickr.

Read more about bdsns here

Barve, V. (2014). Discovering and developing primary biodiversity data from social networking sites: A novel approach. Ecological Informatics, 24, 194–199. doi:10.1016/j.ecoinf.2014.08.008

GSoC Proposal 2014: package bdvis: Biodiversity Data Visualizations

17 Mar

Update: The proposal has been approved for participation in Google Summer of Code 2014. I will post updates on the progress on the blog once the coding phase starts.

I am applying for Google Summer of Code 2014 again with “Biodiversity Data Visualizations using R” proposal. We are proposing to take package bdvis to next level by adding more functions and making it available through CRAN. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Background

Package bdvis is already under development and was successful projects in GSoC 2013. As of now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for additional features are coming up. We propose to add the user requested functionality and implement some new functions to take bdvis to next level. Following are the major tasks of proposed project.

  1. Fix currently reported bugs and complete documentation to submit package to CRAN.
  2. Implementation of additional features requested by users.
  3. Develop seamless data support.
  4. Additional functions for visualizations.
  5. Prepare detailed vignette.

User requested features

The features and functionality requested by users so far are the following:

  • A versatile function to subset the data based on taxonomy for a species, genus, family etc. or date like a particular year or range of years and so on.
  • Tempolar ability to show average records per day/week/month rather than just raw numbers currently
  • Taxotree additional parameters to control the diagram like Title, Legend, Colors. Also to add ability to choose summary based on number of records, number of species or higher taxonomy
  • bdsummary number of grid cells covered by data records and % of coverage of the bounding box
  • Visualisation ability for the output of completeness analysis bdcomplete function
  • Improve gettaxo efficiency by adding ability to search by genus rather than current scientific name. This could be added as an option in case user needs to search by full scientific names for some reason.

Data formats support

Develop functions for seamless support for major available Biodiversity occurrence data formats in R environment to work with bdvis package. Preliminary list of packages that make data available are rgbif, rvertnet, rinat, spocc. Get feedback from user community for additional data sources they might be using and incorporate them into the worklist.

Additional visualizations

    • Distribution of collection efforts over time (line graph) [Fig 1 Soberon et al 2000]

Soberon_Fig_1

    • Distribution of number of records among taxon, cells (histogram) [Fig 3,4 Soberon et al 2000]

Soberon_Fig_3

  • Distribution of number of species among cells (histogram) [Fig 5 Soberon et al 2000]
  • Completeness vs number of species(scatterplot) [Fig 6 Soberon et al 2000]
  • Record densities for day of year and week of year [Otegui 2012]

RecordsPerDayofYear

  • Records per year dot plots [Otegui 2012]

RecPerYear

  • calenderHeat maps of number of records or species recorded

IndianMoths_calenderheat

Interactive Map of records

A function to plot records on an interactive map. The plan is to develop a function that will generate a geoJSON based map using a html / java script file. User can open the file in web browser to explore the records. Considering the performance we might have to restrict number of records for this function.

geoJSON example screenshot

Vignette preparation

Prepare test data sets for the vignette. Three data sets one with global geographical coverage and wide species coverage, second with country level geographical and Class or Order level species coverage and final narrow species selection may be at genus level to demonstrate functionality. Write up code and explanation of each of the function in package, add result tables, graphs and maps to complete the vignette.

References

  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Llorente, J., & Oñate, L. (2000). The use of specimen-label databases for conservation purposes: an example using Mexican Papilionid and Pierid butterflies. Biodiversity and Conservation, 9(Roman 1997), 1441–1466. Retrieved from http://www.springerlink.com/index/H58022627013233W.pdf

Temporal visualization of records of IndianMoths project using bdvis

13 Aug

I was looking for some data set which has some bias in terms of temporal data. I thought of checking out the data from iNaturalist project IndianMoths. This project is aimed at documenting moths from India. This project was initiated in July 2012 but really caught steam in January 2013, with members contributing regularly, minimum of 100 records per month. The reason that this project has not yet completed one year, I thought it might have some bias form the missed out months. Another reason for bias could be the fact that moths are not seen in the same numbers through out the year.

IndianMoths project on iNaturalist

IndianMoths project on iNaturalist


To explore this data, I first downloaded the data in a .csv file and loaded into R.

The data summary looked like this:

Total no of records = 2958
Bounding box of records Inf , Inf - -Inf , -Inf
Taxonomic summary...
No of Families : 0
No of Genus : 0
No of Species : 0

This tells us that the data is read by the package, but it has not understood the format well and we might have to do some transformations to get this going with our package. So let us use the function fixstr

to get the data into (somewhat) required format.

imoth=fixstr(imoth,Latitude="latitude",
                 Longitude="longitude",
                 DateCollected="observed_on")

Now let us check the summary again


 Total no of records = 2958
 Date range of the records from  0208-07-26  to  2013-08-07
 Bounding box of records  6.660428 , 72.8776559  -  32.5648529099 , 96.2124788761
 Taxonomic summary...
 No of Families :  0
 No of Genus :  0
 No of Species :  0

Now we have date and Latitude-Longitudes in a form that our package can understand. A quick glance at this data summary shows us that there is some problem with dates. In our data set we have one record form year 208 (which must be typo for year 2008). And the data is all form in and around India looking at the bounding box values of records.
We still need to get the taxonomy in place, but we will leave that for later time, and start working with this data. Let us create temporal plots of this data for different timescales of Daily, Weekly and Monthly.

tempolar(imoth,title="Daily Records")
tempolar(imoth,title="Weekly Records",timescale="w")
tempolar(imoth,title="Monthly Records",timescale="m")

would produce following three plots.

Indian Moths Daily Records

Indian Moths Daily Records

These are records per calender day and we see that 2-3 days in April have very high number of records compared to other dates. This could be due to some targeted survey during that time. This also shows us that we do not have much data records from September till April.

Indian Moths Weekly Records

Indian Moths Weekly Records

The weekly aggregation of same records highlights the fact that April month does have some spike in numbers, and otherwise the number of records seem to fairly uniform.

Indian Moths Monthly Records

Indian Moths Monthly Records

Monthly plot shows that April has recorded more than 800 records, where as no other month have more than 500 records in a month.

This could be due to several reasons, but mainly because of the activity of this particular project.

bdvis development version available for early feedback

31 Jul Weekly plot of Temporal data. Plottype polygon is used here.

Google Summer of Code 2013 is half way through. Mid term evaluations are underway. I thought this is a good logical point for us to share what we have been doing for Biodiversity Data Visualizations in R project and open up the package for testing and some early feedback. We have named the package bdvis. The package is on github, and I would appreciate if you could install and test it. Feedback may be given in the comments here, using issues on github  by twitter or email.

Getting data

The data was obtained from the Data portal of Global Biodiversity Information Facility. (http://data.gbif.org). The data set we are looking for is iNaturalist research grade records. We accessed the datasets page at http://data.gbif.org/datasets/ and selected the iNaturalist.org page from the alphabetic list which is at http://data.gbif.org/datasets/provider/407. Once on this page use link Explore: Occurrences and then from the next page click Download: Spreadsheet of results. On this page make sure  Comma separated values is selected and then press Download Now button. Website may take a few minutes to make your download ready. Once it is ready, the download link will be provided. Typically the name of the file will be occurrence-search-12345.zip The number of digits would be as many as 40.  Use the link to download the .zip file and then extract the data file occurrence-search-12345.csv in the working directory of R. Since this file has a long name, let us rename it to inat.csv for convenience.

Now we are ready to load our data.

inat = read.csv("inat.csv")
dim(inat)

If it shows something like

[1] 66581    47

we are on right track. Our data is loaded into R. For the time being, this package handles only GBIF provided data format, but getting user generated biodiversity data in this format using some built in functions is being worked out.

Package installation

Now let us install bdvis package. First we need to get devtools package which will let us install packages from github (rather than CRAN).

install.packages("devtools")
require(devtools)

install_github("bdvis", "vijaybarve")
require(bdvis)

if this produces something like

Loading required package: bdvis

Attaching package: ‘bdvis’

The following object(s) are masked from ‘package:base’:

summary

we are on right track. Our packages is installed and loaded into R.

Package functions

1. summery

Let us start playing with the functions now. We have the data loaded in inat data frame.

bdvis::summary(inat)

Should produce something like:

Total no of records = 66581
Date range of the records from  1710-02-26  to  2012-12-31
Bounding box of records  -77.89309 , -177.37895  -  78.53431 , 179.2615
Taxonomic summary...
No of Families :  1394
No of Genus :  5089
No of Species :  11299

What does this tell us about our data ?

  • We have 66581 records in the data set
  • The date range is from 1710 to 2012. (Really we have record form 1710? Looks we have a problem there.)
  • The bounding box is almost the whole world. Yes, this is global data set.
  • We have so many Families, Genus and Species represented in this data set.

I have two questions here:

  1. What more would you like to get in the summary?
  2. Should I rename the function summary to something else, so it does not clash with usual data frame summery function name?

2. mapgrid

Now let us generate a Heat map of the records in this data set. This map will show us the density of records in different parts of the world. To generate this map

mapgrid(inat,ptype="species")
mapgrid output for iNaturalist data

mapgrid output for iNaturalist data

ptype could be records if we need the map with raw records rather than aggregated to species. Again the questions:

  • What more options would you like to see here?
  • Ability to zoom in certain region?
  • Control over color pallet ?

3. tempolar

Now coming to Temporal visualizations, the function tempolar would make polar plots of temporal data into daily, weekly and monthly plots. The code and samples are as follows:

tempolar(inat,color="green",title="iNaturalist daily"
          ,plottype="r",timescale="d")
tempolar(inat,color="blue",title="iNaturalist weekly"
          ,plottype="p",timescale="w")
tempolar(inat,color="red",title="iNaturalist monthly"
          ,plottype="r",timescale="m")
Dailyly plot of Temporal data. Each line is records on each day of the year.

Dailyly plot of Temporal data. Each line is records on each day of the year.

Weekly plot of Temporal data. Plottype polygon is used here.

Weekly plot of Temporal data. Plottype polygon is used here.

Monthly plot of Temporal data. Each line is representing records in that month.

Monthly plot of Temporal data. Each line is representing records in that month.

Here options to control color, title, plottype and of course timescale are provided.

We are less than half way through our original proposal, and will continue to actively build this package. As I build more functionality, I will post more information on the blog. Till that time keep the feedback flowing telling us what more you would like to see in this package.

GSoC Proposal 2013: Biodiversity Visualizations using R

29 Apr

I am applying for Google Summer of Code 2013 with this “Biodiversity Visualizations using R” proposal. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Backgrouond

R is increasingly being used in Biodiversity information analysis. There are several R packages like rgbif and rvertnet in rOpenSci suite to query, download and to some extent analyse the data within R workflow. We also have packages like dismo and SDMTools for modelling the data. It will be useful to have a package to quickly visualize biodiversity data. These visualizations would be helpful to understand extent of geographical, taxonomic and temporal coverage, gaps and biases in data.

The proposal is to work on a R package to provide functionality to quickly generate the visualizations of the data set user has gathered or generated.

The functions provided would be for following tasks:

  • Data preparation – The data needs to be converted into suitable format for visualizations and analysis i.e. date format, taxonomic classification and geographical co-ordinates should be in uniform and usable formats.
  • Data summary: Function(s) to quickly summarize the data set telling user number of records, number of records with Lat Long values, Bounding box of Lat Long Values, Date range and so on.
  • Geographic coverage – functions to visualize the data points on maps, density maps at different scales like Country level, Degree grid and so on.
Density of the records worldwide

Density of the records worldwide. Darker color indicates higher density of records.

Temporal coverage of the records

Temporal coverage of the records. Each line represents number of records on that particular day.

  • Taxonomic coverage – functions to visualize the taxonomic coverage of data in Tree Map formats by Number of records per species and number of species covered.
Familywise records

Family wise records present in the data set. (White block indicates records with unassigned family)

  • Completeness analysis – functions to assess and visualize completeness of biodiversity inventory of the region or in other words a measure of how exhaustive is the sampling in the study area [Ref:http://dx.doi.org/10.1111/j.0906-7590.2007.04627.x ]

Mentor(s): Javier Otegui

Data set: The data set used for the sample visualizations here is records published by iNaturalist.org on GBIF data portal. This data set contains Research Grade records (~46K) for all the organisms posted. The details of the data set are available here. The description on GBIF dat postal says “iNaturalist.org is a website where anyone can record their observations from nature. Members record observations for numerous reasons, including participation in citizen science projects, class projects, and personal fulfillment.”

References:

  • Chamberlain, S., & Barve, V. (2012). rvertnet: Search VertNet database from R. Retrieved from http://cran.r-project.org/package=rvertnet
  • Chamberlain, S., Boettiger, C., Ram, K., & Barve, V. (2013). rgbif: Interface to the Global Biodiversity Information Facility API methods. Retrieved from http://cran.r-project.org/package=rgbif
  • Hijmans, R. J., Phillips, S., Leathwick, J., & Elith, J. (2012). dismo: Species distribution modeling. Retrieved from http://cran.r-project.org/package=dismo
  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Jiménez, R., Golubov, J., & Koleff, P. (2007). Assessing completeness of biodiversity databases at different spatial scales. Ecography, 30(1), 152–160. doi:10.1111/j.2006.0906-7590.04627.x
  • VanDerWal, J., Falconi, L., Januchowski, S., Shoo, L., & Storlie, C. (2012). SDMTools: Species Distribution Modelling Tools: Tools for processing data associated with species distribution modelling exercises. Retrieved from http://cran.r-project.org/package=SDMTools

Blue Jay and Scrub Jay : Using rvertnet to check the distributions in R

30 Jul

As part of my Google Summer of Code, I am also working on another package for R called rvertnet. This package is a wrapper in R for VertNet websites. Vertnet is a vertebrate distributed database network consisting of FishNet2MaNISHerpNET, and ORNIS. Out of that currently Fishnet, HerpNET and ORNIS have their v2 portals serving data. rvertnet has functions now to access this data and import them into R data frames.

Some of my lab mates faced a difficulty in downloading data for Scrub Jay (Aphelocoma spp. ) due to large number of records (220k+) on ORNIS, so decided to try using rvertnet package which was still in development that time. The package really helpe and got the results quickly. So while ecploring that data I came up with this case study.

So here to get data for Blue Jay (Cyanocitta cristata) which is distributed in eastern USA, we use vertoccurrence function and specify taxon we are looking for as Blue Jay with t=”Cyanocitta cristata” and we need to specify this is bird species with grp=”bird” since currently we have to access the data form three different websites of VertNet for Fishes, Birds and Herps(Reptiles and Amphibians). This fetches us all the records for Blue Jay. Now we want to get discard the records without Latitude and Longitude values so we use subset function on the data with Latitude !=0 & Longitude != 0. This gives us all geocoded records for Blue jay which then map using maps and ggplot packages like we did in earlier post.

library(rvertnet)
bluej1=vertoccurrence(t="Cyanocitta cristata",grp="bird")
bluej2=subset(bluej1,Latitude !=0 & Longitude != 0)

library(maps)
library(ggplot2)
world  = map_data("world")
ggplot(world, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white",
               color = "gray40", size = .2) +
  geom_jitter(data = bluej2,
              aes(Longitude, Latitude), alpha=0.6, size = 4,
              color = "red") +
                opts(title = "Cyanocitta cristata (Blue Jay)")

The final output of the code snippet is as following.

Now let us put Blue Jay and Scrub Jay side by side on the map to see how they are distributed in North America.  This is same as the earlier code except that we get data for both Jays and while plotting we use additional geom_jitters for the other Jay with a different color to distinguish the two. Also not the reduction in the size from 4 to 1 in order to make the points clearly visible

library(rvertnet)
bluej1=vertoccurrence(t="Cyanocitta cristata",grp="bird")
bluej2=subset(bluej1,Latitude !=0 & Longitude != 0)
scrubj1=vertoccurrence(t="Aphelocoma",grp="bird")
scrubj2=subset(scrubj1,Latitude !=0 & Longitude != 0)

library(maps)
library(ggplot2)
world = map_data("world")
ggplot(world, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white", color = "gray40",
               size = .2) +
  geom_jitter(data = bluej2,
              aes(Longitude, Latitude), alpha=0.6, size = 1,
              color = "blue") +
                opts(title = "Blue Jay and Scrub Jay") +
  geom_jitter(data = scrubj2,
              aes(Longitude, Latitude), alpha=0.6, size = 1,
              color = "brown")

The final output of the code snippet is as following.

Map biodiversity records with rgbif and ggmap packages in R

23 Jul

When I attended usrR! 2012 last month, there was an interesting presentation by Dr. David Kahle about the package ggmap. It is a package built over ggmap2 and helps us map spatial data over online maps like Google maps or Open Street Maps. I decided to give ggmap package a try with biodiversity data.

So first let us create a map for the Plain Tiger or the African Monarch Butterfly (Danaus chrysippus). We use occurrencelist from rgbif package again like previous post.

We use qmap function from ggmap package to quickly pull up the base map from Google Maps. So in essence the qmap function eliminates two step process of getting map data using map_data function and then setting up map display using ggplot function into one step. We use geom_jitter function to plot the occurrence points in the specified size(size = 4) and color(color = “red”).

library(rgbif)
Dan_chr=occurrencelist(sciname = 'Danaus chrysippus',
                       coordinatestatus = TRUE,
                       maxresults = 1000,
                       latlongdf = TRUE, removeZeros = TRUE)
library(ggmap)
library(ggplot2)
wmap1 = qmap('India',zoom=2)
wmap1 +
      geom_jitter(data = Dan_chr,
                  aes(decimalLongitude, decimalLatitude),
                  alpha=0.6, size = 4, color = "red") +
                    opts(title = "Danaus chrysippus")

Here is the opuput map of the code snippet:

Though in earlier code we have used geom_jitter, high density of the points in some regions are not clearly seen. If we want to get better idea about the number of points we can try two dimensional density maps using the stat_density2d function. It just adds density lines on the map showing higher density with darker circles.

library(rgbif)
Dan_chr=occurrencelist(sciname = 'Danaus chrysippus',
                       coordinatestatus = TRUE,
                       maxresults = 1000,
                       latlongdf = TRUE, removeZeros = TRUE)
library(ggmap)
library(ggplot2)
wmap1 = qmap('India',zoom=2)
wmap1 +
  stat_density2d(aes(x = decimalLongitude, y = decimalLatitude,
                     fill = ..level.., alpha = ..level..),
                 size = 4, bins = 6,
                 data = Dan_chr, geom = 'line') +
      geom_jitter(data = Dan_chr,
                  aes(decimalLongitude, decimalLatitude),
                  alpha=0.6, size = 4, color = "red") +
                    opts(title = "Danaus chrysippus :: Density Plot")