Tag Archives: rvertnet

GSoC Proposal 2014: package bdvis: Biodiversity Data Visualizations

17 Mar

Update: The proposal has been approved for participation in Google Summer of Code 2014. I will post updates on the progress on the blog once the coding phase starts.

I am applying for Google Summer of Code 2014 again with “Biodiversity Data Visualizations using R” proposal. We are proposing to take package bdvis to next level by adding more functions and making it available through CRAN. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Background

Package bdvis is already under development and was successful projects in GSoC 2013. As of now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for additional features are coming up. We propose to add the user requested functionality and implement some new functions to take bdvis to next level. Following are the major tasks of proposed project.

  1. Fix currently reported bugs and complete documentation to submit package to CRAN.
  2. Implementation of additional features requested by users.
  3. Develop seamless data support.
  4. Additional functions for visualizations.
  5. Prepare detailed vignette.

User requested features

The features and functionality requested by users so far are the following:

  • A versatile function to subset the data based on taxonomy for a species, genus, family etc. or date like a particular year or range of years and so on.
  • Tempolar ability to show average records per day/week/month rather than just raw numbers currently
  • Taxotree additional parameters to control the diagram like Title, Legend, Colors. Also to add ability to choose summary based on number of records, number of species or higher taxonomy
  • bdsummary number of grid cells covered by data records and % of coverage of the bounding box
  • Visualisation ability for the output of completeness analysis bdcomplete function
  • Improve gettaxo efficiency by adding ability to search by genus rather than current scientific name. This could be added as an option in case user needs to search by full scientific names for some reason.

Data formats support

Develop functions for seamless support for major available Biodiversity occurrence data formats in R environment to work with bdvis package. Preliminary list of packages that make data available are rgbif, rvertnet, rinat, spocc. Get feedback from user community for additional data sources they might be using and incorporate them into the worklist.

Additional visualizations

    • Distribution of collection efforts over time (line graph) [Fig 1 Soberon et al 2000]

Soberon_Fig_1

    • Distribution of number of records among taxon, cells (histogram) [Fig 3,4 Soberon et al 2000]

Soberon_Fig_3

  • Distribution of number of species among cells (histogram) [Fig 5 Soberon et al 2000]
  • Completeness vs number of species(scatterplot) [Fig 6 Soberon et al 2000]
  • Record densities for day of year and week of year [Otegui 2012]

RecordsPerDayofYear

  • Records per year dot plots [Otegui 2012]

RecPerYear

  • calenderHeat maps of number of records or species recorded

IndianMoths_calenderheat

Interactive Map of records

A function to plot records on an interactive map. The plan is to develop a function that will generate a geoJSON based map using a html / java script file. User can open the file in web browser to explore the records. Considering the performance we might have to restrict number of records for this function.

geoJSON example screenshot

Vignette preparation

Prepare test data sets for the vignette. Three data sets one with global geographical coverage and wide species coverage, second with country level geographical and Class or Order level species coverage and final narrow species selection may be at genus level to demonstrate functionality. Write up code and explanation of each of the function in package, add result tables, graphs and maps to complete the vignette.

References

  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Llorente, J., & Oñate, L. (2000). The use of specimen-label databases for conservation purposes: an example using Mexican Papilionid and Pierid butterflies. Biodiversity and Conservation, 9(Roman 1997), 1441–1466. Retrieved from http://www.springerlink.com/index/H58022627013233W.pdf
Advertisements

GSoC Proposal 2013: Biodiversity Visualizations using R

29 Apr

I am applying for Google Summer of Code 2013 with this “Biodiversity Visualizations using R” proposal. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Backgrouond

R is increasingly being used in Biodiversity information analysis. There are several R packages like rgbif and rvertnet in rOpenSci suite to query, download and to some extent analyse the data within R workflow. We also have packages like dismo and SDMTools for modelling the data. It will be useful to have a package to quickly visualize biodiversity data. These visualizations would be helpful to understand extent of geographical, taxonomic and temporal coverage, gaps and biases in data.

The proposal is to work on a R package to provide functionality to quickly generate the visualizations of the data set user has gathered or generated.

The functions provided would be for following tasks:

  • Data preparation – The data needs to be converted into suitable format for visualizations and analysis i.e. date format, taxonomic classification and geographical co-ordinates should be in uniform and usable formats.
  • Data summary: Function(s) to quickly summarize the data set telling user number of records, number of records with Lat Long values, Bounding box of Lat Long Values, Date range and so on.
  • Geographic coverage – functions to visualize the data points on maps, density maps at different scales like Country level, Degree grid and so on.
Density of the records worldwide

Density of the records worldwide. Darker color indicates higher density of records.

Temporal coverage of the records

Temporal coverage of the records. Each line represents number of records on that particular day.

  • Taxonomic coverage – functions to visualize the taxonomic coverage of data in Tree Map formats by Number of records per species and number of species covered.
Familywise records

Family wise records present in the data set. (White block indicates records with unassigned family)

  • Completeness analysis – functions to assess and visualize completeness of biodiversity inventory of the region or in other words a measure of how exhaustive is the sampling in the study area [Ref:http://dx.doi.org/10.1111/j.0906-7590.2007.04627.x ]

Mentor(s): Javier Otegui

Data set: The data set used for the sample visualizations here is records published by iNaturalist.org on GBIF data portal. This data set contains Research Grade records (~46K) for all the organisms posted. The details of the data set are available here. The description on GBIF dat postal says “iNaturalist.org is a website where anyone can record their observations from nature. Members record observations for numerous reasons, including participation in citizen science projects, class projects, and personal fulfillment.”

References:

  • Chamberlain, S., & Barve, V. (2012). rvertnet: Search VertNet database from R. Retrieved from http://cran.r-project.org/package=rvertnet
  • Chamberlain, S., Boettiger, C., Ram, K., & Barve, V. (2013). rgbif: Interface to the Global Biodiversity Information Facility API methods. Retrieved from http://cran.r-project.org/package=rgbif
  • Hijmans, R. J., Phillips, S., Leathwick, J., & Elith, J. (2012). dismo: Species distribution modeling. Retrieved from http://cran.r-project.org/package=dismo
  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Jiménez, R., Golubov, J., & Koleff, P. (2007). Assessing completeness of biodiversity databases at different spatial scales. Ecography, 30(1), 152–160. doi:10.1111/j.2006.0906-7590.04627.x
  • VanDerWal, J., Falconi, L., Januchowski, S., Shoo, L., & Storlie, C. (2012). SDMTools: Species Distribution Modelling Tools: Tools for processing data associated with species distribution modelling exercises. Retrieved from http://cran.r-project.org/package=SDMTools

Exploring distributions of Ensatina salamander subspecies using rvertnet by Neil Kelly

9 Aug

This week we have a guest blog post by Neil Kelley


Last week, I stumbled on Vijay’s blog post demonstrating his new package rvertnet. Although I am a paleontologist, some of my research involves anatomical comparison between extinct species and extant relatives or ecological analogs, so I have some experience using VertNet to track down specimens of interest in museum collections around the country.

Although I have been using R off an on for years, I have always been a little terrified of it. This has less to do with R–which is really quite intuitive and forgiving once you get the hang of it–and more to do with my phobia of comand-line interfaces. I’ve been spoiled by rich GUIs, dropdown menus and toolbars. The ominous pulse of a cursor in an empty console window can still send me into a cold sweat.

But lately that is starting to change thanks to my discovery of a number of resources including RStudio, sites like Quick-R and R Cookbook, the ever growing collection of valuable discussion threads on StackOverflow and the R-help mailing list, and of course the proliferation of R-themed blogs like Vijay’s. I suppose the true sign that I am overcoming my fear is that I find myself “playing” with R to learn and practice skills that I can use in my own research.

So I decided to have some fun with rvertnet after seeing what Vijay did with Scrub Jays and Blue Jay distributions. I wanted to take a look at the distribution of HerpNET records for the classic ring-species complex Ensatina eschscholtzii. My aim was to reproduce something like this map from the excellent California Herps site:

http://www.californiaherps.com/salamanders/maps/ensatinamap.jpg

I essentially copied Vijay’s approach (see his post for details on that) but I modified to basemap to focus just on California using the “state” database in the maps package, and by manually specifying latitude and longitude limits for the map using xlim() and ylim().

I picked colors to match those used on the California Herps map. A legend would be nice, but that is left as an excercise for the reader (i.e. if you know of an easy way to do it let me know, because I was too lazy to manually set all of the legend parameters myself).

The colors used on the map correspond to: Light Blue = E. e. croceater – Yellow-blotched Ensatina; Purple = E. e. eschscholtzii – Monterey Ensatina; Blue = E. e. klauberi – Large-blotched Ensatina; Red = E. e. oregonensis – Oregon Ensatina; Black = E. e. picta – Painted Ensatina; Orange = E. e. platensis – Sierra Nevada Ensatina; E. e. xanthoptica – Yellow-eyed Ensatina.

Overall I was pretty happy with the results, though it is interesting to note that E. e. oregonensis seems to extend somewhat further south than shown on the California Herps map, overlapping completely with the western population of E. e. xanthoptica. According to California Herps, all E. e. oregonensis occuring in California are now regarded as intergrades, so this map may reflect some outdated taxonomic practices. There are also a few interesting “out-of-bounds” records, notably several E. e. eschscholtzii in Northern California far beyond the “normal” northern limit of their range near Monterey Bay. I have no idea if these might be misidentified or whether they are true vagrants.

I decided that it would be interesting to plot the location of hybrids on the map too, given that the presence/absence of hybridization zones is an important compenent of the ring species idea. It turns out that the ScienfiticName field from HerpNET was not as clean as I was hoping. When I downloaded the entire Ensatina dataset from HerpNET and checked the constitent taxa with levels() I got this:

 [1] "Ensatina  eschscholtzi"                                                  "Ensatina ensatina xonthoptica"                                          
 [3] "Ensatina escholtzi"                                                      "Ensatina eschschlotzii eschschlotzii"                                   
 [5] "Ensatina eschscholtzi"                                                   "ENSATINA ESCHSCHOLTZI"                                                  
 [7] "Ensatina eschscholtzi croceator"                                         "Ensatina eschscholtzi eschscholtzi"                                     
 [9] "Ensatina eschscholtzi eschscholtzi x xanthopicta"                        "Ensatina eschscholtzi klauberi"                                         
[11] "Ensatina eschscholtzi oregonensis"                                       "Ensatina eschscholtzi oregonensis x xanthopicta"                        
[13] "ENSATINA ESCHSCHOLTZI OREGONESIS"                                        "Ensatina eschscholtzi picta"                                            
[15] "Ensatina eschscholtzi picta x oregonensis"                               "Ensatina eschscholtzi platensis"                                        
[17] "Ensatina eschscholtzi platensis x croceator"                             "Ensatina eschscholtzi xanthoptica"                                      
[19] "Ensatina eschscholtzii"                                                  "ENSATINA ESCHSCHOLTZII"                                                 
[21] "Ensatina eschscholtzii cf oregonensis"                                   "Ensatina eschscholtzii croceater"                                       
[23] "Ensatina eschscholtzii croceator"                                        "Ensatina eschscholtzii escholtzi"                                       
[25] "Ensatina eschscholtzii eschscholtzi"                                     "Ensatina eschscholtzii eschscholtzii"                                   
[27] "ENSATINA ESCHSCHOLTZII ESCHSCHOLTZII"                                    "Ensatina eschscholtzii eschscholtzii x Ensatina eschscholtzii klauberi" 
[29] "Ensatina eschscholtzii eschscholtzii x eschscholtzii oregonensis"        "Ensatina eschscholtzii eschscholtzii x xanthoptica"                     
[31] "Ensatina eschscholtzii klauberi"                                         "Ensatina eschscholtzii oregonensis"                                     
[33] "ENSATINA ESCHSCHOLTZII OREGONENSIS"                                      "Ensatina eschscholtzii oregonensis x Ensatina eschscholtzii xanthoptica"
[35] "Ensatina eschscholtzii oregonensis x eschscholtzii picta"                "Ensatina eschscholtzii oregonensis X picta"                             
[37] "Ensatina eschscholtzii oregonensis x platensis"                          "Ensatina eschscholtzii oregonensis x xanthoptica"                       
[39] "Ensatina eschscholtzii picta"                                            "ENSATINA ESCHSCHOLTZII PICTA"                                           
[41] "Ensatina eschscholtzii picta x oregonensis"                              "Ensatina eschscholtzii platensis"                                       
[43] "ENSATINA ESCHSCHOLTZII PLATENSIS"                                        "Ensatina eschscholtzii platensis x Ensatina eschscholtzii xanthoptica"  
[45] "Ensatina eschscholtzii ssp."                                             "Ensatina eschscholtzii xanthoptica"                                     
[47] "ENSATINA ESCHSCHOLTZII XANTHOPTICA"                                      "Ensatina eschscholzii"                                                  
[49] "Ensatina sp."

Yikes. Note that “Ensatina eschscholtzii oregonensis x Ensatina eschscholtzii xanthoptica” and “Ensatina eschscholtzii oregonensis x xanthoptica” appear as separate levels. This variable formatting is probably worth being aware of if you work on species that hybridize. Likewise, I noticed that my original queries also returned hybrids when the complete parent taxon name was included in taxonomic identification of the specimen. Thus an individual identified as “Ensatina eschscholtzii oregonensis x Ensatina eschscholtzii xanthoptica” would appear in queries for “Ensatina eschscholtzii oregonensis” and “Ensatina eschscholtzii xanthoptica” but a hybrid identified as “Ensatina eschscholtzii oregonensis x xanthoptica” would only appear in the former.

I began cleaning this up using gsub() to standardize the formatting when eventually my wife came in the room and asked “have you been working on that salamander thing ALL day?” At which point I realized I should probably get back to my own dissertation work.

Fun stuff though. Thanks Vijay!

library(rvertnet)
library(ggplot2)
library(maps)

YBE<-vertoccurrence(t="Ensatina eschscholtzii croceater",grp="herp")
YBE2<-subset(YBE,Latitude !=0 & Longitude != 0)
ME<-vertoccurrence(t="Ensatina eschscholtzii eschscholtzii",grp="herp")
ME2<-subset(ME,Latitude !=0 & Longitude != 0)
LBE<-vertoccurrence(t="Ensatina eschscholtzii klauberi",grp="herp")
LBE2<-subset(LBE,Latitude !=0 & Longitude != 0)
OE<-vertoccurrence(t="Ensatina eschscholtzii oregonensis",grp="herp")
OE2<-subset(OE,Latitude !=0 & Longitude != 0)
PE<-vertoccurrence(t="Ensatina eschscholtzii picta",grp="herp")
PE2<-subset(PE,Latitude !=0 & Longitude != 0)
SNE<-vertoccurrence(t="Ensatina eschscholtzii platensis",grp="herp")
SNE2<-subset(SNE,Latitude !=0 & Longitude != 0)
YE<-vertoccurrence(t="Ensatina eschscholtzii xanthoptica",grp="herp")
YE2<-subset(YE,Latitude !=0 & Longitude != 0)

all_states<-map_data("state")
states <- subset(all_states, region %in% c("california") )
emap <- ggplot()
emap <- emap + geom_polygon( data=states, aes(x=long, y=lat, group = group),colour="white", fill="grey90" )+theme_bw()

emap +
geom_jitter(data = YBE2,aes(Longitude, Latitude), alpha=0.3, color = "light blue") +
opts(title = "Ensatina subspecies")+
geom_jitter(data = ME2,aes(Longitude, Latitude), alpha=0.3, color = "purple")+
geom_jitter(data = LBE2, aes(Longitude, Latitude), alpha=0.3, color = "blue")+
geom_jitter(data = OE2,aes(Longitude, Latitude), alpha=0.3, color = "red")+
geom_jitter(data = PE2,aes(Longitude, Latitude), alpha=0.3, color = "black")+
geom_jitter(data = SNE2, aes(Longitude, Latitude), alpha=0.3, color = "orange")+
geom_jitter(data = YE2, aes(Longitude, Latitude), alpha=0.3, color = "yellow")+
xlim(c(-125,-113))+ylim(c(30,43))

Blue Jay and Scrub Jay : Using rvertnet to check the distributions in R

30 Jul

As part of my Google Summer of Code, I am also working on another package for R called rvertnet. This package is a wrapper in R for VertNet websites. Vertnet is a vertebrate distributed database network consisting of FishNet2MaNISHerpNET, and ORNIS. Out of that currently Fishnet, HerpNET and ORNIS have their v2 portals serving data. rvertnet has functions now to access this data and import them into R data frames.

Some of my lab mates faced a difficulty in downloading data for Scrub Jay (Aphelocoma spp. ) due to large number of records (220k+) on ORNIS, so decided to try using rvertnet package which was still in development that time. The package really helpe and got the results quickly. So while ecploring that data I came up with this case study.

So here to get data for Blue Jay (Cyanocitta cristata) which is distributed in eastern USA, we use vertoccurrence function and specify taxon we are looking for as Blue Jay with t=”Cyanocitta cristata” and we need to specify this is bird species with grp=”bird” since currently we have to access the data form three different websites of VertNet for Fishes, Birds and Herps(Reptiles and Amphibians). This fetches us all the records for Blue Jay. Now we want to get discard the records without Latitude and Longitude values so we use subset function on the data with Latitude !=0 & Longitude != 0. This gives us all geocoded records for Blue jay which then map using maps and ggplot packages like we did in earlier post.

library(rvertnet)
bluej1=vertoccurrence(t="Cyanocitta cristata",grp="bird")
bluej2=subset(bluej1,Latitude !=0 & Longitude != 0)

library(maps)
library(ggplot2)
world  = map_data("world")
ggplot(world, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white",
               color = "gray40", size = .2) +
  geom_jitter(data = bluej2,
              aes(Longitude, Latitude), alpha=0.6, size = 4,
              color = "red") +
                opts(title = "Cyanocitta cristata (Blue Jay)")

The final output of the code snippet is as following.

Now let us put Blue Jay and Scrub Jay side by side on the map to see how they are distributed in North America.  This is same as the earlier code except that we get data for both Jays and while plotting we use additional geom_jitters for the other Jay with a different color to distinguish the two. Also not the reduction in the size from 4 to 1 in order to make the points clearly visible

library(rvertnet)
bluej1=vertoccurrence(t="Cyanocitta cristata",grp="bird")
bluej2=subset(bluej1,Latitude !=0 & Longitude != 0)
scrubj1=vertoccurrence(t="Aphelocoma",grp="bird")
scrubj2=subset(scrubj1,Latitude !=0 & Longitude != 0)

library(maps)
library(ggplot2)
world = map_data("world")
ggplot(world, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white", color = "gray40",
               size = .2) +
  geom_jitter(data = bluej2,
              aes(Longitude, Latitude), alpha=0.6, size = 1,
              color = "blue") +
                opts(title = "Blue Jay and Scrub Jay") +
  geom_jitter(data = scrubj2,
              aes(Longitude, Latitude), alpha=0.6, size = 1,
              color = "brown")

The final output of the code snippet is as following.