GSoC Proposal 2013: Biodiversity Visualizations using R

29 Apr

I am applying for Google Summer of Code 2013 with this “Biodiversity Visualizations using R” proposal. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Backgrouond

R is increasingly being used in Biodiversity information analysis. There are several R packages like rgbif and rvertnet in rOpenSci suite to query, download and to some extent analyse the data within R workflow. We also have packages like dismo and SDMTools for modelling the data. It will be useful to have a package to quickly visualize biodiversity data. These visualizations would be helpful to understand extent of geographical, taxonomic and temporal coverage, gaps and biases in data.

The proposal is to work on a R package to provide functionality to quickly generate the visualizations of the data set user has gathered or generated.

The functions provided would be for following tasks:

  • Data preparation - The data needs to be converted into suitable format for visualizations and analysis i.e. date format, taxonomic classification and geographical co-ordinates should be in uniform and usable formats.
  • Data summary: Function(s) to quickly summarize the data set telling user number of records, number of records with Lat Long values, Bounding box of Lat Long Values, Date range and so on.
  • Geographic coverage - functions to visualize the data points on maps, density maps at different scales like Country level, Degree grid and so on.
Density of the records worldwide

Density of the records worldwide. Darker color indicates higher density of records.

Temporal coverage of the records

Temporal coverage of the records. Each line represents number of records on that particular day.

  • Taxonomic coverage - functions to visualize the taxonomic coverage of data in Tree Map formats by Number of records per species and number of species covered.
Familywise records

Family wise records present in the data set. (White block indicates records with unassigned family)

  • Completeness analysis – functions to assess and visualize completeness of biodiversity inventory of the region or in other words a measure of how exhaustive is the sampling in the study area [Ref:http://dx.doi.org/10.1111/j.0906-7590.2007.04627.x ]

Mentor(s): Javier Otegui

Data set: The data set used for the sample visualizations here is records published by iNaturalist.org on GBIF data portal. This data set contains Research Grade records (~46K) for all the organisms posted. The details of the data set are available here. The description on GBIF dat postal says “iNaturalist.org is a website where anyone can record their observations from nature. Members record observations for numerous reasons, including participation in citizen science projects, class projects, and personal fulfillment.”

References:

  • Chamberlain, S., & Barve, V. (2012). rvertnet: Search VertNet database from R. Retrieved from http://cran.r-project.org/package=rvertnet
  • Chamberlain, S., Boettiger, C., Ram, K., & Barve, V. (2013). rgbif: Interface to the Global Biodiversity Information Facility API methods. Retrieved from http://cran.r-project.org/package=rgbif
  • Hijmans, R. J., Phillips, S., Leathwick, J., & Elith, J. (2012). dismo: Species distribution modeling. Retrieved from http://cran.r-project.org/package=dismo
  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Jiménez, R., Golubov, J., & Koleff, P. (2007). Assessing completeness of biodiversity databases at different spatial scales. Ecography, 30(1), 152–160. doi:10.1111/j.2006.0906-7590.04627.x
  • VanDerWal, J., Falconi, L., Januchowski, S., Shoo, L., & Storlie, C. (2012). SDMTools: Species Distribution Modelling Tools: Tools for processing data associated with species distribution modelling exercises. Retrieved from http://cran.r-project.org/package=SDMTools
About these ads

11 Responses to “GSoC Proposal 2013: Biodiversity Visualizations using R”

  1. Diego Barneche April 30, 2013 at 12:55 am #

    Hello Vijay, your ideas sound very interesting and it is great to see your enthusiasm with this project.
    Please forgive me if some of the suggestions I’m about to make are obvious:

    1) Be careful with whatever is already offered in R (overlaps with other packages). For instance, your first map could be easily done using package maps or even MASS? If it does overlap, what sort of novelties/arguments are you bringing to these functions?

    2) The temporal coverage seems a nice idea, but I’m still not sure if the temporal circle is the best way to present the data. Maybe it would be nice to add an inner circle on top of the red lines that show the minimum coverage common to the entire temporal coverage?

    3) On familywise records it is unlikely that you’ll have negative values? If that’s the case, then you may want to ignore the negative scale and expand the colors from zero to maximum to allow a better comparison between families – in the example you provided, all families are in different shades of green, very difficult to tell them apart. You may also want to consider alternatives for people who cannot distinguish colors properly?

    Cheers,

    Diego

    • vijaybarve April 30, 2013 at 9:38 pm #

      Hi Diago,

      Thank you very much for taking time to through my GSoC proposal and more importantly giving me detailed feedback on ideas. As I have mentioned in the post the visualizations are nowhere near final, but are really crude images. The purpose of posting these images is duel. First to give some visual clue how things might look and second, proof of concept, that these things can be done using R and more importantly I can do it.

      1. You have raised a very important point I should keep in mind. “Do not reinvent the wheel”. Reuse the functions wherever possible. The idea behind this map is it is a world map overlaid with a kind of heat map. The reason I am thinking to provide this functionality is it is very useful visualization and might be frequently used. And with the kind of Biodiversity data we use, it is simple but needs some repeating code to get the map.

      2. Temporal coverage graph, I really liked your suggestions of adding a circle of minimum records. I am adding it to my function specification. And I am planning to also give more functions to plot some histograms and other graphs for temporal data.

      3. With the taxonomic coverage family wise diagram, I need to do a lot of work, first to understand how effectively I can display three parameters and then the control over the color schemes and legend. During coming days, I will try to add more example visualizations to make the proposal clearer to all.

      Thanks once again for your time and encouragement.

      • Diego Barneche May 27, 2013 at 9:37 am #

        Great Vijay, sorry I took so long to see your reply.

        What I really like about software development in Science is that you often find a lot of people commenting here and there, and I can see you are already benefiting from it. Keep up the good job.

        As a matter of curiosity, do you keep your things under version control in a public GIT repository? If so, would you be willing to share (maybe the outputs only) so we can happily follow the ideas through?

  2. Tim Appelhans April 30, 2013 at 7:27 am #

    Hello Vijay,
    I am not per se a biodiversity/ecology biologist, but I am a climatologist who is involved in a biggish biodiversity/ecology project at Mt. Kilimanjaro, Tanzania (https://www.kilimanjaro.biozentrum.uni-wuerzburg.de/).
    From my experience, a package that provides straight forward visualisation tools for the most common modelling outputs used in the community and also brings together the geographic/spatial possibilities of packages like sp and raster, would be very neat and much appreciated (at least among our PhD students)! There’s plenty examples of having a … package extended by a …Vis package. So, I would encourage you to go ahead with this.

    Given my involvement in the above mentioned project, I would also like to offer my help. I do have experience in R programming and package development (http://tim-salabim.github.io/metvurst/) and am especially interested in spatial analysis (I am a geographer) and visualisation (especially lattice and grid).

    I hope you go forward with this and if I can help in any way, please let me know.

    Cheers
    Tim

    • vijaybarve May 1, 2013 at 3:48 am #

      Hello Tim,

      Thanks for the detailed response and encouragement. Your help will be valuable for me to refine my ideas and function specifications. Also once the coding starts testing the package at early stage.

      As you have mentioned, the goal is to provide wrappers for functions with straight forward calls.

      Will be in touch with you.

      Regards,

      Vijay

  3. Scott Chamberlain April 30, 2013 at 5:32 pm #

    Hey Vijay, A few comments.

    Yeah, like Diego said, make sure to not reinvent the wheel. If an R package already does what you need you could just depend on that package, and make a light wrapper around it for your purposes. For a lot of data, it could be an option to allow users to run a shiny app with the data (they can be run locally on a browser) to interact with the data (e.g, adjust variables in the plot to see different data subset, etc.).

    -S

    • vijaybarve May 1, 2013 at 3:52 am #

      Hi Scott,

      I will take a close look at Shiny and add some interactivity in data exploration.

      Thanks,

      Vijay

  4. Olmo April 30, 2013 at 6:23 pm #

    I do not write very often in blogs, but I hope to do it well. I want only comment about completeness functions, did you check the fossil package? I used it for my completeness analysis in the past without problems (function spp.est for example). I can be wrong but since you do not comment about that package, I felt I must tell about it.

    By the way, it is a very good project and that kind of packages are the ones I use in my PhD so I am happy to see someone in GSoC working on it.

    Best luck,
    Olmo.

    • vijaybarve May 1, 2013 at 3:55 am #

      I have not used fossil package so far, but I will take a look at it and see how much I can borrow form that for the functionality we have in mind for this package. And thanks for your encouragement. Looking forward to your help in testing phase.

  5. vijaybarve May 3, 2013 at 4:43 am #

    Added Data Summary function to the proposal.

Trackbacks/Pingbacks

  1. bdvis development version available for early feedback | Vijay Barve - May 8, 2014

    […] are underway. I thought this is a good logical point for us to share what we have been doing for Biodiversity Data Visualizations in R project and open up the package for testing and some early feedback. We have named the package […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 404 other followers

%d bloggers like this: