Tag Archives: GSoC

GSoC 2017 : Parser for Biodiversity Checklists

31 May

Guest post by Qingyue Xu

Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. In the GSoC 2017 project Parser for Biodiversity checklists, my overall goal is to extract taxonomic names from given text into a tabular format so that easy aggregation of biodiversity data in a structured format that can be used for further processing can be facilitated.

I mainly plan to build three major functions which serve different purposes and take various sources of text into account.

However, before building functions, we need to first identify and cover as many different formats of scientific names as possible. The inconsistencies of scientific names make things complicated for us. The common rules for scientific names follow the order of:

genus, [species], [subspecies], [author, year], [location]

Many components are optional, and some components like author and location can be one or even more. Therefore, when we’re parsing the text, we need to analyze the structure of the text and match it with all possible patterns of scientific names and identify the most likely one. To resolve the problem more accurately, we can even draw help from NLTK (Natural Language Toolkit) packages to help us identify “PERSON” and “LOCATION” so that we can analyze the components of scientific names more efficiently.

Function: find_taxoname (url/input_file, output_file)

  • Objective: This is a function to search scientific names with supplied texts, especially applied to the situation when the text is not well structured.
  • Parameters: The first parameter is the URL of a web page (HTML based) or the file path of a PDF/TXT file, which is our source text to search for the biology taxonomic names. The second parameter is the file path of the output file and it will be in a tabular format including columns of the genus, species, subspecies, author, year.
  • Approach: Since this function is intended for the unstructured text, we can’t find a certain pattern to parse the taxonomic names. What we can do is utilizing existing standard dictionaries of scientific names to locate the genus, species, subspecies. By analyzing the surrounding structure and patterns, we can find corresponding genus, species, subspecies, author, year, if they exist, and output the findings in a tabular format.

Function: parse_taxolist(input_file, filetype, sep, output_file, location)

  • Objective: This is a function to parse and extract taxonomic names from a given structured text file and each row must contain exactly one entry of the scientific names. If the location information is given, the function can also list and return the exact location (latitude and longitude) of the species. The output is also in a tabular format including columns of genus, species, subspecies, author(s), year(s), location(s), latitude(s), longitude(s).
  • Parameters: The first parameter is the file path of the input file and the second parameter is the file type, which supports txt, PDF, CSV types of files. The third parameter ‘sep’ should indicate the separators used in the input file to separate every word in the same row. The fourth parameter is the intended file path of the output file. The last parameter is a Boolean, indicating whether the input file contains the location information. If ‘true’, then the output will contain detailed location information.
  • Approach: The function will parse the input file based on rows and the given separators into a well-organized tabular format. An extended function is to point the exact location of the species if the related information is given. With the location info such as “Mirik, West Bengal, India”, the function will return the exact latitude and longitude of this location as 26°53’7.07″N and 88°10’58.06″E. It can be realized through crawling the web page of https://www.distancesto.com/coordinates or utilizing the API of Google Map. This is also a possible solution to help us identify whether the content of the text represents a location. If it cannot get exact latitude and longitude, then it’s not a location. If a scientific name doesn’t contain location information, the function will return NULL value for the location part. If it contains multiple locations, the function will return multiple values as a list as well as the latitudes and longitudes.

Function: recursive_crawler(url, htmlnode_taxo, htmlnode_next, num, output_file, location)

  • Objective: This function is intended to crawl the web pages containing information about taxonomic names recursively. The start URL must be given and the html_node of the scientific names should also be indicated. Also, if the text contains location info, the output will also include the detailed latitude and longitude.
  • Parameters: The first parameter is the start URL of the web page and the following web pages must follow the same structure as the first web page. The second parameter is the html_node of the taxonomic names, such as “.SP .SN > li”. (There’re a lot of tools for the users to identify the HTML nodes code for certain contexts). The third parameter is the html_node of the next page, which can lead us to the next page of another genus. The fourth parameter ‘num’ is the intended number of web pages the user indicates. If ‘num’ is not given, the function will automatically crawl and stop until the htmlnode_next cannot return a valid URL. The next two parameters are the same with the above two functions.
  • Approach: For the parsing part and getting the location parts, the approach is the same as the above functions. For the crawling part, for a series of structured web pages, we can parse and get valid scientific names based on the given HTML nodes. The HTML nodes for the next pages should also be given, and we can always get the URL of the next page by extracting it from the source code. For example, the following screenshot from the web page we used provides a link which leads us to the next page. By recursively fetching the info from the current page and jump to the following pages, we can output a well-organized tabular file including all the following web pages.
  • Other possible functionalities to be realized

Since inconsistencies might exist in the format of scientific names, I also need to construct a function to normalize the names. The complication always lies in the author part, and there can be two approaches to address the problem. The first one is still analyzing the structure of the scientific name and we can try to capture as many exceptions as possible, such as author names which have multiple parts or there’re two authors. The second approach is to draw help from the NLTK package to identify possible PERSON names. However, when it gets too complicated, the parsing result won’t be very accurate all the time. Therefore, we can add a parameter to suggest how reliable our result is and indicate the need for further manual process if the parser cannot work reliably.

 

GSoC 2017 : Biodiversity data cleaning

17 May

By Ashwin Agrawal

URL of the Project Idea: https://github.com/rstats-gsoc/gsoc2017/wiki/Biodiversity-data-cleaning

Introduction

There are an increasing number of scientists using R for their data analyses, however, the skill set required to handle biodiversity data in R, is considerably varies. Since, users need to retrieve, manage and assess high volume data with complex structure (Darwin Core standard, DwC); only users with an extremely sound R programming background can attempt this. Recently, various R packages dealing with biodiversity data and specifically data cleaning have been published (e.g. scrubr, biogeo, rgeospatialquality, assertr , and taxize). Though numerous new procedures are now available, implementing them requires users to prepare the data according to the formats of each of these packages and learning each R package. Dealing with the integration related tasks which would facilitate the data format conversions and smooth execution of all the available data cleaning functions from various packages, is being addressed in another GSOC project (link). The purpose of my project is to identify and address missing crucial functionalities for handling biodiversity (big) data in R. Properly addressing these gaps will hopefully enable us to offer a more complete framework for data quality assessment in R.

Proposed components

1. Standardized flagging system:

Biodiversity quality assessment is based upon a user capability to execute variety of data checks. Thus, a well-designed flagging system will allow users to easily manage their data checks result, and facilitate control on the desired quality level on one hand, and user flexibility on the other hand. I will assess several approaches for designing such a system, factoring comprehensibility and programming complexity.

Any insights and ideas regarding this task will be highly appreciated (please create a github issues).

2. A DwC summary table:

When dealing with high complexity and high-volume data, summary statistics of different fields and categories, can have an immense value. I will develop a DwC summary table based on DwC fields and vocabulary. First, I will explore different R packages dealing with descriptive statistics and table visualizations. Then, I will map key DwC data fields and key categories for easy faceting of the summary table. In addition, the developed framework can be used to enhance the flagging system, by utilizing it unique functionality to summarize the data quality checks results.

3. Outliers analysis:

Identifying spatial, temporal, and environmental outliers can single out erroneous records. However, identifying an outlier is a subjective exercise, and not all outliers are errors.  I will develop a set of functions which will aid in detection of outliers. Various statistical methods and techniques will be evaluated (e.g. Reverse Jackknife, Standard Deviations from the Mean, Alphahull).

4. Developing new data quality checks and procedure

I will identify critically missing spatial, taxonomic and temporal data cleaning routines, factoring users need level and programming complexity.  Ideas and needs regarding this task will be highly appreciated (please create a github issues).

Significance

Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data. This can greatly serve the scientific community and consequently our ability to address more accurately urgent conservation issues.

Feedback

For feedback, suggestions please post them on github issues

GSoC 2017 : Integrating biodiversity data curation functionality

8 May

By Thiloshon Nagarajah

Any data used in data science analyses, either it be simple statistical inference-making or high end machine learnings, needs to meet certain quality. Any dataset has ‘Signal’ the answer we are trying to find and ‘Noise’ the disturbances and anomalies in the data. The important part of preparing data for any analysis is to make it easier to distinguish noise from data. In biodiversity researches,  the data can be very large in number. Thus there is high probability of having a lot of noise. Giving control to researchers on this noise reduction will provide a clean and tidy data. 

Biodiversity research is a huge spectrum. It varies from analyzing simple heredity, climate and Eco-system impacts on species to complex Genome Sequencing researches. So the requirements of data in each of these fields vary with the type of researches. Taxonomic researchers will be interested in taxonomic fields and not so in spatial or temporal aspects of the data. They will be okay with loosing spatial data in compensation for better taxonomic data. Whereas the spatial related researchers will be lousy on taxonomic fields but not on spatial fields. This application-specific control on cleaning and preparing data helps in having an immensely efficient subsequent processes. 

The gist of this project and it’s sister project (Biodiversity data cleaning done by Ashwin Agrawal) is to provide this customizable control over cleaning of the data. The cleaning and standardization is not done on the fly, it’s customized, tweaked and refined as per user needs. The controls for that will be given by our solution. The brief proposal of our solution is given below. For the full proposal, click here.

 

Introduction

The importance of data in the biodiversity research has been repeatedly stressed in the recent times and various organizations have come together and followed each other to provide data for advancing biodiversity research. But, that is exactly where the main hiccup of biodiversity research lies. Since there are many such organizations, the data aggregated by these organizations vary in precision and in quality. Further, though in recent times more researchers have started to use R for their data analyses, since they need to retrieve, manage and assess data with complex (DwC) structure and high volume, only researchers with extremely sound R programming background have been able to attempt this.

Various R packages created so far have been focused on addressing some elements of the entire process. For example

Thus when a researcher decides to use these tools, he needs to

  1. Know these packages exist
  2. Understand what each package does
  3. Compare and contrast packages offering same functionalities and decide the best for his needs
  4. Maintain compatibility between the packages and datasets transferred between packages.

What we propose:

We propose to create a R Package that will function as the main data retrieval, cleaning, management and backup tool to the researchers. The functionalities of the package are culmination of various existing packages and enhancements to existing functions rather than creating one from the scratch. This way we can cultivate the existing resources, collaborative knowledge and skills and also address the problem we identified efficiently. The package will also address the issue of researchers not having sound R programming skills.

Before we analyze solutions, it’s important to understand the stakeholders and scope of the project.

blog_img01

Data Flow

Proposed Package:

The package will cover major processes in the research pipeline.

  1. Getting Biodiversity data to the workspace
    The biodiversity data can be read from existing DwC archive files in various formats (DwCA, XML, CSV) or it can be downloaded from online sources (GBIF, Vertnet). So functions to read local files in XML, DwCA and CSV formats, to download data directly from GBIF and to retrieve from respective APIs will be included. In case the user doesn’t know what data to retrieve the name suggestion functions will also be included. Converting common name to scientific name, scientific name to common name, getting taxon keys to names will also be covered. Further functions to convert to simple data frames to retrieve medias associated with occurrences will also be included.
  2. Flagging the data
    The biodiversity data is aggregated by various different organizations. Thus these data vary in precision and in quality. It is highly necessary to first check the quality of the data and strip the records which lacks the quality expected before using it. Various packages built thus far have been able to check data for various discrepancies such as scrubr and rgeospatialquality. Integrating functionalities given by such packages to produce a better quality control will benefit the community greatly. The data will be checked for following discrepancies

    1. In spatial – Incorrect, impossible, incomplete and unlikely coordinates and invalid country and country codes
    2. In temporal – missing or incorrect dates in all time fields
    3. In taxonomic – epithet, scientific name and common name discrepancies and also fixing scientific names.
    4. And duplicate records of data will be flagged.
  3. Cleaning data
    The process is done step by step to help user configure and control the cleaning process.
    The data will be flagged first for various discrepancies. It can be any combination of spatial, temporal, taxonomic and duplicate flags as user specifies. Then the user can view the data he will be losing and decide if he wants to tweak the flags. In times when the data is high in volume, this procedural cleaning would help user for number of reasons.

    1. When user wants multiple flags, he can apply each quality check one by one and decide if he wants to remove flagged data once he applies one check. If he is to apply all flags at once the records to be removed will be high in number and he has to put much effort to go through all of it to decide if he wants that quality check
    2. When user views the flagged data, not all fields will be shown. Only the fields the quality check was done on and the flagged result will be shown. This saves user from having to deal with all the complex fields in the original data and having to go to and forth the data to check the flags.
      If he is satisfied with the records he will lose then, the data can be cleaned.

    Ex:

    # Step 01
    biodiversityData
          %>% coordinateIncompleteFlag()
          %>% allFlags(taxonomic)
    
    #Step 02
    viewFlaggedData()
    
    #Step 03
    cleanAll()
    
    
  4. Maintaining backups of dataThe original data user retrieved and any subsequent resultant data of his process can be backed up with versioning to maintain reproducibility. Functions for maintaining repositories, backing up to repositories, loading from repositories and achieving will be implemented.

Underneath, the package we plan to implement these:

  1. Standardization of dataThe data retrieved will be standardized according to DwC formats. To maintain consistency and feasibility I have decided to use GBIF fields as the standard. The reasons are,
    1. The GBIF uses DwC as the standardization and the fields from GBIF backbone complies with DwC terms.
    2. GBIF is a well-established organization and using the fields from their backbone will assure consistency and acceptance by researchers.
    3. GBIF is a superset of all major biodiversity data available. Any data gathered from GBIF can be expected to also be in other sources too.
  2. Unique fields grouping system for DwC fields, based on recommended grouping (see this and this)

Conclusion

We believe a centralized data retrieval, cleaning, management and backup tool will benefit the bio diversity research community and eradicate many short comes in the current research process. The package will be built over the course of next few months and any insights, guidance and contributions from the community will be greatly appreciated.

Feedback

Please give us your feedback, suggestions on github issues

Visualizing bdsns data using bdvis

12 Aug

One of the tasks in my Google Summer of Code 2015 was to integrate new package bdsns with existing package bdvis to identify strengths and gaps in the data. This can be achieved with few simple steps.

Begin with opening both libraries

library(devtools)
install_github("vijaybarve/bdsns")
install_github("vijaybarve/bdvis")
library(bdsns)
library(bdvis)

Get data for few species of butterflies using bdsns package from Flickr and store in sqlite database. User needs to get own API key form Flickr website from here. A file containing few scientific names of butterfly species

bflytest.txt
scname
Graphium agetes
Graphium antiphates 
Graphium aristeus
Colias nilagiriensis
Dercas verhuelli
Eurema andersoni 
Gonepteryx rhamni
Hebomoia glaucippe
Euripus nyctelius 
Hestinalis nama
Mimathyma ambica 
Ariadne merione
Byblia ilithyia
Abisara echerius
Abisara neophron 
Zemeros flegyas
Curetis thetis
Heliophorus epicles
Spalgis epeus
Hasora badra
Hasora chromus
Gangara lebadea
Gangara thyrsis

And then we are all set to run the command to download and store the data in sqlite database.

flickrtodatabase(myapikey,"bflytest.txt",
                  "scname","testdb")

Read in the sqlite database

dat=extract_flickrdb("testdb","t1.csv")

Set up the data for use in bdvis.Function format_bdvis will set the field names for scientific name, latitude, longitude and date in the bdvis format and also assigh grid cell ids. Function gettaxo will fetch and store higher taxonomy of the species.

dat=format_bdvis(dat)
dat=gettaxo(dat)

Now bdvis functions can be used for visualizations

mapgrid(dat)
tempolar(dat)
taxotree(dat)
chronohorogram(dat)
bdcalenderheat(dat)

Here is a sample of what this code will produce:

Butterfly MapGrid

MapGrid output of Butterfly Data

Temporal Butterfly

Temporal output of daily butterfly data

Taxotree output of butteerfly dataChronohorogram of Butterfly dataCalander Heat Map of Butterfly dataPlease note the results may not exactly match, since new photographs are being posted continuously on Flickr.

Read more about bdsns here

Barve, V. (2014). Discovering and developing primary biodiversity data from social networking sites: A novel approach. Ecological Informatics, 24, 194–199. doi:10.1016/j.ecoinf.2014.08.008

GSoC Proposal 2014: package bdvis: Biodiversity Data Visualizations

17 Mar

Update: The proposal has been approved for participation in Google Summer of Code 2014. I will post updates on the progress on the blog once the coding phase starts.

I am applying for Google Summer of Code 2014 again with “Biodiversity Data Visualizations using R” proposal. We are proposing to take package bdvis to next level by adding more functions and making it available through CRAN. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Background

Package bdvis is already under development and was successful projects in GSoC 2013. As of now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for additional features are coming up. We propose to add the user requested functionality and implement some new functions to take bdvis to next level. Following are the major tasks of proposed project.

  1. Fix currently reported bugs and complete documentation to submit package to CRAN.
  2. Implementation of additional features requested by users.
  3. Develop seamless data support.
  4. Additional functions for visualizations.
  5. Prepare detailed vignette.

User requested features

The features and functionality requested by users so far are the following:

  • A versatile function to subset the data based on taxonomy for a species, genus, family etc. or date like a particular year or range of years and so on.
  • Tempolar ability to show average records per day/week/month rather than just raw numbers currently
  • Taxotree additional parameters to control the diagram like Title, Legend, Colors. Also to add ability to choose summary based on number of records, number of species or higher taxonomy
  • bdsummary number of grid cells covered by data records and % of coverage of the bounding box
  • Visualisation ability for the output of completeness analysis bdcomplete function
  • Improve gettaxo efficiency by adding ability to search by genus rather than current scientific name. This could be added as an option in case user needs to search by full scientific names for some reason.

Data formats support

Develop functions for seamless support for major available Biodiversity occurrence data formats in R environment to work with bdvis package. Preliminary list of packages that make data available are rgbif, rvertnet, rinat, spocc. Get feedback from user community for additional data sources they might be using and incorporate them into the worklist.

Additional visualizations

    • Distribution of collection efforts over time (line graph) [Fig 1 Soberon et al 2000]

Soberon_Fig_1

    • Distribution of number of records among taxon, cells (histogram) [Fig 3,4 Soberon et al 2000]

Soberon_Fig_3

  • Distribution of number of species among cells (histogram) [Fig 5 Soberon et al 2000]
  • Completeness vs number of species(scatterplot) [Fig 6 Soberon et al 2000]
  • Record densities for day of year and week of year [Otegui 2012]

RecordsPerDayofYear

  • Records per year dot plots [Otegui 2012]

RecPerYear

  • calenderHeat maps of number of records or species recorded

IndianMoths_calenderheat

Interactive Map of records

A function to plot records on an interactive map. The plan is to develop a function that will generate a geoJSON based map using a html / java script file. User can open the file in web browser to explore the records. Considering the performance we might have to restrict number of records for this function.

geoJSON example screenshot

Vignette preparation

Prepare test data sets for the vignette. Three data sets one with global geographical coverage and wide species coverage, second with country level geographical and Class or Order level species coverage and final narrow species selection may be at genus level to demonstrate functionality. Write up code and explanation of each of the function in package, add result tables, graphs and maps to complete the vignette.

References

  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Llorente, J., & Oñate, L. (2000). The use of specimen-label databases for conservation purposes: an example using Mexican Papilionid and Pierid butterflies. Biodiversity and Conservation, 9(Roman 1997), 1441–1466. Retrieved from http://www.springerlink.com/index/H58022627013233W.pdf

Temporal visualization of records of IndianMoths project using bdvis

13 Aug

I was looking for some data set which has some bias in terms of temporal data. I thought of checking out the data from iNaturalist project IndianMoths. This project is aimed at documenting moths from India. This project was initiated in July 2012 but really caught steam in January 2013, with members contributing regularly, minimum of 100 records per month. The reason that this project has not yet completed one year, I thought it might have some bias form the missed out months. Another reason for bias could be the fact that moths are not seen in the same numbers through out the year.

IndianMoths project on iNaturalist

IndianMoths project on iNaturalist


To explore this data, I first downloaded the data in a .csv file and loaded into R.

The data summary looked like this:

Total no of records = 2958
Bounding box of records Inf , Inf - -Inf , -Inf
Taxonomic summary...
No of Families : 0
No of Genus : 0
No of Species : 0

This tells us that the data is read by the package, but it has not understood the format well and we might have to do some transformations to get this going with our package. So let us use the function fixstr

to get the data into (somewhat) required format.

imoth=fixstr(imoth,Latitude="latitude",
                 Longitude="longitude",
                 DateCollected="observed_on")

Now let us check the summary again


 Total no of records = 2958
 Date range of the records from  0208-07-26  to  2013-08-07
 Bounding box of records  6.660428 , 72.8776559  -  32.5648529099 , 96.2124788761
 Taxonomic summary...
 No of Families :  0
 No of Genus :  0
 No of Species :  0

Now we have date and Latitude-Longitudes in a form that our package can understand. A quick glance at this data summary shows us that there is some problem with dates. In our data set we have one record form year 208 (which must be typo for year 2008). And the data is all form in and around India looking at the bounding box values of records.
We still need to get the taxonomy in place, but we will leave that for later time, and start working with this data. Let us create temporal plots of this data for different timescales of Daily, Weekly and Monthly.

tempolar(imoth,title="Daily Records")
tempolar(imoth,title="Weekly Records",timescale="w")
tempolar(imoth,title="Monthly Records",timescale="m")

would produce following three plots.

Indian Moths Daily Records

Indian Moths Daily Records

These are records per calender day and we see that 2-3 days in April have very high number of records compared to other dates. This could be due to some targeted survey during that time. This also shows us that we do not have much data records from September till April.

Indian Moths Weekly Records

Indian Moths Weekly Records

The weekly aggregation of same records highlights the fact that April month does have some spike in numbers, and otherwise the number of records seem to fairly uniform.

Indian Moths Monthly Records

Indian Moths Monthly Records

Monthly plot shows that April has recorded more than 800 records, where as no other month have more than 500 records in a month.

This could be due to several reasons, but mainly because of the activity of this particular project.

bdvis development version available for early feedback

31 Jul

Google Summer of Code 2013 is half way through. Mid term evaluations are underway. I thought this is a good logical point for us to share what we have been doing for Biodiversity Data Visualizations in R project and open up the package for testing and some early feedback. We have named the package bdvis. The package is on github, and I would appreciate if you could install and test it. Feedback may be given in the comments here, using issues on github  by twitter or email.

Getting data

The data was obtained from the Data portal of Global Biodiversity Information Facility. (http://data.gbif.org). The data set we are looking for is iNaturalist research grade records. We accessed the datasets page at http://data.gbif.org/datasets/ and selected the iNaturalist.org page from the alphabetic list which is at http://data.gbif.org/datasets/provider/407. Once on this page use link Explore: Occurrences and then from the next page click Download: Spreadsheet of results. On this page make sure  Comma separated values is selected and then press Download Now button. Website may take a few minutes to make your download ready. Once it is ready, the download link will be provided. Typically the name of the file will be occurrence-search-12345.zip The number of digits would be as many as 40.  Use the link to download the .zip file and then extract the data file occurrence-search-12345.csv in the working directory of R. Since this file has a long name, let us rename it to inat.csv for convenience.

Now we are ready to load our data.

inat = read.csv("inat.csv")
dim(inat)

If it shows something like

[1] 66581    47

we are on right track. Our data is loaded into R. For the time being, this package handles only GBIF provided data format, but getting user generated biodiversity data in this format using some built in functions is being worked out.

Package installation

Now let us install bdvis package. First we need to get devtools package which will let us install packages from github (rather than CRAN).

install.packages("devtools")
require(devtools)

install_github("bdvis", "vijaybarve")
require(bdvis)

if this produces something like

Loading required package: bdvis

Attaching package: ‘bdvis’

The following object(s) are masked from ‘package:base’:

summary

we are on right track. Our packages is installed and loaded into R.

Package functions

1. summery

Let us start playing with the functions now. We have the data loaded in inat data frame.

bdvis::summary(inat)

Should produce something like:

Total no of records = 66581
Date range of the records from  1710-02-26  to  2012-12-31
Bounding box of records  -77.89309 , -177.37895  -  78.53431 , 179.2615
Taxonomic summary...
No of Families :  1394
No of Genus :  5089
No of Species :  11299

What does this tell us about our data ?

  • We have 66581 records in the data set
  • The date range is from 1710 to 2012. (Really we have record form 1710? Looks we have a problem there.)
  • The bounding box is almost the whole world. Yes, this is global data set.
  • We have so many Families, Genus and Species represented in this data set.

I have two questions here:

  1. What more would you like to get in the summary?
  2. Should I rename the function summary to something else, so it does not clash with usual data frame summery function name?

2. mapgrid

Now let us generate a Heat map of the records in this data set. This map will show us the density of records in different parts of the world. To generate this map

mapgrid(inat,ptype="species")
mapgrid output for iNaturalist data

mapgrid output for iNaturalist data

ptype could be records if we need the map with raw records rather than aggregated to species. Again the questions:

  • What more options would you like to see here?
  • Ability to zoom in certain region?
  • Control over color pallet ?

3. tempolar

Now coming to Temporal visualizations, the function tempolar would make polar plots of temporal data into daily, weekly and monthly plots. The code and samples are as follows:

tempolar(inat,color="green",title="iNaturalist daily"
          ,plottype="r",timescale="d")
tempolar(inat,color="blue",title="iNaturalist weekly"
          ,plottype="p",timescale="w")
tempolar(inat,color="red",title="iNaturalist monthly"
          ,plottype="r",timescale="m")
Dailyly plot of Temporal data. Each line is records on each day of the year.

Dailyly plot of Temporal data. Each line is records on each day of the year.

Weekly plot of Temporal data. Plottype polygon is used here.

Weekly plot of Temporal data. Plottype polygon is used here.

Monthly plot of Temporal data. Each line is representing records in that month.

Monthly plot of Temporal data. Each line is representing records in that month.

Here options to control color, title, plottype and of course timescale are provided.

We are less than half way through our original proposal, and will continue to actively build this package. As I build more functionality, I will post more information on the blog. Till that time keep the feedback flowing telling us what more you would like to see in this package.