Archived - The Biological Observations, Specimens and Collections (BiOSC) Gateway
Archived information is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.
and Derek Munro
Agriculture and Agri-Food Canada, Ottawa
Publication date: 26 July 2002
Note: The BiOSC Gateway was actively maintained between 2001 and 2004 and deactivated in 2006. This article is kept for historical purposes.
As part of the Canadian participation to the Global Biodiversity Information Facility (GBIF), Agriculture and Agri-Food Canada (AAFC) is introducing a prototype information gateway to Biological Observations, Specimens and Collections (BiOSC Gateway). The BiOSC Gateway is a global metadata search engine that cross-walks multiple distributed biodiversity networks and facilitates access to their combined holdings. As of early January 2002, the BiOSC Gateway contained approximately 2.4 million biological records. The gateway associates a biological names harvestor with a multilingual taxonomic authority file (ITIS*North America), providing international users with access to biodiversity records by either scientific names, vernacular names, synonymy, country of collection or geographic coordinates. Each individual metadata record (either an individual specimen or a single observation) is hyperlinked directly to its primary source on its native biodiversity network, ensuring that end-users access detailed records exclusively via the facility of the records owners who remain in full control of their records. Interactive world maps are provided for those biological records that are associated with explicit latitude and longitude coordinates. This paper outlines the difficulties encountered and lessons learned when building a search engine for global biodiversity networks. Specific recommendations to overcome major obstacles in building the next generation biodiversity records search engine are provided.
Biodiversity is distributed all over the Earth, with the highest concentration in the tropical regions, especially in developing countries, and in the oceans. In contrast, scientific information about biodiversity (including most type specimens) is largely concentrated in major centres in developed countries, especially in the scientific collections of the world's natural history museums, herbaria, genebanks and culture collection of microorganisms.
It is difficult for scientists, environmentalists, policy makers or any other users to determine what biodiversity is present in a given area. This determination is largely based on collected specimens housed in research facilities all over the world, as well as from environmental assessment reports. Traditionally, access to collection information has only been possible by either visiting the collections or by borrowing the specimens. Even in developed countries, a large proportion of relevant specimens are housed in foreign collections, frequently located on other continents.
The "holy grail" of biological specimen collections and observations would be the ability to find all the information about a specific taxon or about taxa from a defined location, no matter where in the world the information resides. In fact, plans for the Global Biodiversity Information Facility (GBIF) are to ultimately interconnect databases to provide information about all 1.8 million species of organisms - from bacteria to whales - that have received scientific names, including access to data on the approximately 3 billion specimens located in the world's natural history collections. One of the major goals is to obtain precisely geo-referenced data, because once location data of taxonomic entities are obtained, they can be submitted to biodiversity modelling techniques (such as WhyWhere - http://biodi.sdsc.edu/) to unravel correlations with other environmental layers (such as temperature, rainfall, soils, topography, etc) in order to predict change or explain distribution of biodiversity.
Over the last decades, numbers of institutions have initiated the computerization of their natural history collections. Many of them are available over the Internet via custom or stand alone interfaces. A growing number of institutions are now joining efforts to build distributed systems, through which a single query form sends a request to many participating sites and combines the returned results into a single output that can be text- or map-based. Several of these networks coexist now, each of them using a slightly different approach and different technologies to access the data and make them available to end-users. Technologies associated with distributed queries are moving targets. There are already many "standards" to choose from. Undoubtedly each of these standards will change, and new ones will emerge as the technology and our ability to make use of them evolves. As of early 2002, examples of functional distributed biodiversity specimens or observations networks were:
- The Species Analyst (TSA)
Over 2 million specimens from 82 institutions; one shared data exchange model (second version in draft)
- The World Biodiversity Information Network
or "Red Mundial de Información sobre Biodiversidad (REMIB)"
Over 4 million specimens from 62 collections; one shared data exchange model, gathering in real-time from at least 6 different data models
- The European Natural History Specimen Information Network (ENHSIN)
Less than one million specimens from 6 institutions; one shared data exchange model
- Australia's Virtual Herbarium(not included in this prototype)
Targeting 6 millions specimens from 6 Australian institutions; one shared data exchange model
Rationale for Prototyping the Biological Observations, Specimens and Collections Gateway
There is an obvious need to build a dedicated search engine that cross-walks all existing biodiversity networks and gathers information on the Internet location of primary biodiversity records and how to access them. Very few potential users are aware of the existence of these many networks. None of the current biodiversity networks can retrieve from all others. Each network has its own (often experimental) interface, its own data exchange model, its own data access protocol. Most of the models are still evolving, access procedures are changing frequently. Sites are often bending rules of the data models that they are in principle adhering to (for example, storing alphanumeric text where numbers are expected, or country names where ISO codes are expected). Considering the still relatively small size of the current networks, distributed queries work relatively fast for now, but these are not expected to scale up very well as the number of participating institutions grows. Most of the current networks are very institution-centric rather than being taxa-centric or geo-centric. They typically display long list of participating institutions, and users are expected to select those institutions to which they wish to submit a query. This is easy when there are only 5-6 institutions to choose from, it becomes tedious when there are more than 10 and soon ludicrous as the list keeps growing. As there are thousands of natural history collections and observation repositories around the globe and as more and more of them are gradually made available over the Internet, it will quickly become almost impossible to query all potentially relevant sites in a single distributed query. Who wants to select manually amongst hundreds of sites to locate all known occurences of a certain butterfly, or of a certain invasive species? What is needed is a search engine, that indexes biodiversity primary location records irrespective of network technologies, transport protocol, physical location, holding institution, and underlying data model. A good search engine should work in a multilingual context and ultimately offer ways to query by continent, country, even state or province, independently of the language used on the original label. The search engine should also support entry points using vernacular names where possible, automatically identify common misspellings and synonyms, and dynamically suggest alternative names ranked by probability of correctness.
Current Internet search engines (such as Google, AltaVista, and others) do not index biodiversity networks. Many of the institutions participating to the current biodiversity networks simply do not wish yet to have all their content indexed by search engines for full text retrieval or warehoused elsewhere. In addition, for now, search engines are simply blocked by a feature common to all of the 4 current biodiversity networks. Each network is publicly accessible via a query form that requires user input and the click of a mouse on a submit button. None of the networks is completely browsable in a way similar to a standard web site that can be automatically crawled by robots following chains of hyperlinks radiating from a home page. Technically however, this impediment can be easily overcome.
- Models underlying the networks were compared and the appropriate query syntax was derived from the default public query form on each of the selected networks.
- As most data owners are reluctant to see their collections fully indexed by search engines, a minimal number of common fields were selected to build the BiOSC prototype: taxonomic identification (usually a binomial or trinomial) , primary geographic division (usually a country), date of collection, and longitude/latitude of collection (when explicitely provided).
- Each network was contacted to obtain formal permission to build the prototype index.
- A custom built "harvestor" or "robot" was designed to crawl each site of each network, systematically cycling through the data to make sure that all records were retrieved.
- Data gathered by the harvestor was parsed in order to retain the content of fields required for the index, the collection source and the query path to each of the primary records, and loaded into an Oracle database.
- Partially normalized connecting tables were created to match harvested taxonomic names with the Integrated Taxonomic Information System (ITIS - http://www.cbif.gc.ca/pls/itisca/taxaget?p_ifx=cbif&p_lang=en) and to reduce the number of spelling variants in primary geographic divisions. This operation required the largest amount of human intervention.
- Public Web query interface to BiOSC was written using PL/SQL and PHP.
The BiOSC Gateway is a metadata search engine that provides direct access to approximately 2.4 million biological records from about 100 institutions available via three biodiversity networks. The gateway cross-walks the three networks and data-holding institutions within networks to get to specific records of interest to end-users. As there are many genus-species combinations that are not recorded in a taxonomic authority file such as ITIS, the taxonomic names gathered by the harvestor component have to be directly accessible for query. Taxa can be retrieved for all of their occurences, or filtered by some primary geopolitical divisions. Nearly 100,000 biological names have been harvested from the 3 networks. The 25,000 taxonomic names that occur in ITIS are fully integrated (from the Canadian sites of ITIS) with the BiOSC Gateway and the three biodiversity networks. Through ITIS, BiOSC records can be accessed via scientific names and synonyms or via vernacular names in three languages. Summary counts of BiOSC records by taxonomic groups are available through the browsable ITIS taxonomic hierarchy. All BiOSC metadata records are hyperlinked directly to their original label as provided in real time by data holding institutions. Dynamic mapping capabilities have been added for those records supplied with explicit geographic coordinates (longitude and latitude). Clicking on points plotted on the maps brings up the actual record(s) retrieved directly from the data original holding institution in real time. A bounding box query, retrieving all taxa collected from any given square grid (between 1 and 4 degrees in size) is also available for records provided with geographic coordinates. When a collection is off line, the BiOSC Gateway automatically stops serving hyperlinks to records from that collection.
There are many regional and international efforts to integrate access to biologicial information across data holders and networks. Most of them are still at the conceptual stage. Some of these initiatives include:
- CODATA Working Group on Biological Collection Data Access
- Distributed Generic Information Retrieval
- Biological Collection Access Service for Europe (BioCASE)
Each of these initiatives can benefit from the insights obtained in constructing the prototype BiOSC Gateway. Certainly, the feasibility of a global search engine for biodiversity occurrence records as envisioned by GBIF is demonstrated by the current prototype BiOSC Gateway. Much more and much better could be done, but there will be many obstacles to overcome. The main obstacles are cultural. Data owners are still reluctant to truly liberate and share their data. There are many reasons for this and most of them are beyond the scope of this paper, but data quality is a big concern. Progress is being made, but it will take years and true success stories to really build the community. Other related keys issues are the need for global standards for the entry and exchange of new biodiversity data, and the obvious lack of tools to facilitate the correction of old entries and ensure their compliance with at least one standard.
The following "lessons learned" are meant to facilitate these and other initiatives in their efforts. Comments about the networks and examples from individual data holding institutions are meant to be constructive and in no way should be considered as complaints or negative comments.
- Metadata mining was done with formal permission from the following three biodiversity networks. Any additional indexing activity will require additional negotiations
- The Species Analayst (TSA) and Species Analyst Canada
- World Biodiversity Information Networkd (REMIB)
- European Natural History Specimen Information Network (ENHSIN)
- TSA and REMIB are production systems. ENHSIN was inititated to develop and evaluate a pilot network as a demonstrator and basis for further discussions. It is not considered as a prodution system at this time. TSA currently uses the Z39.50 protocol and the other networks use combinations of TCP/IP and HTTP protocols. Both TSA and ENHSIN can provide XML output. TSA is unique in that it allows querying on any element of the schema.
- TSA and ENHSIN return null fields (or missing elements with XML output). REMIB uses phrases such as nulo, Restrictado, NDPRfor text fields and impossible numbers for numeric fields (e.g. 99for LatitudeDegrees and -999for LongitudeDegrees). The use of the Restrictadoelement is useful because it indicates that there is *hidden* information available for those elements.
- The three networks currently have between 16 and 34 elements (fields) in their schemas for each biological records. There are few mandatory fields in the three schemas. The prototype BiOSC Gateway focuses on taxonomic identification, primary geographic division, longitude/latitude, and collection date. The following discussion is a reality check on the complexities encountered. Multilingualism is a major issue for most schema elements. It is difficult to expect most countries to remap their label data to a *standard* language.
- All three networks provide query fields for family, genus, and specific epithet. Additional hierarchical levels are included for ENHSIN and TSA, but are rarely populated. Many variations were encountered in the form of names. There were the expected infraspecific ranks such as subsp., ssp., var, f.There were also other qualifiers such as: ? , nr., nulo, cf, -cf, unidentified. These variations require significant human intervention to clean the names into a normal format (i.e. monomial/binomial/trinomial). This was done partially in order to compare the harvested names against ITIS. The process allows for seamless integration of BiOSC and ITIS when names or synonyms coincide, but requires a lot of human intervention and maintenance. For example, Anthidium n. sp. 2 "Mex."needs to be transformed into Anthidiumfor connection to ITIS, but the original form needs to be retained for display and requery of the data holding institution. Maintaining the connection tables is difficult and requires, manual inspection on each of the variation encountered.
- Records are only as good as the entered data. The following variations were found on a taxon: Apteronotus bonaparte, Apteronotus bonaparti, Apteronotus bonapartii, Apteronotus bonarparti. The number of spelling mistakes found on labels by the harvestor is surprisingly high and emphasizes the need for truly global and complete taxonomic authority files, as well as a good taxonomic name service that would provide probability ranked alternatives to common spelling mistakes and automated synonymy. Great care must be taken before automatically suggesting an alternative name, because names harvested on specimens in collections will include amongst other variations "collection names" that have never been published, and also many homonyms
- ENHSIN supports a finely dissected ("atomized") data model, but also provides a </text> element of the </taxon> node in the XML schema for information that does not fit into the other elements. This convenience has a perverse effect because collections can choose not to parse their data into the proper fields. For instance, some records on ENHSIN are copies of entire labels in the </text> element and do not make use of the atomized (structured) elements. The search engine has no "hook" to identify data elements with certainty (such as scientific name or primary geopolitical division) and build the appropriate metadata record. These records were simply ignored for the prototype the BiOSC Gateway. It would be possible to make some use of these via a full text index of the entire labels. This would be better than nothing, but extremely inefficient.
- The three biodiversity networks provide varying means of querying their networks for geographic locality. All three networks provide a "country" query that we call primary geographic division, because several collections have been using the country field to store province/state information, or something else. Additionally TSA allows querying all of the *locality* elements.
- Primary geographic division information is very complex. Country names are most commonly used, but there is a lot of variation. For example there are several variations on Canada: Canada(fourteen data holdings), CA(three data holdings), CAN(one data holding), Canadá(one data holding). Similarily for the United States: US, USA, U S A, U.S.A, U.S.A. ?, U.S.A., Usa`,Usa?` etats-unis. Then there are defunct country names (e.g. USSR, Czechoslovakia, Yugoslavia). Country names may be provided in the language of the data holding institution. One partial solution is to use a Country names standard such as ISO 3166. ENHSIN provides a picklist based on the ISO-3166 Country standard. However the onus is on the individual data holders to input and interpret the country information to match the standard. For instance, one ENHSIN collection stores modified country names in that field instead of the expected normalized country ISO code. This makes the collection unsearchable using the ISO 3166 search functionality because the records contain strings that to not match the standard. This also begs the question how far can original label data be *transformed* for standardization?
- Obviously country information does not appear for collections from international waters or large bodies of water (seas or gulfs). Either geographic coordinates are provided or something descriptive such as: Ocean Indien Sud, Off Chile, Mediterranean Sea.
- Deriving a consistent metadata layer for geographic locations from the harvested records will require significant human intervention. There will be an ongoing effort to equate geographic variations to a *standard* (including abbreviations, multilingualism, and descriptive text such as Southern Pacific Ocean).
- Remapping defunct country names to current country names will require great effort and in some case remapping will be impossible (e.g. USSR, Yugoslavia; it may be impossible to decide which current country the collection was from if more precise locality information was not provided).
- Many biological records from the data holding institutions are provided with geographic coordinates. TSA and ENHSIN provide coordinates in decimal degrees. REMIB uses degrees/minutes/seconds. For the purpose of dynamically mapping coordinates, REMIB coordinates are converted to decimal degrees. However, only TSA provides a means to query collections by GIS coordinates. A larger number of records are usually retrieved through ENHSIN or REMIB because querying is not allowed on specific GIS coordinates.
- TSA allows bounding box queries as well. Bounding box queries are a powerful enabling technology for finding all known collections and observations across biological kingdoms from a defined geographic area.
- Unfortunately, several obvious errors are also found on coordinates. A common mistake is to omit the minus sign in front of west longitude resulting in maps showing collections from North America plotting in Europe. In other cases, one or both coordinates are obviously wrong (dots mapping in the wrong country). Also, some sites use Zero (null) to presumably represent missing coordinates. These records are plotting on the equator line at the meridian of Greenwich. The dynamic mapping component of the BiOSC Gateway can be used to help verify suspicious outliers.
- All three networks provide year, month, day elements in their schemas. In addition to being null or in the case of REMIB having impossible numbers (e.g. year= 9999) there were many other non standard dates such as "purchased in 1960".
- Occasionally, odd numbers, looking like strange Julian dates or like the canonical internal representation of time in UNIX (for example "3157648") occurred in the year element.
In order for GBIF to accurately answer a question as simple as "where has this species been found?", much more data and much more performing search engines will be needed. The following recommendations are formulated to facilitate future developments.
- The three networks scanned by the prototype BiOSC Gateway include 16 to 34 elements in their schemas. Creating metadata tables from just four topics highlighted many data problems that required significant human intervention and interpretation. Several regional and international initiatives are underway to define theoretical schemas to capture digitized biological collections and observations information. Some of these theoretical schemas may be exhaustive. As a reality check significant human intervention was required with only 6 or 7 atomized elements covering four topic areas. Any practical *standard* schemas should provide data holding institutions with a system that is manageable. If the system is too complex (with too many elements) it provides the data holders with an excuse not to participate. Even if they wish to participate, the degree of attribute dissection should not be intimidating (remembering that often data entry personnel are not experts in taxonomy and biology). Many institutions have developed their own proprietary online systems and will not participate in biodiversity networks if emerging standards are too complex. Many institutions will likely not rework their existing digitized information. It is unlikely that a schema with more that 50 elements will be functional in the short term.
- A "Date of Insert or Update" of each record (timestamp) should be a mandatory element. Currently there are no timestamp elements in any of the three schemas (though this is proposed for the draft TSA Darwin Core V.2). The timestamp represents either the insertion date or update date on each record. Without a timestamp on each record search engines will have to bulk collect everything each time they visit a site. For example Collection of Entomology (INBIO) currently serves 2,325,794 specimens through REMIB. Without an update stamp on each record it will be necessary to reaccess every record to find out if there have been any changes to existing records or newly added records. Being able to query the database by date of insert or update would considerably reduce the latency between the time a record is added or modified and its inclusion or update in the search engine and would also minimize network traffic and search engine harvestor load on servers.
- A unique record identifier should be a mandatory element facilitating queries by search engines to retrieve a specific record in real time. In general the search is by biodiversity network/data holding institution/specific collection/ individual record. Many collections have no unique record identifier. Neither ENHSIN or REMIB allow for searching by record identifier. Without a unique collection record indentifier it is difficult for a search engine to return Internet clients to individual records at data holding institutions.
- It is important to expose collections even if some of the record information is restricted. Using an impossible string such as -9999 (either as a numeric or text value) in restricted elements allows data holding institutions to expose minimum information on taxa that are available only to authorized experts. Otherwise data holding institutions may restrict records entirely from their online information
- It would be very useful if each of the biodiversity networks would enable query by any of their data elements. At the present time, only TSA supports this option.
- Date modified: