The Biological Observations, Specimens and Collections (BiOSC) Gateway
by Guy Baillargeon
and Derek Munro
Agriculture and Agri-Food Canada, Ottawa
Publication date: 26 July 2002
Note: The BiOSC Gateway was actively maintained between 2001 and 2004 and deactivated in 2006. This article is kept for historical purposes.
As part of the Canadian participation to the Global Biodiversity Information Facility
(GBIF), Agriculture and Agri-Food Canada (AAFC) is introducing a prototype information gateway to Biological Observations, Specimens and
Collections (BiOSC Gateway). The BiOSC Gateway is a global metadata search engine that cross-walks multiple distributed biodiversity networks
and facilitates access to their combined holdings. As of early January 2002, the BiOSC Gateway contained approximately 2.4 million biological
records. The gateway associates a biological names harvestor with a multilingual taxonomic authority file (ITIS*North America), providing
international users with access to biodiversity records by either scientific names, vernacular names, synonymy, country of collection or
geographic coordinates. Each individual metadata record (either an individual specimen or a single observation) is hyperlinked directly to
its primary source on its native biodiversity network, ensuring that end-users access detailed records exclusively via the facility of the
records owners who remain in full control of their records. Interactive world maps are provided for those biological records that are
associated with explicit latitude and longitude coordinates. This paper outlines the difficulties encountered and lessons learned when
building a search engine for global biodiversity networks. Specific recommendations to overcome major obstacles in building the next
generation biodiversity records search engine are provided.
Biodiversity is distributed all over the Earth, with the highest concentration in the tropical regions, especially in developing countries,
and in the oceans. In contrast, scientific information about biodiversity (including most type specimens) is largely concentrated in major
centres in developed countries, especially in the scientific collections of the world's natural history museums, herbaria, genebanks and
culture collection of microorganisms.
It is difficult for scientists, environmentalists, policy makers or any other users to determine what biodiversity is present in a given
area. This determination is largely based on collected specimens housed in research facilities all over the world, as well as from
environmental assessment reports. Traditionally, access to collection information has only been possible by either visiting the collections
or by borrowing the specimens. Even in developed countries, a large proportion of relevant specimens are housed in foreign collections,
frequently located on other continents.
The "holy grail" of biological specimen collections and observations would be the ability to find all the information about a specific taxon
or about taxa from a defined location, no matter where in the world the information resides. In fact, plans for the Global Biodiversity
Information Facility (GBIF) are to ultimately interconnect databases to provide information about all 1.8 million species of organisms - from
bacteria to whales - that have received scientific names, including access to data on the approximately 3 billion specimens located in the
world's natural history collections. One of the major goals is to obtain precisely geo-referenced data, because once location data of
taxonomic entities are obtained, they can be submitted to biodiversity modelling techniques (such as WhyWhere - http://biodi.sdsc.edu/) to unravel correlations with other environmental layers (such as temperature, rainfall,
soils, topography, etc) in order to predict change or explain distribution of biodiversity.
Over the last decades, numbers of institutions have initiated the computerization of their natural history collections. Many of them are
available over the Internet via custom or stand alone interfaces. A growing number of institutions are now joining efforts to build
distributed systems, through which a single query form sends a request to many participating sites and combines the returned results into a
single output that can be text- or map-based. Several of these networks coexist now, each of them using a slightly different approach and
different technologies to access the data and make them available to end-users. Technologies associated with distributed queries are moving
targets. There are already many "standards" to choose from. Undoubtedly each of these standards will change, and new ones will emerge as the
technology and our ability to make use of them evolves. As of early 2002, examples of functional distributed biodiversity specimens or
observations networks were:
The Species Analyst (TSA)
Over 2 million specimens from 82 institutions; one shared data exchange model (second version in draft)
The World Biodiversity Information Network
or "Red Mundial de Información sobre Biodiversidad (REMIB)"
Over 4 million specimens from 62 collections; one shared data exchange model, gathering in real-time from at least 6 different data models
The European Natural History Specimen Information Network (ENHSIN)
Less than one million specimens from 6 institutions; one shared data exchange model
Australia's Virtual Herbarium (not included in this prototype)
Targeting 6 millions specimens from 6 Australian institutions; one shared data exchange model
Rationale for prototyping the BiOSC Gateway
There is an obvious need to build a dedicated search engine that cross-walks all existing biodiversity networks and gathers information on
the Internet location of primary biodiversity records and how to access them. Very few potential users are aware of the existence of these
many networks. None of the current biodiversity networks can retrieve from all others. Each network has its own (often experimental)
interface, its own data exchange model, its own data access protocol. Most of the models are still evolving, access procedures are changing
frequently. Sites are often bending rules of the data models that they are in principle adhering to (for example, storing alphanumeric text
where numbers are expected, or country names where ISO codes are expected). Considering the still relatively small size of the current
networks, distributed queries work relatively fast for now, but these are not expected to scale up very well as the number of participating
institutions grows. Most of the current networks are very institution-centric rather than being taxa-centric or geo-centric. They typically
display long list of participating institutions, and users are expected to select those institutions to which they wish to submit a query.
This is easy when there are only 5-6 institutions to choose from, it becomes tedious when there are more than 10 and soon ludicrous as the
list keeps growing. As there are thousands of natural history collections and observation repositories around the globe and as more and more
of them are gradually made available over the Internet, it will quickly become almost impossible to query all potentially relevant sites in a
single distributed query. Who wants to select manually amongst hundreds of sites to locate all known occurences of a certain butterfly, or of
a certain invasive species? What is needed is a search engine, that indexes biodiversity primary location records irrespective of network
technologies, transport protocol, physical location, holding institution, and underlying data model. A good search engine should work in a
multilingual context and ultimately offer ways to query by continent, country, even state or province, independently of the language used on
the original label. The search engine should also support entry points using vernacular names where possible, automatically identify common
misspellings and synonyms, and dynamically suggest alternative names ranked by probability of correctness.
Current Internet search engines (such as Google, AltaVista, and others) do not index biodiversity networks. Many of the institutions
participating to the current biodiversity networks simply do not wish yet to have all their content indexed by search engines for full text
retrieval or warehoused elsewhere. In addition, for now, search engines are simply blocked by a feature common to all of the 4 current
biodiversity networks. Each network is publicly accessible via a query form that requires user input and the click of a mouse on a submit
button. None of the networks is completely browsable in a way similar to a standard web site that can be automatically crawled by robots
following chains of hyperlinks radiating from a home page. Technically however, this impediment can be easily overcome.
- Models underlying the networks were compared and the appropriate query syntax was derived from the default public query form on each of
the selected networks.
- As most data owners are reluctant to see their collections fully indexed by search engines, a minimal number of common fields were
selected to build the BiOSC prototype: taxonomic identification (usually a binomial or trinomial) , primary geographic division (usually a
country), date of collection, and longitude/latitude of collection (when explicitely provided).
- Each network was contacted to obtain formal permission to build the prototype index.
- A custom built "harvestor" or "robot" was designed to crawl each site of each network, systematically cycling through the data to make
sure that all records were retrieved.
- Data gathered by the harvestor was parsed in order to retain the content of fields required for the index, the collection source and the
query path to each of the primary records, and loaded into an Oracle database.
- Partially normalized connecting tables were created to match harvested taxonomic names with the Integrated Taxonomic Information System
(ITIS - http://www.cbif.gc.ca/pls/itisca/taxaget?p_ifx=cbif ) and to reduce
the number of spelling variants in primary geographic divisions. This operation required the largest amount of human intervention.
- Public Web query interface to BiOSC was written using PL/SQL and PHP.
The BiOSC Gateway is a metadata search engine that provides direct access to approximately 2.4 million biological records from about 100
institutions available via three biodiversity networks. The gateway cross-walks the three networks and data-holding institutions within
networks to get to specific records of interest to end-users. As there are many genus-species combinations that are not recorded in a
taxonomic authority file such as ITIS, the taxonomic names gathered by the harvestor component have to be directly accessible for query. Taxa
can be retrieved for all of their occurences, or filtered by some primary geopolitical divisions. Nearly 100,000 biological names have been
harvested from the 3 networks. The 25,000 taxonomic names that occur in ITIS are fully integrated (from the Canadian sites of ITIS) with the
BiOSC Gateway and the three biodiversity networks. Through ITIS, BiOSC records can be accessed via scientific names and synonyms or via
vernacular names in three languages. Summary counts of BiOSC records by taxonomic groups are available through the browsable ITIS taxonomic
hierarchy. All BiOSC metadata records are hyperlinked directly to their original label as provided in real time by data holding institutions.
Dynamic mapping capabilities have been added for those records supplied with explicit geographic coordinates (longitude and latitude).
Clicking on points plotted on the maps brings up the actual record(s) retrieved directly from the data original holding institution in real
time. A bounding box query, retrieving all taxa collected from any given square grid (between 1 and 4 degrees in size) is also available for
records provided with geographic coordinates. When a collection is off line, the BiOSC Gateway automatically stops serving hyperlinks to
records from that collection.
There are many regional and international efforts to integrate access to biologicial information across data holders and networks. Most of
them are still at the conceptual stage. Some of these initiatives include:
- CODATA Working Group on Biological Collection Data Access
- Distributed Generic Information Retrieval
- Biological Collection Access Service for Europe (BioCASE)
Each of these initiatives can benefit from the insights obtained in constructing the prototype BiOSC Gateway. Certainly, the feasibility of a
global search engine for biodiversity occurrence records as envisioned by GBIF is demonstrated by the current prototype BiOSC Gateway. Much
more and much better could be done, but there will be many obstacles to overcome. The main obstacles are cultural. Data owners are still
reluctant to truly liberate and share their data. There are many reasons for this and most of them are beyond the scope of this paper, but
data quality is a big concern. Progress is being made, but it will take years and true success stories to really build the community. Other
related keys issues are the need for global standards for the entry and exchange of new biodiversity data, and the obvious lack of tools to
facilitate the correction of old entries and ensure their compliance with at least one standard.
The following "lessons learned" are meant to facilitate these and other initiatives in their efforts. Comments about the networks and
examples from individual data holding institutions are meant to be constructive and in no way should be considered as complaints or negative
- Metadata mining was done with formal permission from the following three biodiversity networks. Any additional indexing activity will
require additional negotiations
- The Species Analayst (TSA) and Species Analyst Canada
- World Biodiversity Information Networkd (REMIB)
- European Natural History Specimen Information Network (ENHSIN)
- TSA and REMIB are production systems. ENHSIN was inititated to develop and evaluate a pilot network as a demonstrator and basis for
further discussions. It is not considered as a prodution system at this time. TSA currently uses the Z39.50 protocol and the other networks
use combinations of TCP/IP and HTTP protocols. Both TSA and ENHSIN can provide XML output. TSA is unique in that it allows querying on any
element of the schema.
- TSA and ENHSIN return null fields (or missing elements with XML output). REMIB uses phrases such as nulo, Restrictado, NDPR
for text fields and impossible numbers for numeric fields (e.g. 99 for LatitudeDegrees and -999 for
LongitudeDegrees). The use of the Restrictado element is useful because it indicates that there is *hidden* information
available for those elements.
- The three networks currently have between 16 and 34 elements (fields) in their schemas for each biological records. There are few
mandatory fields in the three schemas. The prototype BiOSC Gateway focuses on taxonomic identification, primary geographic division,
longitude/latitude, and collection date. The following discussion is a reality check on the complexities encountered. Multilingualism is a
major issue for most schema elements. It is difficult to expect most countries to remap their label data to a *standard* language.
- All three networks provide query fields for family, genus, and specific epithet. Additional hierarchical levels are included for ENHSIN
and TSA, but are rarely populated. Many variations were encountered in the form of names. There were the expected infraspecific ranks such as
subsp., ssp., var, f. There were also other qualifiers such as: ? , nr., nulo, cf, -cf, unidentified. These
variations require significant human intervention to clean the names into a normal format (i.e. monomial/binomial/trinomial). This was done
partially in order to compare the harvested names against ITIS. The process allows for seamless integration of BiOSC and ITIS when names or
synonyms coincide, but requires a lot of human intervention and maintenance. For example, Anthidium n. sp. 2 "Mex." needs to be
transformed into Anthidium for connection to ITIS, but the original form needs to be retained for display and requery of the
data holding institution. Maintaining the connection tables is difficult and requires, manual inspection on each of the variation
- Records are only as good as the entered data. The following variations were found on a taxon: Apteronotus bonaparte, Apteronotus
bonaparti, Apteronotus bonapartii, Apteronotus bonarparti. The number of spelling mistakes found on labels by the harvestor is
surprisingly high and emphasizes the need for truly global and complete taxonomic authority files, as well as a good taxonomic name service
that would provide probability ranked alternatives to common spelling mistakes and automated synonymy. Great care must be taken before
automatically suggesting an alternative name, because names harvested on specimens in collections will include amongst other variations
"collection names" that have never been published, and also many homonyms
- ENHSIN supports a finely dissected ("atomized") data model, but also provides a </text> element of the </taxon> node in the
XML schema for information that does not fit into the other elements. This convenience has a perverse effect because collections can choose
not to parse their data into the proper fields. For instance, some records on ENHSIN are copies of entire labels in the </text> element
and do not make use of the atomized (structured) elements. The search engine has no "hook" to identify data elements with certainty (such as
scientific name or primary geopolitical division) and build the appropriate metadata record. These records were simply ignored for the
prototype the BiOSC Gateway. It would be possible to make some use of these via a full text index of the entire labels. This would be better
than nothing, but extremely inefficient.
- The three biodiversity networks provide varying means of querying their networks for geographic locality. All three networks provide a
"country" query that we call primary geographic division, because several collections have been using the country field to store
province/state information, or something else. Additionally TSA allows querying all of the *locality* elements.
- Primary geographic division information is very complex. Country names are most commonly used, but there is a lot of variation. For
example there are several variations on Canada: Canada (fourteen data holdings), CA (three data holdings),
CAN (one data holding), Canadá (one data holding). Similarily for the United States: US, USA, U S A, U.S.A,
U.S.A. ?, U.S.A., Usa`,Usa?` etats-unis. Then there are defunct country names (e.g. USSR, Czechoslovakia,
Yugoslavia). Country names may be provided in the language of the data holding institution. One partial solution is to use a
Country names standard such as ISO 3166. ENHSIN provides a picklist based on the ISO-3166 Country standard. However the onus is on the
individual data holders to input and interpret the country information to match the standard. For instance, one ENHSIN collection stores
modified country names in that field instead of the expected normalized country ISO code. This makes the collection unsearchable using the
ISO 3166 search functionality because the records contain strings that to not match the standard. This also begs the question how far can
original label data be *transformed* for standardization?
- Obviously country information does not appear for collections from international waters or large bodies of water (seas or gulfs). Either
geographic coordinates are provided or something descriptive such as: Ocean Indien Sud, Off Chile, Mediterranean Sea.
- Deriving a consistent metadata layer for geographic locations from the harvested records will require significant human intervention.
There will be an ongoing effort to equate geographic variations to a *standard* (including abbreviations, multilingualism, and descriptive
text such as Southern Pacific Ocean).
- Remapping defunct country names to current country names will require great effort and in some case remapping will be impossible (e.g.
USSR, Yugoslavia; it may be impossible to decide which current country the collection was from if more precise locality information was not
- Many biological records from the data holding institutions are provided with geographic coordinates. TSA and ENHSIN provide coordinates
in decimal degrees. REMIB uses degrees/minutes/seconds. For the purpose of dynamically mapping coordinates, REMIB coordinates are converted
to decimal degrees. However, only TSA provides a means to query collections by GIS coordinates. A larger number of records are usually
retrieved through ENHSIN or REMIB because querying is not allowed on specific GIS coordinates.
- TSA allows bounding box queries as well. Bounding box queries are a powerful enabling technology for finding all known collections and
observations across biological kingdoms from a defined geographic area.
- Unfortunately, several obvious errors are also found on coordinates. A common mistake is to omit the minus sign in front of west
longitude resulting in maps showing collections from North America plotting in Europe. In other cases, one or both coordinates are obviously
wrong (dots mapping in the wrong country). Also, some sites use Zero (null) to presumably represent missing coordinates. These records are
plotting on the equator line at the meridian of Greenwich. The dynamic mapping component of the BiOSC Gateway can be used to help verify
- All three networks provide year, month, day elements in their schemas. In addition to being null or in the case of REMIB having
impossible numbers (e.g. year= 9999) there were many other non standard dates such as "purchased in 1960".
- Occasionally, odd numbers, looking like strange Julian dates or like the canonical internal representation of time in UNIX (for example
"3157648") occurred in the year element.
In order for GBIF to accurately answer a question as simple as "where has this species been found?", much more data and much more performing
search engines will be needed. The following recommendations are formulated to facilitate future developments.
- The three networks scanned by the prototype BiOSC Gateway include 16 to 34 elements in their schemas. Creating metadata tables from just
four topics highlighted many data problems that required significant human intervention and interpretation. Several regional and
international initiatives are underway to define theoretical schemas to capture digitized biological collections and observations
information. Some of these theoretical schemas may be exhaustive. As a reality check significant human intervention was required with only 6
or 7 atomized elements covering four topic areas. Any practical *standard* schemas should provide data holding institutions with a system
that is manageable. If the system is too complex (with too many elements) it provides the data holders with an excuse not to participate.
Even if they wish to participate, the degree of attribute dissection should not be intimidating (remembering that often data entry personnel
are not experts in taxonomy and biology). Many institutions have developed their own proprietary online systems and will not participate in
biodiversity networks if emerging standards are too complex. Many institutions will likely not rework their existing digitized information.
It is unlikely that a schema with more that 50 elements will be functional in the short term.
- A "Date of Insert or Update" of each record (timestamp) should be a mandatory element. Currently there are no timestamp elements in any
of the three schemas (though this is proposed for the draft TSA Darwin Core V.2). The timestamp represents either the insertion date or
update date on each record. Without a timestamp on each record search engines will have to bulk collect everything each time they visit a
site. For example Collection of Entomology (INBIO) currently serves 2,325,794 specimens through REMIB. Without an update stamp on each record
it will be necessary to reaccess every record to find out if there have been any changes to existing records or newly added records. Being
able to query the database by date of insert or update would considerably reduce the latency between the time a record is added or modified
and its inclusion or update in the search engine and would also minimize network traffic and search engine harvestor load on servers.
- A unique record identifier should be a mandatory element facilitating queries by search engines to retrieve a specific record in real
time. In general the search is by biodiversity network/data holding institution/specific collection/ individual record. Many collections have
no unique record identifier. Neither ENHSIN or REMIB allow for searching by record identifier. Without a unique collection record indentifier
it is difficult for a search engine to return Internet clients to individual records at data holding institutions.
- It is important to expose collections even if some of the record information is restricted. Using an impossible string such as -9999
(either as a numeric or text value) in restricted elements allows data holding institutions to expose minimum information on taxa that are
available only to authorized experts. Otherwise data holding institutions may restrict records entirely from their online information
- It would be very useful if each of the biodiversity networks would enable query by any of their data elements. At the present time, only
TSA supports this option.