| Abstract Detail
Systematics Section/ASPT Gottschalk, Stephen [1], Watson, Kimberly [1], Tulig, Melissa [1], Thiers, Barbara [1]. Developing a Semi-Automated Workflow for Specimen Record Completion. The New York Botanical Garden Herbarium has more than 200 years worth of plant collections from all over the globe, amounting to over 7 million herbarium specimens. Optical character recognition (OCR) software has increased the rate at which specimen label data is captured and much progress has been made on how best to incorporate OCR-generated text into a label transcription workflow. However, given the diversity of collectors, collection label types, languages, etc. represented in the NYBG collection, a fully-automated “one size fits all” approach to specimen data capture through OCR and Natural Language Processing is unlikely. Instead the focus is on grouping label images based on the OCR text, enabling rapid data capture of key label data elements (e.g. collection number, collector, country). Records are then completed from grouped sets of label images with corresponding OCR text, leveraging where possible any digitized collector field book records and existing complete data from all available sources (e.g. GBIF, project partners). Furthermore, this grouping allows records to be siphoned off for various methods of completion, including crowd sourcing legible labels to citizen scientists, sending difficult labels to a specialist, and targeting fully typed labels for natural language processing. Further integration and automation of these methods will lead to more efficient data extraction from physical herbarium specimens. Log in to add this item to your schedule
1 - New York Botanical Garden, William and Lynda Steere Herbarium, 2900 Southern Blvd., Bronx, New York, 10458, United States
Keywords: Herbarium specimen digitization data management data analysis workflow.
Presentation Type: Oral Paper:Papers for Sections Session: 4 Location: Payette/Boise Centre Date: Monday, July 28th, 2014 Time: 9:00 AM Number: 4005 Abstract ID:418 Candidate for Awards:None |