Geosphere; February 2008; v. 4; no. 1;
p. 159-169; DOI: 10.1130/GES00140.1
© 2008 Geological Society of America
Automated extraction of data from text using an XML parser: An earth science example using fossil descriptions
Gordon B. Curry*,1 and
Richard C.H. Connor
,2
1 Digital Geosciences Laboratory, Dept of Geographical and Earth Sciences, University of Glasgow, Gregory Building, Lilybank Gardens, Glasgow G12 8QQ, Scotland, UK
2 Department of Computer and Information Sciences, University of Strathclyde, Glasgow G1 1XH, Scotland, UK

View larger version (34K):
[in this window]
[in a new window]
|
Figure 1. First 18 lines of extensible markup language (XML) parsed taxonomic description of the genus Glottidia (from Williams et al., 2000). Blue text corresponds to the original published text; capitalized words in red enclosed within brackets (<...>) are the XML tags added automatically by the XML parser. Pink and green text is also added by the parser inside the XML tags as attributes. The figure shows how the name author, date, and taxonomic description of the taxon have been recognized and tagged in such a way as to allow complex queries. See text for details.
|
|

View larger version (24K):
[in this window]
[in a new window]
|
Figure 2. Section of the extensible markup language (XML) tagged taxonomic description of the genus Glottidia showing how the information on the overall stratigraphic distribution has been tagged by the XML parser. Blue text corresponds to the original published text; capitalized words in red enclosed within <...> brackets are the XML tags added automatically by the XML parser. Pink and green text is also added by the parser inside the XML tags as attributes. See text for details.
|
|

View larger version (56K):
[in this window]
[in a new window]
|
Figure 3. Section of the extensible markup language (XML) tagged taxonomic description of the genus Glottidia showing how the data on the detailed stratigraphic and geographical information present in the original description have been tagged by the XML parser. Blue text corresponds to the original published text; capitalized words in red enclosed within <...> brackets are the XML tags added automatically by the XML parser. Pink and green text is also added by the parser inside the XML tags as attributes. See text for details.
|
|

View larger version (9K):
[in this window]
[in a new window]
|
Figure 4. Section of the extensible markup language (XML) tagged taxonomic description of the genus Glottidia showing how the data on the illustrations provided along with the taxonomic description have been tagged by the XML parser. Blue text corresponds to the original published text; capitalized words in red enclosed within <...> brackets are the XML tags added automatically by the XML parser. See text for details.
|
|

View larger version (54K):
[in this window]
[in a new window]
|
Figure 5. Demonstrating format of extensible stylesheet language transformation (XSLT) document used to transform a series of extensible markup language (XML) tagged taxonomic descriptions of brachiopod genera from the Treatise on Invertebrate Paleontology (Williams et al., 2000) into a list of genus name, author, and date. See text for detail.
|
|

View larger version (76K):
[in this window]
[in a new window]
|
Figures 6. View of part of the output generated by applying an extensible stylesheet language transformation (XSLT) (shown in Fig. 5) to a series of extensible markup language (XML) tagged taxonomic descriptions of brachiopod genera from the Treatise on Invertebrate Paleontology (Williams et al., 2000). See text for detail.
|
|

View larger version (20K):
[in this window]
[in a new window]
|
Figure 7. Histogram showing the number of genera described in each decade from 1790s to the 1990s. Data are extracted from extensible markup language (XML) tagged descriptions from the Treatise on Invertebrate Paleontology, part H, Volume 2 (Williams et al., 2000), and hence only apply to a subset (277) of the total genera assigned to the phylum (>5000). Note that data for the 1990s are not complete.
|
|

View larger version (40K):
[in this window]
[in a new window]
|
Figure 8. Graph showing the cumulative number of brachiopod genera in a sample subset of 277 taxa that were described over a period of 200 yr from the 1790s to the 1990s. See text for detail.
|
|

View larger version (84K):
[in this window]
[in a new window]
|
Figure 9. Front cover of one of the three monographs comprising Davidson's monographs on recent Brachiopoda (Davidson, 1886–1888).
|
|

View larger version (102K):
[in this window]
[in a new window]
|
Figure 10. Image from the Davidson (1886–1888, p. 173) monograph on recent Brachiopoda, showing taxonomic descriptions and illustrations.
|
|

View larger version (53K):
[in this window]
[in a new window]
|
Figure 11. Output of scanning page 173 of the Davidson (1886–1888) monograph using optical character recognition software. Compared with the original, there are six mistakes, which are highlighted using yellow shading. None of the mistakes interferes with extensible markup language (XML) parsing, and all but one are easily corrected using the learn function in the optical character recognition (OCR) dictionary. The large area of highlighting outlines an area in which the order of words in the original text has been very slightly altered. This latter feature is due to the presence of separate text boxes on the page, in particular the interaction of the text box containing the figure description text with the main taxonomic description. This feature was not significant for XML parsing of the text of the taxonomic description.
|
|
Copyright © 2008 by Geological Society of America