Department of Chemistry

Murray-Rust Research Group


Oscar3 is a tool for shallow, chemistry-specific parsing of chemical documents. It identifies (or attempts to identify):

  • Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms, some enzymes and reaction names.
  • Ontology terms: if you can do it by string-matching, you can get OSCAR to do it.
  • Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections.

In addition, where possible the chemical names that are detected are annotated with structures, either via lookup or name-to-structure parsing ("OPSIN"), and with identifiers from the chemical ontology ChEBI

Current work on OSCAR3 by Peter Corbett focuses on its use in SciBorg, a framework for the deep parsing of chemical text.

OSCAR3 also includes the Oscar Server, a Jetty-powered set of servlets. These provide the following services:

  • Parsing of text/HTML by OSCAR.
  • Text/InChI/SMILES/SMILES substructues/SMILES similarity search of papers, coupled with keyword and ontology-based search, using Lucene and the CDK.
  • List of all names found / all names that co-occur with a search term or terms.
  • Online management of a chemical/stopword lexicon.
  • Manual editing of SciXML fragments containing named entities, for creating of gold standards and training data.

Oscar3 can be found on SourceForge: [1]

Oscar3-related publications

High-Throughput Identification of Chemistry in Life Science Texts. Peter Corbett and Peter Murray-Rust. CompLife 2006, LNBI 4216, pp. 107 – 118, 2006. Official Publisher's Site self-archived PDF

Semantic Enrichment of Journal Articles Using Chemical Named Entity Recognition. Colin Batchelor and Peter Corbett. Proceedings of the ACL 2007 Demo and Poster Sessions. PDF

Annotation of Chemical Named Entities. Peter Corbett, Colin Batchelor and Simone Teufel. BioNLP 2007: Biological, translational, and clinical language processing. PDF

Pyridines, Pyridine and Pyridine Rings: Disambiguating Chemical Named Entities. Peter Corbett, Colin Batchelor and Ann Copestake. Proceedings of Building & evaluating resources for biomedical text mining (LREC 2008 workshop). PDF of workshop proceedings

Language Resources and Chemical Informatics. C.J. Rupp, Ann Copestake, Peter Corbett, Peter Murray-Rust, Advaith Siddharthan, Simone Teufel, Benjamin Waldron. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) PDF of proceedings

Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition. Peter Corbett and Ann Copestake. Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP 2008). PDF