Department of Chemistry

Murray-Rust Research Group


N.B. Current projects are denoted with *

AMI - The Chemist’s Amanuensis, or, The Intelligent Fume Cupboard

This project aims to explore new ways of interacting with chemical information. A range of low cost devices and pervasive technologies are coupled to recognition systems for presence, voice (natural language), gestures and sensors. The goal is that the laboratory environment can understand the chemistry being done in it, remember it and interpret it. You should be able to ask questions such as “who has used this solvent in the last 6 months?”

* Chem4Word

The Chem4Word add-in for Word makes it easier for students, chemists, and researchers to insert and modify chemical information, such as labels, formulas and 2-D depictions, from within Microsoft Office Word. In addition to authoring functionality, Chem4Word enables users to create inline "chemical zones", the rendering of high-quality and print-ready visual depictions of chemical structures and the ability to store and expose chemical information in a semantically rich manner.

It is available to download from More than 250,000 downloads so far!

* CheTA

The joint CheTA (Chemistry using Text Annotations) project between the PMR group and the National Centre for Text Mining in Manchester is funded by JISC and involves the integation of the OSCAR text-mining tool into the U-Compare workflow infrastructure.


The Chemical Laboratory Repository In/Organic Notebooks project is a JISC-funded project to develop a semantically enhanced repository for open chemical research data. The project will develop an embargo management tool that will facilitate the management of research data, making it easier for a scientist to release data to Open access. It is a two-year project (Apr'09-Mar'11).

* CrystalEye

CrystalEye was developed as part of Nick Day’s PhD and is a fully-automated system for the reformulation of the fragmented crystallographic Web into a structured XML-based repository. Since then the service has continued to be well used as an Open Data resource, contributing data into the eCrystals Federation (JISC, led by University of Southampton) and OREChem (Microsoft, led by the University of Cornell) projects as part of our research contribution. The software has been developed into a demonstrator departmental crystallography repository (C3DER), and was consequently critical in the successful funding of the CLARION project. CrystalEye runs every night and continues to be an excellent test bed for our pioneering use of semantic web technologies in publishing chemistry data, combining web crawlers and publication using RDF, ORE and Atom technologies. More details here...

Experimental Data Checker

Originally developed by summer students in the Unilever Centre, the Experimental Data Checker extracts information such as compound appearance, melting points (if applicable), Rf, infra-red and NMR data, and mass spectral information from either a paragraph of experimental data, or a full paper, and then run some checks to test the data for consistency. It consists of an application for authors and editors to use to check their data before publication, along with the toolkit which can be used to develop other applications.

The Experimental Data Checker is freely available to download from the Royal Society of Chemistry website.

Published as: "Experimental data checker: better information for organic chemists" Adams, S. E.; Goodman, J. M.; Kidd, R. J.; McNaught, A. D.; Murray-Rust, P.; Norton, F. R.; Townsend, J. A.; Waudby, C. A. Org. Biomol. Chem., 2(21), 3067-3070; DOI: 10.1039/B411699M

Green Chain Reaction

The Green Chain Reaction was an initiative to conduct a scientific experiment over the period before and during the Science Online London 2010 meeting (hashtag #solo10). It was not necessary to be an attendee at #solo10 to take part in the interactive session held during the conference (on September 4th 2010), nor was it necessary to be a chemist. The initiative was underpinned by Open Data and the emerging tradition of Citizen Science.

The experiment assessed the feasibility of extracting meaning from chemical reaction information and data online to create new knowledge from sources that would otherwise remain "dark" (mainly patents). The focus was on determining the "green-ness" of chemical reactions in manufacturing and research, with an aim to increasing their acceptability. Both machines and humans were employed to collect and systematize chemical syntheses in the current scientific literature (journals, theses and patents) and to analyse the results.

* Open Bibliography

A JISC-funded project (see for further details) that aims to transform existing bibliographic metadata (from Cambridge University Library and the British Library) into a substantial corpus of Linked Open Data.


OPSIN (Open Parser for Systematic IUPAC Nomenclature) interprets chemical names, especially IUPAC organic chemical names, into structures. It forms the basis of Daniel Lowe's PhD research, funded by Boehringer Ingelheim.


The OREChem project is a collaboration between chemistry scholars and information scientists to develop and deploy the infrastructure, services, and applications to enable new models for research and dissemination of scholarly materials in the chemistry community. Although the focus of the project is chemistry, the work is being undertaken with an attention to general cyber infrastructure for eScience, thereby enabling the linkages among disciplines that are required to solve today’s key scientific challenges such as global warming. A key aspect of this work, and a core aim of this project, is the design and implementation of an interoperability infrastructure that will allow chemistry scholars to share, reuse, manipulate, and enhance data that are located in repositories, databases, and Web services distributed across the network.

The project was funded by Microsoft External Research.


The suite of Open Source Chemistry Analysis Routines (OSCAR) has been under development since 2002/2003. It extracts and validates data from experimental chemistry data reports, identifies named chemical entities such as names of compounds and attempts to resolve the compound name to a chemical structure.

The original version of the OSCAR toolkit is available to download from the Royal Society of Chemistry.

Extensive development of OSCAR (the SciBorg project) resulted in Oscar3 (see also "High-Throughput Identification of Chemistry in Life Science Texts" Peter Corbett and Peter Murray-Rust Lecture Notes in Computer Science, 2006, 4216, pp107-118; DOI: 10.1007/11875741_11 )

* PolyInfo

Nick England's PhD research, funded by Unilever.

* Reaction Extraction

In the modern age, chemists publish vast quantities of data per year. In the absence of semantic authoring, this information is either locked in the original documents or manually indexed and added to commercial databases. Using the patent literature as a sample corpus, the potential to use OSCAR, OPSIN and ChemicalTagger to automatically abstract reactions from chemical texts along with their associated spectra has been demonstrated.

Future development in the project will see the liberation of hundreds of thousands of reactions and the release of the resulting information as free and Open data.

This project formed the basis of David Jessop's PhD research, funded by Unilever.


"Submission, Preservation and Exposure of Chemistry Teaching and Research Data" - funded by JISC under the Digital Repositories program, SPECTRa ran from 2005 to 2007. It aimed to:

  • investigate the needs of the academic chemistry research community with respect to how data

associated with theses and peer-reviewed publications may best be communicated.

  • demonstrate how these needs may best be co-ordinated with emerging institutional strategies for

repositories handling both data and publications.

  • facilitate routine extraction of data in high volumes and their ingest into institutional repositories.
  • investigate the cultural issues in capturing and re-using scientific data.
  • explore interoperability issues involving archiving data in repositories.


"Submission, Preservation and Exposure of Chemistry Teaching and Research Data from Theses" - a JISC-funded project which ran from 2007 to 2008. The project built upon the earlier SPECTRa project and aimed to

  • facilitate routine and automatic extraction of Knowledge Objects in high volumes,

transformation into metadata and their ingest into institutional repositories.

  • survey current practice in the deposition of chemistry theses.
  • investigate the needs of the academic chemistry research community with respect to how data

associated with theses may best be managed.

  • demonstrate how these needs may best be co-ordinated with emerging institutional strategies

for repositories handling data-rich objects.

  • investigate the automatic discovery of data and data-rich documents in institutional repositories
  • investigate the cultural issues in capturing and re-using scientific data.
  • explore interoperability issues involving preservation of data in repositories.
  • develop semantic querying of institutional repositories.

Published as: "SPECTRa-T: Machine-Based Data Extraction and Semantic Searching of Chemistry e-Theses" Jim Downing, Matt J. Harvey, Peter B. Morgan, Peter Murray-Rust, Henry S. Rzepa, Diana C. Stewart, Alan P. Tonge and Joe A. Townsend, J. Chem. Inf. Model., 2010, 50(2), pp 251–261; DOI: 10.1021/ci9003688


TheOREm was a JISC funded project from mid 2008 to early 2009 aimed at:

  • Testing the applicability of the ORE (Object Reuse and Exchange) standard in a realistic scholarly setting - thesis description, submission and publication.
  • Demonstrating the advantages of the ORE approach in complex object publication, by combining it with existing web-standards compliant technologies.
  • Providing examples to fully exercise the ORE specification in order to provide validation and future direction.


The JISC-funded XYZ project (see for further details) aims to build a new workflow model for publication of scientific data which is all too often lost during the publication process.