CrystalEye (formerly CMLCrystBase) is a project that PeterMR and I (Nick Day) have been working on since the end of 2005. After a suggestion by Peter I decided that it would be a good idea to make some public documentation of the reasoning behind the project and what it is, does and will do.
The current state of this page is somewhat of a brain dump and certainly needs further work in places. I'll add more images and explanations in due course.
At present the system consists of three main parts, which will be discussed at greater length further down the page:
- aggregate the latest crystallographic data that has been published on the web, in the form of CIF files, from remote web resources (currently just publishers websites).
- convert the data (losslessly) into CML from which we can do many more file conversions and 'add-value' to the crystallographic data (discussed below).
- disseminate the data via RSS and CMLRSS
Part One - Fetching
Our robot is currently scraping CIF files from the journals at Acta Crystallographica and the Royal Society of Chemistry (who have kindly given us permission to do so).
Current Tactics For Scraping CIFs
Our robot only wants to gather the latest crystallographic data from each journal at each publishers website; thus we had to come up with a way of pointing it directly to the table of contents for the latest issue of a journal. For RSC journals this is easy, as each journal has a special 'current-issue' URL which always points to the latest issue to have been published. For instance:
points to the latest issue of Dalton Transactions at the RSC. It is easy to construct these current issue URLs once you know the abbreviation for each journal; for instance, cc is ChemComm and ce is CrystEngComm so the desired URLs are respectively:
So, you can see that here we are at the mercy of the webmasters. If they did not provide this special URL, we would have to find another route for our robot to find the latest issue of each journal. This is the case for Acta Cryst, though thankfully a 'back-issues' page is provided for each journal. The URL for Section E's back issues page is:
From this the robot can find the top issue link and follow that. However, there is another complication in that the latest issue links are usually provided before the issue has been fully completed, in which case we don't want to scrape only part of an issue's data. This is indicated by text saying 'in preparation' next to the link. In this case the robot needs to spot this and follow the link below.
Once at the latest issue table of contents, the robot doesn't start scraping straight away as it might already have downloaded the CIFs on a previous visit. Instead, the robot searches the page for the year and issue ID, extracts them and then compares them against an XML log of the issues that have already been visited. An excerpt of this log looks like:
<log> <publisher abbreviation="acta" name="Acta Crystallographica"> <journal abbreviation="e"> <year id="2006"> <issue id="05-00"> <processed value="true"/> <disseminated value="true"/> <cifs number="281"/> </issue> ....
So each issue has an associated year, journal and publisher. The 'processed' and 'disseminated' flags can be ignored at this time as they are used on the latter two stages of the system. Every time the robot finishes downloading the CIFs from a particular journal, it performs some simple validation to make sure all the files were written and then adds a new entry to this log.
If the robot has decided it needs to scrape a given issue then it performs the following steps:
- downloads the issue table of contents HTML (using Apache HttpClient)
- using Tagsoup the HTML is tidied and then parsed into a XOM DOM.
- once it has the DOM representation the robot can query the document using a series of XPaths to find the links which will eventually lead to the CIF file**. This is far more reliable than using regular expressions, though still ultimately relies on the HTML structure of the page staying the same.
- note that wherever possible, I have tried to leave HTML structural elements out of the XPaths that I have used. Instead I have aimed to find something unique to the links that I am searching for and include that in the XPath. For instance, using './/a[contains(@href,'vi.gif')] rather than './html/body/div/a[@href='../foo/bar.html']'. This means that the XPaths do not to rely on the HTML structure of the page staying the same (as websites tend to change things rather often), but rather just the form of the URLs to the pages we want.
- when it finds a link to a CIF file the data is downloaded and written to the server file system (again using HttpClient). The DOI (Digital Object Identifier) is also stored, as this provides us with a method to link files created later permanently back to the original article.
**The page hierarchy and structure is different for each publisher, hence the route from issue homepage to CIF is also different. So for each publisher we have a set of XPaths that are used to extract the desired link(s) from a page before fetching the associated HTML page and then performing the next XPath on that, and so on until the CIF is reached.
The links to the CIF files in an Acta Cryst journal can be found directly from the issue table of contents, whereas in an RSC journal the hierarchy looks like:
Recently, a better way of keeping track of the latest published crystallographic data has emerged.
RSS feeds for each journal at Acta and the RSC are now provided, which are updated each time a new article is published. The entries of these RSS feeds point directly to the article summary page at the respective site. We plan to rewrite the scraping robot so that it will read these RSS feeds, extracting the given article page links and then follow the links contained within to the CIF files. There are a couple of benefits to doing it this way:
- the feeds are updated on an article basis, so we can disseminate the crystallography as soon as each article is published rather than having to wait for the whole issue to be finished and then put on the site.
- the robot won't have to go looking to see if any new issues have been published, as the new content will be signalled by new entries in the feed it is reading.
- the robot will be relying on fewer URLs staying permanent for finding the content. Namely, the URL for the RSS feed and the URL to the CIF from the article page; instead of the whole set of pages that need to be followed if you start at the issue home page. The main problem with the current method is that it relies on too many URLs, any of which could change and break the system. Even worse, the parsing and extracting of information from HTML files to find links is very fragile and the less done of it the better.
What Have we Got?
As of 2006-11-22, we believe we have gathered all CIFs available on the web from Acta Cryst, the RSC and CSOJ. The details of the number of CIFs scraped from each journal are:
- Acta Crystallographica
- Section A - Foundations of Crystallography: 40
- Section B - Structural Science: 1,706
- Section C - Crystal Structure Communications: 11,262
- Section D - Biological Crystallography: 3
- Section E - Structure Reports: 13,420
- Section F - Structural Biology and Crystallization: 0
- Section J - Applied Crystallography: 39
- Section S - Synchrotron Radiation: 5
- total: 26,475
- Royal Society of Chemistry
- ChemComm: 2,474
- CrystEngComm: 218
- PCCP: 14
- Dalton Transactions: 3,448
- Journal of Materials Chemistry: 334
- New Journal of Chemistry: 672
- Organic & Biomolecular Chemistry: 414
- Perkin Transactions 1: 379
- Perkin Transactions 2: 225
- total: 8,178
- total: 19,157
Overall no. of CIFs scraped: 53,810
Each CIF may also contain more than one datablock, with each datablock corresponding to a crystal structure. The trend with all journals except Acta Cryst Section E seems to be that it is likely that each CIF will contain data about more than one structure. So I would predict that we could have anywhere up to 100,000 structures in total; though of course we cannot tell until all the CIFs have been parsed.
Part Two - Data Conversion
Again, when the cron job is run by the WWMM server, the robot will check the log file and look at the entry for each issue it has:
<issue id="05-00"> <processed value="true"/> <disseminated value="true"/> <cifs number="281"/> </issue>
Depending on the value of the 'processed' element the robot will decide whether or not it needs to process the CIFs for that issue.
CIFDOM has been around in one form or another since 2002 (I think??) and can be used to losslessly parse a CIF file into an XML document. So an excerpt from a CIF file might look like:
loop_ _symmetry_equiv_pos_as_xyz 'x, y, z' '-x, y+1/2, -z+1/2' '-x, -y, -z' 'x, -y-1/2, z-1/2' _cell_length_a 6.6686(10) _cell_length_b 13.648(3) _cell_length_c 13.321(3) _cell_angle_alpha 90 _cell_angle_beta 95.794(18) _cell_angle_gamma 90 _cell_formula_units_Z 4 _symmetry_space_group_name_H-M 'P 21/c'
When this is parsed using CIFDOM it would look like:
<loop names="_symmetry_equiv_pos_as_xyz"> <row> <cell>x, y, z</cell> </row> <row> <cell>-x, y+1/2, -z+1/2</cell> </row> <row> <cell>-x, -y, -z</cell> </row> <row> <cell>x, -y-1/2, z-1/2</cell> </row> </loop> <item name="_cell_length_a">6.6686(10)</item> <item name="_cell_length_b">13.648(3)</item> <item name="_cell_length_c">13.321(3)</item> <item name="_cell_angle_alpha">90</item> <item name="_cell_angle_beta">95.794(18)</item> <item name="_cell_angle_gamma">90</item> <item name="_cell_formula_units_Z">4</item> <item name="_symmetry_space_group_name_H-M">P 21/c</item>
Anyone who has tried extracting information from the CIF format knows how nasty it can be to associate a data value to a name when 'loops' and the various kinds of delimiters are involved. Thankfully, this has all been taken care of by CIFDOM, and extracting information from the XML representation is far easier as you can use XPath expressions to find the desired data elements. Note that parsing a CIF document using CIFDOM is just a few lines of code.
CIF to Other Formats
As mentioned above, a CIF file may contain data on more than one structure. We don't want the extra complexity of having to deal with an unknown number of structures per CIF file so the first thing we do is:
- take the CIF with n structures and convert it into a CIFDOM,
- split the CIFDOM up into n CIFDOMs with 1 structure,
- write out each CIFDOM as a new CIF file.
Once we have one structure per file, the processing of the crystallographic data begins. The steps taken are as follows:
- submit the CIF to the IUCr CheckCIF service - for which I have written a Java wrapper to use it as a kind of 'Web Service'. Write the retrieved CheckCIF HTML to the file system.
- convert the CheckCIF HTML to CheckCIF XML using CheckCIFParser (not yet publicly available).
- from the CheckCIF XML, extract the link to the ORTEP ellipsoid image for the crystal structure. Fetch the file from the web and save to the file system (we do this as the ORTEP plot is only available for a short time after submitting your CIF to CheckCIF).
- convert the CIF into a CIFDOM.
- losslessly convert the data in the CIFDOM into CML (we will call this the 'complete CML') using JUMBO (quite an involved process - discussed further below).
- merge the CheckCIF XML into the complete CML. Write to the file system.
- extract the CML data for each moiety from the complete CML. Write each out to the file system.
- from the moiety CML, get the CML for fragments corresponding to:
- metal centres
- metal clusters
- ring-ring and ring-terminal linkers
- and write them to the file system. Note that we also 'sprout' each generated fragment twice. That is, we also generate and store the CML for fragments for the next two ligand shells out from the original one found.
It is worth noting that in the above steps, whenever we create a CML file, we always generate the corresponding InChI (using the jni-inchi wrapper) and SMILES (using the CDK) and merge this information into the CML in <identifier> elements. The original article DOI is also always added to generated CML. We also always generate a 2D structure diagram as a PNG (using Renderer2D in the CDK).
The CIF2CML Process
As already mentioned, the CIF2CML process is lossless. In fact, the data remains unchanged in going from CIF->CIFDOM->CML (though we do add to it); it is just the markup that changes. Below is the CML representation of the example data from earlier in CIF and CIFDOM form.
<crystal z="4"> <scalar dictRef="iucr:_cell_length_a" dataType="xsd:double" errorValue="0.0010">6.6686</scalar> <scalar dictRef="iucr:_cell_length_b" dataType="xsd:double" errorValue="0.0030">13.648</scalar> <scalar dictRef="iucr:_cell_length_c" dataType="xsd:double" errorValue="0.0030">13.321</scalar> <scalar dictRef="iucr:_cell_angle_alpha" dataType="xsd:double" errorValue="0.0">90.0</scalar> <scalar dictRef="iucr:_cell_angle_beta" dataType="xsd:double" errorValue="0.018">95.794</scalar> <scalar dictRef="iucr:_cell_angle_gamma" dataType="xsd:double" errorValue="0.0">90.0</scalar> <symmetry spaceGroup="P 21/c"> <transform3>1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0</transform3> <transform3>-1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.5 0.0 0.0 -1.0 0.5 0.0 0.0 0.0 1.0</transform3> <transform3>-1.0 0.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 0.0 1.0</transform3> <transform3>1.0 0.0 0.0 0.0 0.0 -1.0 0.0 -0.5 0.0 0.0 1.0 -0.5 0.0 0.0 0.0 1.0</transform3> </symmetry> </crystal>
It is simple to convert the author and experimental metadata and the experimental conditions to CML. The difficulty lies in creating the complete crystallochemical unit from the data in the CIF; we are only provided with a set of atoms (including H's) with their coordinates (which may or may not correspond to the whole unit cell contents, or even complete molecules!), the unit cell parameters and symmetry elements.
Completing the Molecules
So to calculate the complete connection table for the crystallographic formula unit, the following two steps are performed:
- process the disorder, if any, in the crystal structure; removing the disordered atoms of lowest occupancy.
- using the non-translational symmetry elements provided and an orthogonalization matrix created using "Rollett, Computing Methods in Crystallography, Pergamon, 1965, p.23"; calculate the symmetry-related molecules in the unit cell and add bonds where appropriate.
Once we have the complete connection table, we need to calculate the correct bond orders and charges for each molecule in the unit cell. This is described here.
When this has finished, the robot performs some simple validation on the files generated, then returns to the log file and updates the 'processed' element for this issue to 'true'.
Part Three - Dissemination
At this point we have finished our processing of the crystallographic data. All that is left to do now is to let the world know about it, and then to use it for some research! Once more, when the cron job is run, the robot checks the 'disseminated' element for each issue. If the value is 'false' it will run the following steps.
Our two methods of dissemination are through:
For each journal we aim to have an RSS feed, which will be updated each time we process the contents of the latest issue. The new entry of the feed will contain a link to a website that is automatically generated by our robot. An example of the homepage of one of these generated websites can be found here.
Generating the Website
The contents of the website provide a summary of the data from the scraped CIF files, as well as providing links to all files generated. The HTML files the robot generates for the website are as follows:
- the topmost page contains a table where each row corresponds to an article in the issue that has been scraped. This provides links back to the original articles, and also to,
- pages summarising some of the contents for each 'complete' CML file and also providing links to all files generated in the data conversion step (example), which are.
- a page with a table where each row corresponds to a moiety in the unit cell, and provides links to
- a page with a table where each row corresponds to a fragment generated from the 'complete' CML,
- pages summarising some of the contents for each 'complete' CML file and also providing links to all files generated in the data conversion step (example), which are.
Note that on each HTML page created from a CML file there is a 3D rendering of the structure provided in a Jmol applet.
Once all the HTML pages have been created, the robot automatically updates the RSS feed (using the ROME library), followed by the system log.
CMLRSS is an extension of RSS that, instead of each entry supplying a URL address to web content, contains it within the feed itself. Thus a CMLRSS entry contains a complete CML document. For each journal we will provide a CMLRSS feed where each entry will correspond to one crystal structure.
The benefit of producing CMLRSS is that the newly produced data can be read directly by third party applications. A great example of this is in Bioclipse, which has a CMLRSS reader built in. Bioclipse can read CMLRSS feeds and immediately render the structures in 2D and 3D forms; all while providing editors to view the information in the feed.
While we will be providing CMLRSS feeds of the total output of each journal, we are also keen to create feeds which provide structures which match specific criteria, for instance:
- bonds between particular elements,
- unusual bond lengths,
- the lengths of unusual bonds, etc.
What will you do with the data now?
Details to come...