Sunday, March 27, 2011

Towards the automated discovery of useful solubility applications

Last week, I came across (via David Bradley) a paper by an MIT group regarding the desalination of water using a very clever application of solubility behavior:
Anurag Bajpayee, Tengfei Luo, Andrew Muto and Gang Chen, Energy Environ. Sci., 2011 Very low temperature membrane-free desalination by directional solvent extraction (article, summary)
The technique simply involves the heating of saltwater with molten decanoic acid to 40-80 C. Some water dissolves into the decanoic acid, leaving the salt behind. The layers are then separated and, upon cooling to 34C, sufficiently pure water separates out. Any traces of decanoic acid are inconsequential since this compound is already present in many foods at higher levels.

From a technological standpoint, I can't think of a reason why this solution could not have been discovered and implemented 100 years ago. It makes you wonder how many other elegant solutions to real problems could be uncovered by connecting the right pieces together.

To me, this is where the efforts of Open Science and the automation of the scientific process will pay off first. For this to happen on a global level, two key requirements must be met:
1) Information must be freely available, optimally as a web service (measurements if possible - otherwise a predicted value, preferably from an Open Model)
2) There has to be a significantly automated way of identifying what is important enough to be solved.
Since we have been working on fulfilling the first requirement for solubility data, I first looked at our available services to see if there was anything there that could have pointed towards this solution.

Although we have a measured (0.0004 M) and predicted (0.001 M) room temperature solubility of decanoic acid in water, our best prediction service can't do the opposite: the solubility of water in decanoic acid. For that we would need the Abraham descriptors for decanoic acid as a solvent and those are not yet available as far as I'm aware.

Also, we use a model to predict solubility at different temperatures - but it assumes that the solute is miscible with the solvent at its melting point. This is probably a reasonable assumption for the most part but it fails when the solute and the solvent are too radically dissimilar (e.g. water/hydrophobic organic compounds). In this particular application, decanoic acid melts at 31C and the process occurs in the 34-80 C range.

But even if we had the necessary models (and corresponding web services) for the decanoic acid/water/NaCl system, could it have been flagged in an automated way as being potentially "useful" or even "interesting"?

For utility assessment, humans are still the best source. Luckily, they often record this information tagged with common phrases in the introductory paragraphs of scientific documents. (In fact, this is the origin of the UsefulChem project). For example, if we search for "there is a pressing need for" AND solubility in a Google search, most of the results provide reasonable answers to the question of what a useful application of solubility might be. I have summarized the initial results in this sheet.

The first result is:
"there is a pressing need for new materials for efficient CO2 separation" from a Macromolecules article in 2005. The general problem needing solving would correspond to "global warming/CO2 sequestration" and the modeling challenge would be "gas solubility".

Analyzing the first 9 results in this way gives us the following problem types:
  1. global warming/CO2 sequestration
  2. fire control
  3. global warming/refrigeration fluid
  4. AIDS prevention
  5. Iron absorption in developing countries
  6. agriculture/making phosphate from rock bioavailable
  7. water treatment/flocculation
  8. natural gas purification/environmental
  9. waste water treatment
and the following modeling challenges:
  1. gas solubility
  2. polymer solubility
  3. hydrofluoroether solubility
  4. solubility of drug in gels
  5. inorganics
  6. inorganics/pH dependence of solubility
  7. polymer solubility/flocculation/colloidal dispersions
  8. gas solubility
  9. inorganics
These preliminary results are instructive. The problem types are broad and varied - and I think they will be helpful for keeping in mind as we continue to work on solubility. The modeling challenges can be compared directly with our existing services - and none of them overlap at this time! All of these involve either gasses, polymers, gels, salts, inorganics or colloids while our services are strictly for small, non-ionic organic compounds in liquid solvents.

Part of the reason for our focus on these types of compounds relates to our ulterior objective of assessing and synthesizing drug-like compounds. But a more important consideration is what type of information is available and what can be processed related to cheminformatics. Currently most cheminformatics tools deal only with organic chemicals, with essential sources such as ChemSpider and the CDK providing measurements, models, descriptors, etc.

Even though some inorganic compounds are on ChemSpider, most of the properties are unavailable. Consider the example of sodium chloride:


This doesn't mean that the situation is hopeless but it does make the challenge much more difficult. Solubility measurements and models for inorganic salts do exist (for example see Abdel-Halim et al.) but they are much more fragmented.

With the feedback we obtain from this search phrase approach - and hopefully help from experts in the chemistry community - we can piece together a federated service to provide reasonable estimates for most types of solubility behavior.

I think that this desalination solution will prove to be a good test for automated (or at least semi-automated) scientific discovery in the realm of open solubility information. In order to pass the test, the phrase searching algorithm should eventually identify desalination as a "useful problem to solve" and should connect with the predicted behavior of water/NaCl/decanoic acid (or other similar compound).

Luckily we have Don Pellegrino on board. His expertise on automated scientific discovery should prove quite valuable for this approach.

Tuesday, March 22, 2011

Open modeling of melting point data

The contribution of Alfa Aesar melting point data to our open collection has facilitated the validation of a significant amount of the entire dataset. However, this process of curation is never-ending. A good example is the discovery of an error in one of the sources for the melting point of warfarin. Following David Weinberger's post about our melting point explorer, his brother Andy noticed a problem and this enabled us to fix it.

In a way, creating an open environment to make it easy to find and report errors - as well as add new data - complicates scientific evaluation. In order to report a reproducible process and outcome, it is necessary to take a snapshot of the dataset. Choosing the exact composition of a dataset for a particular application is somewhat arbitrary. Aside from selecting a threshold for excluding measurements that deviate too much, compounds may be excluded based on their type.

For the sake of clarity, we archived the various datasets we created from multiple sources with brief descriptions of the filtering and merging at each step. From the perspective of an organic chemist, ONSMP013 is probably the most useful at this time. It contains averaged measurements for 12634 organic compounds and excludes salts, inorganics or organometallics. The original file provided by Alfa Aesar contained several of these excluded compounds and can be obtained from ONSMP000. It might be interesting at some point to create a collection of melting points for inorganics or salts. We would welcome contributions of collections of melting points with different filters.

One of the advantages of ONSMP013 is that it is possible to generate CDK descriptors for each entry (and these are included in the spreadsheet). By not using commercial software to generate descriptors, it enables fully transparent modeling - and extension of that modeling by anyone.

With this in mind, Andrew Lang has used ONSMP013 to generate a Random forest melting point model (MPM002). The most important descriptors turned out to be the number of hydrogen bond donors and the Topological Polar Surface Area (TPSA). The scatter plot below shows the correlation (R2 = 0.79) between the predicted and experimental values. (color represents TPSA and size relates to H-bond donors)


Andy has described in much more detail the rationale for selecting the Random forest approach over a linear model in MPM001. He has also compared the performance of CDK descriptors versus those from a commercial program for a small set of drug melting points in MPM003.

The Random forest model (MPM002) is also now available as a web service by entering the ChemSpiderID (CSID) of a compound in a URL. See this example for benzoic acid. If experimental results exist they will appear on top and a link to obtain the predicted melting point will appear underneath.

Note that the current web service for predicting melting points can be slow - it may take a minute to process.

Additional web services for melting point data will be listed on the ONS web services wiki.

Friday, March 04, 2011

Validating Melting Point Data from Alfa Aesar, EPI and MDPI

I recently reported that Alfa Aesar publicly released their melting point dataset for us to use to take into account temperature in solubility measurements. Since then, Andrew Lang, Antony Williams and I have had the opportunity to look into the details of this and other open melting point datasets. (See here for links and definitions of each dataset)

An initial evaluation by Andy found that the Alfa Aesar collection yielded better correlations with selected molecular descriptors compared to the Karthikeyan dataset (originally from MDPI), an open collection of melting points used by several researchers to provide predictive melting point models. This suggested that the quality of the Alfa Aesar dataset might be higher.

Inspection of the Karthikeyan dataset did reveal some anomalies that may account for the poor correlations. First there were several duplicates - identical compounds with different melting points, sometimes radically different (up to 176 C). A total of 33 duplicates (66 measurements) were found with a difference in melting points greater than 10 C.(see ONSMP008 dataset) Here are some examples.


A second problem we ran into involved difficulty processing the SMILES in the Karthikeyan collection. Most of these involved SO2 groups. An attempt to view this SMILES string in ChemSketch ends up with two extra hydrogens on the sulfur.
[S+2]([O-])([O-])(OCC#N)c1ccc(C)cc1
Other SMILES strings render with 5 bonds on a carbon and ChemSketch draws these with a red X on the problematic atom. See for example this SMILES string:
O=C(OC=1=C2C=CC=CC2=NC=1c1ccccc1)C


Note that the sulfur compounds appear to render correctly on Daylight's Depict site:

In total 311 problematic SMILES from the Karthikeyan collection were removed (see ONSMP009).

With the accumulation of melting point sources, overlapping coverage is revealing likely incorrect values. For example, 5 measurements are reported for phenylacetic acid.

Four of the values cluster very close to 77 C and the other - from the Karthikeyan dataset - is clearly an outlier at 150 C.

In order to predict the temperature dependence for the solutes in our database, Andy collected the EPI experimental melting points, which are listed under the predicted properties tab in ChemSpider (ultimately from the EPA). (There are predicted EPI values there but we only used the ones marked exp).

This collection of 150 compounds was then listed in a spreadsheet (ONSMP010) and each entry was marked as having only an EPI value (44 compounds) or having at least one other measurement from another source (106 compounds). Out of those having at least one more value, 10 reported significant differences (> 5C) between the measurements. Upon investigation, many of these point strongly to the error lying with the EPI dataset. For example, the EPI melting point for phenyl salicylate is over 85 C higher than that reported by both Sigma-Aldrich and Alfa Aesar.


These preliminary results suggest that as much as 10% of the EPI experimental melting point dataset is significantly in error. Only a systematic analysis over time will reveal the full extent of the deficiencies.

So far the Alfa Aesar dataset has not produced many outliers, when other sources are available for comparison. However, even here, there are some surprising results. One of the most well studied organic compounds - ethanol - is listed with a melting point of -130 C by Alfa Aesar, clearly an outlier from the other values clustered around -114 C.

When downloading the Karthikeyan dataset from Cheminformatics.org, a Trust Level field indicates: "High - Original Author Data".

It would be nice if it were that simple. Unfortunately there are no shortcuts. There is no place for trust in science. The best we can do is to collect several measurements from truly independent sources and look for consensus over time. Where consensus is not obvious and information sources are exhausted, performing new measurements will be the only option left to progress.

The idea that a dataset has been validated - and can be trusted completely - simply because it is attached to a peer-reviewed paper is a dangerous one. This is perhaps the rationale used by projects such as Dryad, where datasets are not accepted unless they are associated with a peer-reviewed paper. Peer review was not designed to validate datasets - even if we wanted it to, reviewers don't typically have access to enough information to do so.

The usefulness of a measurement is related much more to the details in the raw data provided by following the chain of provenance (when available) than it is in where it is published. To be fair, in the case of melting point measurements, there really isn't that much additional experimental information to provide, except perhaps an NMR of the sample to prove that it was completely dry. In such a case, we have no choice but to use redundancy until a consensus number is finally reached.

Creative Commons Attribution Share-Alike 2.5 License