After getting annoyed of myself and re-curating datasets I already had previously curated, I started with an obvious, simple, but imho important curation of the USPTO reaction datasets which have been “pre-curated” in the past into csv / rsmi format. I collected them and uploaded them here at figshare.
The public curation of the USPTO datasets has by now a near decade on its back. In itself, the original USPTO consists of a bunch of XML files, containing certainly a number of errors, but also a lot of (useful) data. The first obvious curation effort was made by Lowe during his Ph.D. work, later published in figshare. It even ended up as a commercial database. Other groups have been using this dataset within chemical reaction prediction modeling and have done their own curation disclosed during publication of their research. Among them Jin (& Coley) et al. (MIT), followed by Schwaller et al. (IBM Zurich). In the end, due to “the” consortia that was founded, many of the names one sees currently in the ML/AI field of synthesis prediction are interconnected. It is certainly a great collaboration, though one can suspect that competition cannot be denied and is just like in any other research field.
Back to the topic at hand though. The yield-curated datasets. In particular, I was annoyed about the ton of data but messy yield columns. Two columns that have faulty/partial/no data. This had to be somehow deconvoluted to something that was useful. By the time this was finally curated, roughly 50% of the data disappears. So despite a valiant effort by esp. Lowe, the data is essentially just full of noise. And that is before considering (in)correct structures! The alternative would of course be to go and curate oneself the original USPTO xml files, or perhaps the CML files by Lowe containing the original / more curated data. That is a different story though and not part of this discussion.
Here is an example of the type of Yield in the original files (left two columns). After cleaning these numbers (see later on) they end up as “Cleaned columns” (not part of the final datasets) and those will be combined to give the final Yield column (to the right).
I hacked together a Python script to do precisely this curation, you can find the code for it here on Github. As alternative, since I also like working with Knime, and for people who might not want to or cannot program, I have made the same steps in a graphical Knime workflow (available on the Knime-hub). In that Knime workflow, one can find a bonus analysis step as well. A simple proof of concept on how to split the reaction smiles (data not shared though, it was a only POC after all).
The Python scripts (technically also Knime) consists of two main components: First the cleaning up of all the >= and other text tokens (see also table above), so that you can convert to a number (I use pandas df for the data; see also the original script & Readme for more details):
data["CalculatedYield"] = data["CalculatedYield"].str.rstrip("%") data["TextMinedYield"] = data["TextMinedYield"].str.lstrip("~") data["TextMinedYield"] = data["TextMinedYield"].str.rstrip("%") data["TextMinedYield"] = data["TextMinedYield"].str.replace(">=", "", regex=True) data["TextMinedYield"] = data["TextMinedYield"].str.replace(">", "", regex=True) ... etc ...
Then the numbers are compared to each other and only the largest one kept. Plain and simple:
def deconvolute_yield(row): text_yield_data = row["TextMinedYield"] calc_yield_data = row["CalculatedYield"] my_text_yield = 0 my_calc_yield = 0 if 0 < text_yield_data <= 100: my_text_yield = text_yield_data if 0 < calc_yield_data <= 100: my_calc_yield = calc_yield_data out_yield = my_text_yield if my_calc_yield > my_text_yield: out_yield = my_calc_yield return out_yield
Go check it out if you have use for this type of data. Of course I do welcome feedback!
Anyway, in light of current big-data needs this is just a drop in the sea and perhaps not necessary anymore considering the upcoming Open Reaction Database? We shall see what this will bring, afaik it is going to be officially launched any time now(?).