Curation of yield in the public USPTO reaction datasets – now on Figshare

After getting annoyed of myself and re-curating datasets I already had previously curated, I started with an obvious, simple, but imho important curation of the USPTO reaction datasets which have been “pre-curated” in the past into csv / rsmi format. I collected them and uploaded them here at figshare.

figshare

The public curation of the USPTO datasets has by now a near decade on its back. In itself, the original USPTO consists of a bunch of XML files, containing certainly a number of errors, but also a lot of (useful) data. The first obvious curation effort was made by Lowe during his Ph.D. work, later published in figshare. It even ended up as a commercial database. Other groups have been using this dataset within chemical reaction prediction modeling and have done their own curation disclosed during publication of their research. Among them Jin (& Coley) et al. (MIT), followed by Schwaller et al. (IBM Zurich). In the end, due to “the” consortia that was founded, many of the names one sees currently in the ML/AI field of synthesis prediction are interconnected. It is certainly a great collaboration, though one can suspect that competition cannot be denied and is just like in any other research field.

Back to the topic at hand though. The yield-curated datasets. In particular, I was annoyed about the ton of data but messy yield columns. Two columns that have faulty/partial/no data. This had to be somehow deconvoluted to something that was useful. By the time this was finally curated, roughly 50% of the data disappears. So despite a valiant effort by esp. Lowe, the data is essentially just full of noise. And that is before considering (in)correct structures! The alternative would of course be to go and curate oneself the original USPTO xml files, or perhaps the CML files by Lowe containing the original / more curated data. That is a different story though and not part of this discussion.

Here is an example of the type of Yield in the original files (left two columns). After cleaning these numbers (see later on) they end up as “Cleaned columns” (not part of the final datasets) and those will be combined to give the final Yield column (to the right).

I hacked together a Python script to do precisely this curation, you can find the code for it here on Github. As alternative, since I also like working with Knime, and for people who might not want to or cannot program, I have made the same steps in a graphical Knime workflow (available on the Knime-hub). In that Knime workflow, one can find a bonus analysis step as well. A simple proof of concept on how to split the reaction smiles (data not shared though, it was a only POC after all).

A snapshot of the Knime workflow.

The Python scripts (technically also Knime) consists of two main components: First the cleaning up of all the >= and other text tokens (see also table above), so that you can convert to a number (I use pandas df for the data; see also the original script & Readme for more details):

data["CalculatedYield"] = data["CalculatedYield"].str.rstrip("%")
data["TextMinedYield"] = data["TextMinedYield"].str.lstrip("~")
data["TextMinedYield"] = data["TextMinedYield"].str.rstrip("%")
data["TextMinedYield"] = data["TextMinedYield"].str.replace(">=", "", regex=True)
data["TextMinedYield"] = data["TextMinedYield"].str.replace(">", "", regex=True)
... etc ...

Then the numbers are compared to each other and only the largest one kept. Plain and simple:

def deconvolute_yield(row):
    text_yield_data = row["TextMinedYield"]
    calc_yield_data = row["CalculatedYield"]
    my_text_yield = 0
    my_calc_yield = 0

    if 0 < text_yield_data <= 100:
        my_text_yield = text_yield_data
    if 0 < calc_yield_data <= 100:
        my_calc_yield = calc_yield_data

    out_yield = my_text_yield
    if my_calc_yield > my_text_yield:
        out_yield = my_calc_yield

    return out_yield

Go check it out if you have use for this type of data. Of course I do welcome feedback!

Anyway, in light of current big-data needs this is just a drop in the sea and perhaps not necessary anymore considering the upcoming Open Reaction Database? We shall see what this will bring, afaik it is going to be officially launched any time now(?).

Logo
Open Reaction Database

AI made easy – an image classifier example for everyone

Artificial Intelligence (AI) is all in vogue right now. For better or worse, it is here to stay. So why not have a look at this being part of modern data science? A simple image classifier could do the trick!

It is almost too easy for anyone these days to work with AI or MachineLearning. Tools are aplenty, be it using the graphical based Knime or one of the more common scripting languages such as Python. Combine that with popular tools such a scikit, pytorch, etc using only a few lines of code and you are done. Making a good AI model though, even with all the available tools – another story for another time…

Moving on. What is it you are asking? You can’t/don’t want to get into advanced “stuff”? AI sounds complicated? Too much programming and statistics or whatnot? Forget all that. Not necessary. May I suggest this online course/book by Jeremy Howard and Rachel Thomas from SanFran Uni (no connections or perks exist between us, I simply like their approach). Do start at their blog: https://www.fast.ai/ and choose “Practical Deep Learning for Coders”. It introduces you to all prerequisites in an easy and simple manner, even tips with regards to free cloud services if you don’t have the hardware required. The video sessions go through the book as python notebooks (Jupyter) and introduces you to some basic programming at the same time. All with the attitude that you don’t need a Ph.D. to do AI. (Although, while that is true, a certain level of education or “human intelligence” is necessary to make useful and “safe” models – otherwise you end up with scandals or abuse of models. Check out e.g. Thomas’s course on tech ethics: https://ethics.fast.ai/).

Taking from this course, I present here a very simple AI for image recognition, specifically, one that distinguishes (more or less well) between Bengal cats vs “other cats”, and “cartoon cats”, because, why not. And since I have a Bengal myself… To test this, you won’t even need to install anything, simply use this MyBinder link:

Binder

This is a rather neat way to share code with others who don’t (want to / can) code, without having to go through whatever hoops to get it shared. One can even include a simplistic GUI when using something called ‘Voila’. It does have some drawbacks, but for the purpose of this e.g. this blog, it is perfect.

(Regarding model creation: I won’t go into the actual creation, though you will find a separate notebook in the Git repository of this model. It uses a simple 18 layer model, resnet 18.

For deeper explanations, you are probably better off viewing the pro description on how to to do this – I followed these two notebooks from FastAi Book: https://github.com/fastai/fastbook/blob/master/05_pet_breeds.ipyn resp. https://github.com/fastai/fastbook/blob/master/06_multicat.ipynb . )

My Bengal Classifier

Anyway, the final code and output looks simply like this, where the actual “AI” is strictly speaking only one line (in paragraph 3 (learn.inf.predict(img)), everything else is preparation and output. Well, that, and the architecture that is being loaded in paragraph 2 (load_learner(….)). This architecture is the model created in the above mentioned separate notebook.

You can have an even simpler view, if you use something called Voila (which is available in the referred notebook):

You can find all this on Github for testing yourself – using MyBinder.org though, you don’t require any local installation/know-how: simply click on the icon, wait for the (rather long) creation of the virtual image of this app (but hey it’s for free!). Or click directly here on the Binder link without the hassle of going through Github:

Binder

You will see something like this in your browser (click in that window the “show/hide” text in “Build logs” to expand and see the (slow) startup status):

Finally, you should have the notebook open and you can either run it there directly (click the run button multiple times, or choose menu “Cell > Run all “; ignore the error messages).

Finally, upload an image via the Upload button.

Even simpler, if you don’t want to bother with code, or “Run”, click the “Voila” button (circled in red) and you will only see the text and the upload button (as shown above).

That’s it! Artificial Intelligence (AI) made easy! Although … shouldn’t forget to at least touch upon that mainstream usually forgets to mention that AI isn’t that intelligent at all. It’s actually pretty stupid and depends on (the intelligence of?) the person(s) who sets up a system…. Anyway….

Of course, since I myself am interested in molecules, I want to use AI for different purposes, but that is something for another time.

Thanks for reading, hope you enjoyed the intro to creating your own AI app!

Oh, and Happy New Year!