Curation of yield in the public USPTO reaction datasets – now on Figshare

After getting annoyed of myself and re-curating datasets I already had previously curated, I started with an obvious, simple, but imho important curation of the USPTO reaction datasets which have been “pre-curated” in the past into csv / rsmi format. I collected them and uploaded them here at figshare.


The public curation of the USPTO datasets has by now a near decade on its back. In itself, the original USPTO consists of a bunch of XML files, containing certainly a number of errors, but also a lot of (useful) data. The first obvious curation effort was made by Lowe during his Ph.D. work, later published in figshare. It even ended up as a commercial database. Other groups have been using this dataset within chemical reaction prediction modeling and have done their own curation disclosed during publication of their research. Among them Jin (& Coley) et al. (MIT), followed by Schwaller et al. (IBM Zurich). In the end, due to “the” consortia that was founded, many of the names one sees currently in the ML/AI field of synthesis prediction are interconnected. It is certainly a great collaboration, though one can suspect that competition cannot be denied and is just like in any other research field.

Back to the topic at hand though. The yield-curated datasets. In particular, I was annoyed about the ton of data but messy yield columns. Two columns that have faulty/partial/no data. This had to be somehow deconvoluted to something that was useful. By the time this was finally curated, roughly 50% of the data disappears. So despite a valiant effort by esp. Lowe, the data is essentially just full of noise. And that is before considering (in)correct structures! The alternative would of course be to go and curate oneself the original USPTO xml files, or perhaps the CML files by Lowe containing the original / more curated data. That is a different story though and not part of this discussion.

Here is an example of the type of Yield in the original files (left two columns). After cleaning these numbers (see later on) they end up as “Cleaned columns” (not part of the final datasets) and those will be combined to give the final Yield column (to the right).

I hacked together a Python script to do precisely this curation, you can find the code for it here on Github. As alternative, since I also like working with Knime, and for people who might not want to or cannot program, I have made the same steps in a graphical Knime workflow (available on the Knime-hub). In that Knime workflow, one can find a bonus analysis step as well. A simple proof of concept on how to split the reaction smiles (data not shared though, it was a only POC after all).

A snapshot of the Knime workflow.

The Python scripts (technically also Knime) consists of two main components: First the cleaning up of all the >= and other text tokens (see also table above), so that you can convert to a number (I use pandas df for the data; see also the original script & Readme for more details):

data["CalculatedYield"] = data["CalculatedYield"].str.rstrip("%")
data["TextMinedYield"] = data["TextMinedYield"].str.lstrip("~")
data["TextMinedYield"] = data["TextMinedYield"].str.rstrip("%")
data["TextMinedYield"] = data["TextMinedYield"].str.replace(">=", "", regex=True)
data["TextMinedYield"] = data["TextMinedYield"].str.replace(">", "", regex=True)
... etc ...

Then the numbers are compared to each other and only the largest one kept. Plain and simple:

def deconvolute_yield(row):
    text_yield_data = row["TextMinedYield"]
    calc_yield_data = row["CalculatedYield"]
    my_text_yield = 0
    my_calc_yield = 0

    if 0 < text_yield_data <= 100:
        my_text_yield = text_yield_data
    if 0 < calc_yield_data <= 100:
        my_calc_yield = calc_yield_data

    out_yield = my_text_yield
    if my_calc_yield > my_text_yield:
        out_yield = my_calc_yield

    return out_yield

Go check it out if you have use for this type of data. Of course I do welcome feedback!

Anyway, in light of current big-data needs this is just a drop in the sea and perhaps not necessary anymore considering the upcoming Open Reaction Database? We shall see what this will bring, afaik it is going to be officially launched any time now(?).

Open Reaction Database

Sweet, another publication! Machine Learning in Reaction predictions!

Well, sort-of-kind-of peer-reviewed. The publication in question went through several peer-review rounds before we decided (due to lack of time and resources) to go with ChemRxiv and leave it at that. But at least it’s out there.

Here is the link:

Machine Learning to Reduce Reaction Optimization Lead Time – Proof of Concept with Suzuki, Negishi and Buchwald-Hartwig Cross-Coupling Reactions

Fernando Huerta, Samuel Hallinder, Alexander Minidis

We are continuing our work on this, but will probably take it in slightly different direction.


Noteworthy is that just around the same time, we had a my work place, RISE AB, Södertälje, Sweden, an international online workshop on the topic:

Accelerating chemical design and synthesis using artificial intelligence

The presentations are available as binder with ISBN: 9789189167421

Workshop RISE

As a bit of whining part (it’s my blog after all), regarding the paper, I have to say, it’s been a rough ride – we starting writing fall 2019 and were ready just before XMas. But while the peer-reviewers had some good points, there were some aspect were we felt they “didn’t get it” (meaning maybe we weren’t very clear). They seemed to also judge the paper not from the mixed audience view, something we admittedly struggled with. And after we received questions that had been answered twice(!) in earlier revisions and comments that have nothing to do with a publication we figured that we secure our publication date via ChemRxiv. Because, to be honest, we had the uneasy feeling that some of the reviewers might not be 100% ethical and use some of our ideas. And we believe that has happened. There is a ChemRxiv publication of an expert in the field who used an eerily similar conceptual idea – Pd-catalyzed cross-coupling and splitting the data set in “good” and “bad” to obtain better separation. I am paraphrasing here, but some wordings just in the abstract alone were…. Anyway, that could be coincidence of course. So let’s leave it at that.

Covid-19 research at home? Why, of course you can!

Not having been active for a while on my blog, a friend mine reminded of the work we once did regarding Zika – virus. And that this should also be possible to do with Corona/Covid-19. So there is an idea!

Image result for images covid 19

You can try to do this at home yourself if you like. I will myself give it a try when I find some time this week-end(?). Most likely some of the Knime work-flows will need some updating, but hopefully that won’t be too big of an issue. If you are interested, check the blog entries I made about the Zika:

Another easy way to help support the Covid-19 research is Folding@Home distributed computing. Though I heard that this is so popular, that there are not work-packages as of this writing avaiable! Check it out for yourself!

Update: A perhaps simpler (and slightly more transparent way) to participate in distributed computing is the previously mentioned BOINC World Community Grid application with “Open Pandemics Research“. Give it a go! It’s simple and not at all resource demanding! I myself am at the time of the update at a (meager?) 49d of computing time. Compare to Zika with > 1 year computation time (all on a simple Intel i5 machine).
Don’t mind if you use this referal link (no perks such as e.g. money are involved, only badge for me):

Github repository – version control also for Knime

Even if you are not a programmer, you might have heard of “Git”, “Github” or “Bitbucket”, etc. These are simply put code-repositories, used for version control of your code to track changes, collaborate, etc. It is also very useful for public domain (open source) code – a “future” proof place in the cloud where you can find code. Maybe you have encountered a similar system with documents, e.g. Word or Excel on Microsoft Sharepoint, or Dropbox for Teams – multiple people can work on a document, you have a history of older versions available, or you can make local copies of said document – a very similar idea!

In my case, let’s say my blog site might stop to exist in the future (of course it won’t, but hey, you never know), the code that is stored here, will still be available elsewhere, maybe even maintained by others!

Now, currently I don’t have any “code” per se on this site, but there are Knime workflows, which are a form of code (imho). And since I felt the urge to play with Git also for other reasons, I decided to furthermore upload all my blog workflows there. You can find my repository here:

(I am updating the older blog entries in the next few days)

If you don’t (want to) know how to use git, don’t worry – you can still download the code by using the “download as zip” possibility (the big green clone or download” button, then choose “Download ZIP“. Then import this into Knime as you would any external workflow.

In case you yourself would like to use a version control system (be it locally or in the cloud) with your Knime workflows, you might want to use the following content in a .gitignore file:

Continue reading “Github repository – version control also for Knime”

Something rare nowadays – a publication

As life continues after my years of research in R&D, there are and will be less and less publications. Therefore I am even more so excited and happy if I can contribute to some great scientific work.

Alf Claesson, the main author, and I have published a “Perspective” in the ACS journal Chemical Research in Toxicology, titled “Systematic Approach to Organizing Structural Alerts for Reactive Metabolite Formation from Potential Drugs”.

We believe it should be a good tool for especially medicinal chemists who design new compounds, but also for metabolic biologists who work with reactive metabolites. It has to do to some extend with the software SpotRM+ by Awametox which is to a certain extent the engine behind this paper.

Here is the full citation:

Systematic Approach to Organizing Structural Alerts for Reactive Metabolite Formation from Potential Drugs

Alf Claesson and Alexander Minidis
Chemical Research in Toxicology 2018 31 (6), 389-411

DOI: 10.1021/acs.chemrestox.8b00046

And the link:

Part 3: What disease should I …. ? Knime workflows

First of, apologies for the late entry of part 3 of this article. I hope you haven’t lost interest in this just yet.

Anyway, here, as promised, the Knime workflow used to retrieve the data as mentioned in part 2 & the presentation in Heidelberg last October.

The whole workflow looks like this – I will go through some of the details separately. During the course of the description here in part 3, I will zoom in via pictures to some of the metanodes (grey boxes with a green checkbox) but not all. If you want to dig into details, I will attach the full workflow for Knime for you to download and view explanations directly within.

Overview Pubchem retrieval in Knime for Zika (click to expand)

Pubchem itself quotes these two free access references with regards to itself and API programming:

Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. PubChem Substance and Compound databases. Nucleic Acids Res. 2016 Jan 4; 44(D1):D1202-13. Epub 2015 Sep 22 [PubMed PMID: 26400175] doi: 10.1093/nar/gkv951.

Kim S, Thiessen PA, Bolton EE, Bryant SH. PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 2015 Jul 1;43(W1):W605-11. Epub 2015 Apr 30 [PubMed PMID: 25934803] doi: 10.1093/nar/gkv396.

1 Obtaining the CIDs (Compound IDs)

In order to obtain any data from Pubchem, we first require the CIDs, this simplifies searching over using synonyms. In this case the DrugBank IDs retrieved from ZikaVR (see Part 2) are used, there are “only” 15 of them and we do it via a text file (a manual table within Knime would do as well). The Metanode Get CID is perhaps the portion that is most staggering for someone who doesn’t know or care for API and such. But in able to get data automatically from Pubchem, we do have to use the API. Let’s open this Metanode:

Metanode Get CID – click to enlarge and view more details

First of, we convert the Drugbank ID number to a useful URL (String Manipulation node).

The URL should look like this:

i.e. checking for compounds whose name contain DB01693 and retrieve as XML.

Next up, the actual GET Request. In this instance simply sending the URLs, with nothing else to set up here, except a delay, since we didn’t use async requests. After that, we keep only positive results (html code 200) and convert the XML based structure information to, well, a structure. The XPath node could, if desired, retrieve a whole lot more information, but here we simply retrieve the CID. Finally, we keep only certain columns and tell the system that the molecule text column is indeed a structure column.

2 Obtaining the AIDs (Assay IDs)

Reminder of parts 1, 2 and 3

The next step is to obtain the assays IDs from Pubchem. Since there is no direct way (as far as I can tell) to obtain a particular screen or target in context of a compound, one has to retrieve all assays which have a reference to a particular compound, then analyze the assays.

Retrieve AIDs

Thus in this case, the URL sent to Pubchem via GET Request looks something like this:

i.e. retrieving all AIDs for compound with CID 578447 in text format (which corresponds to above Drugbank ID DB01693 ). The results we work with is in form a list of AIDs, per compound, therefor the Ungroup node following the Get AID metanode.

Retrieve target names

Now that we have the AIDs, we can retrieve the actual target names, done here in the Get Target (from AID) metanode. Here, the Get Request URLs look like this:

i.e. retrieve as XML the information on the assay with AID #1811, etc.

From the XML we extract three values: ChemBL Link (optional), Target Type and Target Name, which we finally filter down to (via Row Splitter (or Filter, if you prefer)) to keep only “Single Proteins”, separating the result ambiguous things like “Tissue: Lymphoma Cells”, or “Target Type: CELL-LINE”, etc., leaving us in this instance with three compounds tested in “Kinesin-like protein 1”. Remember, this target was one of the targets identified earlier in Part 2).

Be aware that we have now the target name and some assay IDs for OUR compounds, but not all assays that have been submitted to Pubchem with the protein Kinesine-Like-Protein 11 (KIF11); in these picture sometimes denoted as KLP11, stemming from the Pubchem code.

3 Retrieve all screened Pubchem cmpds and Comparison with other sources

Now we retrieve all assays that deal with KIF11 and can thus retrieve all structures mentioned in those assays, followed by comparing with other sources.  We start with the metanode Get KLP11:

Retrieve all structures tested in KLP11

At this point, the URLs required should be straightforward – here the URL for the target (one single get request in this case):

Next up, is the retrieval of the compounds as their CIDs mentioned in these assays, e.g.:

Now we end up with over 5700 compounds (for which we also retrieve the structures in it’s own sub-metanode, just as described earlier). At this point, to be able to compare with the original structures found in DrugDB stemming from ZikaVR, we cluster these compounds and make the “graph” (node Cluster Preparation) in parallel to the original input structures in node common skeletons. Clustering per se, especially (but not soley) in Knime is a rather deep topic of discussion and will therefore not be described here. Though you can go into the workflow and have a look at how we did it in this instance. Now that I write this, I guess this is a neat follow-up topic for this blog!

The final comparison DB vs PubChem cores is a simple affair based on Knime’s Joiner/Reference Row Splitter nodes – via the Inchi keys as comparison string (Inchi is a good way to sift through duplicates, despite some caveats when using Inchi).

There  we have it – The top output port of the metanode gives us the common cores, the lower one, cores not found, in this case, in Pubchem.

4 Substructures of DrugDB compounds in Pubchem

A not so dissimilar approach as in above 2 & 3 to retrieve all substructures of the ones we have in our original list, independent of any target, is shown here in 4. Specifically, which similar compounds are out there that have not been reported (in Pubchem) screened on our targets of interest but might show activity anyway?

Part 4 overview

We need to start with removing explicit hydrogens  from the structures retrieved. For efficiency, this should probably be done only once early on, since it was e.g.  reused in section 3 (common skeletons metadnode contains this step again). This though is not uncommon in development of workflows – you add on a new portion and have multiples of certain steps, which you might be bothered to change later on or not. For easier reading and understanding it is simpler to actually work with the same node multiple times; remember – we are not programming a final super efficient workflow here at the moment.

Retrieve (relevant) substructures from Pubchem

Drilling down into the metanode Get Sub-Strucures we have to retrieve the substructures via an asynchronous (async) request – something shown here in probably the least nice and efficient way, but hey, it works. For substructure searches, Pubchem won’t give you back the whole list at once, only a reference to a list with the substructures. This is what the first two boxes do, PART1 and PART2.

The URL for substructure btw is:

The XML then contains a reference to a listkey, a long string of numbers, which you submitt again via:

Now, each list contains a number of new CIDs, if we collect them all, we get more than 300 000 IDs, a bit too much too handle…. thus a filtration was necessary, one that at this point was done manually, but certainly can be done more elegantly otherwise. In this case though, a manual table with the subgraphs of interest is used (Table Creator node).  Needless to say, if you want and can mine through the remaining compounds, you will certainly have an interesting resource for further analysis available (e.g. via other types of subgraphs, property filtration, etc. etc.)

Finally, the structures themselves are retrieved of the ca 1100 compounds (IDs) in our case  (same way as described above) .

Back to the main portion: Looking at the top row next to the Get Sub-Structures, this row (branch) is more of a confirmation addition – of all the substructures searched, how many of them mention KIF11, which leads back to compounds we have seen before.

The lower two branches check for similarity of the new substructures, versus ours in terms of high likelyhood to show activity – in this case – let’s not forget the overall goal – Zika Virus. This comparison is simply done by fingerprinting and comparing, here with two different methods, average or max similarity with a Tanimoto cut-off of 0.7.

And “hopefully” all the results (numbers/graphs) should correspond to what was earlier described in the series of these blog entries.

If you have any questions of anything being unclear, don’t hesitate to contact me! And/or download the workflow and play around with it yourself!

Have fun!

Download the Knime workflow (including all data)

PS: don’t hesitate to contact me if you run into troubles with the workflow.

PPS: The excel file is now included within the workflow folder, you will have to adjust the path for it to be correct. Obviously other input methods are also possible, a manual table, a csv reader, etc.

Addendum: Workflow now also available on Github:




Part 1: What disease should I research @home? Zika-Virus as example

To dabble with basic Science@home is fun – though probably only up until the question arises “what to actually research about”?

This leads to the question on how to find or decide on any disease to start with. Since this is a rather entangled question (or rather, the answers can be), I will offer three of the simplest answers:

  • Choose whichever disease you are curious about (or have a relation to)
  • Pick a particular target that you heard (know) of and are interested in
  • Take from current news a “hot-topic” disease/target

This might sound somewhat naïve, but can be rather relevant and is used by many researchers within pharmaceutical development at least as part of the starting point. As example, my friend Fernando from InOutScience and I have been considering the Zika-Virus ourselves (hence, I will use “us/we” for the remainder of this blog series). Myself, I stumbled upon this due to the news last year (and a family interest, if you will).

What is Zika?

Zika hit the world news last year after an outbreak of epidemic proportions in South America. That the world took notice at all was (as usual?) down to economics. The 2016 Summer Olympics in Brazil and the spread to southern parts of USA.  It then though nearly equally declined by the end of the year as fast as it appeared earlier. The reasons for this still seems to be unclear for epidemiologists.


The virus itself is a mosquito-borne virus transmitted by Aedes mosquitoes. It leads to harmless symptoms, the most common ones’ headache, muscle and joint pain, mild fever, rash, and inflammation of the underside of the eyelid.
But: What brought this virus into the limelight is the fact that when transferred to pregnant women the fetus is at risk for birth defects!

The latter is the reason for efforts on trying to find treatments (otherwise, basic flue treatment seems to do the trick).

You can find nicely summarized facts at the World Health Organziation (WHO) webpage on Zika.

Unfortunately, as with any neglected disease (tropical ones fall most often into this category), there is no money to make in finding new medications (research & development costs versus what you can make from it….). Therefore it falls on some smaller companies as well as academic groups as major player researching these, as is the case with Zika.
You yourself can participate indirectly if you like via the WorldCommnunityGrid distributed computation project – see my blog entry here to see how. 

If you want to know more about Zika, please check out these links:

Now that we have a disease to research on – how to continue? Part 2 now available, please click here.

PS: Part of this blog series will be presented at the “ICIC 2017“, the International Conference on Trends for Scientific Information Professionals, Heidelberg, October 23-24. The presentation will be made by Fernando Huerta from InOutScience .

MolPress – Open source chemistry plugin for WordPress

So much to discover and to do – yet so little time.
Here e.g. is such a nifty thing that I will have to try out at some point, time or not since it fits the @home perspective perfectly:

MolPress is an open source chemistry plugin for WordPress.

One of my new colleagues, Alex Clark, who has been a bit longer than me in the blogosphere, is putting work into this. No need for me to reiterate what he can describe best himself – check out the Molpress page, or his blog:

Cheminformatics2.0 – MolPress

Now – to just find the time to integrate this 😀


Abuse of open access tools and data?

As in a previous blog of mine described, it is rather simple to set up virtual compound design from the comfort of your home. Tools and data are easily accessible and hardware is cheap. Add to that a bit more hardware, maybe even a (garage) laboratory – it makes you wonder “What If”?

Is it possible that open access data is abused for criminal purposes, in particular recreational drugs? I recon it it would make sense (unfortunately) and I am sure there are more articles to be found other than the one I stumbled upon recently, dating back to 2013, by the Guardian. Though they don’t give any source or example for their (probably legitimate, imho) claim of what/were “clandestine” labs are.

Synthesis of known (recreational) drugs have been accessible since the days of Usenet newsgroups (seen them myself back in the days) and probably even BBSs. And then there is of course PhiKal, perhaps one of the main sources for Usenet/BBS in those days, before internet became bigger and easier accessible. With that know-how also follows a list of how to replace certain ingredients with household items/chemicals as replacement of otherwise only laboratory accessible items. It is so simple nowadays, a simple Google search will yield e.g. the recipe for crystal meth based on household chemicals; “Breaking Bad” in real life.

Combine the urge do to something like this with knowledge on pharmaceutical design and open access…..

Though as long as as so called designer drugs seem to be based on arbitrary testing of only slightly modified existing compounds – one of many examples fitting that picture seems to be acrylfentanyl – it doesn’t look like  open access is the culprit (yet).  It’s more the usual greed and stupidity with as fast, simple and cheap turn-over as possible – health and safety concerns have never been on the agenda. The only optimization probably is accessibility of starting materials. If there is anything valid to the above mentioned article, then of course the synthesis can go beyond your local garage and is done by “professionals” with expert equipment and chemicals. But hey, maybe I am naive and there are pro-labs doing all the typical design and test cycles as a pharmaceutical company would do…. Not that that is a good justification for illegal drugs.

It’s a rather scary thought – I am not sure what, if anything at all, can be done about this.

Perhaps the law-makers should start banning substances based on their pharmaceutical action, or generic structure (Markush like?), rather than one-by-one. I believe a similar problem exists in the area of sports & doping, were new “undetectable compounds” turn up faster than anyone has time to analyze and make new laws prohibiting previously identified ones.

I (obviously) can only recommend against any type of creating existing or new drugs – not only from a substance abuse of legal issue, but also from a plain health perspective – putting untested “shit” into your body will lead to – shitty results, plain and simple.  And if you are not a chemist doing “shit” in your garage, well, count on “shit” happening.


Been having a few problems with the site, thus development isn’t going as expected, but should be resolved soon.

Aside from the web-page structure, I am working on an article on “Sugars in diets” (not mining related, indeed) and how to use the external tool in Knime in a Windows environment.