After getting annoyed of myself and re-curating datasets I already had previously curated, I started with an obvious, simple, but imho important curation of the USPTO reaction datasets which have been “pre-curated” in the past into csv / rsmi format. I collected them and uploaded them here at figshare.
The public curation of the USPTO datasets has by now a near decade on its back. In itself, the original USPTO consists of a bunch of XML files, containing certainly a number of errors, but also a lot of (useful) data. The first obvious curation effort was made by Lowe during his Ph.D. work, later published in figshare. It even ended up as a commercial database. Other groups have been using this dataset within chemical reaction prediction modeling and have done their own curation disclosed during publication of their research. Among them Jin (& Coley) et al. (MIT), followed by Schwaller et al. (IBM Zurich). In the end, due to “the” consortia that was founded, many of the names one sees currently in the ML/AI field of synthesis prediction are interconnected. It is certainly a great collaboration, though one can suspect that competition cannot be denied and is just like in any other research field.
Back to the topic at hand though. The yield-curated datasets. In particular, I was annoyed about the ton of data but messy yield columns. Two columns that have faulty/partial/no data. This had to be somehow deconvoluted to something that was useful. By the time this was finally curated, roughly 50% of the data disappears. So despite a valiant effort by esp. Lowe, the data is essentially just full of noise. And that is before considering (in)correct structures! The alternative would of course be to go and curate oneself the original USPTO xml files, or perhaps the CML files by Lowe containing the original / more curated data. That is a different story though and not part of this discussion.
Here is an example of the type of Yield in the original files (left two columns). After cleaning these numbers (see later on) they end up as “Cleaned columns” (not part of the final datasets) and those will be combined to give the final Yield column (to the right).
I hacked together a Python script to do precisely this curation, you can find the code for it here on Github. As alternative, since I also like working with Knime, and for people who might not want to or cannot program, I have made the same steps in a graphical Knime workflow (available on the Knime-hub). In that Knime workflow, one can find a bonus analysis step as well. A simple proof of concept on how to split the reaction smiles (data not shared though, it was a only POC after all).
The Python scripts (technically also Knime) consists of two main components: First the cleaning up of all the >= and other text tokens (see also table above), so that you can convert to a number (I use pandas df for the data; see also the original script & Readme for more details):
Go check it out if you have use for this type of data. Of course I do welcome feedback!
Anyway, in light of current big-data needs this is just a drop in the sea and perhaps not necessary anymore considering the upcoming Open Reaction Database? We shall see what this will bring, afaik it is going to be officially launched any time now(?).
Well, sort-of-kind-of peer-reviewed. The publication in question went through several peer-review rounds before we decided (due to lack of time and resources) to go with ChemRxiv and leave it at that. But at least it’s out there.
We are continuing our work on this, but will probably take it in slightly different direction.
Noteworthy is that just around the same time, we had a my work place, RISE AB, Södertälje, Sweden, an international online workshop on the topic:
Accelerating chemical design and synthesis using artificial intelligence
The presentations are available as binder with ISBN: 9789189167421
As a bit of whining part (it’s my blog after all), regarding the paper, I have to say, it’s been a rough ride – we starting writing fall 2019 and were ready just before XMas. But while the peer-reviewers had some good points, there were some aspect were we felt they “didn’t get it” (meaning maybe we weren’t very clear). They seemed to also judge the paper not from the mixed audience view, something we admittedly struggled with. And after we received questions that had been answered twice(!) in earlier revisions and comments that have nothing to do with a publication we figured that we secure our publication date via ChemRxiv. Because, to be honest, we had the uneasy feeling that some of the reviewers might not be 100% ethical and use some of our ideas. And we believe that has happened. There is a ChemRxiv publication of an expert in the field who used an eerily similar conceptual idea – Pd-catalyzed cross-coupling and splitting the data set in “good” and “bad” to obtain better separation. I am paraphrasing here, but some wordings just in the abstract alone were…. Anyway, that could be coincidence of course. So let’s leave it at that.
Not having been active for a while on my blog, a friend mine reminded of the work we once did regarding Zika – virus. And that this should also be possible to do with Corona/Covid-19. So there is an idea!
Another easy way to help support the Covid-19 research is Folding@Home distributed computing. Though I heard that this is so popular, that there are not work-packages as of this writing avaiable! Check it out for yourself!
Update: A perhaps simpler (and slightly more transparent way) to participate in distributed computing is the previously mentioned BOINC World Community Grid application with “Open Pandemics Research“. Give it a go! It’s simple and not at all resource demanding! I myself am at the time of the update at a (meager?) 49d of computing time. Compare to Zika with > 1 year computation time (all on a simple Intel i5 machine). Don’t mind if you use this referal link (no perks such as e.g. money are involved, only badge for me): https://join.worldcommunitygrid.org?recruiterId=1039555
The whole workflow looks like this – I will go through some of the details separately. During the course of the description here in part 3, I will zoom in via pictures to some of the metanodes (grey boxes with a green checkbox) but not all. If you want to dig into details, I will attach the full workflow for Knime for you to download and view explanations directly within.
Pubchem itself quotes these two free access references with regards to itself and API programming:
In order to obtain any data from Pubchem, we first require the CIDs, this simplifies searching over using synonyms. In this case the DrugBank IDs retrieved from ZikaVR (see Part 2) are used, there are “only” 15 of them and we do it via a text file (a manual table within Knime would do as well). The Metanode Get CID is perhaps the portion that is most staggering for someone who doesn’t know or care for API and such. But in able to get data automatically from Pubchem, we do have to use the API. Let’s open this Metanode:
First of, we convert the Drugbank ID number to a useful URL (String Manipulation node).
i.e. checking for compounds whose name contain DB01693 and retrieve as XML.
Next up, the actual GET Request. In this instance simply sending the URLs, with nothing else to set up here, except a delay, since we didn’t use async requests. After that, we keep only positive results (html code 200) and convert the XML based structure information to, well, a structure. The XPath node could, if desired, retrieve a whole lot more information, but here we simply retrieve the CID. Finally, we keep only certain columns and tell the system that the molecule text column is indeed a structure column.
2 Obtaining the AIDs (Assay IDs)
The next step is to obtain the assays IDs from Pubchem. Since there is no direct way (as far as I can tell) to obtain a particular screen or target in context of a compound, one has to retrieve all assays which have a reference to a particular compound, then analyze the assays.
Thus in this case, the URL sent to Pubchem via GET Request looks something like this:
i.e. retrieving all AIDs for compound with CID 578447 in text format (which corresponds to above Drugbank ID DB01693 ). The results we work with is in form a list of AIDs, per compound, therefor the Ungroup node following the Get AID metanode.
Now that we have the AIDs, we can retrieve the actual target names, done here in the Get Target (from AID) metanode. Here, the Get Request URLs look like this:
i.e. retrieve as XML the information on the assay with AID #1811, etc.
From the XML we extract three values: ChemBL Link (optional), Target Type and Target Name, which we finally filter down to (via Row Splitter (or Filter, if you prefer)) to keep only “Single Proteins”, separating the result ambiguous things like “Tissue: Lymphoma Cells”, or “Target Type: CELL-LINE”, etc., leaving us in this instance with three compounds tested in “Kinesin-like protein 1”. Remember, this target was one of the targets identified earlier in Part 2).
Be aware that we have now the target name and some assay IDs for OUR compounds, but not all assays that have been submitted to Pubchem with the protein Kinesine-Like-Protein 11 (KIF11); in these picture sometimes denoted as KLP11, stemming from the Pubchem code.
3 Retrieve all screened Pubchem cmpds and Comparison with other sources
Now we retrieve all assays that deal with KIF11 and can thus retrieve all structures mentioned in those assays, followed by comparing with other sources. We start with the metanode Get KLP11:
At this point, the URLs required should be straightforward – here the URL for the target (one single get request in this case):
Now we end up with over 5700 compounds (for which we also retrieve the structures in it’s own sub-metanode, just as described earlier). At this point, to be able to compare with the original structures found in DrugDB stemming from ZikaVR, we cluster these compounds and make the “graph” (node Cluster Preparation) in parallel to the original input structures in node common skeletons. Clustering per se, especially (but not soley) in Knime is a rather deep topic of discussion and will therefore not be described here. Though you can go into the workflow and have a look at how we did it in this instance. Now that I write this, I guess this is a neat follow-up topic for this blog!
The final comparison DB vs PubChem cores is a simple affair based on Knime’s Joiner/Reference Row Splitter nodes – via the Inchi keys as comparison string (Inchi is a good way to sift through duplicates, despite some caveats when using Inchi).
There we have it – The top output port of the metanode gives us the common cores, the lower one, cores not found, in this case, in Pubchem.
4 Substructures of DrugDB compounds in Pubchem
A not so dissimilar approach as in above 2 & 3 to retrieve all substructures of the ones we have in our original list, independent of any target, is shown here in 4. Specifically, which similar compounds are out there that have not been reported (in Pubchem) screened on our targets of interest but might show activity anyway?
We need to start with removing explicit hydrogens from the structures retrieved. For efficiency, this should probably be done only once early on, since it was e.g. reused in section 3 (common skeletons metadnode contains this step again). This though is not uncommon in development of workflows – you add on a new portion and have multiples of certain steps, which you might be bothered to change later on or not. For easier reading and understanding it is simpler to actually work with the same node multiple times; remember – we are not programming a final super efficient workflow here at the moment.
Drilling down into the metanode Get Sub-Strucures we have to retrieve the substructures via an asynchronous (async) request – something shown here in probably the least nice and efficient way, but hey, it works. For substructure searches, Pubchem won’t give you back the whole list at once, only a reference to a list with the substructures. This is what the first two boxes do, PART1 and PART2.
Now, each list contains a number of new CIDs, if we collect them all, we get more than 300 000 IDs, a bit too much too handle…. thus a filtration was necessary, one that at this point was done manually, but certainly can be done more elegantly otherwise. In this case though, a manual table with the subgraphs of interest is used (Table Creator node). Needless to say, if you want and can mine through the remaining compounds, you will certainly have an interesting resource for further analysis available (e.g. via other types of subgraphs, property filtration, etc. etc.)
Finally, the structures themselves are retrieved of the ca 1100 compounds (IDs) in our case (same way as described above) .
Back to the main portion: Looking at the top row next to the Get Sub-Structures, this row (branch) is more of a confirmation addition – of all the substructures searched, how many of them mention KIF11, which leads back to compounds we have seen before.
The lower two branches check for similarity of the new substructures, versus ours in terms of high likelyhood to show activity – in this case – let’s not forget the overall goal – Zika Virus. This comparison is simply done by fingerprinting and comparing, here with two different methods, average or max similarity with a Tanimoto cut-off of 0.7.
And “hopefully” all the results (numbers/graphs) should correspond to what was earlier described in the series of these blog entries.
If you have any questions of anything being unclear, don’t hesitate to contact me! And/or download the workflow and play around with it yourself!
PS: don’t hesitate to contact me if you run into troubles with the workflow.
PPS: The excel file is now included within the workflow folder, you will have to adjust the path for it to be correct. Obviously other input methods are also possible, a manual table, a csv reader, etc.
Parts of the blog on “What disease should I research @home? Zika Virus as Example” (Part 1 & Part 2) was presented at the “ICIC 2017“, the International Conference on Trends for Scientific Information Professionals, Heidelberg, October 23-24, by my friend Dr. Fernando Huerta from InOutScience .
This presentation was a combined effort between the two of us in terms of content, though as far as the slide-deck and the presentation goes, the main work was done by Fernando – awesome job, don’t you (the reader) agree?
Here is the slide-deck (via Dr Haxel Congress and Event Management GmbH, Slidshare and LinkedIn):
Continuing from our introduction in part 1 where we established the “what” – i.e. Zika Virus – we will address now the “how”.
(disclaimer for the nit-picky ones: this is a casually written blog, not a full-fledged scientific article)
First – which target do we need to dig into? There is of course the virus itself. This is somewhat “easily” done by creating a vaccine against it. As far as we can tell, efforts for this are ongoing, though require still many years to deliver any positive or negative clinical outcome. It won’t be the focus of this little report – we are small-molecule people after all!
Thus, the alternative is to look into the biological pathways affected within the host by the Zika-Virus. There are currently a multitude of targets that are of interest. Generally speaking, all drugs that are developed affect a target within such a pathway (or several) – if such pathway(s) are known. In the case of Zika, several dozens of targets are under investigation which seem to show some impact on inhibiting/stopping the disease. Some targets are more promising than others.
After some digging around on the Web and reported literature, there are two sources we will be using:
This database has done a lot of the leg-work in collecting relevant information at this single resource. E.g., results of the Zika genome compared to known viruses which allows for quick identification of known drugs. These known drugs affect certain targets, therefore allowing the assumption, that the Zika virus affects said targets in the host as well. Of course, this is still to be proven in animal models/clinical trials. But, it reduces the number of possible targets to allow for a more focused plan of attack until more details are known (e.g. new targets may emerge from the genome comparisons or certain known drug targets being less/more attractive for whatever reason, etc).
There are numerous ways and strategies to use as starting point to find new leads. Strategies can be at times a question of personal vs company taste, experience and even “philosophy”. The overall (certainly non-exhaustive) gist in our case is a structural analysis with associated data.
Actually, there are two more general resources which we use:
Pubchem (https://pubchem.ncbi.nlm.nih.gov/) and DrugBank (https://www.drugbank.ca/) – public accessible databases with a wealth of information. They can be searched via a web interface, downloaded via FTP or accessed via their APIs. The latter is the way we access especially Pubchem, with the help of Knime . If you are wondering at this point about Knime – please see some of the previous blog entries on this site.
Analysis Part 1
To identify known drugs (pre-clinical-, clinical-, as well as marketed ones) and (related) drug targets. From a small molecule perspective, the most interesting section of ZikaVR is the “Drug Target” section, as of October 2017, containing 464 targets: http://bioinfo.imtech.res.in/manojk/zikavr/drug_target1.php, the list of compounds is over 500 structures long! For easier and faster analysis, we exported this list containing Drugbank IDs and used Knime to continue working with these. We need to find the structures, check for duplicates, and do a clustering – if at all possible.
With the help of Knime, we end up with a list of only 14 compounds! (We will give details on the Knime workflows in a subsequent part of this blog series).
And looking more closely at the targets of these, we see that all of them belong to three main biological classes:
Interleukin-4-receptor subunit alpha (a cytokine wiki https://en.wikipedia.org/wiki/Interleukin_4))
Genome polyprotein (which is a rather generic class for viral strains)
We will at this point not discuss a potential bias stemming from compounds entered into the ZikaVR database – we refer to the database itself on details on curation.
The next step in part one is to analyze the structures and check for any similarities. Fingerprint & graph-based clustering (in simple terms: strip all attached groups from a molecule to keep the core, then replace all heteroatoms with Carbon and finally make all bonds to single bonds; again, see upcoming part 3) and end up with only three distinct cores, whereas six of the DrugDB compounds share one common core (subgraph), all corresponding to KIF11 inhibitors:
Analysis Part 2
Looking into the above mentioned article by Ming et al. we payed particular attention to the PHA-690509 compound. It is a CDK (see e.g. this WIKI entry) type of inhibitor known to have antiviral activity. The authors disclose further structurally unrelated CDKi compounds which inhibit ZIKA replication as well.
In the supplementary material one finds a list of 27 CDKi compounds which we used as basis for further analysis. If we do a subgraph analysis of these compounds we find three such graphs for all of these (with some overlap):
We have one structural “coincidence” based on KIF 11 (plus two other target classes which we will not use here) and ten structurally unrelated CDKi compounds (different CDK targets). Thus we intended the following:
Search for bio-actives based on the subgraphs shown above
Search for KIF11 inhibitors (target/structure related)
Search for CDK inhibitors (target/structure related)
We searched through Pubchem either by target-name or via the subgraphs (substructure type of search) – also, we used an activity cut-off in the case of CDKi’s of pI50 > 6 (i.e. sub-micromolar activity). For target names we also had to do a detour to check the Assay IDs (AIDs) based on related structure IDs (CIDs) – if there is a better way that you happen to know of, please contact us, we would be delighted to hear how.
In the case of CDKi’s we found 257 structurally unrelated CDKi’s based on the subgraph search, such as
In total we found over 1450 CDKi’s (hitting the whole range of CDK1-CDK20).
In the case of KIF11 it wasn’t equally straightforward. One of the reasons is that for KIF11 the alternative name KIF1 is used… Absolutely not confusing, right? So initially we get 0 hits when we do targetname/assay related searches. With the graph based search though, we do find over 1100 analogues. If we filter by fingerprint similarity (Tanimoto >0.7, or even 0.8) we end up with a bit over 730 compounds, whereas 18 of those are reported as KIF1 compounds (meaning KIF11).
We were able to systematically identify compounds that are not necessarily structurally related (which is good to since it removes some of the starting bias), many of them are active on the same targets. There is assay data available as well, thanks to Pubchem, something we haven’t really used so far other than a generic potency cut-off in one case.
For a simple exercise as we have done here, we would claim it is acceptable if one neglects some details, e.g. the at times varying curation quality of Pubchem. You might need to drill down further into things like target names and make sure you search through all alternative names, just in case. Sometimes structures aren’t 100% correct, etc. etc. Don’t get us wrong, in general curation is very good, but you cannot trust it blindly since “the devil at times lies in the details”.
And now what?
Now? Only now one would really get started! We have some starting points and would want to make of them. For this, there are “gazillion” things to do further with these findings. Property analysis, pharmacophore modelling, 3d-shape-modelling, etc. etc. etc. Perhaps we will revisit this in the future.
Comparison with commercial products
On a final note, if you have the possibility to search through commercial databases, you can do basically the same thing. It might even be a good idea to combine these two types of searches. A quick test with e.g. Reaxys revealed many more CDKi structures, but less KIF11 inhibitors. In the end, you have roughly an equal amount of data and would probably end up with the same conclusion in terms of what you want to focus on. For someone doing Science@home though, Pubchem and other public databases are the go-to-place.
Stay tuned for some practical examples from the above mentioned Knime work-flows in Part 3.
PS: Part of this blog will be (have been) presented at the “ICIC 2017“, the International Conference on Trends for Scientific Information Professionals, Heidelberg, October 23-24, by Dr. Fernando Huerta from InOutScience .
I’m assuming that you have (some) knowledge on how to search, what to look out for, a workflow on the different steps required to do the job. It’s otherwise a topic on it’s own for another time. Not that it hasn’t been described before, alas, no, see just one example here:
Nicola, G., Berthold, M. R., Hedrick, M. P., & Gilson, M. K. (2015). Connecting proteins with drug-like compounds: Open source drug discovery workflows with BindingDB and KNIME. Database: The Journal of Biological Databases and Curation, 2015, bav087. https://doi.org/10.1093/database/bav087
So you identified something and want to test your hypothesis beyond in-silico. Well, that is a bit tougher – you can’t really handle and test compounds at home. Theoretically though, you could have someone else do this part for you (order commercial compounds, synthesize something new, test in a biological assay). That is (unfortunately) not for free.
Actual testing aside (it never hurts), what can you do with those cool results? Well, there are a number of things – the simplest one would be: write a blog! More involved and scientifically more appropriate – at the same time more difficult – write a publication in a scientific journal or present at a scientific meeting. You could even try and patent your findings, if you have the finances. It all depends on the impact you want to have.
To go beyond a publication, if you want to be part of/follow your findings, you can contact some of the initiatives by pharmaceutical companies who are open to collaboration on new findings. For example, Johnsson&Johnsson [jnjinnovation.com/partner-with-us], or AstraZeneca [openinnovation.astrazeneca.com], or the Medicines for Malaria Venture [www.mmv.org/partnering/our-partner-network] and many more. You can also find incubators within academia, but then you would require some contact to a research group within. The list of incubators/companies & universities is nowadays quite big and could be a topic for a separate blog entry.
If you are really in it for the money though, I think you will be disappointed. Doing drug research from home is more like a hobby just fun, in the best case though for the greater good. Having said that, should you really find something interesting and you contact any of the above mentioned initiatives, intellectual property and reimbursements will most likely be on the table at some point.
In the current day and age of open access information, combined with cheap computing power, it is rather simple to do (some) drug research from the comfort of your home, be it as private person for fun or out of interest, or as a small (start-up) company. Actually, big pharma companies use some of the same resources combined with their own in-house data and programs as well – so why shouldn’t you?
Where is this data? What kind of data?
There are a number of public- so called open access – databases available these days, curated over many years by high profile institutes, as e.g. the National Institute of Health, NIH for Pubchem. Many more institutions and specific initiatives have evolved over many years, some appearing literally right now, depending on the field and data. Databases on chemical compounds, small molecules, have been around the longest, afik, with structure, properties, literature references and biological data associated.
More focused for our purpose of drug research, you have sites such as PubChem, BindingDB, Zinc, or e.g. GuideToPharmacology. I’d say with these you can get pretty far. Curated from literature and also patents, these databases connect structures to biology, i.e. mechanism of action, structure of the target, how much is know about it (or not). All sites and db-s are arranged differently, some you can search on the web, via an API, some by browsing, or a combination thereof. Then, there are also the semi-public databases, such as CDD-Vault – you can register and search within the public databases (all via the web, independent of your machine power), though you cannot download or batch process on the free account. It might still be worth a look at times considering you find data which is not in literature/patent based curated databases.
What will you need?
A certain understanding of the drug discovery process, chemistry and some degree of biology. If not yourself, then a good friend who might have that knowledge and can support you (though this seems like a unlikely scenario?). Some IT-skills certainly don’t hurt. Below I will focus on data-mining as the core task of the home research, methods such as docking or quantum mechanic calculations I will leave out for now.
A(ny) computer – Windows, Linux, Mac – doesn’t matter.
In my experience though when it comes to chemistry, the Windows platform still offers a broader range of both commercial and freeware programs .
Simply put, also doesn’t matter. Sure, the more power, the smoother your experience, though for mining purpose I would go for more memory before processing power. An Intel i3 with (minimum) 16GB of RAM can get you pretty far with little money. Only for large data sets and more complicated calculations I feel this being a bit of a bottleneck. If you have an i7 or Xeon available, good for you!
What about graphic cards? That actually doesn’t matter for data-mining and simple visualizations. Once you want to do some visual 3d-docking though, that’s another story.
An alternative, or even complimentary solution is a (powerful) workstation, placed “anywhere”, which could e.g. be shared with someone else sharing investment costs and then access it via any (simple) PC/Laptop via remote access, e.g. TeamViewer. Cloud computing@home so to say.
Reasonably fast internet connection – for mining those web-services.
Knime (available on all platforms) allowing for flexible, visual and fast development of search and analysis workflows. Combined with some know-how on Java or XML and you have quite a powerful package. To start your journey, you can use some of the readily available (example) workflows before getting into details.
A chemical drawing program – there are a rather larger number out there, it is difficult to really make a good suggestion. Knime itself comes with a “myriad” of plugins for structural input and output, thus you actually don’t really need a separate program. Myself, I do have the free Marvin package by Chemaxon installed.
DataWarrior – a great package for visually guided “manual” mining, sort of “Spotfire light”, if you will.
Excel – or similar, can be used as light weight DataWarrior alternative, but also useful for sharing or storage (as would be Word or Powerpoint (and alternatives).
Scripting languages – R or Python – are not necessary, though they can make a good complement, depending on your requirements.
Java – also not necessary, but since Knime is built on Java, it sometimes can help for certain work-arounds.
XML, HTML, REST – some basics might be helpful when accessing certain services via network API.
What if you don’t know Java and such? Don’t fret, initially, I for example didn’t either. If you are though a person who is more of a “learning by doing”, then the knowledge will come automatically. Obviously, you can learn these in courses as well.
Modelling and prediction of toxicity of drug compounds has been, is, and will be be a continuous area of interest. I won’t go into the detailed literature of this, here, I want to focus on SpotRM+’s contribution to that field:
This methodology focuses on reactive metabolite formation and avoidance as a means to reduce structure based toxicity issues. In addition, it is a computationally cheap method since it is solely based on SMARTS, carefully hand-curated ones at that. In addition to identifying certain structural features, SpotRM+ delivers one to three page monographs on the marketed (or withdrawn) reference compounds including mechanistic summaries. So it is more about learning than pure black box filtration.
SpotRM+ requires Bioclipse, a platform which has chemical data-mining in its focus. There is a disadvantage to this package – you can only run and analyze one compound at a time, batch mode isn’t possible.
According to the company Awametox AB, the batch mode analysis is a feature requested by a number of customers, e.g. for design/synthesis prioritization. And yes, it is possible – IF you use script based or workflow based tools with one of the simpler ones being Knime. For this, you require access to the SpotRM+ database itself and the standard chemistry mining nodes in Knime.
[note that SpotRM+ is a commercial package, though there is a free demo available; both are based on Bioclipse. For the mining suggested here you need the database itself which can be purchased separately]
One of the drawbacks of the database and the SpotRM+ system with regards to batch analysis is that it isn’t really designed for batch analysis. The readout usually consists of a traffic light colouring system of reference compounds and links to their analysis monographs. Thus, for batch mode to work, you need to ask what you desire of it -e.g.
Is a single “red” or “green” reference hit sufficient?
Do you want to summarize all the reference hits?
Combine with other data for further calculations?
In principle, anything goes, that’s the beauty of the flexibility of a package such as Knime. But, would that be sufficient for you to make a decision? I can imagine that a batch based “high quality” decision should be possible, if you combine the output with, e.g., a model based on measured ADMET data (and/or reactive metabolite data).
Independent of the latter, a basic workflow could look simply like this:
You can find more info and access to mentioned programs here:
SpotRM+: www.awametox.com (bioclipse included; recently updated to V1.2!)
Bioclipse: www.bioclipse.net (mainly for info, not required to download separately)
Knime is a fantastic tool to create automation of handling data without the need for programming. Although, the more complicated the data becomes the better it is to have or to acquire knowledge with regards to Java, XML, SQL, etc.
Some limitations do exist with Knime, again depending on your data and the end-result you require. In certain situations the need to use an external program might arise.
Now, there is an “External Tool” node in Knime, though officially it is designed for usage in a Unix type environment. But, with some tweaking it does work under Windows!
The example I will give here will include chemical structure recognition stemming from pictures, i.e. OCR.
As it often as is with Knime (or programming in general), there are always multiple ways to solve a problem. This here is simply one solution.
What this workflow does:
Read a directory of PNG (preferred) picture files of structures, even reactions, and converts them to smiles files. The workflow creates a DOS-batch file with the osra commands which gets executed by the external tool node.
To obtain best quality and to know about the recognition limitations, you should read about the OSRA tool via the links below.
Knime, preferably latest Version >= 3.3.x (though it should run with any V3; originally it was developed in V2.x, so it should run there as well, though I can’t test it anymore at this point). https://www.knime.org/
Nodes in Knime: Standard installation, including:
NGS tools [I like using their “Wait” node]
Erlwood Nodes [used for chemistry part]
In order for this workflow to work properly, you will require following files in following places – (this has to do with the fact that some nodes, as e.g. the external node tool, can’t be opened/executed for testing if certain entered data isn’t available). Thus it’s easier if you copy enclosed, alt. create (empty notepad files would be sufficient) as described:
OSRA in following location:
C:\osra\V2.0.0\osra-bin.exe [I don’t make usage of %PATH% and the batch file included in OSRA distribution]
C:\osra\donotdelete\extToollGreen.txt [can be emtpy; used for giving the “clear” sign when node is done]
C:\osra\donotdelete\ignore_me2.txt [can be emtpy; echos the cmd line output, can be ignored, potentially parsed, I don’t]
C:\test\ [currently; may be anywhere else, it includes your images and the resulting structures]
The first metanode reads the name of the picture files and creates an executable batch file called by the external tool node. [open the picture in a separate window to view full size]
The tricky part is the external tool node, should you do a full reset and not have all the necessary files in place, you won’t be able to open it and do a comparison of the set-up.
And here the flow-variables:
The remaining portions are less tricky and it is a matter of taste what you want to do with the obtained files. In my second metanode, I make a list of all the smiles files (OSRA creates one output file per one input file) and combine it with the original inputfile (resp. filename). [open the picture in a separate window to view full size]
Finally, in the third node, the structure is drawn out to have a visual comparison to the picture input. [open the picture in a separate window to view full size]
After that, it is up to you what you want to do with the results.
A zip-file, containing the workflow and mentioned text files and folder structure may be downloaded from this link. Some examples of varying quality graphics are included.