Part 3: What disease should I …. ? Knime workflows

First of, apologies for the late entry of part 3 of this article. I hope you haven’t lost interest in this just yet.

Anyway, here, as promised, the Knime workflow used to retrieve the data as mentioned in part 2 & the presentation in Heidelberg last October.

The whole workflow looks like this – I will go through some of the details separately. During the course of the description here in part 3, I will zoom in via pictures to some of the metanodes (grey boxes with a green checkbox) but not all. If you want to dig into details, I will attach the full workflow for Knime for you to download and view explanations directly within.

Overview Pubchem retrieval in Knime for Zika (click to expand)

Pubchem itself quotes these two free access references with regards to itself and API programming:

Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. PubChem Substance and Compound databases. Nucleic Acids Res. 2016 Jan 4; 44(D1):D1202-13. Epub 2015 Sep 22 [PubMed PMID: 26400175] doi: 10.1093/nar/gkv951.

Kim S, Thiessen PA, Bolton EE, Bryant SH. PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 2015 Jul 1;43(W1):W605-11. Epub 2015 Apr 30 [PubMed PMID: 25934803] doi: 10.1093/nar/gkv396.

1 Obtaining the CIDs (Compound IDs)

In order to obtain any data from Pubchem, we first require the CIDs, this simplifies searching over using synonyms. In this case the DrugBank IDs retrieved from ZikaVR (see Part 2) are used, there are “only” 15 of them and we do it via a text file (a manual table within Knime would do as well). The Metanode Get CID is perhaps the portion that is most staggering for someone who doesn’t know or care for API and such. But in able to get data automatically from Pubchem, we do have to use the API. Let’s open this Metanode:

Metanode Get CID – click to enlarge and view more details

First of, we convert the Drugbank ID number to a useful URL (String Manipulation node).

The URL should look like this:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compound/name/DB01693/XML

i.e. checking for compounds whose name contain DB01693 and retrieve as XML.

Next up, the actual GET Request. In this instance simply sending the URLs, with nothing else to set up here, except a delay, since we didn’t use async requests. After that, we keep only positive results (html code 200) and convert the XML based structure information to, well, a structure. The XPath node could, if desired, retrieve a whole lot more information, but here we simply retrieve the CID. Finally, we keep only certain columns and tell the system that the molecule text column is indeed a structure column.

2 Obtaining the AIDs (Assay IDs)
Reminder of parts 1, 2 and 3

The next step is to obtain the assays IDs from Pubchem. Since there is no direct way (as far as I can tell) to obtain a particular screen or target in context of a compound, one has to retrieve all assays which have a reference to a particular compound, then analyze the assays.

Retrieve AIDs

Thus in this case, the URL sent to Pubchem via GET Request looks something like this:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/578447/aids/TXT

i.e. retrieving all AIDs for compound with CID 578447 in text format (which corresponds to above Drugbank ID DB01693 ). The results we work with is in form a list of AIDs, per compound, therefor the Ungroup node following the Get AID metanode.

Retrieve target names

Now that we have the AIDs, we can retrieve the actual target names, done here in the Get Target (from AID) metanode. Here, the Get Request URLs look like this:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1811/XML

i.e. retrieve as XML the information on the assay with AID #1811, etc.

From the XML we extract three values: ChemBL Link (optional), Target Type and Target Name, which we finally filter down to (via Row Splitter (or Filter, if you prefer)) to keep only “Single Proteins”, separating the result ambiguous things like “Tissue: Lymphoma Cells”, or “Target Type: CELL-LINE”, etc., leaving us in this instance with three compounds tested in “Kinesin-like protein 1”. Remember, this target was one of the targets identified earlier in Part 2).

Be aware that we have now the target name and some assay IDs for OUR compounds, but not all assays that have been submitted to Pubchem with the protein Kinesine-Like-Protein 11 (KIF11); in these picture sometimes denoted as KLP11, stemming from the Pubchem code.

3 Retrieve all screened Pubchem cmpds and Comparison with other sources

Now we retrieve all assays that deal with KIF11 and can thus retrieve all structures mentioned in those assays, followed by comparing with other sources.  We start with the metanode Get KLP11:

Retrieve all structures tested in KLP11

At this point, the URLs required should be straightforward – here the URL for the target (one single get request in this case):

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol/KIF11/aids/TXT

Next up, is the retrieval of the compounds as their CIDs mentioned in these assays, e.g.:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/242686/cids/TXT

Now we end up with over 5700 compounds (for which we also retrieve the structures in it’s own sub-metanode, just as described earlier). At this point, to be able to compare with the original structures found in DrugDB stemming from ZikaVR, we cluster these compounds and make the “graph” (node Cluster Preparation) in parallel to the original input structures in node common skeletons. Clustering per se, especially (but not soley) in Knime is a rather deep topic of discussion and will therefore not be described here. Though you can go into the workflow and have a look at how we did it in this instance. Now that I write this, I guess this is a neat follow-up topic for this blog!

The final comparison DB vs PubChem cores is a simple affair based on Knime’s Joiner/Reference Row Splitter nodes – via the Inchi keys as comparison string (Inchi is a good way to sift through duplicates, despite some caveats when using Inchi).

There  we have it – The top output port of the metanode gives us the common cores, the lower one, cores not found, in this case, in Pubchem.

4 Substructures of DrugDB compounds in Pubchem

A not so dissimilar approach as in above 2 & 3 to retrieve all substructures of the ones we have in our original list, independent of any target, is shown here in 4. Specifically, which similar compounds are out there that have not been reported (in Pubchem) screened on our targets of interest but might show activity anyway?

Part 4 overview

We need to start with removing explicit hydrogens  from the structures retrieved. For efficiency, this should probably be done only once early on, since it was e.g.  reused in section 3 (common skeletons metadnode contains this step again). This though is not uncommon in development of workflows – you add on a new portion and have multiples of certain steps, which you might be bothered to change later on or not. For easier reading and understanding it is simpler to actually work with the same node multiple times; remember – we are not programming a final super efficient workflow here at the moment.

Retrieve (relevant) substructures from Pubchem

Drilling down into the metanode Get Sub-Strucures we have to retrieve the substructures via an asynchronous (async) request – something shown here in probably the least nice and efficient way, but hey, it works. For substructure searches, Pubchem won’t give you back the whole list at once, only a reference to a list with the substructures. This is what the first two boxes do, PART1 and PART2.

The URL for substructure btw is:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compounds/substructure/smiles/followed_by_your_smiles_string/XML

The XML then contains a reference to a listkey, a long string of numbers, which you submitt again via:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compounds/listkey/333121232131244/cids/XML

Now, each list contains a number of new CIDs, if we collect them all, we get more than 300 000 IDs, a bit too much too handle…. thus a filtration was necessary, one that at this point was done manually, but certainly can be done more elegantly otherwise. In this case though, a manual table with the subgraphs of interest is used (Table Creator node).  Needless to say, if you want and can mine through the remaining compounds, you will certainly have an interesting resource for further analysis available (e.g. via other types of subgraphs, property filtration, etc. etc.)

Finally, the structures themselves are retrieved of the ca 1100 compounds (IDs) in our case  (same way as described above) .

Back to the main portion: Looking at the top row next to the Get Sub-Structures, this row (branch) is more of a confirmation addition – of all the substructures searched, how many of them mention KIF11, which leads back to compounds we have seen before.

The lower two branches check for similarity of the new substructures, versus ours in terms of high likelyhood to show activity – in this case – let’s not forget the overall goal – Zika Virus. This comparison is simply done by fingerprinting and comparing, here with two different methods, average or max similarity with a Tanimoto cut-off of 0.7.

And “hopefully” all the results (numbers/graphs) should correspond to what was earlier described in the series of these blog entries.

If you have any questions of anything being unclear, don’t hesitate to contact me! And/or download the workflow and play around with it yourself!

Have fun!

Download the Knime workflow (including all data)

PS: don’t hesitate to contact me if you run into troubles with the workflow.

PPS: The excel file is now included within the workflow folder, you will have to adjust the path for it to be correct. Obviously other input methods are also possible, a manual table, a csv reader, etc.

Addendum: Workflow now also available on Github: https://github.com/DocMinus/ZikaPubChem

 

 

 

Towards a drug discovery wiki? Pubchem, Reaxys & Zika

Parts of the blog on “What disease should I research @home? Zika Virus as Example” (Part 1 & Part 2) was presented at the “ICIC 2017“, the International Conference on Trends for Scientific Information Professionals, Heidelberg, October 23-24, by my friend Dr. Fernando Huerta from InOutScience .

This presentation was a combined effort between the two of us in terms of content, though as far as the slide-deck and the presentation goes, the main work was done by Fernando – awesome job, don’t you (the reader) agree?

Here is the slide-deck (via Dr Haxel Congress and Event Management GmbH, Slidshare and LinkedIn):

Part 2: What disease should I …. ? Data mining Zika-virus information

Continuing from our introduction in part 1 where we established the “what” – i.e. Zika Virus –  we will address now the “how”.
(disclaimer for the nit-picky ones: this is a casually written blog, not a full-fledged scientific article)

First – which target do we need to dig into? There is of course the virus itself. This is somewhat “easily” done by creating a vaccine against it. As far as we can tell, efforts for this are ongoing, though require still many years to deliver any positive or negative clinical outcome. It won’t be the focus of this little report – we are small-molecule people after all!

Thus, the alternative is to look into the biological pathways affected within the host by the Zika-Virus. There are currently a multitude of targets that are of interest. Generally speaking, all drugs that are developed affect a target within such a pathway (or several) – if such pathway(s) are known. In the case of Zika, several dozens of targets are under investigation which seem to show some impact on inhibiting/stopping the disease. Some targets are more promising than others.

After some digging around on the Web and reported literature, there are two sources we will be using:

ZikaVR: http://bioinfo.imtech.res.in/manojk/zikavr/ a large database dedicated to this particular virus. It has been described in detail in this research article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5025660/ (you can access it for free).

This database has done a lot of the leg-work in collecting relevant information at this single resource. E.g., results of the Zika genome compared to known viruses which allows for quick identification of known drugs. These known drugs affect certain targets, therefore allowing the assumption, that the Zika virus affects said targets in the host as well. Of course, this is still to be proven in animal models/clinical trials. But, it reduces the number of possible targets to allow for a more focused plan of attack until more details are known (e.g. new targets may emerge from the genome comparisons or certain known drug targets being less/more attractive for whatever reason, etc).

There are numerous ways and strategies to use as starting point to find new leads. Strategies can be at times a question of personal vs company taste, experience and even “philosophy”. The overall (certainly non-exhaustive) gist in our case is a structural analysis with associated data.

The second resource we will be using is a recent article by Guo-Li Ming and coworkers who describe in more details some of the most advanced candidate molecules and discuss possible combination therapies: http://www.nature.com/nm/journal/v22/n10/full/nm.4184.html

Actually, there are two more general resources which we use:

Pubchem (https://pubchem.ncbi.nlm.nih.gov/) and DrugBank (https://www.drugbank.ca/) – public accessible databases with a wealth of information. They can be searched via a web interface, downloaded via FTP or accessed via their APIs. The latter is the way we access especially Pubchem, with the help of Knime . If you are wondering at this point about Knime – please see some of the previous blog entries on this site.


Analysis Part 1

To identify known drugs (pre-clinical-, clinical-, as well as marketed ones) and (related) drug targets. From a small molecule perspective, the most interesting section of ZikaVR is the “Drug Target” section, as of October 2017, containing 464 targets: http://bioinfo.imtech.res.in/manojk/zikavr/drug_target1.php, the list of compounds is over 500 structures long! For easier and faster analysis, we exported this list containing Drugbank IDs and used Knime to continue working with these. We need to find the structures, check for duplicates, and do a clustering – if at all possible.

With the help of Knime, we end up with a list of only 14 compounds! (We will give details on the Knime workflows in a subsequent part of this blog series).

And looking more closely at the targets of these, we see that all of them belong to three main biological classes:

We will at this point not discuss a potential bias stemming from compounds entered into the ZikaVR database – we refer to the database itself on details on curation.

The next step in part one is to analyze the structures and check for any similarities. Fingerprint & graph-based clustering (in simple terms: strip all attached groups from a molecule to keep the core, then replace all heteroatoms with Carbon and finally make all bonds to single bonds; again, see upcoming part 3) and end up with only three distinct cores, whereas six of the DrugDB compounds share one common core (subgraph), all corresponding to KIF11 inhibitors:

Analysis Part 2

Looking into the above mentioned article by Ming et al. we payed particular attention to the PHA-690509 compound. It is a CDK (see e.g. this WIKI entry) type of inhibitor known to have antiviral activity. The authors disclose further structurally unrelated CDKi compounds which inhibit ZIKA replication as well.

In the supplementary material one finds a list of 27 CDKi compounds which we used as basis for further analysis. If we do a subgraph analysis of these compounds we find three such graphs for all of these (with some overlap):

So far…..

We have one structural “coincidence” based on KIF 11 (plus two other target classes which we will not use here) and ten structurally unrelated CDKi compounds (different CDK targets). Thus we intended the following:

  • Search for bio-actives based on the subgraphs shown above
  • Search for KIF11 inhibitors (target/structure related)
  • Search for CDK inhibitors (target/structure related)

We searched through Pubchem either by target-name or via the subgraphs (substructure type of search) – also, we used an activity cut-off in the case of CDKi’s of pI50 > 6 (i.e. sub-micromolar activity).  For target names we also had to do a detour to check the Assay IDs (AIDs) based on related structure IDs (CIDs) – if there is a better way that you happen to know of, please contact us, we would be delighted to hear how.

In the case of CDKi’s we found 257 structurally unrelated CDKi’s based on the subgraph search, such as

In total we found over 1450 CDKi’s (hitting the whole range of CDK1-CDK20).

In the case of KIF11 it wasn’t equally straightforward. One of the reasons is that for KIF11 the alternative name KIF1 is used… Absolutely not confusing, right? So initially we get 0 hits when we do targetname/assay related searches. With the graph based search though, we do find over 1100 analogues. If we filter by fingerprint similarity (Tanimoto >0.7, or even 0.8) we end up with a bit over 730 compounds, whereas 18 of those are reported as KIF1 compounds (meaning KIF11).


Summary

We were able to systematically identify compounds that are not necessarily structurally related (which is good to since it removes some of the starting bias), many of them are active on the same targets. There is assay data available as well, thanks to Pubchem, something we haven’t really used so far other than a generic potency cut-off in one case.

For a simple exercise as we have done here, we would claim it is acceptable if one neglects some details, e.g. the at times varying curation quality of Pubchem. You might need to drill down further into things like target names and make sure you search through all alternative names, just in case. Sometimes structures aren’t 100% correct, etc. etc.  Don’t get us wrong, in general curation is very good, but you cannot trust it blindly since “the devil at times lies in the details”.

And now what?

Now? Only now one would really get started! We have some starting points and would want to make of them. For this, there are “gazillion” things to do further with these findings. Property analysis, pharmacophore modelling, 3d-shape-modelling, etc. etc. etc. Perhaps we will revisit this in the future.

Comparison with commercial products

On a final note, if you have the possibility to search through commercial databases, you can do basically the same thing. It might even be a good idea to combine these two types of searches. A quick test with e.g. Reaxys revealed many more CDKi structures, but less KIF11 inhibitors. In the end, you have roughly an equal amount of data and would probably end up with the same conclusion in terms of what you want to focus on. For someone doing Science@home though, Pubchem and other public databases are the go-to-place.

Stay tuned for some practical examples from the above mentioned Knime work-flows in Part 3.

PS: Part of this blog will be (have been) presented at the “ICIC 2017“, the International Conference on Trends for Scientific Information Professionals, Heidelberg, October 23-24, by Dr. Fernando Huerta from InOutScience .

Part 1: What disease should I research @home? Zika-Virus as example

To dabble with basic Science@home is fun – though probably only up until the question arises “what to actually research about”?

This leads to the question on how to find or decide on any disease to start with. Since this is a rather entangled question (or rather, the answers can be), I will offer three of the simplest answers:

  • Choose whichever disease you are curious about (or have a relation to)
  • Pick a particular target that you heard (know) of and are interested in
  • Take from current news a “hot-topic” disease/target

This might sound somewhat naïve, but can be rather relevant and is used by many researchers within pharmaceutical development at least as part of the starting point. As example, my friend Fernando from InOutScience and I have been considering the Zika-Virus ourselves (hence, I will use “us/we” for the remainder of this blog series). Myself, I stumbled upon this due to the news last year (and a family interest, if you will).

What is Zika?

Zika hit the world news last year after an outbreak of epidemic proportions in South America. That the world took notice at all was (as usual?) down to economics. The 2016 Summer Olympics in Brazil and the spread to southern parts of USA.  It then though nearly equally declined by the end of the year as fast as it appeared earlier. The reasons for this still seems to be unclear for epidemiologists.

mosquitop

The virus itself is a mosquito-borne virus transmitted by Aedes mosquitoes. It leads to harmless symptoms, the most common ones’ headache, muscle and joint pain, mild fever, rash, and inflammation of the underside of the eyelid.
But: What brought this virus into the limelight is the fact that when transferred to pregnant women the fetus is at risk for birth defects!

The latter is the reason for efforts on trying to find treatments (otherwise, basic flue treatment seems to do the trick).

You can find nicely summarized facts at the World Health Organziation (WHO) webpage on Zika.

Unfortunately, as with any neglected disease (tropical ones fall most often into this category), there is no money to make in finding new medications (research & development costs versus what you can make from it….). Therefore it falls on some smaller companies as well as academic groups as major player researching these, as is the case with Zika.
You yourself can participate indirectly if you like via the WorldCommnunityGrid distributed computation project – see my blog entry here to see how. 

If you want to know more about Zika, please check out these links:

Now that we have a disease to research on – how to continue? Part 2 now available, please click here.

PS: Part of this blog series will be presented at the “ICIC 2017“, the International Conference on Trends for Scientific Information Professionals, Heidelberg, October 23-24. The presentation will be made by Fernando Huerta from InOutScience .

Science@home for everyone – the quick and simple(st) way

Do you want to do contribute to research but don’t have the time/nerve/know-how for any kind of deeper involvement? Of course you want to 😀 !
And yes, it is possible! The answer is – distributed or volunteer computing!

This is not a new phenomena, it has been around for quite a long time now. One of the more know projects most likely is SETI@home, where you help analyze radio signals from space in the search for extra-terrestrial life.
Today, the field of distributed computing encompasses all kinds of research areas, including drug discovery. One of many summaries on this subject can be found on this blog by the OpenScientist and of course Wikipedia, on Volunteer Computing.

Thus, by allowing your computer to calculate on behalf of whatever research in question, you indirectly contribute to that project – without lifting a finger. The only thing you need to do is install a program, register yourself as user (for some you can even just run anonymously) with a tiny caveat that you also “contribute” with electricity. But hey – it’s for science, right? In addition, some projects include a fancy looking screen saver!
Don’t want to have your computer on all the time? Don’t want to be bothered while you are using your own machine? No problem, nearly all allow AFAIK several ways to restrict the client with regards to CPU/GPU usage or the time it may run or not.

Can’t decide what to contribute to? Want to contribute to multiple projects but not have multiple clients installed/have to keep track off? Then I can recommend the World Community Grid which supports out-of-the-box 7 different projects. And if I am not mistaken, with a wee bit of manual work, you can make the client run other projects via this client. And if you prefer doing something like this while playing a video-game, even that is possible, for example in EVE Online or FoldIt (these though require a bit more “work” requiring active inputs/analysis by the user and thus go beyond the idea of “simplistic” distributed computing).

Myself, I am supporting the OpenZika project, due to some personal interest in this field. Come join me and many, many others!

Click here to get started!
(Note: this includes my referral ID – don’t worry, there is no money involved, it simply gives out “badges” for “recruitment”. Use the above World Community Grid link instead if you don’t like this idea).