Part 3: What disease should I …. ? Knime workflows

First of, apologies for the late entry of part 3 of this article. I hope you haven’t lost interest in this just yet.

Anyway, here, as promised, the Knime workflow used to retrieve the data as mentioned in part 2 & the presentation in Heidelberg last October.

The whole workflow looks like this – I will go through some of the details separately. During the course of the description here in part 3, I will zoom in via pictures to some of the metanodes (grey boxes with a green checkbox) but not all. If you want to dig into details, I will attach the full workflow for Knime for you to download and view explanations directly within.

Overview Pubchem retrieval in Knime for Zika (click to expand)

Pubchem itself quotes these two free access references with regards to itself and API programming:

Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. PubChem Substance and Compound databases. Nucleic Acids Res. 2016 Jan 4; 44(D1):D1202-13. Epub 2015 Sep 22 [PubMed PMID: 26400175] doi: 10.1093/nar/gkv951.

Kim S, Thiessen PA, Bolton EE, Bryant SH. PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 2015 Jul 1;43(W1):W605-11. Epub 2015 Apr 30 [PubMed PMID: 25934803] doi: 10.1093/nar/gkv396.

1 Obtaining the CIDs (Compound IDs)

In order to obtain any data from Pubchem, we first require the CIDs, this simplifies searching over using synonyms. In this case the DrugBank IDs retrieved from ZikaVR (see Part 2) are used, there are “only” 15 of them and we do it via a text file (a manual table within Knime would do as well). The Metanode Get CID is perhaps the portion that is most staggering for someone who doesn’t know or care for API and such. But in able to get data automatically from Pubchem, we do have to use the API. Let’s open this Metanode:

Metanode Get CID – click to enlarge and view more details

First of, we convert the Drugbank ID number to a useful URL (String Manipulation node).

The URL should look like this:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compound/name/DB01693/XML

i.e. checking for compounds whose name contain DB01693 and retrieve as XML.

Next up, the actual GET Request. In this instance simply sending the URLs, with nothing else to set up here, except a delay, since we didn’t use async requests. After that, we keep only positive results (html code 200) and convert the XML based structure information to, well, a structure. The XPath node could, if desired, retrieve a whole lot more information, but here we simply retrieve the CID. Finally, we keep only certain columns and tell the system that the molecule text column is indeed a structure column.

2 Obtaining the AIDs (Assay IDs)
Reminder of parts 1, 2 and 3

The next step is to obtain the assays IDs from Pubchem. Since there is no direct way (as far as I can tell) to obtain a particular screen or target in context of a compound, one has to retrieve all assays which have a reference to a particular compound, then analyze the assays.

Retrieve AIDs

Thus in this case, the URL sent to Pubchem via GET Request looks something like this:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/578447/aids/TXT

i.e. retrieving all AIDs for compound with CID 578447 in text format (which corresponds to above Drugbank ID DB01693 ). The results we work with is in form a list of AIDs, per compound, therefor the Ungroup node following the Get AID metanode.

Retrieve target names

Now that we have the AIDs, we can retrieve the actual target names, done here in the Get Target (from AID) metanode. Here, the Get Request URLs look like this:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1811/XML

i.e. retrieve as XML the information on the assay with AID #1811, etc.

From the XML we extract three values: ChemBL Link (optional), Target Type and Target Name, which we finally filter down to (via Row Splitter (or Filter, if you prefer)) to keep only “Single Proteins”, separating the result ambiguous things like “Tissue: Lymphoma Cells”, or “Target Type: CELL-LINE”, etc., leaving us in this instance with three compounds tested in “Kinesin-like protein 1”. Remember, this target was one of the targets identified earlier in Part 2).

Be aware that we have now the target name and some assay IDs for OUR compounds, but not all assays that have been submitted to Pubchem with the protein Kinesine-Like-Protein 11 (KIF11); in these picture sometimes denoted as KLP11, stemming from the Pubchem code.

3 Retrieve all screened Pubchem cmpds and Comparison with other sources

Now we retrieve all assays that deal with KIF11 and can thus retrieve all structures mentioned in those assays, followed by comparing with other sources.  We start with the metanode Get KLP11:

Retrieve all structures tested in KLP11

At this point, the URLs required should be straightforward – here the URL for the target (one single get request in this case):

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol/KIF11/aids/TXT

Next up, is the retrieval of the compounds as their CIDs mentioned in these assays, e.g.:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/242686/cids/TXT

Now we end up with over 5700 compounds (for which we also retrieve the structures in it’s own sub-metanode, just as described earlier). At this point, to be able to compare with the original structures found in DrugDB stemming from ZikaVR, we cluster these compounds and make the “graph” (node Cluster Preparation) in parallel to the original input structures in node common skeletons. Clustering per se, especially (but not soley) in Knime is a rather deep topic of discussion and will therefore not be described here. Though you can go into the workflow and have a look at how we did it in this instance. Now that I write this, I guess this is a neat follow-up topic for this blog!

The final comparison DB vs PubChem cores is a simple affair based on Knime’s Joiner/Reference Row Splitter nodes – via the Inchi keys as comparison string (Inchi is a good way to sift through duplicates, despite some caveats when using Inchi).

There  we have it – The top output port of the metanode gives us the common cores, the lower one, cores not found, in this case, in Pubchem.

4 Substructures of DrugDB compounds in Pubchem

A not so dissimilar approach as in above 2 & 3 to retrieve all substructures of the ones we have in our original list, independent of any target, is shown here in 4. Specifically, which similar compounds are out there that have not been reported (in Pubchem) screened on our targets of interest but might show activity anyway?

Part 4 overview

We need to start with removing explicit hydrogens  from the structures retrieved. For efficiency, this should probably be done only once early on, since it was e.g.  reused in section 3 (common skeletons metadnode contains this step again). This though is not uncommon in development of workflows – you add on a new portion and have multiples of certain steps, which you might be bothered to change later on or not. For easier reading and understanding it is simpler to actually work with the same node multiple times; remember – we are not programming a final super efficient workflow here at the moment.

Retrieve (relevant) substructures from Pubchem

Drilling down into the metanode Get Sub-Strucures we have to retrieve the substructures via an asynchronous (async) request – something shown here in probably the least nice and efficient way, but hey, it works. For substructure searches, Pubchem won’t give you back the whole list at once, only a reference to a list with the substructures. This is what the first two boxes do, PART1 and PART2.

The URL for substructure btw is:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compounds/substructure/smiles/followed_by_your_smiles_string/XML

The XML then contains a reference to a listkey, a long string of numbers, which you submitt again via:

https://pubhchem.ncbi.nlm.nih.gov/rest/pug/compounds/listkey/333121232131244/cids/XML

Now, each list contains a number of new CIDs, if we collect them all, we get more than 300 000 IDs, a bit too much too handle…. thus a filtration was necessary, one that at this point was done manually, but certainly can be done more elegantly otherwise. In this case though, a manual table with the subgraphs of interest is used (Table Creator node).  Needless to say, if you want and can mine through the remaining compounds, you will certainly have an interesting resource for further analysis available (e.g. via other types of subgraphs, property filtration, etc. etc.)

Finally, the structures themselves are retrieved of the ca 1100 compounds (IDs) in our case  (same way as described above) .

Back to the main portion: Looking at the top row next to the Get Sub-Structures, this row (branch) is more of a confirmation addition – of all the substructures searched, how many of them mention KIF11, which leads back to compounds we have seen before.

The lower two branches check for similarity of the new substructures, versus ours in terms of high likelyhood to show activity – in this case – let’s not forget the overall goal – Zika Virus. This comparison is simply done by fingerprinting and comparing, here with two different methods, average or max similarity with a Tanimoto cut-off of 0.7.

And “hopefully” all the results (numbers/graphs) should correspond to what was earlier described in the series of these blog entries.

If you have any questions of anything being unclear, don’t hesitate to contact me! And/or download the workflow and play around with it yourself!

Have fun!

Download the Knime workflow (including all data)

PS: don’t hesitate to contact me if you run into troubles with the workflow.

PPS: The excel file is now included within the workflow folder, you will have to adjust the path for it to be correct. Obviously other input methods are also possible, a manual table, a csv reader, etc.

Addendum: Workflow now also available on Github: https://github.com/DocMinus/ZikaPubChem

 

 

 

Drug research at home – (how) is that possible? – Part 2

Continuing on after part 1:

What to do with the tools

I’m assuming that you have (some) knowledge on how to search, what to look out for, a  workflow on the different steps required to do the job. It’s otherwise a topic on it’s own for another time. Not that it hasn’t been described before, alas, no, see just one example here:

Nicola, G., Berthold, M. R., Hedrick, M. P., & Gilson, M. K. (2015). Connecting proteins with drug-like compounds: Open source drug discovery workflows with BindingDB and KNIME. Database: The Journal of Biological Databases and Curation, 2015, bav087. https://doi.org/10.1093/database/bav087

Actual Compounds

So you identified something and want to test your hypothesis beyond in-silico. Well, that is a bit tougher – you can’t really handle and test compounds at home. Theoretically though,  you could have someone else do this part for you (order commercial compounds, synthesize something new, test in a biological assay). That is (unfortunately) not for free.

Though to obtain compounds ,if you are (or have connections to) academia or a (smaller) company, there are some interesting initiatives are available, such as within Malaria research by http://www.mmv.org/research-development/open-source-research/open-access-malaria-box, though now more broadly for pathogens at http://www.pathogenbox.org/. Then there are possibilities as described in the next section.

Once you think you have something

Actual testing aside (it never hurts), what can you do with those cool results? Well, there are a number of things – the simplest one would be: write a blog! More involved and scientifically more appropriate – at the same time more difficult – write a publication in a scientific journal or present at a scientific meeting. You could even try and patent your findings, if you have the finances. It all depends on the impact you want to have.

To go beyond a publication, if you want to be part of/follow your findings, you can contact some of the initiatives by pharmaceutical companies who are open to collaboration on new findings. For example,  Johnsson&Johnsson [jnjinnovation.com/partner-with-us], or AstraZeneca [openinnovation.astrazeneca.com], or the Medicines for Malaria Venture [www.mmv.org/partnering/our-partner-network] and many more. You can also find incubators within academia, but then you would require some contact to a research group within. The list of incubators/companies & universities is nowadays quite big and could be a topic for a separate blog entry.

If you are really in it for the money though, I think you will be disappointed. Doing drug research from home is more like a hobby just fun, in the best case though for the greater good. Having said that, should you really find something interesting and you contact any of the above mentioned initiatives, intellectual property and reimbursements will most likely be on the table at some point.

Now, start researching!

Drug research at home – (how) is that possible? – Part 1

In the current day and age of open access information, combined with cheap computing power, it is rather simple to do (some) drug research from the comfort of your home, be it as private person for fun or out of interest, or as a small (start-up) company. Actually, big pharma companies use some of the same resources combined with their own in-house data and programs as well – so why shouldn’t you?  

Where is this data? What kind of data?

There are a number of public- so called open access – databases available these days, curated over many years by high profile institutes, as e.g. the National Institute of Health, NIH for Pubchem.  Many more institutions and specific initiatives have evolved over many years, some appearing literally right now, depending on the field and data. Databases on chemical compounds, small molecules, have been around the longest, afik, with structure, properties, literature references and biological data associated.

Listing all of them would require an entire Wikipedia page (or more), and that work has already been done – you can find a substantial list here for example http://oad.simmons.edu/oadwiki/Data_repositories, though in terms of life science, on this NIH site, you can really knock yourself out: https://www.ncbi.nlm.nih.gov/guide/all/#databases_. The scientific literature has regularly some article on databases and software, as well as many blogs do, but that is outside of this scope.

More focused for our purpose of drug research, you have sites such as PubChem, BindingDBZinc, or e.g. GuideToPharmacology. I’d say with these you can get pretty far.  Curated from literature and also patents, these databases connect structures to biology, i.e. mechanism of action, structure of the target, how much is know about it (or not).  All sites and db-s are arranged differently, some you can search on the web, via an API, some by browsing, or a combination thereof. Then, there are also the semi-public databases, such as CDD-Vault – you can register and search within the public databases (all via the web, independent of your machine power), though you cannot download or batch process on the free account. It might still be worth a look at times considering you find data which is not in literature/patent based curated databases.

What will you need?

A certain understanding of the drug discovery process, chemistry and some degree of biology. If not yourself, then a good friend who might have that knowledge and can support you (though this seems like a unlikely scenario?). Some IT-skills certainly don’t hurt. Below I will focus on data-mining as the core task of the home research, methods such as docking or quantum mechanic calculations I will leave out for now.

Hardware
  • A(ny) computer – Windows, Linux, Mac – doesn’t matter.
    In my experience though when it comes to chemistry, the Windows platform still offers a broader range of both commercial and freeware programs .
  • How powerful?
    Simply put, also doesn’t matter. Sure, the more power, the smoother your experience, though for mining purpose I would go for more memory before processing power. An Intel i3 with (minimum) 16GB of RAM can get you pretty far with little money. Only for large data sets and more complicated calculations I feel this being a bit of a bottleneck. If you have an i7 or Xeon available, good for you!
    What about graphic cards? That actually doesn’t matter for data-mining and simple visualizations. Once you want to do some visual 3d-docking though, that’s another story.
  • An alternative, or even complimentary solution is a (powerful) workstation, placed “anywhere”, which could e.g. be shared with someone else sharing investment costs and then access it via any (simple) PC/Laptop via remote access, e.g. TeamViewer. Cloud computing@home so to say.
  • Reasonably fast internet connection  – for mining those web-services.
Software
  • Knime (available on all platforms) allowing for flexible, visual and fast development of search and analysis workflows.  Combined with some know-how on Java or XML and you have quite a powerful package. To start your journey, you can use some of the readily available (example) workflows before getting into details.
  • A chemical drawing program – there are a rather larger number out there, it is difficult to really make a good suggestion. Knime itself comes with a “myriad” of plugins for structural input and output, thus you actually don’t really need a separate program. Myself, I do have the free Marvin package by Chemaxon installed.
  • DataWarrior – a great package for visually guided “manual” mining, sort of “Spotfire light”, if you will.
  • Excel – or similar, can be used as light weight DataWarrior alternative, but also useful for sharing or storage (as would be Word or Powerpoint (and alternatives).
  • Scripting languages – R or Python – are not necessary, though they can make a good complement, depending on your requirements.
  • Java – also not necessary, but since Knime is built on Java, it sometimes can help for certain work-arounds.
  • XML, HTML, REST – some basics might be helpful when accessing certain services via network API.

What if you don’t know Java and such? Don’t fret, initially, I for example didn’t either. If you are though a person who is more of a “learning by doing”, then the knowledge will come automatically. Obviously, you can learn these in courses as well.

Continued in part 2.