In 2018, Russ Altman and I published the “Global Network of Biomedical Relationships (GNBR)”, a weighted, labeled network of all chemical-gene, gene-gene, gene-disease, and chemical-disease connections in Medline abstracts. The labels, or “themes”, come from a paper we published that same year, in which we used an algorithm I developed in my PhD dissertation (“Ensemble Biclustering for Classification”, or “EBC”) to identify clusters of descriptions of biomedical relationships that corresponded to distinct themes. The complete network is available on Zenodo, and descriptions of all of the themes are available both in the paper and in the README attached to the Zenodo dataset.
As of this writing, GNBR has been viewed over 13,000 times and downloaded over 10,500 times. People have done cool things with it, like build a searchable graph of all Medline sentences and use it to help identify drug repurposing opportunities for rare diseases. However, because we published GNBR under a very permissive license, I have no idea who the vast majority of these people are and that makes me sad. (Please email me if you’re using GNBR! I would love to hear from you!)
Occasionally, however, someone contacts me asking for my code because they want to modify GNBR in some way and regenerate it, or simply better understand what we did. The truth is that while all the code is already online, GNBR evolved out of earlier work in a – let’s just say – “organic” fashion, and if you wanted to regenerate it from scratch, it would not be straightforward. So I decided to write a tutorial. I have included some comments on where I think the current GNBR falls short and how we might improve it in the future. If you’re interested in improving or adapting any part of GNBR, please reach out to me.
Step 1: PubTator
To learn the relationships that connect chemicals, genes, and diseases* in the biomedical literature, we must first be able to identify chemical, gene, and disease names in text. This is a process called named entity recognition (NER); for Medline, state-of-the-art NER is provided by an NCBI tool called PubTator. You’ll want to begin by downloading the latest set of PubTator annotations, with the text of the title and abstract included, from the PubTator FTP site. The file format looks like this:
Comments: (1) PubTator is in the process of being retired in favor of PubTator Central, which contains annotations for both Medline abstracts and all of the full-text articles found in PubMed Central. Right now if you use the PubTator Central annotations file to create GNBR, you’ll miss all the annotations from the full text because the text itself (beyond the abstract) is not included in the file. (2) Unfortunately, basing the entire project on the PubTator annotations file inherently created a bottleneck for GNBR, especially if teams want to use annotation types that PubTator does not cover or want to use their own software for NER.
Step 2: Dependency Parsing
The next step is to look for all the sentences in PubTator that contain two (or more) named entities, parse them, and extract the dependency paths connecting the named entities. We did this using the Stanford Parser.
This is the step where most of you are going to give up, because parsing all of Medline is compute resource intensive. Every six months or so when I need to regenerate GNBR, I log on to the Stanford Sherlock cluster and re-parse the latest PubTator annotations file by dividing it into chunks and parsing the chunks in parallel. With 200 jobs running at a time, this monopolizes the entire
rbaltman cluster partition for about four days. This usually results in a flurry of invective via text message from the current members of the Altman lab. In short: for most of you, it makes no sense to re-parse Medline yourselves. Just use the downloadable version of GNBR.
If you do want to re-parse the PubTator annotations file, however, all of my code for doing that is here. I recommend splitting the main annotations file into 500k or 1M line chunks first using this script and then running this script on each sub-file.
Comments: (1) Although I used (and continue to use) the Stanford Parser and a separate network library for dependency path extraction for GNBR, if I were redoing this project today, I might take a different approach. For example, the spaCy dependency parser is near-state-of-the-art while being considerably faster than other parsers. (2) The pubtator-nlp repository contains a lot of other utility code for counting entities and paths, etc. besides just the parsing code, in case folks find that useful. (3) The original Stanford Parser was written in Java, and my code for this step is also Java. I compile it using Maven. Most of the other code for GNBR is in Python.
Step 3: Creating Matrices for Each Relation Type
The next step is to turn the parsed dependency paths into relation type-specific matrices (e.g. one for chemical-disease relations, another for chemical-gene relations, etc.). The script that does this is here. The actual command I used to generate the chemical-gene matrix for the GNBR paper is here:
java -mx15g -cp /home/groups/rbaltman/blpercha/pubtator-nlp-1.0-SNAPSHOT-jar-with-dependencies.jar pubtator.matrix.MakeDependencyPathMatrices /home/groups/rbaltman/blpercha/pubtator-dependencies-final-20160430/ Chemical Gene /home/groups/rbaltman/blpercha/experiments-pubtator-v-20160430/Chemical-Gene-2x5.mat 2 5 100000 100000 2000"
The jar I’m using here is the Maven-compiled pubtator-nlp jar from Step 2. A couple things to notice about this: First, the script is taking in a folder of files containing dependencies, not a single file. That’s because it assumes you’re going to split up the raw PubTator file and parse the chunks separately. Second, you can set the minimum and maximum number of occurrences of each (a) entity pair, e.g. chemical-gene pair, and (b) dependency path that you want to include in the matrix. In our paper, we set the minimum number of dependency path occurrences at a level that gave us around 700 dependency paths in each matrix, and we then randomly downsampled the entity pairs so that we were left with 2000. How did we randomly downsample? The last argument to the script controls the downsampling (if it’s “-1”, no downsampling is performed).
The last thing to note is that the script is doing a lot behind the scenes. For example, it’s removing all paths of length two or less, all paths where the start and end entities are the same, and all paths containing the dependency type conj. At some point, I apparently also hard-coded a requirement that the dependency filenames include the string “pubtator”. Moral of the story: please look at the code before you use it.
The final matrix file you create should look something like this:
Comments: The only two columns you really need from the matrix file are column 3 (dependency path) and column 5 (entity pair). Column 4 is the count of the number of unique sentences in which the dependency path connected the entity pair; we didn’t use this information and treated each dependency path/entity pair co-occurrence as a binary yes/no.
Step 4: EBC and Hierarchical Clustering
Now that you have your matrices, it’s time to cluster them. The inner workings of the EBC algorithm are described in this paper, and the biclustering algorithm on which EBC depends (information-theoretic co-clustering, or ITCC) is described in detail here. EBC applies multiple rounds of ITCC to matrices like those generated in Step 3, followed by hierarchical clustering, to uncover groups of dependency paths that share a similar meaning. A Stanford colleague, Yuhao Zhang, and I implemented EBC in Python here.
Here’s the process that will get you from a matrix to one of the Stargate-like circular dendrograms below. My original (c. 2013) code for this was in Java, so I’m just going to sketch out the process you would use if you wanted to do this using the new Python version of EBC:
- Optimize numbers of row and column clusters (K and L). We developed a heuristic for finding the optimal numbers of row and column clusters to use when running ITCC. To implement that heuristic, read in the matrix data using the
SparseMatrixclass found here, and then use the
shuffle()method to generate randomized copies of the matrix. Run EBC on the original matrix and each randomized copy using:
ebc = EBC(sparse_matrix, n_clusters=[K, L]) cXY, objective, iter = ebc.run()
- (cont’d) … and then compare the objective between the real and randomized matrices. You’re looking for values of K and L where the absolute difference between the two objectives is greatest.
- Run at optimal K and L many times (around N=1000) and record cluster assignments. Your goal here is to keep track of which dependency paths (or entity pairs, depending on whether you’re clustering rows or columns) clustered together on each run. You can do that using this script.
- Convert cluster assignments to co-occurrence matrix. Take the output from (2) and convert it into a co-occurrence matrix. For example, if you start with the cluster assignments for 700 dependency paths over 1000 runs, you’ll end up with a 700 x 700 symmetric matrix where each element is a count between 0 and 1000.
- Run hierarchical clustering and create dendrogram. You have some options here, because there are multiple ways to do hierarchical clustering. My specific choices are explained deep within the Methods section of the original EBC paper (it’s the section called “Building a dendrogram of drug-gene pairs based on EBC’s similarity assessments”). Recently I actually unearthed the original R script I used to make Figures 4 and 5 for this paper. It is an unholy mess, but at least you can see how I created the circular dendrograms.
Step 5: Manual Review to Assign Thematic Labels
There’s no good way to say this. Once you have your dendrogram(s) and have sliced them at your chosen height, if you want to assign thematic labels to the clusters that result, you have to actually look at them. This table and this table were both created by me, not by any algorithm. I randomly selected dependency paths from each cluster, found examples of corresponding sentences in Medline, and labeled the clusters accordingly.
Comments: (1) There is an important distinction between the GNBR paper and its predecessor paper. In the first paper, we clustered drug-gene pairs, so the thematic labels were assigned to groups of drug-gene pairs. In the GNBR paper, we clustered dependency paths, so the thematic labels apply to the dependency paths directly. (2) In the discussion section of the first paper, I actually make an argument (that I still believe) against thematic labeling of individual dependency paths/sentences. True themes apply to entity pairs, and sentences can only provide evidence for or against different themes. However, for a variety of reasons, it’s often useful to label individual sentences, and this was the format most requested by collaborators. That’s why we switched to labeling dependency paths in GNBR: to enable people to easily return to the individual sentences providing evidence for each theme. (3) Of course, this complicated our evaluation strategy since we could only evaluate at the entity pair level using biomedical databases. You can get a taste of this complexity by reading the caption to Figure 2 in the GNBR paper.
Step 6: Flagship Paths and Supports
At this point, you’ve done all the hard stuff. You’ve got your labeled clusters (we refer to the dependency paths in these clusters as the “flagship paths”) and you’ve assigned thematic labels to individual clusters or groups of clusters. Now all that remains is to assign thematic labels to all of the dependency paths that were not part of your original clustering exercise (the non-flagship paths).
This is a simple matter of calculating co-occurrences of the other dependency paths with the flagship paths to establish “supports” for each theme. I do this using this script, which takes four arguments:
assign-remaining-paths.py arg1: config/chemical-gene-remote.txt <- theme letters followed by cluster numbers arg2: results-frames/chemical-gene-flagship-paths.txt <- flagship paths w/cluster numbers arg3: matrices/Chemical-Gene-1x1.mat <- a matrix containing all of the dependency paths you want to assign themes to arg4: results-frames/chemical-gene-path-theme-distributions.txt <- output (theme dists)
The file for the first argument should look like this. The file for the second argument should look like this. The file for the third argument is a matrix file exactly like the one you generated in Step 3, but including all the paths (minimum number of occurrences for both entity pairs and dependency paths set to 1; no downsampling). Here’s an example of how you would create that matrix using the same script I used in Step 3.
java -mx15g -cp /home/groups/rbaltman/blpercha/pubtator-nlp-1.0-SNAPSHOT-jar-with-dependencies.jar pubtator.matrix.MakeDependencyPathMatrices /home/groups/rbaltman/blpercha/pubtator-dependencies-final-20190915/ Chemical Gene /home/groups/rbaltman/blpercha/experiments-pubtator-v-20190915/Chemical-Gene-1x1.mat 1 1 100000 100000 -1"
Step 7: Creating GNBR Files
Now it’s just a matter of formatting. I follow the steps outlined here to:
- Consolidate all of the dependency paths from Step 2 into one giant file per relation type (chemical-gene, chemical-disease, etc.).
- Sort those files alphabetically by entity names so all the paths for a given entity pair are grouped together.
- Create subset files that only contain the paths to which we could assign themes (for orphan paths – paths that only occur once and only with one entity pair – we have no hope).
- Gzip the files for more efficient upload/download from Zenodo.
Regenerating vs. Recreating
Whenever I want to regenerate GNBR, I follow this README that I wrote for myself a couple of years ago. It’s very similar in structure to this tutorial. However, Steps 3-5 are not included. That’s because I continue to use the same flagship paths from version 1 of GNBR, so there is no need to redo the clustering each time. The flagship paths from version 1 and the actual configuration files and other scripts I use are all here.
Some Final Thoughts
The creation of GNBR was a multi-year effort that incorporated code I had written at various times and for various purposes throughout my PhD. When Russ and I published it, we had no idea if it would be useful to anyone, so we published what was essentially a rough version and waited to see if anyone would use it. Now that they have, there are several things that I think could be improved, such as basic negation detection, the removal of the dependency on PubTator (i.e. modularizing the NER portion so that different NER engines can be swapped in and out) and a re-evaluation of how we are doing the dependency parsing and whether dependency paths even make sense anymore as the best features to use for assigning themes.
In short, there is still a lot to do. If you have ideas about how GNBR could be improved (or thrown away altogether and replaced by something better) by all means, reach out. Thanks for reading.
* These terms are overloaded. A “chemical” can refer to a drug or any other small molecule. A “gene” refers to a gene either in the traditional sense (DNA), or the mRNA or protein product of that gene. A “disease” refers to a true disease (e.g. diabetes, lupus, etc.) or any other related phenotype, such as a side effect or symptom. We use the terms “chemical”, “gene”, and “disease” because these are the terms PubTator uses.