close
close

first Drop

Com TW NOw News 2024

Research into cancer types with neo4j
news

Research into cancer types with neo4j

How to Identify and Visualize Clusters in Knowledge Graphs

In this post, we will identify and visualize different clusters of cancer types by analyzing the disease ontology as a knowledge graph. Specifically, we will set up neo4j in a docker container, import the ontology, generate graph clusters and embeddings, before using dimensionality reduction to plot these clusters and derive some insights. While we will use `disease_ontology` as an example, the same steps can be used to explore any ontology or graph database.

Research into cancer types with neo4jCancer types viewed as embeddings and colored by cluster, image by author

Ontology established

In a graph database, data is not stored as rows (like a spreadsheet or relational database), but as nodes and relationships between nodes. For example, in the image below, we can see that melanoma and carcinoma are subcategories of cancer tumor cell type (represented by the SCO relationship). With this type of data, we can clearly see that melanoma and carcinoma are related, even though this is not explicitly stated in the data.

Example of a graphic database, image by author

Ontologies are a formalized set of concepts and relationships between those concepts. They are much easier for computers to parse than free text and therefore easier to extract meaning from. Ontologies are widely used in the biological sciences and you can find an ontology that interests you at https://obofoundry.org/. Here we focus on the disease ontology which shows how different types of diseases relate to each other.

Neo4j is a tool for managing, querying and analyzing graph databases. To make it easier to set up, we use a docker container.

docker run \
-it - rm \
- publish=7474:7474 - publish=7687:7687 \
- env NEO4J_AUTH=neo4j/123456789 \
- env NEO4J_PLUGINS='("graph-data-science","apoc","n10s")' \
neo4j:5.17.0

In the above command, the `-publish` flags set ports to let Python query the database directly and give us access to it via a browser. The `NEO4J_PLUGINS` argument specifies which plugins to install. Unfortunately, the Windows Docker image doesn’t seem to be able to handle the installation, so to proceed you’ll need to install neo4j Desktop manually. But don’t worry, the other steps should all still work for you.

While neo4j is running, you can access your database by going to http://localhost:7474/ in your browser, or you can use the python driver to connect as shown below. Note that we are using the port we published with our docker command above, and we are authenticating with the username and password we defined above as well.

URI = "bolt://localhost:7687"
AUTH = ("neo4j", "123456789")
driver = GraphDatabase.driver(URI, auth=AUTH)
driver.verify_connectivity()

Once you have your neo4j database set up, it’s time to gather some data. The neo4j n10s plugin is built to import and process ontologies; you can use it to embed your data into an existing ontology or to explore the ontology itself. With the cypher commands below, we’ll first set up some configurations to make the results cleaner, then set a uniqueness constraint, and finally, we’ll actually import the disease ontology.

CALL n10s.graphconfig.init({ handleVocabUris: "IGNORE" });
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.onto.import.fetch(http://purl.obolibrary.org/obo/doid.owl, RDF/XML);

To see how this can be done with the python driver, see the full code here https://github.com/DAWells/do_onto/blob/main/import_ontology.py

Now that we have imported the ontology, you can explore it by opening http://localhost:7474/ in your web browser. This will allow you to manually explore a small part of your ontology, but we are interested in the bigger picture, so let’s do some analysis. Specifically, we are going to perform Louvain clustering and generate fast, random projection embeddings.

Clusters and embeddings

Louvain clustering is a clustering algorithm for networks like this. In short, it identifies sets of nodes that are more connected to each other than to the broader set of nodes; this set is then defined as a cluster. When applied to an ontology, it is a fast way to identify a set of related concepts. Fast random projection, on the other hand, produces an embedding for each node, i.e., a numerical vector where more similar nodes have more similar vectors. With these tools, we can identify which diseases are similar and quantify that similarity.

To generate embeddings and clusters, we need to “project” the parts of our graph that we are interested in. Since ontologies are typically very large, this subset is an easy way to speed up computations and avoid out-of-memory errors. In this example, we are only interested in cancers and not in other types of diseases. We do this with the cypher query below; we match the node labeled “cancer” and any nodes related to it by one or more SCO or SCO_RESTRICTION relationships. Since we want to include the relationships between cancer types, we have a second MATCH query that returns the connected cancer nodes and their relationships.

MATCH (cancer:Class {label:"cancer"})<-(:SCO|SCO_RESTRICTION *1..)-(n:Class)
WITH n
MATCH (n)-(:SCO|SCO_RESTRICTION)->(m:Class)
WITH gds.graph.project(
"proj", n, m, {}, {undirectedRelationshipTypes: ('*')}
) AS g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels

Once we have the projection (which we called “proj”) we can calculate the clusters and embeddings and write them back to the original graph. Finally, by querying the graph we can get the new embeddings and clusters for each cancer type which we can export to a csv file.

CALL gds.fastRP.write(
'proj',
{embeddingDimension: 128, randomSeed: 42, writeProperty: 'embedding'}
) YIELD nodePropertiesWritten

CALL gds.louvain.write(
"proj",
{writeProperty: "louvain"}
) YIELD communityCount

MATCH (cancer:Class {label:"cancer"})<-(:SCO|SCO_RESTRICTION *0..)-(n)
RETURN DISTINCT
n.label as label,
n.embedding as embedding,
n.louvain as louvain

Results

Let’s look at some of these clusters to see which cancers are related. After loading the exported data into a pandas dataframe in python, we can inspect individual clusters.

Cluster 2168 is a group of pancreatic cancers.

nodes(nodes.louvain == 2168)("label").tolist()
#array(('"islet cell tumor"',
# '"non-functioning pancreatic endocrine tumor"',
# '"pancreatic ACTH hormone producing tumor"',
# '"pancreatic somatostatinoma"',
# '"pancreatic vasoactive intestinal peptide producing tumor"',
# '"pancreatic gastrinoma"', '"pancreatic delta cell neoplasm"',
# '"pancreatic endocrine carcinoma"',
# '"pancreatic non-functioning delta cell tumor"'), dtype=object)

Cluster 174 is a larger group of cancers, but they are mainly carcinomas.

nodes(nodes.louvain == 174)("label")
#array(('"head and neck cancer"', '"glottis carcinoma"',
# '"head and neck carcinoma"', '"squamous cell carcinoma"',
#...
# '"pancreatic squamous cell carcinoma"',
# '"pancreatic adenosquamous carcinoma"',
#...
# '"mixed epithelial/mesenchymal metaplastic breast carcinoma"',
# '"breast mucoepidermoid carcinoma"'), dtype=object)p

These are sensible groupings, based on organ or cancer type, and will be useful for visualization. However, the embeddings are still too high-dimensional to visualize meaningfully. Fortunately, TSNE is a very useful method for dimension reduction. Here, we use TSNE to reduce the embedding from 128 dimensions to 2, while keeping closely related nodes close together. We can verify that this has worked by plotting these two dimensions as a scatter plot and coloring them based on the Louvain clusters. If these two methods agree, we should see nodes clustering based on color.

from sklearn.manifold import TSNE

nodes = pd.read_csv("export.csv")
nodes('louvain') = pd.Categorical(nodes.louvain)

embedding = nodes.embedding.apply(lambda x: ast.literal_eval(x))
embedding = embedding.tolist()
embedding = pd.DataFrame(embedding)

tsne = TSNE()
X = tsne.fit_transform(embedding)

fig, axes = plt.subplots()
axes.scatter(
X(:,0),
X(:,1),
c = cm.tab20(Normalize()(nodes('louvain').cat.codes))
)
plt.show()

TSNE projection of cancer inclusions colored by cluster, image by author

This is exactly what we see, similar cancers are grouped together and are visible as clusters of one color. Note that some of the nodes of one color are very far apart, this is because we have to reuse some colors, since there are 29 clusters and only 20 colors. This gives us a good overview of the structure of our knowledge graph, but we can also add our own data.

Below we plot the cancer type frequency as node size and the mortality rate as opacity (Bray et al 2024). I only had access to this data for a few of the cancer types, so I only plotted those nodes. Below we can see that liver cancer does not have a particularly high incidence overall. However, liver cancer incidence rates are much higher than other cancers within the same cluster (shown in purple), such as oropharyngeal, laryngeal, and nasopharyngeal.

Frequency and mortality of cancers colored by cluster, image by author

Conclusions

Here we have used the disease ontology to group different cancers into clusters, which gives us the context to compare these diseases. Hopefully this little project has shown you how to visually explore an ontology and add that information to your own data.

The full code for this project can be viewed at https://github.com/DAWells/do_onto.

References

Bray, F., Laversanne, M., Sung, H., Ferlay, J., Siegel, R. L., Soerjomataram, I., & Jemal, A. (2024). World cancer statistics 2022: GLOBOCAN estimates of global incidence and mortality for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 74(3), 229–263.


Exploring cancer types with neo4j was originally published in Towards Data Science on Medium, where people continued the conversation by bookmarking and commenting on this story.