We have re-architected and re-wrote the original Databricks solution accelerator and deployed it in our Databricks environment using low cost alternatives and it works! Our solution uses rdflib, PySpark, networkx, and plotly libraries. We have created a production-level knowledge graph connecting 6,000 MeSH biomedical concepts to 5,000 clinical trials without relying on any paid libraries.
Introduction
Knowledge Graphs (KG) have become an essential component in the modern world of biomedical data. KGs allow us to ask complex questions by connecting different pieces of data that were never intended to interact with each other. This is exactly what the original solution accelerator provided by Databricks achieved. It connected NIH MeSH data on biomedical concepts with ClinicalTrials.gov data on clinical trials. This allowed us to answer questions like: what treatments are being researched for brain diseases? And has anyone ever conducted a clinical trial on Galantamine in Lewy Body Dementia? This solution accelerator was originally written in Scala using the Graphster library. However, Graphster is a closed-source library that is not available on Maven Central. It is not installable on modern Databricks. For our solution, we completely re-wrote it in highly efficient Python script. All Graphster libraries were replaced with alternatives.
The Problem with Graphster
When we attempted to use the original notebooks with Databricks, the first roadblock was that Graphster is not available in Maven Central. It is not available in Spark Packages. The WiseCube JAR is not publicly available anywhere that Databricks can access. The library is simply not installable in any modern Databricks runtime.
In addition to the issue with Graphster itself, the original notebooks had two other dependencies that needed to be replaced:
- AACT bulk downloads: The ClinicalTrials.gov bulk database available at aact.ctti-clinicaltrials.org requires registration and occasionally returns 500 errors. The database is available in CSV files, which are inconvenient to work with.
- Bellman SPARQL engine: The Graphster library includes a custom SPARQL query engine called Bellman. Without Graphster installed, there is no way to run SPARQL queries against Spark DataFrames.
The answer was to replace everything: not just Graphster itself, but the data sources and the query engine as well. The replacements would have to be low or no additional cost, convenient, and always available.
The Python Stack
Here is the complete replacement map — every Graphster component and what we used instead:
| Graphster (Original) | Python Replacement | Purpose |
|---|---|---|
| URIGraphConf | rdflib.URIRef | URI node construction |
| LangLiteralGraphConf | rdflib.Literal(lang=) | Language-tagged string literals |
| DataLiteralGraphConf | rdflib.Literal(datatype=XSD.*) | Typed data literals (dates, IDs) |
| TripleMarker + TripleExtractor | extract_triples() (PySpark) | Extract (s,p,o) rows from DataFrames |
| GraphLinker | graph_linker() (PySpark join) | Link conditions/interventions to MeSH |
| MeSH.download() | NIH MeSH SPARQL endpoint | Fetch MeSH concept labels — No additional cost |
| ClinicalTrials.download() | ClinicalTrials.gov API v2 | Fetch trial data — no credentials needed |
| Bellman SPARQL engine | PySpark SQL | Query the knowledge graph at scale |
Pipeline Architecture
The pipeline is structured across four notebooks, each handling a distinct stage of Knowledge Graph construction.
Notebook 01 — RDF Configuration
All RDF graph configuration is defined using rdflib primitives. URIRef constructs URI nodes for trials, conditions, and interventions. Literal(lang='en') creates language-tagged MeSH labels. Literal(datatype=XSD.date) handles trial submission dates. Eight namespaces are registered: schema.org, rdfs, rdf, MeSH, MeSH vocab, NCT trials, and two custom fallback namespaces for conditions and interventions that could not be matched to MeSH.
Notebook 02 — Data Download and Staging
MeSH data is fetched from the NIH SPARQL endpoint in paginated batches of 50,000 triples using the requests library. The query retrieves rdfs:label triples for all MeSH concepts, giving us the vocabulary needed for entity matching. Results are written to the Delta table mesh_nct.mesh.
Clinical trials data is fetched from the ClinicalTrials.gov REST API v2 across three endpoints: studies, conditions per study, and interventions per study. Results are written to mesh_nct.studies, mesh_nct.conditions, and mesh_nct.interventions. The final staged data contains:
| Table | Rows |
|---|---|
| mesh_nct.mesh | 6,000 |
| mesh_nct.studies | 5,000 |
| mesh_nct.conditions | 9,112 |
| mesh_nct.interventions | 8,432 |
Notebook 03 — Knowledge Graph Construction
This is where data fusion occurs. A custom graph_linker() function joins the condition names with the MeSH rdfs:label values using a LOWER() match, which is case-insensitive. It then assigns the MeSH URI when a match is found; otherwise, it assigns a custom URI from wisecube.com.
A custom extract_triples() function accepts a PySpark DataFrame and a list of triple specifications. Each specification includes a subject column, a predicate of URI, and an object column. It then returns a DataFrame containing the extracted triples. This replaces the need for the TripleMarker and TripleExtractor from the Graphster library.
The final fused graph is saved to a Delta table called mesh_nct.mesh_nct, which includes all the MeSH triples and all the NCT triples.
Notebook 04 — Querying
PySpark SQL is used to execute the queries against the Delta tables. This replaces the need for a SPARQL engine.
The brain diseases query joins the studies, conditions, and interventions to retrieve the number of trials for each condition/intervention pair.
A filter query retrieves all the trials where the intervention name contains the string 'galantamine'. This identifies the trial 'NCT00230997: Safety and Efficacy of Galantamine in Patients with Dementia with Lewy Bodies submitted in 2005'.
Results
| Metric | Value |
|---|---|
| MeSH label triples | 6,000 |
| Clinical trial studies | 5,000 |
| Conditions | 9,112 |
| Interventions | 8,432 |
| Total graph triples | ~25,000 |
| Galantamine trial found | NCT00230997 (Lewy Body Dementia, 2005) |
The Knowledge Graph successfully identifies the overlap between the world of biomedical concepts in MeSH and the real world of clinical trials — exactly the kind of information a researcher would need when planning a new study or evaluating the existing landscape for a given treatment area.
Technical Lessons Learned
There were a number of not-so-obvious issues that arose during the rewrite that it might be helpful to document for anyone else attempting to do something similar on Databricks:
MeSH NTriples file is 2.2 GB
The NIH MeSH NTriples file is too large to load into the Python memory space on the driver node. Instead, we will use the SPARQL endpoint to fetch only the rdfs:label information we need a much smaller volume from 2.2 GB to a few MB.
pyparsing conflict breaks rdflib SPARQL
However, there is a version of pyparsing in Databricks clusters that conflicts with rdflib's SPARQL parser, resulting in AttributeError: 'module' object has no attribute 'DelimitedList'. The solution is to avoid rdflib's SPARQL parser altogether and use PySpark SQL instead.
spark is pre-loaded — never import SparkSession
spark, sc, and dbutils are already imported as global variables in Databricks. Importing SparkSession and calling SparkSession.builder.getOrCreate() is not necessary, as it will create another session, which conflicts with the session already running in the cluster. The solution is to use the already imported spark variable.
CANNOT_DETERMINE_TYPE with nullable columns
When creating a PySpark DataFrame from a Python list, Spark cannot automatically detect the column type when there is None in the list. The solution is to use an empty string "" instead of None for nullable string columns, and to always use a StructType schema when calling spark.createDataFrame().
ORDER BY must use SELECT aliases
When the column is given a new name in the SELECT statement, the ORDER BY must use the alias instead of the original column name. The original column name in the ORDER BY statement will throw a ColumnNotFound error when the column is given a new name in the SELECT statement.
%md cells must have no prefix
For Databricks, the %md must be the first thing on line 1 of the cell, with no #, spaces, or quotes before the %md. Each block of %md must be in a separate cell. Writing # %md or putting the %md in a python cell will throw a SyntaxError.
How to Run It Yourself
The entire project runs on Azure Databricks Community Edition — no paid cluster required. Setup takes under five minutes:
- Create a new Databricks cluster (any runtime, Python 3.8+)
- Import the four notebooks into your workspace
- Run notebooks 01 through 04 in order — each installs its own dependencies via %pip at startup
- No Maven coordinates, no cluster init scripts, no external accounts needed
Libraries installed automatically at notebook start:
pip install rdflib requests plotly networkx matplotlib
Conclusion
The original wscb-kg accelerator had shown us a compelling vision of using a knowledge graph to integrate ontology data with clinical trial records at scale on Databricks. However, the original code is now useless because the library it used, Graphster, is no longer installable.
This Python version of the code achieves the same vision, installable, at no additional cost and uses open-source libraries. It uses rdflib to create the RDF graph, PySpark to perform data processing and fusion of the graphs, PySpark SQL to query the data, networkx and plotly to visualize the results, etc. All parts of Graphster used in the original code have an equivalent in Python. Also, the results match what the original accelerator was intended to produce.
If you want to implement Biomedical Knowledge Graphs on Databricks or need to perform data fusion in an RDF graph, this highly valuable solution is economical, reproducible, and reliable base to work from.
Built with rdflib | PySpark | networkx | plotly | Azure Databricks