Get high value insights in no time..
Clinical trial groups pour over hundreds of patient charts to identify qualified subjects. Pharmacovigilance groups dig through prescription history searching for dangerous drug combinations. Data science groups can write SQL joins over six different databases in weeks just to answer the question “which patients have hypertension and diabetes?”
All of this can be accomplished in under 5 minutes using a knowledge graph. You don't have to shell out six figures in software costs to do it either.
We know this, because we just did it.
In this post, we will discuss exactly how we took 5,000 unstructured clinical notes and built a complete and utter Clinical Knowledge Graph using open-source Python libraries and a Databricks cluster. No JSL license required. No Neo4j required. We can customize and deploy this simplified solution accelerator at low cost to you as well.
Why This Problem Is Harder Than It Looks
Ask any data scientist in the healthcare industry what their biggest pet peeve is, and they will tell you the same thing: the most important information in the healthcare industry is not stored in any database. It is stored in a paragraph of text.
Physician notes, discharge summaries, radiology reports, and consultation letters are where the most important information about a patient's health is stored. And they are all just free text.
Here is the kind of document that our pipeline has to work with:
From this single paragraph of text, any doctor can immediately understand the following: two symptoms, two diagnoses, three medications, two tests. A database query can understand the following: nothing. It is just plain text. There is nothing in the text for the database to understand or query. Making the text in these paragraphs and documents structured and query-able is the problem that this project solves.
The Original Pipeline We Replicated
Our project is based on a solution accelerator published by Databricks and John Snow Labs. Their version is genuinely excellent. It uses:
- ner_jsl_slim — a deep learning Named Entity Recognition model trained on millions of annotated clinical documents, capable of identifying 20+ medical entity types with high accuracy
- Neo4j — a professional graph database with its own query language (Cypher), a visual browser, and enterprise-grade persistence
- JSL Posology model — a relation extraction model that links drugs to their dosages, frequencies, and routes of administration
It is also gated behind licenses that most teams cannot afford.
Our goal was simple: same architecture, same outputs, same clinical queries — for low cost to you. Here is what we swapped:
| Paid Component | Replacement | What We Gave Up |
|---|---|---|
| JSL ner_jsl_slim model | Keyword extraction (Spark UDF) | Deep learning accuracy, negation handling |
| Neo4j graph database | NetworkX (Python) | Persistent storage, browser UI |
| JSL drug resolver | NIH RxNorm API | Brand name / abbreviation resolution |
| MIMIC clinical dataset | MTSamples (public) | Patient demographics, real visit dates |
We will be honest about every trade-off as we go. Spoiler: the gaps are real but manageable for research and analytical use cases.
The Data: 5,000 Clinical Transcriptions, All at one low cost
MTSamples is an open-source database that contains 5,000 de-identified medical transcriptions from 40 different specialties. The text is written by real doctors and is about real patient encounters.
It is not perfect. Unlike the MIMIC database used in the original accelerator, MTSamples is lacking patient demographics, visit dates, and encounter IDs. Every patient in our pipeline is assigned an ID and the current date. While this would be an issue in a real-world application, for our purposes in demonstrating the full pipeline architecture and query capabilities, this is perfect.
The data is loaded into a Delta Lake Bronze Table, the first level in the Medallion Architecture.
The Delta Lake Bronze Table is the first level in the Medallion Architecture and provides us with a reliable source of information before any transformations are done.
Stage 1: Extracting Medical Entities From Free Text
This is the stage where most people assume you need the expensive model. You do not — at least not for a well-defined clinical vocabulary.
We built a Spark UDF (User Defined Function) backed by a curated dictionary of 155 medical terms across six categories:
| Entity Type | Examples | Count |
|---|---|---|
| Symptom | pain, fever, dyspnea, chest pain, syncope | 29 terms |
| Disease / Disorder | diabetes, hypertension, cancer, COPD, stroke | 29 terms |
| Drug | aspirin, warfarin, metformin, lasix, ibuprofen | 39 terms |
| Procedure | surgery, biopsy, angioplasty, chemotherapy | 27 terms |
| Test | ECG, CBC, HbA1c, troponin, creatinine | 25 terms |
| Body Part | heart, lung, kidney, pancreas, thyroid | 30 terms |
The UDF runs distributed across all Spark partitions — meaning it processes all 5,000 transcriptions in parallel. Total extraction time: under 2 minutes on a scalable cluster.
The output is a Delta Lake gold table with one row per entity per patient — the structured data we need to build the graph.
The Honest Trade-Off
But we do want to be clear about what our keyword dictionary does not do.
“No chest pain” and “chest pain” look like the same thing to our UDF. A deep learning model handles negation. Ours does not. If a drug is spelled “Coumadin” rather than “warfarin,” it will not be recognized unless we add it to our list. If we were not aware of a clinical abbreviation, it will not be recognized.
These are real limitations. For clinical use, this stage of the process should use a real medical NLP model. There are many good ones available from HuggingFace.
But what our keyword dictionary does offer in return is total transparency and instant extensibility. Every decision is transparent. Every decision is easy to extend. Adding a new drug takes exactly three seconds. This is a real benefit.
Stage 2: Building the Knowledge Graph
Here is where the project starts getting really interesting. And here is the insight that makes the graph structure so powerful:
Medical entities are shared across patients.
There is only one “warfarin” node in the graph. Every patient that was prescribed warfarin points to the same “warfarin” node. Finding all the patients prescribed warfarin is laughably simple: who points to the warfarin node? No joins. No subqueries. One lookup.
Using the NetworkX library, which is a Python library for graph manipulation, we create a directed graph with the exact same schema as the original Neo4j database:
Patient ──IS_SYMPTOM──► Symptom
Patient ──IS_DSD──────► Disease / Disorder
Patient ──IS_TEST─────► Test
Patient ──IS_BODYPART─► Body Part
Patient ──IS_PROCEDURE► Procedure
Patient ──RXNORM_CODE─► Drug
After processing all 5,000 transcriptions the graph contains:
4,246 patient nodes · 155 entity nodes · ~180,000 edges
The graph lives in Python driver memory — unlike Neo4j which persists to disk. We serialise it to the Databricks File System with Python's pickle module so it survives cluster restarts. Not as elegant as a dedicated graph database, but entirely functional.
Stage 3: Querying the Graph — Where the Value Lives
This is the part that makes everything above worth doing. Six clinical query types, all running in seconds.
Patient Journey
“Give me the complete clinical picture for patient 42.”
One function call. Returns every symptom, diagnosis, drug, procedure, test, and body part connected to that patient. What requires multi-table joins in SQL is a single 1-hop graph traversal.
Drug Cohorts
“Which patients have been prescribed warfarin?”
Identify all patient node entities that are linked to the warfarin entity node. Works for any drug in the dictionary.
Dangerous Drug Combinations
“Which patients are co-prescribed an NSAID and warfarin?”
This is our favorite query because it illustrates the power of the graph approach the best. The combination of NSAIDs (like ibuprofen, naproxen, aspirin, diclofenac, and many others) and warfarin increases the risk of bleeding. Identifying such combinations is an important pharmacovigilance application.
In the classical database approach, you'd need to know the list of NSAIDs that are members of the NSAID class, perform the join, and filter the results. In the graph approach, it takes two set lookups and one intersection:
- Set A: All patient entities linked to any NSAID node
- Set B: All patient entities linked to the warfarin node
- Result: Set A intersect Set B
That's it.
Co-occurrence Cohorts
“Which patients have both chest pain and shortness of breath?” “Which patients have both hypertension and diabetes?”
Two set lookups, one intersect operation. This is the starting point for clinical trial eligibility criteria: finding patients that satisfy multiple criteria at the same time.
Procedure Cohorts
“All patients who had surgery, bypass, or cholecystectomy.”
Useful for surgical outcome tracking, readmission studies, and care pathway analysis.
Named Cohort Builder
A generic construct for creating and persisting a named patient cohort for any clinical term of interest. We have pre-built twelve named cohorts of warfarin patients, diabetes patients, chest pain patients, surgical patients, asthma patients, etc., all persisted as Delta Lake gold tables for further analysis or reporting.
What the Results Look Like
There are three visualizations that tell you what is in the graph.
The Schema Diagram is the visualization of the architecture of the graph itself, showing the entities and the relationships between them, color-coded throughout. It is the equivalent of calling CALL db.schema.visualization() in Neo4j, using Matplotlib as the backend.
The Analytics Dashboard is the visualization that always sparks a reaction. The most connected entities tell you at a glance the dominant clinical language. “Pain” has over 4,000 patients connected to it. “Disease” and “surgery” are not far behind. You can see at once the clinical concepts that are general and the ones that are specific.
The Patient Journey Graph shows you the complete clinical network of a single patient as a radial graph, with the patient at the centre and all the entities connected to them grouped by type, with drugs closest and body parts furthest out. It turns a mass of text into a visual clinical fingerprint. Show this to a doctor and they will immediately understand what they are looking at.
The Numbers That Matter
| Metric | Result |
|---|---|
| Transcriptions processed | 4,999 |
| Medical entities extracted | ~180,000 |
| Patient nodes in graph | 4,246 |
| Shared entity nodes | 155 |
| Named cohorts built | 12 |
| Extraction time | < 2 minutes |
| Graph build time | < 5 minutes |
| Total software cost | $0 |
What This Actually Enables
Take a step back from the technical details for a second and think about the implications of this pipeline for a data science or analytics team:
- Clinical trials: recruiting patients who meet complex multi-criteria inclusion criteria is reduced from weeks of manual effort to a single cohort query. For a clinical trial requiring patients with diabetes, hypertension, and no history of surgery, this is three set intersections and a result in under a second.
- Pharmacovigilance: proactively identifying dangerous drug combinations for the entire patient population before adverse events happen. The NSAID + warfarin query is performed over all 4,000 patients in milliseconds.
- Quality improvement: creating cohorts of patients with certain conditions to monitor outcomes, compare with peers, or identify gaps in the care pathway.
- Research: answering epidemiological questions about co-occurring conditions, treatments, and trends for thousands of patients.
And all of this from unstructured clinical notes.
The Honest Verdict
Is this pipeline production-ready for a hospital or health system? Well, no. It still needs a proper clinical NLP model in the NER step for the level of accuracy and negation that is required in the real world. It still needs proper persistent storage for the graph in a production environment.
But is this a fully functional, architecturally correct, clinically relevant analytics system built entirely at low cost?
And, perhaps more importantly, is this pipeline a proof of the point that the architecture is correct, the methodology is valid, the results are relevant, and the benefits are real? Well, yes. It's simply a matter of replacing the keyword dictionary in the NER step of the pipeline with a better NLP model and replacing the NetworkX library used in the pipeline for a better persistent graph store. Both of these steps can be done on your own schedule and your own timeline, and you don't have to commit to the costs of either library.
Sounds like the right place to start.
Try It Yourself
The full notebook, README, and documentation are available on GitHub. Everything runs on Databricks.
If you are working in healthcare data and want to explore what clinical NLP can do for your team without a procurement process, this is a practical, working starting point.
Built with Spark NLP (open source) · NetworkX · MTSamples · Databricks Community Edition
Apache 2.0 License — free for research and commercial use