What enterprise AI services does Smart Tech LLC offer?

Smart Tech LLC provides end-to-end enterprise AI services including custom AI/ML model development, predictive analytics, NLP and LLM integration, computer vision, federated learning, and AI product deployment. We serve healthcare, defense, and Fortune 500 organizations with secure, scalable AI solutions built on Databricks, Snowflake, AWS, and Azure.

Does Smart Tech LLC work with healthcare organizations?

Yes. We specialize in HIPAA-compliant data engineering, clinical AI, federated learning across health jurisdictions, healthcare analytics platforms, and interoperability solutions (HL7/FHIR). We serve hospitals, health systems, and rural healthcare organizations across the Midwest and nationally.

Where is Smart Tech LLC located?

Smart Tech LLC is headquartered in Overland Park, Kansas (Kansas City metro area) at 7301 W 129th St, STE 130, Overland Park, KS 66213. Our mailing address is 7131 W 135th St, #1039, Overland Park, KS 66223, USA. We serve clients locally in Kansas City, regionally across Kansas and Missouri, nationally across the United States, and internationally.

What cybersecurity solutions does Smart Tech LLC provide?

We deliver OT/IT cybersecurity solutions including ransomware prevention for critical infrastructure, AI model security, zero-trust architecture design, penetration testing, SCADA/ICS security, and SOC design. Our focus areas include operational technology environments, healthcare devices, and enterprise networks.

How can I get a free consultation with Smart Tech LLC?

Visit https://www.stdataai.com/contact or https://www.stdataai.com/get-started to request a free 30-minute AI and data strategy consultation. Share your project requirements and receive a tailored technology proposal within 48 hours. You can also call us at +1-816-514-9294 or email support@smarttechks.com.

What data engineering platforms does Smart Tech LLC build?

We design and implement enterprise data platforms using Databricks, Snowflake, AWS (Redshift, Glue, S3, EMR), and Azure (Synapse, Data Factory). Our services include data lakehouse architecture, ETL/ELT pipeline design, real-time streaming, data mesh implementation, and cloud migration.

Are there Databricks professionals available in the Kansas City area?

Yes. Smart Tech LLC is a Kansas City-based technology firm with certified Databricks professionals serving the Kansas City metro area and greater Midwest. Our team delivers Databricks data lakehouse implementations, Delta Lake pipeline engineering, Unity Catalog governance, Genie AI/BI dashboards, and Databricks migration projects. Contact us at +1-816-514-9294 or support@smarttechks.com to engage a Databricks professional for your project.

Where can I find a Snowflake professional in the Kansas City area?

Smart Tech LLC provides certified Snowflake professionals based in Kansas City, KS (Overland Park). Our Snowflake experts specialize in cloud data warehouse architecture, Snowpark development, data vault modeling, data sharing, and enterprise migrations to Snowflake. We serve clients in Kansas City, across Kansas and Missouri, and nationally. Reach us at +1-816-514-9294 or visit https://www.stdataai.com/contact.

Expensive Healthcare AI Pipeline deployed at low cost using Databricks accelerator

Get high value insights in no time..

Clinical trial groups pour over hundreds of patient charts to identify qualified subjects. Pharmacovigilance groups dig through prescription history searching for dangerous drug combinations. Data science groups can write SQL joins over six different databases in weeks just to answer the question “which patients have hypertension and diabetes?”

All of this can be accomplished in under 5 minutes using a knowledge graph. You don't have to shell out six figures in software costs to do it either.

We know this, because we just did it.

In this post, we will discuss exactly how we took 5,000 unstructured clinical notes and built a complete and utter Clinical Knowledge Graph using open-source Python libraries and a Databricks cluster. No JSL license required. No Neo4j required. We can customize and deploy this simplified solution accelerator at low cost to you as well.

Why This Problem Is Harder Than It Looks

Ask any data scientist in the healthcare industry what their biggest pet peeve is, and they will tell you the same thing: the most important information in the healthcare industry is not stored in any database. It is stored in a paragraph of text.

Physician notes, discharge summaries, radiology reports, and consultation letters are where the most important information about a patient's health is stored. And they are all just free text.

Here is the kind of document that our pipeline has to work with:

“The patient is a 58-year-old male presenting with chest pain and shortness of breath. History significant for hypertension and type 2 diabetes. Currently on metformin 500mg twice daily, lisinopril 10mg, and aspirin 81mg. Referred for ECG and lipid panel.”

From this single paragraph of text, any doctor can immediately understand the following: two symptoms, two diagnoses, three medications, two tests. A database query can understand the following: nothing. It is just plain text. There is nothing in the text for the database to understand or query. Making the text in these paragraphs and documents structured and query-able is the problem that this project solves.

The Original Pipeline We Replicated

Our project is based on a solution accelerator published by Databricks and John Snow Labs. Their version is genuinely excellent. It uses:

ner_jsl_slim — a deep learning Named Entity Recognition model trained on millions of annotated clinical documents, capable of identifying 20+ medical entity types with high accuracy
Neo4j — a professional graph database with its own query language (Cypher), a visual browser, and enterprise-grade persistence
JSL Posology model — a relation extraction model that links drugs to their dosages, frequencies, and routes of administration

It is also gated behind licenses that most teams cannot afford.

Our goal was simple: same architecture, same outputs, same clinical queries — for low cost to you. Here is what we swapped:

Paid Component	Replacement	What We Gave Up
JSL ner_jsl_slim model	Keyword extraction (Spark UDF)	Deep learning accuracy, negation handling
Neo4j graph database	NetworkX (Python)	Persistent storage, browser UI
JSL drug resolver	NIH RxNorm API	Brand name / abbreviation resolution
MIMIC clinical dataset	MTSamples (public)	Patient demographics, real visit dates

We will be honest about every trade-off as we go. Spoiler: the gaps are real but manageable for research and analytical use cases.

The Data: 5,000 Clinical Transcriptions, All at one low cost

MTSamples is an open-source database that contains 5,000 de-identified medical transcriptions from 40 different specialties. The text is written by real doctors and is about real patient encounters.

It is not perfect. Unlike the MIMIC database used in the original accelerator, MTSamples is lacking patient demographics, visit dates, and encounter IDs. Every patient in our pipeline is assigned an ID and the current date. While this would be an issue in a real-world application, for our purposes in demonstrating the full pipeline architecture and query capabilities, this is perfect.

The data is loaded into a Delta Lake Bronze Table, the first level in the Medallion Architecture.

The Delta Lake Bronze Table is the first level in the Medallion Architecture and provides us with a reliable source of information before any transformations are done.

Stage 1: Extracting Medical Entities From Free Text

This is the stage where most people assume you need the expensive model. You do not — at least not for a well-defined clinical vocabulary.

We built a Spark UDF (User Defined Function) backed by a curated dictionary of 155 medical terms across six categories:

Entity Type	Examples	Count
Symptom	pain, fever, dyspnea, chest pain, syncope	29 terms
Disease / Disorder	diabetes, hypertension, cancer, COPD, stroke	29 terms
Drug	aspirin, warfarin, metformin, lasix, ibuprofen	39 terms
Procedure	surgery, biopsy, angioplasty, chemotherapy	27 terms
Test	ECG, CBC, HbA1c, troponin, creatinine	25 terms
Body Part	heart, lung, kidney, pancreas, thyroid	30 terms

The UDF runs distributed across all Spark partitions — meaning it processes all 5,000 transcriptions in parallel. Total extraction time: under 2 minutes on a scalable cluster.

The output is a Delta Lake gold table with one row per entity per patient — the structured data we need to build the graph.

The Honest Trade-Off

But we do want to be clear about what our keyword dictionary does not do.

“No chest pain” and “chest pain” look like the same thing to our UDF. A deep learning model handles negation. Ours does not. If a drug is spelled “Coumadin” rather than “warfarin,” it will not be recognized unless we add it to our list. If we were not aware of a clinical abbreviation, it will not be recognized.

These are real limitations. For clinical use, this stage of the process should use a real medical NLP model. There are many good ones available from HuggingFace.

But what our keyword dictionary does offer in return is total transparency and instant extensibility. Every decision is transparent. Every decision is easy to extend. Adding a new drug takes exactly three seconds. This is a real benefit.

Stage 2: Building the Knowledge Graph

Here is where the project starts getting really interesting. And here is the insight that makes the graph structure so powerful:

Medical entities are shared across patients.

There is only one “warfarin” node in the graph. Every patient that was prescribed warfarin points to the same “warfarin” node. Finding all the patients prescribed warfarin is laughably simple: who points to the warfarin node? No joins. No subqueries. One lookup.

Using the NetworkX library, which is a Python library for graph manipulation, we create a directed graph with the exact same schema as the original Neo4j database:

Patient ──IS_SYMPTOM──► Symptom

Patient ──IS_DSD──────► Disease / Disorder

Patient ──IS_TEST─────► Test

Patient ──IS_BODYPART─► Body Part

Patient ──IS_PROCEDURE► Procedure

Patient ──RXNORM_CODE─► Drug

After processing all 5,000 transcriptions the graph contains:

4,246 patient nodes · 155 entity nodes · ~180,000 edges

The graph lives in Python driver memory — unlike Neo4j which persists to disk. We serialise it to the Databricks File System with Python's pickle module so it survives cluster restarts. Not as elegant as a dedicated graph database, but entirely functional.

Stage 3: Querying the Graph — Where the Value Lives

This is the part that makes everything above worth doing. Six clinical query types, all running in seconds.

Patient Journey

“Give me the complete clinical picture for patient 42.”

One function call. Returns every symptom, diagnosis, drug, procedure, test, and body part connected to that patient. What requires multi-table joins in SQL is a single 1-hop graph traversal.

Drug Cohorts

“Which patients have been prescribed warfarin?”

Identify all patient node entities that are linked to the warfarin entity node. Works for any drug in the dictionary.

Dangerous Drug Combinations

“Which patients are co-prescribed an NSAID and warfarin?”

This is our favorite query because it illustrates the power of the graph approach the best. The combination of NSAIDs (like ibuprofen, naproxen, aspirin, diclofenac, and many others) and warfarin increases the risk of bleeding. Identifying such combinations is an important pharmacovigilance application.

In the classical database approach, you'd need to know the list of NSAIDs that are members of the NSAID class, perform the join, and filter the results. In the graph approach, it takes two set lookups and one intersection:

Set A: All patient entities linked to any NSAID node
Set B: All patient entities linked to the warfarin node
Result: Set A intersect Set B

That's it.

Co-occurrence Cohorts

“Which patients have both chest pain and shortness of breath?” “Which patients have both hypertension and diabetes?”

Two set lookups, one intersect operation. This is the starting point for clinical trial eligibility criteria: finding patients that satisfy multiple criteria at the same time.

Procedure Cohorts

“All patients who had surgery, bypass, or cholecystectomy.”

Useful for surgical outcome tracking, readmission studies, and care pathway analysis.

Named Cohort Builder

A generic construct for creating and persisting a named patient cohort for any clinical term of interest. We have pre-built twelve named cohorts of warfarin patients, diabetes patients, chest pain patients, surgical patients, asthma patients, etc., all persisted as Delta Lake gold tables for further analysis or reporting.

What the Results Look Like

There are three visualizations that tell you what is in the graph.

The Schema Diagram is the visualization of the architecture of the graph itself, showing the entities and the relationships between them, color-coded throughout. It is the equivalent of calling CALL db.schema.visualization() in Neo4j, using Matplotlib as the backend.

The Analytics Dashboard is the visualization that always sparks a reaction. The most connected entities tell you at a glance the dominant clinical language. “Pain” has over 4,000 patients connected to it. “Disease” and “surgery” are not far behind. You can see at once the clinical concepts that are general and the ones that are specific.

The Patient Journey Graph shows you the complete clinical network of a single patient as a radial graph, with the patient at the centre and all the entities connected to them grouped by type, with drugs closest and body parts furthest out. It turns a mass of text into a visual clinical fingerprint. Show this to a doctor and they will immediately understand what they are looking at.

The Numbers That Matter

Metric	Result
Transcriptions processed	4,999
Medical entities extracted	~180,000
Patient nodes in graph	4,246
Shared entity nodes	155
Named cohorts built	12
Extraction time	< 2 minutes
Graph build time	< 5 minutes
Total software cost	$0

What This Actually Enables

Take a step back from the technical details for a second and think about the implications of this pipeline for a data science or analytics team:

Clinical trials: recruiting patients who meet complex multi-criteria inclusion criteria is reduced from weeks of manual effort to a single cohort query. For a clinical trial requiring patients with diabetes, hypertension, and no history of surgery, this is three set intersections and a result in under a second.
Pharmacovigilance: proactively identifying dangerous drug combinations for the entire patient population before adverse events happen. The NSAID + warfarin query is performed over all 4,000 patients in milliseconds.
Quality improvement: creating cohorts of patients with certain conditions to monitor outcomes, compare with peers, or identify gaps in the care pathway.
Research: answering epidemiological questions about co-occurring conditions, treatments, and trends for thousands of patients.

And all of this from unstructured clinical notes.

The Honest Verdict

Is this pipeline production-ready for a hospital or health system? Well, no. It still needs a proper clinical NLP model in the NER step for the level of accuracy and negation that is required in the real world. It still needs proper persistent storage for the graph in a production environment.

But is this a fully functional, architecturally correct, clinically relevant analytics system built entirely at low cost?

And, perhaps more importantly, is this pipeline a proof of the point that the architecture is correct, the methodology is valid, the results are relevant, and the benefits are real? Well, yes. It's simply a matter of replacing the keyword dictionary in the NER step of the pipeline with a better NLP model and replacing the NetworkX library used in the pipeline for a better persistent graph store. Both of these steps can be done on your own schedule and your own timeline, and you don't have to commit to the costs of either library.

Sounds like the right place to start.

Try It Yourself

The full notebook, README, and documentation are available on GitHub. Everything runs on Databricks.

If you are working in healthcare data and want to explore what clinical NLP can do for your team without a procurement process, this is a practical, working starting point.

Built with Spark NLP (open source) · NetworkX · MTSamples · Databricks Community Edition
Apache 2.0 License — free for research and commercial use

Get high value insights in no time..

All of this can be accomplished in under 5 minutes using a knowledge graph. You don't have to shell out six figures in software costs to do it either.

We know this, because we just did it.

Why This Problem Is Harder Than It Looks

Physician notes, discharge summaries, radiology reports, and consultation letters are where the most important information about a patient's health is stored. And they are all just free text.

Here is the kind of document that our pipeline has to work with:

The Original Pipeline We Replicated

Our project is based on a solution accelerator published by Databricks and John Snow Labs. Their version is genuinely excellent. It uses:

ner_jsl_slim — a deep learning Named Entity Recognition model trained on millions of annotated clinical documents, capable of identifying 20+ medical entity types with high accuracy
Neo4j — a professional graph database with its own query language (Cypher), a visual browser, and enterprise-grade persistence
JSL Posology model — a relation extraction model that links drugs to their dosages, frequencies, and routes of administration

It is also gated behind licenses that most teams cannot afford.

Our goal was simple: same architecture, same outputs, same clinical queries — for low cost to you. Here is what we swapped:

Paid Component	Replacement	What We Gave Up
JSL ner_jsl_slim model	Keyword extraction (Spark UDF)	Deep learning accuracy, negation handling
Neo4j graph database	NetworkX (Python)	Persistent storage, browser UI
JSL drug resolver	NIH RxNorm API	Brand name / abbreviation resolution
MIMIC clinical dataset	MTSamples (public)	Patient demographics, real visit dates

We will be honest about every trade-off as we go. Spoiler: the gaps are real but manageable for research and analytical use cases.

The Data: 5,000 Clinical Transcriptions, All at one low cost

MTSamples is an open-source database that contains 5,000 de-identified medical transcriptions from 40 different specialties. The text is written by real doctors and is about real patient encounters.

The data is loaded into a Delta Lake Bronze Table, the first level in the Medallion Architecture.

The Delta Lake Bronze Table is the first level in the Medallion Architecture and provides us with a reliable source of information before any transformations are done.

Stage 1: Extracting Medical Entities From Free Text

This is the stage where most people assume you need the expensive model. You do not — at least not for a well-defined clinical vocabulary.

We built a Spark UDF (User Defined Function) backed by a curated dictionary of 155 medical terms across six categories:

Entity Type	Examples	Count
Symptom	pain, fever, dyspnea, chest pain, syncope	29 terms
Disease / Disorder	diabetes, hypertension, cancer, COPD, stroke	29 terms
Drug	aspirin, warfarin, metformin, lasix, ibuprofen	39 terms
Procedure	surgery, biopsy, angioplasty, chemotherapy	27 terms
Test	ECG, CBC, HbA1c, troponin, creatinine	25 terms
Body Part	heart, lung, kidney, pancreas, thyroid	30 terms

The UDF runs distributed across all Spark partitions — meaning it processes all 5,000 transcriptions in parallel. Total extraction time: under 2 minutes on a scalable cluster.

The output is a Delta Lake gold table with one row per entity per patient — the structured data we need to build the graph.

The Honest Trade-Off

But we do want to be clear about what our keyword dictionary does not do.

These are real limitations. For clinical use, this stage of the process should use a real medical NLP model. There are many good ones available from HuggingFace.

Stage 2: Building the Knowledge Graph

Here is where the project starts getting really interesting. And here is the insight that makes the graph structure so powerful:

Medical entities are shared across patients.

Using the NetworkX library, which is a Python library for graph manipulation, we create a directed graph with the exact same schema as the original Neo4j database:

Patient ──IS_SYMPTOM──► Symptom

Patient ──IS_DSD──────► Disease / Disorder

Patient ──IS_TEST─────► Test

Patient ──IS_BODYPART─► Body Part

Patient ──IS_PROCEDURE► Procedure

Patient ──RXNORM_CODE─► Drug

After processing all 5,000 transcriptions the graph contains:

4,246 patient nodes · 155 entity nodes · ~180,000 edges

Stage 3: Querying the Graph — Where the Value Lives

This is the part that makes everything above worth doing. Six clinical query types, all running in seconds.

Patient Journey

“Give me the complete clinical picture for patient 42.”

One function call. Returns every symptom, diagnosis, drug, procedure, test, and body part connected to that patient. What requires multi-table joins in SQL is a single 1-hop graph traversal.

Drug Cohorts

“Which patients have been prescribed warfarin?”

Identify all patient node entities that are linked to the warfarin entity node. Works for any drug in the dictionary.

Dangerous Drug Combinations

“Which patients are co-prescribed an NSAID and warfarin?”

Set A: All patient entities linked to any NSAID node
Set B: All patient entities linked to the warfarin node
Result: Set A intersect Set B

That's it.

Co-occurrence Cohorts

“Which patients have both chest pain and shortness of breath?” “Which patients have both hypertension and diabetes?”

Two set lookups, one intersect operation. This is the starting point for clinical trial eligibility criteria: finding patients that satisfy multiple criteria at the same time.

Procedure Cohorts

“All patients who had surgery, bypass, or cholecystectomy.”

Useful for surgical outcome tracking, readmission studies, and care pathway analysis.

Named Cohort Builder

What the Results Look Like

There are three visualizations that tell you what is in the graph.

The Numbers That Matter

Metric	Result
Transcriptions processed	4,999
Medical entities extracted	~180,000
Patient nodes in graph	4,246
Shared entity nodes	155
Named cohorts built	12
Extraction time	< 2 minutes
Graph build time	< 5 minutes
Total software cost	$0

What This Actually Enables

Take a step back from the technical details for a second and think about the implications of this pipeline for a data science or analytics team:

Clinical trials: recruiting patients who meet complex multi-criteria inclusion criteria is reduced from weeks of manual effort to a single cohort query. For a clinical trial requiring patients with diabetes, hypertension, and no history of surgery, this is three set intersections and a result in under a second.
Pharmacovigilance: proactively identifying dangerous drug combinations for the entire patient population before adverse events happen. The NSAID + warfarin query is performed over all 4,000 patients in milliseconds.
Quality improvement: creating cohorts of patients with certain conditions to monitor outcomes, compare with peers, or identify gaps in the care pathway.
Research: answering epidemiological questions about co-occurring conditions, treatments, and trends for thousands of patients.

And all of this from unstructured clinical notes.

The Honest Verdict

But is this a fully functional, architecturally correct, clinically relevant analytics system built entirely at low cost?

Sounds like the right place to start.

Try It Yourself

The full notebook, README, and documentation are available on GitHub. Everything runs on Databricks.

If you are working in healthcare data and want to explore what clinical NLP can do for your team without a procurement process, this is a practical, working starting point.

Built with Spark NLP (open source) · NetworkX · MTSamples · Databricks Community Edition
Apache 2.0 License — free for research and commercial use

Get high value insights in no time..

Why This Problem Is Harder Than It Looks

The Original Pipeline We Replicated

The Data: 5,000 Clinical Transcriptions, All at one low cost

Stage 1: Extracting Medical Entities From Free Text

The Honest Trade-Off

Stage 2: Building the Knowledge Graph

Stage 3: Querying the Graph — Where the Value Lives

Patient Journey

Drug Cohorts

Dangerous Drug Combinations

Co-occurrence Cohorts

Procedure Cohorts

Named Cohort Builder

What the Results Look Like

The Numbers That Matter

What This Actually Enables

The Honest Verdict

Try It Yourself

Want to Build Something Like This for Your Data?

Get high value insights in no time..

Why This Problem Is Harder Than It Looks

The Original Pipeline We Replicated

The Data: 5,000 Clinical Transcriptions, All at one low cost

Stage 1: Extracting Medical Entities From Free Text

The Honest Trade-Off

Stage 2: Building the Knowledge Graph

Stage 3: Querying the Graph — Where the Value Lives

Patient Journey

Drug Cohorts

Dangerous Drug Combinations

Co-occurrence Cohorts

Procedure Cohorts

Named Cohort Builder

What the Results Look Like

The Numbers That Matter

What This Actually Enables

The Honest Verdict

Try It Yourself

Want to Build Something Like This for Your Data?