What enterprise AI services does Smart Tech LLC offer?

Smart Tech LLC provides end-to-end enterprise AI services including custom AI/ML model development, predictive analytics, NLP and LLM integration, computer vision, federated learning, and AI product deployment. We serve healthcare, defense, and Fortune 500 organizations with secure, scalable AI solutions built on Databricks, Snowflake, AWS, and Azure.

Does Smart Tech LLC work with healthcare organizations?

Yes. We specialize in HIPAA-compliant data engineering, clinical AI, federated learning across health jurisdictions, healthcare analytics platforms, and interoperability solutions (HL7/FHIR). We serve hospitals, health systems, and rural healthcare organizations across the Midwest and nationally.

Where is Smart Tech LLC located?

Smart Tech LLC is headquartered in Overland Park, Kansas (Kansas City metro area) at 7301 W 129th St, STE 130, Overland Park, KS 66213. Our mailing address is 7131 W 135th St, #1039, Overland Park, KS 66223, USA. We serve clients locally in Kansas City, regionally across Kansas and Missouri, nationally across the United States, and internationally.

What cybersecurity solutions does Smart Tech LLC provide?

We deliver OT/IT cybersecurity solutions including ransomware prevention for critical infrastructure, AI model security, zero-trust architecture design, penetration testing, SCADA/ICS security, and SOC design. Our focus areas include operational technology environments, healthcare devices, and enterprise networks.

How can I get a free consultation with Smart Tech LLC?

Visit https://www.stdataai.com/contact or https://www.stdataai.com/get-started to request a free 30-minute AI and data strategy consultation. Share your project requirements and receive a tailored technology proposal within 48 hours. You can also call us at +1-816-514-9294 or email support@smarttechks.com.

What data engineering platforms does Smart Tech LLC build?

We design and implement enterprise data platforms using Databricks, Snowflake, AWS (Redshift, Glue, S3, EMR), and Azure (Synapse, Data Factory). Our services include data lakehouse architecture, ETL/ELT pipeline design, real-time streaming, data mesh implementation, and cloud migration.

Are there Databricks professionals available in the Kansas City area?

Yes. Smart Tech LLC is a Kansas City-based technology firm with certified Databricks professionals serving the Kansas City metro area and greater Midwest. Our team delivers Databricks data lakehouse implementations, Delta Lake pipeline engineering, Unity Catalog governance, Genie AI/BI dashboards, and Databricks migration projects. Contact us at +1-816-514-9294 or support@smarttechks.com to engage a Databricks professional for your project.

Where can I find a Snowflake professional in the Kansas City area?

Smart Tech LLC provides certified Snowflake professionals based in Kansas City, KS (Overland Park). Our Snowflake experts specialize in cloud data warehouse architecture, Snowpark development, data vault modeling, data sharing, and enterprise migrations to Snowflake. We serve clients in Kansas City, across Kansas and Missouri, and nationally. Reach us at +1-816-514-9294 or visit https://www.stdataai.com/contact.

Building a High Value Knowledge Graph on Azure Databricks

We have re-architected and re-wrote the original Databricks solution accelerator and deployed it in our Databricks environment using low cost alternatives and it works! Our solution uses rdflib, PySpark, networkx, and plotly libraries. We have created a production-level knowledge graph connecting 6,000 MeSH biomedical concepts to 5,000 clinical trials without relying on any paid libraries.

Introduction

Knowledge Graphs (KG) have become an essential component in the modern world of biomedical data. KGs allow us to ask complex questions by connecting different pieces of data that were never intended to interact with each other. This is exactly what the original solution accelerator provided by Databricks achieved. It connected NIH MeSH data on biomedical concepts with ClinicalTrials.gov data on clinical trials. This allowed us to answer questions like: what treatments are being researched for brain diseases? And has anyone ever conducted a clinical trial on Galantamine in Lewy Body Dementia? This solution accelerator was originally written in Scala using the Graphster library. However, Graphster is a closed-source library that is not available on Maven Central. It is not installable on modern Databricks. For our solution, we completely re-wrote it in highly efficient Python script. All Graphster libraries were replaced with alternatives.

The Problem with Graphster

When we attempted to use the original notebooks with Databricks, the first roadblock was that Graphster is not available in Maven Central. It is not available in Spark Packages. The WiseCube JAR is not publicly available anywhere that Databricks can access. The library is simply not installable in any modern Databricks runtime.

In addition to the issue with Graphster itself, the original notebooks had two other dependencies that needed to be replaced:

AACT bulk downloads: The ClinicalTrials.gov bulk database available at aact.ctti-clinicaltrials.org requires registration and occasionally returns 500 errors. The database is available in CSV files, which are inconvenient to work with.
Bellman SPARQL engine: The Graphster library includes a custom SPARQL query engine called Bellman. Without Graphster installed, there is no way to run SPARQL queries against Spark DataFrames.

The answer was to replace everything: not just Graphster itself, but the data sources and the query engine as well. The replacements would have to be low or no additional cost, convenient, and always available.

The Python Stack

Here is the complete replacement map — every Graphster component and what we used instead:

Graphster (Original)	Python Replacement	Purpose
URIGraphConf	rdflib.URIRef	URI node construction
LangLiteralGraphConf	rdflib.Literal(lang=)	Language-tagged string literals
DataLiteralGraphConf	rdflib.Literal(datatype=XSD.*)	Typed data literals (dates, IDs)
TripleMarker + TripleExtractor	extract_triples() (PySpark)	Extract (s,p,o) rows from DataFrames
GraphLinker	graph_linker() (PySpark join)	Link conditions/interventions to MeSH
MeSH.download()	NIH MeSH SPARQL endpoint	Fetch MeSH concept labels — No additional cost
ClinicalTrials.download()	ClinicalTrials.gov API v2	Fetch trial data — no credentials needed
Bellman SPARQL engine	PySpark SQL	Query the knowledge graph at scale

Pipeline Architecture

The pipeline is structured across four notebooks, each handling a distinct stage of Knowledge Graph construction.

Notebook 01 — RDF Configuration

All RDF graph configuration is defined using rdflib primitives. URIRef constructs URI nodes for trials, conditions, and interventions. Literal(lang='en') creates language-tagged MeSH labels. Literal(datatype=XSD.date) handles trial submission dates. Eight namespaces are registered: schema.org, rdfs, rdf, MeSH, MeSH vocab, NCT trials, and two custom fallback namespaces for conditions and interventions that could not be matched to MeSH.

Notebook 02 — Data Download and Staging

MeSH data is fetched from the NIH SPARQL endpoint in paginated batches of 50,000 triples using the requests library. The query retrieves rdfs:label triples for all MeSH concepts, giving us the vocabulary needed for entity matching. Results are written to the Delta table mesh_nct.mesh.

Clinical trials data is fetched from the ClinicalTrials.gov REST API v2 across three endpoints: studies, conditions per study, and interventions per study. Results are written to mesh_nct.studies, mesh_nct.conditions, and mesh_nct.interventions. The final staged data contains:

Table	Rows
mesh_nct.mesh	6,000
mesh_nct.studies	5,000
mesh_nct.conditions	9,112
mesh_nct.interventions	8,432

Notebook 03 — Knowledge Graph Construction

This is where data fusion occurs. A custom graph_linker() function joins the condition names with the MeSH rdfs:label values using a LOWER() match, which is case-insensitive. It then assigns the MeSH URI when a match is found; otherwise, it assigns a custom URI from wisecube.com.

A custom extract_triples() function accepts a PySpark DataFrame and a list of triple specifications. Each specification includes a subject column, a predicate of URI, and an object column. It then returns a DataFrame containing the extracted triples. This replaces the need for the TripleMarker and TripleExtractor from the Graphster library.

The final fused graph is saved to a Delta table called mesh_nct.mesh_nct, which includes all the MeSH triples and all the NCT triples.

Notebook 04 — Querying

PySpark SQL is used to execute the queries against the Delta tables. This replaces the need for a SPARQL engine.

The brain diseases query joins the studies, conditions, and interventions to retrieve the number of trials for each condition/intervention pair.

A filter query retrieves all the trials where the intervention name contains the string 'galantamine'. This identifies the trial 'NCT00230997: Safety and Efficacy of Galantamine in Patients with Dementia with Lewy Bodies submitted in 2005'.

Results

Metric	Value
MeSH label triples	6,000
Clinical trial studies	5,000
Conditions	9,112
Interventions	8,432
Total graph triples	~25,000
Galantamine trial found	NCT00230997 (Lewy Body Dementia, 2005)

The Knowledge Graph successfully identifies the overlap between the world of biomedical concepts in MeSH and the real world of clinical trials — exactly the kind of information a researcher would need when planning a new study or evaluating the existing landscape for a given treatment area.

Technical Lessons Learned

There were a number of not-so-obvious issues that arose during the rewrite that it might be helpful to document for anyone else attempting to do something similar on Databricks:

MeSH NTriples file is 2.2 GB

The NIH MeSH NTriples file is too large to load into the Python memory space on the driver node. Instead, we will use the SPARQL endpoint to fetch only the rdfs:label information we need a much smaller volume from 2.2 GB to a few MB.

pyparsing conflict breaks rdflib SPARQL

However, there is a version of pyparsing in Databricks clusters that conflicts with rdflib's SPARQL parser, resulting in AttributeError: 'module' object has no attribute 'DelimitedList'. The solution is to avoid rdflib's SPARQL parser altogether and use PySpark SQL instead.

spark is pre-loaded — never import SparkSession

spark, sc, and dbutils are already imported as global variables in Databricks. Importing SparkSession and calling SparkSession.builder.getOrCreate() is not necessary, as it will create another session, which conflicts with the session already running in the cluster. The solution is to use the already imported spark variable.

CANNOT_DETERMINE_TYPE with nullable columns

When creating a PySpark DataFrame from a Python list, Spark cannot automatically detect the column type when there is None in the list. The solution is to use an empty string "" instead of None for nullable string columns, and to always use a StructType schema when calling spark.createDataFrame().

ORDER BY must use SELECT aliases

When the column is given a new name in the SELECT statement, the ORDER BY must use the alias instead of the original column name. The original column name in the ORDER BY statement will throw a ColumnNotFound error when the column is given a new name in the SELECT statement.

%md cells must have no prefix

For Databricks, the %md must be the first thing on line 1 of the cell, with no #, spaces, or quotes before the %md. Each block of %md must be in a separate cell. Writing # %md or putting the %md in a python cell will throw a SyntaxError.

How to Run It Yourself

The entire project runs on Azure Databricks Community Edition — no paid cluster required. Setup takes under five minutes:

Create a new Databricks cluster (any runtime, Python 3.8+)
Import the four notebooks into your workspace
Run notebooks 01 through 04 in order — each installs its own dependencies via %pip at startup
No Maven coordinates, no cluster init scripts, no external accounts needed

Libraries installed automatically at notebook start:

pip install rdflib requests plotly networkx matplotlib

Conclusion

The original wscb-kg accelerator had shown us a compelling vision of using a knowledge graph to integrate ontology data with clinical trial records at scale on Databricks. However, the original code is now useless because the library it used, Graphster, is no longer installable.

This Python version of the code achieves the same vision, installable, at no additional cost and uses open-source libraries. It uses rdflib to create the RDF graph, PySpark to perform data processing and fusion of the graphs, PySpark SQL to query the data, networkx and plotly to visualize the results, etc. All parts of Graphster used in the original code have an equivalent in Python. Also, the results match what the original accelerator was intended to produce.

If you want to implement Biomedical Knowledge Graphs on Databricks or need to perform data fusion in an RDF graph, this highly valuable solution is economical, reproducible, and reliable base to work from.

Built with rdflib | PySpark | networkx | plotly | Azure Databricks

Introduction

The Problem with Graphster

In addition to the issue with Graphster itself, the original notebooks had two other dependencies that needed to be replaced:

AACT bulk downloads: The ClinicalTrials.gov bulk database available at aact.ctti-clinicaltrials.org requires registration and occasionally returns 500 errors. The database is available in CSV files, which are inconvenient to work with.
Bellman SPARQL engine: The Graphster library includes a custom SPARQL query engine called Bellman. Without Graphster installed, there is no way to run SPARQL queries against Spark DataFrames.

The Python Stack

Here is the complete replacement map — every Graphster component and what we used instead:

Graphster (Original)	Python Replacement	Purpose
URIGraphConf	rdflib.URIRef	URI node construction
LangLiteralGraphConf	rdflib.Literal(lang=)	Language-tagged string literals
DataLiteralGraphConf	rdflib.Literal(datatype=XSD.*)	Typed data literals (dates, IDs)
TripleMarker + TripleExtractor	extract_triples() (PySpark)	Extract (s,p,o) rows from DataFrames
GraphLinker	graph_linker() (PySpark join)	Link conditions/interventions to MeSH
MeSH.download()	NIH MeSH SPARQL endpoint	Fetch MeSH concept labels — No additional cost
ClinicalTrials.download()	ClinicalTrials.gov API v2	Fetch trial data — no credentials needed
Bellman SPARQL engine	PySpark SQL	Query the knowledge graph at scale

Pipeline Architecture

The pipeline is structured across four notebooks, each handling a distinct stage of Knowledge Graph construction.

Notebook 01 — RDF Configuration

Notebook 02 — Data Download and Staging

Table	Rows
mesh_nct.mesh	6,000
mesh_nct.studies	5,000
mesh_nct.conditions	9,112
mesh_nct.interventions	8,432

Notebook 03 — Knowledge Graph Construction

The final fused graph is saved to a Delta table called mesh_nct.mesh_nct, which includes all the MeSH triples and all the NCT triples.

Notebook 04 — Querying

PySpark SQL is used to execute the queries against the Delta tables. This replaces the need for a SPARQL engine.

The brain diseases query joins the studies, conditions, and interventions to retrieve the number of trials for each condition/intervention pair.

Results

Metric	Value
MeSH label triples	6,000
Clinical trial studies	5,000
Conditions	9,112
Interventions	8,432
Total graph triples	~25,000
Galantamine trial found	NCT00230997 (Lewy Body Dementia, 2005)

Technical Lessons Learned

There were a number of not-so-obvious issues that arose during the rewrite that it might be helpful to document for anyone else attempting to do something similar on Databricks:

MeSH NTriples file is 2.2 GB

pyparsing conflict breaks rdflib SPARQL

spark is pre-loaded — never import SparkSession

CANNOT_DETERMINE_TYPE with nullable columns

ORDER BY must use SELECT aliases

%md cells must have no prefix

How to Run It Yourself

The entire project runs on Azure Databricks Community Edition — no paid cluster required. Setup takes under five minutes:

Create a new Databricks cluster (any runtime, Python 3.8+)
Import the four notebooks into your workspace
Run notebooks 01 through 04 in order — each installs its own dependencies via %pip at startup
No Maven coordinates, no cluster init scripts, no external accounts needed

Libraries installed automatically at notebook start:

pip install rdflib requests plotly networkx matplotlib

Conclusion

Built with rdflib | PySpark | networkx | plotly | Azure Databricks

Building a High Value Knowledge Graph on Azure Databricks

Introduction

The Problem with Graphster

The Python Stack

Pipeline Architecture

Notebook 01 — RDF Configuration

Notebook 02 — Data Download and Staging

Notebook 03 — Knowledge Graph Construction

Notebook 04 — Querying

Results

Technical Lessons Learned

MeSH NTriples file is 2.2 GB

pyparsing conflict breaks rdflib SPARQL

spark is pre-loaded — never import SparkSession

CANNOT_DETERMINE_TYPE with nullable columns

ORDER BY must use SELECT aliases

%md cells must have no prefix

How to Run It Yourself

Conclusion

Interested in Building Knowledge Graphs for Your Data?

Building a High Value Knowledge Graph on Azure Databricks

Introduction

The Problem with Graphster

The Python Stack

Pipeline Architecture

Notebook 01 — RDF Configuration

Notebook 02 — Data Download and Staging

Notebook 03 — Knowledge Graph Construction

Notebook 04 — Querying

Results

Technical Lessons Learned

MeSH NTriples file is 2.2 GB

pyparsing conflict breaks rdflib SPARQL

spark is pre-loaded — never import SparkSession

CANNOT_DETERMINE_TYPE with nullable columns

ORDER BY must use SELECT aliases

%md cells must have no prefix

How to Run It Yourself

Conclusion

Interested in Building Knowledge Graphs for Your Data?