Leverage LLMs for Graph Data Science Pipelines: 4 Steps to Avoid Pitfalls of ChatGPT

By Sean Robinson, MS / Director of Data Science

September 15, 2023

Blog

Reading Time: 6 minutes

In this article, we’ll explore five crucial steps to harness the capabilities of large language models (LLMs) for enhancing graph data science pipelines.The power of LLMs is now undeniable as we see it rising to the top of every board room agenda more and more each month. While they are often associated with natural language processing tasks, their potential extends far beyond just text analysis.

By combining the strengths of LLMs and graph data, you can unlock a wealth of insights and intelligence that traditional methods alone cannot provide. Whether you’re a data scientist, machine learning enthusiast, or a business professional looking to extract valuable knowledge from interconnected data, these steps will guide you towards building more robust and insightful graph-based applications.

Planning your Graph Data Science Pipeline

Before we can leverage any LLMs in our process, we first must engineer data science pipeline steps that support our use case. In this example, we’ll be using the Cora dataset to predict one of seven node classes.

data science pipeline

To do this, we will aim for a data science pipeline with the following steps

  1. Generate a graph projection containing the nodes and relationships from the Cora dataset
  2. Generate node embeddings using FastRP
  3. Transfer the embeddings and labels from Neo4j to Python
  4. Split train and test sets
  5. Train a variety of classifiers on the data
  6. Compare the their results

While these are relatively strait forward data science pipeline steps, we can consider how we may modify it in a variety of ways such as by introducing additional information through document embeddings as I describe in a previous posts on text to graph machine learning and graph embeddings or add additional performance tuning through a grid search mechanism. However for the purposes of this blog, we’ll keep things simple.

Engineering your Prompt

Next we must engineer our prompt so ChatGPT can do the heavy lifting of building our data science pipeline. However, as I discussed in my post on chain of thought prompting, we should keep a few things in mind for this prompt:

  • Be specific- overly specific.
  • Provide plenty of punctuation to differentiate terms and sentence structure.
  • Specify steps to complete a task

After some testing, we arrive at the following prompt:

Given the labeled Cora dataset stored in neo4j with the following neo4j schema: (Paper)-[:CITES]->(Paper) write a python script which:

1) Create a directed graph projection of Papers with and CITES relationships named “papers”.
2) Use the graph data science library to create a fastrp node embedding using the “papers” projections
3) Returns the embedding and the node’s “subject” property as label to a pandas dataframe 4) Uses splits the data into a 70/30 train and test set
5) compares the recall, precision, and F1 scores of logistic regression, random forest, and XGboost to predict the label

Analyzing and Test your Result

After arriving at the prompt discussed above, we receive the following python code:

import pandas as pd
import numpy as np
import neo4j
from neo4j import GraphDatabase
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Define your Neo4j connection parameters
neo4j_uri = "bolt://localhost:7687"  # Change to your Neo4j server URI
neo4j_user = "your_username"
neo4j_password = "your_password"

# Connect to Neo4j
driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))

# Create a directed graph projection of Papers with CITES relationships
with driver.session() as session:
    query = """
    CALL gds.graph.create('papers', 'Paper', 'CITES', { relationshipProperties: 'weight' })
    """
    session.run(query)

# Use the graph data science library to create FastRP node embeddings
with driver.session() as session:
    query = """
    CALL gds.fastRP.write('papers', {
        nodeLabels: ['Paper'],
        embeddingDimension: 128,
        relationshipWeightProperty: 'weight',
        writeProperty: 'embedding'
    })
    """
    session.run(query)

# Fetch the embeddings and subject labels
with driver.session() as session:
    query = """
    MATCH (p:Paper)
    RETURN p.embedding AS embedding, p.subject AS subject
    """
    result = session.run(query)
    data = [record for record in result]

# Convert data to a Pandas DataFrame
df = pd.DataFrame(data)

# Split data into 70/30 train and test sets
X = np.stack(df['embedding'].to_numpy())
y = df['subject'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Encode labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Define and train models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": xgb.XGBClassifier()
}

results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train_encoded)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test_encoded, y_pred)
    precision = precision_score(y_test_encoded, y_pred, average='weighted')
    recall = recall_score(y_test_encoded, y_pred, average='weighted')
    f1 = f1_score(y_test_encoded, y_pred, average='weighted')
    results[model_name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Display the results
results_df = pd.DataFrame(results).transpose()
print(results_df)

Now the savvy among you may have already noticed a crucial flaw in our resulting code. After populating the script with the needed Neo4j credentials, the issue makes itself known.

image

This error highlights what is possibly the most important aspect of utilizing LLMs to do our heavy lifting when programming: Their limitations.

Since ChatGPT is trained on data up until 2021 (as of the writing of this blog), it fails to return the most up to date syntax when producing code. In this example, it attempts to use the gds.graph.create() function to create our projection. However, in March 2022 with the release of Neo4j’s GDS 2.0 gds.graph.create() was deprecated in favor of the current syntax of gds.graph.project().

It is these subtleties which highlight the weaknesses of LLMs in their current state and the ever-present need for expertise and experience with the underlying technology which they are interfacing with. Luckily this issue is easy to detect and resolved simply by replacing create with project.

Another issue you may find with the resulting code is the attempted use a of relationship property weight which while not actually existing in the schema provided to the prompt, does exist in the GDS documentation which is likely a primary source for ChatGPT.

Failed to invoke procedure `gds.fastRP.write`: Caused by: java.lang.IllegalArgumentException: Relationship weight property `weight` not found in relationship types ['CITES']. Properties existing on all relationship types: []

We can resolve this by simply removing the line-

relationshipWeightProperty: 'weight',

… from our FastRP call.

After resolving these errors, our data science pipeline looks to be fully functional and returns the following results:

image 1
Iterate and Refine

Here at Graphable, we are strong believers in value and necessity of iteration when developing any solution whether it be for a small internal project or for a Fortune 500 company. No matter the scale, continuously refining an approach yields the kind of improving results that are required to succeed. It is this same mentality which must be adopted when designing robust and reusable LLM prompts.

Once an initial successful prompt is identified, we encourage the practice of taking it one step further iteration until the desired consistency, performance, or reproducibility is achieved. This could be in order to avoid the syntax issues identified earlier, or to incorporate additional data science pipeline steps to improve some aspect of the model’s performance. There are many design directions which one can follow just waiting to be found.

Conclusion

In this article we looked through a simple example of how to use LLMs such as ChatGPT for an LLM pipeline to automate the construction of a graph data science pipeline in Neo4j or other graph platforms. While these novel tools help to kickstart the process, it’s critical to remember these solutions are not magic bullets and often return erroneous information that developers must be watching out for.

This may be as simple and obvious as syntax errors, but they can also manifest more dangerously as syntax correct code with a hidden misunderstanding of the problem space or techniques implemented. If you are looking to start exploring your data, or are ready to move to a production grade data science pipeline, reach out to leverage our specialist knowledge of graphs, graph data science, or graph machine learning, with no ChatGPT / LLM pipeline drawbacks. Contact us here.

Related Articles:

Graphable helps you make sense of your data by delivering expert analytics, data engineering, custom dev and applied data science services.
 
We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we have deep expertise in Financial Services, Life Sciences, Security/Intelligence, Transportation/Logistics, HighTech, and many others.
 
Thriving in the most challenging data integration and data science contexts, Graphable drives your analytics, data engineering, custom dev and applied data science success. Contact us to learn more about how we can help, or book a demo today.

We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Contact us for more information: