With the release of the Neo4j Graph Data Science (GDS) Python driver, it’s easier than ever to integrate the power of graph data science into your Python analysis workflows. In this video, Graphable Lead Data Scientist Sean Robinson takes a look at these new features in a working context by comparing a link prediction machine learning pipeline in GDS 1.8 to the brand new GDS 2.0 driver for Python.
Video Transcript: A Look at Neo4j Graph Data Science 2.0: Link Prediction Example
Today, I’m excited to look at a code walkthrough, comparing two ways of implementing a link prediction pipeline in the Neo4j graph data science library. Specifically, we’re going to conduct a side-by-side comparison of how to implement this pipeline in GDS 1.8 versus the all-new GDS 2.0 Python driver. And as we see, there’s a lot to be excited about. So, let’s jump right into exploring Neo4j graph data science 2.0.
How to Implement a Link Prediction Pipeline in Neo4j Graph Data Science Library
To start, I encourage you to go over to GitHub and pip install, the Graph Data Science driver, and take it for a spin. And with that, we will jump right into the code. So to start is our setup, and you can see I’ve delineated the 1.8 versus what we’re doing in the driver in the headers above.
Instantiating a Graph Data Science Object
So largely, things are exactly the same, but there are a few differences. So what’s the same is I’m still just declaring my URI and my credentials, as you can see here on the first two lines of each cell. But instead of instantiating a graph database driver object, I’m instantiating my GDS Graph Data Science object.
The other thing that I would previously do is implement my run cipher function. This was to abstract away the need for dealing with drivers and transactions and sessions, so I could just run my cipher and get the results back easily. But you’ll see, we’re not doing that here because it’s natively embedded into the Neo4j Graph Data Science driver.
And with that said, we don’t even need to use that functionality because we won’t be running any cipher with the driver. We’re going to be working directly with the underlying objects and abstractions. So there’s no need for any cipher on the driver’s part.
Creating a Link Prediction Pipeline in Python
Now, I’m going to start by creating my link prediction pipeline, which you can see here and here, which is largely the same syntax, and it’s actually a benefit. Because if you’re already familiar with the Graph Data Science library, there’s a pattern to how you call certain functions, and that’s already embedded in the library and in the driver.
So all of those will line up the same. If you’re used to the order in which you call functions, that is the exact order you’ll call them in the driver. We can see a couple of differences, like the fact that our link prediction pipeline has been upgraded to beta, which is great.
Adding Embeddings to Link Prediction Pipeline
Next, we’re going to look at adding our embeddings to our link prediction pipeline. And you can see a lot of the structures are exactly the same. We’re calling the same arguments in largely the same way, except we no longer need to specify things like which pipeline we’re going to add our embeddings to because we’re calling it directly on our pipe object. So we don’t need to specify those kinds of things because we’re using underlying objects and calling the function on them directly.
Python Dynamics and Graph Data Science Library Implementation
Next, let’s look at how we could use the dynamics of Python and combine that with implementing things in the Graph Data Science Library. So here, I’m adding PageRank Getweeness as I’ve done in previous videos. But instead in the driver version, I’m going to do that with two quick lines of code and with a simple for-loop.
So previously, I would make these separate function calls via strings and then pass them to my run cipher function. Now, I can just call for-loop and say pipe.addNodeProperty, and they’ll easily be added, and I’ll get quick two lines of code done.
Adding Features to Pipeline
Next, I’m going to add my features to my pipeline. Again, largely the same code, not much difference here, except for the fact that it’s a little less to type, and we’re calling it directly on our pipeline object. That holds true for splitting between train and test as well.
Where we can really start to see the convenience is in things like creating our projections. So one thing to note is that in 2.0 we have changed from the previous graph.create to what is now graph.project. And we can use this because it’s much simpler to parameterize my arguments to my function.
So here, I might want to change around what I want my projection to do, or run different experiments with certain nodes or certain edges included or excluded. Now, the savvy amongst you will say, well, you could do that with the string function called, version. You know, by using an F string or a dot format. But that produces its own set of small challenges.
So if I were to change this to an F string, we can now see that the curly brackets which are a part of my cipher syntax, now become part of the F string syntax, then there’s confusion between the two of these instructions trying to interpolate my underlying cipher.
So for this, I would actually have to come in here and add a second pair of curly brackets. And again, this isn’t the end of the world, but it’s one more thing I have to worry about and track, and makes it a little harder when I’m trying to actually underuse the underlying variables to switch things around.
With the new version of the Graph Data Science driver, I don’t need to necessarily worry about using the string interpolation to pass it variables, I can just instantiate the variables right here, declare my list, declare my dictionary, and then switch those around as I want to change an experiment and pass them to my function to create my embedding or my projection.
Training Your Link Prediction Model
So next, let’s look at training our link prediction model. And one of the things you can see right away is it has way fewer codes. So here, it’s taking me about eight lines of cipher and three lines of Python. So again, here, I’m having to declare things, like which projection I want to use, and I have to actually specify the specific string. I also have to remember that or shorten some variable, while in the Python driver, I already have that stored as an object G, I just passed that.
And same thing goes for which pipeline. Again, I have to specify the specific pipeline I want in the old version, but now, I just call it as a function from my pipeline object, making that even simpler. And as far as returning specific metrics and model info, I don’t have to worry about that either, because it already knows that that’s what I want, and it works, and that’s one of the things that it returned from my train function. So, I get that automatically.
And when it comes to streaming the results back, it’s the same story, which is less code, simpler to do. So no need for all my different cipher calls, no need to specify the string that refers to the projection I want, or the specific pipeline model. I already have that stored here in my train pipeline model object. I’ve got my graph stored in my G object and I can just stream my results back simply and easily.
And even better, these are returned back in a Pandas DataFrame. So if I want to then use these for some downstream tasks, or store them as a CSV or whatever that might be, that’s super easy to do. It’s a standard Pythonic object that I’m already familiar with as a data scientist.
So, go PIP install the Python Driver, take it for a spin, and be sure to share what you create with us. I look forward to seeing what you do.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.