Serving data to front-end developers who are not Cypher experts while also maintaining many API endpoints can be a hassle. In this article, we’ll explore core principles for optimal graph database schema design. We’ll help simplify querying and writing data back to the graph much easier. We’ll also look at how to best format the returned data and other lessons learned during different projects.
Graph Database Schema Design
As you may know, Neo4j is schemaless, and that presents many advantages. It’s extremely flexible, allows for rapid iterations during development, and evolution over time as an application needs changes.
But the difficulty can arise when all these changes happen really fast and the developer needs to keep up. Changes to the schema (and specifically to graph database schema design) can potentially break everything on the API side, and over time, what we’ve noticed is that a large number of clients start this with no Neo4j naming conventions in place, no rules, and this can lead to everyone spinning their wheels, trying to keep up with the changes instead of progressing and implementing new features.
Some examples of this that we’ve seen is Joe the Undecided, who often renames properties. He’ll change date to start date, and that means that in the backend, I, the developer, need to go and update all my properties. Then two weeks later, he’ll change his mind and change it to start timestamp instead of start date, and again, I need to go update that. It makes it hard for me to be friends with Joe.
We also have Ted the Improver, who wants to move all the address properties off the nodes into a place node since they could be shared. This makes complete sense, and we want to do this, but that means that now I have to go through all my code and change all my queries to make sure that that happens.
Sometimes I also get comments at 6 a.m., like “Why haven’t you sent back all the new properties that we just added last night?” or “Why is this property always returning a null?” That one’s on me usually because it’s a misspelling.
There are also situations where suddenly we need to collect all the nodes when we want to return them because now it’s a one-to-many relationship instead of a one-to-one. So I end up finding myself continuously reworking queries and spending a lot of time doing that.
The solution we want to end up with is kind of in between a GraphQL, where the onus is on the frontend and where everything is hard-coded. We want to essentially have APIs that we use the maximum amount of code and of queries, so we want everything to be standardized.
API Endpoints and Graph Database Schema Design
Our goal with this is to create a functional graph database schema design that’s generic enough to facilitate iterations of development, also easy enough to set up new API endpoints or edit existing ones, and maximize the number of non-breaking changes that can be made to the graph schema so that we can just continue without breaking things. We also want to return all of our data in an easy-to-use, easy-to-understand standardized format and maximize our generic code.
What ends up happening is that our schema becomes self-documenting – and we’ll look at what that means a little later on – and it defines the API contract. Using that, the developer can just take a look at the graph database schema design, the frontend or any of the developers, and understand what objects are going to be returned, what the endpoints look like, etc.
Today we’ll go over some neo4j naming conventions, some API rules that’ll help us keep all this in place, and then we’ll look at a few examples.
Neo4j Naming Conventions
For the neo4j naming conventions on the graph schema side, there are just a couple rules that we want to follow. We want to start with a verb for all our relationships. We want to have the relationships contain the name of the node that they point to. In this case, we can see that our verb is “AT” and it points to a FilmLocation node. So our relationship name is AT_FILM_LOCATION.
And then we want to start every relationship with a “HAS” if it’s a one-to-many relationship. This scene, for example, can have multiple actors, so that’s going to be “HAS” and then the name of the node, ACTOR. Anything else that’s not a “HAS” is going to be a one-to-one. This is going to really help us understand the relationships down the road. We’ll look at that.
API Property Rules
The only other rules that we have are on the properties. We try to have anything that’s a date needs to end up with ‘_date’ so we know that’s what that is.
On the API side, the rules that we’ve defined are going to be that each API endpoint is going to be the node_alias, which is essentially the node label, just underscored. Again, looking at film location, the endpoint becomes Film_Location.
The endpoints are going to return all of that node’s properties, and they’re also going to return all of the outgoing child nodes. So if the child node is connected by a “HAS” relationship, it’ll be returned as a list of objects, or if it’s just a one-to-one, it’ll be an object. We’re going to use the relationship name as our key for those objects.
We’re not going to do any type casting on the backend. We’re not going to check for existing properties. We’re just going to return exactly what’s in the database. This avoids issues with property name changes.
Now I don’t need to go and update those in all of my queries. I’m just returning them all regardless. It avoids property type changes. I’m not worrying about any of the types. It also allows any new properties that are just added are automatically returned because we’re returning all properties. We also avoid returning null for nonexistent or misspelled properties.
So let’s take a look at one of our schemas. Right off the bat, because of our rules that we’ve already established, we can determine what the name of the endpoints for each of these nodes are going to be. We also know what that endpoint is going to return.
In this case, we know that movie is going to be the movie endpoint. It’s going to return a studio object because it’s an “AT,” it’s going to return a company object because it’s a “FOR,” and it’s going to return a list of objects because you can have multiple scenes. HAS_ACTOR also will be – each scene can have multiple actors.
We know all this information because of our rules for the graph database schema design, and we know what it’s going to return because of our rules on the API. This just makes it really easy for anyone to come and look and see what they want and see where they need to get it from.
It also, on the backend, allows me to write super generic queries like match this movie, and then optionally match any outgoing relationship, and if it’s a “HAS,” return it as a list of objects. That allows generic endpoints that reuse code as much as possible.
For an example, for the GET example here, we’re going to pass in the UUID. In this case, the frontend knows everything he needs to know about the schema on the backend because of the payload that’s returned.
He knows that by calling this specific movie, he’s going to get all the properties. Because it’s a list of objects here, he knows that scene is a one-to-many, and he knows that this movie has a single relationship to studio and a single relationship to company, and he knows what these relationships are named. So based on our payload here, the frontend developer can even know what the schema is on the backend.
The only other rules that we have for the API are going to be in our updating or creating new nodes. What we want to do here is we want to say that if the payload that the frontend developer sends to the backend contains a UUID, then we know that we’re matching that node.
If it doesn’t contain a UUID, then we know that we’re creating a new node. With that, we know what the node label needs to be based on the relationship or based on the endpoint if it’s at the top level. In this case, if he’s calling scene, we’ll know that he’s looking for this scene or updating this scene. If it didn’t include a UUID, we would know that he’s trying to create this scene.
Then the other rule that we have is that any property that’s passed will get updated. So if he doesn’t pass any properties, we won’t do any updating to that node. And any non-“HAS” relationship, so any other relationship that’s an object, we’re going to detach those so that we can attach the new relationship that’s passed.
So in this case, he might be trying to attach a new film location to this scene, so we’re going to detach any other existing film locations – it should be just one – and attach this one.
That’s basically it. Based on those rules, now on the PUT side, again, it’s just a really simple chunk of code that can handle any and all of our needs for updating the graph. Those are all the rules and conventions.
They allow us to predict the schema based on the payloads, and then conversely, they allow us to predict the payload based on the schema. And we’re able to use the same code for the majority of our endpoints, and we avoid breaking the API with most of our graph database schema design changes.
Access Full Graph Database Schema Design Video
Access the entire talk from Graphable Senior Consultant Sam Chalvet for his Neo4j #nodes2021 on demystifying graph database schema design by clicking here. The video comes from the Nodes 2021 online conference.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.