CONTACT US
Application-Driven Graph Schema Design, Demystified
While graph databases like Neo4j are schemaless by default, we should still leverage the power of schemas in development by focusing on application-driven graph schema design. Being schemaless does allow for extreme flexibility, rapid development iterations, and easy evolution over time as applications need to change. However, for the developer of an API, this can be difficult to manage without the help of a schema. The fast-paced changes to the graph database lead to a continuous cycle of reworking queries to align with the revised approach. For an application that uses scores of endpoints, this can ultimately be very time-consuming.
By applying a schema to the database (with tools such as the Hume platform), and with a few simple design rules applied to the graph schema design and API we can mitigate most of these development woes, and as a serendipitous effect we can also end up with a self-documenting Graph.
Common Graph-based AppDev Issues
The flexibility of a schemaless architecture yields tremendous value both during and after the development phase. However, over time we have encountered some common issues that result in a lot of time spent reworking completed tasks. These impact both the front-end and back-end.
Some examples are:
- Renaming property names.
- Removing/adding properties to a node.
- Moving properties to a child node (that can be shared with other nodes).
- Spelling mistakes on properties in the back-end resulting in NULL values.
- Renaming Labels and Relationships.
- Changing the schema from a one-to-one to a one-to-many.
These all result in a delay between what is in the database and what is sent back because they first require an API update to reflect changes.
With some simple rules and conventions we can eliminate the majority of the reworking required on the back-end, and simplify the process for the front-end.
The Goal in Graph Application-Driven Schema Design
We want to design a functional Graph Schema that is generic enough to:
- Facilitate fast development iterations,
- Easily setup new or modified existing API endpoints
- Maximize the number of non-breaking changes that can be made to the graph schema design.
We also want to make sure to return all data in an easy-to-use, standardized format, and maximize reusable code by leveraging schema rules.
In this scenario we chose not to go with a GraphQL solution because we want our API calls to be as simple as possible for the front-end developer. This means that our backend needs to handle the Cypher generation to return the results.
By following a handful of rules and our approach as described below, our schema ends up being self-documenting and defines the API contract. This allows everyone, including Front-End developers who may have very little knowledge of the database content, to intrinsically know of and understand each endpoint and the data that it will return. simply based on the schema.
Step 1: Establish Naming Conventions
The first step is to establish some common naming and structure conventions to our graph schema design. There will usually be some exceptions predicated by the requirements but we want to strive to minimize those to reduce the number of custom API endpoints. A standard endpoint will be able to use common code to return the appropriate objects, while a custom one will need special handling, unique to that endpoint, which means that changes to the schema will possibly require changes in the code (usually to the Cypher query).
Node Relationships
- Relationships should start with a single verb.
- The verb HAS will be used to denote a one-to-many relationship.
- Any other verb is always a one-to-one relationship.
- They contain the name of the node that they point to.
- The Node Label should be split by an underscore and capitalized.
For example, a Scene node that points to a FilmLocation node will have a relationship like VERB_NODE_LABEL, in our case it’s AT_FILM_LOCATION.
By doing this we can infer the node label by splitting the relationship on underscore and combining the remaining parts in PascalCase (CamelCase with first letter capitalized). AT_FILM_LOCATION becomes FilmLocation.
This will be very helpful to the developers as they put together the payloads that are sent between the front-end and back-end, as we will see later.
Node Labels
- Are always in PascalCase.
Node Properties
- In snake_case (all spaces replaced by underscore)
- Anything that ends in “_date” is an ISO timestamp.
- Any properties that are like “name”, “code”, “number” refer back to the node they belong to. For example if “name” is a property of Scene then it is the Scene name.
- This rule keeps the property names simple and predictable.
- If the node label needs to change from Scene to FilmScene we don’t need to go back and also change all the properties from “scene_name” to “film_scene_name”.
Step 2: Establish API Rules
- The API endpoint name is the node label in snake_case. FilmLocation has an endpoint like /v1/film_location
- The endpoints return all of the node properties.
- The endpoints returns immediate outgoing child nodes and their properties.
- If the outgoing relationship starts with “HAS” then we know it is one-to-many and a list of objects is returned.
- Any non “HAS” relationship is returned as an Object.
- The relationship name is used as the KEY for those objects.
- The API doesn’t do any type casting and doesn’t check to see if the property exists, it simply returns all of them.
Given our previous example with a Scene that has a FilmLocation and Actors, we know that by accessing the endpoint /v1/scene our returned payload will be an Object with all of the Scene properties, a nested list with the Actor(s) properties and a nested object with the FilmLocation properties.
With these rules we avoid having to make updates to the query when any property is added, removed or renamed.
Example Schema
Given our set of schema and API rules, anyone can look at this graph schema design above and know:
- All of the endpoint names.
- Which relationships are one-to-many.
- The object and list of objects that each endpoint will return, including the Key that they will be stored under.
Examples of Common Changes and their Impact on the API
Thanks to our rules and setup in our graph schema design, most changes do not introduce any issues with the API, the rest typically require only small changes.
- Adding a new child node
- No impact. We return all child nodes so it will still be returned when calling the parent.
- Minimal impact. A new generic API endpoint can be setup to access the Child node and its children if needed (often not the case).
- Renaming a node Label
- Minimal impact. If an API endpoint exists for that node, renaming the endpoint is the only change needed.
- Changing a Relationship from a one-to-one to a one-to-many or vice-versa
- No impact. Provided the Relationship name is changed to match our rules
- Changing a Relationship name
- No impact. We dynamically fetch the Relationship name and return it (see the example reusable Cypher listed below)
- Adding or removing a Relationship
- No impact. We return all outgoing relationships. It will either start showing up or cease being returned.
- Renaming, removing or adding properties to a node.
- No impact. All properties are returned.
Example API
Let’s take a look at what the payload looks like if we were to call the movie endpoint: GET v1/movie/5ad5ac65-a11f-406a-8b76-ed6e454d1bc5
Our “data” object in this case is the Movie. Our nested Objects are the child nodes. We can see that because the relationship between Movie and Scene starts with “HAS” we are returning a list of objects.
If we need more information about one of the Scene objects, we can use the”uuid” for that node returned in that object (not shown but always returned as one of the properties).
This is an example reusable chunk of Cypher that we can use to return any of our nodes and their outgoing child nodes based on our Schema and Api rules.
Given our rules and the endpoint called we can determine what our node_alias and node_label are going to be.
MATCH ($node_alias:$node_label {uuid: $uuid}) OPTIONAL MATCH ($node_alias)-[r]->(n) WITH DISTINCT $node_alias, type(r) as rel_type, COLLECT(distinct properties(n)) as nodes WITH {data: properties($node_alias), relationships: COLLECT(DISTINCT {rel_name: rel_type, nodes: nodes} )} as results RETURN apoc.convert.toJson(results) as results
We can then perform some post-processing and check the rel_name to see if it should be sent as an object or array of objects. The final results looks like the payload listed above.
A POST v1/movie will return the same results as above, but as a list for each of the Movie nodes.
Creating/Updating Nodes With PUT
To create or update nodes we have just a few other rules on how we will process the payload to create our Cypher query.
- If the object contains a “uuid” then we find the node with a MATCH. Otherwise we CREATE the node.
- Any non-uuid properties sent in the objects are updated via SET.
- If the relationship does NOT begin with “HAS_” then remove any previous existing relationships of that kind.
We need rule 3 to keep our one-to-one relationships one-to-one.
Given the example payload to the right, if a Scene is updated with an AT_FILM_LOCATION we know that the Scene can only have one location, so we first detach the previous location (if it exists) via an optional match.
For one-to-many relationships we have found that that it makes more sense to handle these separately.
We also have observed the following behavior for dealing with outgoing one-to-many relationships:
- Replace all the child nodes when the parent is updated
- ex: Parent node is Scene and child nodes are Actors in that Scene
- In this situation we handle it in the UI with a separate API. The user should remove the Actor which calls an API to detach the node from the parent.
- Append the new child nodes
- ex: Parent node is an Item and child nodes are different Price levels, but we need to keep the old price levels.
PUT v1/scene
In the example payload above the user is calling a PUT for a Scene.
Based on our rules we can generate the Cypher needed because we know:
- The user wants to update Scene with uuid = a7962197…
- We want to update the “code” property.
- We want to link it to a FilmLocation with uuid = 98a3c5cd…
- We want the Scene to have two Actors:
- One Actor with uuid = 8a2f3228… already exists.
- One is a new Actor with a property “name” = Bill. We will need to create this Actor and SET its properties.
Exceptions to the rules
Exceptions will always exist in any project, even the most organized ones with strong graph schema designs. Even so, we have found that often with some cleaver rethinking of the problems, we are able adjust the schema and still follow the rules.
There is a fine line between adding complexity to the rules and just handling the exception with some custom code, both could be valid solutions depending on the circumstances. But we would caution to err on the side of the KISS principle. If the rules get too complex, they get harder to follow and you end up backing yourself into a corner with complex code that handles all your rules and still needing exceptions.
Other Benefits
Some other benefits not discussed above with application driven graph schema designs come into play when implementing filters, text searches, ordering results and permissions. These often depend on the project, but by being able to predict our node Labels and Relationships we can easily craft some reusable code that will generate our Cypher.
For example when setting up a new endpoint with full text search we can run the following:
db.index.fulltext.createNodeIndex($node_alias+"Search", ["$node_label"], ["$prop1", "$prop2"])
And when calling the endpoint with a POST we can start our query with:
CALL db.index.fulltext.queryNodes($node_alias+"Search", "$search_term")
YIELD node MATCH ($node_alias:$node_label) WHERE id($node_alias) = id(node)
Conclusion
The unique value of application development on graph is becoming more evident every year, as graph databases are leveraged more and more by businesses and organizations of all types, around the world. The flexibility in development- particularly the ability to leverage application-driven graph schema design– the speed with which applications can operate at scale on graphs, and even the data science required to drive value out of the information in those applications is all fueled by this evolution towards graph-base AppDev. Check out the blog for more info, and the graph-based AppDev series with webinars digging into additional aspects of this topic.
Check out additional Neo4j use cases including a contract tracing graph database, exploring bike data using Uber H3 Grid System, and graph databases in pro sports, graph database fraud detection, and analytics with ChatGPT (also about What is ChatGPT? and the ChatGPT topic in general).
Read Other Graphable AppDev Articles:
- What is an AWS Serverless Function?
- How to Configure an AWS Serverless API
- The Power of Graph-centered AppDdev – Graph Database Applications
- Graph AppDev with GraphQL Relay
- Neo4j GraphQL Serverless Apps with SST
- Streamlit tutorial for building graph apps
- How to build a Domo app using React
Still learning? Check out a few of our introductory articles to learn more:
- What is a Graph Database?
- What is Neo4j (Graph Database)?
- What Is Domo (Analytics)?
- What is Hume (GraphAware)?
Additional discovery:
- Hume consulting / Hume (GraphAware) Platform
- Neo4j consulting / Graph database
- Domo consulting / Analytics - BI
We would also be happy to learn more about your current project and share how we might be able to help. Schedule a consultation with us today. We can also discuss pricing on these initial calls, including Neo4j pricing and Domo pricing. We look forward to speaking with you!