Join Graphable for the replay of the second episode about backend architecture and graph schema design in our year-long video series Graph-Centered AppDev, covering the common pitfalls of rdbms-based applications and why and how to move to our new graph-based paradigm.
Top Two Backend Architecture Session Takeaways
- There’s a new and better way to design applications using graph database at the core.
- In this particular session, you will see more specifically how the approach works on the backend of the application (e.g. DB design, APIs etc)
In the replay of this episode, we start right at the core, focusing specifically on the backend and exploring how a graph database can make your application faster, more resilient, more powerful, and easier to upgrade. All episodes in the series are the first week of every month on Tuesdays at 12:00 p.m. ET.
Format: Roundtable discussion on backend archtirecutre and graph schema design with some supporting slides. This particular event is not a hands-on session, but if you think that would be a good idea for future events, please reach out to email@example.com.
Audience: The session on backend architecture and graph schema design should be interesting to anyone who has experienced the shortcomings of rdbms-based applications, and is in a position to impact their organization’s direction for new and re-platformed apps. Graph database and graph schema design enthusiasts, IT managers, architects, DBAs and AppDev technical people will benefit uniquely from this contention backend architecutre.
- It is not necessary to have seen this first session in this series “What’s wrong with your application?” since there will be a recap, but it would be helpful (replay, here).
- No specific technical skills in backend architecture or graph schema design are required, but to derive the most value from this session on backend architecture, you should at least understand appdev and database concepts and their implications.
Video transcript:[00:00:00] Will Evans: I’m Will Evans, the VP of Consulting, and also on with me today is Lee Hong, our Director of Data Science as we discuss backend architecture and graph schema design.
What we’re going to go through today is Episode 2 of our webinar series on backend architecture and graph schema design. You can see looking at our schedule that we’re setting up a year-long series of videos this year on building graph-centered applications. We started last month with a general overview of what’s wrong with your application, all of the trials and tribulations of building relational-based applications. Today what we’re going to focus on is, let’s say we change up that paradigm and we use a graph as the backend. We’re going to focus on backend architecture.
As we go along throughout the year, we’re going to move on to some of the other benefits and some of the other pieces of the backend architecture that we recommend. So please tune in for further episodes of this series – and as you can see from things like the Star Wars theme song, we’re trying to make sure that these are fun and interesting and not just some disembodied heads reading from a PowerPoint. We’re going to quickly get out of PowerPoint and into some live data walkthroughs today as well.
We’re going to do a quick review of what we talked about previously. We’re going to go into an overview where we talk a little bit about the extended sports metaphor that we’re considering throughout your application build process. After that, we’re going to dive right into exploring some of the backend architecture models and looking at some schemas that we’ve created for demos, potentially some overviews of graph schema designs we’ve looked at for actual clients.
Then we’re going to talk about the whiteboard model, which is a really effective way to graph schema designs using a graph database. Then Lee is going to talk a little bit about schema design for Graph Data Science, and lastly, we’ll get into the Q&A to answer any questions on backend architecture or graph schema design.
I see for our poll results that a lot of people here are pretty new to graph and backend architecture. Not necessarily their first intro, but definitely early on. Most people haven’t deployed a production graph app in their org. That’s perfect. This session is for you, and we’re going to cover how you might work towards that as part of our backend architecture.
Looking at an overview – oh, and just briefly, actually, if you’re looking at your Demia [00:02:16] window, there’s a black bar at the top where Lee and I should be sitting. You can also hover your mouse over that and drag it up to make the presentation larger and our faces smaller if that’s helpful. There’s some detail in some of these images.
What you can see here is a sample application stack backend architecture that we put together. This is really the core of what we’re looking at throughout this entire year. At the top, you’ve got some of the technologies that we’re leveraging here – Neo4j, Hume, and Domo. This particular stack is built on top of AWS, but there’s a lot of different ways you can go about that.
In our fully-fledged backend architecture, we actually recommend using serverless, especially if you’re deploying consumer-facing applications with a web or SaaS-based offering. And you can see here on the backend, down in this corner, we have Neo4j in the bottom left-hand corner as well as some S3, potentially a DynamoDB instance, also a little bit of Redshift and MySQL. But the importance is to scale here, and we’re really looking at having Neo4j be the core of the application. We’re going to discuss why that’s important to backend architecture overall throughout the next 30 or 45 minutes here.
When we go into talking about our overview – bringing you out of the backend architecture presentation for a second here – what Lee and I discussed as we were coming up with planning for this webinar, we ended up settling on this concept of sports as a metaphor, and it really ended up working incredibly well.
Just to backtrack on that a little bit, what we talked about was that sports teams have a whole number of different moving pieces; they have a whole different number of controlling users. You’ve got a general manager, you’ve got a coach, you’ve got an owner, you’ve got teammates that all do different things. They all have different goals. It really translates well to an application. I think, Lee, you said in terms of building a sports team, you can’t take an NFL team and expect them to go win the World Series, and that’s also true. What we’re introducing here is not just a – you can’t take your application and go do anything with it, but you can build a really good baseball team that’s going to be able to succeed over the extended time period.
That’s related to a few things. Graph databases don’t create superstars in the way that relational databases do. In a relational database, you often have a one most important table, and everything has to be close to that table; otherwise you can’t get to your data. So you end up with a star pitcher, or your orders table, in the middle of your schema, and it starts to be really valuable in the beginning, and then over time it becomes inflexible. It gets a little bit old, it gets a little bit slower, but your whole organization is built around that individual superstar or that table, and it’s very hard to change.[00:05:00] In graphs, because we don’t pay that penalty for moving away from our individual entities and being flexible, it makes it much easier for us to adapt over time as we keep our goals in mind.
I’m going to dive really straight in here, to not bore everyone in PowerPoint, into a screenshare. What I’m going to share is actually the Hume platform. What you’ll be seeing up on your screen is one of the schemas that we use in a number of our demos. This is an actual – it’s not necessarily a fully-fledged production schema, but this is a schema that we actually use in a number of our demos, and I just want to talk about it briefly in terms of its power here.
If I come over here and actually look at a perspective – or back in the schema – you can see in our schema that we’ve got a relatively simple outlay here. We’re looking at a use case where we’re trying to make good beer recommendations. In our schema, we have beers, we have user reviews, and we have users. That’s really the three core pieces of our schema.
Then we’re using some of our semantic rules. This is one of our first tips, and as one of our handouts we’re going to give everyone – right now that backend architecture diagram is available as a handout in the Demia [00:06:27] platform. So you can go in there and download that handout if you’re interested. At the end, we’ll share a handout on our top 5 tricks for graph schema design and backend architecture.
One of the first ones that we talk about is actually using some of these semantic rules. If I zoom in on this relationship, you can see that this relationship is called HAS_A because a beer has a review. One of the things we’ve found over time is that prefacing this specific relationship with “HAS” becomes incredibly powerful, because then we’ve set up a rule where we know from a semantic standpoint that beer will have one or more reviews. So we can use that rule throughout our schema to prefix a lot of those one-to-many relationships with “HAS,” and that’s really powerful over something like a traditional relational schema where you might have a join table somewhere that’s even obfuscating what the relationships are. In a graph, we can actually add hose semantics in and increase our understanding.
We’re also following in the schema our subject-VERB-object paradigm where we have a subject (a beer), the relationship is a verb, and the ending entity is another object, or could be a subject in other cases. And then similarly, we have a user wrote a review. So this graph becomes very, very straightforward right off the bat.
As we get out of this demo later, we’re going to talk about the whiteboard model. But this really is the whiteboard model in practice, where if I asked you to step back and said, “Okay, I have an application in mind. Users come in, they’re going to write reviews about beers, and then we’re going to make recommendations to them. Now draw that schema up on a whiteboard,” this is approximately what you would come up with. You would have a node for beers, you would have a circle for reviews and a circle for users, and they would be connected by lines. The graph just replicates that.
That’s really part of the real power here: the user understanding. At least I’ve, in a consulting role, come into a number of organizations where someone hands you an entity relationship diagram for some relational database and you have to spend five hours trying to figure out what it is because it’s hugely complicated. Instead, we can switch to simply this schema.
If we look at what the schema looks like in practice, we can start looking at some of the instances of this data. You can see we’ve got 1.6 million reviews and 66,000 beers, and if we start looking at some of these beers, I might look at my Killer Whale Stout. If we just focus on that for a moment, we can expand out and see how many reviews we have. If we look at a hierarchy view here, we can see that this Killer Whale Stout has six reviews, and I can also look at all of these in the same class. I can find all of the users that wrote these reviews. Then I could even keep extending out to find other reviews that these users have written – which would be 7,000, so I’m not going to load all of those, but we’ll load 500. We can see the other reviews that these users have written. If I zoom in here, we can start to actually look at maybe expanding back out to another beer.
So in this way, we can start from the top of our hierarchy up at this beer –[00:10:00] oh, this is now a user – we can start at the top of our hierarchy with a beer, traverse through users, and end up back at another beer. We’re just following that path throughout our schema. We’re starting with a beer, going out to a review, to the users, to their reviews, and back to beer. So we have a concept somewhat of a self-join as well as the ability to traverse throughout our graph with almost no cost. You saw how fast that query was.
We could do something very similar if we’re looking at finding out what beers a user has reviewed. We come in here and we can look at his “Show reviewed beers” and see the schema directly. You can see that in our graph, we don’t have any duplicated data. So if you’re looking at normal forms in terms of how to use join tables to reduce duplication, because we’re using actual instances – these nodes are instances of – I’ll put up my properties so you can see. This is a style. This is an instance of the style node that has the name American Double Imperial Stout, and this individual node is being referenced twice.
What this looks like as we come over here to American IPA – this is another example that’s being referenced three times – we can change the name of this American IPA. We don’t have any data duplication issues, and we still have three beers that reference it. We could then expand out from here – there’s another 3,000, but we can load 500 of them again. We can load another 500 beers within that category quite easily. Let’s not do that for a second.
That really starts to showcase some of the flexibility of our schema and the power of being able to navigate throughout our graph, keeping this property graph structure.
So as we start to think about our schema here, maybe there’s additional pieces that we want to look at, and that’s one of the biggest problems in relational design typically. You have to solve every problem at the outset, because changes coming later down the line are going to present a huge headache. That’s something that you can solve easily in a graph, and it’s actually one of our top tips. This isn’t your forever schema. This is not the one thing that you have to cast in stone and it’s going to be there forever. This is your first pass. It has to be workable and it has to solve your business cases, but over time, your business case is going to evolve and there’s going to be new requirements.
So we might start out with this platform where we just want users to review beers, but then once we have all of this data, maybe we’re going to create a website. On this website, we’re going to have a webpage, and now there’s actually going to be beers listed on a webpage. We could do this relationship in a couple different directions, but we can call this relationship something like EXISTS_ON. So now we have a beer that exists on a webpage, and we can have a user that viewed a webpage.
So we’re keeping that subject-VERB-object relational structure, but we’re easily extending our graph further out to include other aspects, including the webpage. Now we’ve built out our website and we can start to include some of this information in our query, and we’ve created another path, essentially, to navigate between beers to users or between beers to reviews. It enables us to move and traverse easily within our application.
This can really translate to a number of different use cases or can be incredibly flexible even going forward. So even in our beer entities, you can see we have some properties here that we require of beers, like a name, an ID, and a beer style, and we could also say, “Hey, maybe we’re going to start actually selling beers now, so I want to add an inventory property onto this beer.” Once I’ve added an inventory property, we can then include that in our queries to update and say, “Let’s only look at beers that we have inventory on.” It doesn’t become this massive schema redesign headache where our tables are fixed and we’ve got all these issues because we can easily navigate around our graph without penalty cost.
And this applies to a number of different use cases. You can see some that we have up here. I’m going to bring in another schema that we’ve looked at that’s also incredibly simple. This is a regulation recommendation use case. We’re finding similarities between regulations or bills. So at the center of our universe is bills, and out from there we have some additional entities. This is actually a case where we’ve used NLP on top of these bills. So we have MENTIONS_CODE, MENTIONS_ORGANIZATION, and MENTIONS_PERSON.[00:15:00] And as I mentioned, we might want to extend this out. This is the data that we have today, but potietally later on, we’re intdoucing some concept of events or digital views where we have customers that have selected bills, but maybe we want to also check and see when customers have viewed bills. So we can extend that out increibyl easily.
Again, diving into this visualization to see some of what it might look like, we’ve got a lot of information up on the screen here. But if I clear the schema, you can see that we’ve got 1,200 bills, and these 1,200 bills have 7,800 relationships to 2,800 organizations. We actually have more organizations than we have bills. But because we don’t have a messy join table in the middle of here, we don’t take a penalty hit for needing to navigate between bills and organizations in our query. That becomes incredibly powerful if we think about our application, because I can look at a specific bill – like Bill 516 here – and I can make a recommendation or I can say I want to find similar bills based on code, organization, keyword, and person.
I’m going to run a query and it’s going to do that exploration throughout my graph. It’s going to navigate out from my bill to my codes and my organizations, and out to other bills within my system, and then return them based on a ranking. So having these organizations connected so easily, where we might have – looking here, I don’t know how many we have, but 10 organizations associated with a single bill. If that was the case in most join tables, this query would just blow up. It’d be untenable because we’d be looking at a lot of duplicates. We’d have a hard time with aggregation, trying to get from this table to my join table, from bill to organization, back to my bills table, back to organizations. It would become a mess.
In Cypher, this is an incredibly short query, or relatively straightforward query, where we can start at a bill and some custom weights that we’ve put in here, and then we can navigate out and then aggregate back on collect and distinct in our graph and return these results almost instantaneously. So that’s another instance of the power of both the flexibility and the power of graphs in terms of that relation.
Just one other thing – and I don’t want to steal Lee’s thunder because he’s coming up here on the data science aspect of it – but the other thing that we can do based on that relation is it becomes really easy to compute rarity scores for our instances on our nodes. So we can look at something like the State Department of Agriculture, and you can see that we have a rarity score. This is something that we’ve computed within our graph based on, how many bills is this organization connected to out of the entire set of bills within our graph?
We can see looking at something like the State Department of Agriculture, that organization is a rarity score – this is normalized, so 1 is the highest. So this is a very rare organization with 0.998 rarity. Now we look at something like the 121st General Assembly. This is something that’s used in almost every bill because this is the instance of the General Assembly that they were introduced at. So it’s very popular, and therefore is less important to our recommendations.
In our graph, we can leverage that by extending out the relationship to the organization and then looking at the weighting of the rarity score of that organization. So we’re able to have properties on our entities and leverage those in our queries based on our relationships.
But we can take that even a step further. If I come into one of our other schemas in this instance in terms of movies, we can actually see in our visualization that we have – I’ll actually remove her from the screen briefly – I’ve got myself here as a user, and I can see what movies I’ve rated. I’ve only rated one movie. I’ve rated Top Gun, 1986, great movie. One of my personal favorites. You can tell because when I click on this relationship, I have a property where I’ve rated it a 5.
And this is incredibly powerful, and Lee will talk about this further as we get to data science. But just to highlight this part of the schema, a property graph model in Neo4j allows us to store properties not only on our entities, on our nodes, but also on our relationships. This allows us to increase the amount of information in our graph dramatically. Because again, in a relational system, this is probably a table of users[00:20:00] with some join table in the middle of your ratings to movies, and it becomes incredibly complex to try to aggregate and compare ratings of users. But within a graph, I can actually just store that property directly on my rating, and I can leverage this in similarity recommendation queries, but I can also leverage it in pathfinding, which is incredibly powerful. We’ll draw in one second a schema for a network use case where we have a dependency. You can actually put load balancers, for instance, accurately into your schema because you can determine how much traffic they’re routing.
When I look at this rating, I can leverage that property in a query as well to find my most similar user. So I’m going to navigate out my entire graph and find my most similar user. In this case, this is just someone else who has rated Top Gun very highly. And now I can take this and I can recommend other movies potentially that Virginia has watched to me.
As we explore that, we can see the power of this actually happening in real time. If I add another movie to my graph, I can actually create that rating – and maybe I hated this movie and I watched it yesterday and it was horrible. I can rate that movie. There we go. I can rate that movie and have it return on my graph, and now if I try to find my most similar user, I may get a different result depending on the movies that I’ve rated. Or I may need to rate one more movie before Virginia and I are not the most similar, so we’ll find License to Wed, which was a great movie, and we’ll rate that a rating of 5. And now I can again try to find my most similar user, and you can see that I now have a different most similar user based on these ratings. They’ve rated License to Wed similarly to me with a 4, and they also really enjoyed Top Gun with a 4.5 rating.
So we can see the power of graphs in real time being able to make those recommendations and find similarities navigating out between these relationships. We didn’t cover this in the beginning, but this is a relatively large graph. It’s not massive, but we have 62,000 movies and 46,000 users who have rated 7 million times those 62,000 movies. So this is a relatively small graph, but we can still do those real-time recommendations.
And then briefly, I want to just focus on a new schema that we haven’t designed where we might have dependencies. We might be looking at a network dependency graph or business dependency. A lot of times we look at regulation in financial institutions where there’s regulation dependencies. So we could add a piece of hardware here, potentially. We could add a piece of hardware here and that might be depended on by a piece of software. We’ll spell “software” appropriately by editing this. And there may be another instance of hardware. This might be a server. I’ll make this a little more clear, call this a server. We might also have another piece of hardware like a load balancer. On top of that load balancer, we might have another server. We’ll call this Server 1. And we might have a software that depends on my load balancer, and then from my load balancer, I might also depend on Server 1 as well as my server.
That would become very hard to model in a relational system, having this n number of dependencies, but it would be impossible to add as an attribute the weighting. I can add on my weighting here as an integer, and now within my graph I can store what percent of traffic this load balancer routes to Server 1. So not only do I have a dependency chain, but I actually have a weighting scale dependency. I might have just a warm backup that gets 10% of my traffic and a production server that’s getting 90%. I can start to look at that chain within my query, and that nuanced representation becomes incredibly valuable within my graph schema design instead of being incredibly challenging within my relational database.
All right, so now I’m going to go back into the Power presentation on the overview slide and just talk briefly about the[00:25:00] whiteboard model. This is just another example of a different movies database that we talk about. Some of this is in Resources at the end as well, for people who want to follow along. But in our whiteboard modeling, as I talked about – this is movies, but we talked about it with beer as well – you might sit down at a whiteboard or stand up at a whiteboard and say, “Let’s draw this out. Let’s see what this looks like.”
You might say, all right, we’ve got Tom Hanks. He acted in a movie, or Hugo Weaving also acted in a movie. And we’re going to start to draw something that looks like a bunch of circles and lines. I can’t tell you how many times, even non-graph-related projects, I’ve gone to a client and we’ve stood at a whiteboard and drawn circles and lines. I think everyone in consulting and a lot of businesses have had that experience.
What we then do in our graph modeling process is we say, okay, let’s take these circles and lines as fact. Let’s turn them into nodes and edges in a structured schema. This is much, much simpler than the relational modeling process, where what you’re typically doing is drawing your circles and lines and then giving it to some engineer to spend 6 hours saying, “Okay, this ACTED_IN, this is going to have to be a join table. It’s going to have to be a one-to-many relationship. There’s going to have to be all these rules on it. We need the foreign key for actor in my join table for actor/movie, and then we need the foreign key for movie in my join table, and we can only ever have an actor appear in one role in one movie.”
Imagine you do your entire relational schema with actors appearing only once in a movie, and then you end up with instances of movies where actors are playing two different roles. How are we going to store that? Are we going to then expand our relational schema with another join table to role from actor/movie? It becomes incredibly complicated.
In our graph database, we can simply add a property to our ACTED_IN relationship of the role that they played and then create another relationship from Tom Hanks to Cloud Atlas. So we have this incredibly flexible schema that doesn’t require us to solve all of our problems upfront.
If you take one thing away from this webinar, besides our handout at the end, I think one of the biggest things about creating graph schemas is that you don’t have to solve all of your problems in the beginning because they’re going to be flexible. And that’s super important for any application you build, whether it’s internal business-facing or external consumer-facing. Flexibility is key.
As we talked about, going back to our sports metaphor – I was trying to think of a Star Wars metaphor for May the 4th, but I couldn’t quite come up with one – sports evolve over time. Rules evolve over time. You look at the high jump, for those of you who like track and field. The high jump was jumped with people going belly-down for years and years and years, until one guy came around and went back-first and now that’s all anyone’s ever seen. You can look at the NHL, one of my personal favorites. They’ve changed the rules on contact. They’ve changed the rules on fighting. Players now have to wear visors. Look at the NFL, famous for changing rules basically all the time. I can’t keep up and say what rule is new, but there’s new pass interference. There’s all of these different changes that happen.
In your business application, that looks like consumer preferences changing, or there’s new demands from the business on your application. You’ve got financial fraud, and now they want your application to say, “What’s the scale of that fraud?” They want to say, “It’s two layers deep; now give me six layers deep. Don’t identify fraudulent individuals; identify fraud rings.” There’s all kinds of ways where the requirements are going to change, and if you have an inflexible schema, you won’t be able to adapt. Your options will be “Do I tear the whole thing down, or do I start doing duct tape and bubblegum patchwork to try to make it happen?” That’s typically what we see, and we end up coming into a mess that we’re trying to fix and often recommending we replace it with a graph at the center.
Now I want to turn it over to Lee for a bit here to talk about some of the data science considerations of modeling your graph.
Dr. Lee Hong: Yeah, so today, at this point, because it is May the 4th, I will throw in a Star Wars analogy about relational schemas and graph schemas.
Think about building relational schemas in your background. You’re very much taking an Empire-based approach. I’ve got these big Star Destroyers, each of those tables. They only work well if you have a ton of rows and you take all those rows. The individual TIE fighters have got to sit in that Star Destroyer. That’s the only way that it works. So you only have power in large, large numbers.[00:30:00] And then you realize that all of that culminates in a very large, planet-like instrument that can destroy other planets – but after you’ve built all of that and you’ve blown up a couple of plants, you realize there’s one gaping hole in your design that then makes your planet somewhat useless. Other than blowing up other planets, it really doesn’t do anything else.
From the graph database standpoint, it’s more like setting up something around the Rebellion, where the power of what you’re doing is at the entity level. Every unit of data I have matters to me because each of those is an individual entity, each with its own powers and capabilities. So as you take that “force” with you, what that really does from a data science perspective is it gives you this capability of being able to be flexible, to Will’s point about being able to adjust. So every time you get hit by the Empire, you run away to some other galaxy and you build something else and then you come back with some different technology, different people, and you’re back in business.
From a data science perspective, a lot of the work that we do is dependent on being able to engineer the right kinds of features. In order to be able to predict what’s coming up next, be able to know if somebody should be recommended a different thing, I need to be able to create those features that tell me either these two people are the same, or, given the history, that’s what the future’s going to look like.
In order to do that well, I need to have data cardinality, which means I need to know that each entity of data that I have is a specific entity. I can’t have a bunch of clones that are semi-similar running around that can break down very easily. I need to know when I aggregate to the level that what I have in that group is the right number. So I don’t have duplicated data; I’m not double-counting things. That’s extremely important.
And then the other thing is the flexibility of the pivot. In a table schema, if I’ve set my columns and my rows, transforming that into something from long to wide, where I take my columns and I need to make my columns and my rows or make my rows and my columns, that becomes extremely expensive and very, very slow if I have a big table. But with a graph, everything is at the entity level, so I’m connecting one entity to another. Those two things link to each other, and I never have to worry if I’m pivoting because I know what the higher order entity is and what the lower order entity is, and I always have that hierarchy in place.
The last thing is what Will was showing with the recommendation, whether it’s for beers or similar documents. One of the biggest, most expensive efforts whenever you have a relational database is the self-join. When I need to know if two things share similar properties or, more importantly, similar properties that are kinda sorta similar but not completely similar, where I want to weight things, like a rarity score or a similarity metric – however I compute that, I need to be able to do that fairly quickly and on the fly.
One example that Will was showing just now with the document database, I have to be able to compute these rarity scores on the fly, day by day, document by document, because if a new document comes in and uses a particular term or calls the General Assembly 1,000 times, that changes my background score. So I need to be able to compute all these things on the fly and know where my connections are. Having to create an aggregated table becomes very expensive. Then having to self-join that aggregated data together against itself to find me similar documents becomes very, very slow.
So overall, on the data science side, from a graph perspective, it’s all about speed. It’s about flexibility. I’m going to hand it back to Will as we move on.
Will Evans: Thanks, Lee. Yeah, thinking about the power of graph for data science, the other piece that we sometimes talk about is that depending on the use case, sometimes you want to have separate schemas for data science and your production use case.
So in a lot of cases, especially in the beginning with smaller data, you can use the same schema. But in some production schemas, which I’ll share in a moment here from a high level – if I just bring this up – I’ll share one of our production schemas from – I see Rodrigo – I hope everyone else could hear Lee. I certainly could.
If I come in here to one of our production schemas to show what it might look like, we actually have a lot of relationships in this schema. You can see we have some nodes here with a number of different connections as well as a schema that extends out 6, 8, 10 joins[00:35:00] from some of these central nodes. And some of these relationships, we may not need in a graph data science application. One of the features of Neo4j is that you can actually create subgraphs, so we might choose to not include some of these relationships as we’re looking to do our data science approach because these may simply be instances that we want within our queries in order to return things quickly to users that we don’t need to be looking at as we’re traversing from a data science approach.
So we can look and see that we have some aspects of the graph that are able to be looked at in a production application viewpoint as well as some that you can create in a subgraph for graph data science. I’ll switch back to my presentation and then go to one of our next slides here.
These are some of our schema design tips, just sort of a summary of the things that we’ve covered – which is also available in a handout which we’ll hand out shortly – that we really think are the key things for people to take away from this webinar.
With a graph, there isn’t any such thing as your forever schema. Your schema needs to continue to change over time based on your business needs, and the inflexibility of relational schemas because of their esoteric rules in order to make things work is their downfall in really short order that you don’t have to deal with with a graph.
Create some basic semantic rules – things like HAS, creating for, defining some rules for your use case that make your graph understandable. Think about the next new developer coming onto your project and how quickly they’d be able to ramp up. Having those semantic rules makes things much, much faster.
Start at the center of your universe and make a sketch. These ones are really related together. You want to say, “What’s the most important thing to me?” and start from there.
You don’t have to necessarily consider all of the different aspects, because if you think about that production schema I just showed, you might have entities that are 6, 7, or 8 relationships away, and that’s okay in a graph. You don’t have to have everything tightly together in this really centralized star schema, ERD methodology that everyone’s used to. You can spread out. Your graph and your schema can sprawl a little bit, like things do in real life, like applications do in real life. You have a lot of different connections that you need to traverse.
And then leverage that subject-VERB-object model. Especially as you’re getting started, that’s really one of the most intuitive ways to make the switch from thinking about applications based on relational database backings up to relationships that are backed by a graph.
I’ll actually share that handout now for people who are interested in it, our Top 5 Schema Design Tips. Feel free to grab that. And then I’m going to switch to our next page.
These are some resources that you’ll be able to see at the end. I also want to go to Q&A now. I see that we have some questions already. One of the things is, “Is Hume simply the name of the GUI, or is it recommended by Graphable, or are there other options?” To answer this one, I’m going to come back into my backend architecture diagram here.
You can see that Hume fits into our visualization, analytics, and data science component. We look at Hume as really a huge number of things. It’s a very powerful platform. Where we typically see it in use cases is to have business users and data scientists exploring your graph directly. We’re using tools like Domo for BI, for executive reporting, looking at charts and tables going to C-level executives. A graph isn’t necessarily best for that. But for understanding and exploring the relationships in your data and doing data science as well as manipulating your graph, Hume is the most powerful tool on the market, and we would definitely recommend it.
If you have any other questions – you may not. That’s answered already. Any other questions on backend architecture or graph schema design, feel free to ask them. Otherwise, we’ll wrap up the backend architecture session in a moment here. And we’ll share these slides out with everyone who attended as well as be posting this recording to our YouTube if anyone wants to follow up with it.
If you’re interested, check out our other events that we have coming up. I want to thank Lee for taking the time to join me today on this webinar, and thank you all very much for joining our discussion on backend architecture and graph schema design. May the Fourth be with you, and have a great rest of your week.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.