In this episode replay we cover how effective Master Data Management (MDM) must extend beyond the basics of entity resolution to include the entire data lifecycle. To stay truly in control of your data you need to understand not only that two entities may represent the same thing, but also what is the source of these two entities? Have they changed over time? Who or what changed them? Graphs enable you to answer all of these questions and more. Join us in the third episode of our year-long series as we highlight the best practices for using graphs to truly master your data.
Format: Presentation/discussion with some show and tell. This particular event is not a hands-on session, but if you think that would be a good idea for future events, please contact us.
Audience: The session should be interesting to anyone who has experienced the shortcomings of rdbms-based applications, and is in a position to impact their organization’s direction for new and re-platformed apps. Graph database enthusiasts, IT managers, architects, DBAs and AppDev technical people will benefit uniquely from this content.
- It is not necessary to have seen this first sessions in this series since there will be a recap, but it would be helpful. Replays here: Episode 1: What’s wrong with your app? | Episode 2: Backend dev
- No specific technical skills required, but to derive the most value from this session, you should at least understand appdev and database concepts and their implications.
Video transcript:[00:00:00] Will Evans: We’re a couple minutes fast. We’ll get started right away. What we’re talking about today is our Graph-Centered AppDev video series. We’re looking at Episode 3: Master Data Management. Really, the focus of this series is on how to create powerful applications with graph at their core.
If you look at the schedule, you can see that we are on our third session. We’ve done a session on what’s wrong with your application, focused on all of the ways that older application design with relational databases, tabular databases at the center, can present issues as you try to scale and grow over time to meet new business needs. And then we focused specifically on the strengths of graph databases for your backend as the core to an application rather than as an ancillary piece on a relational database.
Today what we’re going to focus on specifically is a really important aspect of having a strong business application: master data management. We’re going to take about 20 or 30 minutes today and talk about master data management, and we’ll talk about how this is done before graph and then we’ll talk about the power of graph for solving this use case and follow it up with some specific use cases that we’ve looked at with our customers and a demo of using the Hume platform to do master data management. And then some question and answer, of course.
As we look through this, we want to just start by defining master data management. I’m going to – let’s see if I can turn my camera on. I don’t seem to be showing anything on my camera. We’ll leave that off.
But if we go on towards the definition of master data management – sorry, if we start with a review of our graph-centered appdev, this is the typical application stack that we see with a lot of our clients. This is actually available as a handout that we can share right now and you’ll be able to download.
What we see in this diagram is that there’s a lot of different needs that applications serve. Serving that flexibility can present a major challenge for old style applications. Certainly for consumer-facing or SaaS applications where you have full control over the architecture, this is really the best practice that we are recommending to all of our clients now as we go through our application development projects. For internal business applications where maybe you’re not hosted in the cloud or there are other restrictions, there may be some modifications that we would make to this. This is really typical best practice that we would see.
So what is master data management? Well, if we take a look, we see that master data management really falls into two buckets based on our experience. Those two buckets are data resolution, which really presents the “master” in master data management. These types of use cases require identifying a single master record from a collection of records.
So this is that combination of – customer 360 is sort of the buzzword that people talk about now. You can see on the right-hand side, we’ve got Google Ads, CRM, Snowplow, some event-driven systems. What we’re trying to do is we have someone who’s in our marketing funnel because they’ve clicked on an ad campaign of ours, and then they go through our marketing qualified leads and eventually become a sales qualified lead. But it turns out that they used their personal email address in the marketing funnel, so now as we go into the sales qualified funnel, we have their business email address. That’s still the same person, and if we want to understand the ROI on our marketing spend, we need to resolve those two separate entities, separated by an email address, to one unique person. That’s the topic of data resolution.
Data lineage is more specifically around, where did that data come from? And specifically looking at, okay, some of that data came from our marketing system and Google Ads, and then the sales data came from Salesforce. What databases, what tables, what physical data elements did those pieces of information come from as we try to unify this into a single record? That applies to data resolution, but it also applies heavily to reporting and business intelligence as well as regulatory use cases where companies are constantly creating reports, and they need to understand at a point in time, what comprised this report?[00:05:00] What tables did it go back to? What were the business rules? What were the SQL rules that were being created in order to make this report? And understanding how those change over time.
If we focus briefly a little bit closer into our data resolution use case, you can see that any company is going to have a number of different platforms, whether it’s their web team, data warehouses, NoSQL systems, data lakes, or even departments between CRM, sales, marketing, finance, HR, data formats, Excel files, schemas, unstructured data, object storage, and systems, like their ServiceNow or Salesforce, Slack, Office365, and even different divisions between product lines or groups of an organization.
Each of these systems or each of these own groups are going to have their own different schemas, even within the platforms or even within your departments. There’s going to be individual schemas in that group. If you move between all five of them, you might have at least five, if not hundreds of schemas, depending on the size of your organization, and you’re going to need data from all of them.
This is something that, as we’ve gone through our reporting focus as a company, one of the top complaints we find in working with new customers is that users don’t have access to the data they need to make decisions, and it’s hurting their ability to innovate and drive their business forward. We’re spending too much time trying to navigate between all of these individual schemas to get the results that they need rather than actually make decisions based on the data that they already have and that we can connect together in a graph format.
Using the graph, we can help to resolve all of these different schemas and understand a single version of truth of an individual entity.
As we look at data lineage as well, we can see that we think about data lineage as a cycle. So we have a report up at the top that we’re generating, and that report comes from regulatory requirements and business rules, therefore driving towards databases, tables, physical data elements, and eventually a reporting tool that’s generating the visualization.
And this is a cycle that’s happening constantly, so even when we don’t have this error resolution special event, this cycle happens constantly. Regulatory requirements change, business rules change, we adapt and improve our business, and how we define our reports is going to change over time. This repetitive and flexible cycle must be tracked at each step along the way. We’ve seen our clients struggle enough with just reporting today that requests like “Show me what that report looked like 9 months ago” often present an untenable challenge.
We can take this complex process and track that element of time using the graph database in order to understand how these interconnected systems produce these reports at any point in time.
If we look at the old way, how people handled master data management and specifically data lineage as well, for those of you who are experienced and have suffered the scars of business intelligence, you may recognize IBM Framework Manager on the right-hand side here. This is really the path of centralization, restriction, and glacially slow progress in companies.
As someone who’s been in BI for a while, I have sat down and spent more time than I would like to in Framework Manager, and it’s really indicative of the old way that companies try to deal with this, which is: bring all of our data together into a central location, define all of our joins, have IT handle the whole thing, and then when businesses want requests, it’s going to be an 8-month development process because the whole thing is so brittle and inflexible that one change has cascading impacts throughout the entire thing.
It really makes it difficult to do anything and make any progress because we can only connect our data in one specifically defined way, navigating from “far away” places in our schema – so if we have data from marketing, which is connected through sales, which is connected through our product orders, which is connected through our coupon system – getting from marketing to the coupon system is requiring a number of join tables, potentially calculations that are being done in memory, based on metadata from Framework Manager or other tools, and it can become incredibly slow as well as very difficult to understand the connections between the data to build reports and to get that information out.
If we look at this process, how can we then use graphs[00:10:00] in order to help make things better? When we look at the power of graph, the power really lies in these relationships that we see on our screen here. What we see is that using these relationships, we can track the synonyms and related elements for data resolution. We can take individual entities that are disparate, but potentially related, and bring them together with relationships. And no matter how long that path is, because the cost of traversal is so low within graphs, we can make that jump without any query cost. We don’t need 10 join tables. We aren’t going to be crippled by computing all of these things in memory and scanning through all of our tables at the same time. We’re able to simply look at one element and traverse down several relationships to the next element.
Similarly, we can do that tracking of history both for our data resolution and our data lineage. The term “slowly changing dimensions” is probably familiar to many of you, and that’s something that is incredibly challenging in a relational database. It introduces two major complications that you may be familiar with, which is you have to define a date for your report – often a specific date, but at least a date range, which is often hard to understand for users – and it requires us to define upfront all of the specific fields that we want to change. That’s because doing slowly changing dimensions in a relational database is so difficult that we want to do it on as few things as possible because it’s adding a ton of complexity and significant performance hit to our queries.
The last piece of why graphs are so good for master data management is that you can easily navigate these loose and temporal hierarchies. This is something that we’ve talked about in a couple of blog posts recently, but hierarchies are incredibly powerful; however, when they become too strict and we start forcing products or people or communities of people into groups that they don’t belong in, we start losing the power of the hierarchy because we’re basically form-fitting, we’re trying to put the square peg into the round hole, and form-fitting people into an area that they don’t belong in.
Graphs allow us to actually keep these loose hierarchies and have people be in two different components of that hierarchy and use weightings in our relationships, or have that hierarchy change over time. Potentially one category in our hierarchy is no longer needed. We want to modify it, change some of the grouping on it as well. This is really highlighting – and we’ll show in a second in Hume – how graphs can really drive master data management.
Just talking briefly about a couple of our specific use cases for Hume for master data management, one of the top ones that we see is regulatory compliance. We’ve seen this at a number of top financial institutions, where companies or institutions are looking at “How am I satisfying regulatory requirements, and how do they change over time?” Often, this looks like understanding “How did I deliver this report back in January of 2018? What were the physical data elements and reporting structures that I was leveraging then?”
That can also look like “How do my business rules change over time in order to better satisfy these requirements?”, because you may get feedback from regulatory bodies saying, “This datapoint doesn’t sit well with your peers. Can you give us some more explanation as to how you got it?”, and they’ll have to come back and say, “Okay, let’s look at three months prior. How did we submit this report? What were the business rules? We can see that we have a physical data issue on this system, and that’s impacting this regulation.” That way it lets them get an answer quickly and efficiently to regulatory bodies without massive headache, months of time trying to go back in time and recreate things.
Then again on slowly changing dimensions, this is something we see across all use cases. Basically in every application that we build, there are things that we want to track time on. We want to look at how this quote changed over time, what is the history of this product, was its price going up or down, and all of these things are possible within a graph database using those relationships by creating chains of elements. So we actually will chain together a number of different elements to represent history, which allows us to keep our connections along the chain, but also always keep a reference to the most current aspect.
Another really strong use case that we’ve seen for master data management with graphs is data science, where we’re doing a lot of data manipulation prior to running our models, and we need to do – especially in a lot of the[00:15:00] automatic models that we can run these days, you want to do a lot of cleansing, a lot of trying to unify disparate customers into a single customer, as well as connecting data from all of those different systems. Doing that process in ETL flow or in something like Python or R can be extremely challenging, but using a graph, we can bring all of that data together, do our joins and complex processing in the graph database, and then take that data structure and push it into some of these more automated data science tools.
Then lastly, assist in basic data processing and in ETL. We worked with one of our clients on a process where it’s another large financial institution that was receiving loan and banking documents from thousands of smaller financial institutions, like in local banks type setup. They found over time that one of the biggest problems they had was that people had different names for fields. It might be “FIRST NAME, LAST NAME” or it might be “FIRST_NAME” and “LAST_NAME,” or it might be all joined together on “NAME.” They were having a really hard time, when there were issues with data, talking to these local institutions about it because they all were referring to the same column with different names.
So they actually used Neo4j to build a custom data processing pipeline where they could ingest all of these varieties of schemas and bring all of them to a unified standard schema, but keep that reference back to the original data source name and column name. And that did a number of things for them. One, it automated their process so that it was much faster to get these documents and have them show up in a unified schema, but it also meant that when there were errors, they could go talk easily and quickly to these partner institutions and get them resolved, and that ended up saving them huge amounts of effort over time.
I’m going to pause sharing on that slideshow for a moment here and start sharing a specific window, because I want to take a look at how we can use Hume to do some aspects of master data management. Hume is a graph-powered insights engine created by GraphAware that we resell in the U.S. I want to look at a couple of different aspects for it for master data management.
One of the most powerful things that we can do in master data management is to start looking at, what is our data going to look like when it’s unified together? When we start looking at this data – this is a beer use case where we have users who have reviewed beers, and we’ve done some natural language processing in Hume on that data. We’re just going to focus on users, reviews, and beers at the moment.
If we open up one of our users, specifically our Demo_User_1, we can see that they have reviewed a number of different beers. We’ll only load about 500 of these. And once we’ve loaded 500 of their reviews, we can see that we’ve already grouped these nodes based on certain properties. So we can actually see that these reviews are about the same style.
Once I turn off this grouping, we’ll see that this visualization becomes almost untenable, and this is the power of master data management: taking a visualization like this, which is meaningless, and turning it into something more powerful.
I’m actually going to add in as well the beers that are within this group, so we’ll get all of the beers that these reviews are about. You again will see that we have really a massive visualization here. Using Hume, we can leverage one of our powerful capabilities here called computed attributes. We can actually compute, at runtime, certain attributes of nodes, and this allows us to test some of our master data management queries at visualization time without having to write that information to the database. And we can rapidly iterate, as we might not have the perfect fit solution immediately, but we can develop it quite quickly within the Hume platform.
So we can look at something like a review style attribute where we’re actually taking the review and computing which beer style that review is about. And then once we have a computed attribute, we can set up our node grouping and group our nodes together. This falls under the data resolution use case that we were discussing earlier, where we want to know, okay, we have all of these different reviews about different styles, and I know that there are similar styles in here, but which ones are about which?
If I enable this grouping, you’ll see that we move from a visualization that is essentially meaningless to a very clear and optimized visualization.[00:20:00] And what we – taking this beyond the beer use case to a broader picture, these might be – if we look at some of our styles, even though we’re looking at styles of beers here, this could just as easily be individual instances of the same user and references to that user in different systems. So we might have our marketing systems and our CRM systems and our order systems. These are all different instances of an identical user that we’ve now been able to resolve to a single instance and group in our visualization.
This is incredibly powerful because it means that, again, we can keep that data lineage. We don’t have to simply erase of our user in our separate systems. We can keep them, but visualize them together as one individual entity. This is really powerful as you have different users within the system because you may be having your marketing users in there, and they’re going to know the marketing fields and the marketing database and want to reference all of their information using fields that they are familiar with.
That’s really the power of keeping that data lineage. You not only get the value of being able to leverage relationships to quickly connect things and get to a single source of truth, but you get the convenience of your users being able to understand the data in their own native language.
The efficiency of that to drive business forward really can’t be overstated. It’s something that we’ve seen time and time again as we go in to our clients. We’re often trying to say, “Okay, how does this Salesforce data connect into this Google AdWords data?” and we’re bringing together two departments that don’t talk to each other often enough, frankly. They don’t understand the other department’s data. So we come in and we bridge that gap. But then once we’ve left, they’re sometimes unable to keep bridging that gap. But when we introduce a graph database, they each can continue to talk about the data in their own language, and that graph is filled with master data management on Neo4j using the power of these relationships.
We can do a few things using the Hume platform. One of the things is we can start with that same process and we can search by some of these organization names. We’ll take one example here like the Federal Reserve. We can start searching for some entities that we know are in our database, so we can see the different examples that might be present. As we can see, we have a number of different Federal Reserve node instances here. We can actually start to just do queries on these as if they were the same entity, so I can understand what are the keywords for this organization, and we can see that for all of these instances of the Federal Reserve, these are the top keywords that are relevant to them.
That’s simply by leveraging the fact that as we expand these organizations into articles, and from these articles we can expand into keywords, we can navigate through all of these different connections incredibly rapidly with the limited cost of our relationships. As we’re moving between all of these different entities of Federal Reserve, we can navigate these specific keywords and understand that there’s actually some limited – despite the number here, the different instances of the Federal Reserve naming refer to a few central keywords, which can get us both the most important keywords to the Federal Reserve and help us understand an individual instance of the Federal Reserve.
And then the last thing that we can do with the Hume platform is sometimes master data management actually can involve enriching data with other pieces of information. So we may want to be processing these RSS feeds and extracting organization names and then enriching that information with external ownership structure or external corporate data. One of the things that we’ve done in some of our use cases is use the power of Hume Orchestra.[00:25:00] We have within Orchestra the ability to create calls out to external APIs or even internal APIs, and we can use this HTTP enriching skill to actually, within the platform, enrich our data. We’ve looked at using things like the Open Corporate data in order to enrich information. So we might be able to do something like search “federal reserve,” and we can see a lot of different companies that may have the name “federal reserve” in them. And if we looked at some of our search results from Hume back in this visualization, some of our instances of “federal reserve” may actually resolve to these companies, or they may resolve specifically to the U.S. Federal Reserve.
But we can take this structured data and list of related companies and unify it alongside our unstructured data, and we can quickly extend out our schema to say things like we might have a related organization. So if I draw a relationship here, we might have RELATED_ORGANIZATION. Now in our queries, we can quickly navigate through that chain of related organizations to start identifying communities or individual instances of organizations.
I’ll go back to the presentation here. That wraps up our webinar on master data management and the power of graphs. We looked at how graph databases can help with master data management for both data resolution and for data lineage use cases or a combination of both.
At this point we will go to the chat to see if we have any questions. I see one in the chat here: “Should we migrate completely off our existing tools, or is this a complement?” We really see graph databases as a complement for most master data management use cases. While for graph-centered appdev, we would recommend building with graphs at the core, for master data management, the use case here is that you already have a number of existing systems, and it’s simply not feasible to replace all of them with a single graph database. So we’ll typically see this as a complement and serving as a sort of middleware for the interaction with data.
Looking at one other one here, “What kind of technical skill would be required from our side to implement an MDM solution like this?” Just a plug for us is that obviously we do graph database consulting, and I’ve worked with a number of different clients on this solution. So if you leverage our expertise, the technical expertise can be low. What we typically see is that these projects actually result in less effort for the clients long term than their current solutions do. But in terms of maintaining it, you’re looking at skilling up on some Neo4j skills, and depending on the technology we might leverage for the middleware, looking at those technical skills as well.
Then just the last one here is, “What are some benefits to basing the master data management paradigm around graph versus typical relational structure?” This is something that I think I’ve talked about, but just to really highlight the point, it’s the power of the relationships. That’s why we work with Neo4j, which is the leading enterprise property graph database.
The difference between a relational database and, confusingly, a graph database, despite the fact that relational has “relation” in the name, they’re actually really the Achilles heel of relationships because navigating between different tables often requires these intermediate join tables or complex SQL statements or sometimes just straight lookups and subqueries, whereas in a graph database, that is really truly native, that relationship. It’s stored on disk pointers from the different elements, and that makes the traversal so much faster that it’s really not a competition. Graph databases really are the future, both for master data management and for everything else as well. Most other use cases, in our opinion.
We do have another question here on “How do you manage restrictions on edges?” That’s a really good question. Actually, in Neo4j 4.0,[00:30:00] they released node and relationship level restrictions where we can let users traverse certain nodes but not actually return information on that. Depending on the use case for restrictions, that’s something that can be managed with properties on the edges, or something that also might get managed in the middle, if we’re looking at a data service API. That’s a great question.
If there aren’t any other questions, we’ll wrap up here. The slide deck also will be available – it should be emailed out to you at the end. We’ve got a couple resources here that we recommend on master data management, looking at graph databases, some of the solutions out there.
As always, please feel free to reach out to us at firstname.lastname@example.org, and head to graphable.ai/events to see more of our upcoming webinars or blog posts and some of the content that we’re putting out there constantly. Thank you guys very much. Appreciate you taking time on this Tuesday to learn about master data management with us. Everyone have a great day.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.