The process of preparing data for use in predictive models is often a significant barrier to successful deployment. Richer, more informative datasets tend to be more complex, making the engineering of features from the raw data cumbersome and opaque to business stakeholders. A novel solution is the use of a flexible database in the background that can accommodate complex relationships within the data while also allowing for transparent feature engineering. In this discussion, we hope to demystify feature engineering and data preparation for data science efforts while also demonstrating how a graph database can make model building more efficient and more transparent to business stakeholders.
In this session you’ll learn how to:
- Ingest complex relational tables into a graph database
- Leverage graphs for transparent data manipulation and feature engineering for machine learning
- Convert data science into a business-friendly process using Domo AutoML
- See the benefits of Domo’s “polyglot” data ingestion capabilities using connectors
Additional info and registration:
- More session details: Leveraging Graphs for Data Preparation and Feature Engineering for Domo AutoML
- All DP21 sessions: DP21 Agenda
- Register for conference & more conference details: Domopalooza Virtual Conference
Video transcript:[00:00:00] Dr. Lee Hong: Hello, everyone. Welcome to our presentation on leveraging graphs for data preparation and feature engineering for Domo AutoML. We are from Graphable. Me and Rebecca will be walking through the process of taking data and putting it into a machine learning algorithm as painlessly and as seamlessly as possible using a graph database.
What we will cover today – the promise, the problems, and the pain points. Why is it so difficult to get a machine learning algorithm into production? Why are there so many pain points and high failure rates? And why are the reports telling us that it’s actually very, very difficult to get a machine learning and artificial intelligence project up and running and actually powering your business?
In the second half of our presentation today, we will be talking about the pain reduction process. How do we leverage Domo and Neo4j, the graph database, to make your process much more seamless and much easier, taking in a fairly complex dataset, manipulating that data, and then putting it into AutoML using our Domo Neo4j connector?
Today, I am Lee Hong. I’m the Director of Data Science at Graphable. I have been in the data science business probably for about the better part of a decade. I used to be a neuroscience professor, and I’ve done a lot of work with data and seen the ups and downs and the difficulties that everybody goes through in bringing these projects to life.
With that, I will hand it over to Rebecca really quickly for a quick introduction.
Rebecca Rabb: I’m Rebecca. I come from the consulting side at Graphable, so I work with Domo customers and many others on various use cases in the medical space, in the food space, in just operations, marketing, advertising. So I come from a space of wanting to make this approachable and usable for the everyday Domo user.
Dr. Lee Hong: As we start walking into the promise of machine learning, one of the things that we’ve started to learn, particularly as the pandemic came about, is this growth in popularity of artificial intelligence and machine learning in the business context. As more and more organizations were forced to move their business and their enterprise onto a digital platform, there became a bigger and bigger need to let the computer do the work that humans, for one, are not necessarily that good at, and also take a lot of that load off the human labor, making the business more efficient and working to get the things that your business needs to do to empower itself.
As we read a lot of the news, Harvard Business Review, VentureBeat, it consistently talks about the bright future of AI developments and deployments and bringing those into production and into a place where it can actually power your business and help your business move along more efficiently.
There are still many, many barriers to production and a lot of pain points that many businesses see when developing their AI, artificial intelligence and machine learning platform. Wall Street Journal is talking about a near 50% rate of failure of AI projects – about half. There’s a lot of discussion even with PwC needing that boost to get AI into production when such a high failure rate consistently hamstrings organizations from getting that going.
When you really boil down a lot of the problems that these organizations face, there are two major things. One is a lack of transparency. When we talk about transparency, it means the business doesn’t necessarily understand what the algorithm is doing, so it’s very difficult for the business side to understand why a model may not be as accurate, or maybe the model is behaving in a way that’s counterintuitive to the business. So being able to have that transparency and say “This is what we’re using to teach our model and this is why it’s producing the results it’s producing” is very, very important.
Then the second piece is the operational cost. Taking data, moving it, preparing it, cleaning it, and then feeding it into the machine learning algorithm is a fairly costly process. So when we read the overall discussion about AI and the failure of AI, it comes down to having the right problem, the right team, the right data, and an approach to handling any potential biases that may be living in your data that you may not know about. There’s a high cost of building a huge team to take data and move it. There’s a high cost of[00:05:00] getting somebody to actually communicate the concept of what’s going on in the machine learning and artificial intelligence space to the business side, particularly if you have people who aren’t that well-versed in the technology business.
So the first problem is we have to overcome the poor transparency, which is a disconnect between the business side of the house and the data science side of the house. The second thing that we have to overcome is the ability to engineer, move, prepare, and clean data easily.
The bottom line of our presentation is the concept of having this one-stop shopping approach where I have a place where I can take data, move it in, build my model, and then validate it in a very easily digestible way that both business and technical users can use, understand, and disseminate. This is where Domo comes in – being able to have that one single place where I can have my data, I can visualize my outputs, and I can also train my machine learning models.
As an example for today, we’re going to look at a business example where I need to know if I’m getting enough foot traffic through my door. The tricky part is, I don’t have any data from anybody else in the neighborhood or around me, so I need to be able to find a benchmark to say, “Yes, I should be hitting these targets.”
Now, I could just go out and say, “Everybody else is getting 20 people,” but is that the right number? Should I use the historical KPIs from my business? I got 30 last month; should I be getting 30 this month, or am I a seasonal business? And is anybody getting more foot traffic than I am?
One of the ways that you can solve this problem is to build a machine learning model that uses other people’s traffic to essentially guesstimate how much traffic I should be getting. So if other people are doing well in the same region, the same business area, I should be seeing fairly similar numbers. This is one of the ways to use machine learning as a benchmarking and forecasting approach and allowing me to take these insights and start to adjust my business strategy.
Really, at this point, we have to say a huge, huge thank you to the SafeGraph team for giving us a selected amount of data from the Columbus metro area. It’s been wonderful working with their data. It’s extremely rich, and we really have to thank them a lot for giving us this access.
The SafeGraph data is an excellent data source. It gives you so many dimensions and a very rich view into the foot traffic data in any given location. But what makes it rich also makes it complex, as you can see from this screen. It has data that’s nested in arrays. Sometimes you have fields that exist, sometimes you have fields that don’t. It’s just the nature of the beast. I might have a business that runs only five days a week, whereas somebody else has a business that runs seven days a week. Some people might be only open four. Some businesses are open 10 hours, some businesses are open eight.
Making that into any kind of functional schema from a table form would be pretty complicated. I would have to think about what my reference tables are for my days and my hours, and I would have to have bridge tables that connected things together. One of the difficult parts is if anything changes to my business or somebody else’s business, there might be some impact to my relational schema.
Constantly having to readjust and reengineer these models gets pretty difficult over time, and maintaining that data architecture so that it’s flexible enough for me to continue to use this for my upcoming machine learning algorithms, retrain my models and train future ones, becomes very, very difficult and very slow.
What we can then do is take that SafeGraph data and build that into a Neo4j graph database. What a graph database does is it takes those rows and columns and maps them onto what we call nodes and relationships. So each of these bubbles that you see here on the screen is a node, and then the line that connects the two is a relationship.
What’s nice about this is what we create is an entity-based model where I can say here I have a location that’s the center of my universe. That location is my store. That store belongs to some subcategory of business, and that subcategory of business belongs to a top category. Then that location also gives me information for my address. It tells me where my business is located, and then it also gives me information about the device types of the people who are visiting my location, and I can then check how popular my location is compared to other locations and by time of day. So it gives me a very, very rich dataset, but it allows me to nest some of the information inside those nodes and relationships.[00:10:00] That’s one of the interesting ways of storing data in a property graph. It allows me to store “properties,” which would normally be called fields or variables, as part of a node or part of a relationship.
With that, I’m now going to hand it off to Rebecca to start taking a view of a single location and how that data in that CSV file and from that table actually becomes converted into a specific graph.
Rebecca Rabb: Yeah, thanks, Lee. We wanted to step back here from Lee’s explanation of the schema and concentrate on what a single location would look like on our schema – for example, River Road Coffee House out at Granville, a favorite location in college. Just to walk through the different pieces of the schema and how it will be applied to the different features.
Starting in the middle, in the dark green will be the location, and then belonging to a particular subcategory in yellow, this one being under the snack and nonalcoholic beverage bars subcategory, and then also belonging to the top category of restaurants and other eating places. In this case, the coffee shop only belongs to one category and one top category; however, with a graph, this would be fairly common for something to be able to belong to many, just keeping that walk along the graph pretty simple, though, if you wanted to hit multiple.
In addition, to the left in the pink bubbles, you can see the popularity by day of this given location. Again, in a star schema, Lee mentioned sometimes these businesses aren’t open particular days of the week; sometimes there are certain hours. This gives the ability to look at all days, whether they’re relevant or not. It’s a coffee shop, so it’s open seven days a week, and you can see all of the popularity data. In this case, it’s a rating associated with each.
In addition, near the bottom of the screen you’ll see the location information that we have. So we’re able to see it’s at 935 River Road and also located in Granville, being the city here, and the zip code of 43023. In addition, we’re able to capture the geolocation for the lat and long if you’re able to put that up on some sort of map. In this case, it will be your purple bubble.
To the right, the SAMEDAY_BRAND_LOCATION. So in Lee’s nod to the richness of the SafeGraph data, there’s quite a bit of detail in terms of other locations that people visited when they visited River Road Coffee House. In this example, people tended to visit GameStop and Sunoco, so one can make an inference that maybe it’s a younger crowd visiting here.
In addition, in the light blue bubble on the right-hand side, the HAS_VISIT relationship connects to say the number of visitors or the number of visits at that given location. In addition, finally, in the top right in the green is where the data is gathered from. So is it gathered from iOS or Android?
Then on this graph, you say, what’s the feature engineering process going to be like? In our Neo4j approach, we’ll use path queries. I mentioned being able to just semantically walk along the schema of the graph. We’ll use something called WITH clauses – those who are familiar with SQL, this is kind of just subqueries – along our schema. And then we’ll reflect our aggregation back to a single node or a single location in this case.
If you were to do this in SQL, it’s a little bit different. It would require many mapping tables, many dimension tables. You’ve got to keep track of your aggregations, your keys, that sort of thing. The schema becomes very cumbersome, very large, very complex. It takes a very narrow set of people who would be able to manage it. Neo4j really simplifies that process. Also noting the subquery and the aggregations, you would need to do some self-joins, some outer joins. That becomes incredibly messy with duplicates and ghost joins. It’s not pretty.
Cypher is the language by which you query Neo4j and Neo4j consumes the query. In our case, in generating the features, I’ll just walk through some of the Cypher that we produced here.
The beginning statement, match on location, is just saying per location, and then each subsequent block is a feature that we created on a given location, the first one being the distance to nearest competitor. So we’re trying to say we should measure foot traffic on what their competitors are doing. As you can see,[00:15:00] it’s a walk along the graph from the geolocation to the subcategory from either side. So it’s saying of the location, which match is in that particular subcategory? We’re generating a latitude and longitude point from that – as I mentioned, that was a piece on our schema – and with that calculating the distance between these given items. Something like this in SQL becomes incredibly intensive to calculate given that you have to loop through every location. This becomes very simple in Neo4j in that it’s a walk along a node and it’s a very quick query.
The next one was the average day popularity rating, was what we called this. Along the left side of the schema, if you remember, there was popularity by given days, Sunday through Saturday. All this is saying per location, what was that average rating? In addition, we calculated the number of competitors in the subcategory in the same zip code. In this case, for the River Road Coffee House in 43023, how many other coffee houses or restaurant businesses exist within 43023?
We also pulled the top category here as a feature, pretty simple, to say if you’re in the same category, your foot traffic may be similar. And then we also wanted to just pull down the raw number of visits per location to have something to calc off of.
Then the work becomes, we have this schema, we have all of our SafeGraph data; how do we get it into Domo? We have the Neo4j connector, just like any connector from the 700+ in the store. You just search for Neo4j, and come down. The screen would look like so. Some simple basics in getting into your Neo4j connector, you’ll need a username and a password for your database, and then the instance URL where it lives.
From there, the only inputs – it’s not like a dropdown of reports or different things that you may be familiar in seeing in a connector. The only options here are to submit a query and then what that query would be. The Cypher that we were looking at before, you would simply copy and paste what that Cypher is into the below window, and that’s how you would access the database.
Once you have the dataset in Domo, it will come in via table like you’re familiar with on the Neo4j connector. In the top right of that dataset, you’ll have the ability to select AutoML and send it through the model. We generated the features. We may have some bias on our data; we may be generating the feature incorrectly or needing to go back and workshop it. Maybe we’re overfitting something in our prediction. So just keeping track of what we bring in and how we model off the data. Say that we bring in a feature we don’t want to use or columns that maybe will skew the data. Always feel free to send it through an ETL, for example, to sort through the relevant columns.
In AutoML, when you choose to say “I want to send this dataset of mine and predictive value through AutoML,” you’ll be prompted to select a few different items. In our case, we generated features on a location and we want to predict the visit counts. So we’re predicting raw visit counts, would be our field here. Then you’ll choose a task type as well. You’ll have a couple different options, but for here we chose to do a regression. You’ll also have a selection on the number of candidates to run in your model. From an efficiency point of view, fewer runs will go faster but may not be as accurate. A lot of runs may just take a ton of time.
So then you’ll want to run your model and somehow use the Domo platform to model and visualize your results. Domo becomes a really great one-stop shop for being able to deploy your model, run against your data, and then say, “Hey, were my results valid? Are they close? Are they biased?”, that sort of thing. It’s really important to be able to check this quickly, to be able to go back and troubleshoot as needed. As I mentioned, even in our experience, we created some features that added a little bit too much bias to the data. But what’s important is to rinse and repeat here and being able to modify as you go.
In the spirit of being able to validate your data in a transparent manner, in a manner that’s easy to consume by different users, we put together our results in a basic dashboard to QA from the different party holders. The first thing you see here, we want to see our mean absolute error on our model. That’s pretty important. As a reminder, we’re not here in the business of creating[00:20:00] a great model; we’re here to profess that graph is an easy, transparent way to bring it in. Our MAE isn’t great, but 73% is what we came out with when we ran our data. But interestingly enough, our net percent error was less than 1%. So we overpredicted quite a bit and we underpredicted quite a bit, but it all balanced out to almost zero, so that was interesting to see on the data.
In terms of the breadth of data, we had 3.79 million actual visits to our different location. Our model, again, predicted pretty close to that, 3.8, but again, it vastly overestimated and vastly underestimated in different cases.
Below on the left, I’d like to point to the top category on our actual visits. Based on the area of the square here, just on a tree map, pretty simple, is to say that there was quite a bit of bias in our data in the actual visits and our number of locations for the given items, meaning that a very large portion of our data, just pointing to the blue squares for example, lay with restaurants, grocery stores, gas stations, real estate, that sort of item.
Also, this data was pulled in April of 2020, in the thick of coronavirus, so again, it’s very difficult to predict an anomaly in data. But something to note when building a good model is to say maybe throw out some outliers. In this case we had a large number of locations and large number of visits; maybe throw out the large one and throw out those little ones in the corner here that don’t even register on the data. Miscellaneous durable goods, specialty stores, postal service – these sorts of things aren’t really helping our data. We’d like to fall maybe in the middle of the distribution here.
As I come down here, we also did the prediction error across top categories. Again, that was one of our features, for example. We wanted to say which categories did our model perform best on and which categories did our model perform worst? As we see, machinery, equipment, accounting, other financial, the absolute error was over twice as many visits as there actually were. But you can also see under the number of records, there weren’t many locations there to predict off of. So as I mentioned before, maybe monitoring some outliers, throwing out some data that lives on the bottom of that distribution in order to improve it moving forward.
In the best categories, interesting that we have ones where there are fewer records there. But in the scheme of things, an absolute error of say 40% isn’t great anyway. So being able to use the graph, go back, generate maybe some better features, maybe some more thorough features, but just keeping it transparent for everybody involved on how to continually improve the model.
Below here we have a visualization of the error by visits. It’s important to note here that we have a gentle exponential curve that we got given by the average number of actual visits in a given category versus the absolute error. So you can see in the top here with the highest error is our machinery, equipment, and supplies at 264%. Not a great prediction. And then down here, as you get to a lot of actual visits, it’s 63%. Not great, but this is our lessors of real estate.
In the spirit of monitoring our distribution and cutting out items that are skewing our results, we want to hover here in this cluster, and as the actual visits go up, for example, our prediction gets better. So we would want to hover around this 60% to 50% error in terms of tuning our model.
I’ll hand things off now to Lee for closing remarks.
Dr. Lee Hong: Thanks, Rebecca. As we come down to the end, under normal circumstances I’d be pretty depressed and pretty unhappy with a model that has a 73% error. But what we’ve done here is create our baseline benchmark that we can just keep rinsing and repeating and getting our data better and getting our model improved.
Two things out of that. One is using Neo4j, I can very quickly and easily modify the features and point my data in different directions, and I can keep recycling data through. But the more important thing is what Rebecca[00:25:00] was showing with the dashboards. I can have a discussion with the VP of Marketing and say, “Should we be benchmarking against machine stores? We’re a coffee shop. Is a machine shop a good comparison? Are there particular categories that we should be dropping for our model? Are there certain outliers or certain things we shouldn’t be doing given the fact that this dataset came in during a COVID pandemic period? Maybe we should be using data from three months ago or from 2019.”
These are all very interesting questions that now you can have that discussion with somebody else who doesn’t necessarily understand how the model works or even how AutoML works, but they can look at a graph and as, “Hmm, we’ve got stores or categories of businesses that perform very, very well in our model because they had enough traffic. Maybe those are the categories that are similar to us.”
What’s nice about what we’ve done here is being able to map data – stores that are close to me, stores that are in the same category to mine. I can start looking at those and start making decisions of how we can handle our model updates over time.
The first thing we found is that preparing data for AutoML is very, very seamless using Domo and Neo4j. It’s very easy to take data from a fairly complex source, take that and break that down to its entity level, reaggregate it, transform it, and then push it through AutoML. What it does is it gives us a very short and transparent path to building my AI project so that I can feel comfortable knowing that I can engineer and reengineer and rebuild my process without a lot of friction. So I can actually do this without a huge team – take my data, break it down, rewrite it, and constantly go through this rinse and repeat flow.
And the last very, very critical piece is all this happens in Domo with the ability to visualize this and show this to any member of the executive team very, very transparently, very easily digestible, very easily understandable. And if over time my model changes, I can track those changes looking at the same dashboard as everyone else.
Really, in closing, the most important thing is the ability to quickly pivot and make updates to your model and track those over time, and being able to share that with the rest of the organization.
With that, I want to thank everyone for following along and being part of our data journey. Once again, I am Lee Hong, Director of Data Science at Graphable. Thank you very much.
Rebecca Rabb: And I’m Rebecca Rabb, consultant at Graphable. Thank you, everybody, for listening, and good luck on your Neo4j journey.
Check out Domo pricing today.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.