Thanks to Neo4j for a great event! Check out the replay of Will Evans, Graphable’s VP of Strategy & Innovation, presenting at Neo4j Nodes 2020.
Video transcript:[00:00:00] Anna Anisin: Hi, everyone. My name is Anna Anisin, and I’m very excited to be here at NODES 2020 Virtual Conference with you today. I’m the founder and CEO, formulated by the guys behind the Data Science Salon. For those of you who don’t know about the Data Science Salon, we are a unique vertical focused data science conference which grew into a diverse community of senior data science, machine learning and other technical specialists. We gather face-to-face, and these days virtually, to educate each other, eliminate best practices, and innovate new solutions in a really casual, chill, salon atmosphere virtually. We have some really awesome events coming up in November and December, and I hope to see some of you there.
I’m really excited to announce our next speaker, Will Evans. Will is the VP of Strategy and Innovation at Graphable, and works with companies large and small to help them develop industry-leading technologies based on graph. A focus on event data, recommendations, and NLP will help guide companies to success.
Will’s presentation focuses on leveraging dimensionality for graph-based recommendations with sparse data. Thanks for being with us, Will. Please take it away.[00:01:18] Will Evans: All right, hello, everyone. Thank you for that introduction. I’m excited to be here as well, presenting at NODES 2020. As mentioned in the intro, I’ll be talking about leveraging dimensionality for graph-based recommendations with sparse data.
Quick extra introduction, I’m the VP of Strategy and Innovation at Graphable, as was mentioned. I’ve been working with the Neo4j technology for more than four years. Actually, in a little full circle endeavor, my first project with Neo4j was a recommendation engine for blogs to recommend further blogs on a corporate website – which I’m sure I did completely wrong – so this is going to be recommendations plus experience to help guide you in a better direction.
Graphable is a graph-based consultancy and the exclusive reseller of Hume in the U.S. So we help companies with their data and solving problems like this all the time.
What we’re going to be working through today is making recommendations on a graph filled with beer reviews. What we want to do is really go the soup-to-nuts example where we start with a dataset and a business goal and take you all the way from our schema design process to usable results that could be deployed into production.
We’re going to start with the use case and our dataset, take that into the schema design, and look at some of the things that we think about as we set up a design for highly dimensional sparse data, look at the data ingestion process briefly – though that’s really not the focus of today’s session – and then we’ll work and spend the bulk of the time on some of the Cypher development and the modules that we’re speaking about and looking at finding beer similarity for recommendations, before we get to a conclusion. So let’s jump right in.
What we’re looking at here is a recommendations use case. Recommendations are always critical for a good customer experience. Especially when the customer has a large product assortment and people want product discovery, searching is almost never feasible because you run into the problem of people not knowing what they don’t know. They can’t search for new products because they don’t know they exist or they’re not aware of the options that are available to them.
Additionally, recommendation engines are critical to personalization and making sure that each individual customer is shown products that they actually want to buy. As we’re looking at increasing our cart size or increasing our cross-sell and upsell, we want to make sure we’re recommending relevant pieces of information to every customer.
The caveat here is that usually businesses that have a large assortment of products and a customer base that would benefit from that personalization have a sparse dataset that doesn’t lend itself to machine learning and have a customer pool of unique individuals with very diverse taste profiles.
When we talk about highly dimensional and sparse data for a second, something that we’ll go through in this presentation is progress, not perfection. Our goal here is not to end up with the perfect machine learning model or the perfect algorithm that’s going to give us 99% accuracy. It’s getting us good, repeatable recommendations with minimal investment. We want to make sure that we’re not throwing out all of this work because at the end, we used up our whole budget and didn’t get anything into production. We want to make sure we’re getting the dimensionality of the data to produce those results that we can use.
When we talk about sparse data with high dimensionality, this is data that we’re talking about. On the right, I have a screen capture from our sample dataset. You can see that it’s very sparse. The only real facts we have in the data are our users and review scores.[00:05:00] This is in contrast to situations like Google or Amazon or real big data type situations, or Netflix, where you have a really high number of touchpoints, and we can see – like in Netflix’s case, we can use a banding algorithm because we’re looking at what parts of a movie people are watching and we can recommend certain pieces of that to other types of users, because we have tons and tons and tons of data.
When we’re leveraging something like reviews on a beer dataset, we have very sparse data where we have beer and beer style and brewer and review and text and glass type and flavors and all of these different things coming in from the reviews and from our data, creating our dimensions, but only reviews on aroma, palate, taste, and appearance to give us any defining characteristics. In data science, we also consider this P > N, where we have more parameters than our sample size.
As we look through this, you can see that on the left, I have the number of reviews that every user has written, and in general, that’s one, maybe two. Similarly on the beers, each beer typically has one or maybe two recommendations, meaning that we’re going to have a very hard time with a standard machine learning algorithm walking out and finding things like collaborative filtering, etc.
So as we go through our schema design, when we’re working with sparse, high-dimensional data, the schema design is often very simple. In our instance, beer is the center of our universe – both in this example and sometimes after work as well – where we have beer as the central entity. Then we start asking questions about our beer, like where does it come from? Then we have a concept of beer being brewed by a brewer. And because brewers can brew multiple beers and it’s sort of a question of hierarchy and we don’t want to have brewer as a property, we make it its own separate entity.
Very similarly with style, we have beers that are part of a style, and it’s almost like a product category or a product hierarchy. So we’re going to pull style off of the beer and make it its own separate node as well.
Lastly in terms of entities, we come to users who aren’t related in any way to our beers or to our brewers. So we have them as a separate node, which we then connect through a review. The reviews only exist as an interaction of the user and beer, really even at a point in time, where the user might review the same beer multiple times as they try it over several months and their opinions change, or they want to confirm what they have previously. But really, it’s the cardinality of this relationship that determines its direction. Beers can have multiple reviews, but in our dataset, each review only mentions one beer. So it makes sense in this case to have the relationship from beer to review.
If you had a broader set where someone had blogs about beers and they could mention multiple beers, then we might want to change this relationship. But in our use case, this is just a good example of why the direction of relationship can matter sometimes.
Covering data ingestion very briefly, because there’s a lot of information out there already on it, you saw our original dataset. It’s a sort of weird fromat where we have 14 lines making up one data item, and then it’s separated by a new line character and then there’s another data item. We actually used Orchestra, which is data ingestion, an ETL tool in Hume, to load the data. But you could just as easily write a transformation script in Java or Python or whatever you want and then use a load CSV or a bulk load for Neo4j. That would work just as well.
Then going on to something that we did just for fun, but is a good example of how far you can get with some basic rules-based approach, is creating brewer names based on the beer names. In our dataset, it actually didn’t include the brewer names. They just had an ID. What we did is we said, let’s see how far we can get by just using this basic structure and creating some brewer names.
What we do is break out the beer names into individual words, and we basically are doing this three times. We’re looking at, for each beer and the brewer and the beers that they’re connected to, what words are in the beer names, and how often do they appear? We basically take this and we replicate it three times based on our expertise – which we’ll say a lot throughout this process, “our expertise,” because what we want to do is make our best decisions based on me understanding the beer market and what people are looking for in beers and how brewers name themselves.
As we go through that, we’re actually using the ordering of the names as well as the difference in popularity between the first word and the second word. If the first word is incredibly popular, used in every beer, and the second word is only used in[00:10:00] three of them, then likely that second word isn’t part of the beer name.
If we look forward through this, we can see that actually for a number of my favorite brewers, and some relatively large brewers, they include their names in all of their beers. So Allagash, Shipyard, Harpoon, Smuttynose, and then even two-word brewers like Sierra Nevada or Samuel Adams. We can accurately extract brewer names. There are some examples where this fails – for instance, New Belgian Brewing out of Fort Collins, Colorado. They don’t put their name into any of their beers, and so we can’t extract it from this dataset.
You could use more advanced NLP techniques in order to get this more accurate if this were very important and highly valuable to us, but for our initial pass, this was plenty good enough, and it really shows how far you can get with just some expertise and a rule-based approach.
As we get towards our Cypher development, I want to hammer home on that point for a second, which is don’t boil the ocean. We see this a lot with data science projects as we come in, and we see that people have found a problem, they have a dataset, they spend a bunch of time and money trying to develop the perfect machine learning algorithm or throwing it to other types of really high-level data science, and they’re not getting the results they want. It often comes from the fact that they have high-dimensional sparse data and they’re really trying to trust the herd to give them the best results, whereas what we recommend is starting with individual models and unifying them at the end.
We want to look at splitting out our data based on X dimension and exploring the results, and just doing that exploration as someone who understands the use case I’m working with, and see what’s giving good results and what’s giving bad results. As we go through that, we can then start to aggregate and weight and bring these different modules together and create a better recommendation as we stack them.
Part of this comes with avoiding complexity if it’s not needed. If we have the dataset that supports a big ML algorithm and we need that kind of accuracy, then sure, we can go for it. But we can achieve good, repeatable results from a hard dataset using a collection of basic techniques, and we like to reference this as rule-based+. It’s not quite as simple as just rule-based, but it’s taking the expertise, taking some rule-based and simple algorithms, and creating good results.
Once you’ve gone through this, the more advanced algorithms and more advanced processes like clustering and node2vec and other machine learning type algorithms have diminishing returns, and the results depend heavily on the data size. They often require a lot more facts than we have. So we want to do a thorough cost analysis on the ROI of improving our accuracy compared to the effort. We almost always want to do this after we’ve already deployed and measured good, repeatable recommendations in production.
We want to know, what was the business aspect of recommending our users good beers? Did this increase our profit in some way or increase our margin or increase our revenue? Once we’ve done that, we can then say, okay, we know that increasing our recommendations led to X percent increase in revenue; what would increasing the accuracy be? What if we made better recommendations? And once you have recommendations in production, you can actually start collecting data on their efficacy and then use that to develop your new model.
Looking at our specific solution, we want to build a recommendation engine with sparse high-dimensional data, and it’s a tricky process. We have what’s not quite a Cold Start problem – we’ll call it a Cool Start problem – which is that many users have reviewed one beer, and many beers only have one review. This gives us very little overlap across the users and the beers that they reviewed, and it doesn’t give us a great pattern to walk between users and beers and find similar results.
Some brewers have many beers and some have few, and overlapping beer styles that don’t clearly identify beer flavor. This is a good hierarchical concept where we don’t actually have beers in multiple categories, but we have categories like an IPA and an APA, where the American Pale Ale and the India Pale Ale are quite similar flavors, though different styles. So we don’t want to rely on style as a simple identifier.
What we can’t and shouldn’t do is recommend beers based on co-review occurrence, recommend beers simply with higher scores, or simply parse review text and then recommend corresponding beers. We’re often going to run into the law of large numbers here; the aggregation is going to usually file away our smaller beers and we’re just going to end up recommending the popular beers, only recommending beers and brewers that they’ve had before, only recommending beers within the same style, when there’s a lot more options out there.[00:15:00] We want to avoid presenting narrow and mismatched recommendations to our users so that we can give them things that they actually wouldn’t have thought of and that they might like.
To solve our Cold Start problem, what we want to do is introduce new products to people, but we don’t want to push the wrong products and have them not enjoy them. Since so many users have reviewed one or two beers, they would never get relevant beers recommended to them because of the issues we talked about on the last slide.
Our solution is to leverage the hierarchy of beer to style and find highly related styles, and then drill back down to beer. So we want to try to find similar styles, like American Pale Ale and India Pale Ale, and then expand out our options. Once we’ve done this, we’re going to get too many beers to return because review scores are going to be banded in our top echelon – looking at Uber reviews where if a driver has a 4.1, it probably means that they don’t drop you off in the right place, and a 4.9 is the only thing you’ll accept. Many of our beers may have the same score, so we want to be able to add another layer here.
This is when we talk about stacking modules. We’re going to add a little bit of NLP without NLP onto our review text and find words that our users mentioned in their reviews that show up in beer names. So we’re using our expertise to decide that people, like me for instance – I’ve been made fun of at Graphable; I will choose any beer off of the shelf if it mentions oceans or dogs in the name of the beer. People make not fully rational decisions around taste, but branding and other aspects come in as well. So we can add that as a layer of our recommendation into the graph based on the review text and the beer text.
If we look at that query, we can see basically going through 10 steps. Looking at the first few lines, we want to take the styles that the user has reviewed – starting with the user MikeDrinksBeer2, we’re going to reviews that he’s written of beers that are within a style. We’re going to take that style and we’re going to split it into the individual words that are in that style name, and we’re going to find the highest rated words from the style names for this user. We want to take an average of the appearance and other scores to get basically which words in the style names our user likes the most.
From there, we’re going to find other styles that have these words in them and find beers within those styles. This is the first piece that we talked about on the last slide, in here, where we’re looking at finding styles to find highly related styles and drill back down. And now we have one set. We could at this point just give them those reviews. We could say, “This is good enough. Let’s rank by review score. We’ll take the top 10, call it a day.” That would be plenty, and that might work in your use case.
But we’re going to add another layer onto this recommendation where we’re using the NLP on saying, okay, now let’s take our original user, break up all of the reviews that he’s written into individual words – again, take his most favorite words, which we’re using a sum here – you can see on this, we’re looking at actually a sum instead of an average. If he’s using words more frequently, we want that included in our information, so we’re going to use a sum and give weighting to repetition, whereas we didn’t want to do that on the style scores.
From there, we’re going to pick the top 100 highest scoring review words, and we’re going to find beers within our large set that have those words in the title. Then we’re going to additionally use some sorting based on their review scores if we have a lot of them, and return that to the user. I’ll give a demonstration of this at the end of the presentation, after we’ve discussed some of the other queries that we’re going to look at.
Then we can look at other ways to recommend beers to users. Our last query was based in the user, and our next query will be based in the beers. We can use cosine similarity. This query is designed to be able to take the user on a “Beer Walk” using review scores. We can do this because our review scores are multidimensional and can go in different directions. Some users might score the same beer highly on palate, whereas other people might prefer aroma or experience. These different scores that we have make our review a vector, and we can use the cosine similarity. Therefore, we can match beers based on sharing similar patterns of scoring, not just the total average score.[00:20:00] When a user selects a beer, the algorithm can find beers that have a similar pattern of review scoring to the first. Then we will likely have ties, and we’re going to randomly select the next best beer, and over multiple rotations it will take the user on a weighted random walk through the beer landscape. So we can start with a beer that we know that they like, and then we can move along the path to say, “You might like this beer,” and from there we can keep moving.
Again, we can pair this algorithm with our Cold Start in order to get similarities of beers as we’re recommending them to the user. Instead of going to the NLP, we could get our large style set to find the similarly reviewed beers and then reduce them down. We can pair these two together to give our user a universe of beers and pick the best matches while maintaining diversity.
So looking at the Cypher development for our cosine similarity query, we’re starting off at a beer this time instead of a user, and we’re finding this beer’s reviews and the styles. We’re going to aggregate its review scores for appearance, aroma, palate, and taste, and then we’re going to collect those so that we have a vector of this beer’s review. Once we’ve done that, we’re going to find other beers – all other beers – and which styles they’re in.
As you can see here on Line 14, we don’t actually need to do this, but we’re insisting that the new beers we recommend are not in the same style as the original beer. We’re just doing this to increase some diversity in our recommendations. We could remove this, and depending on your use case, this may not be the appropriate thing. But for beer reviews, we’re assuming that our users want to try new things, and we don’t want to recommend them the same styles over and over again.
Then we’re going to collect those beers and again create vectors of their scores based on aroma, palate, appearance, and taste, and then we’re going to use the graph data science library to compute the cosine similarity between our source beer and our destination beers, rank those by our top 10, and then as mentioned, we’re going to randomly select the top one and have them start their Beer Walk. Again, we’ll demonstrate this in a moment.
Two key pieces to both of our queries – and really our three queries: our styles expansion, our NLP process, and our cosine similarity – is leveraging subject matter expertise. While human preference is personal, our palate and tastes change as we mature and we move toward the level of connoisseur. Using large-scale machine learning trusts that the average level of “palate education” will conform to all beer drinkers.
There was a really interesting study on a similar dataset about wine expertise that showed basically that users’ wine tastes and choices changed over time, so depending where a user was in their wine journey, they would prefer and rank different wines differently. Without accommodating aspects like this, there’s a strong potential that the model aggregates away a lot of data.
To avoid this, using our parameterized dimensional data allows expert humans to define the key dimensions on which beers should be matched to individuals. Machine learning is a fundamentally regressive application and pegs users to their prior state. With our approach, we can leverage machines to speed up the searches across a broad range of products while leveraging the knowledge of beer experts instead of relying just on the herd or the crowd to make recommendations.
Again, this is really aiming at solving our original problems, like the Cold Start problem and our dimensionality problem. We don’t have enough data to do anything but recommend the most popular beers that have thousands of reviews, and we’ll never recommend beers that have 11 reviews, even if someone might really like them.
In terms of deploying to production, the answer often is “It depends.” It depends means that you need to determine your performance requirements and your data size, which will dictate if you need to precompute your queries or if you can just run them on the fly. Neo4j is incredibly performant at recommendation engines, so for smaller datasets, you can absolutely run them on the fly. If you need millisecond response times, you’re likely going to be precomputing. That is determined by where and how your recommendations are generated. Are they being sent in email to customers based on their purchases, or are you having users search for a beer and they need to immediately see 10 other beers that they might like?
Once we’re making these recommendations, we can actually use it to generate data where we can actually leverage more types of machine learning algorithms. If we use our rules-based+ approach to make recommendations, we can then capture event information[00:25:00] as what beers are actually clicked on. The users then review beers that we talk about, etc., and it gives us a lot more facts that we can store back into our dataset.
Again, an advantage of the rules-based+ methodology: dimensions can evolve along their own timescale because we treat them individually. How a user’s style preference changes, we’re treating individually from how they write reviews. In your standard basic ML algorithms, they must all evolve at the same rate because we need to find essentially the causality of the dimensions together as opposed to individually.
Now I want to give just a few quick examples of how we’re going to run through this. In these examples, I’m using Hume to run some queries, but in real life, this would be a web interface for searching for beers or an iOS app where people are searching for beers and finding the recommendations.
What we’re looking at here is really our full dataset of beers and reviews. We can say we’ve got about 1.6 million reviews on 66,000 beers by 33,000 users. If I go back to our original query, where we’re looking at starting at a specific user, I’m going to search and find that same specific user that we started with, and I can quickly look and see, what beers has he had already? If we look at this, we see MikeDrinks at the top, his reviews down at our middle hierarchy, followed by beers and the styles in which he’s had them.
We can start to see some of the things that we discussed earlier, where we have an American Blonde Ale and an American IPA, a Russian Imperial Stout, the American Barleywine, American Pale Ale. So we’re getting some idea here that maybe actually Mike likes to drink beers that are based on an American style, though he has had an English Porter and some other beers as well, including some Belgian Strong Pale Ale and otherwise.
If we start with this original set of beers that he’s reviewed, we can run that original query where we break out these styles into their individual words, find words that he associates with strongly through his reviews, an expanded set of beers, and then filter that expanded set based on his review text. As we come in here, we can see that each review has some text that he’s written about all of his beers, and we’re going to try to find beers that have words from these reviews in their names.
If we come up and we run that query, we’re going to recommend beers to Mike. Again, I’ll use this layout. We can see, again, we have Mike at the top, followed by beers that he’s reviewed and already tasted, and then we can see all of the new beers that we’ve been able to recommend to him. Some of these are in existing styles that he’s tried, but we’re also able to recommend him beers that are in other styles. We have an American Double or an American Imperial – not sure what this is. We have some new beers where he’s tried before other types of IPAs, so we’re going to recommend a Belgian IPA, because he drinks both Belgian beers and IPAs. So this might be a really great fit for him. Similarly, we’ve got an English Stout. Might go up with the English Porter we have over here. And then some other recommendations at this side as well.
Really, what this is showcasing to us is how well we can do recommendations based on that one query. Again, if I pull that back up onto the screen – I’ll just pull it up here – it looks long here, but it’s a relatively short query in order to get such great results that quickly on a dataset of this size for a user, which is such dimensional, very sparse data. He’s only written 20 reviews, and we’re able to make really great recommendations for him.
We then start looking at our second recommendation path, where we’re starting with the beer. I’m going to search for another favorite of mine and look at the Sam Adams Boston Lager. Pull back that beer. We can use our cosine similarity to expand out from here and do our Beer Walk. If I start on this, I can do my Beer Walk. This query takes a slightly longer time to run, so again, when we talk about deploying into production or not, this might be an instance of a query that we want to precompute some results rather than run them real-time. We might want to store 10 relationships to the Sam Adams Beer Walk to the Kawanda Cream Ale.
The intention here is that there’s really two directions we can go. This originally returned a Kawanda Cream Ale, but because of that randomizer, we can actually now run Beer Walk again and we’ll get different results[00:30:00] because Sam Adams is a very popular beer. If I right-click on this, I can see that there are 2,400 reviews. The randomizer didn’t work this time, so I’ll run it a third time, and we’ll get a different beer for that source beer. Then we can continue our Beer Walk and start out at our Kawanda Cream Ale and explore out further from this and start to get an idea of how we might navigate through our graph.
This would be an example of places where we could potentially store this into our graph as we precompute our Beer Walk cosine similarity, and then maybe we actually want to run some clustering algorithms. Maybe we have some more data down the line where we can give weights to our Beer Walk based on the traversals people actually followed or next reviews that they’ve written. Then we can run a clustering algorithm on our beer to beer relationships and use that for recommendations in the future.
But our original dataset doesn’t support it, so we need to get something out there that lets us both make our customers happier as well as look at more data for creating better recommendations in the future.
In conclusion, we want to use our rules-based+ algorithm. This allows us to use the expertise of our use case experts to give the best dimensional recommendations to our users. It doesn’t involve us sinking all of our time and effort into ML projects that don’t have enough data to succeed from the start because we need to really be careful of our ROI on data science projects.
There’s a Gartner study out there – I think it’s widely accepted – that something like 80% of data science projects never make it into production, and that’s a problem that we’re really trying to solve. Don’t be afraid of cheap, good solutions to get you started and start making your clients’ and customers’ experience better.
A few combinations of these expertise-guided modules is likely better for highly dimensional data than ML because of the things that we talked about.
A few questions that we often get here on this talk are things like, why is running an algorithm along independent dimensions, then ensembling the scores, a better option than running machine learning on all of the dimensions together? As we talked about earlier, this is because different dimensions evolve on different timescales. Capturing those changes with an ML model will require constant training, but if we treat each dimension independently and then unify them, especially in a small sample size, that allows us to change on their own time and get reaggregated at the end, as opposed to all having to train together or be constantly retraining our machine learning algorithms.
Another question that we get relatively frequently is if we had to choose one or so machine learning algorithms as an alternative to what we’ve presented here, what would they be? We would suggest probably using some kind of vector embedding approach using a combination of entities extracted from the styles, beer names, review texts, and grab components such as country, beer type, etc.
Related to the third question of, how would you deploy NLP in this situation as a way to connect the free text with the structured data? We could deploy NLP in a number of ways. One of the first ways we would go for would be to find sentiments and extract different named entities and build a model to project what sets of words lead to different review dimension scores to link beers together.
A third question that we get is, how would you deploy NLP in this situation as a way to connect the free text with the structured data? You could use NLP in this situation to find sentiments within reviews, extract different named entities, and build a model to project what sets of words lead to different review dimension scores to link beers together. This is definitely something that you can do. We do it a lot with NLP for heavily textual data, and it can be very valuable.
But a key here as well, to continue our theme, is using the human expertise to extract the important named entities. We see a lot of cases where people want to just throw text and have it determine the keywords for us and basically give us all of the information, and it almost always returns bad results. So it’s always worth it to spend the time upfront to get some human expertise into the model, even if you’re using an NLP solution, in order to make it more relevant and more accurate based on your users’ data.
That wraps up our presentation. Hopefully that’s a helpful introduction[00:35:00] to leveraging dimensionality for graph-based recommendations. We want to still hammer home on using good, cheap solutions to get results out there, and then start recording data to make your recommendations better.
If you have any questions, feel free to reach out to me. My contact information should be on the NODES 2020 page. Thank you very much, and have a great rest of the conference.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.