6 Steps to Activate the Value of Text to Graph Machine Learning Systems

By Sean Robinson, MS / Lead Data Scientist

October 27, 2022


Reading Time: 5 minutes

This blog will walk through how to construct a text to graph machine learning pipeline to bring the power of text data and graphs together. Harmonizing these two techniques provides the next step in the evolution of machine learning as Natural Language Processing (NLP) and Graph Theory which have remained two of the fastest-growing fields within data science for the last several years.

Why Text to Graph Machine Learning?

There are lots of reasons, but the main one is NLP effectiveness. While these two domains can operate independently of one another, because of the value that graph brings to NLP, a natural question emerges about how to leverage the value of graph machine learning to unqiuely drive more value from textual documents into a graph machine learning model. Text to graph machine learning is also a fundamental building block for how to create a knowledge graph, which is becoming a more and more important topic every year.

Text Embeddings

Before we can infuse our graphs with the information from our text, we must first extract the stored meaning and value from them. To do this we will implement a simple embedding model to produce feature vectors for each instance of text. This could be for individual words as would be the case if we were analyzing individual keywords from medical publications or even entire documents if we were interested in comparing the abstracts and meaning of publications or other documents.

In this example, we’ll assume we are interested in understanding how different publications relate to one another based on the text of their abstracts and their related network of co-authors.

We could do this with a number of different models or services, below I’ve listed a handful of the most common:

Word Embeddings

  • TF-IDF
  • Word2Vec
  • BERT

Document Embeddings

  • Doc2Vec
  • AWS SageMaker Object2Vec 
  • Word Mover’s Embedding (WME)

Now that we have our document embeddings established, we can examine the graph machine learning portion of our problem.

Text to Graph Machine Learning
Setting the Stage with the Proper Projection

Before we can start training our graph ML model with our document embeddings, we first must describe the projection of the graph which it will train on. If you read our article What Is a Knowledge Graph, you will see that the native graph data store you’re using may contain several node types in addition to articles and authors, many of which could be connected to those articles and authors. Because of this, we must first distill our graph data into those nodes and relationships which are relevant to the problem at hand.

articles abstracts stored in graph

We begin by ignoring the rest of the knowledge graph to focus exclusively on author nodes, article nodes, and the relationships which connect them to get a bipartite projection of our graph

focusing in on article and author nodes and relationships

Next, we define a relationship weight to indicate which relationships are stronger than others, then we will “fold” this bipartite graph into a monopartite graph of articles where the weight of each relationship is governed by the number of co-authors two articles share.

moving from bipartite graph into a monopartite graph

Next, with our projection established, we’re to consider which graph ML model to use in our text to graph machine learning pipeline.

choosing which graph ML model to use in our machine learning pipeline
Choosing a Graph Machine Learning Model

Now that we have our publication data represented in an embedding space, and our graph projection established, we can start to think about how to feed it downstream to a graph machine learning model. However, not just any graph ML model will do, we must consider those which can accept a set of feature vectors.

Many of the early graph ML models that use topological adjacency or random walks to produce similarity measures exclusively rely on the connections between sets of nodes to produce vector representation of each node in the graph. Since these methods do not utilize any underlying node features in their calculations, they are ill-suited for the task at hand. Instead, we must rely on those model architectures which accept node features as part of their design. Once again, I’ve listed some popular options for such a task:

In our example, we’ll use GraphSAGE in our text to graph machine learning pipeline. (See our related article on Convolutional Graph Neural Networks)

Finally, with our model architecture selected, we can begin to train and get some results.

train and get results on our ML model architecture
Passing Text to Graph Machine Learning Models

In order to capture both the document similarity and co-authorship network similarity in our documents let’s complete our text to graph machine learning pipeline by creating node representations for each of our documents.

First, let’s make sure our monopartite projection contains all the document embeddings we created earlier as node features.

ensure monopartite projection has all the document embeddings we created as node features

Next, we’ll train our GraphSAGE model against the projection with K = 2 to restrict the message passing to two degrees of separation. This should give us a stable message-passing flow between articles.

This message-passing architecture will cause any articles which are connected to one another to share their feature information between one another. Through this, we are passing the information captured by the document embedding model to other documents based on the number of authors they’ve shared. This information passing will then extend to other neighbors, as well as their neighbors, to develop the embedding space which will contain our final representations of the articles.

develop the embedding space for our node embeddings
Utility of Text Powered Machine Learning

With our embeddings established we now have a collection of documents represented as vectors using both their underlying text similarity and co-authorship connectivity. This provides us with a set of extremely informed embeddings which can be used for a number of downstream tasks. Lets explore a few.

downstream applicaitons for graph embeddings
Clustering for Recommender Systems

While the simplest of any option, a simple KNN cluster allows us to also measure which documents are most similar to one another within the embedding space. Using this measure, we can easily supply a “Top N” recommendation based on the nearest neighbors for a given article.


One of the most powerful final tasks for our text to graph machine learning pipeline is feeding our embeddings to a final classifier model such as logistic regression, random forest, or XGBoost. This would allow us to use historical labels to classify which subject a given article may be associated with.

Link Prediction

Lastly, we can look to predict non-existing relationships within our graph. In this context, we could use it for link prediction when it comes to finding sets of authors who may want to consider doing research together in the future.


While text to graph machine learning is not the only way to accomplish NLP, it is rapidly becoming the most effective and valuable due to the unique strengths of the graph database. By combining the unique value of both fields, there is now a more powerful and scalable way to drive value from all your textual data.

Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.

Still learning? Check out a few of our introductory articles to learn more:

Want to find out more about our Hume consulting on the Hume (GraphAware) Platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo today.

We would also be happy to learn more about your current project and share how we might be able to help. Schedule a consultation with us today. We can discuss Neo4j pricing or Domo pricing, or any other topic. We look forward to speaking with you!

We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Contact us for more information: