In today’s data-driven landscape, information is collected in mass quantities. From transactions, to server logs, to comments on social media. And while this valuable information has some amount of structure on which to base one’s analysis, much of that information remains locked away in the ambiguous and difficult-to-define confines of natural language, value that can be consistently unlocked using the correct order of steps in natural language understanding.
Note that we are talking about “natural language” and not just “textual data”. As I cover in-depth in a previous post “Natural Language is Structured Data” importantly, natural language distinctly provides a linguistic structure that serves as the scaffolding on which we can base our analyses. For example, in the examples mentioned above, only the comments on social media qualify as the “natural language” that can provide us with this critical scaffolding.
Overview – So what is the order of steps in natural language understanding?
In this article we will examine the order of steps in natural language understanding, looking at the seven main steps to unlock previously hidden meaning within your natural language textual data:
Step 1) Assess Data Quality
From time immemorial, data analysts have declared the same wisdom, despite the data being analyzed: “garbage in, garbage out!”. In few disciplines does this age-old saying hold true more than in natural language processing. Given this inconvenient truth, our first step in natural language understanding must be to assess the quality of our data and set our expectations regarding the final results accordingly. This is especially important when managing stakeholder expectations as it is often the very stakeholder source data that can be low quality to begin with, even though they may not be aware of it.
Step 2) Clean
After determining the quality of the natural language source data, cleaning that data offers one of the few salves available to deal with a lack of data quality, much as a craftsperson does their best to first plane warped lumber for a building project. This cleaning process can happen in a variety fo ways, but it generally revolves around the same general concepts of using heuristic logic to replace words or phrases which may be misspelled.
For example, an open text field used to answer a yes or no question may contain many variations of the word “yes” such as “yeah”, “yea”, “yess”, “ye”, etc. Using a simple replace function or regex implementation can make quick work of such inconsistencies and transform the data from garbage quality to a usable state. Whether or not you are using hand-coded regex and other related techniques or a high-end data-quality tool to clean the data, the net result is to target higher-quality data in order to support optimal results.
Step 3) Process
Once the quality of the underlying data has been improved as much as possible, we can begin processing the data to be analyzed by a downstream routine. This often takes the form of removing frequent stop words which will clutter our analysis due to their high frequency and their lack of meaning.
Next in this step will be to tokenize the data into individual elements which can be easily interpreted by a model. This could be done by dividing individual words into a bag of words, parsing documents, or dividing a document into paragraphs and sentences using delimiters like new lines and punctuation (such as comma, pipe etc).
Step 4) Model
Within the field of natural language processing (NLP) there is a wide variety of models that can drive additional understanding from your natural language data. For example, the “bag of words” term mentioned above is the simplest of these models which provides a count of each word present, and that count can then be represented as a vector. Another is LDA topic modeling which is especially useful when surfacing the most important concepts within a collection of documents or providing automated filtering mechanisms.
Step 5) Analyze
After the natural language data has been modeled, we must analyze the results in order to initially understand the quality of our output. As mentioned in step 1, the output quality will vary greatly depending on the quality and comprehensiveness of your source data. However while data quality does play a significant role in the results of a model, other factors such as cleaning, processing, and model hyper-parameters impact the final result as well. It is at this stage where metrics such as vector similarity can enable testing examples that, based on relevant domain logic, should be similar. If they are not, adjustments to the cleaning, processing, and model hyper-parameters may be necessary.
Step 6) Visualize
If after analyzing the model, its results are satisfactory, the next step would be to visualize the results in some kind(s) of data visualization. Having the data visualized in a bar or other kind of chart can be especially effective when presenting an initial proof of concept to stakeholders and others who may not be familiar with the specifics involved in natural language modeling. For example, a common and popular visualization for LDA is pyLDAvis package’s interactive chart which is designed to work well with jupyter notebooks.
Step 7) Operationalize
Perhaps the most forgotten or overlooked element in the order of steps in natural language understanding- and in all of data science- is the process of operationalization. The most well-trained, accurate, and potentially value-producing model is actually useless if it is not operationalized in some way that can produce actual, tangible value for those in the business for whom it is intended.
This can take many forms, sometimes a simple endpoint to be called via an ETL process is sufficient for further upstream analysis. Other times, a model is further developed and hardened around a more specific purpose or problem that needs to be solved.
One of the most nuanced steps in operationalizing in any NLP development process is delivering the model itself so that the users can gain value from it in a seamless way. This is especially evident in how Amazon and Google both use topic modeling to provide automated filters for product review data. Below is an illustration showing how they resolve the problem of a user having to dig through or search for the key concepts that are important to a given product or location, by providing keyword filters:
With the amount of natural language data only increasing every day, we must consider how to approach natural language understanding in a way that is automated, scalable, and operationalized. While the order of steps in natural language understanding discussed above serve as an overview, there remain countless opportunities for nuanced implementations of the steps described above, as well as novel techniques yet to be discovered by researchers. Time will only reveal how we will continue to interpret and understand natural language in the future.
Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.
Check out our article, What is a Graph Database? for more info.