Graph Data Lineage for Financial Services: Avoiding Disaster

By Sean Robinson, MS / Lead Data Scientist

September 28, 2021

Blog

Reading Time: 4 minutes

While graph data lineage capabilities (the data lineage graph database combination) are ultimately suited for every industry, given the unique regulatory burden in the financial services context, it becomes even more critical technology. Find out why as you read on.

Graph Data Lineage for Financial Services

As discussed in more detail in our article What is Data Lineage and Data Provenance, data lineage (and the related concept data provenance) is critical for managing regulatory compliance and audits as the most financially impactful context, but also reliable application development and analytics decision-making are among other contexts it can be critically important. In the financial services (FinServ) space, due to the significant and increasing regulatory compliance required (in response to many high impact financial crises over time), the reporting/audit demands, as well as the level of difficulty of those demands, it now represents hundreds of billions or more across the world in sheer cost in penalties and effort and a corresponding level of technical complexity and difficulty that is unprecedented.

Challenges of FinServ Regulatory Compliance

Beyond even just the many regulatory laws and regulations facing all organizations across all industries (e.g. SOX, GDPR etc), FinServ companies in particular face even more regulation whether it be through regulatory and enforcement organizations such as the SEC, FINRA, FinCEN, OCC, CFPB, FTC, CFTC (and many more), and/or the many laws passed to protect the consumers and even the economy itself against risk (e.g. Dodd-Frank, GLBA and so on). The potential costs of failing to comply are often measured in the millions and tens of millions of dollars and more per incident in an organization.

As a way to describe the magnitude of the consequences, most of the top US banks have paid tens of billions of dollars in penalties over the last two decades, with the top offender paying over $80 billion. That is why all FinServ orgs including banks spend huge sums of money on avoiding the risks of compliance violations, and because of the sheer scale of the institutions, and the complexity of the data and regulatory landscape, this problem is best solved using a data lineage graph database combination.

A Compliance Nightmare Scenario: Danske Bank

There are endless regulatory violation stories from across the globe, but a well-known one is from Denmark regarding Danske Bank. In what has been called the single largest money-laundering scandal in Europe, Danske (the largest bank in Denmark) allowed about €200 billion (about $236 billion) in illegal funds to move through its Estonia branch during the 8 years from 2007 to 2015. The results were catastrophic including the bank losing half its value overnight, the CEO being made to resign, multiple branch employees being arrested, with billions in fines to be paid to Europe, the US and beyond, and even Denmark as a country passing legislation increasing related fines more than eight-fold.

danske bank fraud
Why is Graph Data Lineage the Best Solution?

Given the vast and complex data landscape within today’s financial organizations, something is required beyond traditional data lineage methods. An effective solution is one that can accurately abstract the data landscape of an organization but reflect it in a way that is also authentic and understandable.

It also needs to provide answers to evolving and multifaceted questions about the data’s lineage in as near real-time as is required by the business, while remaining flexible enough to adapt to the constant changes within an organization’s data landscape.

While heritage methods using traditional relational database management systems (RDBMS) have served as the foundation of data lineage for years, its more rigid and linear nature makes RDBMS ill-suited for data lineage efforts in the ever-evolving data landscape complexity and data volumes in today’s organizations. The below sample architecture illustrates the complexity of locating and tracking Personally Identifiable Information (PII) in the context of a credit card application and processing context:

data lineage graph database

Heritage RDBMS systems attempting to track PII in the example above across numerous systems and networks, struggle to continue tracking and linking the locations and connectedness of data at scale as volumes grow and architectural topographies become more and more complex.

This is why graph databases (e.g. Neo4j data lineage, as well as other graph databases) have shined in providing a set of capabilities that combine agility in responding to changes to the data landscape and data flows, as well as offering a graph-based structure which is ideal for modeling the interconnected nature of data sources to begin with, as well as their downstream systems, and associated regulatory reports.

As an example of what is possible, the figure below illustrates how with graph databases, for example using Neo4j data lineage, organizations can more easily understand where risky data such as persisted PII data (e.g. SSN) is located across a data landscape. The underlying graph data structure itself can more easily enable scalable tracking and even alerting, also enabling analytics for understanding impact as well as visualizations to more holistically capture how these sensitive data elements are embedded across sources and even transformations to ensure proper security measures can be implemented in time to avoid massive consequences.

graph data lineage

Organizations are pursuing such graphs for the tracking of critical data lineage requirements, integrating the update of the data lineage graphs (PII graph in the example above) as part of foundational organizational application development requirements. When it comes time to understand the exact PII impact in a given audit or another scenario, it is then faster and easier to capture the context but also perform the required analyses.

Graph Data Lineage: Conclusion

It’s not that it is impossible to use other data stores for data provenance and data lineage, it is just becoming more difficult every year for those other options to scale. The growth in leveraging graph data lineage is ultimately a response to the symptom of growth in FinServ regulatory requirements and the associated technical complexity and data volumes within financial services over the last decade and more.


Graphable delivers insightful graph database (e.g. Neo4j consulting) / machine learning (ml) / natural language processing (nlp) projects as well as graph and Domo consulting for BI/analytics, with measurable impact. We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.

Want to find out more about our Hume consulting on the Hume knowledge graph / insights platform? As the Americas principal reseller, we are happy to connect and tell you more. Book a demo by contacting us here.

Check out our article, What is a Graph Database? for more info.


We are known for operating ethically, communicating well, and delivering on-time. With hundreds of successful projects across most industries, we thrive in the most challenging data integration and data science contexts, driving analytics success.
Contact us for more information: