The Pandora Papers findings were brought together by the International Consortium of Investigative Journalists (ICIJ). Pandora definitely earns its name, and we know all too well how the ancient Greek story of Pandora’s Box turned out. In a recent disclosure, twelve million documents revealed hidden wealth, tax avoidance and, in some cases, money laundering by some of the world's richest and most powerful. The files exposed how some of the most powerful people in the world - including more than 330 politicians from 90 countries - use secret offshore companies to hide their wealth, uncovering dirty deeds that make honest folk cringe.
The mined data has revealed corruption, tax evasion, scandals, secretly-owned companies, and huge hidden property deals. The journalists behind the Pandora stories will surely compete for Pulitzer Prizes.
What’s in the Box
The sheer magnitude of data is mind blowing. Extracting useful information to inform knowledgeable decisions from it requires a vast amount of organizing, tagging, sorting, and categorizing, then analyzing an immense number of combinations and permutations. Doing this in a timely manner is simply beyond the realm of traditional computing. The task is exacerbated by the unformatted nature of the Pandora Papers data file types, which include documents, images, emails, spreadsheets, video, audio, presentations, and more.
There are more than 2.94 Terabytes of data, consisting of 12 million files of unstructured data, excluding the various spreadsheet formats. More than half of the files (6.4 million) are text documents, including 4 million PDFs totaling more than 10,000-pages. The documents included passports, bank statements, tax declarations, company incorporation records, real estate contracts and due diligence questionnaires. There were also more than 4.1 million images and emails in the leak (ICIJ, BBC).
Katana Graph reports that 80% of the world’s data is unstructured; the Pandora Papers are even less structured than that.
Extracting meaning from droves of different types of data is particularly challenging when a single document can contain many years’ worth of emails, charts, and attachments. Some providers digitized their records and structured them in spreadsheets, while others kept paper files that were scanned. PDFs made from scanned paper that included spreadsheets are exceptionally difficult to interpret programmatically.
Then there is the matter of languages. The Pandora Papers included works in English, Spanish, Russian, French, Arabic, Korean and other languages, requiring extensive coordination among ICIJ partners.
Further, there are the normal problems of permutations and combinatorics in connection mapping. In this case Pandora analysts faced: 27,000 companies and 29,000 so-called ultimate beneficial owners (BBC).
Sifting through this amount of data in a finite time period is obviously beyond the realm of human agency. Gathering insights on timely topics could take weeks, months or even years, and would take thousands of people working collaboratively to make links across vastly different types of files from 90 different countries.
Serving Justice
Knowledge graphs are representations of networks of real-world entities (nodes) - in the case of Pandora, the people, documents, transactions and events described in the “papers” - and their relationships with each other (edges). Computing on knowledge graphs is now by far the best option for exploring connections and making insights sufficiently fast for the insights to still be relevant. This obviously requires that both the structured and unstructured data be processed jointly to spot the money laundering and webs of relationships and transfers of information, money, and property. Katana Graph's graph engine was designed to cut through exactly this sort of monumental data challenge, saving energy, resources, and, most crucially in this case, time, so that more justice might be served.