Data Interlinked

2018-02-18

This article contains some very minor spoilers for the Blade Runner 2049 movie. If you have not seen it yet, by all means do, it is really good.

"And blood-black nothingness began to spin... A system of cells interlinked within cells interlinked within cells interlinked within one stem... And dreadfully distinct against the dark, a tall white fountain played."

Pale Fire by Vladimir Nabokov © Berkley Medallion

The above is a quote from the poem Pale Fire. It occurs in the novel titled Pale Fire by Vladimir Nabokov which has been recently used in the movie Blade Runner 2049. In the movie it was part of the Baseline test, a way to test the emotional response of a Replicant. The reason they undergo this process is because their creators fear that the connections they might make in their lives would give them emotions that would interfere with their intended purpose.

Blade Runner 2049 Poster for the Movie © Columbia Pictures

In life, such relationships always surround us. They are not just between people, but in our work, in our beliefs, in our art and in the knowledge we represent.

Suppose we intend to describe the link between the movie Blade Runner and the book Pale Fire. We can summarize this information with a number of facts.

Blade Runner is a movie. 
Blade Runner has a character named K.
K is a Replicant.
Replicants must pass a Baseline Test.
Baseline Test is based on the poem Pale Fire.
Pale Fire is written by Vladimir Nabokov.

The above facts show that the links that can tie together various pieces of knowledge. One can trace the connections from a simple description of a movie, released in 2017, to the author Vladimir Nabokov, as was intended by the writers of the movie.

Although the above recitation of facts is easy to follow, from a knowledge representation perspective one can find some issues with it.

First the description is imprecise. As the Blade Runner could refer to the newer Blade Runner 2049 movie as opposed to the 1982 original titled Blade Runner.

Movie poster for the first Blade Runner Movie © 1982 The Ladd Company

Second the set of facts is incomplete. The poem Pale Fire is indeed written by Vladimir Nabokov, but it is presented in the book Pale Fire, also written by Nabokov as the work of the fictional poet John Shade. The set of facts here fails to make the explicit distinction between Pale Fire (poem) and Pale Fire (book), and that the poem is contained in the book.

Third, and perhaps most importantly, the above list of facts relies a lot on the users grasp of the English natural language. For a program, it can be surprisingly difficult to understand the relationships such as "is a", "is based on", "named", etc between the various elements in the text.

These issues seem somewhat nit-picky, as this information can be derived from the rest of the article. However this means that the knowledge in the summary does not stand on its own. If those facts are detailed without the rest of the article, or if the reader of them is a machine, and not a person that can easily add some context, they might lead them to incorrect or insufficient conclusions. They might get wrong information that the 1982 movie Blade Runner has a character named K, or fail to see the link that poem is contained in the book by the same author. And although in the case of Blade Runner, these issues might seem small, this is different if the knowledge relates to financial, legal or clinical domains. Here, mistakes or omissions can be costly.

Having a larger list of more detailed facts can help with these issues, but to a certain extent they still remain due to the ambiguity of the natural language. In addition the fact that is often very easy to skip over implicit details. This is especially true for the issue of a computer not being able to make (enough) sense of this information.

A proposed solution to these issues is Linked Data and in particular Resource Description Framework (RDF), with which Linked Data data can be expressed. These technologies allow us to represent the above facts in a more formal and precise way, that can make it both human and machine read- and write-able.

Resource Description Framework (RDF) Logo © W3C

One significant feature of RDF is that requires precise naming. Many elements of it are either a International Resource Identifier (IRI) or some raw data-types. Good examples of the former are URLs, such as the link to this website: http://www.newresalhaider.com , that allows one to find a web resource. Examples of the later are texts or numbers, such as "Blade Runner" or 15 respectively.

The other significant feature of RDF is that most knowledge is represented as a set of facts, where each fact is expressed as subject, predicate object triples. For example the fact "Blade Runner is a movie" is expressed with the subject "Blade Runner" the predicate "is a" and the object "movie".

Putting this together in RDF (using the Turtle notation) you would get a triple such as:

<http://www.newresalhaider.com/ontologies/bladerunner/blade-runner> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.newresalhaider.com/ontologies/bladerunner/movie>.

This example is an RDF way of saying "Blade Runner is a type of movie" or alternatively "Blade runner is a movie". This type of representation shows us a couple of benefits. First we are now being more precise as each element in the triple can refer to one specific resource, for example Blade Runner or Movie, where there IRI makes sure we do not necessarily confuse the term with anything else. Second this also shows off the fact that you can link to resources from different places: the predicate "type" is from a completely different domain. This allows us to re-use knowledge that has already been defined. As one can expect saying something is of a "type", for example an apple is a type of a fruit, is actually very common. This is one of the main strengths of what makes Linked Data so powerful, one can re-use knowledge already stated.

Typing out the full IRI each time can be pretty bothersome, and it does not help the readability either. Thankfully we can define a common prefix we use separately, and just write the last part of the IRI in each case. In this case we define a base prefix and we refer to subject and object by "<#blade-runner>" and "<#movie>" respectively.

@base <http://www.newresalhaider.com/ontologies/bladerunner> .
<#blade-runner> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <#movie>.

We can do similar things when linking elements that have been already defined elsewhere. In this case we define a prefix to use as an abbreviation while writing:

@base <http://www.newresalhaider.com/ontologies/bladerunner> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<#blade-runner> rdf:type <#movie>.

In practice, "rdf:type" as a predicate is so common that there is an even simpler notation. We can use 'a' as a predicate, which is in line with what we intend to express: "Blade Runner is a movie".

The resulting RDF facts look as follows (note that the rdf prefix could be omitted here as the "a" abbreviation does not make it necessary):

@base <http://www.newresalhaider.com/ontologies/bladerunner> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<#blade-runner> a <#movie>.

If we aim to write something that is just a text as a subject, say when referring to the title of a movie , we can do that as well:

@base <http://www.newresalhaider.com/ontologies/bladerunner> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<#blade-runner> a <#movie>.
<#blade-runner> <#title> "Blade Runner 2049".

With this way of writing, we can actually rewrite our original set of facts as follows:

@base <http://www.newresalhaider.com/ontologies/bladerunner> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<#blade-runner> a <#movie>.
<#blade-runner> <#title> "Blade Runner 2049".
<#blade-runner> <#has-character> <#K>.
<#K> foaf:name "K".
<#K> a <#replicant>.
<#replicant> <#must-pass> <#baseline-test>.
<#baseline-test> <#based-on> <#pale-fire-poem>.
<#pale-fire-poem> <#included-in> <#pale-fire-book>.
<#pale-fire-book> <#written-by> <#nabokov>.
<#nabokov> foaf:name "Vladimir Nabokov".

With this version we suddenly defined our list of facts in a more formal manner than previously. This makes it much more simpler for machines to understand this set of facts. In fact we actually used the Friend of a Friend (FOAF) ontology to use the notion of name that is also used when talking about relationships between people. In fact, one could argue that using an existing movie dataset, such as the Linked Movie Database would have been even better, which we will leave as an exercise for the reader.

Hopefully I could show a glimpse of the possibilities the Semantic Web for which Linked Data forms the basis, with this example. Of course the above is just scratching the surface of what it can be done with RDF, Linked Data. With each addition, our set of facts could grow. One could go beyond a single movie and build a document of poems that are references in movies, or a knowledge base of the Blade Runner franchise. It might be easier than one expects, due to the fact that knowledge, much like people are...

Interlinked.

Data Interlinked

Pale Fire by Vladimir Nabokov © Berkley Medallion

Blade Runner 2049 Poster for the Movie © Columbia Pictures

Movie poster for the first Blade Runner Movie © 1982 The Ladd Company

Resource Description Framework (RDF) Logo © W3C