Working with Knowledge(-Graphs) instead of Data(-Bases)
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Semantics enriches data with meaning. This meaning is stored in semantic graph databases. Graph databases open up new possibilities and perspectives for developers in many ways, and not just when it comes to modeling, classes, and their properties. In contrast to the well-known SQL and NoSQL databases with their tables and columns, listings and fields, graph databases extend the familiar view of schemas, objects, and inheritance with new features such as metadata, inference, and linking multiple graph databases to extensive knowledge graphs. With the appropriate tools, data can be linked to a semantic network. This is not new. Tim Berners-Lee coined the term "semantic web" as early as 2001. The W3C extended it in 2004 with standards such as OWL (Web Ontology Language, extended in 2012 with OWL2) and made it practical.
What was of rather scientific interest for a long time and led an isolated existence has now evolved into professional products that have reached a level of maturity in terms of quality, runtime behavior, availability, and scalability that is worth a closer look. The first thing to clarify would be in which scenarios the use of a semantic graph database is worthwhile and what advantages the new technologies bring for developers and users. The aim is to bridge the gap between current research and practical development, covering topics such as modern RDF triple stores, new types of classification and validation of objects, a new perspective on data properties and automatic learning, but also on interoperability, portability, reusability, harmonization, and data quality. This will require several articles, and by the end of this series, you will be able to create a semantic data model with classes and properties and create, read, update, and delete (CRUD) instances in an ontology using the SPARQL query language in Node.js instances.
Knowledge Graphs
Semantic graph databases or ontologies essentially consist of a class hierarchy, a taxonomy (also referred to as T-Box), and the so-called individuals, i.e., the instances of the classes (also referred to as A-Box). In addition, there are data properties with concrete values of various data types, the so-called literals, and the object properties with references to other individuals, the actual links. All these entities are also referred to as resources. Within a graph, all resources are identified using so-called URIs or IRIs; IRI stands for Internationalized Resource Identifier, more on this in the next section. URIs/IRIs are comparable to primary keys from the world of tables and collections but are unique across the entire graph and not just related to the instances of a particular class.
The semantic web is also referred to as the "web of data". One of the paradigms is that not only resources within an ontology can be linked to each other, but also any ontologies among themselves, including the resources contained therein. The combination of multiple ontologies, usually in hierarchical structures, is referred to as a knowledge graph. A special feature in semantic graph databases is the "reasoner" - a rule engine within the database itself with various standardized but also configurable rule sets. On the one hand, the reasoner maps the so-called inference: a function to automatically draw logical conclusions from already existing statements and make them available as new knowledge. On the other hand, it serves to validate and check for consistency, for example, to identify contradictions or rule violations, and thus also to ensure data quality.
Terms and Abbreviations
Anyone dealing with the semantic web comes across a number of new terms and abbreviations, the knowledge of which is important for understanding models, relations, and functions.
First, a brief summary of the differences between URL, URI, and IRI. A Uniform Resource Locator (URL) specifies the location of a specific resource, so it is used for localization. This includes addresses, i.e., references to content that may change constantly, such as an HTML page or a database. According to RFC 1738 [1] from 1994, only a subset of 60 US-ASCII characters is allowed for a URL, which represents a significant restriction for internationalized applications today. The term base URL is used to divide an address with a path into several sub-addresses, for example, http://my.baseurl.com/cat1/func2
. The https scheme instead of http for the URL has a special meaning because it causes the transfer to be encrypted, usually between the web server and the browser.
The Uniform Resource Identifier (URI) is used to uniquely identify specific resources globally. On the web, these are, for example, specific files or interfaces to services. They consist of a URL followed by a file or service name, optionally with additional query parameters or hash values, such as http://my.domain/db/customers?id=7
or http://my.domain/db/product#46510
; URLs are therefore a subset of URIs. According to RFC 3986 [2], URIs are subject to the same character set restrictions as URLs.
Remaining as the third in the group is the IRI, the Internationalized Resource Identifier. It was standardized by the IETF (Internet Engineering Task Force) in 2005 to meet modern requirements for global use. Like the URI, it is used to uniquely identify a specific resource worldwide. It serves the same purpose as the URI but without the restriction on the ASCII character set. According to RFC 3987, all UTF-8 characters are allowed for IRIs with a few exceptions [3]. In the context of the semantic web, IRIs are composed of a namespace, a separator, and a so-called local name. The #-character has established itself as the de facto standard, even though a few others are still allowed. Analogous to URL and URI, the namespace should actually be called IRL, i.e., International Resource Locator, but this term has never caught on. A valid IRI for a resource in an ontology is, for example: http://ont.enapso.com/erp#product_1
The namespace here is http://ont.enapso.com/erp#
and the local name is product_1
, referred to simply as name or identifier in the further course of this article. In contrast to URLs, the https scheme is not only uncommon for IRIs, but due to the purely identificational character, it also does not cause any encryption - neither of the transport nor of the content.
Namespaces and Domains
For the interoperability of ontologies, it is considered good practice to use only those namespaces for which you are also the official domain owner. In the example ontology for this text, it could be, for example, http://ont.enapso.com/[...]
. Although the validity of the namespaces is currently not checked by name servers (DNS) or by an official institution when publishing ontologies, you risk errors in applications or with users by using namespaces such as http://foo.bar/
because the resources can no longer be uniquely referenced globally and can therefore not be used in internationally composed knowledge graphs.
Names with Prefixes
If resources in RDF/OWL files or in SPARQL commands were always referenced by their full IRI - syntactically, even additional surrounding <- and >-characters are necessary - they could quickly become confusing and difficult to read. Consider the following triple (see the "RDF" section):
...
Within the framework of RDF, OWL, and SPARQL, this notation is referred to as a prefixed name.
Primary Keys and Relationships
In addition to the conventions for namespaces, there are other proven practices for IRIs. Since each resource has its own database-wide and internationally unique IRI, no additional primary keys are required in a graph; so no special ID fields need to be provided, for example, in the form of data properties. The existing IRIs are completely sufficient for this - de facto, these are the global primary keys. The same applies to relationships and foreign keys. In a 1:1 or 1:n relationship, one or more object properties of a master individual simply reference its child individuals using the child IRIs. The following example shows a 1:n relationship between invoices and products, represented here in so-called prefixed names notation:
...
erp:user_1 props:hasRole erp:role_1
erp:user_1 props:hasRole erp:role_2
erp:user_2 props:hasRole erp:role_1
erp:user_2 props:hasRole erp:role_2
RDF
RDF is the abbreviation for Resource Description Framework [4], a modeling concept standardized by the World Wide Web Consortium (W3C) for the semantic web that enables simple logical statements using triples of subject, predicate, and object, which can be formulated in directed graphs and easily read, understood, and visualized by machines.
Examples: Max hasAge 32
(data property) or Josef hasSpouse Maria
(object property).
A collection of triples is also referred to as a triple store. At the lowest level, a semantic graph database is such a triple store. How the triples are managed internally as efficiently as possible is up to the database vendor. For import into and export from files, the RDF triples are represented in various formats. These can be, for example, Turtle, Trig, N-Triples, JSON-LD, or even RDF/XML, with their respective advantages and disadvantages. While Turtle (.ttl) is popular because of its compactness and easy readability, Trig (.trig) is more suitable for backup and restore of knowledge graphs since the information about subgraphs is also serialized and deserialized. As expected, JSON-LD is increasingly found in JavaScript environments and RDF/XML more in the Java world. However, in the spirit of interoperability, semantic graph databases such as GraphDB from Ontotext have import and export functions for all important formats.
{{% content-ref "/attachments/1093466056929341481" %}}
RDFS
The abbreviation RDFS stands for Resource Description Framework Schema \ [5\], a semantic extension of the RDF vocabulary specifically for data modeling. It contains mechanisms especially for grouping resources and their relationships to each other. It is comparable to the class system known from object-oriented programming (OOP), but with an essential and important extension: While in OOP it is defined which properties a class has and may have, an RDF schema can specify for properties to which class an individual is automatically assigned if it contains or references the property in question. This includes support for subClassOf
and subPropertyOf
to organize classes and properties hierarchically. RDFS plus, finally, is an extended version of RDFS that supports symmetric, inverse, and transitive properties. These new concepts, in contrast to OOP, will be explained in more detail below. Many of the RDFS and RDFS plus components are also part of the even more expressive Web Ontology Language (OWL).
...
OWL
The abbreviation OWL stands for Web Ontology Language \ [6\]. The language was specially designed for the semantic web to represent knowledge about objects and classes (as groups of objects) and their relationships to each other. Ontologies are based on RDF and OWL and can be read and modified with SPARQL as a query language. Reasoners support OWL.
Class Trees
The taxonomy is an essential part of a semantic database model. It corresponds more to the hierarchical class model from object-oriented programming than to the schemas of traditional SQL and NoSQL databases for tables or collections. The classes in the taxonomy are organized in a tree topology, and they can also inherit their properties.
OOP vs. Graphs - or the Inadequacy of Tree
...
Topologies
However, with conventional class trees, it is poorly practicable to impossible to map real complex environments. Although they are suitable for implementing inheritance vertically, horizontal aspects, i.e., those that affect several subtrees, cannot be mapped in tree topologies. Take the employees of a company as an example. In a traditional class tree, Person might be at the top, then Employee and Freelancer might follow, and under Employee there might be Developer, Architect, and Manager.
This already shows the first problem: Because even if a freelancer might not be able to become a manager, why should he not also be a developer or software architect? Why not an architect and manager at the same time? Generalized, contract types are mixed here vertically with roles in a company horizontally (Figure 1). What can still be solved with class designators with multiple meanings in small models, quickly escalates into semantic chaos when the model grows. To address this problem, in classical OOP you now add fields such as Role. But where?
For the Person class, it is too specific, for Employee and Freelancer already redundant again - which increases the code and maintenance effort. Or you add another helper class called Personnel in between. Semantically, this is ambiguous and also additional code. But not only that; here again, two concepts are mixed, namely that of classification and that of properties. Straightforward would be if an individual could simultaneously be a member of multiple classes, in the example, for instance, descend from the superclass Person and at the same time be a member of a role class and a contract type class. Semantic databases support exactly that.
Solution of Multiple Classification
In OOP, classes are primarily used as schemas, i.e., as a kind of specification for instances. From this perspective, it is naturally difficult to mix schemas with any overlaps or contradictions. The same applies to the schemas of traditional relational databases. In ontologies, on the other hand, classes are primarily considered as groups of individuals. Individuals can be explicitly assigned to one or more classes or also implicitly by the use of certain properties.
In the following, therefore, it is better to talk about multiple classification than multiple inheritance. A good practice for the above example would first be to introduce only one class Person; then a class Role and below it the subclasses Developer
, Architect
, and Manager
. In addition, there is the class Personnel
and below it the subclasses Employee
and Freelancer
. Using this construct, a particular person, more precisely an individual of the class Person
, can now be a member of the classes Person
and either the class Employee
(exclusively) or Freelancer
and additionally a member of one or more role classes, i.e., Developer
, Architect
, or Manager
, see Figure 2.
A single individual can thus be a member of many classes. This model can now be extended as desired. Think, for example, of the class WorkTimeModel
with the subclasses FullTimeWorker
and PartTimeWorker
, which can be completely independent of role and contract type. In addition to easier modeling, another advantage is that all members of a particular class can also be queried just as easily with the query language SPARQL. This makes queries very clear and also easy to read.
...
Disjoint Classes
Now it often happens in practice that a single person holds multiple roles at the same time. In contrast, however, he or she normally cannot be an Employee
and Freelancer
at the same time. Therefore, these two classes are also referred to as "disjoint". This states that an individual cannot be a member of these two classes at the same time. The attempt to specify this would be recognized by the reasoner as a rule violation and identified as an inconsistency in the database.
Schemas
A developer from the OOP world would initially be tempted to view classes in an ontology as a schema. And indeed, a graph database can also be used in this sense. For each class, data and object properties are specified, and all individuals of this class can then receive values or IRI references for these properties. The emphasis here is on "can", because for ontologies the Open World Assumption (OWA) applies. This states that everything that is not explicitly specified is simply unknown, but not necessarily false.
This means: The reasoner will not identify the absence of, for example, a value for an individual in an ontology as a rule violation, because in the sense of OWA it does not mean that the missing value might not be defined in another referenced ontology. Theoretically, in a graph database, all individuals can use any properties. Also, individuals do not necessarily have to have one or more classifications. In other words: A graph database is initially schemaless. Nevertheless, the assignment of an individual to a class is, of course, useful, for example, to distinguish different types of individuals such as customers, products, or invoices in an ERP application.
Explicit and Implicit Classification
An explicit assignment of an individual to one or more classes is done via type statements. Here, corresponding statements per individual determine its memberships:
enapsodnp:Max rdf:type enapsodnp:Person
enapsodnp:Max rdf:type enapsodnp:Freelancer
enapsodnp:Max rdf:type enapsodnp:Developer
For graph databases with many instances without extensive semantics and the need to process these even without a reasoner with simple CRUD operations (Create, Read, Update, and Delete), this is already a sufficient and practicable approach. A great advantage of this type of modeling is that, referring to the above example, all persons, but alternatively also all freelancers or all developers can now be queried very easily with a single command. As a further advantage, you can also query all individuals of the class Personnel
. For this, however, you need the reasoner with RDFS support. RDFS can specify the property subClassOf
and thus allows the creation of class hierarchies or taxonomies:
enapsodnp:Employee rdfs:subClassOf enapsodnp:Personnel
enapsodnp:Freelancer rdfs:subClassOf enapsodnp:Personnel
Inference and Reasoning
An important realization from this is that it is now not necessary to explicitly define for each person that he or she is a member of the class Personnel
, but that this is done implicitly and only once via the central subClassOf
definition within the taxonomy. This task is performed by the reasoner. It uses two independent pieces of information, namely "Max is a member of Employee" and "Employee is a subclass of Personnel", and logically concludes: Max is a member of Personnel. This conclusion is called inference and represents one of the strengths of semantic databases. To efficiently query the knowledge, the database internally generates temporary triples and also manages them. This is also the reason why semantic graph databases require more memory than pure triple stores when inference is used intensively. However, exports can be performed with or without the inferred triples, for example, to convert OWL2 ontologies with reasoning support into simple RDF graphs without losing the automatically generated additional information. But you don't have to worry about managing this information, the database with the reasoner takes care of it automatically.
Domains and Ranges
Another special feature of semantic graph databases is the implicit classification of individuals using properties. Like all information in a graph, properties are also mapped via RDF triples, for example in the following form:
enapsodnp:EbnerVerlag rdf:type enapsodnp:Company
enapsodnp:SemanticDatabases rdf:type enapsodnp:Document
enapsodnp:EbnerVerlag enapsodnp:publishedArticle enapsodnp:SemanticDatabases
The individual EbnerVerlag is a member of the class Company and the individual SemanticDatabases belongs to the class Document. Suppose there were also the classes Publisher and Article. The property publishedArticle can be used to classify the subject, here EbnerVerlag, as well as the object, here SemanticDatabases. In RDF Schema, the so-called domain and range axioms were introduced for this purpose. Axioms describe facts, so the reasoner also considers them independently of the Open World Assumption. In particular, the range axiom is often confused with a value restriction, but this is not about restrictions or validations, but about the classification of individuals. While the domain axiom controls the classification of the subject, the range axiom determines the classification of the object. Here is an example:
enapsodnp:publishedArticle rdfs:domain enapsodnp:Publisher
enapsodnp:publishedArticle rdfs:range enapsodnp:Article
The first triple states that any subject that uses the property publishedArticle
is a Publisher
(axiom). The second triple expresses that any object referenced by this property is an Article
. Through the statement, EbnerVerlag
is thus implicitly a member of the class Publisher
and SemanticDatabases
a member of the class Article
:
enapsodnp:EbnerVerlag enapsodnp:publishedArticle enapsodnp:SemanticDatabases
Conversely, a query for Publisher
also returns EbnerVerlag
and a query for Article
also SemanticDatabases
, although this was not explicitly defined for either of the two individuals. What on the one hand is an extremely useful and welcome feature - after all, this saves many redundant declarations and thus ultimately a lot of maintenance effort - harbors a certain danger on the other hand. If, for example, the statement enapsodnp:EbnerVerlag
enapsodnp:publishedArticle
enapsodnp:Max
is made, the reasoner automatically infers that Max
is an Article
.
So a certain amount of care should be taken here. Since the reasoner does not automatically identify such semantic errors, they are difficult to identify and fix later in extensive ontologies. The Shapes Constraints Language (SHACL [7]) based on RDF graphs is a suitable tool to uncover type violations, among other things.
...
Semantic Reusability of Properties
In traditional SQL databases, each table has its own schema and thus also individual designators per column. However, it is not automatically clear whether the same designators actually mean the same thing or different designators actually mean something else. The columns and their data types are independent of each other here, and the few metadata for a column usually only determine whether a value must be present or unique within a table, whether default values should be used, or an automatic increment should be performed.
The disadvantage here is that the actual knowledge about the data is not in the database itself, but in the code of the application that uses this data. And there, the knowledge is difficult to extract and just as difficult to reuse for applications in other programming languages. In semantic databases, RDF graphs, and OWL ontologies, the properties are initially created independently of classes and instances - after all, it is the properties with the metadata that give the contents of a database their actual meaning. Classes subsequently use these properties together, which ensures that fields with the same name here actually always mean the same thing.
This not only avoids incompatibilities and inconsistencies but also significantly simplifies the readability and shared understanding of data between different users. The knowledge about the data (the metadata) is stored in the database itself and can thus be shared and reused independently of code and programming language.
Data Properties
While object properties are descriptions of relationships between individuals, data properties describe certain characteristics of a particular individual. They are comparable to the data fields of an object in OOP or value columns from SQL databases. In an RDF graph, data properties - like all other statements - are represented by triples. Example:
enapsodnp:EbnerVerlag enapsodnp:companyName "Ebner Media Group GmbH & Co. KG"^^xsd:String
Data Types
In OWL2, a variety of data types are available for data properties [8]. The numeric ones include the following:
owl:real
owl:rational
xsd:decimal
xsd:integer
xsd:nonNegativeInteger
xsd:nonPositiveInteger
xsd:positiveInteger
xsd:negativeInteger
xsd:long
xsd:int
xsd:short
xsd:byte
xsd:unsignedLong
xsd:unsignedInt
xsd:unsignedShort
xsd:unsignedByte
xsd:double
xsd:float
The string types include the following variants:
xsd:string
xsd:normalizedString
xsd:token
xsd:language
xsd:Name
xsd:NCName
xsd:NMTOKEN
Boolean, binary, and time types:
xsd:Boolean
xsd:hexBinary
xsd:base64Binary
xsd:dateTime
xsd:dateTimeStamp
The character sequence Hello World “Hello World” would be 48656c6c6f20576f726c64 “48656c6c6f20576f726c64” in hexBinary format and SGVsbG8gV29ybGQ“SGVsbG8gV29ybGQ=” in base64. The two date/time types are noted in ISO8601 ISO-8601 format [9]. The difference between the two is that for dateTime the specification of the time zone is optional, but for dateTimeStamp it is required.
A valid dateTime value is, for example, 2020-02-15T12:45:00Z, as a dateTimeStamp value it is 2020-02-15T12:45:00-05:00. In OWL, it is also possible to create your own data types. A follow-up article will go into more detail on this topic and the benefits for applications.
Property Characteristics
A major advantage of semantic databases is the ability to enrich properties with extensive meta information and thus give them great expressiveness that can also be understood by machines in a standardized way. Table 1 shows the supported characteristics and their meaning.:
Characteristic | Meaning, Restriction |
---|---|
functional | The property may be used at most once per individual (set restriction for the referenced objects). Example hasMother: A child B can have at most one mother A. |
inverse functional | The referenced object may not be referenced by more than one subject (set restriction for the referencing subjects). Example isMarriedWith: If person A is married to person B, then B cannot also be married to C (at least not in the Christian cultural sphere). |
transitive | If an individual A refers to an individual B and B to C, then A can also refer to C. Example hasSuccessor: If person A is a descendant of B and B is a descendant of C, then A is also a descendant of C. |
symmetric | If A is in a relationship with B, then B is in the same relationship with A. Example hasRelative: If person A is related to B, then person B is also related to A. |
asymmetric | If A is in a relationship with B, B cannot be in the same relationship with A. Example isFatherOf: If A is the father of B, B cannot simultaneously be the father of A. |
reflexive | Specifies that all individuals with this property always refer to themselves as well. Example knows: A person A always knows himself/herself as well. |
irreflexive | Defines that individuals with this object property cannot refer to themselves. Example hasSibling: A person A cannot have himself/herself as a brother or sister. |
inverse of | Defines that the property in question is the opposite of another property. Example: hasChild is an inverse property of hasParent, likewise hasWife is inverse to hasHusband. |
equivalent to | Specifies that the property in question has the same values for domain and range, for example, as the referenced property. Example: The property child has the same values as hasChild. |
In the full OWL profile [10], data properties are a subclass of object properties. Therefore, an inverse-functional property can also be defined for data properties here. In the OWL-DL profile, object properties and datatype properties are separate, so an inverse-functional property for datatype properties cannot be defined here. The W3C website provides a good overview of many OWL2 axioms and restrictions [11].
Property Restrictions
In addition to the characteristics of properties, so-called property restrictions can be defined in OWL2. For data properties, these statements are about number and data type, for object properties about number and class of the referenced individuals. Table 2 shows the supported restrictions.
Restriction Type | Meaning |
---|---|
some | There must be at least one property of the specified type. |
only | All properties must be of the specified type, no quantity is specified. |
min (minimum cardinality) | There must be at least the specified number of properties of the specified type. |
max (maximum cardinality) | There may be at most the specified number of properties of the specified type. |
exactly (exact cardinality) | There must be exactly the specified number of properties of the specified type. |
Due to the Open World Assumptions, restrictions in OWL work differently than in SQL or NoSQL databases. Although the reasoner can detect, for example, when the limit of cardinality for properties is exceeded, because in this case there are indeed too many properties. However, due to the OWA, it does not identify falling below a limit that specifies a minimum cardinality as an error, because the missing properties could be defined in other ontologies. SHACL is better suited for validation purposes of individuals against schemas. One In one of our the next issues of dotnetpro articles we will report on this in detail.
Conclusion
Graph databases, with the help of semantics and extensive metadata, contribute to storing and managing knowledge in databases instead of code. This makes knowledge easily understandable, machine-readable, and thus reusable. Reasoners automatically infer new knowledge from existing knowledge through inference, databases learn by machine, and developers declare less redundantly and explicitly, but only once and centrally and then imply. This reduces maintenance costs and susceptibility to errors. RDF, RDFS, and OWL offer powerful mechanisms for metadata that give meaning to the contents of a database. According to the motto "Good analyses need good data and good data need good metadata", these technologies contribute to improving the quality of data and to a better understanding of it.
W3C conformity ensures better interoperability, vendor independence, and thus also higher investment security. The next part of this article series will show how the modeling tool Protégé helps to create your own semantic data models and how you can manage your own knowledge graphs with the SPARQL language and Ontotext GraphDB Free..
Referenzen
[1] RFC 1738 – Uniform Resource Locators (URL), www.dotnetpro.de/SL2004Semantik1
[2] RFC 3986 – Uniform Resource Identifier (URI), Generic Syntax, www.dotnetpro.de/SL2004Semantik2
[3] RFC 3987 – Internationalized Resource Identifiers (IRIs), www.dotnetpro.de/SL2004Semantik3
[4] W3C, RDF, http://www.w3.org/RDF
[5] W3C, RDF Schema 1.1, www.w3.org/TR/rdfÂschema
[6] W3C, OWL 2 Web Ontology Language Document Overview (Second Edition) http://www.dotnetpro.de/SL2004Semantik4
[7] W3C, Shapes Constraint Language (SHACL), http://www.w3.org/TR/shacl
[8] W3C, OWL 2 Web Ontology Language, Structural SpeÂcification and FunctionalÂStyle Syntax (Second Edition), Datatype Maps, http://www.dotnetpro.de/SL2004Semantik5
[9] ISO 8601 bei Wikipedia, http://www.dotnetpro.de/SL2004Semantik6
[10] Sprachebenen Lite, DL und Full der Web Ontology Language bei Wikipedia, http://www.dotnetpro.de/SL2004Semantik7
[11] W3C, OWL 2 Web Ontology Language, Quick RefeÂrence Guide (Second Edition), http://www.dotnetpro.de/SL2004Semantik8