...
...
Working with Knowledge(-Graphs) instead of Data(-Bases)
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Semantics enriches data with meaning. This meaning is stored in semantic graph databases. Graph databases open up new possibilities and perspectives for developers in many ways, and not just when it comes to modeling, classes, and their properties. In contrast to the well-known SQL and NoSQL databases with their tables and columns, listings and fields, graph databases extend the familiar view of schemas, objects, and inheritance with new features such as metadata, inference, and linking multiple graph databases to extensive knowledge graphs. With the appropriate tools, data can be linked to a semantic network. This is not new. Tim Berners-Lee coined the term "semantic web" as early as 2001. The W3C extended it in 2004 with standards such as OWL (Web Ontology Language, extended in 2012 with OWL2) and made it practical.
What was of rather scientific interest for a long time and led an isolated existence has now evolved into professional products that have reached a level of maturity in terms of quality, runtime behavior, availability, and scalability that is worth a closer look. The first thing to clarify would be in which scenarios the use of a semantic graph database is worthwhile and what advantages the new technologies bring for developers and users. The aim is to bridge the gap between current research and practical development, covering topics such as modern RDF triple stores, new types of classification and validation of objects, a new perspective on data properties and automatic learning, but also on interoperability, portability, reusability, harmonization, and data quality. This will require several articles, and by the end of this series, you will be able to create a semantic data model with classes and properties and create, read, update, and delete (CRUD) instances in an ontology using the SPARQL query language in Node.js instances.
Knowledge Graphs
Semantic graph databases or ontologies essentially consist of a class hierarchy, a taxonomy (also referred to as T-Box), and the so-called individuals, i.e., the instances of the classes (also referred to as A-Box). In addition, there are data properties with concrete values of various data types, the so-called literals, and the object properties with references to other individuals, the actual links. All these entities are also referred to as resources. Within a graph, all resources are identified using so-called URIs or IRIs; IRI stands for Internationalized Resource Identifier, more on this in the next section. URIs/IRIs are comparable to primary keys from the world of tables and collections but are unique across the entire graph and not just related to the instances of a particular class.
The semantic web is also referred to as the "web of data". One of the paradigms is that not only resources within an ontology can be linked to each other, but also any ontologies among themselves, including the resources contained therein. The combination of multiple ontologies, usually in hierarchical structures, is referred to as a knowledge graph. A special feature in semantic graph databases is the "reasoner" - a rule engine within the database itself with various standardized but also configurable rule sets. On the one hand, the reasoner maps the so-called inference: a function to automatically draw logical conclusions from already existing statements and make them available as new knowledge. On the other hand, it serves to validate and check for consistency, for example, to identify contradictions or rule violations, and thus also to ensure data quality.
Terms and Abbreviations
Anyone dealing with the semantic web comes across a number of new terms and abbreviations, the knowledge of which is important for understanding models, relations, and functions.
...
The namespace here is http://ont.enapso.com/erp#
and the local name is product_1
, referred to simply as name or identifier in the further course of this article. In contrast to URLs, the https scheme is not only uncommon for IRIs, but due to the purely identificational character, it also does not cause any encryption - neither of the transport nor of the content.
Namespaces and Domains
For the interoperability of ontologies, it is considered good practice to use only those namespaces for which you are also the official domain owner. In the example ontology for this text, it could be, for example, http://ont.enapso.com/[...]
. Although the validity of the namespaces is currently not checked by name servers (DNS) or by an official institution when publishing ontologies, you risk errors in applications or with users by using namespaces such as http://foo.bar/
because the resources can no longer be uniquely referenced globally and can therefore not be used in internationally composed knowledge graphs.
Names with Prefixes
If resources in RDF/OWL files or in SPARQL commands were always referenced by their full IRI - syntactically, even additional surrounding <- and >-characters are necessary - they could quickly become confusing and difficult to read. Consider the following triple (see the "RDF" section):
...
Within the framework of RDF, OWL, and SPARQL, this notation is referred to as a prefixed name.
Primary Keys and Relationships
In addition to the conventions for namespaces, there are other proven practices for IRIs. Since each resource has its own database-wide and internationally unique IRI, no additional primary keys are required in a graph; so no special ID fields need to be provided, for example, in the form of data properties. The existing IRIs are completely sufficient for this - de facto, these are the global primary keys. The same applies to relationships and foreign keys. In a 1:1 or 1:n relationship, one or more object properties of a master individual simply reference its child individuals using the child IRIs. The following example shows a 1:n relationship between invoices and products, represented here in so-called prefixed names notation:
...
erp:user_1 props:hasRole erp:role_1
erp:user_1 props:hasRole erp:role_2
erp:user_2 props:hasRole erp:role_1
erp:user_2 props:hasRole erp:role_2
RDF
RDF is the abbreviation for Resource Description Framework [4], a modeling concept standardized by the World Wide Web Consortium (W3C) for the semantic web that enables simple logical statements using triples of subject, predicate, and object, which can be formulated in directed graphs and easily read, understood, and visualized by machines.
...
A collection of triples is also referred to as a triple store. At the lowest level, a semantic graph database is such a triple store. How the triples are managed internally as efficiently as possible is up to the database vendor. For import into and export from files, the RDF triples are represented in various formats. These can be, for example, Turtle, Trig, N-Triples, JSON-LD, or even RDF/XML, with their respective advantages and disadvantages. While Turtle (.ttl) is popular because of its compactness and easy readability, Trig (.trig) is more suitable for backup and restore of knowledge graphs since the information about subgraphs is also serialized and deserialized. As expected, JSON-LD is increasingly found in JavaScript environments and RDF/XML more in the Java world. However, in the spirit of interoperability, semantic graph databases such as GraphDB from Ontotext have import and export functions for all important formats.
{{% content-ref "/attachments/1093466056929341481" %}}
RDFS
The abbreviation RDFS stands for Resource Description Framework Schema [5], a semantic extension of the RDF vocabulary specifically for data modeling. It contains mechanisms especially for grouping resources and their relationships to each other. It is comparable to the class system known from object-oriented programming (OOP), but with an essential and important extension: While in OOP it is defined which properties a class has and may have, an RDF schema can specify for properties to which class an individual is automatically assigned if it contains or references the property in question. This includes support for subClassOf
and subPropertyOf
to organize classes and properties hierarchically. RDFS plus, finally, is an extended version of RDFS that supports symmetric, inverse, and transitive properties. These new concepts, in contrast to OOP, will be explained in more detail below. Many of the RDFS and RDFS plus components are also part of the even more expressive Web Ontology Language (OWL).
...
OWL
The abbreviation OWL stands for Web Ontology Language [6]. The language was specially designed for the semantic web to represent knowledge about objects and classes (as groups of objects) and their relationships to each other. Ontologies are based on RDF and OWL and can be read and modified with SPARQL as a query language. Reasoners support OWL.
Class Trees
The taxonomy is an essential part of a semantic database model. It corresponds more to the hierarchical class model from object-oriented programming than to the schemas of traditional SQL and NoSQL databases for tables or collections. The classes in the taxonomy are organized in a tree topology, and they can also inherit their properties.
OOP vs. Graphs - or the Inadequacy of Tree Topologies
However, with conventional class trees, it is poorly practicable to impossible to map real complex environments. Although they are suitable for implementing inheritance vertically, horizontal aspects, i.e., those that affect several subtrees, cannot be mapped in tree topologies. Take the employees of a company as an example. In a traditional class tree, Person might be at the top, then Employee and Freelancer might follow, and under Employee there might be Developer, Architect, and Manager.
...
For the Person class, it is too specific, for Employee and Freelancer already redundant again - which increases the code and maintenance effort. Or you add another helper class called Personnel in between. Semantically, this is ambiguous and also additional code. But not only that; here again, two concepts are mixed, namely that of classification and that of properties. Straightforward would be if an individual could simultaneously be a member of multiple classes, in the example, for instance, descend from the superclass Person and at the same time be a member of a role class and a contract type class. Semantic databases support exactly that.
Solution of Multiple Classification
In OOP, classes are primarily used as schemas, i.e., as a kind of specification for instances. From this perspective, it is naturally difficult to mix schemas with any overlaps or contradictions. The same applies to the schemas of traditional relational databases. In ontologies, on the other hand, classes are primarily considered as groups of individuals. Individuals can be explicitly assigned to one or more classes or also implicitly by the use of certain properties.
...
A single individual can thus be a member of many classes. This model can now be extended as desired. Think, for example, of the class WorkTimeModel
with the subclasses FullTimeWorker
and PartTimeWorker
, which can be completely independent of role and contract type. In addition to easier modeling, another advantage is that all members of a particular class can also be queried just as easily with the query language SPARQL. This makes queries very clear and also easy to read.
...
Disjoint Classes
Now it often happens in practice that a single person holds multiple roles at the same time. In contrast, however, he or she normally cannot be an Employee
and Freelancer
at the same time. Therefore, these two classes are also referred to as "disjoint". This states that an individual cannot be a member of these two classes at the same time. The attempt to specify this would be recognized by the reasoner as a rule violation and identified as an inconsistency in the database.
Schemas
A developer from the OOP world would initially be tempted to view classes in an ontology as a schema. And indeed, a graph database can also be used in this sense. For each class, data and object properties are specified, and all individuals of this class can then receive values or IRI references for these properties. The emphasis here is on "can", because for ontologies the Open World Assumption (OWA) applies. This states that everything that is not explicitly specified is simply unknown, but not necessarily false.
This means: The reasoner will not identify the absence of, for example, a value for an individual in an ontology as a rule violation, because in the sense of OWA it does not mean that the missing value might not be defined in another referenced ontology. Theoretically, in a graph database, all individuals can use any properties. Also, individuals do not necessarily have to have one or more classifications. In other words: A graph database is initially schemaless. Nevertheless, the assignment of an individual to a class is, of course, useful, for example, to distinguish different types of individuals such as customers, products, or invoices in an ERP application.
Explicit and Implicit Classification
An explicit assignment of an individual to one or more classes is done via type statements. Here, corresponding statements per individual determine its memberships:
...
dnp:Employee rdfs:subClassOf dnp:Personnel
dnp:Freelancer rdfs:subClassOf dnp:Personnel
Inference and Reasoning
An important realization from this is that it is now not necessary to explicitly define for each person that he or she is a member of the class Personnel
, but that this is done implicitly and only once via the central subClassOf
definition within the taxonomy. This task is performed by the reasoner. It uses two independent pieces of information, namely "Max is a member of Employee" and "Employee is a subclass of Personnel", and logically concludes: Max is a member of Personnel. This conclusion is called inference and represents one of the strengths of semantic databases. To efficiently query the knowledge, the database internally generates temporary triples and also manages them. This is also the reason why semantic graph databases require more memory than pure triple stores when inference is used intensively. However, exports can be performed with or without the inferred triples, for example, to convert OWL2 ontologies with reasoning support into simple RDF graphs without losing the automatically generated additional information. But you don't have to worry about managing this information, the database with the reasoner takes care of it automatically.
Domains and Ranges
Another special feature of semantic graph databases is the implicit classification of individuals using properties. Like all information in a graph, properties are also mapped via RDF triples, for example in the following form:
...
So a certain amount of care should be taken here. Since the reasoner does not automatically identify such semantic errors, they are difficult to identify and fix later in extensive ontologies. The Shapes Constraints Language (SHACL [7]) based on RDF graphs is a suitable tool to uncover type violations, among other things.
Semantic Reusability of Properties
In traditional SQL databases, each table has its own schema and thus also individual designators per column. However, it is not automatically clear whether the same designators actually mean the same thing or different designators actually mean something else. The columns and their data types are independent of each other here, and the few metadata for a column usually only determine whether a value must be present or unique within a table, whether default values should be used, or an automatic increment should be performed.
...
This not only avoids incompatibilities and inconsistencies but also significantly simplifies the readability and shared understanding of data between different users. The knowledge about the data (the metadata) is stored in the database itself and can thus be shared and reused independently of code and programming language.
Data Properties
While object properties are descriptions of relationships between individuals, data properties describe certain characteristics of a particular individual. They are comparable to the data fields of an object in OOP or value columns from SQL databases. In an RDF graph, data properties - like all other statements - are represented by triples. Example:
dnp:EbnerVerlag dnp:companyName "Ebner Media Group GmbH & Co. KG"^^xsd:String
Data Types
In OWL2, a variety of data types are available for data properties [8]. The numeric ones include the following:
...
A valid dateTime value is, for example, 2020-02-15T12:45:00Z, as a dateTimeStamp value it is 2020-02-15T12:45:00-05:00. In OWL, it is also possible to create your own data types. A follow-up article will go into more detail on this topic and the benefits for applications.
Property Characteristics
A major advantage of semantic databases is the ability to enrich properties with extensive meta information and thus give them great expressiveness that can also be understood by machines in a standardized way. Table 1 shows the supported characteristics and their meaning:
...
In the full OWL profile [10], data properties are a subclass of object properties. Therefore, an inverse-functional property can also be defined for data properties here. In the OWL-DL profile, object properties and datatype properties are separate, so an inverse-functional property for datatype properties cannot be defined here. The W3C website provides a good overview of many OWL2 axioms and restrictions [11].
Property Restrictions
In addition to the characteristics of properties, so-called property restrictions can be defined in OWL2. For data properties, these statements are about number and data type, for object properties about number and class of the referenced individuals. Table 2 shows the supported restrictions.
...
Due to the Open World Assumptions, restrictions in OWL work differently than in SQL or NoSQL databases. Although the reasoner can detect, for example, when the limit of cardinality for properties is exceeded, because in this case there are indeed too many properties. However, due to the OWA, it does not identify falling below a limit that specifies a minimum cardinality as an error, because the missing properties could be defined in other ontologies. SHACL is better suited for validation purposes of individuals against schemas. In one of our the next articles we will report on this in detail.
Conclusion
Graph databases, with the help of semantics and extensive metadata, contribute to storing and managing knowledge in databases instead of code. This makes knowledge easily understandable, machine-readable, and thus reusable. Reasoners automatically infer new knowledge from existing knowledge through inference, databases learn by machine, and developers declare less redundantly and explicitly, but only once and centrally and then imply. This reduces maintenance costs and susceptibility to errors. RDF, RDFS, and OWL offer powerful mechanisms for metadata that give meaning to the contents of a database. According to the motto "Good analyses need good data and good data need good metadata", these technologies contribute to improving the quality of data and to a better understanding of it.
W3C conformity ensures better interoperability, vendor independence, and thus also higher investment security. The next part of this article series will show how the modeling tool Protégé helps to create your own semantic data models and how you can manage your own knowledge graphs with the SPARQL language and Ontotext GraphDB.
Referenzen
[1] RFC 1738 – Uniform Resource Locators (URL), www.dotnetpro.de/SL2004Semantik1
[2] RFC 3986 – Uniform Resource Identifier (URI), Generic Syntax, www.dotnetpro.de/SL2004Semantik2
[3] RFC 3987 – Internationalized Resource Identifiers (IRIs), www.dotnetpro.de/SL2004Semantik3
[4] W3C, RDF, http://www.w3.org/RDF
[5] W3C, RDF Schema 1.1, www.w3.org/TR/rdfschema
[6] W3C, OWL 2 Web Ontology Language Document Overview (Second Edition) http://www.dotnetpro.de/SL2004Semantik4
[7] W3C, Shapes Constraint Language (SHACL), http://www.w3.org/TR/shacl
[8] W3C, OWL 2 Web Ontology Language, Structural Specification and FunctionalStyle Syntax (Second Edition), Datatype Maps, http://www.dotnetpro.de/SL2004Semantik5
[9] ISO 8601 bei Wikipedia, http://www.dotnetpro.de/SL2004Semantik6
[10] Sprachebenen Lite, DL und Full der Web Ontology Language bei Wikipedia, http://www.dotnetpro.de/SL2004Semantik7
[11] W3C, OWL 2 Web Ontology Language, Quick Reference Guide (Second Edition), http://www.dotnetpro.de/SL2004Semantik8
...