Semantic Web, Part 1 - Working with Knowledge(-Graphs) instead of Data(-Bases)

Working with Knowledge(-Graphs) instead of Data(-Bases)

Semantics enriches data with meaning. This meaning is stored in semantic graph databases. Graph databases open up new possibilities and perspectives for developers in many ways, and not just when it comes to modeling, classes, and their properties. In contrast to the well-known SQL and NoSQL databases with their tables and columns, listings and fields, graph databases extend the familiar view of schemas, objects, and inheritance with new features such as metadata, inference, and linking multiple graph databases to extensive knowledge graphs. With the appropriate tools, data can be linked to a semantic network. This is not new. Tim Berners-Lee coined the term "semantic web" as early as 2001. The W3C extended it in 2004 with standards such as OWL (Web Ontology Language, extended in 2012 with OWL2) and made it practical.

What was of rather scientific interest for a long time and led an isolated existence has now evolved into professional products that have reached a level of maturity in terms of quality, runtime behavior, availability, and scalability that is worth a closer look. The first thing to clarify would be in which scenarios the use of a semantic graph database is worthwhile and what advantages the new technologies bring for developers and users. The aim is to bridge the gap between current research and practical development, covering topics such as modern RDF triple stores, new types of classification and validation of objects, a new perspective on data properties and automatic learning, but also on interoperability, portability, reusability, harmonization, and data quality. This will require several articles, and by the end of this series, you will be able to create a semantic data model with classes and properties and create, read, update, and delete (CRUD) instances in an ontology using the SPARQL query language in Node.js instances.

Knowledge Graphs

Semantic graph databases or ontologies essentially consist of a class hierarchy, a taxonomy (also referred to as T-Box), and the so-called individuals, i.e., the instances of the classes (also referred to as A-Box). In addition, there are data properties with concrete values of various data types, the so-called literals, and the object properties with references to other individuals, the actual links. All these entities are also referred to as resources. Within a graph, all resources are identified using so-called URIs or IRIs; IRI stands for Internationalized Resource Identifier, more on this in the next section. URIs/IRIs are comparable to primary keys from the world of tables and collections but are unique across the entire graph and not just related to the instances of a particular class.

The semantic web is also referred to as the "web of data". One of the paradigms is that not only resources within an ontology can be linked to each other, but also any ontologies among themselves, including the resources contained therein. The combination of multiple ontologies, usually in hierarchical structures, is referred to as a knowledge graph. A special feature in semantic graph databases is the "reasoner" - a rule engine within the database itself with various standardized but also configurable rule sets. On the one hand, the reasoner maps the so-called inference: a function to automatically draw logical conclusions from already existing statements and make them available as new knowledge. On the other hand, it serves to validate and check for consistency, for example, to identify contradictions or rule violations, and thus also to ensure data quality.

Terms and Abbreviations

Anyone dealing with the semantic web comes across a number of new terms and abbreviations, the knowledge of which is important for understanding models, relations, and functions.

First, a brief summary of the differences between URL, URI, and IRI. A Uniform Resource Locator (URL) specifies the location of a specific resource, so it is used for localization. This includes addresses, i.e., references to content that may change constantly, such as an HTML page or a database. According to RFC 1738 [1] from 1994, only a subset of 60 US-ASCII characters is allowed for a URL, which represents a significant restriction for internationalized applications today. The term base URL is used to divide an address with a path into several sub-addresses, for example, http://my.baseurl.com/cat1/func2. The https scheme instead of http for the URL has a special meaning because it causes the transfer to be encrypted, usually between the web server and the browser.

The Uniform Resource Identifier (URI) is used to uniquely identify specific resources globally. On the web, these are, for example, specific files or interfaces to services. They consist of a URL followed by a file or service name, optionally with additional query parameters or hash values, such as http://my.domain/db/customers?id=7 or http://my.domain/db/product#46510; URLs are therefore a subset of URIs. According to RFC 3986 [2], URIs are subject to the same character set restrictions as URLs.

Remaining as the third in the group is the IRI, the Internationalized Resource Identifier. It was standardized by the IETF (Internet Engineering Task Force) in 2005 to meet modern requirements for global use. Like the URI, it is used to uniquely identify a specific resource worldwide. It serves the same purpose as the URI but without the restriction on the ASCII character set. According to RFC 3987, all UTF-8 characters are allowed for IRIs with a few exceptions [3]. In the context of the semantic web, IRIs are composed of a namespace, a separator, and a so-called local name. The #-character has established itself as the de facto standard, even though a few others are still allowed. Analogous to URL and URI, the namespace should actually be called IRL, i.e., International Resource Locator, but this term has never caught on. A valid IRI for a resource in an ontology is, for example: http://ont.enapso.com/erp#product_1

The namespace here is http://ont.enapso.com/erp# and the local name is product_1, referred to simply as name or identifier in the further course of this article. In contrast to URLs, the https scheme is not only uncommon for IRIs, but due to the purely identificational character, it also does not cause any encryption - neither of the transport nor of the content.

Namespaces and Domains

For the interoperability of ontologies, it is considered good practice to use only those namespaces for which you are also the official domain owner. In the example ontology for this text, it could be, for example, http://ont.enapso.com/[...]. Although the validity of the namespaces is currently not checked by name servers (DNS) or by an official institution when publishing ontologies, you risk errors in applications or with users by using namespaces such as http://foo.bar/ because the resources can no longer be uniquely referenced globally and can therefore not be used in internationally composed knowledge graphs.

Names with Prefixes

If resources in RDF/OWL files or in SPARQL commands were always referenced by their full IRI - syntactically, even additional surrounding <- and >-characters are necessary - they could quickly become confusing and difficult to read. Consider the following triple (see the "RDF" section):

<http://ont.enapso.com/erp#invoice_1> <http://ont.enapso.com/props#hasProduct> <http://ont.enapso.com/erp#product_1>

To abbreviate this cumbersome notation, so-called prefixes were introduced for the files as well as for SPARQL. These separate the IRI into an abbreviation, the prefix, followed by a colon and the actual identifier of the resource. These are declared at the beginning of an RDF/OWL file or a SPARQL command and can then be used as a placeholder for their long version:

  • prefix erp: <http://ont.enapso.com/erp#>

  • prefix props: <http://ont.enapso.com/props#>

Triples can then be easily noted in a significantly more readable short form:

erp:invoice_1 props:hasProduct erp:product_1

Within the framework of RDF, OWL, and SPARQL, this notation is referred to as a prefixed name.

Primary Keys and Relationships

In addition to the conventions for namespaces, there are other proven practices for IRIs. Since each resource has its own database-wide and internationally unique IRI, no additional primary keys are required in a graph; so no special ID fields need to be provided, for example, in the form of data properties. The existing IRIs are completely sufficient for this - de facto, these are the global primary keys. The same applies to relationships and foreign keys. In a 1:1 or 1:n relationship, one or more object properties of a master individual simply reference its child individuals using the child IRIs. The following example shows a 1:n relationship between invoices and products, represented here in so-called prefixed names notation:

  • erp:invoice_1 props:hasProduct erp:product_1

  • erp:invoice_1 props:hasProduct erp:product_2

  • erp:invoice_1 props:hasProduct erp:product_3

The advantage of a graph is that even n:m relationships no longer require additional auxiliary tables or objects, as known from classic relational database architectures, since the same object property can be used multiple times by an individual. Users and roles, for example, can be related to each other in an RDF triple store as follows:

  • erp:user_1 props:hasRole erp:role_1

  • erp:user_1 props:hasRole erp:role_2

  • erp:user_2 props:hasRole erp:role_1

  • erp:user_2 props:hasRole erp:role_2

RDF

RDF is the abbreviation for Resource Description Framework [4], a modeling concept standardized by the World Wide Web Consortium (W3C) for the semantic web that enables simple logical statements using triples of subject, predicate, and object, which can be formulated in directed graphs and easily read, understood, and visualized by machines.

Examples: Max hasAge 32 (data property) or Josef hasSpouse Maria (object property).

A collection of triples is also referred to as a triple store. At the lowest level, a semantic graph database is such a triple store. How the triples are managed internally as efficiently as possible is up to the database vendor. For import into and export from files, the RDF triples are represented in various formats. These can be, for example, Turtle, Trig, N-Triples, JSON-LD, or even RDF/XML, with their respective advantages and disadvantages. While Turtle (.ttl) is popular because of its compactness and easy readability, Trig (.trig) is more suitable for backup and restore of knowledge graphs since the information about subgraphs is also serialized and deserialized. As expected, JSON-LD is increasingly found in JavaScript environments and RDF/XML more in the Java world. However, in the spirit of interoperability, semantic graph databases such as GraphDB from Ontotext have import and export functions for all important formats.

{{% content-ref "/attachments/1093466056929341481" %}}

RDFS

The abbreviation RDFS stands for Resource Description Framework Schema [5], a semantic extension of the RDF vocabulary specifically for data modeling. It contains mechanisms especially for grouping resources and their relationships to each other. It is comparable to the class system known from object-oriented programming (OOP), but with an essential and important extension: While in OOP it is defined which properties a class has and may have, an RDF schema can specify for properties to which class an individual is automatically assigned if it contains or references the property in question. This includes support for subClassOf and subPropertyOf to organize classes and properties hierarchically. RDFS plus, finally, is an extended version of RDFS that supports symmetric, inverse, and transitive properties. These new concepts, in contrast to OOP, will be explained in more detail below. Many of the RDFS and RDFS plus components are also part of the even more expressive Web Ontology Language (OWL).

image004.jpg

OWL

The abbreviation OWL stands for Web Ontology Language [6]. The language was specially designed for the semantic web to represent knowledge about objects and classes (as groups of objects) and their relationships to each other. Ontologies are based on RDF and OWL and can be read and modified with SPARQL as a query language. Reasoners support OWL.

Class Trees

The taxonomy is an essential part of a semantic database model. It corresponds more to the hierarchical class model from object-oriented programming than to the schemas of traditional SQL and NoSQL databases for tables or collections. The classes in the taxonomy are organized in a tree topology, and they can also inherit their properties.

OOP vs. Graphs - or the Inadequacy of Tree Topologies

However, with conventional class trees, it is poorly practicable to impossible to map real complex environments. Although they are suitable for implementing inheritance vertically, horizontal aspects, i.e., those that affect several subtrees, cannot be mapped in tree topologies. Take the employees of a company as an example. In a traditional class tree, Person might be at the top, then Employee and Freelancer might follow, and under Employee there might be Developer, Architect, and Manager.

This already shows the first problem: Because even if a freelancer might not be able to become a manager, why should he not also be a developer or software architect? Why not an architect and manager at the same time? Generalized, contract types are mixed here vertically with roles in a company horizontally (Figure 1). What can still be solved with class designators with multiple meanings in small models, quickly escalates into semantic chaos when the model grows. To address this problem, in classical OOP you now add fields such as Role. But where?

For the Person class, it is too specific, for Employee and Freelancer already redundant again - which increases the code and maintenance effort. Or you add another helper class called Personnel in between. Semantically, this is ambiguous and also additional code. But not only that; here again, two concepts are mixed, namely that of classification and that of properties. Straightforward would be if an individual could simultaneously be a member of multiple classes, in the example, for instance, descend from the superclass Person and at the same time be a member of a role class and a contract type class. Semantic databases support exactly that.

Solution of Multiple Classification

In OOP, classes are primarily used as schemas, i.e., as a kind of specification for instances. From this perspective, it is naturally difficult to mix schemas with any overlaps or contradictions. The same applies to the schemas of traditional relational databases. In ontologies, on the other hand, classes are primarily considered as groups of individuals. Individuals can be explicitly assigned to one or more classes or also implicitly by the use of certain properties.

In the following, therefore, it is better to talk about multiple classification than multiple inheritance. A good practice for the above example would first be to introduce only one class Person; then a class Role and below it the subclasses Developer, Architect, and Manager. In addition, there is the class Personnel and below it the subclasses Employee and Freelancer. Using this construct, a particular person, more precisely an individual of the class Person, can now be a member of the classes Person and either the class Employee (exclusively) or Freelancer and additionally a member of one or more role classes, i.e., Developer, Architect, or Manager, see Figure 2.

A single individual can thus be a member of many classes. This model can now be extended as desired. Think, for example, of the class WorkTimeModel with the subclasses FullTimeWorker and PartTimeWorker, which can be completely independent of role and contract type. In addition to easier modeling, another advantage is that all members of a particular class can also be queried just as easily with the query language SPARQL. This makes queries very clear and also easy to read.

image006.png

Disjoint Classes

Now it often happens in practice that a single person holds multiple roles at the same time. In contrast, however, he or she normally cannot be an Employee and Freelancer at the same time. Therefore, these two classes are also referred to as "disjoint". This states that an individual cannot be a member of these two classes at the same time. The attempt to specify this would be recognized by the reasoner as a rule violation and identified as an inconsistency in the database.

Schemas

A developer from the OOP world would initially be tempted to view classes in an ontology as a schema. And indeed, a graph database can also be used in this sense. For each class, data and object properties are specified, and all individuals of this class can then receive values or IRI references for these properties. The emphasis here is on "can", because for ontologies the Open World Assumption (OWA) applies. This states that everything that is not explicitly specified is simply unknown, but not necessarily false.

This means: The reasoner will not identify the absence of, for example, a value for an individual in an ontology as a rule violation, because in the sense of OWA it does not mean that the missing value might not be defined in another referenced ontology. Theoretically, in a graph database, all individuals can use any properties. Also, individuals do not necessarily have to have one or more classifications. In other words: A graph database is initially schemaless. Nevertheless, the assignment of an individual to a class is, of course, useful, for example, to distinguish different types of individuals such as customers, products, or invoices in an ERP application.

Explicit and Implicit Classification

An explicit assignment of an individual to one or more classes is done via type statements. Here, corresponding statements per individual determine its memberships:

  • dnp:Max rdf:type dnp:Person

  • dnp:Max rdf:type dnp:Freelancer

  • dnp:Max rdf:type dnp:Developer

For graph databases with many instances without extensive semantics and the need to process these even without a reasoner with simple CRUD operations (Create, Read, Update, and Delete), this is already a sufficient and practicable approach. A great advantage of this type of modeling is that, referring to the above example, all persons, but alternatively also all freelancers or all developers can now be queried very easily with a single command. As a further advantage, you can also query all individuals of the class Personnel. For this, however, you need the reasoner with RDFS support. RDFS can specify the property subClassOf and thus allows the creation of class hierarchies or taxonomies:

  • dnp:Employee rdfs:subClassOf dnp:Personnel

  • dnp:Freelancer rdfs:subClassOf dnp:Personnel

Inference and Reasoning

An important realization from this is that it is now not necessary to explicitly define for each person that he or she is a member of the class Personnel, but that this is done implicitly and only once via the central subClassOf definition within the taxonomy. This task is performed by the reasoner. It uses two independent pieces of information, namely "Max is a member of Employee" and "Employee is a subclass of Personnel", and logically concludes: Max is a member of Personnel. This conclusion is called inference and represents one of the strengths of semantic databases. To efficiently query the knowledge, the database internally generates temporary triples and also manages them. This is also the reason why semantic graph databases require more memory than pure triple stores when inference is used intensively. However, exports can be performed with or without the inferred triples, for example, to convert OWL2 ontologies with reasoning support into simple RDF graphs without losing the automatically generated additional information. But you don't have to worry about managing this information, the database with the reasoner takes care of it automatically.

Domains and Ranges

Another special feature of semantic graph databases is the implicit classification of individuals using properties. Like all information in a graph, properties are also mapped via RDF triples, for example in the following form:

  • dnp:EbnerVerlag rdf:type dnp:Company

  • dnp:SemanticDatabases rdf:type dnp:Document

  • dnp:EbnerVerlag dnp:publishedArticle dnp:SemanticDatabases

The individual EbnerVerlag is a member of the class Company and the individual SemanticDatabases belongs to the class Document. Suppose there were also the classes Publisher and Article. The property publishedArticle can be used to classify the subject, here EbnerVerlag, as well as the object, here SemanticDatabases. In RDF Schema, the so-called domain and range axioms were introduced for this purpose. Axioms describe facts, so the reasoner also considers them independently of the Open World Assumption. In particular, the range axiom is often confused with a value restriction, but this is not about restrictions or validations, but about the classification of individuals. While the domain axiom controls the classification of the subject, the range axiom determines the classification of the object. Here is an example:

  • dnp:publishedArticle rdfs:domain dnp:Publisher

  • dnp:publishedArticle rdfs:range dnp:Article

The first triple states that any subject that uses the property publishedArticle is a Publisher (axiom). The second triple expresses that any object referenced by this property is an Article. Through the statement, EbnerVerlag is thus implicitly a member of the class Publisher and SemanticDatabases a member of the class Article:

  • dnp:EbnerVerlag dnp:publishedArticle dnp:SemanticDatabases

Conversely, a query for Publisher also returns EbnerVerlag and a query for Article also SemanticDatabases, although this was not explicitly defined for either of the two individuals. What on the one hand is an extremely useful and welcome feature - after all, this saves many redundant declarations and thus ultimately a lot of maintenance effort - harbors a certain danger on the other hand. If, for example, the statement dnp:EbnerVerlag dnp:publishedArticle dnp:Max is made, the reasoner automatically infers that Max is an Article.

So a certain amount of care should be taken here. Since the reasoner does not automatically identify such semantic errors, they are difficult to identify and fix later in extensive ontologies. The Shapes Constraints Language (SHACL [7]) based on RDF graphs is a suitable tool to uncover type violations, among other things.

Semantic Reusability of Properties

In traditional SQL databases, each table has its own schema and thus also individual designators per column. However, it is not automatically clear whether the same designators actually mean the same thing or different designators actually mean something else. The columns and their data types are independent of each other here, and the few metadata for a column usually only determine whether a value must be present or unique within a table, whether default values should be used, or an automatic increment should be performed.

The disadvantage here is that the actual knowledge about the data is not in the database itself, but in the code of the application that uses this data. And there, the knowledge is difficult to extract and just as difficult to reuse for applications in other programming languages. In semantic databases, RDF graphs, and OWL ontologies, the properties are initially created independently of classes and instances - after all, it is the properties with the metadata that give the contents of a database their actual meaning. Classes subsequently use these properties together, which ensures that fields with the same name here actually always mean the same thing.

This not only avoids incompatibilities and inconsistencies but also significantly simplifies the readability and shared understanding of data between different users. The knowledge about the data (the metadata) is stored in the database itself and can thus be shared and reused independently of code and programming language.

Data Properties

While object properties are descriptions of relationships between individuals, data properties describe certain characteristics of a particular individual. They are comparable to the data fields of an object in OOP or value columns from SQL databases. In an RDF graph, data properties - like all other statements - are represented by triples. Example:

  • dnp:EbnerVerlag dnp:companyName "Ebner Media Group GmbH & Co. KG"^^xsd:String

Data Types

In OWL2, a variety of data types are available for data properties [8]. The numeric ones include the following:

  • owl:real

  • owl:rational

  • xsd:decimal

  • xsd:integer

  • xsd:nonNegativeInteger

  • xsd:nonPositiveInteger

  • xsd:positiveInteger

  • xsd:negativeInteger

  • xsd:long

  • xsd:int

  • xsd:short

  • xsd:byte

  • xsd:unsignedLong

  • xsd:unsignedInt

  • xsd:unsignedShort

  • xsd:unsignedByte

  • xsd:double

  • xsd:float

The string types include the following variants:

  • xsd:string

  • xsd:normalizedString

  • xsd:token

  • xsd:language

  • xsd:Name

  • xsd:NCName

  • xsd:NMTOKEN

Boolean, binary, and time types:

  • xsd:Boolean

  • xsd:hexBinary

  • xsd:base64Binary

  • xsd:dateTime

  • xsd:dateTimeStamp

The character sequence “Hello World” would be “48656c6c6f20576f726c64” in hexBinary format and “SGVsbG8gV29ybGQ=” in base64. The two date/time types are noted in ISO-8601 format [9]. The difference between the two is that for dateTime the specification of the time zone is optional, but for dateTimeStamp it is required.

A valid dateTime value is, for example, 2020-02-15T12:45:00Z, as a dateTimeStamp value it is 2020-02-15T12:45:00-05:00. In OWL, it is also possible to create your own data types. A follow-up article will go into more detail on this topic and the benefits for applications.

Property Characteristics

A major advantage of semantic databases is the ability to enrich properties with extensive meta information and thus give them great expressiveness that can also be understood by machines in a standardized way. Table 1 shows the supported characteristics and their meaning:

Characteristic

Meaning, Restriction

Characteristic

Meaning, Restriction

functional

The property may be used at most once per individual (set restriction for the referenced objects). Example hasMother: A child B can have at most one mother A.

inverse functional

The referenced object may not be referenced by more than one subject (set restriction for the referencing subjects). Example isMarriedWith: If person A is married to person B, then B cannot also be married to C (at least not in the Christian cultural sphere).

transitive

If an individual A refers to an individual B and B to C, then A can also refer to C. Example hasSuccessor: If person A is a descendant of B and B is a descendant of C, then A is also a descendant of C.

symmetric

If A is in a relationship with B, then B is in the same relationship with A. Example hasRelative: If person A is related to B, then person B is also related to A.

asymmetric

If A is in a relationship with B, B cannot be in the same relationship with A. Example isFatherOf: If A is the father of B, B cannot simultaneously be the father of A.

reflexive

Specifies that all individuals with this property always refer to themselves as well. Example knows: A person A always knows himself/herself as well.

irreflexive

Defines that individuals with this object property cannot refer to themselves. Example hasSibling: A person A cannot have himself/herself as a brother or sister.

inverse of

Defines that the property in question is the opposite of another property. Example: hasChild is an inverse property of hasParent, likewise hasWife is inverse to hasHusband.

equivalent to

Specifies that the property in question has the same values for domain and range, for example, as the referenced property. Example: The property child has the same values as hasChild.

In the full OWL profile [10], data properties are a subclass of object properties. Therefore, an inverse-functional property can also be defined for data properties here. In the OWL-DL profile, object properties and datatype properties are separate, so an inverse-functional property for datatype properties cannot be defined here. The W3C website provides a good overview of many OWL2 axioms and restrictions [11].

Property Restrictions

In addition to the characteristics of properties, so-called property restrictions can be defined in OWL2. For data properties, these statements are about number and data type, for object properties about number and class of the referenced individuals. Table 2 shows the supported restrictions.

Restriction Type

Meaning

Restriction Type

Meaning

some

There must be at least one property of the specified type.

only

All properties must be of the specified type, no quantity is specified.

min (minimum cardinality)

There must be at least the specified number of properties of the specified type.

max (maximum cardinality)

There may be at most the specified number of properties of the specified type.

exactly (exact cardinality)

There must be exactly the specified number of properties of the specified type.

Due to the Open World Assumptions, restrictions in OWL work differently than in SQL or NoSQL databases. Although the reasoner can detect, for example, when the limit of cardinality for properties is exceeded, because in this case there are indeed too many properties. However, due to the OWA, it does not identify falling below a limit that specifies a minimum cardinality as an error, because the missing properties could be defined in other ontologies. SHACL is better suited for validation purposes of individuals against schemas. In one of our the next articles we will report on this in detail.

Conclusion

Graph databases, with the help of semantics and extensive metadata, contribute to storing and managing knowledge in databases instead of code. This makes knowledge easily understandable, machine-readable, and thus reusable. Reasoners automatically infer new knowledge from existing knowledge through inference, databases learn by machine, and developers declare less redundantly and explicitly, but only once and centrally and then imply. This reduces maintenance costs and susceptibility to errors. RDF, RDFS, and OWL offer powerful mechanisms for metadata that give meaning to the contents of a database. According to the motto "Good analyses need good data and good data need good metadata", these technologies contribute to improving the quality of data and to a better understanding of it.

W3C conformity ensures better interoperability, vendor independence, and thus also higher investment security. The next part of this article series will show how the modeling tool Protégé helps to create your own semantic data models and how you can manage your own knowledge graphs with the SPARQL language and Ontotext GraphDB.

Referenzen

 

Related pages

(C) Copyright 2014-2024 INNOTRADE GmbH, Herzogenrath, NRW, Germany (all rights reserved)