Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Semantics enriches data with meaning. This meaning is stored in semantic graph databases. Graph databases open up new possibilities and perspectives for developers in many ways, and not just when it comes to modeling, classes, and their properties. In contrast to the well-known SQL and NoSQL databases with their tables and columns, listings and fields, graph databases extend the familiar view of schemas, objects, and inheritance with new features such as metadata, inference, and linking multiple graph databases to extensive knowledge graphs. With the appropriate tools, data can be linked to a so-called semantic network. This is not so new. Tim Berners-Lee coined the term "semantic web" as early as 2001. The W3C extended it in 2004 with standards such as OWL (Web Ontology Language, extended in 2012 with OWL2) and made it practical.

What was of rather scientific interest for a long time and led an isolated existence has now evolved into professional products that have reached a level of maturity in terms of quality, runtime behavior, availability, and scalability that is worth a closer look. The first thing to clarify would be in which scenarios the use of a semantic graph database is worthwhile and what advantages the new technologies bring for developers and users. The aim is to bridge the gap between current research and practical development, covering topics such as modern RDF triple stores, new types of classification and validation of objects, a new perspective on data properties and automatic learning, but also on interoperability, portability, reusability, harmonization, and data quality. This will require several articles, and by the end of this series, you will be able to create a semantic data model with classes and properties and create, read, update, and delete (CRUD) instances in an ontology using the SPARQL query language in Node.js instances.

...

Semantic graph databases or ontologies essentially consist of a class hierarchy, a taxonomy (also referred to as T-Box), and the so-called individuals, i.e., the instances of the classes (also referred to as A-Box). In addition, there are data properties with concrete values of various data types, the so-called literals, and the object properties with references to other individuals, the actual links. All these entities are also referred to as resources. Within a graph, all resources are identified using so-called URIs or IRIs; IRI stands for Internationalized Resource Identifier, more on this in the next section. URIs/IRIs are comparable to primary keys from the world of tables and collections but are unique across the entire graph and not just related to the instances of a particular class.

The semantic web is also referred to as the "web of data". One of the paradigms is that not only resources within an ontology can be linked to each other, but also any ontologies among themselves, including the resources contained therein. The combination of multiple ontologies, usually in hierarchical structures, is referred to as a knowledge graph. A special feature in semantic graph databases is the "reasoner" - a rule engine within the database itself with various standardized but also configurable rule sets. On the one hand, the reasoner maps the so-called inference: a function to automatically draw logical conclusions from already existing statements and make them available as new knowledge. On the other hand, it serves to validate and check for consistency, for example, to identify contradictions or rule violations, and thus also to ensure data quality.

...

Anyone dealing with the semantic web comes across a number of new terms and abbreviations, the knowledge of which is important for understanding models, relations, and functions.

First, a brief summary of the differences between URL, URI, and IRI. A Uniform Resource Locator (URL) specifies the location of a specific resource, so it is used for localization. This includes addresses, i.e., references to content that may change constantly, such as an HTML page or a database. According to RFC 1738 \ [1\] from 1994, only a subset of 60 US-ASCII characters is allowed for a URL, which represents a significant restriction for internationalized applications today. The term base URL is used to divide an address with a path into several sub-addresses, for example, http://my.baseurl.com/cat1/func2. The https scheme instead of http for the URL has a special meaning because it causes the transfer to be encrypted, usually between the web server and the browser.

The Uniform Resource Identifier (URI) is used to uniquely identify specific resources globally. On the web, these are, for example, specific files or interfaces to services. They consist of a URL followed by a file or service name, optionally with additional query parameters or hash values, such as http://my.domain/db/customers?id=7 or http://my.domain/db/product#46510; URLs are therefore a subset of URIs. According to RFC 3986 \ [2\], URIs are subject to the same character set restrictions as URLs.

Remaining as the third in the group is the IRI, the Internationalized Resource Identifier. It was standardized by the IETF (Internet Engineering Task Force) in 2005 to meet modern requirements for global use. Like the URI, it is used to uniquely identify a specific resource worldwide. It serves the same purpose as the URI but without the restriction on the ASCII character set. According to RFC 3987, all UTF-8 characters are allowed for IRIs with a few exceptions \ [3\]. In the context of the semantic web, IRIs are composed of a namespace, a separator, and a so-called local name. The #-character has established itself as the de facto standard, even though a few others are still allowed. Analogous to URL and URI, the namespace should actually be called IRL, i.e., International Resource Locator, but this term has never caught on. A valid IRI for a resource in an ontology is, for example: http://ont.dotnetproenapso.decom/erp#product\\_1

The namespace here is http://ont.dotnetproenapso.decom/erp# and the local name is product\_1, referred to simply as name or identifier in the further course of this article. In contrast to URLs, the https scheme is not only uncommon for IRIs, but due to the purely identificational character, it also does not cause any encryption - neither of the transport nor of the content.

...

For the interoperability of ontologies, it is considered good practice to use only those namespaces for which you are also the official domain owner. In the example ontology for this text, it could be, for example, http://ont.dotnetproenapso.decom/\\[...\\]. Although the validity of the namespaces is currently not checked by name servers (DNS) or by an official institution when publishing ontologies, you risk errors in applications or with users by using namespaces such as http://foo.bar/ because the resources can no longer be uniquely referenced globally and can therefore not be used in internationally composed knowledge graphs.

...

If resources in RDF/OWL files or in SPARQL commands were always referenced by their full IRI - syntactically, even additional surrounding <- and >-characters are necessary - they could quickly become confusing and difficult to read. Consider the following triple (see the "RDF" section):

<http<http://ont.dotnetproenapso.decom/erp#invoice\\_1> <http_1> <http://ont.dotnetproenapso.de/props#hasProduct> <httpcom/props#hasProduct> <http://ont.dotnetproenapso.decom/erp#product\\_1>1>

To abbreviate this cumbersome notation, so-called prefixes were introduced for the files as well as for SPARQL. These separate the IRI into an abbreviation, the prefix, followed by a colon and the actual identifier of the resource. These are declared at the beginning of an RDF/OWL file or a SPARQL command and can then be used as a placeholder for their long version:

...

  • prefix erp:

...

  • <http://ont.

...

  • enapso.

...

  • com/

...

  • erp#>

  • prefix props:

...

  • <http://ont.

...

  • enapso.

...

  • com/

...

  • props#>

Triples can then be easily noted in a significantly more readable short form:

erp:invoice\_1 props:hasProduct erp:product\_1

Within the framework of RDF, OWL, and SPARQL, this notation is referred to as a prefixed name.

...

In addition to the conventions for namespaces, there are other proven practices for IRIs. Since each resource has its own database-wide and internationally unique IRI, no additional primary keys are required in a graph; so no special ID fields need to be provided, for example, in the form of data properties. The existing IRIs are completely sufficient for this - de facto, these are the global primary keys. The same applies to relationships and foreign keys. In a 1:1 or 1:n relationship, one or more object properties of a master individual simply reference its child individuals using the child IRIs. The following example shows a 1:n relationship between invoices and products, represented here in so-called prefixed names notation:

  • erp:invoice

...

  • _1 props:hasProduct erp:product

...

  • _1

  • erp:invoice

...

  • _1 props:hasProduct erp:product

...

  • _2

  • erp:invoice

...

  • _1 props:hasProduct erp:product

...

  • _3

The advantage of a graph is that even n:m relationships no longer require additional auxiliary tables or objects, as known from classic relational database architectures, since the same object property can be used multiple times by an individual. Users and roles, for example, can be related to each other in an RDF triple store as follows:

  • erp:user

...

  • _1 props:hasRole erp:role

...

  • _1

  • erp:user

...

  • _1 props:hasRole erp:role

...

  • _2

  • erp:user

...

  • _2 props:hasRole erp:role

...

  • _1

  • erp:user

...

  • _2 props:hasRole erp:role

...

  • _2

RDF

RDF is the abbreviation for Resource Description Framework \ [4\], a modeling concept standardized by the World Wide Web Consortium (W3C) for the semantic web that enables simple logical statements using triples of subject, predicate, and object, which can be formulated in directed graphs and easily read, understood, and visualized by machines. Examples: Max hasAge 32 (data property) or Josef hasSpouse Maria (object property). A collection of triples is also referred to as a triple store. At the lowest level, a semantic graph database is such a triple store. How the triples are managed internally as efficiently as possible is up to the database vendor. For import into and export from files, the RDF triples are represented in various formats. These can be, for example, Turtle, Trig, N-Triples, JSON-LD, or even RDF/XML, with their respective advantages and disadvantages. While Turtle (.ttl) is popular because of its compactness and easy readability, Trig (.trig) is more suitable for backup and restore of knowledge graphs since the information about subgraphs is also serialized and deserialized. As expected, JSON-LD is increasingly found in JavaScript environments and RDF/XML more in the Java world. However, in the spirit of interoperability, semantic graph databases such as GraphDB from Ontotext have import and export functions for all important formats.

...

An explicit assignment of an individual to one or more classes is done via type statements. Here, corresponding statements per individual determine its memberships:dnp

  • enapso:Max rdf:type

...

  • enapso:Person

...

  • enapso:Max rdf:type

...

  • enapso:Freelancer

...

  • enapso:Max rdf:type

...

  • enapso:Developer

For graph databases with many instances without extensive semantics and the need to process these even without a reasoner with simple CRUD operations (Create, Read, Update, and Delete), this is already a sufficient and practicable approach. A great advantage of this type of modeling is that, referring to the above example, all persons, but alternatively also all freelancers or all developers can now be queried very easily with a single command. As a further advantage, you can also query all individuals of the class Personnel. For this, however, you need the reasoner with RDFS support. RDFS can specify the property subClassOf and thus allows the creation of class hierarchies or taxonomies:dnp

  • enapso:Employee rdfs:subClassOf

...

  • enapso:Personnel

...

  • enapso:Freelancer rdfs:subClassOf

...

  • enapso:Personnel

An important realization from this is that it is now not necessary to explicitly define for each person that he or she is a member of the class Personnel, but that this is done implicitly and only once via the central subClassOf definition within the taxonomy. This task is performed by the reasoner. It uses two independent pieces of information, namely "Max is a member of Employee" and "Employee is a subclass of Personnel", and logically concludes: Max is a member of Personnel. This conclusion is called inference and represents one of the strengths of semantic databases. To efficiently query the knowledge, the database internally generates temporary triples and also manages them. This is also the reason why semantic graph databases require more memory than pure triple stores when inference is used intensively. However, exports can be performed with or without the inferred triples, for example, to convert OWL2 ontologies with reasoning support into simple RDF graphs without losing the automatically generated additional information. But you don't have to worry about managing this information, the database with the reasoner takes care of it automatically.

...

Another special feature of semantic graph databases is the implicit classification of individuals using properties. Like all information in a graph, properties are also mapped via RDF triples, for example in the following form:dnp

  • enapso:EbnerVerlag rdf:type

...

  • enapso:Company

...

  • enapso:SemanticDatabases rdf:type

...

  • enapso:Document

...

  • enapso:EbnerVerlag

...

  • enapso:publishedArticle

...

  • enapso:SemanticDatabases

The individual EbnerVerlag is a member of the class Company and the individual SemanticDatabases belongs to the class Document. Suppose there were also the classes Publisher and Article. The property publishedArticle can be used to classify the subject, here EbnerVerlag, as well as the object, here SemanticDatabases. In RDF Schema, the so-called domain and range axioms were introduced for this purpose. Axioms describe facts, so the reasoner also considers them independently of the Open World Assumption. In particular, the range axiom is often confused with a value restriction, but this is not about restrictions or validations, but about the classification of individuals. While the domain axiom controls the classification of the subject, the range axiom determines the classification of the object. Here is an example:dnp

  • enapso:publishedArticle rdfs:domain

...

  • enapso:Publisher

...

  • enapso:publishedArticle rdfs:range

...

  • enapso:Article

The first triple states that any subject that uses the property publishedArticle is a Publisher (axiom). The second triple expresses that any object referenced by this property is an Article. Through the statement, EbnerVerlag is thus implicitly a member of the class Publisher and SemanticDatabases a member of the class Article:dnp

  • enapso:EbnerVerlag

...

  • enapso:publishedArticle

...

  • enapso:SemanticDatabases

Conversely, a query for Publisher also returns EbnerVerlag and a query for Article also SemanticDatabases, although this was not explicitly defined for either of the two individuals. What on the one hand is an extremely useful and welcome feature - after all, this saves many redundant declarations and thus ultimately a lot of maintenance effort - harbors a certain danger on the other hand. If, for example, the statement dnpenapso:EbnerVerlag dnpenapso:publishedArticle dnpenapso:Max is made, the reasoner automatically infers that Max is an Article. So a certain amount of care should be taken here. Since the reasoner does not automatically identify such semantic errors, they are difficult to identify and fix later in extensive ontologies. The Shapes Constraints Language (SHACL \ [7\]) based on RDF graphs is a suitable tool to uncover type violations, among other things.

...

While object properties are descriptions of relationships between individuals, data properties describe certain characteristics of a particular individual. They are comparable to the data fields of an object in OOP or value columns from SQL databases. In an RDF graph, data properties - like all other statements - are represented by triples. Example:dnp

  • enapso:EbnerVerlag

...

  • enapso:companyName "Ebner Media Group GmbH & Co. KG"^^xsd:String

Data Types

In OWL2, a variety of data types are available for data properties \ [8\]. The numeric ones include the following:

...

The character sequence Hello World would be 48656c6c6f20576f726c64 in hexBinary format and SGVsbG8gV29ybGQ= in base64. The two date/time types are noted in ISO8601 format \ [9\]. The difference between the two is that for dateTime the specification of the time zone is optional, but for dateTimeStamp it is required. A valid dateTime value is, for example, 2020-02-15T12:45:00Z, as a dateTimeStamp value it is 2020-02-15T12:45:00-05:00. In OWL, it is also possible to create your own data types. A follow-up article will go into more detail on this topic and the benefits for applications.

...

A major advantage of semantic databases is the ability to enrich properties with extensive meta information and thus give them great expressiveness that can also be understood by machines in a standardized way. Table 1 shows the supported characteristics and their meaning.

Characteristic

Meaning, Restriction

functional

The property may be used at most once per individual (set restriction for the referenced objects). Example hasMother: A child B can have at most one mother A.

inverse functional

The referenced object may not be referenced by more than one subject (set restriction for the referencing subjects). Example isMarriedWith: If person A is married to person B, then B cannot also be married to C (at least not in the Christian cultural sphere).

transitive

If an individual A refers to an individual B and B to C, then A can also refer to C. Example hasSuccessor: If person A is a descendant of B and B is a descendant of C, then A is also a descendant of C.

symmetric

If A is in a relationship with B, then B is in the same relationship with A. Example hasRelative: If person A is related to B, then person B is also related to A.

asymmetric

If A is in a relationship with B, B cannot be in the same relationship with A. Example isFatherOf: If A is the father of B, B cannot simultaneously be the father of A.

reflexive

Specifies that all individuals with this property always refer to themselves as well. Example knows: A person A always knows himself/herself as well.

irreflexive

Defines that individuals with this object property cannot refer to themselves. Example hasSibling: A person A cannot have himself/herself as a brother or sister.

inverse of

Defines that the property in question is the opposite of another property. Example: hasChild is an inverse property of hasParent, likewise hasWife is inverse to hasHusband.

equivalent to

Specifies that the property in question has the same values for domain and range, for example, as the referenced property. Example: The property child has the same values as hasChild.

In the full OWL profile \ [10\], data properties are a subclass of object properties. Therefore, an inverse-functional property can also be defined for data properties here. In the OWL-DL profile, object properties and datatype properties are separate, so an inverse-functional property for datatype properties cannot be defined here. The W3C website provides a good overview of many OWL2 axioms and restrictions \ [11\].

Property Restrictions

In addition to the characteristics of properties, so-called property restrictions can be defined in OWL2. For data properties, these statements are about number and data type, for object properties about number and class of the referenced individuals. Table 2 shows the supported restrictions.

Restriction Type

Meaning

some

There must be at least one property of the specified type.

only

All properties must be of the specified type, no quantity is specified.

min (minimum cardinality)

There must be at least the specified number of properties of the specified type.

max (maximum cardinality)

There may be at most the specified number of properties of the specified type.

exactly (exact cardinality)

There must be exactly the specified number of properties of the specified type.

Due to the Open World Assumptions, restrictions in OWL work differently than in SQL or NoSQL databases. Although the reasoner can detect, for example, when the limit of cardinality for properties is exceeded, because in this case there are indeed too many properties. However, due to the OWA, it does not identify falling below a limit that specifies a minimum cardinality as an error, because the missing properties could be defined in other ontologies. SHACL is better suited for validation purposes of individuals against schemas. One of the next issues of dotnetpro will report on this in detail.

...