Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Harmonization and Knowledge Management with a Semantic Single Source of Truth

Table of Contents
minLevel1
maxLevel6
outlinefalse
stylenone

...

default
typelist
printabletrue

Harmonizing heterogeneous data sources with semantic graph databases.

In large companies, especially those that have grown through acquisitions, there often exists a coexistence of the most diverse tool landscapes. There are various business areas and competencies, individual processes, and historically grown environments that cannot be replaced, migrated, or standardized in the short term or economically in practice. This involves technologies, but also the people who use them.

Current State

From demand and requirements management, product and application lifecycle management, development and version control, test, deployment, and continuous integration tools, to operations and monitoring tools: In practice, developers are confronted with complex heterogeneous environments, a variety of distributed data sources and quantities with the most diverse interfaces, data formats, quality grades, or availabilities.

...

For management, the smart harmonization of data offers higher transparency and quality as well as better analyses for their business decisions - across the boundaries of tools and departments. For us, it offers a common understanding and the reusability of knowledge, thus freeing up more space and time for creativity and innovation - across personal skills, different working models, and cultures.

Challenges

Many tasks in merging information are obvious and can usually be solved by so-called Extract-Transform-Load (ETL) tools. These include downloading data from the primary sources, transforming it into a standard format, optionally validating it, and finally making it available via APIs.

...

Especially when it comes to business analytics, reports, and KPIs, aspects such as correctness, completeness, and consistency of information are essential. The SSOT discussed in this article is based on the W3C-compatible semantic graph database GraphDB from Ontotext. Version 8.x The recent free version can be downloaded for free [1] for development purposes. It is limited to two simultaneous connections and in terms of clustering and external connectors, but not in terms of data volume. here.

Support not only in centralizing data but also in improving data quality in the primary systems is another crucial factor for the success of a Single Source of Truth (SSOT).

...

Once security and legal concerns are resolved, individual data sovereignty is added. Once these are resolved and there is a willingness to harmonize on all sides, the effort to create a common understanding of the available information can continue.

Harmonization versus Standardization

Complex heterogeneous environments pose diverse challenges. Applications, like people, often mean the same thing - but speak different languages, using not only different data but also different terminologies. And that leads to friction at the interfaces, both technical and human.

...

With a Single Source of Truth, it is therefore not about standardizing information and processes, but about harmonizing them, with the goal of creating compatibility instead of conformity and thus ultimately supporting strategic corporate goals.

Compatibility versus Conformity

As an example, let's consider the use of different task tracking systems, for simplicity's sake, GitLab and Jira. If you want to show across the board in a reporting tool for your project management how many tasks with which priority are open, this assumes that both systems have the same configurations and values for specifying the priority - this is rarely the case in practice.

...

The linking of information in a semantic graph creates attractive automation potentials, but only on the basis of the defined criteria; exceptional cases remain unconsidered. An assistance system should therefore not mean taking over control of all decisions, but rather it should support us. To stay with the priority, a manual intervention to control the ranking within the framework of valid values should certainly be maintained.

Common Language, Terminology

A strategic goal in large companies is often to improve internal collaboration. One measure to achieve this is to improve communication. To achieve this, a uniform terminology, a common language - technically a common semantic definition of identifiers and their synonyms - helps.

...

Eliminating or at least reducing ambiguities is one of the goals of an SSOT. A glossary is a useful feature here. It represents a central reference of identifiers including a human-understandable description, the actual meaning. You can find important identifiers for understanding the terminology in the Glossary box at the botton of this article.

Semantics

A semantic graph, also called an ontology, essentially consists of a class hierarchy, the so-called taxonomy or T-Box, and the individuals as well as the instances of the classes, the so-called A-Box. Multiple inheritance is also supported. In addition, there are the properties, which are divided into data properties and object properties, as well as the annotations.

...

That's it for now. In the following dotnetpro issuearticle, you will read the second part of this article, which presents we’ll delve into more details of a reference model as well as the mapping in the reference model.

Glossary

The most important terms for understanding graph databases:

  • A-Box: The Assertional Box is part of an ontology and contains knowledge about the concrete instances (individuals, objects) of a domain. It contains facts (see Axioms) about individuals and their properties, as well as their relationships to each other. The A-Box represents the state of a world modeled in the T-Box.

  • Annotation: Annotations are statements that can be attached to any entity, i.e., any class, individual, or property, without affecting its semantics. Annotations can be used for comments, specifying authors or version numbers, and for internationalizing ontologies, i.e., translating knowledge into different languages. Annotations are treated as data but not considered by the reasoner.

  • Assertions: Assertions are claims in an ontology that do not necessarily have to be true, complete, or consistent. For example, a person's age could be specified as negative or with two contradictory statements. The ontology does not reject invalid claims per se. However, the reasoner of a graph database identifies rule violations and semantic inconsistencies in the form of explanations and supports correction at a time chosen by the user, while the entire ontology remains available externally despite internal inconsistencies.

  • Axioms: Axioms are statements in an ontology that determine what is true in a domain. Example: Human is a subclass of Living Being. If Max is a human, he is also a living being. Axioms determine, among other things, classes, data and object properties, data types, or annotations. Axioms about individuals are often also referred to as facts.

  • Data Warehouse: A data warehouse is a central repository for structured data, prepared, filtered, validated, and transformed for a purpose. A schema is used when writing to the warehouse (Schema on Write). Contents are easy to understand, but changes are more complex, as consumers such as business intelligence tools use them directly, for example, for dashboards. The target group is more business professionals, and the purpose is quick analysis results.

  • Data Lake: A data lake is the consolidation of raw data in a central location, whose purpose and use are not yet determined, dynamic, unfiltered, extensive, and less organized. Contents are ideal for machine learning, harder to understand but easier to change. Navigation, data quality, and data governance are more difficult. A schema is only applied when reading from the lake (Schema on Read). The target group is more data scientists.

  • Domain: In the semantic web, a domain is referred to as a content-related or logically intertwined area of knowledge, an area of interest, or a collection of resources, people, or machines. It is identified by a name and should not be confused with an Internet domain.

  • Graph Database: A graph essentially consists of nodes and the connections between them, the so-called edges. A graph database is ideal for representing networked information and thus managing knowledge. RDF is the best-known concept for a semantic graph database. Here, statements are formulated in the form of so-called triples consisting of subject, predicate, and object (see RDF). Graph databases can usually contain multiple ontologies in different contexts (graphs) within a single database schema. Such a combination of ontologies, especially together with a lot of instance information, is often also referred to as a knowledge graph.

  • Individuals: In the context of the semantic web and ontologies, objects are usually referred to as individuals, technically understood as instances of classes, which are occasionally also referred to as concepts. Individuals can contain data and object properties and can be assigned to one or more classes. While in the object-oriented world, an instance of a particular class is usually created - for example, var Max = new Person() - in the semantic web, an individual can also be created without a class, but the class assignment(s) can be inferred via properties. Example: If an individual has the property hasWheels and hasWheels has the domain Vehicle, the individual is automatically a member of the class Vehicle without this having to be explicitly defined for it.

  • Inference: Inference means logically deriving new statements based on existing ones. Example: If A is equivalent to B and B is equivalent to C, it can be inferred that A is also equivalent to C. While the first two statements are explicitly formulated, the third is inferred - that is, implicit, automatically generated knowledge. The reasoner is responsible for the inference using sets of rules, so-called profiles. The W3C defines the scope of the profiles for OWL, but not how they are to be implemented. The major vendors of graph databases usually allow the adaptation of existing or the definition of custom rule sets in addition to the predefined profiles.

  • Ontology: An ontology includes the formal definition of concepts (classes), properties, and relationships between the entities of a domain. Ontologies are particularly suitable for sharing knowledge using a common vocabulary. They consist of a T-Box, the taxonomy (class hierarchy), and an A-Box, the individuals (instances). Reasoners are responsible for inference (logical conclusions) in ontologies. Annotations can be used for comments or internationalization of ontologies. Ontologies can reference each other and thus grow into extensive, self-learning knowledge databases.

  • Open World Assumption: The Closed World Assumption (CWA) states that everything that is not known to be true must be false. If a train schedule states that a train runs at 10 and 14 o'clock, this implies in CWA that it does not run at 12 o'clock. In contrast, the Open World Assumption (OWA) states that everything that is not known to be true is simply unknown. If a phone directory lists the numbers of two subscribers, this does not mean that no other subscribers exist. Ontologies in the Semantic Web work according to the Open World Assumption. This is advantageous because incorrect, contradictory, or rule-violating information entered does not restrict the functionality of the ontology as a whole, but can be identified, reported, and explained by the reasoner.

  • OWL: OWL stands for Web Ontology Language, a language designed for the Semantic Web to represent knowledge about objects and classes (as groups of objects) and their relationships to each other. Ontologies are based on RDF and OWL and can be read and modified with SPARQL as query language. Reasoners support RDF and OWL.

  • Profiles: OWL2 (http://www.w3.org/TR/owl2-profiles ) defines sets of inference rules, so-called profiles, with different expressiveness and efficiency for specific use cases. Prominent examples are EL for ontologies with many classes and properties, QL for ontologies with many instances, and RL for applications that require a balanced trade-off between scalable reasoning (inference) and expressiveness.

  • Properties: Properties describe the characteristics of individuals, the instances in a graph database. Data properties contain concrete values of various data types (www.3.org/TR/owl2-syntax/#Datatype_Maps), object properties describe the relations between individuals. The expressiveness of ontologies is determined, among other things, by so-called restrictions. There are transitive - if a customer A uses app B and app B uses service C, then it can be inferred: customer A also uses service C -, symmetric - if, for example, a service A interacts with a service B, then service B also interacts with service A - and inverse properties - an inverse property to App A uses Service B would be, for example, usedBy. A query for Service B usedBy App A would be answered by the reasoner with true.

  • Reasoner: A reasoner is a so-called inference engine. A software that is able to draw new logical conclusions from axioms and assertions without explicitly persisting them, but can implicitly provide them in SPARQL queries: A = B (explicit) and B = C (explicit) => A = C (implicit). Therefore, it can also happen that ontologies require considerably more space in memory than on disk due to the dynamically generated knowledge. W3C-compliant semantic graph databases offer different sets of inference rules, called profiles in OWL2. Reasoners are also responsible for reporting and explaining inferences and inconsistencies in an ontology.

  • RDF: RDF is the abbreviation for Resource Definition Framework (http://www.w3.org/RDF ), a modeling concept of the Semantic Web standardized by the W3C that uses triples of subject, predicate, and object to formulate simple logical statements in directed graphs that can be easily read, understood, and visualized by machines. Examples: Max hasAge 32 (data property) or Josef hasSpoose Maria (object property). The format is RDF/XML.

  • Semantic Web: According to the inventor of the Internet, Tim Berners-Lee, the Semantic Web is "the web of data that can be processed by machines". The W3C defines it as follows: "The Semantic Web provides a framework that allows data to be shared and reused across application, enterprise, and community boundaries." The Semantic Web is therefore also seen as an integrator across different contents, information applications, and systems.

  • SPARQL: Is the name of the protocol based on HTTP (http://www.w3.org/TR/sparql11-protocol ) and at the same time the abbreviation for SPARQL Protocol And RDF Query Language (http://www.w3.org/TR/sparql11-overview ), a query (SPARQL 1.0) and manipulation language (SPARQL 1.1) in the Semantic Web for RDF graphs standardized by the W3C. SPARQL is similar to SQL, supports the integration of multiple, even external RDF graphs and is optimized for RDF triple stores. With SPARQL, not only persisted explicit knowledge, but also implicitly generated knowledge through reasoners and inference can be queried. Implementations exist for almost all programming languages.

  • Taxonomy: In terms of ontologies and the Semantic Web, a taxonomy is a model for a hierarchical classification of objects, or in simple terms: a class hierarchy with classes and subclasses. It enables simple aggregations at the class level. Example: accident statistics of an insurance company for cars and trucks as subclasses of all vehicles.

  • T-Box: The Terminological Box is the conceptual component of a knowledge base, also called schema or vocabulary. It contains the knowledge about the classes of a domain as well as their hierarchy (taxonomy) and their characteristics (properties). There are powerful ontologies that consist only of a T-Box but contain no instances.

[1] GraphDB, http://graphdb.ontotext.com Author: Alexander Schulze, Translated by Ashesh Goplani, published October 2019