Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

After discussing the challenges of managing large amounts of data and presenting semantic mapping as a possible solution in the our previous article [1https://www.linkedin.com/pulse/harmonization-knowledge-management-semantic-single-source-zemwe ], we now present a reference model for a graph database.

A semantic graph is an instrument to model the real world“real world”, its entities, and their relationships to each other. Even if certain identifiers are usually used within a graph database to keep them maintainable for developers, the mapping is still independent of a specific terminology or language. In principle, class names could also be unique identifiers, so-called UUIDs. However, without additional information, these are difficult for a human to manage in a model, so the actual identifiers can be enriched with annotations.

...

For example, if a product in a trading an e-commerce platform is to be found by its name, annotations can be used to support different spellings for the same product without affecting the relations or semantics of the actual data set. Worth mentioning is the class equivalence in semantic graph databases. For example, if you define that the classes Error, Bug, and Defect are equivalent, a query of all instances of the class Error will also return all instances of the classes Bug and Defect - notably using a single and central equivalence declaration and not per individual query or even per instance.

...

Here's an example: In one environment, a field "estimation" could mean the effort in hours, in another "effort" could mean the cost in euros, a field "cost" could mean an amount in euros or dollars, a field "time" could mean a timestamp relative to another or a duration, thus implicitly perhaps also mean the effort. Even with these few terms and their multitude of ambiguities, it becomes apparent how important a semantic reference model is.

Currently, most data sources do not yet provide machine-readable meta-information. However, heterogeneous data can only be compared, aggregated, or analyzed across tools if their meaning is unambiguously defined and harmonized. For example, two weight specifications 0.1 and 300 from different systems can only be meaningfully and correctly summed if both their units (e.g. kg and g) and the conversion factors are known (1 kg = 1000 g) - the difference in the result on a pure data basis: 300.1 or semantically correct: 400 g or 0.4 kg.

...

So let's take a look at the architecture of an SSOT (Figure 1).image-20240521-092210.pngImage Removed

...

Inbound APIs

The path of data into an SSOT begins with the transport from the primary sources into the hub. The term "import" is deliberately not used here, as it is associated with a persistence that does not necessarily take place in a hub, since the hub may only pass the data through, if necessary.

...

While the direct transformation is more similar to the data warehouse architecture with a central schema, the transformation via intermediate domain models is more reminiscent of the lake architecture with independent schemas in a superordinate database.

...

The supply of the SSOT with data from the primary sources can be done in a variety of ways. These can be direct access to SQL or NoSQL databases, interfaces to big data services such as SAP4 HANA, Hadoop or Elasticsearch, files such as Excel, in CSV, TSV or JSON format, or REST and SOAP APIs, perhaps message queues or log files, or of course also external knowledge databases and ontologies on the web (Figure 2).

...

Mapping Adapters

Many of the operations required for a transformation can already be handled by simple mapping tables and configurable conversion functions, for example for field names, date specifications, string or number conversions. But where specific or more complex transformations are required, so-called mapping adapters help.

...

An offline cache is also useful for performance reasons. If the queries to a primary system take a very long time or burden the systems to such an extent that frequent queries from the SSOT lead to performance losses for its users, then the required data can be retrieved comfortably and quickly from the cache. For many reports, real-time statements are often not required at all, so updating the offline cache on a daily basis, for example, can significantly relieve the overall system.image-20240521-092658.pngImage Removed

Central Services

...

Author: Alexander Schulze, first published October 2019, translation and editorial work by Ashesh Goplani , thanks for your contribution!