Building a Knowledge Management Reference Model with a Semantic Single Source of Truth (SSOT)
Harmonization and Knowledge Management with a Semantic Single Source of Truth (SSOT)
After discussing the challenges of managing large amounts of data and presenting semantic mapping as a possible solution in our previous article [https://www.linkedin.com/pulse/harmonization-knowledge-management-semantic-single-source-zemwe ], we now present a reference model for a graph database.
...
For example, if a product in an e-commerce platform is to be found by its name, annotations can be used to support different spellings for the same product without affecting the relations or semantics of the actual data set. Worth mentioning is the class equivalence in semantic graph databases. For example, if you define that the classes Error, Bug, and Defect are equivalent, a query of all instances of the class Error will also return all instances of the classes Bug and Defect - notably using a single and central equivalence declaration and not per individual query or even per instance.
Mapping in the Reference Model
While mapping source data to a central schema in a data warehouse primarily takes place via program or declaratively at the data level, in a semantic Single Source of Truth (SSOT) it is subject to semantic conventions. A common terminology is required for uniform, harmonized processing of content.
...
If additional data sources are to be integrated into an existing SSOT, ideally the reference model is first semantically meaningfully extended and then the mapping is built on top of it. If new classes or properties need to be created and integrated into the structure of the reference model, they should always be clearly defined and documented immediately and any equivalence with other entities should be specified to avoid ambiguities from the beginning. The more ambiguities occur in a reference model, the less effective it is in terms of harmonization.
Demarcation of Database Architectures
When consolidating data, the terms data warehouse and data lake inevitably come into play. Both approaches differ in various aspects from a Single Source of Truth (SSOT).
...
So let's take a look at the architecture of an SSOT (Figure 1).
...
Inbound APIs
The path of data into an SSOT begins with the transport from the primary sources into the hub. The term "import" is deliberately not used here, as it is associated with a persistence that does not necessarily take place in a hub, since the hub may only pass the data through, if necessary.
...
If the hub also supports push delivery, then in addition to automated scheduled and interval-controlled retrieval via classic APIs such as REST, JSONP, or SOAP, systems that provide their data via FTP, SFTP, message queues, or file sharing can also be easily integrated, with event or file monitoring mechanisms, exactly at the time they are delivered by the primary source.
Import and Persistence vs. On-Demand Queries
The hub character of the SSOT leaves the decision to the operator - unlike a data warehouse - to import and persist data or to simply pass it on to consumers. Both have advantages and disadvantages.
...
One advantage of the persistence of information within the SSOT is that all detailed information is available in a central location. While aggregated and filtered on-demand data from external sources cannot be further broken down without additional queries, but must be requested anew, for example for drill-down reports, all information persisted in the graph database can be arbitrarily linked with each other and put in relation to each other with comparatively simple SPARQL queries - a decisive aspect, especially when the integration of different data sources in dashboards is accompanied by intensive user interactions.
Reference Model vs. Domain Models
Once the raw data from the primary sources has arrived at the SSOT, it can optionally be transformed directly into the reference model or first into a domain-specific model within the graph database and then mapped against the reference model.
...
However, domain models do not compete with a reference model, but use it. The Priority class discussed earlier in this article is sensibly part of the reference model. With SPARQL, this is used, for example, as a central definition for the output format, and the contents of the relevant domain models are mapped to it "on the fly".
Inbound Transformation
The actual transformation of raw data from a specific format into a semantic model is more of a technical than a semantic task. The primary challenge here is the implementation of a variety of different data formats that may occur in the primary sources.
...
The supply of the SSOT with data from the primary sources can be done in a variety of ways. These can be direct access to SQL or NoSQL databases, interfaces to big data services such as SAP4 HANA, Hadoop or Elasticsearch, files such as Excel, in CSV, TSV or JSON format, or REST and SOAP APIs, perhaps message queues or log files, or of course also external knowledge databases and ontologies on the web (Figure 2).
...
Mapping Adapters
Many of the operations required for a transformation can already be handled by simple mapping tables and configurable conversion functions, for example for field names, date specifications, string or number conversions. But where specific or more complex transformations are required, so-called mapping adapters help.
These receive the raw data in the format of the primary source and convert it into the intermediate format, here JSON. This is helpful where source data needs to be prepared or information needs to be merged or converted already on the basis of the source data, for example local time zone adjustments into UTC/GMT formats, special encodings to UTF8, or special mappings using manual auxiliary or cross-reference tables.
Inbound Validation
The first stage of quality assurance in a Single Source of Truth takes place already during import. After the raw data has been transformed into the intermediate format, a downstream layer in the software stack checks the data for completeness and validity according to the domain-specific rules - at this level, not yet against the reference model.
For example, it may happen that primary systems are not reachable during a download attempt or that data is incompletely loaded due to errors. It is also practical in the event of an error to send a notification to the owner of the data already during import that a correction is required. The inbound validation thus prevents invalid data from reaching the SSOT in the first place, thereby avoiding any subsequent errors.
Offline Cache
A particular challenge arises when primary systems do not have historical data, evaluable log files, or audit trails and only provide snapshots of their current status. In this case, reports over time periods or trend analyses without further tools is not possible.
...
An offline cache is also useful for performance reasons. If the queries to a primary system take a very long time or burden the systems to such an extent that frequent queries from the SSOT lead to performance losses for its users, then the required data can be retrieved comfortably and quickly from the cache. For many reports, real-time statements are often not required at all, so updating the offline cache on a daily basis, for example, can significantly relieve the overall system.
Central Services
In general, a major advantage of consolidating data in a central location is that many services no longer need to be implemented multiple times per data source, but only once, namely centrally in the SSOT.
...
A central access control system via rights and roles or policies regulates who may access or change which knowledge in the SSOT and how. The SSOT thus becomes not only a valuable knowledge hub and enterprise asset, but also an effective collaboration tool.
Outbound Transformation
The outbound transformation serves to provide harmonized data to its consumers. In a semantic Single Source of Truth, the information is available as RDF triples. In order for it to be processed by the target systems such as BI tools, it must be transformed back into a supported data format.
Similar to the inbound transformation, corresponding adapters also help here. The query language SPARQL returns the results as simply structured result sets in column and row format. These can be transformed into any conceivable data format, such as JSON, XML, CSV, or TSV. Customer-specific formats can be easily implemented via own adapters, mostly as plug-ins.
Outbound Validation
The last stage before delivering result sets to consumers is outbound validation. With the extensive checks and measures for data quality in the SSOT, the question arises as to why outbound validation is necessary?
...
On the other hand, it is an important additional security instance. Independently of all upstream software layers and any errors or manipulations there due to attacks, final checks can be carried out here on sensitive data that must not be delivered to consumers under any circumstances - combined with appropriate notification measures, a welcome protection against data leaks.
Outbound APIs
After outbound transformation and validation, the data is ready in terms of content for delivery to the consumer. The outbound APIs are now available for transport to the target systems.
...
When it comes to larger amounts of data, it can first be pushed into SQL or NoSQL databases as intermediary instances - useful where only standard configurations such as JDBC/ODBC are supported in BI, charting, or reporting tools, or where only certified out-of-the-box adapters and no manual implementations may be used for security reasons.
Customized Post-Processing
One motivation for introducing an SSOT is to enable cross-tool reporting and to support central business decisions with cross-departmental key performance indicators (KPIs). This can be realized in the frontend in different ways. For example, BI tools such as Tableau or Power BI could be used, alternatively a separate web app with a chart engine.
...
Depending on the necessity, the post-processing can therefore take place in a separate process within the SSOT platform, on a separate system, or even only on the target system itself. It is important to realize the processes as independently of each other as possible, for which cloud-based microservice structures are particularly suitable.
Security
In contrast to traditional SQL or NoSQL databases, in a graph database the information is not available as tables or collections, rows and columns or documents, but as a multitude of referencing triples, precisely as "linked information".
...
Furthermore, the parameters required by the call can be queried. Thus, the app can also react dynamically to changing requirements or possibilities. For example, in a reporting system, automatically offer all filters that an API request offers for selection.
Conclusion
The semantic Single Source of Truth combines a multitude of positive characteristics of existing database systems and architectures: It enriches data with meaning and creates the basis for a common understanding of it, and thus not only for better communication and more efficient collaboration between people, but also for greater interoperability and easier exchange of information between systems.
...
Author: Alexander Schulze, first published October 2019, translation and editorial work by Ashesh Goplani , thanks for your contribution!
ChatGPT Prompts
Image Prompt (Chat GPT 4o):
Erzeuge mir ein foto-realistisches Bild in der Auflösung 1920*1080 von einer multikulturellen Gruppe aus 6 lächelnden Mitarbeiter:innen je mit einem einzelnen "Daumen-hoch"; drei Mitarbeiter:innen auf der linken Seite und drei auf der rechten Seite des Bildes, jede(r) mit einem Tablet in der Hand.
...
Ändere das vorherige Bild und ersetze den Schriftzug im Hintergrund auf "ENAPSO together" mit "ENAPSO" ausschließlich in Großbuchstaben und "together" in Kleinbuchstaben in kursiver Schrift. “ENAPSO together” ist ein Markenzeichen und muss genau in dieser Schreibweise wiedergegeben werden.
Summary Prompt for the Post (ChatGPT 4o)
Create a summary of the article with a maximum of 600 characters focussing on the key issues and key benefits of harmonizing data in a semantic single source of truth to become high quality and resuable enterprise knowledge assets.
...