Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Semantic Single Source of Truth (SSOT), Part 2: Reference Model

Harmonizing heterogeneous data sources with semantic graph databases.

After discussing the challenges of managing large amounts of data and presenting semantic mapping as a possible solution in our previous article [https://www.linkedin.com/pulse/harmonization-knowledge-management-semantic-single-source-zemwe ], we now present a reference model for a graph database.

A semantic graph is an instrument to model the “real world”, its entities, and their relationships to each other. Even if certain identifiers are usually used within a graph database to keep them maintainable for developers, the mapping is still independent of a specific terminology or language. In principle, class names could also be unique identifiers, so-called UUIDs. However, without additional information, these are difficult for a human to manage in a model, so the actual identifiers can be enriched with annotations.

All entities, classes, and instances can reference a variety of annotations. The fact that annotations can be provided with a country or language designation gives an idea of how easy it is to internationalize a graph database. All entities can also be found via their annotations using the query language SPARQL.

For example, if a product in an e-commerce platform is to be found by its name, annotations can be used to support different spellings for the same product without affecting the relations or semantics of the actual data set. Worth mentioning is the class equivalence in semantic graph databases. For example, if you define that the classes Error, Bug, and Defect are equivalent, a query of all instances of the class Error will also return all instances of the classes Bug and Defect - notably using a single and central equivalence declaration and not per individual query or even per instance.

Mapping in the Reference Model

While mapping source data to a central schema in a data warehouse primarily takes place via program or declaratively at the data level, in a semantic Single Source of Truth (SSOT) it is subject to semantic conventions. A common terminology is required for uniform, harmonized processing of content.

Here's an example: In one environment, a field "estimation" could mean the effort in hours, in another "effort" could mean the cost in euros, a field "cost" could mean an amount in euros or dollars, a field "time" could mean a timestamp relative to another or a duration, thus implicitly perhaps also mean the effort. Even with these few terms and their multitude of ambiguities, it becomes apparent how important a semantic reference model is.

Currently, most data sources do not yet provide machine-readable meta-information. However, heterogeneous data can only be compared, aggregated, or analyzed across tools if their meaning is unambiguously defined and harmonized. For example, two weight specifications 0.1 and 300 from different systems can only be meaningfully and correctly summed if both their units (e.g. kg and g) and the conversion factors are known (1 kg = 1000 g) - the difference in the result on a pure data basis: 300.1 or semantically correct: 400 g or 0.4 kg.

In the reference model, for example, the data property "time" could be defined as "Duration in seconds", "cost" as "Amount of money to be paid", and "estimate" as "Estimated duration in days". The decisive factor is not how the terms are defined in an individual domain, but that they are defined. If necessary, further terms such as "timestamp" with "Point in time in GMT" or "timeInDays" with "DurationInDays" as a descendant of "time" can easily be added. A reference model lives, but should be well thought out and coordinated and remain upward compatible as much as possible.

Source data is always in the context of its respective domain models and thus also in the context of its domain-specific terminology. The art of mapping the primary data sources to a reference model is: on the one hand, to map classes, instances, and properties of the domain models as completely as possible to the reference model - taking into account possible relationships and dependencies among each other - and on the other hand, the greatest possible reuse of classes and properties already existing in the reference model. The latter can become a challenge with extensive ontologies.

image-20240521-091819.png

If additional data sources are to be integrated into an existing SSOT, ideally the reference model is first semantically meaningfully extended and then the mapping is built on top of it. If new classes or properties need to be created and integrated into the structure of the reference model, they should always be clearly defined and documented immediately and any equivalence with other entities should be specified to avoid ambiguities from the beginning. The more ambiguities occur in a reference model, the less effective it is in terms of harmonization.

Demarcation of Database Architectures

When consolidating data, the terms data warehouse and data lake inevitably come into play. Both approaches differ in various aspects from a Single Source of Truth (SSOT).

While a data warehouse is used more as a central repository for structured data for a specific purpose, a data lake is used to consolidate raw data at a central location whose purpose and use have not yet been determined. Before data is placed in a warehouse, it is prepared and is already subject to a schema when writing (schema-on-write). What makes their contents easy to understand makes changes more complex, as consumers such as BI tools use them directly for dashboards, for example. The target group is therefore more business professionals who want to contribute to accelerated decision-making with fast analysis results.

Data in a lake is not subject to a fixed schema; it is usually very dynamic, unfiltered, extensive, and little organized. Its contents are easily changeable, ideal for machine learning and for the target group of data scientists, but on the other hand harder to understand. Navigation, data quality, and data governance are also more difficult. A schema is only applied when reading from the lake (schema-on-read).

The semantic SSOT approach presented here differs in several points from both approaches. Although a taxonomy in the form of a pure class hierarchy and also the reference model can be seen as a kind of schema, the individuals - the objects or records in database terminology - do not have to follow a specific schema, neither when writing nor when reading. Although a class assignment can be explicitly defined for an instance, it can also be semantically derived through the use of certain properties, i.e., implicitly result.

In an SSOT, the properties are usually provided with semantic additional information, the metadata, documented with the help of annotations, and shared by several individuals. For example, the property weight with the additional specification of the base unit kg for all material things as the top class. This makes the knowledge in an SSOT very easy to understand, content changes and extensions are simple, and migrations can even be automated in the ideal case.

In addition, the graph technology allows all information to be linked with each other, which makes an SSOT just as attractive for data analysts at the level of the query language SPARQL as it is for business professionals at the level of high-level knowledge APIs with corresponding parameters and security settings. The SSOT then follows the data hub pattern. This means that information does not have to be persisted on hard disks, but can also simply be passed through - an essential aspect when it comes to security and data protection - without having to do without the welcome features of transformations and validation and thus central quality management.

So let's take a look at the architecture of an SSOT (Figure 1).

image-20240521-092210.png

Inbound APIs

The path of data into an SSOT begins with the transport from the primary sources into the hub. The term "import" is deliberately not used here, as it is associated with a persistence that does not necessarily take place in a hub, since the hub may only pass the data through, if necessary.

Regardless of the type and scope of the data, adapters first ensure the physical connection of the primary sources. The adapters can not only pull the data, i.e., read it from existing APIs of the source via program, but also wait passively for its delivery via hooks.

If the hub also supports push delivery, then in addition to automated scheduled and interval-controlled retrieval via classic APIs such as REST, JSONP, or SOAP, systems that provide their data via FTP, SFTP, message queues, or file sharing can also be easily integrated, with event or file monitoring mechanisms, exactly at the time they are delivered by the primary source.

Import and Persistence vs. On-Demand Queries

The hub character of the SSOT leaves the decision to the operator - unlike a data warehouse - to import and persist data or to simply pass it on to consumers. Both have advantages and disadvantages.

It seems clear that information from big data sources such as Hadoop or Elasticsearch cannot be meaningfully kept redundantly in an SSOT as a shadow system. Here, the semantic model in the SSOT supports formulating the right on-demand queries to the connected systems and preparing the data there - that is, filtering, aggregating, grouping, and sorting it there as needed, and finally processing only the desired data extract in the SSOT. In most cases, the big data engines will generate the extracts faster than would be possible in an SSOT anyway, because they have been specifically optimized for this purpose. The art lies in the optimal combination of both technologies.

Although on-demand queries can cause latencies and thus performance disadvantages, an intelligent offline cache, for example in MongoDB or Redis, can compensate for this again. Anyone who has ever integrated data from SAP into an SSOT will appreciate such a cache.

One advantage of the persistence of information within the SSOT is that all detailed information is available in a central location. While aggregated and filtered on-demand data from external sources cannot be further broken down without additional queries, but must be requested anew, for example for drill-down reports, all information persisted in the graph database can be arbitrarily linked with each other and put in relation to each other with comparatively simple SPARQL queries - a decisive aspect, especially when the integration of different data sources in dashboards is accompanied by intensive user interactions.

Reference Model vs. Domain Models

Once the raw data from the primary sources has arrived at the SSOT, it can optionally be transformed directly into the reference model or first into a domain-specific model within the graph database and then mapped against the reference model.

While the direct transformation is more similar to the data warehouse architecture with a central schema, the transformation via intermediate domain models is more reminiscent of the lake architecture with independent schemas in a superordinate database.

In a graph database, however, the advantages of both approaches can be very practically combined. With regard to the transformation of raw data into a semantic model, there is basically no difference whether it is transferred first into a domain model or directly into the reference model, because both are ultimately semantic models that follow the same conventions.

For reasons of easier maintainability, the instances of the different primary sources are often initially managed in separate graphs. This makes it easy to remove the contents of certain sources en masse from the graph database, re-import them, or process them independently of others. Side effects can thus be conveniently avoided.

SPARQL allows queries against multiple graphs within a query. This makes it easy to summarize a result set from multiple sources into a single result set. Since the domain models are also already subject to semantic definitions, the integration of information from multiple semantic domain models is very simple. Therefore, there are large knowledge graphs that do not have a central reference model, but handle the data harmonization exclusively via SPARQL queries - one of the paradigms of the Semantic Web.

Another argument for upstream domain models is the easier identification of possible inconsistencies between them. If customers from orders are found in the ERP domain model that do not exist in the CRM domain model, this can be very easily found and reported with SPARQL.

Last but not least, independent domain models are advantageous when updates are made to the primary sources and migrations become necessary. In these cases, only the domain models and the involved SPARQL queries need to be adjusted. The other models and instances remain unaffected - an important aspect for the availability of an SSOT in productive operation.

However, domain models do not compete with a reference model, but use it. The Priority class discussed earlier in this article is sensibly part of the reference model. With SPARQL, this is used, for example, as a central definition for the output format, and the contents of the relevant domain models are mapped to it "on the fly".

Inbound Transformation

The actual transformation of raw data from a specific format into a semantic model is more of a technical than a semantic task. The primary challenge here is the implementation of a variety of different data formats that may occur in the primary sources.

Ultimately, it comes down to breaking down incoming data into triples and inserting them into the graph database using SPARQL's INSERT commands, optionally into the graph of a domain model or directly into that of the reference model.

A proven practice is to agree on an intermediate format between the data format of the primary source and the triples. Almost all programming languages support JSON and XML for structured or hierarchically organized data. If only flat structures occur, CSV can be another option. The purpose of the intermediate format is broad support by existing tools and libraries, as well as the ability to implement the conversion to RDF/OWL triples only once centrally. JSON has proven to be a simple, established, and widely supported format here.

The supply of the SSOT with data from the primary sources can be done in a variety of ways. These can be direct access to SQL or NoSQL databases, interfaces to big data services such as SAP4 HANA, Hadoop or Elasticsearch, files such as Excel, in CSV, TSV or JSON format, or REST and SOAP APIs, perhaps message queues or log files, or of course also external knowledge databases and ontologies on the web (Figure 2).

image-20240521-092658.png

Mapping Adapters

Many of the operations required for a transformation can already be handled by simple mapping tables and configurable conversion functions, for example for field names, date specifications, string or number conversions. But where specific or more complex transformations are required, so-called mapping adapters help.

These receive the raw data in the format of the primary source and convert it into the intermediate format, here JSON. This is helpful where source data needs to be prepared or information needs to be merged or converted already on the basis of the source data, for example local time zone adjustments into UTC/GMT formats, special encodings to UTF8, or special mappings using manual auxiliary or cross-reference tables.

Inbound Validation

The first stage of quality assurance in a Single Source of Truth takes place already during import. After the raw data has been transformed into the intermediate format, a downstream layer in the software stack checks the data for completeness and validity according to the domain-specific rules - at this level, not yet against the reference model.

For example, it may happen that primary systems are not reachable during a download attempt or that data is incompletely loaded due to errors. It is also practical in the event of an error to send a notification to the owner of the data already during import that a correction is required. The inbound validation thus prevents invalid data from reaching the SSOT in the first place, thereby avoiding any subsequent errors.

Offline Cache

A particular challenge arises when primary systems do not have historical data, evaluable log files, or audit trails and only provide snapshots of their current status. In this case, reports over time periods or trend analyses without further tools is not possible.

One of these tools is an offline cache, this saves the snapshots at regular intervals and can thus simulate a history. Even if primary systems are offline in complex environments or are updated without coordination with other parties, work can still continue for a while with an offline cache and the availability of the SSOT is maintained.

An offline cache is also useful for performance reasons. If the queries to a primary system take a very long time or burden the systems to such an extent that frequent queries from the SSOT lead to performance losses for its users, then the required data can be retrieved comfortably and quickly from the cache. For many reports, real-time statements are often not required at all, so updating the offline cache on a daily basis, for example, can significantly relieve the overall system.

Central Services

In general, a major advantage of consolidating data in a central location is that many services no longer need to be implemented multiple times per data source, but only once, namely centrally in the SSOT.

Good reports and performance indicators require good data, which gives data quality management a high priority. If many people work on central data at the same time, it can be synchronized centrally and colleagues can be informed of changes immediately via subscriptions.

If changes are made to data, a Single Source of Truth can not only provide a version control system for this or manage dependencies, but also initiate cross-departmental coordination or cross-application approval processes. It can support central audits, information can be provided with labels for indexing or with attachments for additional and detailed information.

A central access control system via rights and roles or policies regulates who may access or change which knowledge in the SSOT and how. The SSOT thus becomes not only a valuable knowledge hub and enterprise asset, but also an effective collaboration tool.

Outbound Transformation

The outbound transformation serves to provide harmonized data to its consumers. In a semantic Single Source of Truth, the information is available as RDF triples. In order for it to be processed by the target systems such as BI tools, it must be transformed back into a supported data format.

Similar to the inbound transformation, corresponding adapters also help here. The query language SPARQL returns the results as simply structured result sets in column and row format. These can be transformed into any conceivable data format, such as JSON, XML, CSV, or TSV. Customer-specific formats can be easily implemented via own adapters, mostly as plug-ins.

Outbound Validation

The last stage before delivering result sets to consumers is outbound validation. With the extensive checks and measures for data quality in the SSOT, the question arises as to why outbound validation is necessary?

On the one hand, it is an (optional) additional verification level that ensures that the transfer to the target system is as error-free as possible. Especially in complex heterogeneous tool environments, different developers work on the various SSOT services. If, for example, one of the transformations is adapted after a model update, the outbound validation can reject an import in the event of an error and thus avoid subsequent errors in the target system. It is thus a kind of final testing and quality assurance instance.

On the other hand, it is an important additional security instance. Independently of all upstream software layers and any errors or manipulations there due to attacks, final checks can be carried out here on sensitive data that must not be delivered to consumers under any circumstances - combined with appropriate notification measures, a welcome protection against data leaks.

Outbound APIs

After outbound transformation and validation, the data is ready in terms of content for delivery to the consumer. The outbound APIs are now available for transport to the target systems.

For pulling by consumers, these can be, for example, REST or JSONP APIs, but also message queues or file shares for CSV/TSV or even Excel files. If the consumers provide hooks, i.e., upload APIs on their part, the SSOT can also actively push results. For this purpose, the outbound API can be provided with a scheduler to automate delivery processes.

When it comes to larger amounts of data, it can first be pushed into SQL or NoSQL databases as intermediary instances - useful where only standard configurations such as JDBC/ODBC are supported in BI, charting, or reporting tools, or where only certified out-of-the-box adapters and no manual implementations may be used for security reasons.

Customized Post-Processing

One motivation for introducing an SSOT is to enable cross-tool reporting and to support central business decisions with cross-departmental key performance indicators (KPIs). This can be realized in the frontend in different ways. For example, BI tools such as Tableau or Power BI could be used, alternatively a separate web app with a chart engine.

Both frontends have different requirements. While it is more suitable for Tableau to provide extensive amounts of data - possibly even in SQL or NoSQL buffers - which are then interactively filtered or aggregated in dashboards in the tool itself, a pure chart engine is usually only supplied with exactly the data required for a currently desired visualization. Think, for example, of adjusting timestamps to readable date information or smoothed average lines - transformations and calculations that are more practical to perform at the application level rather than the database level.

The use of server-side JavaScript proves to be practical here. A post-processing script can very easily enrich a JSON data set from the SSOT with additional information or perform special calculations before delivery. In addition, JavaScript can be easily modified at runtime, which even allows necessary adjustments with uninterrupted availability of the overall system.

In certain cases, it may be necessary to add further data to the result set for a specific target that should not - or must not - be managed in the Single Source of Truth itself. Reasons for this can be, for example, corporate or security policies or legal restrictions: from a company perspective, for example, the protection of intellectual property, from a customer perspective the protection of payment information, or from an employee and legal perspective, labor law or the regulations of the GDPR.

Depending on the necessity, the post-processing can therefore take place in a separate process within the SSOT platform, on a separate system, or even only on the target system itself. It is important to realize the processes as independently of each other as possible, for which cloud-based microservice structures are particularly suitable.

Security

In contrast to traditional SQL or NoSQL databases, in a graph database the information is not available as tables or collections, rows and columns or documents, but as a multitude of referencing triples, precisely as "linked information".

A corresponding challenge is posed by an access control system for a graph, mainly because no schemas are prescribed, because of multiple inheritance, because everything can be linked with everything, and also because identical properties can be shared between a variety of different individuals.

Desired and welcome features for the smart app developer, but which can already give the chief security officer a headache.

Known from the SQL world is the concept of the view, a fixed query implemented on the database. This has a defined scope regarding the data to be considered, possibly clearly defined and limited filter as query parameters, and above all is subject to configurable access control.

In a Single Source of Truth, we can use a similar concept with the help of a so-called facade, but go one step further. Semantic graph databases distinguish themselves from other databases in particular by their ability to process metadata, i.e., by the ability to disclose information about the data itself.

Access to the knowledge base is therefore usually not directly via SPARQL, but via the facade. This reads the list of views stored in the graph and their parameters and access permissions. Each app can thus determine for itself which APIs are available to it or to the consumer.

Furthermore, the parameters required by the call can be queried. Thus, the app can also react dynamically to changing requirements or possibilities. For example, in a reporting system, automatically offer all filters that an API request offers for selection.

Conclusion

The semantic Single Source of Truth combines a multitude of positive characteristics of existing database systems and architectures: It enriches data with meaning and creates the basis for a common understanding of it, and thus not only for better communication and more efficient collaboration between people, but also for greater interoperability and easier exchange of information between systems.

Knowledge becomes centrally available and reusable, redundancies are eliminated, dependencies and inconsistencies are uncovered, transparency and better plannability emerge. Data quality is improved, operating, maintenance and infrastructure costs decrease.

Certainly, existing database systems do not have to be converted immediately, but a closer look at where a semantic Single Source of Truth can provide concrete support is certainly worthwhile. We will keep you up to date here with further insights into these technologies.

Author: Alexander Schulze, first published October 2019, translation and editorial work by Ashesh Goplani , thanks for your contribution!

  • No labels