Future-Proofing Data Architecture for Better Data Quality
In this article, explore best practices for building robust data engineering systems to deliver high-quality data.
Join the DZone community and get the full member experience.
Join For FreeThis is an article from DZone's 2023 Data Pipelines Trend Report.
For more:
Read the Report
Data quality is an undetachable part of data engineering. Because any data insight can only be as good as its input data, building robust and resilient data systems that consistently deliver high-quality data is the data engineering team's holiest responsibility. Achieving and maintaining adequate data quality is no easy task. It requires data engineers to design data systems with data quality in mind. In the hybrid world of data at rest and data in motion, engineering data quality could be significantly different for batch and event streaming systems.
This article will cover key components in data engineering systems that are critical for delivering high-quality data:
- Monitoring data quality – Given any data pipeline, how to measure the correctness of the output data, and how to ensure the output is correct not only today but also in the foreseeable future.
- Data recovery and backfill – In case of application failures or data quality violations, how to perform data recovery to minimize impact on downstream users.
- Preventing data quality regressions – When data sources undergo changes or when adding new features to existing data applications, how to prevent unexpected regression.
Monitoring Data Quality
As the business evolves, the data also evolves. Measuring data quality is never a one-time task, and it is important to continuously monitor the quality of data in data pipelines to catch any regressions at the earliest stage possible. The very first step of monitoring data quality is defining data quality metrics based on the business use cases.
Defining Data Quality
Defining data quality is to set expectations for the output data and measure the deviation in the actual data from the established expectations in the form of quantitative metrics. When defining data quality metrics, the very first thing data engineers should consider is, "What truth does the data represent?" For example, the output table should contain all advertisement impression events that happened on the retail website. The data quality metrics should be designed to ensure the data system accurately captures that truth.
In order to accurately measure the data quality of a data system, data engineers need to track not only the baseline application health and performance metrics (such as job failures, completion timestamp, processing latency, and consumer lag) but also customized metrics based on the business use cases the data system serves. Therefore, data engineers need to have a deep understanding of the downstream use cases and the underlying business problems.
As the business model determines the nature of the data, business context allows data engineers to grasp the meanings of the data, traffic patterns, and potential edge cases.
While every data system serves a different business use case, some common patterns in data quality metrics can be found in Table 1.
METRICS FOR MEASURING DATA QUALITY IN A DATA PIPELINE
|
|
---|---|
Type | Limitations |
Application health | The number of jobs succeeded or running (for streaming) should be N. |
SLA/latency | The job completion time should be by 8 a.m. PST daily. The max event processing latency should be < 2 seconds (for streaming). |
Schema | Column account_id should be INT type and can't be NULL . |
Column values | Column account_id must be positive integers. Column account_type can only have the values: FREE , STANDARD , or MAX . |
Comparison with history | The total number of confirmed orders on any date should be within +20%/-20% of the daily average of the last 30 days. |
Comparison with other datasets | The number of shipped orders should correlate to the number of confirmed orders. |
Table 1
Implementing Data Quality Monitors
Once a list of data quality metrics is defined, these metrics should be captured as part of the data system and metric monitors should be automated as much as possible. In case of any data quality violations, the on-call data engineers should be alerted to investigate further. In the current data world, data engineering teams often own a mixed bag of batched and streaming data applications, and the implementation of data quality metrics can be different for batched vs. streaming systems.
Batched Systems
The Write-Audit-Publish (WAP) pattern is a data engineering best practice widely used to monitor data quality in batched data pipelines. It emphasizes the importance of always evaluating data quality before releasing the data to downstream users.
Figure 1: Write-Audit-Publish pattern in batched data pipeline design
Streaming Systems
Unfortunately, the WAP pattern is not applicable to data streams because event streaming applications have to process data nonstop, and pausing production streaming jobs to troubleshoot data quality issues would be unacceptable. In a Lambda architecture, the output of event streaming systems is also stored in lakehouse storage (e.g., an Apache Iceberg or Apache Hudi table) for batched usage. As a result, it is also common for data engineers to implement WAP-based batched data quality monitors on the lakehouse table.
To monitor data quality in near real-time, one option is to implement data quality checks as real-time queries on the output, such as an Apache Kafka topic or an Apache Druid datasource. For large-scale output, sampling is typically applied to improve the query efficiency of aggregated metrics. Helper frameworks such as Schema Registry can also be useful for ensuring output events have a compatible as-expected schema.
Another option is to capture data quality metrics in an event-by-event manner as part of the application logic and log the results in a time series data store. This option introduces additional side output but allows more visibility into intermediate data stages/operations and easier troubleshooting. For example, assuming the application logic decides to drop events that have invalid account_id
, account_type
, or order_id
, if an upstream system release introduces a large number of events with invalid account_id
, the output-based data quality metrics will show a decline in the total number output events. However, it would be difficult to identify what filter logic or column is the root cause without metrics or logs on intermediate data stages/operations.
Data Recovery and Backfill
Every data pipeline will fail at some point. Some of the common failure causes include:
- Incompatible source data updates (e.g., critical columns were removed from source tables)
- Source or sink data systems failures (e.g., sink databases became unavailable)
- Altered truth in data (e.g., data processing logic became outdated after a new product release)
- Human errors (e.g., a new build introduces new edge-case errors left unhandled)
Therefore, all data systems should be able to be backfilled at all times in order to minimize the impact of potential failures on downstream business use cases. In addition, in event streaming systems, the ability to backfill is also required for bootstrapping large stateful stream processing jobs.
The data storage and processing frameworks used in batched and streaming architectures are usually different, and so are the challenges that lie behind supporting backfill.
Batched Systems
The storage solutions for batched systems, such as AWS S3 and GCP Cloud Storage, are relatively inexpensive and source data retention is usually not a limiting factor in backfill. Batched data are often written and read by event-time partitions, and data processing jobs are scheduled to run at certain intervals and have clear start and completion timestamps.
The main technical challenge in backfilling batched data pipelines is data lineage: what jobs updated/read which partitions at what timestamp. Clear data lineage enables data engineers to easily identify downstream jobs impacted by problematic data partitions. Modern lakehouse table formats such as Apache Iceberg provide queryable table-level changelogs and history snapshots, which allow users to revert any table to a specific version in case a recent data update contaminated the table. The less queryable data lineage metadata, the more manual work is required for impact estimation and data recovery.
Streaming Systems
The source data used in streaming systems, such as Apache Kafka topics, often have limited retention due to the high cost of low-latency storage. For instance, for web-scale data streams, data retention is often set to several hours to keep costs reasonable. As troubleshooting failures can take data engineers hours if not days, the source data could have already expired before backfill. As a result, data retention is often a challenge in event streaming backfill.
Below are the common backfill methodologies for event streaming systems:
METHODS FOR BACKFILLING STREAMING DATA SYSTEMS
|
|
---|---|
Method | Description |
Replaying source streams | Reprocess source data from the problematic time period before those events expire in source systems (e.g., Apache Kafka). Tiered storage can help reduce stream retention cost. |
Lambda architecture | Maintain a parallel batched data application (e.g., Apache Spark) for backfill, reading source data from a lakehouse storage with long retention. |
Kappa architecture | The event streaming application is capable of streaming data from both data streams (for production) and lakehouse storage (for backfill) |
Unified batch and streaming | Data processing frameworks, such as Apache Beam, support both streaming (for production) and batch mode (for backfill). |
Table 2
Preventing Data Quality Regressions
Let's say a data pipeline has a comprehensive collection of data quality metrics implemented and a data recovery mechanism to ensure that reasonable historical data can be backfilled at any time. What could go wrong from here? Without prevention mechanisms, the data engineering team can only react passively to data quality issues, finding themselves busy putting out the same fire over and over again. To truly future-proof the data pipeline, data engineers must proactively establish programmatic data contracts to prevent data quality regression at the root.
Data quality issues can either come from upstream systems or the application logic maintained by data engineers. For both cases, data contracts should be implemented programmatically, such as unit tests and/or integration tests to stop any contract-breaking changes from going into production.
For example, let's say that a data engineering team owns a data pipeline that consumes advertisement impression logs for an online retail store. The expectations of the impression data logging should be implemented as unit and/or regression tests in the client-side logging test suite since it is owned by the client and data engineering teams. The advertisement impression logs are stored in a Kafka topic, and the expectation on the data schema is maintained in a Schema Registry to ensure the events have compatible data schemas for both producers and consumers.
As the main logic of the data pipeline is attributing advertisement click events to impression events, the data engineering team developed unit tests with mocked client-side logs and dependent services to validate the core attribution logic and integration tests to verify that all components of the data system together produce the correct final output.
Conclusion
Data quality should be the first priority of every data pipeline and the data architecture should be designed with data quality in mind. The first step of building robust and resilient data systems is defining a set of data quality metrics based on the business use cases. Data quality metrics should be captured as part of the data system and monitored continuously, and the data should be able to be backfilled at all times to minimize potential impact to downstream users in case of data quality issues. The implementation of data quality monitors and backfill methods can be different for batched vs. event streaming systems. Last but not least, data engineers should establish programmatic data contracts as code to proactively prevent data quality regressions.
Only when the data engineering systems are future-proofed to deliver qualitative data, data-driven business decisions can be made with confidence.
This is an article from DZone's 2023 Data Pipelines Trend Report.
For more:
Read the Report
Opinions expressed by DZone contributors are their own.
Comments