The cloud is empowering organizations to grow their data footprints faster than traditional data centers ever allowed. Concurrently, the rise of advanced data science tools is enabling broader audiences within organizations to analyze data and identify new insights. However, this can lead to small levels of error creeping in during data transfer between different IT systems, with or without the cloud. Even a single part- per-million (PPM) level of error could be significant, considering the hundreds of millions of records that are transferred every day. These errors have the potential to impact the business if not handled proactively.
The intersection of cloud adoption and citizen data scientists requires that organizations deliver high quality, documented data sets for analysts to consume. The promise of cloud for data storage hinges on the certainty that the integrity of the data will be preserved as it’s moved between different IT systems and platforms. This article outlines the requirements needed to achieve business-driven data quality.
Tolerance for Errors
Cloud adopters must be on the lookout for poor data quality permeating through their ecosystems. For data focused on targeted marketing, a higher error level of post-analysis data is more acceptable, given the low cost of a targeted advertisement. However, when the data is critical from a PR, compliance or legal standpoint, errors can be expensive. For example, data required for corporate or industry governance at a financial services institution, must maintain a 100% accuracy level to ensure that stock holdings are accurately attributed and taxes are calculated correctly.
The benefits of moving data to the cloud are very appealing. Major cloud providers have an impressive offering that is extraordinarily economical compared with on-premises costs. Cloud has introduced new opportunities, but with those benefits also come challenges to data integration. The biggest change in the cloud, when moving from on-premises data management is the wide range of services available for the storage, processing and analysis of data. Yet each service introduces new challenges for data manipulation, resulting in output that may not meet the level of quality demanded by the business. So why move? Because leveraging cloud platforms offers the flexibility to store data in a variety of formats, while optimizing the platform technology to the query and analysis patterns of the users.
But cloud providers, while still in early states of maturity, do not yet have the tools to ensure the data is fit for use when it comes out of the cloud. The basic checks that we take for granted in an RDBMS environment are lacking. As data moves in and out of the cloud, data quality frequently deteriorates and loses its trustworthiness due to:
- Inability to validate (and alert) on any bad/ missing data sources day over day
- Inability to enforce data constraints, such as allowing duplicates: Primary key violation!
- Multiple data sources sending data to the cloud get out of sync over time
- Structural change to data in upstream processes not expected by the cloud
- Presence of multiple IT platforms (Hadoop, DW, cloud)
Faulty processes, ad hoc data policies, poor discipline in capturing and storing data and a lack of control over some data sources all contribute to data inconsistencies between the cloud and on-premises systems.
Data quality is based on a range of metrics. These metrics vary, depending on the industry and the use for the data. The common best practice for ensuring data integrity is to check for the six core dimensions of data quality:
- Completeness – Are all data sets and data items recorded in entirety?
- Uniqueness – Are there any duplicates?
- Timeliness – The degree to which data represent reality from the required point in time.
- Validity – Does the data match the rules? A data set is valid if it conforms to the syntax (format, type, range) of its definition.
- Accuracy/Reasonableness – Does the data reflect the data set? The degree to which data correctly describes the “real world” object or event being described.
- Consistency – The absence of difference when comparing two or more representations of the same data between different data sets. This can also include the relationships between different variables.
The traditional approach by developers to ensure data quality is to follow these steps linearly:
- Prognosticate points of failure (expected threats)
- Code to mitigate expected threats
- Test the code
- Put code in production
- Detect new points of failure that were unexpected
- New coding for unexpected failures… testing… production
- Maintain and update the rules for relevancy and accuracy
The data threats that are the most damaging are usually unexpected ones that are not mitigated through proactive programming. The biggest challenge of data validation is the ability to create and maintain thousands of data quality rules that can constantly evolve over time. Organizations need to establish a Data Quality Validation framework that lends itself to the authorization of large and complex data flows across multiple platforms. Right now, that onerous undertaking is a tedious labor-intensive process, prone to human errors. As data flows from many different sources, interrelationships and validations become intricate and complex. As a result, unexpected errors increase exponentially.
The current data quality approaches are reasonably suitable to mitigate “expected threats.” However, they are not scalable or sustainable. They do not work when data hops across multiple platforms and are definitely not suitable in a big data/cloud initiative. Instead of retrofitting existing solutions to solve big data quality issues, organizations must choose an intelligent data validation solution, one that will continue to learn autonomously.
A New Paradigm of Data Quality
The next evolution in ensuring data quality for big data in the cloud must, at the minimum, satisfy these needs:
- Handle massive data volumes with a powerful underlying engine. (Even the Big Data editions of the major ETL solutions cannot do this yet.)
- Users must be able to set up data quality tests with a minimum number of clicks.
- Tools must look for threats beyond those for which they have been programmed. They must autonomously learn the gamut of data quality rules, specific to the dataset, using self-learning algorithms.
- Results of quality indicators must be translatable into relevant metrics for different stakeholders, including executives, team leaders and data quality owners.
With cloud adoption accelerating to operationalize mission critical functions and increase use of third party data, there is an urgent need for a systematic approach to ensure accuracy, completeness and quality of data used by cloud applications. In the absence of the appropriate level of data quality checks, organizations could run into regulatory or operational issues which may negate the potential benefits of using the cloud platform. When clients leverage AWS specifically, we recommend the following types of checks:
- Input Data Quality Checks: When the data lands in Amazon S3, either from on-premises or third party systems, autonomous data quality checks need to be performed to flag out duplicate records and files, anomalous records, incomplete records, and structural and semantic data drifts.
- Data Completeness between on-premises, S3, Amazon EMR and Redshift: Ensure that no record is lost during the transmission of data between on-premises system to landing zone (S3), to the processing application (EMR) and then finally to the warehousing system (Redshift).
Before any major analytical deployment in the cloud, your organization should work to define the key standards for data quality to ensure your analysts are effective with the data available to them. The following tools can help to ensure your data maintains high
quality and improves over time:
- DataBuck from FirstEigen is an autonomous, self-learning, big data and cloud data quality validation and reconciliation tool. It validates the integrity of data and reconciles the cloud with the on-premises source, while enforcing constraints by filtering out the bad data and sending alerts to the appropriate people.
- Informatica offers a full suite of Data as a Service (DaaS) products, including data management platforms, one-time defined business rules applied across platforms, contact record verification and data enrichment services.
- IBM offers data quality solutions (Infosphere, BigInsights – BigQuality) that enable users to cleanse, assess and monitor data quality and maintain consistent views of key entities.
- SAS combines decades of data quality experience to provide users with tools that make it easy to identify problems, preview data and set up repeatable processes in a single management view across multiple sources.
Data quality tools in the cloud must first of all, be deployed to scale as data volumes scale and secondly, maintain a variety of integration methods supporting multiple analysis tools across the same data set.
The cloud provides new opportunities for increased efficiency and agility for big data storage and analysis, but your organization must ensure that the data integrity is maintained throughout the process in order to achieve these new cloud promises.