Data lakes have become a foundation for many organizations’ data environments. While these data lakes provide new capabilities, many enterprises are struggling to derive full value due to the operational overhead of managing multiple new interfaces, tools, data sets and integration points. These data lakes often become “data swamps” due to the large amount of data that is ingested with no clear method to find data sets, separate them as needed and identify the core elements of value to the business.
The Value of Data Catalogs
Why should you care about the deployment of data catalog capability? Because while many organizations now grasp the importance of centralizing their enterprise data, they often have not yet grappled with how difficult it is to efficiently and securely access that data. This difficulty arises because it is ingested from many different places, with varying amounts of structure.
Data catalogs are a critical element to all data lake deployments to ensure that data sets are tracked, identifiable by business terms, governed and managed. Forbes contributor Dan Woods cautions organizations against using tribal knowledge as a strategy, due to the inability to scale. Data catalogs crystallize corporate data governance policies into practice, becoming the engine for enforcement and the tool for auditing of compliance. The inclusive nature of the data catalog enables it to be used for collaboration and centralized sharing of information in a known location, accessible across the organization.
Data catalogs become the entry point for data scientists and other analytical users across the organization via the data engineers (Figure 1) who are focused on creating enriched data sets for analytical uses. Data catalogs ensure these dispersed teams can collaborate on data set quality, usage, and business descriptions.
Data catalogs can be deployed in one of two organizational models:
- Policy – “you must” – This model focuses on organizational policies that require compliance and punish departments and individuals who do not comply with data governance policies. This model is most effective in top-down organizations where staff are accustomed to centralized policy management and accepting of the decision process to bestow policy upon them.
- Give to get – “you will be rewarded” – This model is targeted at self-organizing organizations where collaboration is used to ensure compliance. This model focuses on building trust between teams. If they put their data sets into a centralized data catalog, they will have access to enriched data sets from other departments. This model is built upon the assumption that each department will be accelerated in delivering products through access to other groups’ effort and data.
A data catalog acts as a technology tool which can be employed in a minimal, nearly standalone manner. However, it is most effective as a central part of a larger data governance effort. Figure 2 shows a data catalog in the context of other highly valuable technical and process components typically deployed as part of an implementation of data governance.
The most common surrounding functionalities to deploy include:
- Metadata – Metadata is the enrichment of primary data sets through descriptions, common elements for linking, quality metrics and other dataset-specific details that are necessary for data consumers to get full use out of the data sets effectively. This is the core of what a data catalog provides: a commonly accessible metastore through which the various consumption and query tools can use a consistent set of data definitions to access the underlying enterprise data.
- Data Quality – As data is used to drive more and more automated processes and decision making, the quality of data should be calibrated in accordance with risk. Some lower risk categories such as targeted marketing do not require extreme data quality, because the impact of inaccuracy is comparatively low. But for other higher risk domains such as healthcare or personal financial information, the quality of data is paramount because inaccuracy carries great consequences.
- Business Glossary – Many organizations have their own language and acronyms they use to describe organizations, relationships, value chains and supporting business systems. The data catalog serves as the primary method to capture these organizationally specific descriptions and map them to associated data sets.
- Master Data – An important principle behind effective centralization of enterprise data is that there are highly critical selected data elements that are at the core of the majority of their business operations. Ensuring that these elements are maintained, referenced and relied upon in a consistent manner represent the key importance of the use of Master Data.
- Data Lineage – The importance of tracking a data set’s history, branches and modifications cannot be overestimated. As data engineers combine and manipulate data sets so that they can simplify use for business users, the source data sets and subsequent changes must be tracked so that users can reference earlier versions of data sets when necessary.
- Classification – Data must be classified so that staff know the proper channels through which it can be shared, and the appropriate level of protections or each class of data. Data catalogs become a central point to ensure that all data sets are classified, access is controlled, and a complete audit trail is available for the use of that data.
- Data Licensing – Many data sets today are bought and sold, either in their entirety or as components. The entitlements of these purchased data sets must be tracked to ensure organizations limit access, are compliant with contracts and to ensure proper payment is made for data access.
Deployment of a data catalog, even in a data-light organization, is a non-trivial task that requires planning, investment and integration with existing systems. The reward is a powerful tool to enable data-driven organizations to effectively enforce policy, drive collaboration and ensure high quality data products are available for decision making. Some best practices for deploying data catalogs include:
- Don’t Boil the Ocean – The potential for any data catalog to enable an organization is never-ending. This can lead to confusion on deployment. Organizations should start with the core functions that are critical to enable the business and show viability, then continue to iterate and add capabilities.
- Decide When to Use Emerging Commercial Products – The commercial products for deploying fully functional data catalogs are still emerging. Many organizations today are opting to build their own data catalogs, using standard tools and APIs with the intention of moving to a commercial product when the right one becomes available.
- Cloud Platform Solutions – The cloud platform vendors see the need for this centralization of data and metadata and offer their own implementations, these can both facilitate migration to public cloud and simplify what needs to be custom built. AWS offers this service with the AWS Glue Data Catalog, and Azure through the Microsoft Azure Data Catalog.
Effective data-driven organizations require powerful collaboration and consistency in policy enforcement. Data catalogs provide that centralized functionality for ensuring all teams have visibility into available data, described in a way that is easy to consume, and centrally managed against corporate data governance policies. Deployment of data catalogs is not a one-time implementation, but an evolution to continue to add new functionality, prioritized by the needs of business users.