Predictive analytics and supporting technologies like machine learning require access to diverse data sets and powerful, scalable compute resources. Modern capabilities, including predictive analytics and machine learning, enable organizations to leverage large amounts of data from social media, online journeys, the Internet of things (IoT) and other sources to enable data driven decisions across an organization. Leveraging a data lake to store the necessary information for powering predictive analytics and machine learning workloads empowers staff across an organization to analyze data, test theories and drive changes to business processes, the customer experience and products.
A data lake is not meant to replace existing systems. Rather, it is an integration point between existing data platforms, to enable a seamless view into all of an organization’s data. A data lake will complement existing systems by ensuring that analytical workloads, development, testing and machine learning model creation will not impact production workloads in other performance optimized systems. Ultimately, a data lake is a concept, and while it has some specific technologies and workflows, its value lies in the connectivity between the core of the data lake and supporting business and operational systems.
Building a data lake requires organizations to assess data strategy, infrastructure architecture and workflows, to ensure the data available is of high quality, linked for rapid analysis and does not expose the organization to risk through data compromise, or create compliance challenges. Figure 1 shows the common steps an organization takes as they begin a data lake project, and the key considerations, both technical and organizational, that must be addressed for a successful data lake implementation.
Machine learning is an emerging trend within the technology space, but not a new technology. Machine learning capabilities have been researched for decades and leveraged for many years by mature technology organizations. The difference now is the ability for more organizations to leverage the work in the machine learning community, including easy to consume APIs and pre-trained models specific to certain domains. Machine learning complements the growing work in the predictive analytics space by ensuring outcomes and recommendations are more accurate and highly personalized to the person, organization, domain and purpose.
Building a data lake in the cloud takes special considerations, and provides for advanced capabilities not economically available in on-premise deployments, including elasticity, automated recovery, multi-zone availability and PaaS (Platform as a Service) based analytical services for data consumption.
Many organizations will evaluate the best location to deploy a data lake. Because of the need to ingest and integrate data from many existing systems, the location and connectivity of a data lake are key to its effectiveness and usability. Cloud based data lakes provide an advantage because of their ability to quickly spin up and down new resources, to connect to a variety of networks and data sources, and, most importantly, to leverage the powerful tools and expertise that the providers offer and have proven in running their own complex, world-wide services.
Business Value of a Data Lake
Figure 2 outlines the increasing maturity of big data adoption within an organization. As organizations mature through the different levels, there are technology, people and process components. The data lake is commonly deployed to support the movement from Level 3, through Level 4 and onto Level 5. The data lake provides a platform for execution of advanced technologies, and a place for staff to mature their skill sets in data analysis and data science.
Analytical Operational Model
The primary value of a data lake is enabling flexibility, through a scalable platform for analysis of complex data sets. Many different technologies will go into this analysis, including predictive analytics tools, data modeling, data quality and machine learning. The first part of any analytical workflow is the data process, Figure 3 shows the steps commonly followed to Ingest, Cluster, Index and ultimately Analyze data within a data lake. These steps are key to ensuring that high quality data is brought together, associated properly and organized to enable data scientists to analyze the prepared data.
Machine learning is a series of iterative steps to leverage a known, analyzed data set to training specific models for future execution on unknown data sets. Figure 4 shows the typical steps used by data scientists to train the necessary models when leveraging machine learning. Once the models have been trained, they can be leveraged in conjunction with a variety of analytical tools including R, SAS and open source tools written in Python.
Cloud based data lakes add the value of being able to leverage platform provided machine learning capabilities. Vendors, including AWS and Google, provide a rich set of trained models for immediate use against data sets, as well as the ability to train custom models for use against proprietary data sets. Both AWS and Google have deployed variants of the machine learning technologies they have used and refined internally over many years.
The technical architecture for a data lake must be a match for the dominant use cases being run on the platform. When designing the data lake solution, the key design factors are:
- Use Cases – Early identification of the use cases and workloads for the data lake will allow proper prioritization of different analysis engines, scalability considerations and data integration points.
- Operational Aspects – The data lake architecture should factor in the necessary tools for monitoring and response, as well as which technologies to leverage to ensure the system is maintainable by your organization’s IT organization.
- Scalability & Performance – As your organization grows and evolves, the use of the data lake will expand. Early technology decisions should have an eye towards the ability for the technology choices to scale without replacement.
These top three considerations then become several key design elements for the data lake:
- Data Access & Retrieval – Cloud providers make available a multitude of tools for accessing data using SQL interfaces, tools for storing data in JSON objects, optimized platforms for read-only, as well as tools for batch processing unstructured data. These tools should be considered when designing a data lake, including the necessary interfaces for ingest and processing of data. Later in the paper we discuss specific technologies from AWS and Google for data access and retrieval. A common platform for metadata should also be designated for streamlined data access.
- Security Controls, Logging & Auditing – Security is a key element of a data lake; the identity management, auditing and access controls should be designed to meet the risk levels of the organization, as well as compliance needs. Access controls should be consistent between access methods.
- Deployment & Automation – Tremendous operational value comes from the ability to automate deployment and recovery in the cloud. All data lake functionality should be automated for deployment and recovery, to lower the operational burden on the IT team when making changes and responding to incidents.
- Advanced Capabilities – Advanced capabilities include APIs for data analysis, or development toolkits that quickly enable teams to mock up new analysis and reports.
Figure 5 shows the recommended design pattern for a cloud based data lake, including connectivity to traditional enterprise systems.
Building a data lake is an integration of complex technologies that work together to provide access to diverse data sets. The following are key functional areas that should be included in all data lake deployments:
- Data Processing – The ability for the data lake to seamlessly connect to other systems, provide clean mappings for data and move data around in an automated, highly reliable manner.
- Streaming – Capability for analyzing and making decisions on data that is in-flight.
- Rules/Matching – Ability to execute pattern matching against data for operations like de-identification or deduplication.
- ETL – An Extract-Transform-Load engine is key to integrating into existing RDBMS and EDW platforms.
- Governance – All governance should be consistently implemented at the edge of the data lake to ensure compliance and adherence to corporate policies.
- Data Storage & Retrieval – These are functional areas to enable developers to query data in standard formats, using standard APIs from the data lake.
- Batch – High throughput, high latency processing for data that is being analyzed, not commonly used for interactive workloads.
- Analytical – Commonly used for interactive workloads where the queries change over time.
- In-memory – Used to support very low latency queries that support interactive usage or other low latency needs.
- Search/Index – These support the ability to locate information and relationships quickly.
- OLTP – Targeted to support transactional systems commonly found within business units and operations teams.
- Object – An object store is a key component of a data lake for storing non-relational data, as well as historical copies of information for later analysis.
- Long Term – Long term storage, commonly a component of the object store, is necessary for archiving data that may not be used regularly, but is still required to be accessible. Commonly used for compliance policies and legal hold rules.
- Data Consumers – The data lake can support a variety of different interfaces for data consumers to access data and expose it to a variety of application types.
- Dashboards – Regularly updated sites with reporting on specific metrics and changes over time.
- E-commerce – Customer facing systems that are transactional in nature.
- Data Science – Individuals looking to develop predictive models, test theories or develop new types of reporting for the organization.
- Business Intelligence (BI) – Iterative tools for enabling power users to explore complex data sets and relationships.
- Mobile Apps – Apps designed to be used quickly by users on the go; response time and accuracy of data is key.
Figures 6 and 7 show key technologies provided by AWS and Google for building a data lake. Each provider has unique capabilities that should be weighed when evaluating where to locate your data lake. Key considerations include regional availability, connectivity options, supportability and the need for different vendor and application certifications. Both AWS and Google provide valuable platforms that are scalable, secure and flexible to build a data lake in the cloud.
Google technologies of use shown in Figure 6 are:
- Operational Aspects
- Pub/Sub – Pub/Sub provides a seamless developer experience for the sharing of data between systems and tools.
- Scalability & Performance
- BigQuery – BigQuery provides a highly scalable platform for analysis of data sets that are commonly read-heavy. BigQuery is a PaaS offering, ensuring low operational overhead on the IT organization.
- Data Access & Retrieval
- Google Cloud Storage – Google Cloud Storage provides an object interface for storage of historical and archive data.
- Hadoop on Google Compute Engine – Google provides multiple vendor solutions for running Hadoop on Google Compute Engine; this can be leveraged in a data lake as a scalable batch processing environment that feeds processed, prepared data to other systems, including BigQuery.
- Advanced Capabilities
- Google Machine Learning – Google Machine Learning capabilities provide developers the ability to leverage pre-trained models, as well as train their own for rapid analysis of data.
- Predictive API – Google Predictive API provides the ability to identify patterns in data quickly, without standing up additional servers, or services.
Figure 7 shows a data lake architecture that leverages key technologies and capabilities from AWS.
- Operational Aspects
- CloudFormations – AWS provides CloudFormations, an automated method for standing up services and configurations in a repeatable manner.
- Scalability & Performance
- IDM – AWS provides strong Identity and access management capabilities across their cloud portfolio, as well as the ability to integrate with existing LDAP or active directory infrastructures. This capability ensures consistent entitles across the data access methods.
- Data Access & Retrieval
- S3 – S3 is the object store platform for AWS; it provides a simple API for the storage and retrieval of data.
- Redshift – Redshift is the AWS enterprise data warehouse platform; it provides high speed analytical access to large and complex data sets. Redshift is a PaaS capability, ensuring low operational overhead.
- EMR – Elastic MapReduce is an AWS implementation of MapReduce, allowing for highly scalable batch processing of data that is sent to other systems for query and analysis.
- DynamoDB – DynamoDB is a fully managed, low latency NoSQL platform that enables developers to create powerful and responsive applications that have a high level of data integrity supporting them.
- Advanced Capabilities
- Amazon Machine Learning – Amazon Machine Learning provides both visualization tools and other UI components to build new models and train existing models.
- Operational Aspects
With cloud based data lakes, there are two types of operational teams that will be required to ensure maximum availability, reliability and scalability of the platform. One is a traditional operations team that will focus on aspects including service availability, service performance, security and incident response. The second team will be a DataOps team, focused on data quality, automated workflows, data linking, metadata management and modeling of data-centric processes. Shown in Figure 8, the CTP Cloud Adoption model has been leveraged for small to large migrations successfully, and with data lake implementations there are specific considerations for each tenant.
- Strategy & Economics – Data lakes have specific elements for strategy and economics because of their ability to enable better decision making within an organization, to positively influence revenue and customer satisfaction.
- Security & Governance – Because of the multitude of data stored in a data lake, Security and Governance must consider the risks associated with data being combined, as well as analyzed, outside of traditional organizational roles or workflows.
- Application Portfolio Assessment – Any data lake project should include an evaluation of applications from a data usage perspective, including documentation of source of record evaluation.
- Application Migration – In the case of a data lake, very little application migration work will take place; rather, the focus will be around implementation of new capabilities for supporting the data lake, and integration with existing systems.
- DevOps – In the case of a data lake, DevOps models will allow anyone within an organization to develop analytical models and access a repository of curated data about the organization, allowing them to effectively manage their business and test theories.
- CloudOps – With data lakes, there are many moving pieces and interconnected systems. Strong CloudOps models for monitoring, response, incident management and staff training ensure stability. CloudOps also includes cost control elements to ensure services are properly started and stopped, and that costs are being monitored by management for alignment with organization goals and returns on investment.
- DataOps – Data quality is paramount in a data lake to ensure that decisions and recommendations made are grounded in truth. DataOps, including metadata management, data linking, quality, curation and archiving, are key elements to all data lake deployments.
Data Quality & Modeling
The primary function of a data lake is to provide a single repository of diverse data sets, easily accessible and of high quality and integrity. Data quality is paramount, as is the ability to easily find data sets and related data. There are a variety of best practices to use as measures for data quality within a data lake:
- Schema on Read – Because of the diverse nature of workloads and analytics patterns in a data lake, all schemas should be applied on read. This schema on read model ensures that each analyst can optimize their data views and relationships.
- Immutable data for all work – All work done in a data lake should be executed on immutable data; this will ensure that rogue processes or analyses can be removed without affecting the data quality for future analysis.
- De-identifying Data – Many organizations deal with sensitive data, including healthcare, financial or personal information. A data lake creates a unique risk, allowing many individuals to access data previously stored in silos. All data put into a data lake and allowed to be accessed by a wide audience should be de-identified to ensure that personal privacy is protected. Many data lakes have separate areas with de-identified and identifiable data, with each section accessible to the proper staff.
- Source of Record – A data lake will be pulling data from multiple sources, as well as feeding analytical results back to operational systems. This requires that organizations carefully track their Source of Record for each data type and understand how that information is moved between systems, as well as referenced, to ensure data integrity.
- Relationship Mapping – As organizations have grown their silos of data over many years, relationships in data have become complex. A successful data lake must ensure that data elements are properly mapped, so that reporting can span systems, time frames and business units.
- Metadata Catalog – To ensure that all data lake users can effectively locate required data, a metadata catalog should be deployed to provide information about data sets, relationships, data quality and historical information, including past analysis and results.
A key component of all data lake implementations is a strong set of security controls, backed by organizational governance policies. Because of the disparate data sets brought together in a data lake, and the variety of users, accessing the data in both structured and ad hoc methods, the governance and security controls must be clear, automated and actively respond to business needs and outside threats.
Figure 9 outlines three best practices for data integration when building a data lake:
- Security Context – All security context, including access controls, tagging and ownership, should be carried with data when moved between systems. This will ensure that as data is imported/exported, the policies carried are consistent between systems.
- Identities – Identities should be consistent across all systems; inevitably data will be replicated to provide for performance needs, and consistent identities between systems will ensure data access can be properly audited and updated quickly, as organizational changes occur and staff leave.
- Data + Data – Data lakes present a unique risk to data security because of the ability to combine data previously stored in separate silos. This presents the risk of data becoming more sensitive than its individual components. Data policies and controls need to account for data that is combined, to become sensitive and ensure adequate controls are in place to warn users and control the flow of data.
Many organizations are looking to technologies like machine learning and predictive analytics to enable them to be more effective at targeting prospects, supporting customers, building effective products and responding to market needs. To effectively leverage these technologies, an organization must first develop a solid infrastructure to store data, execute analytical workloads and protect its data assets from unexpected change or compromise. A cloud based data lake provides organizations a flexible platform for data storage and processing, while providing near endless scalability, with high levels of availability.
When building a data lake, start with specific use cases that can be architected for and proven; those will enable the organization to effectively grow capability as the users of the data lake increase. Cloud based data lakes have the added benefits of easily adding new capabilities as the cloud providers increase their feature portfolio, as well as gaining advantages from leveraging the deep expertise in scale and security the cloud provider provides.