
As companies collect, generate and process more data to drive value through their organizations, they find themselves facing an important decision: where to put it all.
One option has been to create a data lake. Data lakes are particularly useful because they allow organizations to feed in large volumes of data in their original formats, with the assumption that someone will use it later. Companies have used, and continue to use, data lakes in on-premises environments. But lakes have become even more popular in the cloud. The inherent flexibility of cloud environments makes it easier to both move data around and scale it up. In addition, storage costs in the cloud are shrinking dramatically, especially in comparison with on-premises data systems such as NAS (network-attached storage) or SANs (storage area networks). As of this writing, AWS offers Amazon Glacier Deep Archive redundant storage with 99.999999999% durability for only $1/TB a month.
Still, data lakes pose challenges. Because these data stores tend to be unmanaged and available to large numbers of people, they can be security nightmares. In one targeted attack, a hacker could steal critical customer and/or operational data. Companies planning to use data lakes, either by migrating data into the cloud, or by building a whole new cloud platform, have to create a comprehensive plan to secure the data inside those lakes.
A data lake security plan needs to address five important challenges: data access control, data protection, data lake usage, data leak prevention and data governance and compliance. Here is a rundown on how to secure your data lake in AWS cloud environments.
Data Access Control
An effective data access control plan has three facets: an authentication/authorization system, access auditing and read/write permissions.
The data lake platform can leverage the existing authentication/authorization system a company has already implemented for its cloud resources. The challenge is that data is stored via the object storage model, where each file object can contain a huge amount of data with many different properties. Access control for each object alone is a challenge — and data lakes can contain billions of these file objects.
In traditional database systems, auditing data can be interpreted fairly clearly, because data access is controlled via schema restrictions at table or column levels, where table and column names are well defined. In data lakes, however, data sets are not segmented by clear boundaries. Someone who has access to a particular file object can modify it without creating a trail of what was modified, beyond the information that the file was altered in some unspecified way.
Controlling read and write accesses to the data lake is another challenge. You can take a lockdown approach by restricting only a few user roles or processes to ingest and extract data in and out of the data lake. Or you can institute fine-grained access control to each object level, where individuals or groups will be assigned to a certain set of data objects.
One key approach companies use is to severely restrict file object access, allowing entry only through higher level tools, which can enforce granular permissions and authorizations. Typically, this is done through something like a Hive metastore that provides a schema layer. Other tools then use the Hive schema layer to enforce more granular authorization controls, such as table, row, etc. Some popular tools in this layer are Athena, QuickSight and Tableau for data consumption, with Hive schema control. The complexity of Hive schemas can be handled with tools such as Collibra, Immuta, AWS Glue Data Catalog, etc.
Figure 1: Data Lake Components
Data Protection
The underlying technologies to protect data at rest or data in transit are mature and widely available in the public cloud platforms. The challenge in a data lake, or in any data platform in the cloud, is how to actually implement and manage all these controls. For the data at rest layer, you can transform and move data to different systems for analytics, such as RDBMS, NoSQL or a data warehousing system. Encryption procedures and keys must be consistent across these systems as well. For data in transit, you can use existing PKI management or cloud solutions, but the process to keep certificates up to date will need coordinated efforts.
A good set of backup and restore standards and procedures can come into play if you detect data compromise. Whenever possible, leverage native automated backup, such as Amazon S3 with versioning, cross-region replication and lifecycle retention, and archive with Amazon Glacier storage. Data backup is not complete unless the restore procedure is well tested and certified.
Data Lake Usage
Data lakes often contain data from many sources in the enterprise. In order to have a usable and maintainable data lake, you will have to overcome many challenges. Each data set is often associated with a data owner and a data custodian, so communication between these data owners and the data lake owners should follow some sort of enterprise data governance and agreements. The data cleansing and transforming process creates other challenges, such as how to produce or maintain data lineage from the original data to the newly transformed data. A common interface for the data user community requires ongoing development and maintenance efforts.
Data Leak Prevention
The more data you put in one place, the more security controls you need to protect it. Data protection requires cognizance and collaboration from many areas in an enterprise. Data leaks can be caused by insiders with access to data, or by the lack of data security controls. Security tools can prevent data leakage in certain ways, but they do not guarantee data protection. Common security control techniques can include starting out with a preventive approach, with deploying access lock-down control and active monitoring systems that alert you on data egress or alteration although you do not want to produce in for normal usage too many false negatives on a daily basis.
Data Governance and Compliance
Because data tends to move across borders, you need to define and understand your data sovereignty so your data lake will have the protections mandated by local laws. You also have to understand and identify the compliance requirements for your data, e.g., PII data needs to be protected with PCI DSS standards, and PHI data has to be in compliance with HIPAA legislation. The importance of data compliance is not limited to regulated industries. It is crucial to have a proper enterprise data classification standard to help any company operate securely and efficiently on-premises or in the cloud with proper system design and deployment. The footprint of a system deployment depends largely on the data classification requirements. For example, a system with critical data rating will require redundant deployment of resources, with active data backup and restore capabilities, which will require a higher budget. A non-critical system can bypass many of the redundant deployments, which can save a substantial sum of money.
On the governance side, data lakes introduce new challenges to your data governance team. Some of the key steps you need to take starting out include education and awareness training about the data lake for that data governance team.
Conclusion
Throughout the world, companies are leveraging data more strategically to move their businesses forward. Data lakes have emerged as important tools for storing increasingly large volumes of information. Companies that do a good job of securing their data lakes – following enterprise security requirements, and leveraging independent security assessments – can position themselves to take full advantage of those data related projects that are critical to their future.