Skip to content
CTP is part of HPE Pointnext Services.   Explore our new services here →
  • The Doppler Report
Cloud TP Logo
  • Thought Leadership
  • Clients
  • Services
  • Careers
  • Contact Us

Cloud Technology Partners

CLOUD SERVICES

  • The Cloud Adoption Program
  • Application Migration
  • Software Development
  • Infrastructure Modernization
  • DevOps & Continuous Delivery
  • Cloud Security & Governance
  • Cloud Strategy Consulting

TECH DOMAIN

  • Amazon Web Services
  • Google Cloud Platform

ABOUT US

  • Company Overview
  • Leadership Team
  • Partners
  • News & Recognition
  • Announcements
  • Weekly Cloud Report
  • Client Case Studies
  • Events

CAREERS

  • Join Us
  • Job Opportunities
 Cloud Technology Partners
  • Doppler Home
  • Client Case Studies
  • Podcasts
  • Videos
  • White Papers
  • Quarterly
  • Events
  • Subscribe

Cloud Security for Data Lakes in AWS Environments

Companies that do a good job of securing their data lakes can position themselves to take full advantage of data related projects that are critical to their future.
Bao Quach Principal Cloud Architect
May 22, 2019May 22, 2019 THE DOPPLER
Share this 
doppler_mail1

For more content like this, Get THE DOPPLER
email every Friday.
 
Subscribe here  chevron_right

As companies collect, generate and process more data to drive value through their organizations, they find themselves facing an important decision: where to put it all.

One option has been to create a data lake. Data lakes are particularly useful because they allow organizations to feed in large volumes of data in their original formats, with the assumption that someone will use it later. Companies have used, and continue to use, data lakes in on-premises environments. But lakes have become even more popular in the cloud. The inherent flexibility of cloud environments makes it easier to both move data around and scale it up. In addition, storage costs in the cloud are shrinking dramatically, especially in comparison with on-premises data systems such as NAS (network-attached storage) or SANs (storage area networks). As of this writing, AWS offers Amazon Glacier Deep Archive redundant storage with 99.999999999% durability for only $1/TB a month.

Still, data lakes pose challenges. Because these data stores tend to be unmanaged and available to large numbers of people, they can be security nightmares. In one targeted attack, a hacker could steal critical customer and/or operational data. Companies planning to use data lakes, either by migrating data into the cloud, or by building a whole new cloud platform, have to create a comprehensive plan to secure the data inside those lakes.

A data lake security plan needs to address five important challenges: data access control, data protection, data lake usage, data leak prevention and data governance and compliance. Here is a rundown on how to secure your data lake in AWS cloud environments.

Read the stories
How these F500 organizations are ensuring security in their highly regulated industries.

 

Data Access Control

An effective data access control plan has three facets: an authentication/authorization system, access auditing and read/write permissions.

The data lake platform can leverage the existing authentication/authorization system a company has already implemented for its cloud resources. The challenge is that data is stored via the object storage model, where each file object can contain a huge amount of data with many different properties. Access control for each object alone is a challenge — and data lakes can contain billions of these file objects.

In traditional database systems, auditing data can be interpreted fairly clearly, because data access is controlled via schema restrictions at table or column levels, where table and column names are well defined. In data lakes, however, data sets are not segmented by clear boundaries. Someone who has access to a particular file object can modify it without creating a trail of what was modified, beyond the information that the file was altered in some unspecified way.

Controlling read and write accesses to the data lake is another challenge. You can take a lockdown approach by restricting only a few user roles or processes to ingest and extract data in and out of the data lake. Or you can institute fine-grained access control to each object level, where individuals or groups will be assigned to a certain set of data objects.

One key approach companies use is to severely restrict file object access, allowing entry only through higher level tools, which can enforce granular permissions and authorizations. Typically, this is done through something like a Hive metastore that provides a schema layer. Other tools then use the Hive schema layer to enforce more granular authorization controls, such as table, row, etc. Some popular tools in this layer are Athena, QuickSight and Tableau for data consumption, with Hive schema control. The complexity of Hive schemas can be handled with tools such as Collibra, Immuta, AWS Glue Data Catalog, etc.

Figure 1: Data Lake Components

Data Protection

The underlying technologies to protect data at rest or data in transit are mature and widely available in the public cloud platforms. The challenge in a data lake, or in any data platform in the cloud, is how to actually implement and manage all these controls. For the data at rest layer, you can transform and move data to different systems for analytics, such as RDBMS, NoSQL or a data warehousing system. Encryption procedures and keys must be consistent across these systems as well. For data in transit, you can use existing PKI management or cloud solutions, but the process to keep certificates up to date will need coordinated efforts.

A good set of backup and restore standards and procedures can come into play if you detect data compromise. Whenever possible, leverage native automated backup, such as Amazon S3 with versioning, cross-region replication and lifecycle retention, and archive with Amazon Glacier storage. Data backup is not complete unless the restore procedure is well tested and certified.

Data Lake Usage

Data lakes often contain data from many sources in the enterprise. In order to have a usable and maintainable data lake, you will have to overcome many challenges. Each data set is often associated with a data owner and a data custodian, so communication between these data owners and the data lake owners should follow some sort of enterprise data governance and agreements. The data cleansing and transforming process creates other challenges, such as how to produce or maintain data lineage from the original data to the newly transformed data. A common interface for the data user community requires ongoing development and maintenance efforts.

Data Leak Prevention

The more data you put in one place, the more security controls you need to protect it. Data protection requires cognizance and collaboration from many areas in an enterprise. Data leaks can be caused by insiders with access to data, or by the lack of data security controls. Security tools can prevent data leakage in certain ways, but they do not guarantee data protection. Common security control techniques can include starting out with a preventive approach, with deploying access lock-down control and active monitoring systems that alert you on data egress or alteration although you do not want to produce in for normal usage too many false negatives on a daily basis.

Data Governance and Compliance

Because data tends to move across borders, you need to define and understand your data sovereignty so your data lake will have the protections mandated by local laws. You also have to understand and identify the compliance requirements for your data, e.g., PII data needs to be protected with PCI DSS standards, and PHI data has to be in compliance with HIPAA legislation. The importance of data compliance is not limited to regulated industries. It is crucial to have a proper enterprise data classification standard to help any company operate securely and efficiently on-premises or in the cloud with proper system design and deployment. The footprint of a system deployment depends largely on the data classification requirements. For example, a system with critical data rating will require redundant deployment of resources, with active data backup and restore capabilities, which will require a higher budget. A non-critical system can bypass many of the redundant deployments, which can save a substantial sum of money.

On the governance side, data lakes introduce new challenges to your data governance team. Some of the key steps you need to take starting out include education and awareness training about the data lake for that data governance team.

Conclusion

Throughout the world, companies are leveraging data more strategically to move their businesses forward. Data lakes have emerged as important tools for storing increasingly large volumes of information. Companies that do a good job of securing their data lakes – following enterprise security requirements, and leveraging independent security assessments – can position themselves to take full advantage of those data related projects that are critical to their future.

Share this


Related articles

 

The Calm After the Cloud Storm - Our Take on the AWS S3 Outage

By Mike Kavis

 

Enterprise Data Lake Architecture: What to Consider When Designing

By Sudi Bhattacharya

 

Data Warehousing with Apache Hive on AWS: Architecture Patterns

By Sudi Bhattacharya

Related tags

AWS   Data Lake   Security

Bao Quach

Bao is a Principal Architect at Cloud Technology Partners.

Full bio and recent posts »



Find what you're looking for.

Visit The Doppler topic pages through the links below.

PLATFORMS

AWS
CTP
Docker
Google
IBM
Kubernetes
Microsoft Azure
OpenStack
Oracle
Rackspace

BEST PRACTICES

App Dev
App Migration
Disaster Recovery
Change Management
Cloud Adoption
Cloud Economics
Cloud Strategy
Containers
Data Integration
DevOps
Digital Innovation
Hybrid Cloud
Managed Services
Security & Governance

SUBJECTS

Big Data
Blockchain
Cloud Careers
CloudOps
Drones
HPC
IoT
Machine Learning
Market Trends
Mobile
Predictive Maintenance
Private Cloud
Serverless Computing
Sustainable Computing
TCO / ROI
Technical "How To" Vendor Lock-In

INDUSTRIES

Agriculture
Energy & Utilities
Financial Services
Government
Healthcare
Manufacturing
Media & Publishing
Software & Technology
Telecom

EVENTS

CES
DockerCon
Google NEXT
Jenkins
re:Invent


 

Get The Doppler

Join 5,000+ IT professionals who get The Doppler for cloud computing news and best practices every week.

Subscribe here


Services

Cloud Adoption
Application Migration
Digital Innovation
Compliance
Cost Control
DevOps
IoT

Company

Overview
Leadership
Why CTP?
News
Events
Careers
Contact Us

The Doppler

Top Posts
White Papers
Podcasts
Videos
Case Studies
Quarterly
Subscribe

Connect

LinkedIn
Twitter
Google +
Facebook
Sound Cloud

CTP is hiring.

Cloud Technology Partners, a Hewlett Packard Enterprise company, is the premier cloud services and software company for enterprises moving to AWS, Google, Microsoft and other leading cloud platforms. We are hiring in sales, engineering, delivery and more. Visit our careers page to learn more.

CWC-blue-01

© 2010 - 2019 Cloud Technology Partners, Inc., a Hewlett Packard Enterprise company. All rights reserved. Here is our privacy policy CTP, CloudTP and Cloud with Confidence are registered trademarks of Cloud Technology Partners, Inc., or its subsidiaries in the United States and elsewhere.

Do Not Sell My Personal Information

  • Home
  • Cloud Adoption
  • Digital Innovation
  • Managed Cloud Controls
  • The Doppler Report
  • Clients
  • Partners
  • About CTP
  • Careers
  • Contact Us
  • Most Recent Posts
  • All Topics
  • Podcasts
  • Case Studies
  • Videos
  • Contact
Our privacy statement has been changed to provide you with additional information on how we use personal data and ensure compliance with new privacy and data protection laws.  
Please take time to read our new Privacy Statement.
Continue