The term “data gravity” refers to the growing “weight” of data being collected in an ever growing mass, the additional data and services pulled closer to that mass and the inertia which must be overcome in order to move such a mass. To be useful, data of course must be accessed (and often updated): e.g., transactions, queries, reports, analyses and data science insights. The applications used to do this usually work best when located close to the data. Why? Because the greater the network distance between data and application, the longer each interaction takes – i.e., the greater the latency. Application performance is why we care about latency.
Principles, Conflicts and Trade-offs
While technology changes constantly, certain principles about data remain consistent. We would propose the following principles as examples:
- Data from more sources is better than data from fewer sources
- Consolidated data is better than siloed data
- Low latency (close network proximity between data and applications) is better than high latency (greater network distance between data and applications)
- The more knowledge you have of your data assets, the better
- No single data structure can optimally serve all consumption patterns
These principles guide our data architectures. Under some circumstances, they also reveal trade-offs that must be considered. Why? Because core data principles can conflict with each other. As an example, the first principle (data from more sources is better than data from fewer sources), and the second (consolidated data is better than siloed data), are both generally accepted as true, yet in architectural terms they represent a potential conflict. If all your data is consolidated into a single location and structure, it cannot at the same time be stored in different places and forms to serve all your disparate applications and consumption use cases.
Data Gravity, Latency and the Hybrid Cloud
Straddling the gap between on-premises and cloud service providers (CSPs) is a highly relevant example of a common architecture that presents data gravity and latency challenges. Nearly all large enterprises are now in various stages of what could be described as hybrid cloud: at least some use cases, workloads and data stores reside in CSPs, while others remain on-premises or in private hosting centers. Many enterprises see hybrid cloud as the optimal high-level architecture, yet others have the goal of being operated largely from within CSPs, realizing that fully transitioning from on-premises to CSPs will take a substantial amount of time.
This is an oversimplification, but it is useful to think of hybrid cloud architectures as two high-speed networks (i.e., on-prem and CSP), with a lower speed connection between them. This simple description provides a context for us to look at use cases that illustrate situations where hybrid cloud might (or might not!) present challenges around data gravity.
Use Case #1
Let us take the hypothetical but typical example of Company A, a financial firm that manages investments for their retail customers. Like many large enterprises, Company A has been conducting business for decades, with a large on-prem collection of data stores upon which dozens of company applications depend. At one point in time, all applications were located on-premises as well. Thus, the data and applications were all resident in the same high-speed, on-prem network. However, consistent with a recently announced cloud-first philosophy, it is assumed a newly proposed end-user application, which needs to interact with data on premises, will be hosted in the cloud (the other high-speed network). So, where should the data for this application reside?
One obvious position is that any data that must be accessed or written by the application should be close to it. On the other hand, because of the on-prem data stores that must be accessed for this application, the application and the on-prem data stores reside in two different locations separated by a low-speed connection. We are facing a “rock and a hard place” situation: the application should be adjacent to the data for latency and performance reasons, but the application cannot be adjacent to the data due to data gravity issues. Under the described circumstances, a solution might be to create a data store in the CSP that is populated with the necessary content, and then synchronize data back and forth with the on-prem data store.
Use Case #2
Let’s look at a different example. Company B is facing excessive storage costs and operational overhead in order to maintain multiple copies of data (including offsite tape storage) for disaster recovery (DR) purposes. Seeking to take advantage of both the comparatively lower cost of object storage in the cloud, and the simplicity of managing backups without maintaining hardware, Company B has decided to utilize a CSP as the destination for all offsite DR backup.
Does this hybrid cloud pattern present any challenges at a basic principle level? Do we care that the DR data is in the cloud, while the application (backup software and associated operational management) resides on-premises? Unlike Use Case #1, this is fairly straightforward. While one of our previous principles states that low latency is better than high latency, in this case, the latency between the two high-speed networks simply is not relevant, because neither storing nor retrieving data located in the cloud needs to be performed on a moment-by-moment basis. Just as with Company B’s previous architecture, the DR data need not be part of the “center of gravity,” so CSP-based DR backups work well.
Use Case #3
Company C is attempting to reduce its commercial RDBMS footprint, which is currently based on Oracle RAC in their on-prem data center. The intention is to migrate to cloud-native database services in order to reduce costs and simplify operational management. However, as with most companies, the complexity of their on-prem database environments dictates that it is not possible or even desirable to convert and move the entire database infrastructure all at once. Thus, there is no option compatible with the overall database migration business strategy, which avoids the need to have database resources in two separate locations for an extended period of time.
Some possible approaches might include ongoing background synchronization of the data selected for workloads initially moved to the cloud, enabling low latency access to the data at the cost of some replication lag and data storage. Another approach, under the right circumstances, might be to completely migrate logical portions of the data to the cloud, carving away some of the on-prem data gravity, which would provide low-latency cloud-based access at the cost of higher-latency queries from on-premises. Once again, Company C’s use case provides another example of the need to creatively deal with data gravity challenges in the context of hybrid cloud.
We have already established that for the foreseeable future, hybrid cloud is a given for the majority of companies: staying completely off the cloud is a no-go for most organizations, and immediate, fully realized cloud-native is usually only an option for relatively young companies. So where does this leave us?
When it comes to enterprise data designs, there simply is no perfect overarching architecture that serves all principles and priorities. Instead, it is necessary to evaluate the company’s goals and design solutions, in light of the ongoing conflicting priorities presented by the higher-level architecture constraints imposed by business strategy.
In the end, there are two key guidelines that can help us navigate the challenges presented by data gravity in the context of hybrid cloud:
- Every overarching enterprise data architecture system requires a unique balance of data principle tradeoffs
- The primary drivers for achieving each enterprise’s optimum balance of data principle tradeoffs are data gravity and latency, measured against consumption use cases
As your large enterprise navigates challenges from the perspective of your hybrid cloud posture, be sure to evaluate your designs, vendors and architectures with these guidelines in mind. They should help you ask the right questions, make better decisions and derive better value from your data and analytics.