
In the technology world, buzzwords and analogies flood the landscape. Analogies help us explain unfamiliar concepts in familiar terms, and when the analogy strikes a chord, it often launches a buzzword. This can be helpful in bridging the communications gap between technology and business. “Data gravity” is one of those buzzwords that now permeate our discussions around digital innovation.
Data gravity was coined by David McRory, a software engineer, in a blog almost ten years ago. It uses the basic concept of gravity to help explain the behaviors and patterns resulting from the ever-growing disruptions caused by the cloud and big data. The law of gravity states that the attraction between objects is directly proportional to their mass. McRory posited that in the ever-growing cloud and big data environments, the mass of the data being accumulated increased the number of applications, services and consumers that are drawn to that data. Basically, the more data that is accumulated, the more applications and services that come into the orbit of that data to consume it.
This phenomenon becomes key when architecting, designing and building out your cloud and big data environments. Where the data resides in relation to the applications and services being drawn to it impacts system performance, cost and reliability. Consumers of the data may reside in the same cloud as the data. They may also still be in on-premises systems, or in another cloud platform. It is a hybrid cloud world, and that must be considered when implementing solutions. The explosive growth and proliferation of applications and services caused by the increase in data gravity can quickly overwhelm a system, degrading performance, increasing costs and even impacting the quality of the results, due to reliability issues with the content.
Data Entanglement to the Rescue
Enter a new analogy and buzzword to save the day. “Data entanglement” plays off the concept of quantum entanglement, again drawing on the world of physics. In simplified terms, quantum entanglement posits that two objects separated by great distances demonstrate a connection when a change to one is reflected in the other. In the data world, data entanglement means that when two data stores share common information, a change in one is reflected in the other.
Yes, we are fundamentally talking data replication, but there is a benefit to thinking about its challenges from the perspective of entangled systems. This encourages you, when considering use case scenarios, to make sure you are taking all the potential impacts into account. Let us look at some high-level scenarios, and why we should consider entanglements.
Living on the Edge in the World of IoT
With IoT devices, sensors of all types generate massive amounts of data. These devices live out on the edge of the cloud, and create multiple data entanglement impact scenarios, as follows:
- Real-time system scenarios – In manufacturing, data from IoT scenarios frequently demands real-time acquisition and processing. Immediate response is required, so manufacturers cannot wait for the data to make it to the cloud and back. This means the data store needs to be on or near the edge of the services and applications doing the data processing. (For example, when sensors in a manufacturing line report an issue in one part of the system, the response must be immediate. Or when a new smart car senses an adverse driving condition, the analysis and response cannot be delayed by latencies back to the cloud.)
- Long-term analysis scenarios – It is often beneficial to analyze IoT device data over the long term — for example, when doing predictive maintenance. Such applications and services do not need real-time data access and capabilities, so inherent latency is not an issue. The original data at the edge is entangled with the data stores used by the long-term analytical applications, which can be in an entirely different location within the cloud.
- Feedback/updates to devices scenarios – Based on the various analyses done on the data, it may be important to send feedback/updates to the IoT devices. (In the smart car example, the analysis may provide data that improves the performance of the smart car features, so you would want that data to upgrade all cars in the fleet.) The devices and back-end systems are inexorably entangled, and changes propagate/replicate/get modified/return back to the devices involved.
As you can see from these high-level scenarios, data is indeed entangled between systems, so we need to ensure the right data is in the right place at the right time.
Data Replication Scenarios
Most replication needs fall into a small set of scenarios (although, as usual, there are exceptions outside these scenarios).
- Data synchronization scenarios – One of the best-known examples is the full synchronization of data between two or more databases, typically in a near real-time fashion. All database systems have some level of this capability built-in. Potential negative impacts can affect cost, resources and performance. In this model, all systems perform reads and updates and are kept in sync. Race conditions are always a risk in this scenario, so a large amount of resources is required to ensure data integrity. While conceptually the simplest solution, this can be overkill for most needs.
- Snapshot scenarios – The snapshot replication of one or more data tables at predetermined time intervals can be useful when there is no real-time need for data access, and destinations only require read access. For example, in the IoT examples above, feedback/updates could potentially be accomplished using the snapshot technique.
- Transactional scenarios – Transactional replication is a step down from pure data synchronization. Data is copied from a master system to slave systems at or near real time. This is usually thought of as an incremental update from a snapshot, and it is frequently used as a mechanism for backup and passive system availability.
- Read-only scenarios – Read-only copies are created in a similar manner to transactional ones, usually for performance reasons in systems doing heavy analysis. In most cases, applications can fall back to the active master system if the read-only system is not available.
Security and Privacy Considerations
One of the critical pieces that is often not addressed, or even considered, when implementing replication/entanglement scenarios, revolves around data security and privacy. When data is replicated from one environment to another, the security requirements surrounding that data must follow it. Many times, people assume security constraints are set up identically in both environments, but this is not necessarily the case.
A second challenge in replication revolves around data privacy issues. As an organization, you are responsible for any data passing through/replicated into your environment. A simple example is the GDPR “right to be forgotten” rule. If your system has personal information about an individual who has requested to be removed from your system, that means their data must be deleted from any and all systems where it resides. If their information is entangled throughout your environment, you must make sure you have identified all those connections in order to remain compliant.
The best way to address both these considerations is to have a comprehensive data governance process in place. With copies of data moving throughout your system, it is critical to know what data you have and where you have it. In many organizations, data replications and entanglements have grown organically, and without good governance, data can be forgotten, putting the organization at risk.
No Technology Negates the Need for Good Design and Planning
In the rapidly changing world of digital disruption, the ever growing number of technology tools available, along with the pressure to quickly implement new solutions, makes it easy to fall into a “ready, fire, aim” approach. Hybrid cloud and huge volumes of data provide tremendous opportunities to add value to your business. But the need for speed does not negate the need for good design and planning.
We as technologists have a responsibility to make sure we understand where the entanglements of data are, what the impacts are and what the requirements are to replicate that information to the right place, at the right time, taking into account performance, cost, security and privacy. Only then can we truly provide lasting and scalable business value.