The software industry has been building hybrid clouds for more than a decade now, and we’ve gotten pretty good at it. The strategies and best practices have started to crystallize for how to manage them. The industry is no longer just playing with cloud-based applications; we’re forming best practices for cloud operations, or CloudOps.
For the purposes of this article, let’s break the concept of a hybrid cloud down to its component parts and discuss how best to manage a hybrid cloud in terms of performance. Once we do that, we can combine the parts into a complete hybrid cloud architecture and analyze the holistic performance management features. Then we’ll look at how all of this fits into CloudOps, which is a bit more strategic.
The power of distribution is key
Today’s hybrid clouds are not the same as the ones we deployed just a few years ago. At one time, workloads were statically bound to a private cloud or public clouds. These days, workloads can be moved between private and public clouds pretty much at will. We can do this dynamically in near real time or as a manual process, such as moving raw code and data.
The ability to move workloads between private and public clouds is a key feature.
Consider that performance is about finding ways to make the components in an application architecture run in optimal ways. If you bundle these components into workloads (e.g., database, program, user interface), it’s easier to optimize applications because they are grouped in a way that’s easier to work with.
There are different ways to create workloads, including virtual machines, containers, or applications that are bound together. A detailed discussion about how to move these types of workloads is beyond the scope of this article, but it’s helpful to walk through the process of moving containers.
Containers are self-contained bundles of components that can include application code, databases, middleware, security services, or all of the above. You can have multiple containers that form a single application, or you can have a single container that includes the entire system. Either way, the containers can run on public and private clouds.
The idea we’re attempting to get across here is that there are many options for balancing workloads across private and public clouds. The ways you distribute the workloads between private and public clouds are basically limitless. When it comes to managing performance, you have the power of distribution on your side. So make sure to leverage it.
The components of service monitoring
Considering the fact that the application’s bits and pieces can be run within the private cloud or the public cloud, it’s helpful to manage performance through the services (or microservices) that are externalized by the application or the cloud platform itself. To manage hybrid cloud performance as sets of services, you should consider using the following logical components, or technologies:
1. Service agents
2. Service repository
3. Communications manager
4. Performance analytics engine
5. Time series database
6. Alert management (see Figure 1)
I’ll explore each. Please keep in mind that they are logical concepts, and we’re not mapping these concepts to technology yet.
Figure 1: The conceptual/logical view of the hybrid cloud service performance monitoring system. The idea is to use this as a basis to select the right technology, or technologies, you’ll require.
Before we look at the components and technologies, here are several things you’ll want to have when implementing service performance metrics and monitoring for a hybrid cloud:
- Monitoring for all relevant services that gather performance metrics, including uptime, performance, dependencies, and trends
- Proactive analysis for trending data to determine the likelihood of current or future issues and potentially correct those issues before they occur
- Consideration for service dependencies such as analyzing the performance of services as linked groups
- Performance metrics and monitoring within a sound service governance program
- “Self-healing” — automatic performance issue resolution with systems that learn and become better at taking corrective actions
- Dynamic monitoring that changes its frequency to avoid performance issues when there’s high load on an application
- Centralized reporting, analytics, visual monitoring, and alerting
- Direct business value tracking from the performance metrics and monitoring subsystems
- Historical performance data organization, analysis, visualization, and storage that can track performance issues and resolutions to proactively avoid said issues in the future
Service agents are software components that run side by side with, or inside of, an existing service, on either public or private clouds (see Figure 1). There could be many services that are bound to a single agent, or a single service bound to a single agent. It’s the job of the service agents to:
- Interact with the point of monitoring in the service during production/operations.
- Interact with the service repository to determine service identity and current dynamic performance thresholds and adhere to those thresholds.
- Update the time series database (see below) as defined in the service repository, at a predefined and dynamic frequency.
- Manage communications and alerts with other components (push).
- Work with the identity management system, or other security subsystems, to manage authentication services.
- Work with the service governance system to leverage policies.
The best way to think about service agents is that they interact with the services in order to determine performance during operations. Also, keep in mind that the agents themselves may cause performance issues and thus need to be used sparingly and only when they are monitoring critical performance components using services.
The service repository maintains all service attributes, policies, and identities, providing a single point of service discovery on either the public or private clouds. Typically, this is part of a service governance system, but it could be created specifically for the purpose of performance monitoring and replicate directly out of the service governance repository. It’s the job of the service repository to:
- Provide a place to define current and past service performance thresholds that are read and acted upon by the agents; these span both private and public clouds and can be dynamically altered using APIs.
- Provide up-to-date service identity, including dependencies with other services or systems. This allows for the definition of groups of services (such as composites) that act together to perform a single function. The services are monitored individually and as a group.
- Define the location and binding information for agents that represent each service.
- Store other information relevant to managing the performance of the services.
These days, service repositories should never be built. There are many open-source and proprietary solutions out there. In some instances, you may have to adapt to an existing repository.
The communications manager deals with all communications between agents, services, service repositories, databases, analytics, and other components charged with monitoring and managing service performance. Typically, this will be a queue — or some other high-speed middleware layer — that will allow messages to be both produced and consumed by each component in the system. It’s the job of the communications manager to:
- Connect with each component of the service performance management system; this includes providing authentication and validation.
- Consume information from each component, deliver that information to the correct target, and produce that information for the target.
- Maintain high-performance data rates to instantly react to service performance alerts and responses.
- Log all communications with current and future analytics.
- Recover quickly from communications failures, including rolling forward and rolling back.
Performance analytics engine
The performance analytics engine is a “pluggable” software component that provides “embedded” analytical services. You leverage these analytics to dynamically manage service performance during production. It’s the job of the performance analytics engine to:
- Provide real-time analytical services around the performance of all connected services and recommend changes in threshold, capacity, or behavior. (For instance, if a service runs under-threshold and the agent generates an alert, the analytics engine can determine a course of automatic action based on the current performance data of that service from the time series database, and the profile of that service from the service repository. The resulting actions could be to dynamically increase the cache size of the database, reroute to another server, or alert a human.)
- Provide ad hoc reporting on service performance and trending over time.
- Dynamically learn as it gathers data, understanding cause and effect as performance issues are identified and resolved.
- Provide the administrative console as well as APIs for integration with other system management consoles.
Time series database
The time series database deals with both structured and unstructured complex data. This database stores all raw data that is recorded around service performance, such as time, service response, database response, network latency, and other information that could be used in the service’s performance profile. There are two key roles of the time series database:
- Storing massive amounts of time series data to actively monitor and analyze performance
- Recording all performance issues (e.g., alerts) and solutions to those issues so that that system can respond right away the next time it happens
The alert management system is a piece of software that deals with services placed into an alert status by their respective agents, making sure to deal with the alerts per the predetermined policies that are stored in the service repository. It’s the job of the alert management system to:
- Capture alerts transmitted though the communication manager from the agents; typically, these are alerts generated by services falling out of thresholds, or failing altogether.
- Evaluate each alert in terms of severity and connect to the analytics engine for an analysis of the issue and potential automatic corrective action. The alert management system then generates corrective action, if instructed to do so by the analytics engine. It can also alert humans.
- Record each alert, including cause and resolution, in the time series database to aid in future analysis and determine the right path to fix future performance problems.
- Trace through paths to better determine the origin of the alert and other services that should be dealt with in the resolution of the problem.
Keep in mind that what we define here is conceptual in nature. However, it’s a good idea to work from your conceptual requirements to your actual performance management solutions. Selecting technology will be the final step in this conceptual process. You have dozens of open-source and proprietary systems to consider, including, but not limited to:
- Umpire alerting-controller
There are no common patterns to these tools, and they are typically purpose-built for specific aspects of performance management. You’ll likely have to select several to build your ultimate solution.
This is not easy stuff, as you may have gathered. While we’ve been doing hybrid cloud computing for some time, performance has been somewhat of an afterthought. That said, as we focus more on CloudOps and production-quality systems that exist on hybrid clouds, performance management needs to be a core part of the deployment strategy of CloudOps. Solve this issue now, or you’ll have a huge problem in the future.