In a few short years, containers have established themselves as indispensable tools for managing portable, stateless applications like Web servers and microservices. But they have taken time to catch on in the world of data science where they have been viewed as too lightweight to package and manage complex, stateful services dealing with big data.
This perception is changing. Users and vendors are starting to embrace containers and Kubernetes, the most popular orchestration platform, as tools to facilitate deployments of big data systems and applications. It is still early in the evolution, but experts see Kubernetes as the foundation for a new generation of machine learning (ML), artificial intelligence (AI), data management and distributed storage uses in cloud native environments. Some are even saying they are paving the way for a whole new field with a flashy new name – DataOps.
What happened? How did Kubernetes’ reputation build up so quickly in the data field? And what has to happen for containers to really become the backbones of data science going forward?
Kubernetes Gains Ground
Part of the story is the general market acceptance of Kubernetes. This orchestrator is the fastest growing open source project ever in terms of sheer velocity, its aggressive release cycle and the number of vendors and users adopting it. It is supported by all three major cloud platforms – AWS, Microsoft Azure, Google Cloud Platform – and platform providers Alibaba, Huawei, Oracle and Tencent are all offering Kubernetes as a Service applications. More than 50 vendors are shipping Kubernetes distributions certified by the Cloud Native Computing Foundation (CNCF).
But the story goes beyond all that. It ties to the ways data scientists are dealing with data itself.
In the past, the data science discussion focused on which big data architecture you were running. The way you managed data depended on where the data was and how you worked with large storage applications like Hadoop or Spark. Now, with the help of containers, data science is becoming less reliant on the state of the underlying data. As long as users can get to the data efficiently, it does not matter where it is. This gives the data scientists more freedom to build models, blend resources and analyze data.
Consequently, data scientists are looking for new ways to leverage their data. Up until now, most of it was stored in data lakes. Now, users are looking at ways to create hybrid data lakes, where some data is stored in on-premises Hadoop clusters, and other data sets are stored in the cloud. Using containers to store data models and Kubernetes to manage their delivery, data scientists have more flexibility to process and analyze data.
Containers offer other benefits as well. They achieve isolation with less overhead than either virtual machines (VMs) or physical servers, enabling four to six times the number of server application instances as traditional VMs when installed on the same size hardware. Plus, once IT has an image of a container, data scientists across the organization can use that image to create new environments as needed. The IT team managing the work of hundreds of data scientists can use containers to ease the development of data science environments and models covering a wide range of tools and languages.
Using Kubernetes for Data Projects
Big data relies on a number of projects and services to get where it needs to go. YARN is a program for scheduling, and ZooKeeper enables consistency and discovery. These programs work well in on-premises environments, but they have not advanced at the pace of the technologies in cloud environments. In contrast, scheduling, consistency, service discovery and infrastructure management features in Kubernetes were all designed as part of the core platform from day one. Kubernetes offers plug-ins for each function and supports other ways of scheduling, giving data scientists a much wider variety of options.
Using Kubernetes as a common orchestration layer for all containerized apps has several benefits:
- Better resource utilization through centralized scheduling of data science and other containerized applications
- Portability for workloads
- A single scheduling solution for multiple environments, on-premises or in multiple clouds
- The ability for IT to create self-service environments for data scientists and other data users
Several strategic product introductions in recent years have accelerated the use of containers in data science applications. The 2.3 release of Apache Spark with native Kubernetes support made Kubernetes much more accessible to data scientists, enterprise companies and startups trying to make sense of data. Mesosphere, another orchestrator, announced its support for Kubernetes at the end of 2017.
The two most influential developments were the advancement of the Kubeflow project and the introduction of Kubernetes on NVIDIA GPUs. Both of these changed the whole outlook of learning models.
Google developed Kubeflow, a machine learning stack for its popular TensorFlow ML framework. It is designed to simplify and scale the framework-agnostic modeling, training, serving and management of containerized AI models across Kubernetes multicloud based ecosystems. AI-driven intelligence can be thoroughly embedded in every edge, hub and cloud service.
This has made it easier to set up and productionize ML workloads on Kubernetes. It changes the game by allowing engineers to consistently deploy the entire lifecycle of a model, starting from setting up Jupyter Notebooks and training environments, to packaging and serving the trained models on production environments using a single framework. Kubeflow abstracts the underlying resources, and the same deployment works on any environment.
In the past, data scientists developed their algorithms in complete isolation, using esoteric, proprietary systems and languages. Kubeflow technology enables the data scientist and software engineer to share roles, creating a new system of DataOps collaboration.
Kubernetes on NVIDIA GPUs extends the orchestration platform with GPU acceleration capabilities across multicloud environments. A GPU is a specialized processor that can be used to accelerate highly parallelized, computationally intensive workloads. Because of their processing power, GPUs have been found to be particularly well-suited to deep learning workloads. Using Kubernetes on NVIDIA CUDA drivers, teams can automate the deployment, maintenance, scheduling and operation of multiple GPU-accelerated application containers across clusters of nodes.
For the most part, data applications still live in the old world of IT, on Hadoop or Spark platforms in on-premises environments. Companies have too much invested to rip and replace everything overnight. But hybrid data environments are coming – quickly – and early adopters are benefitting from the change. They are confident that containers running in Kubernetes clusters will accelerate big data development by enabling system and app code to be reused. The data paradigm is evolving, and the Kubernetes community is driving the change.