The traditional approach to building big data is to deploy a Hadoop cluster, install additional tools, and create a data platform on it.But this approach has several limitations, such as the impossibility of separating the Storage and Compute layers, the difficulty of scaling and isolating environments for different applications. Even though Hadoop can be rented from a cloud provider as a service, this approach is still not much different from deploying on your hardware.
However, there is another, Cloud-Native approach to working with big data. It allows you to solve these problems, as well as get additional opportunities from cloud technologies. To do this, they use Kubernetes, integrating it with various tools.
I am Alexander Volynsky, architect of the cloud platform Mail.ru Cloud Solutions. I will tell you how Kubernetes helps with Big Data, what tools are used, and what advantages you can get over traditional deployment.You can also watch a video presentation at the Big Data: Not HYIP, but an Industry meetup.
A little About The Cloud-Native Approach In Data Science
A cloud-Native approach to working with big data differs from the traditional one in that:
- Allows you to separate the Storage and Compute layers. In Hadoop, each node is both a Storage and Compute layer. To increase the HDFS volume, you need to add new nodes to the cluster, and therefore processor cores, which you may not need. In addition, Hadoop, Arenadata DB or ClickHouse are relatively difficult to scale: it is difficult to add and remove nodes quickly.
- And the Cloud Native approach for storage uses distributed fault-tolerant S3 storage, which is cheaper than HDFS. By moving the nodes to S3, we separate the Storage and Compute layers, which means we are solving the problem of independently increasing the storage volume. In addition, when transferring data to S3, we use the storage as a service: no need to think about node sizing, hardware, monitor the capacity of the cluster, think about when it is time to scale it.
- It provides benefits such as autoscaling, isolation of environments, and integration of tools. It is possible due to the fact that we run some of the tools in Kubernetes.
Let’s take a closer look at the example of specific tools:
- Spark – for data processing,
- Presto and Hive Metastore – for data access,
- Superset – for building dashboards,
- Airflow – for managing workflows,
- Amundsen – for Data Discovery,
- JupyterHub – for training models and experiments in Data Science,
- Kube Flow – for building MLOps in Kubernetes.
Spark In Kubernetes For Data Processing
Now the industry standard for big data processing is Spark, a fast and versatile data processing platform.
Benefits Of Running Spark On Kubernetes :
- Flexible scaling: Kubernetes as a service in the cloud allows you to get a highly scalable system. For example, ten cores are usually enough for your Spark application, but once a week for complex calculations; you need 100 or 1000 cores. When the load from the application increases, Kubernetes as a service in the cloud can provide the necessary resources through automatic scaling; when the load decreases, these resources will automatically return to the cloud. So you will only pay for the time when you consume these resources.
- Isolation environments: A typical problem with a large Hadoop cluster is the compatibility of different versions of programs and libraries. For example, you are currently using Spark 2.4; all your threads and applications have been tested and work in this version. Version 3 is coming out, and updating Spark on a cluster requires testing all applications, and some of them may need to be improved. Kubernetes solves these problems through containerization.Containers allow you to create separate, independent environments that do not affect the neighboring ones in any way. This way, you can run several different versions of Spark or other libraries and applications in a cluster simultaneously. And when a new version comes out, you need to build a new image without updating and revising old applications
How To Run Spark On Kubernetes
- Spark-Submit is the Spark-way. In this case, the Kubernetes cluster must be specified as the master. In this case, Kubernetes does not know that Spark is running inside it; for it, it will just be another application. This makes it harder to access the logs and check the status of your applications.
- Kubernetes Operator For Spark is the Kubernetes way. This is the correct way for Kubernetes and Spark to communicate. In this case, Kubernetes knows that Spark is running inside it; this solves the problems with accessing the logs and getting the current status of your applications. It also allows us to describe applications in a declarative way: we describe the desired state, and Kubernetes itself seeks to bring the application to this state. We recommend using this method.