There are fundamental differences in working with databases and data lakes. We have translated a short article on the Data Lake device. It is useful for those who do not have a lot of experience with relational databases.
Storage And Processing Servers Are Not Related In Any Way
The storage and compute servers operate separately, which is the key difference between a data lake and a database.
In traditional databases (and the earliest lakes for Hadoop), storage is tightly coupled with servers for computing: storage is built into the server, or the server is directly connected to the storage.
In today’s cloud data lake architecture, storage is platform agnostic. Data is stored in cloud object storage – usually in an open format like Parquet. Stateless servers are used for computing; they can be turned on and off as needed.
The advantages of this approach:
- Reduced computing costs. The servers do not work all the time; they can be turned off during the downtime and thus reduce operating costs.
- Scalability. You do not need to purchase equipment for peak loads. The number of servers, processors, and memory modules can be increased or decreased depending on the needs.
- Autonomy. Compute servers and clusters can read the same data at the same time. So that different teams can read data in the same clusters in parallel without interfering with each other.
Raw Data Is More Important Than Processed Data
In a Database, data is taken from source systems, transformed and loaded into a table, after which it is no longer used. In Data Lake, data remains forever and is perceived as a valuable asset.
But business users generally cannot work with raw data. So the data is processed to improve quality, make it structured and usable. Finally, this data is stored for use by analysts and business users.
Business users only see processed data and therefore value it much more than the raw data from which it was obtained. But the actual value of data lakes lies in the raw data and how you work with it. In a sense, the processed data is like a materialized view that can be refreshed at any time.
- at any time, the necessary data can be recreated from the original;
- they can be recreated using improved processing techniques;
- data can be presented in different ways depending on the characteristics of a particular analysis.
The Treatment Scheme Can Be Changed At Any Time
Information requirements change frequently, and later it may be necessary to analyze some data that was not initially included in the sample. In the case of Database, raw data is irretrievably lost if it is not saved.
Data lakes work differently: if today you decide that certain data does not need to be loaded into the processing system, then nothing terrible will happen – you can add it later. All data is securely stored in Data Lake, and the source with raw data can be recreated at any time.
- you do not need to create one general data processing scheme for all occasions if you do not need it right now;
- you can create a data processing scheme by iterations, adding only those fields that are needed right now;
- If you need additional fields, you can add them at any time and repeat the process.
Data lakes do not replace databases; each tool has its strengths and weaknesses. It is illogical to use data lakes for OLTP, as well as databases for storing unstructured data. I hope my article helped you understand the differences between the two systems.
Also Read: Differences Between Cloud And Boxed Bitrix24