Abstract
As the capacity of RAM is growing exponentially, RAM becomes an important instrument for big data processing. When possible, we prefer to store an entire dataset in RAM and conduct in-memory computing and data analytics. Such an approach can speedup business processes substantially. Data analytical programs need to retrieve data from a database management system before conducting analysis on the data. Apart from that, a common data analytical application usually involves a number of stages, which cooperate seamlessly to generate data analysis results. These stages often need to exchange large volume of data. Despite the prevalence of In-Memory Computing in data analytics, the traditional data transmission architecture from database to data analytical program has not changed and it becomes a performance bottleneck in the context of in-memory data analytics. One of the key reasons for this bottleneck is that the IPC (Inter-Process Communication) support of modern operating systems is inadequate. Pipe and Socket are slow. While shared-memory is fast, managing shared-memory is difficult, as we have to deal with memory allocation and data synchronization carefully. Some approaches, such as SAP HANA, try to avoid inter-process data exchange by injecting data processing programs into the process space of DBMS. However, such a tight coupling approach does not suit all applications. We implemented a new IPC method named SWING in the Linux kernel. It is fast and convenient. It enables loose coupling between data processing programs. With SWING, any processes can share a segment of memory, and all of them can read the same contents. When one of them wants to write, operating system will apply copy-on-write for that process, so the write behavior will not affect other processes. This method is similar with fork, but it works for processes out of parent-child relation-in effect, more than one process can share multiple segments of memory to the same process, which cannot be achieved by fork. Based on SWING, we developed a memory allocator named SwingMalloc which makes SWING easy to use. SWING allocates a virtual memory space of 512GB each time it is called, so it may waste a lot of logical space. SwingMalloc allows for fine grained space allocation, so it is more friendly to processes. Basically, SwingMalloc divides a COW memory into two parts. One part is managed with the buddy memory allocation algorithm. The other part is broken down into several blocks of fixed but different sizes. These blocks are allocated to processes based on their need. Based on SwingMalloc, we developed a new in-memory embedded DBMS called SwingDB. With SwingDB, each process accesses a database in its own memory space, without incurring inter-process communication. The data in an instance of SwingDB is completely stored in a Swing memory area, so that independent processes can share the snapshots of their database instances using the SWING mechanism. SwingDB is especially suitable to multi-stage in-memory data processing, in which several loosely coupled programs cooperate in performing data analysis. These applications can access the entire database in their own memory space, instead of resorting to expensive traditional IPC methods.
| Translated title of the contribution | Memory Instant Snapshot Sharing Mechanism and Its Application in Database |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 1912-1927 |
| Number of pages | 16 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 41 |
| Issue number | 8 |
| DOIs | |
| State | Published - 1 Aug 2018 |