OMR: Out-of-Core MapReduce for Large Data Sets
While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.
Mon 18 JunDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 30mTalk | Hardware-Software Co-optimization of Memory Management in Dynamic Languages ISMM 2018 | ||
14:30 30mTalk | Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications ISMM 2018 Rodrigo Bruno INESC-ID / Instituto Superior Técnico, University of Lisbon, Paulo Ferreira INESC-ID / Instituto Superior Técnico, University of Lisbon, Ruslan Synytsky Jelastic, n.n., Tetiana Fydorenchyk Jelastic, n.n., Jia Rao University of Texas at Arlington, USA, Hang Huang Huazhong University of Science and Technology, China, Song Wu Huazhong University of Science and Technology, China | ||
15:00 30mTalk | OMR: Out-of-Core MapReduce for Large Data Sets ISMM 2018 Gurneet Kaur , Keval Vora University of California, Riverside, Sai Charan Koduru University of California, Riverside, Rajiv Gupta UC Riverside |