Support vector regression model for BigData systems

Published 5 Dec 2016 in cs.DC, cs.LG, and cs.PF | (1612.01458v1)

Abstract: Nowadays Big Data are becoming more and more important. Many sectors of our economy are now guided by data-driven decision processes. Big Data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. In such systems, capacity allocation, which is the ability to optimally size minimal resources for achieve a certain level of performance, is a key challenge to enhance performance for MapReduce jobs and minimize cloud resource costs. In order to do so, one of the biggest challenge is to build an accurate performance model to estimate job execution time of MapReduce systems. Previous works applied simulation based models for modeling such systems. Although this approach can accurately describe the behavior of Big Data clusters, it is too computationally expensive and does not scale to large system. We try to overcome these issues by applying machine learning techniques. More precisely we focus on Support Vector Regression (SVR) which is intrinsically more robust w.r.t other techniques, like, e.g., neural networks, and less sensitive to outliers in the training set. To better investigate these benefits, we compare SVR to linear regression.