Efficient Routing for Cost Effective Scale-out Data Architectures

Published 28 Jun 2016 in cs.DB | (1606.08884v1)

Abstract: Efficient retrieval of information is of key importance when using Big Data systems. In large scale-out data architectures, data are distributed and replicated across several machines. Queries/tasks to such data architectures, are sent to a router which determines the machines containing the requested data. Ideally, to reduce the overall cost of analytics, the smallest set of machines required to satisfy the query should be returned by the router. Mathematically, this can be modeled as the set cover problem, which is NP-hard, thus making the routing process a balance between optimality and performance. Even though an efficient greedy approximation algorithm for routing a single query exists, there is currently no better method for processing multiple queries than running the greedy set cover algorithm repeatedly for each query. This method is impractical for Big Data systems and the state-of-the-art techniques route a query to all machines and choose as a cover the machines that respond fastest. In this paper, we propose an efficient technique to speedup the routing of a large number of real-time queries while minimizing the number of machines that each query touches (query span). We demonstrate that by analyzing the correlation between known queries and performing query clustering, we can reduce the set cover computation time, thereby significantly speeding up routing of unknown queries. Experiments show that our incremental set cover-based routing is 2.5 times faster and can return on average 50% fewer machines per query when compared to repeated greedy set cover and baseline routing techniques.