Balanced allocation: considerations from large scale service environments

1 points by asb

Abstract: We study d-way balanced allocation, which assigns each incoming job to the lightest loaded among d randomly chosen servers. While prior work has extensively studied the performance of the basic scheme, there has been less published work on adapting this technique to many aspects of large-scale systems. Based on our experience in building and running planet-scale cloud applications, we extend the understanding of d-way balanced allocation along the following dimensions:

(i) Bursts: Events such as breaking news can produce bursts of requests that may temporarily exceed the servicing capacity of the system. Thus, we explore what happens during a burst and how long it takes for the system to recover from such bursts.

(ii) Priorities: Production systems need to handle jobs with a mix of priorities (e.g., user facing requests may be high priority while other requests may be low priority). We extend d-way balanced allocation to handle multiple priorities.

(iii) Noise: Production systems are often typically distributed and thus d-way balanced allocation must work with stale or incorrect information. Thus we explore the impact of noisy information and their interactions with bursts and priorities.

We explore the above using both extensive simulations and analytical arguments. Specifically we show, (i) using simulations, that d-way balanced allocation quickly recovers from bursts and can gracefully handle priorities and noise; and (ii) that analysis of the underlying generative models complements our simulations and provides insight into our simulation results.