Concurrent systems face a delicate balance: enabling high throughput under contention while maintaining low latency and predictable behavior. Traditional locking strategies and naive parallelization often introduce performance bottlenecks through cache contention, false sharing, and excessive context switching. These issues are further exacerbated in Cloud and virtualized environments, where performance unpredictability, noisy neighbors, and limited control over physical resources introduce additional challenges.
This post explores a small set of techniques that help with more scalable and efficient solutions in Java-based systems, introducing some differences due to Cloud environments.
The code snippets are provided for illustrative purposes and should not be used in production. |
Bits of theory of batching
Batching techniques are essential in concurrent systems to enhance throughput and efficiency by amortizing the fixed overhead associated with each operation. Network requests, disk I/O, or lock acquisitions incur baseline costs (serialization, context switching, lock contention, etc.). Multiple operations can be grouped into a single batch, reducing per-operation costs. For instance, database engines and remote APIs can perform substantially better when multiple queries or updates are bundled into a batch, as this minimizes redundant overhead and exploits data locality or pipeline optimizations.
Some technical aspects of batching
Batching typically involves a dedicated thread or asynchronous task that accumulates items and applies a flushing strategy. Choosing the optimal batching strategy depends on the application’s latency requirements and workload patterns. Common flushing strategies are:
- Size-based batching
-
Flush when the batch reaches a predefined number of items. This ensures full batch utilization, which is ideal for throughput. However, if task arrivals are sparse, this can lead to indefinite waiting, as the batch might never fill.
- Time-based batching
-
Flush at regular time intervals (e.g., every 100 ms), regardless of how many items are in the batch. This bounds the maximum wait time for any task, maintaining low latency. However, during periods of low load, this can result in underutilized or empty batches, which is inefficient.
- A hybrid (adaptive) approach combines the two
-
The batch is flushed either when it reaches a size threshold or a timeout expires—whichever comes first. This balances latency and throughput more effectively, especially under variable load conditions.
The third flushing strategy leads to the most complete version of so-called micro-batching, where small workloads are accumulated into short-sized, fixed-time batches; it gives us some benefits as follows:
- Predictable behavior under moderate load
- Good fit for stateless services in containers or microservices
- Reduces syscall and network pressure
A more complex alternative is called smart batching_. It builds on micro-batching by dynamically adjusting thresholds using system metrics (e.g., queue length, observed latencies, or CPU load). While it offers flexibility and adaptive performance tuning, smart batching introduces:
- Complexity and tuning overhead
- Risk of instability under incorrect heuristics
- Greater need for observability and tracing
What follows is a draft implementation of micro-batching:
public class MicroBatchProcessor {
private final static int MAX_BATCH_SIZE = 100;
private final static long MAX_BATCH_TIME_MS = 100;
private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
private final LinkedBlockingQueue<String> batch = new LinkedBlockingQueue<String>();
public void start() {
scheduler.scheduleAtFixedRate(this::flushActivity,0,MAX_BATCH_TIME_MS,TimeUnit.MILLISECONDS);
}
public void stop() {
scheduler.shutdownNow();
}
public void add(String item) {
batch.offer(item);
}
private void flushActivity() {
if (batch.size() >= MAX_BATCH_SIZE) {
List<String> toProcess = new ArrayList<>(MAX_BATCH_SIZE);
batch.drainTo(toProcess,MAX_BATCH_SIZE);
processBatch(toProcess);
}
}
private void processBatch(List<String> items) {
System.out.println("processing batch - size: " + items.size());
// processing the items
}
}
A side note on concurrency: the single-writer principle
An often-overlooked benefit of using a dedicated batching thread is its alignment with the single-writer principle—a concurrency design pattern in which only one thread performs mutations on shared state. Since the batching thread is the sole actor responsible for collecting, modifying, and flushing batch contents, it eliminates the need for fine-grained locking or atomic operations around the batch data structure: it simplifies the concurrency model. This doesn’t mean single-threaded processing; readers and other writers can exist, but only one thread mutates a particular piece of data. This design is used in the following frameworks:
Request Coalescing
Request coalescing (sometimes called "single-flight" or "collapsed forwarding") means merging multiple identical operations into one. In concurrent systems, this typically refers to situations where different threads or requests simultaneously ask for the same resource or query result. Instead of each thread launching its own database or network call, one thread performs the work while the others wait for the result. One of the most recent Java frameworks that uses this technique on a large scale was made by Netflix: the Hystrix library reinforces this design to the point of the concept of request collapsing. Request Coalescing is a way to solve The Thundering Herd problem in caching: this occurs when a large number of requests for the same resource arrive at the same time, often due to a cache expiration or a sudden surge in demand, causing a surge in requests to the backend system. This surge can overload the backend, leading to performance degradation or even system failure. Logically, there are trade-offs:
- It’s not for dissimilar tasks. Only use coalescing when multiple threads truly need the same operation
- Threads that arrive after the coalesced work has started must wait for it to finish.
- The system must keep track of in-flight requests (typically in a map) and lists of waiters. This adds complexity and memory usage.
- If the single operation fails, you must propagate errors to all waiters.
What follows is a draft implementation of request coalescing:
public class RequestCoalescer<K, V> {
private final ExecutorService worker = Executors.newSingleThreadExecutor();
private final ConcurrentHashMap<K, Future<V>> inFlight = new ConcurrentHashMap<>();
public V getOrCompute(K key, Callable<V> computation) {
var task = inFlight.computeIfAbsent(key, k -> {
return worker.submit(computation);
});
var start = System.currentTimeMillis();
try {
return task.get();
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException("Problem on computation!", e);
}finally {
System.out.printf("%s waited for: %sms %n",Thread.currentThread().getName(),(System.currentTimeMillis()-start));
}
}
public void flush(K key) {
inFlight.remove(key);
}
public static void main(String[] args) throws Exception {
String testKey ="foo";
ExecutorService requesters = Executors.newFixedThreadPool(5);
RequestCoalescer<String, String> coalescer = new RequestCoalescer<>();
Callable<String> request = () -> {
var result = coalescer.getOrCompute(testKey, () -> {
Thread.sleep(500);
return "[value_for_%s]".formatted(testKey);
});
return Thread.currentThread().getName() + " received: " + result;
};
submitRequestsAndShowResults(requesters, request);
submitRequestsAndShowResults(requesters, request);
coalescer.flush(testKey);
submitRequestsAndShowResults(requesters, request);
}
...
}
Flat Combining
Flat combining is a powerful synchronization paradigm for in-memory data structures. Instead of having different threads contend on a lock or atomic operations, flat combining lets a single combiner thread perform a batch of operations on behalf of others.
An optimization of flat combining is to use per-thread buffers (often via ThreadLocal
) to collect operations and reduce the allocation rate. Each thread accumulates requests in its local buffer. A combiner thread occasionally scans all thread-local buffers and processes pending tasks. This design has two big warnings:
- It maintains correct concurrency semantics but may degrade to sequential processing of operations.
- It happens transparently between caller threads but is a lock-coupling strategy for shared memory, not like batching, which is a decoupling strategy.
What follows is a draft implementation of request coalescing, using ThreadLocal
:
public class FlatCombiningBatcher_ThreadLocal {
private static class Operation {
int incrementBy;
volatile boolean completed;
volatile int result;
Operation(int incrementBy) {
this.incrementBy = incrementBy;
}
}
private final AtomicReference<Thread> combiner = new AtomicReference<>();
private final Queue<Operation> queue = new ConcurrentLinkedQueue<>();
private volatile int sharedCounter = 0;
private final ThreadLocal<Operation> threadLocalOp = ThreadLocal.withInitial(() -> new Operation(0));
public int increment(int value) {
Operation op = threadLocalOp.get();
op.incrementBy = value;
op.completed = false;
queue.add(op);
if (combiner.compareAndSet(null, Thread.currentThread())) {
try {
System.out.println(Thread.currentThread().getName() + " act as combiner");
processBatch();
} finally {
combiner.set(null);
}
} else {
while (!op.completed) {
Thread.onSpinWait();
}
}
System.out.println(Thread.currentThread().getName() + " incremented with " + value);
return op.result;
}
private void processBatch() {
List<Operation> batch = new ArrayList<>();
while (true) {
Operation op = queue.poll();
if (op == null)
break;
batch.add(op);
}
if (!batch.isEmpty()) {
// Process batch atomically
int sum = 0;
for (Operation op : batch) {
sum += op.incrementBy;
}
sharedCounter += sum;
for (Operation op : batch) {
op.result = sharedCounter;
op.completed = true;
}
}
}
public int getCounter() {
return sharedCounter;
}
}
What happens in Cloud environments?
In Cloud and virtualized environments, in addition to less efficient performance, naive synchronization and traditional blocking strategies often fail to scale due to unpredictable latencies, interference with shared resources, and reduced hardware determinism. Designing resilient and highly productive Java systems for the Cloud involves balancing concurrency with using infrastructure components, such as autoscaling and queuing, and the approaches seen. It is important to note that compare-and-set primitives in tight loops can be problematic, as can using large numbers of threads. With this in mind, we can highlight a few considerations:
- Prefer batching and micro-batching for predictable cost reduction
- Use adaptive techniques such as smart batching with caution, and constantly monitor their impact
- Replace locks with single-writer models, actor-based designs, or virtual threads to minimize contention
Side Note: Thread-Per-Core
A thread-per-core architecture assigns a single thread to each CPU core, eliminating thread switching overhead and allowing for efficient, low-latency processing. This approach involves running a dedicated thread on each core, often pinned to that core, which minimizes context switching and allows for coarse-grained scheduling, particularly beneficial for latency-sensitive applications. This approach, however, has side effects to consider. It can significantly reduce tail latency, but only under optimal conditions—proper IRQ setup, even load, and support from hardware/OS. Without these, the architecture’s benefits may be diminished or negated.
Conclusion
Using batching is not a simple task, but various techniques and approaches exist to make it a tool that remains valid even in Cloud applications. Balancing resources, measurements, and infrastructure becomes the indispensable vision for new families of more elastic and performing applications!