All Notes

Part 2. Work Stealing Thread Pools

Last updated

Jan 26, 2026

The Limitation: When Our Pizza Kitchen Breaks Down

A New Challenge

  • Tasks that produce their own subtasks e.g. recursive problems
  • Example: Fibonacci pizza recipe (silly but memorable)

The Basic Thread Pool Struggles

  • Benchmark: Fibonacci(35) with basic thread pool
  • Profiling: CPU utilization showing threads blocked
  • Problems: Lock contention + poor cache locality

The Insight

  • Consider
    • “What if threads could take work from each other?”
    • “What if tasks stayed close to their parent tasks?” - explain improvement in profiling nums?
  • Preview of work stealing benefits

The Work Stealing Concept

The Restaurant Analogy

  • Multiple prep stations (local queues)
  • Each cook has their own station
  • Cooks work on their own tasks first
  • Idle cooks “steal” from busy cooks
  • Visual: Restaurant layout diagram

Key Principles

  1. Local queues - Each thread has its own deque
  2. LIFO for owner - Pop from front (cache locality)
  3. FIFO for thieves - Steal from back (avoid conflicts)
  4. No blocking - Spin/yield instead of condition variables

Why This Works

  • Reduced contention (N+1 locks vs 1 lock)
  • Cache locality (parent-child tasks on same thread)
  • Load balancing (idle threads help busy ones)
  • Visual: Comparison diagram (shared queue vs local queues)

Arch details: Load balanced workflow

Component Overview

  • Global queue (fallback for external submissions)
  • Per-thread local queues (deques)
    • todo: how is dequeue implemented under hood? i think it’s like a linked list of arrays instead of nodes maybe? arrays inside the deq have good cache locality so what are the problems with a global dequeue with internals being used as thread locals… complex with global queue workflows & syncs i guess? explore more
  • Thread-local storage (fast access)
  • No condition variables (active polling)
    • sucks at first glance but i think using active polling reduces wakeup time for a thread. so we burn cpu cycles but get a thread that’s always ready to work?
  • Visual: Complete architecture diagram

Intro to dequeu

  • Why deque? (efficient push/pop from both ends)
    • explain why useful when comes to contention etc.
  • Owner operations (front): push_front, pop_front (LIFO)
  • Thief operations (back): pop_back (FIFO)
  • Visual: Deque with arrows showing operations
  • Code: WorkStealingQueue class structure

Thread-Local Storage Strategy

static thread_local WorkStealingQueue* local_queue_;
static thread_local size_t idx_;
  • What is TLS? (per-thread variables)
  • Why pointers? (late initialization)
  • Performance benefit (zero-cost access)
  • Visual: Memory layout showing TLS

Task Execution Priority

  1. Check own local queue (fast, no lock)
  2. Check global queue (one lock)
  3. Steal from other threads (N locks, but rare)
  4. Yield CPU (no work available)
  • Visual: Decision tree flowchart

Impl: Work Stealing Pool

The WorkStealingQueue Class

class WorkStealingQueue {
private:
mutable std::mutex mtx_;
std::deque<task_wrapper> tasks_;

public:
void push(task_wrapper task);
bool try_pop(task_wrapper& task);
bool try_steal(task_wrapper& task);
bool empty() const;
};
  • Implementation of each method and what each methoid/ does
  • Worth talking about lock granularity? trying to keep minimal here

The ThreadPoolWorkStealing Class Structure

class ThreadPoolWorkStealing {
private:
std::mutex mtx_;
std::queue<task_wrapper> global_queue_;
std::vector<std::unique_ptr<WorkStealingQueue>> worker_queues_;
std::vector<std::jthread> workers_;
size_t num_threads_;

static thread_local WorkStealingQueue* local_queue_;
static thread_local size_t idx_;

public:
ThreadPoolWorkStealing(size_t num_threads);
template<typename F> auto submit(F&& f) -> std::future<Ret>;
void run_pending_task();
};
  • Why unique_ptr for queues?
  • Static thread_local explained… might need appendix for so much deep dive?:(
  • Comparison to basic ThreadPool

Constructor: Setting Up the Infrastructure

ThreadPoolWorkStealing(size_t num_threads) : num_threads_(num_threads) {
// Create local queues
for (size_t i = 0; i < num_threads_; i++) {
worker_queues_.push_back(
std::unique_ptr<WorkStealingQueue>(new WorkStealingQueue())
);
}

// Spawn workers
for (size_t i = 0; i < num_threads_; i++) {
workers_.emplace_back([this, i](std::stop_token stoken) {
worker_loop(stoken, i);
});
}
}
  • Ctor infra
    • queues must exist before threads start

Worker Loop: The Active Polling Pattern

void worker_loop(std::stop_token stoken, size_t idx) {
idx_ = idx; // Set TLS
local_queue_ = worker_queues_[idx].get();

while (!stoken.stop_requested()) {
run_pending_task(stoken);
}
}

void run_pending_task(std::stop_token& stoken) {
if (stoken.stop_requested()) return;

task_wrapper task;
if (pop_from_local(task) ||
pop_from_global(task) ||
pop_from_other_thread_queues(task)) {
task();
} else {
std::this_thread::yield();
}
}
  • TLS initialization
  • Short-circuit eval - needed in writeup TODO?
  • yield instead of block…
  • CPU usage trade-off (burn v wakeup delay)

Task Submission

template<typename F, typename Ret = std::invoke_result_t<F>>
auto submit(F&& f) -> std::future<Ret> {
std::packaged_task<Ret()> task(std::forward<F>(f));
auto fut = task.get_future();

if (local_queue_) {
// Fast path: We're a worker thread
local_queue_->push([task = std::move(task)]() mutable {
task();
});
} else {
// Slow path: External thread
std::lock_guard lock(mtx_);
global_queue_.emplace([task = std::move(task)]() mutable {
task();
});
}

return fut;
}
  • Fast path vs slow path
  • Why check local_queue_?
  • Performance difference - worth getting numbers? from time of og submission or time of submission after lock acquired (if lock exists for the logic branch ofc)

The Stealing Algorithm

bool pop_from_other_thread_queues(task_wrapper& task) {
for (size_t i = 0; i < worker_queues_.size(); i++) {
auto idx = (idx_ + i + 1) % worker_queues_.size();
if (worker_queues_[idx]->try_steal(task)) {
return true;
}
}
return false;
}
  • Round-robin stealing pattern
  • Explain why (idx_ + i + 1) % num_workers
  • Load distribution benefits
  • viz: stealing pattern diagram

The Helper Methods

bool pop_from_local(task_wrapper& task) {
return local_queue_ && local_queue_->try_pop(task);
}

bool pop_from_global(task_wrapper& task) {
std::lock_guard<std::mutex> lock(mtx_);
if (global_queue_.empty()) return false;
task = std::move(global_queue_.front());
global_queue_.pop();
return true;
}
  • Null check for local_queue
  • Lock scope minimization

The Magic of Cache Locality: Why LIFO Matters

The Problem with Random Distribution

  • Basic thread pool: tasks go to random threads
  • Parent and child tasks on different CPUs
  • Cache miss penalty? nums provided? (real / estimate)
  • Visual: Would be cool to visualise cpu cache diagram? consult redhat paper i think there should be something there to reference

The LIFO Solution

  • Worker submits task → push to front of local queue
  • Worker continues → pops from front (LIFO)
  • Result: Parent and child on same CPU
  • Cache hit… X cycles?
  • Visual: Timeline showing task execution on same CPU

The Numbers

example:...

Fibonacci(35) execution:
- Basic ThreadPool: ~XXXms (5x speedup)
- WorkStealing: ~Yms (25x speedup)

Why? Cache locality provides 10-50x per task!

Recursive Algorithm Example

long fib(int n, WorkStealingPool& pool) {
if (n <= 1) return n;

auto left = pool.submit([&] { return fib(n-1, pool); });
long right = fib(n-2, pool); // Execute inline

// Later: pop fib(n-1) from front (LIFO)
// Result: fib(n) and fib(n-1) on same CPU

return left.get() + right;
}
  • Visual: Task tree showing parent-child affinity

Benchmarking: Four Algorithms, Four Stories

  • These are generic algos, could just offload impl to AI
  • Do we need briefer on the algos?

Analysis: When Does Work Stealing Shine?

  • Best for: Recursive algorithms, variable task sizes
  • Good for: Any parallel workload
  • Overkill for: Independent, uniform-duration tasks (Pizza Kitchen)
  • Table: Algorithm characteristics vs speedup

This section:

  • Proves perf improvement w/real data
  • Shows different workload patterns
  • Teaches when to use which approach

The Trade-offs: Nothing Is Free

Increased Complexity

  • More code to write and maintain
  • Harder to debug (distributed state)
  • More subtle bugs possible… any way to explore?
  • When is simplicity worth more?

Memory Overhead

  • N local queues vs 1 global queue
  • Thread-local storage
  • Hows it affect memrory?

CPU Usage When Idle

  • Basic ThreadPool: sleeping
  • Work stealing: spinning / yielding
  • Trade-off: Responsiveness vs power consumption
  • When does this matter?

Lock-Free Alternatives

  • Could use lock-free deques (more complex)
  • ABA problem, memory ordering
  • Diminishing returns for most workloads
  • Would have to research more (AW’s book gave me trauma for lock-free datastructures)

Comparison: Basic vs Work Stealing

Feature Comparison Table

FeatureBasic ThreadPoolWorkStealing
ComplexityLowHigh
Memory~1KB~N KB
ContentionHigh (1 lock)Low (N+1 locks)
Cache LocalityPoorExcellent
Scalability4-8 threads16+ threads
Idle CPU0%x%
Best ForIndependent tasksRecursive tasks

Decision Matrix

  • Use Basic when: Simplicity matters, less than 8 threads, independent tasks
  • Use WorkStealing when: Performance critical, recursive, many threads
  • Visual: Decision tree

Guidance

If your tasks spawn other tasks, use work stealing. Otherwise, consider a simple thread pool..?


Real-World Implementations

Intel TBB (Threading Building Blocks)

  • tbb::task_group uses work stealing
  • Cache-oblivious algorithms
  • NUMA-aware scheduling

Advanced Topics: Going Further

Lock-Free Deques

  • Chase-Lev algorithm - worth trying to profile?
  • ABA problem and solutions

NUMA-Aware Stealing

  • Prefer stealing from same socket
  • Topology-aware thread placement

Adaptive Stealing Strategies

  • Track which workers have work, exponential backoff etc. do more research

Key Takeaways and Exercises

What We Learned

  • Work stealing reduces contention through distributed queues
  • LIFO for owner provides cache locality
  • FIFO for thieves avoids conflicts
  • Active polling trades CPU for responsiveness
  • Speedup for algo runtime

Ideas for future dev

  1. Implement lock-free deque
  2. Add NUMA-aware stealing
  3. Benchmark on your workload
  4. Add priority queues
  5. Implement adaptive backoff

Reflection Questions

  • When would you choose basic over work stealing?
  • How would you debug a deadlock in the stealing loop?
  • What workloads benefit most from cache locality?

Complete Code Listing (Appendix)

Content:

  • Full WorkStealingQueue class
  • Full ThreadPoolWorkStealing class
  • All four benchmark algorithms
  • Build instructions
  • How to run benchmarks

Further Reading and References (Appendix)

Papers

  • “Dynamic Circular Work-Stealing Deque” (Chase & Lev)
  • “The Data Locality of Work Stealing” (Acar et al.)

Books

  • C++ Concurrency in Action (Anthony Williams) - Chapter 9

Talks

  • “Designing Concurrent C++ Applications” (Herb Sutter)
  • “Lock-Free Programming” (Fedor Pikus)

Visuals todo

  1. Restaurant with prep stations (work stealing analogy)
  2. Shared queue vs local queues (architecture comparison)
  3. Deque operations diagram (push/pop from both ends)
  4. Thread-local storage memory layout
  5. Task execution priority flowchart
  6. CPU cache hierarchy
  7. Parent-child task affinity timeline
  8. Stealing pattern diagram (round-robin)
  9. Four benchmark bar charts (basic vs work stealing)
  10. Decision tree (when to use which)
  11. Lock contention comparison (basic vs work stealing)

Other notes about Concurrency and/or C++