Part 2. Work Stealing Thread Pools

Reading time

7 mins

Table of contents

The Limitation: When Our Pizza Kitchen Breaks Down
The Work Stealing Concept
Arch details: Load balanced workflow
Impl: Work Stealing Pool
The Magic of Cache Locality: Why LIFO Matters
Benchmarking: Four Algorithms, Four Stories
- Analysis: When Does Work Stealing Shine?
The Trade-offs: Nothing Is Free
Comparison: Basic vs Work Stealing
Real-World Implementations
- Intel TBB (Threading Building Blocks)
Advanced Topics: Going Further
Key Takeaways and Exercises
Complete Code Listing (Appendix)
Further Reading and References (Appendix)
Visuals todo

Status

🌲 Evergreen

Last updated

Jan 26, 2026

Topics

Concurrency C++

The Limitation: When Our Pizza Kitchen Breaks Down

A New Challenge

Tasks that produce their own subtasks e.g. recursive problems
Example: Fibonacci pizza recipe (silly but memorable)

The Basic Thread Pool Struggles

Benchmark: Fibonacci(35) with basic thread pool
Profiling: CPU utilization showing threads blocked
Problems: Lock contention + poor cache locality

The Insight

Consider
- “What if threads could take work from each other?”
- “What if tasks stayed close to their parent tasks?” - explain improvement in profiling nums?
Preview of work stealing benefits

The Work Stealing Concept

The Restaurant Analogy

Multiple prep stations (local queues)
Each cook has their own station
Cooks work on their own tasks first
Idle cooks “steal” from busy cooks
Visual: Restaurant layout diagram

Key Principles

Local queues - Each thread has its own deque
LIFO for owner - Pop from front (cache locality)
FIFO for thieves - Steal from back (avoid conflicts)
No blocking - Spin/yield instead of condition variables

Why This Works

Reduced contention (N+1 locks vs 1 lock)
Cache locality (parent-child tasks on same thread)
Load balancing (idle threads help busy ones)
Visual: Comparison diagram (shared queue vs local queues)

Arch details: Load balanced workflow

Component Overview

Global queue (fallback for external submissions)
Per-thread local queues (deques)
- todo: how is dequeue implemented under hood? i think it’s like a linked list of arrays instead of nodes maybe? arrays inside the deq have good cache locality so what are the problems with a global dequeue with internals being used as thread locals… complex with global queue workflows & syncs i guess? explore more
Thread-local storage (fast access)
No condition variables (active polling)
- sucks at first glance but i think using active polling reduces wakeup time for a thread. so we burn cpu cycles but get a thread that’s always ready to work?
Visual: Complete architecture diagram

Intro to dequeu

Why deque? (efficient push/pop from both ends)
- explain why useful when comes to contention etc.
Owner operations (front): push_front, pop_front (LIFO)
Thief operations (back): pop_back (FIFO)
Visual: Deque with arrows showing operations
Code: WorkStealingQueue class structure

Thread-Local Storage Strategy

static thread_local WorkStealingQueue* local_queue_;
static thread_local size_t idx_;

What is TLS? (per-thread variables)
Why pointers? (late initialization)
Performance benefit (zero-cost access)
Visual: Memory layout showing TLS

Task Execution Priority

Check own local queue (fast, no lock)
Check global queue (one lock)
Steal from other threads (N locks, but rare)
Yield CPU (no work available)

Visual: Decision tree flowchart

Impl: Work Stealing Pool

The WorkStealingQueue Class

class WorkStealingQueue {
private:
    mutable std::mutex mtx_;
    std::deque<task_wrapper> tasks_;
    
public:
    void push(task_wrapper task);
    bool try_pop(task_wrapper& task);
    bool try_steal(task_wrapper& task);
    bool empty() const;
};

Implementation of each method and what each methoid/ does
Worth talking about lock granularity? trying to keep minimal here

The ThreadPoolWorkStealing Class Structure

class ThreadPoolWorkStealing {
private:
    std::mutex mtx_;
    std::queue<task_wrapper> global_queue_;
    std::vector<std::unique_ptr<WorkStealingQueue>> worker_queues_;
    std::vector<std::jthread> workers_;
    size_t num_threads_;
    
    static thread_local WorkStealingQueue* local_queue_;
    static thread_local size_t idx_;
    
public:
    ThreadPoolWorkStealing(size_t num_threads);
    template<typename F> auto submit(F&& f) -> std::future<Ret>;
    void run_pending_task();
};

Why unique_ptr for queues?
Static thread_local explained… might need appendix for so much deep dive?:(
Comparison to basic ThreadPool

Constructor: Setting Up the Infrastructure

ThreadPoolWorkStealing(size_t num_threads) : num_threads_(num_threads) {
    // Create local queues
    for (size_t i = 0; i < num_threads_; i++) {
        worker_queues_.push_back(
            std::unique_ptr<WorkStealingQueue>(new WorkStealingQueue())
        );
    }
    
    // Spawn workers
    for (size_t i = 0; i < num_threads_; i++) {
        workers_.emplace_back([this, i](std::stop_token stoken) {
            worker_loop(stoken, i);
        });
    }
}

Ctor infra
- queues must exist before threads start

Worker Loop: The Active Polling Pattern

void worker_loop(std::stop_token stoken, size_t idx) {
    idx_ = idx;  // Set TLS
    local_queue_ = worker_queues_[idx].get();
    
    while (!stoken.stop_requested()) {
        run_pending_task(stoken);
    }
}

void run_pending_task(std::stop_token& stoken) {
    if (stoken.stop_requested()) return;
    
    task_wrapper task;
    if (pop_from_local(task) ||
        pop_from_global(task) ||
        pop_from_other_thread_queues(task)) {
        task();
    } else {
        std::this_thread::yield();
    }
}

TLS initialization
Short-circuit eval - needed in writeup TODO?
yield instead of block…
CPU usage trade-off (burn v wakeup delay)

Task Submission

template<typename F, typename Ret = std::invoke_result_t<F>>
auto submit(F&& f) -> std::future<Ret> {
    std::packaged_task<Ret()> task(std::forward<F>(f));
    auto fut = task.get_future();
    
    if (local_queue_) {
        // Fast path: We're a worker thread
        local_queue_->push([task = std::move(task)]() mutable {
            task();
        });
    } else {
        // Slow path: External thread
        std::lock_guard lock(mtx_);
        global_queue_.emplace([task = std::move(task)]() mutable {
            task();
        });
    }
    
    return fut;
}

Fast path vs slow path
Why check local_queue_?
Performance difference - worth getting numbers? from time of og submission or time of submission after lock acquired (if lock exists for the logic branch ofc)

The Stealing Algorithm

bool pop_from_other_thread_queues(task_wrapper& task) {
    for (size_t i = 0; i < worker_queues_.size(); i++) {
        auto idx = (idx_ + i + 1) % worker_queues_.size();
        if (worker_queues_[idx]->try_steal(task)) {
            return true;
        }
    }
    return false;
}

Round-robin stealing pattern
Explain why (idx_ + i + 1) % num_workers
Load distribution benefits
viz: stealing pattern diagram

The Helper Methods

bool pop_from_local(task_wrapper& task) {
    return local_queue_ && local_queue_->try_pop(task);
}

bool pop_from_global(task_wrapper& task) {
    std::lock_guard<std::mutex> lock(mtx_);
    if (global_queue_.empty()) return false;
    task = std::move(global_queue_.front());
    global_queue_.pop();
    return true;
}

Null check for local_queue
Lock scope minimization

The Magic of Cache Locality: Why LIFO Matters

The Problem with Random Distribution

Basic thread pool: tasks go to random threads
Parent and child tasks on different CPUs
Cache miss penalty? nums provided? (real / estimate)
Visual: Would be cool to visualise cpu cache diagram? consult redhat paper i think there should be something there to reference

The LIFO Solution

Worker submits task → push to front of local queue
Worker continues → pops from front (LIFO)
Result: Parent and child on same CPU
Cache hit… X cycles?
Visual: Timeline showing task execution on same CPU

The Numbers

example:...

Fibonacci(35) execution:
- Basic ThreadPool: ~XXXms (5x speedup)
- WorkStealing: ~Yms (25x speedup)

Why? Cache locality provides 10-50x per task!

Recursive Algorithm Example

long fib(int n, WorkStealingPool& pool) {
    if (n <= 1) return n;
    
    auto left = pool.submit([&] { return fib(n-1, pool); });
    long right = fib(n-2, pool);  // Execute inline
    
    // Later: pop fib(n-1) from front (LIFO)
    // Result: fib(n) and fib(n-1) on same CPU
    
    return left.get() + right;
}

Visual: Task tree showing parent-child affinity

Benchmarking: Four Algorithms, Four Stories

These are generic algos, could just offload impl to AI
Do we need briefer on the algos?

Analysis: When Does Work Stealing Shine?

Best for: Recursive algorithms, variable task sizes
Good for: Any parallel workload
Overkill for: Independent, uniform-duration tasks (Pizza Kitchen)
Table: Algorithm characteristics vs speedup

This section:

Proves perf improvement w/real data
Shows different workload patterns
Teaches when to use which approach

The Trade-offs: Nothing Is Free

Increased Complexity

More code to write and maintain
Harder to debug (distributed state)
More subtle bugs possible… any way to explore?
When is simplicity worth more?

Memory Overhead

N local queues vs 1 global queue
Thread-local storage
Hows it affect memrory?

CPU Usage When Idle

Basic ThreadPool: sleeping
Work stealing: spinning / yielding
Trade-off: Responsiveness vs power consumption
When does this matter?

Lock-Free Alternatives

Could use lock-free deques (more complex)
ABA problem, memory ordering
Diminishing returns for most workloads
Would have to research more (AW’s book gave me trauma for lock-free datastructures)

Comparison: Basic vs Work Stealing

Feature Comparison Table

Feature	Basic ThreadPool	WorkStealing
Complexity	Low	High
Memory	~1KB	~N KB
Contention	High (1 lock)	Low (N+1 locks)
Cache Locality	Poor	Excellent
Scalability	4-8 threads	16+ threads
Idle CPU	0%	x%
Best For	Independent tasks	Recursive tasks

Decision Matrix

Use Basic when: Simplicity matters, less than 8 threads, independent tasks
Use WorkStealing when: Performance critical, recursive, many threads
Visual: Decision tree

Guidance

If your tasks spawn other tasks, use work stealing. Otherwise, consider a simple thread pool..?

Real-World Implementations

Intel TBB (Threading Building Blocks)

tbb::task_group uses work stealing
Cache-oblivious algorithms
NUMA-aware scheduling

Advanced Topics: Going Further

Lock-Free Deques

Chase-Lev algorithm - worth trying to profile?
ABA problem and solutions

NUMA-Aware Stealing

Prefer stealing from same socket
Topology-aware thread placement

Adaptive Stealing Strategies

Track which workers have work, exponential backoff etc. do more research

Key Takeaways and Exercises

What We Learned

Work stealing reduces contention through distributed queues
LIFO for owner provides cache locality
FIFO for thieves avoids conflicts
Active polling trades CPU for responsiveness
Speedup for algo runtime

Ideas for future dev

Implement lock-free deque
Add NUMA-aware stealing
Benchmark on your workload
Add priority queues
Implement adaptive backoff

Reflection Questions

When would you choose basic over work stealing?
How would you debug a deadlock in the stealing loop?
What workloads benefit most from cache locality?

Complete Code Listing (Appendix)

Content:

Full WorkStealingQueue class
Full ThreadPoolWorkStealing class
All four benchmark algorithms
Build instructions
How to run benchmarks

Visuals todo

Restaurant with prep stations (work stealing analogy)
Shared queue vs local queues (architecture comparison)
Deque operations diagram (push/pop from both ends)
Thread-local storage memory layout
Task execution priority flowchart
CPU cache hierarchy
Parent-child task affinity timeline
Stealing pattern diagram (round-robin)
Four benchmark bar charts (basic vs work stealing)
Decision tree (when to use which)
Lock contention comparison (basic vs work stealing)

Other notes about Concurrency and/or C++

🌲
Part 1. Thread Pools
Thread pooling with c++20 primitives
🌿
Developing a deep learning framework
Developing a deep learning framework
🌱
MPMC Queue
MPMC Queue
🌱
C++ low-latency design patterns
A brief description of what this note covers
🌱
Atomics
Atomics
🌿
SPSC Queue
SPSC Thread-Safe Queue
🌿
Implementing STL's std::shared_ptr
Implementing STL's std::shared_ptr
🌿
Implementing STL's std::unique_ptr
Implementing STL's std::unique_ptr
🌿
Implementing STL's std::vector
A brief description of what this note covers
🌿
Type Erasure in C++
Type Erasure in C++

Part 2. Work Stealing Thread Pools

The Limitation: When Our Pizza Kitchen Breaks Down

A New Challenge

The Basic Thread Pool Struggles

The Insight

The Work Stealing Concept

The Restaurant Analogy

Key Principles

Why This Works

Arch details: Load balanced workflow

Component Overview

Intro to dequeu

Thread-Local Storage Strategy

Task Execution Priority

Impl: Work Stealing Pool

The WorkStealingQueue Class

The ThreadPoolWorkStealing Class Structure

Constructor: Setting Up the Infrastructure

Worker Loop: The Active Polling Pattern

Task Submission

The Stealing Algorithm

The Helper Methods

The Magic of Cache Locality: Why LIFO Matters

The Problem with Random Distribution

The LIFO Solution

The Numbers

Recursive Algorithm Example

Benchmarking: Four Algorithms, Four Stories

Analysis: When Does Work Stealing Shine?

The Trade-offs: Nothing Is Free

Increased Complexity

Memory Overhead

CPU Usage When Idle

Lock-Free Alternatives

Comparison: Basic vs Work Stealing

Feature Comparison Table

Decision Matrix

Guidance

Real-World Implementations

Intel TBB (Threading Building Blocks)

Advanced Topics: Going Further

Lock-Free Deques

NUMA-Aware Stealing

Adaptive Stealing Strategies

Key Takeaways and Exercises

What We Learned

Ideas for future dev

Reflection Questions

Complete Code Listing (Appendix)

Further Reading and References (Appendix)

Papers

Books

Talks

Visuals todo

Other notes about Concurrency and/or C++