All Notes
Part 2. Work Stealing Thread Pools
Reading time
7 mins
Table of contents
- The Limitation: When Our Pizza Kitchen Breaks Down
- The Work Stealing Concept
- Arch details: Load balanced workflow
- Impl: Work Stealing Pool
- The Magic of Cache Locality: Why LIFO Matters
- Benchmarking: Four Algorithms, Four Stories
- The Trade-offs: Nothing Is Free
- Comparison: Basic vs Work Stealing
- Real-World Implementations
- Advanced Topics: Going Further
- Key Takeaways and Exercises
- Complete Code Listing (Appendix)
- Further Reading and References (Appendix)
- Visuals todo
The Limitation: When Our Pizza Kitchen Breaks Down
A New Challenge
- Tasks that produce their own subtasks e.g. recursive problems
- Example: Fibonacci pizza recipe (silly but memorable)
The Basic Thread Pool Struggles
- Benchmark: Fibonacci(35) with basic thread pool
- Profiling: CPU utilization showing threads blocked
- Problems: Lock contention + poor cache locality
The Insight
- Consider
- “What if threads could take work from each other?”
- “What if tasks stayed close to their parent tasks?” - explain improvement in profiling nums?
- Preview of work stealing benefits
The Work Stealing Concept
The Restaurant Analogy
- Multiple prep stations (local queues)
- Each cook has their own station
- Cooks work on their own tasks first
- Idle cooks “steal” from busy cooks
- Visual: Restaurant layout diagram
Key Principles
- Local queues - Each thread has its own deque
- LIFO for owner - Pop from front (cache locality)
- FIFO for thieves - Steal from back (avoid conflicts)
- No blocking - Spin/yield instead of condition variables
Why This Works
- Reduced contention (N+1 locks vs 1 lock)
- Cache locality (parent-child tasks on same thread)
- Load balancing (idle threads help busy ones)
- Visual: Comparison diagram (shared queue vs local queues)
Arch details: Load balanced workflow
Component Overview
- Global queue (fallback for external submissions)
- Per-thread local queues (deques)
- todo: how is dequeue implemented under hood? i think it’s like a linked list of arrays instead of nodes maybe? arrays inside the deq have good cache locality so what are the problems with a global dequeue with internals being used as thread locals… complex with global queue workflows & syncs i guess? explore more
- Thread-local storage (fast access)
- No condition variables (active polling)
- sucks at first glance but i think using active polling reduces wakeup time for a thread. so we burn cpu cycles but get a thread that’s always ready to work?
- Visual: Complete architecture diagram
Intro to dequeu
- Why deque? (efficient push/pop from both ends)
- explain why useful when comes to contention etc.
- Owner operations (front): push_front, pop_front (LIFO)
- Thief operations (back): pop_back (FIFO)
- Visual: Deque with arrows showing operations
- Code:
WorkStealingQueueclass structure
Thread-Local Storage Strategy
static thread_local WorkStealingQueue* local_queue_;
static thread_local size_t idx_;
- What is TLS? (per-thread variables)
- Why pointers? (late initialization)
- Performance benefit (zero-cost access)
- Visual: Memory layout showing TLS
Task Execution Priority
- Check own local queue (fast, no lock)
- Check global queue (one lock)
- Steal from other threads (N locks, but rare)
- Yield CPU (no work available)
- Visual: Decision tree flowchart
Impl: Work Stealing Pool
The WorkStealingQueue Class
class WorkStealingQueue {
private:
mutable std::mutex mtx_;
std::deque<task_wrapper> tasks_;
public:
void push(task_wrapper task);
bool try_pop(task_wrapper& task);
bool try_steal(task_wrapper& task);
bool empty() const;
};
- Implementation of each method and what each methoid/ does
- Worth talking about lock granularity? trying to keep minimal here
The ThreadPoolWorkStealing Class Structure
class ThreadPoolWorkStealing {
private:
std::mutex mtx_;
std::queue<task_wrapper> global_queue_;
std::vector<std::unique_ptr<WorkStealingQueue>> worker_queues_;
std::vector<std::jthread> workers_;
size_t num_threads_;
static thread_local WorkStealingQueue* local_queue_;
static thread_local size_t idx_;
public:
ThreadPoolWorkStealing(size_t num_threads);
template<typename F> auto submit(F&& f) -> std::future<Ret>;
void run_pending_task();
};
- Why
unique_ptrfor queues? - Static thread_local explained… might need appendix for so much deep dive?:(
- Comparison to basic ThreadPool
Constructor: Setting Up the Infrastructure
ThreadPoolWorkStealing(size_t num_threads) : num_threads_(num_threads) {
// Create local queues
for (size_t i = 0; i < num_threads_; i++) {
worker_queues_.push_back(
std::unique_ptr<WorkStealingQueue>(new WorkStealingQueue())
);
}
// Spawn workers
for (size_t i = 0; i < num_threads_; i++) {
workers_.emplace_back([this, i](std::stop_token stoken) {
worker_loop(stoken, i);
});
}
}
- Ctor infra
- queues must exist before threads start
Worker Loop: The Active Polling Pattern
void worker_loop(std::stop_token stoken, size_t idx) {
idx_ = idx; // Set TLS
local_queue_ = worker_queues_[idx].get();
while (!stoken.stop_requested()) {
run_pending_task(stoken);
}
}
void run_pending_task(std::stop_token& stoken) {
if (stoken.stop_requested()) return;
task_wrapper task;
if (pop_from_local(task) ||
pop_from_global(task) ||
pop_from_other_thread_queues(task)) {
task();
} else {
std::this_thread::yield();
}
}
- TLS initialization
- Short-circuit eval - needed in writeup TODO?
- yield instead of block…
- CPU usage trade-off (burn v wakeup delay)
Task Submission
template<typename F, typename Ret = std::invoke_result_t<F>>
auto submit(F&& f) -> std::future<Ret> {
std::packaged_task<Ret()> task(std::forward<F>(f));
auto fut = task.get_future();
if (local_queue_) {
// Fast path: We're a worker thread
local_queue_->push([task = std::move(task)]() mutable {
task();
});
} else {
// Slow path: External thread
std::lock_guard lock(mtx_);
global_queue_.emplace([task = std::move(task)]() mutable {
task();
});
}
return fut;
}
- Fast path vs slow path
- Why check
local_queue_? - Performance difference - worth getting numbers? from time of og submission or time of submission after lock acquired (if lock exists for the logic branch ofc)
The Stealing Algorithm
bool pop_from_other_thread_queues(task_wrapper& task) {
for (size_t i = 0; i < worker_queues_.size(); i++) {
auto idx = (idx_ + i + 1) % worker_queues_.size();
if (worker_queues_[idx]->try_steal(task)) {
return true;
}
}
return false;
}
- Round-robin stealing pattern
- Explain why
(idx_ + i + 1) % num_workers - Load distribution benefits
- viz: stealing pattern diagram
The Helper Methods
bool pop_from_local(task_wrapper& task) {
return local_queue_ && local_queue_->try_pop(task);
}
bool pop_from_global(task_wrapper& task) {
std::lock_guard<std::mutex> lock(mtx_);
if (global_queue_.empty()) return false;
task = std::move(global_queue_.front());
global_queue_.pop();
return true;
}
- Null check for local_queue
- Lock scope minimization
The Magic of Cache Locality: Why LIFO Matters
The Problem with Random Distribution
- Basic thread pool: tasks go to random threads
- Parent and child tasks on different CPUs
- Cache miss penalty? nums provided? (real / estimate)
- Visual: Would be cool to visualise cpu cache diagram? consult redhat paper i think there should be something there to reference
The LIFO Solution
- Worker submits task → push to front of local queue
- Worker continues → pops from front (LIFO)
- Result: Parent and child on same CPU
- Cache hit… X cycles?
- Visual: Timeline showing task execution on same CPU
The Numbers
example:...
Fibonacci(35) execution:
- Basic ThreadPool: ~XXXms (5x speedup)
- WorkStealing: ~Yms (25x speedup)
Why? Cache locality provides 10-50x per task!
Recursive Algorithm Example
long fib(int n, WorkStealingPool& pool) {
if (n <= 1) return n;
auto left = pool.submit([&] { return fib(n-1, pool); });
long right = fib(n-2, pool); // Execute inline
// Later: pop fib(n-1) from front (LIFO)
// Result: fib(n) and fib(n-1) on same CPU
return left.get() + right;
}
- Visual: Task tree showing parent-child affinity
Benchmarking: Four Algorithms, Four Stories
- These are generic algos, could just offload impl to AI
- Do we need briefer on the algos?
Analysis: When Does Work Stealing Shine?
- Best for: Recursive algorithms, variable task sizes
- Good for: Any parallel workload
- Overkill for: Independent, uniform-duration tasks (Pizza Kitchen)
- Table: Algorithm characteristics vs speedup
This section:
- Proves perf improvement w/real data
- Shows different workload patterns
- Teaches when to use which approach
The Trade-offs: Nothing Is Free
Increased Complexity
- More code to write and maintain
- Harder to debug (distributed state)
- More subtle bugs possible… any way to explore?
- When is simplicity worth more?
Memory Overhead
- N local queues vs 1 global queue
- Thread-local storage
- Hows it affect memrory?
CPU Usage When Idle
- Basic ThreadPool: sleeping
- Work stealing: spinning / yielding
- Trade-off: Responsiveness vs power consumption
- When does this matter?
Lock-Free Alternatives
- Could use lock-free deques (more complex)
- ABA problem, memory ordering
- Diminishing returns for most workloads
- Would have to research more (AW’s book gave me trauma for lock-free datastructures)
Comparison: Basic vs Work Stealing
Feature Comparison Table
| Feature | Basic ThreadPool | WorkStealing |
|---|---|---|
| Complexity | Low | High |
| Memory | ~1KB | ~N KB |
| Contention | High (1 lock) | Low (N+1 locks) |
| Cache Locality | Poor | Excellent |
| Scalability | 4-8 threads | 16+ threads |
| Idle CPU | 0% | x% |
| Best For | Independent tasks | Recursive tasks |
Decision Matrix
- Use Basic when: Simplicity matters, less than 8 threads, independent tasks
- Use WorkStealing when: Performance critical, recursive, many threads
- Visual: Decision tree
Guidance
If your tasks spawn other tasks, use work stealing. Otherwise, consider a simple thread pool..?
Real-World Implementations
Intel TBB (Threading Building Blocks)
tbb::task_groupuses work stealing- Cache-oblivious algorithms
- NUMA-aware scheduling
Advanced Topics: Going Further
Lock-Free Deques
- Chase-Lev algorithm - worth trying to profile?
- ABA problem and solutions
NUMA-Aware Stealing
- Prefer stealing from same socket
- Topology-aware thread placement
Adaptive Stealing Strategies
- Track which workers have work, exponential backoff etc. do more research
Key Takeaways and Exercises
What We Learned
- Work stealing reduces contention through distributed queues
- LIFO for owner provides cache locality
- FIFO for thieves avoids conflicts
- Active polling trades CPU for responsiveness
- Speedup for algo runtime
Ideas for future dev
- Implement lock-free deque
- Add NUMA-aware stealing
- Benchmark on your workload
- Add priority queues
- Implement adaptive backoff
Reflection Questions
- When would you choose basic over work stealing?
- How would you debug a deadlock in the stealing loop?
- What workloads benefit most from cache locality?
Complete Code Listing (Appendix)
Content:
- Full
WorkStealingQueueclass - Full
ThreadPoolWorkStealingclass - All four benchmark algorithms
- Build instructions
- How to run benchmarks
Further Reading and References (Appendix)
Papers
- “Dynamic Circular Work-Stealing Deque” (Chase & Lev)
- “The Data Locality of Work Stealing” (Acar et al.)
Books
- C++ Concurrency in Action (Anthony Williams) - Chapter 9
Talks
- “Designing Concurrent C++ Applications” (Herb Sutter)
- “Lock-Free Programming” (Fedor Pikus)
Visuals todo
- Restaurant with prep stations (work stealing analogy)
- Shared queue vs local queues (architecture comparison)
- Deque operations diagram (push/pop from both ends)
- Thread-local storage memory layout
- Task execution priority flowchart
- CPU cache hierarchy
- Parent-child task affinity timeline
- Stealing pattern diagram (round-robin)
- Four benchmark bar charts (basic vs work stealing)
- Decision tree (when to use which)
- Lock contention comparison (basic vs work stealing)
Other notes about Concurrency and/or C++
- 🌲Part 1. Thread Pools
Thread pooling with c++20 primitives
- 🌿Developing a deep learning framework
Developing a deep learning framework
- 🌱MPMC Queue
MPMC Queue
- 🌱C++ low-latency design patterns
A brief description of what this note covers
- 🌱Atomics
Atomics
- 🌿SPSC Queue
SPSC Thread-Safe Queue
- 🌿Implementing STL's std::shared_ptr
Implementing STL's std::shared_ptr
- 🌿Implementing STL's std::unique_ptr
Implementing STL's std::unique_ptr
- 🌿Implementing STL's std::vector
A brief description of what this note covers
- 🌿Type Erasure in C++
Type Erasure in C++