Part 1. Thread Pools

Reading time

6 mins

Table of contents

Intro: The Pizza Kitchen Problem
Basic example (sequential processing)
Concurrency Fundamentals: Foundational knowledge
Designing Our Thread Pool: The Architecture
Potential solution, create a copyable wrapper
visuals

Status

🌲 Evergreen

Last updated

Jan 26, 2026

Topics

Concurrency C++

Intro: The Pizza Kitchen Problem

Story: A pizza kitchen with one cook making pizzas sequentially
Introduce the problem: 50 orders, each taking 100-500ms to prepare
Sequential execution time: ~10 seconds total
What if we had multiple cooks working in parallel?
Transition: This is exactly what thread pools solve in software

Basic example (sequential processing)

Code snippet: sequential pizza processing
Benchmark results showing ~10s runtime
Visualise?
…lead on with CPU cores sitting idle

Concurrency Fundamentals: Foundational knowledge

Threads vs Processes

Visual: Process with multiple threads diagram

Challenges with shared state if we split work with threads

Race conditions explained with simple counter example
Why we need synchronisation primitives and any overhead this adds

C++20 Threading Primitives

std::jthread - auto-joining threads (before + after c++ demo worth it? e.g. joiner threads class?)
std::mutex - worth discussing lock-free or nah?
std::condition_variable - thread signalling
std::stop_token - cooperative cancellation
Code snippets for each with comments

Designing Our Thread Pool: The Architecture

I think at this point it’s worth explaining the purpose of each thing we introduce? Reader could get overwhelmed otherwise if not experienced in multithreading

Core Components

Shared task queue (what work needs to be done)
Worker threads (who does the work)
Synchronization primitives (how they help coordinate concurrent jobs safely)
Visual: arch diagram with labeled components?

Design Decisions

Why a shared queue? (simplicity, fairness)
How many threads? (hardware_concurrency)
- deeper problem though - out of scope for this article? can still do experiments and visualise the effect of different number of threads
FIFO vs LIFO task ordering
- priority / scheduling important with its own problems.
- problem dependent - pizzas fifo most likely. can’t think otherwise
How to handle shutdown gracefully
- cv cool - wonder if there’s other options other than cv/stoken
- worth researching more?

The Task Wrapper Challenge

Problem: Storing different function types in one queue
Solution: Type erased task
Code: Simple task_wrapper implementation?
- could reference c++ concurrency in action & sw design from Klaus
Why std::function isn’t enough for type erasure here (move-only semantics)
- elaborate??

//packaged_task only movable
// remember it has promise inside
// can only perform move 
std::packaged_task<int()> task([]{ return 1; });

// function wants a copyable but pt is not copyable --> Error
std::function<void()> func = std::move(task);

std::queue<std::function<void()>> tasks;
tasks.push([]{ return 42; });  // Return value lost

Potential solution, create a copyable wrapper

class task_wrapper {
    struct callable_base {
        virtual ~callable_base() = default;
        virtual void call() = 0;
    };

    template<typename F>
    struct callable_impl : callable_base {
        F func;
        callable_impl(F&& f) : func(std::forward<F>(f)) {}
        void call() override { func(); }
    };

    std::unique_ptr<callable_base> impl;

public:
    template<typename F>
    task_wrapper(F&& f) : impl(std::make_unique<callable_impl<std::decay_t<F>>>(std::forward<F>(f))) {}

    task_wrapper(task_wrapper&&) = default;
    void operator()() { impl->call(); }
};

// use packaged_task correctly this time...
std::packaged_task<int()> task([]{ return 42; });
auto fut = task.get_future();

std::queue<task_wrapper> tasks;

// below should work, lambdas be const by default
// so need to mark as mutable since when packaged_task op()
// is called, it will modify some internal state
tasks.emplace([task = std::move(task)]() mutable { task(); });

// ...as normal
int result = fut.get();

Implementation: Building the Thread Pool

The Class Structure

class ThreadPool {
private:
    std::mutex mtx_;
    std::condition_variable_any cv_;
    std::vector<std::jthread> workers_;
    std::queue<task_wrapper> tasks_;
public:
    ThreadPool(size_t num_threads);
    template<typename F> auto submit(F&& f) -> std::future<decltype(f())>;
    void run_pending_task();
};

Explanation of each member variable
Why these specific types?

Constructor: Spawning Workers

ThreadPool(size_t num_threads) {
    for (size_t i = 0; i < num_threads; i++) {
        workers_.emplace_back([this](std::stop_token stoken) {
            worker_loop(stoken);
        });
    }
}

Step-by-step breakdown
Lambda capture explained
Why emplace_back vs push_back

The Worker Loop: The Heart of the Pool

void worker_loop(std::stop_token stoken) {
    while (!stoken.stop_requested()) {
        std::unique_lock<std::mutex> lock(mtx_);
        cv_.wait(lock, stoken, [this] { return !tasks_.empty(); });
        
        if (stoken.stop_requested()) return;
        if (tasks_.empty()) continue;
        
        task_wrapper task = std::move(tasks_.front());
        tasks_.pop();
        lock.unlock();
        
        task();
    }
}

Line-by-line explanation
Why unlock before executing task? (critical!)
Condition variable mechanics
Stop token for graceful shutdown

Task Submission: Adding Work

template<typename F>
auto submit(F&& f) -> std::future<decltype(f())> {
    using ret_type = decltype(f());
    std::packaged_task<ret_type()> task(std::forward<F>(f));
    
    auto fut = task.get_future();
    {
        std::unique_lock<std::mutex> lock(mtx_);
        tasks_.emplace([task = std::move(task)]() mutable {
            task();
        });
    }
    cv_.notify_one();
    return fut;
}

Template mechanics explained
std::packaged_task for result retrieval
Perfect forwarding
Why notify after unlocking

The Helper: run_pending_task

Why this exists (avoid blocking main thread)
Implementation walkthrough
Use case in recursive algorithms

Why this section:

Core learning content - most time spent here
Incremental complexity (constructor → loop → submit)
Explains “why” for every “what”
Code comments are teaching tools
Builds muscle memory through repetition

Putting It to Work: The Pizza Kitchen Revisited

The Code

ThreadPool pool(8);
std::vector<std::future<void>> futures;

for (const auto& order : orders) {
    futures.push_back(pool.submit([order]() {
        make_pizza(order);
    }));
}

for (auto& future : futures) {
    future.wait();
}

The Results

Sequential: ~10,000ms
Thread Pool (8 threads): ~1,500ms
Speedup: 6.7x
Visual: Bar chart comparing execution times
Why not 8x? (overhead, Amdahl’s law)

Observing the Threads

How to see threads in action (debugger, htop)
CPU utilization before/after
Visual: CPU usage graph

Understanding the Bottleneck: Lock Contention

The Problem

All threads compete for one mutex
Condition variable causes kernel-level blocking
Visual: Timeline showing threads waiting for lock

Measuring Contention

How to detect lock contention (profiling tools)
Expected vs actual speedup
When does it break down? (16+ threads)

The Trade-off

Simplicity vs scalability
When is this design good enough?
Foreshadowing: “There’s a better way…” (work stealing)

Common Pitfalls and How to Avoid Them

Deadlocks

Example: Holding lock while executing task
How to debug
Prevention strategies

Exception Safety

What happens if a task throws?
std::packaged_task handles this
Still need to call .get() on future

Lifetime Issues

Thread pool must outlive submitted tasks
Dangling references in lambdas
Use shared_ptr when needed

Forgetting to Wait

Fire-and-forget pitfall
Destructor races
Always join or wait

Extending the Design: Production Considerations

What’s Missing?

Priority queues? not sure tbh - might be better used in a scheduling article - could even go into os concepts too
Task cancellation
Dynamic thread count
Exception logging
Metrics/observability

When to Use This Design

Independent tasks
Uniform task duration
Low thread count (less than 8)
Simplicity matters

When to Look Elsewhere

Recursive algorithms (work stealing better)
Very high thread counts
I/O-bound workloads (async I/O better)

Key Takeaways and Next Steps

What We Learned

Thread pools reuse threads for efficiency
Shared queue + condition variable = simple coordination
Type erasure enables heterogeneous task storage
Lock contention is the main bottleneck

Exercises for the Reader

Modify to support priority tasks
Add task cancellation
Implement dynamic thread count
Add performance metrics

Coming Next

Teaser for Blog Post 2: Work Stealing
“What if threads could help each other?”
Preview of 10-100x speedup for recursive algorithms

Complete Code Listing

Content:

Full ThreadPool class
Full task_wrapper class
Pizza kitchen benchmark code
Build instructions (CMake)