All Notes

Part 1. Thread Pools

Last updated

Jan 26, 2026

Intro: The Pizza Kitchen Problem

  • Story: A pizza kitchen with one cook making pizzas sequentially
  • Introduce the problem: 50 orders, each taking 100-500ms to prepare
  • Sequential execution time: ~10 seconds total
  • What if we had multiple cooks working in parallel?
  • Transition: This is exactly what thread pools solve in software

Basic example (sequential processing)

  • Code snippet: sequential pizza processing
  • Benchmark results showing ~10s runtime
  • Visualise?
  • …lead on with CPU cores sitting idle

Concurrency Fundamentals: Foundational knowledge

Threads vs Processes

  • Visual: Process with multiple threads diagram

Challenges with shared state if we split work with threads

  • Race conditions explained with simple counter example
  • Why we need synchronisation primitives and any overhead this adds

C++20 Threading Primitives

  • std::jthread - auto-joining threads (before + after c++ demo worth it? e.g. joiner threads class?)
  • std::mutex - worth discussing lock-free or nah?
  • std::condition_variable - thread signalling
  • std::stop_token - cooperative cancellation
  • Code snippets for each with comments

Designing Our Thread Pool: The Architecture

I think at this point it’s worth explaining the purpose of each thing we introduce? Reader could get overwhelmed otherwise if not experienced in multithreading

Core Components

  • Shared task queue (what work needs to be done)
  • Worker threads (who does the work)
  • Synchronization primitives (how they help coordinate concurrent jobs safely)
  • Visual: arch diagram with labeled components?

Design Decisions

  • Why a shared queue? (simplicity, fairness)
  • How many threads? (hardware_concurrency)
    • deeper problem though - out of scope for this article? can still do experiments and visualise the effect of different number of threads
  • FIFO vs LIFO task ordering
    • priority / scheduling important with its own problems.
    • problem dependent - pizzas fifo most likely. can’t think otherwise
  • How to handle shutdown gracefully
    • cv cool - wonder if there’s other options other than cv/stoken
    • worth researching more?

The Task Wrapper Challenge

  • Problem: Storing different function types in one queue
  • Solution: Type erased task
  • Code: Simple task_wrapper implementation?
    • could reference c++ concurrency in action & sw design from Klaus
  • Why std::function isn’t enough for type erasure here (move-only semantics)
    • elaborate??
//packaged_task only movable
// remember it has promise inside
// can only perform move
std::packaged_task<int()> task([]{ return 1; });

// function wants a copyable but pt is not copyable --> Error
std::function<void()> func = std::move(task);

std::queue<std::function<void()>> tasks;
tasks.push([]{ return 42; }); // Return value lost

Potential solution, create a copyable wrapper

class task_wrapper {
struct callable_base {
virtual ~callable_base() = default;
virtual void call() = 0;
};

template<typename F>
struct callable_impl : callable_base {
F func;
callable_impl(F&& f) : func(std::forward<F>(f)) {}
void call() override { func(); }
};

std::unique_ptr<callable_base> impl;

public:
template<typename F>
task_wrapper(F&& f) : impl(std::make_unique<callable_impl<std::decay_t<F>>>(std::forward<F>(f))) {}

task_wrapper(task_wrapper&&) = default;
void operator()() { impl->call(); }
};

// use packaged_task correctly this time...
std::packaged_task<int()> task([]{ return 42; });
auto fut = task.get_future();

std::queue<task_wrapper> tasks;

// below should work, lambdas be const by default
// so need to mark as mutable since when packaged_task op()
// is called, it will modify some internal state
tasks.emplace([task = std::move(task)]() mutable { task(); });

// ...as normal
int result = fut.get();

Implementation: Building the Thread Pool

The Class Structure

class ThreadPool {
private:
std::mutex mtx_;
std::condition_variable_any cv_;
std::vector<std::jthread> workers_;
std::queue<task_wrapper> tasks_;
public:
ThreadPool(size_t num_threads);
template<typename F> auto submit(F&& f) -> std::future<decltype(f())>;
void run_pending_task();
};
  • Explanation of each member variable
  • Why these specific types?

Constructor: Spawning Workers

ThreadPool(size_t num_threads) {
for (size_t i = 0; i < num_threads; i++) {
workers_.emplace_back([this](std::stop_token stoken) {
worker_loop(stoken);
});
}
}
  • Step-by-step breakdown
  • Lambda capture explained
  • Why emplace_back vs push_back

The Worker Loop: The Heart of the Pool

void worker_loop(std::stop_token stoken) {
while (!stoken.stop_requested()) {
std::unique_lock<std::mutex> lock(mtx_);
cv_.wait(lock, stoken, [this] { return !tasks_.empty(); });

if (stoken.stop_requested()) return;
if (tasks_.empty()) continue;

task_wrapper task = std::move(tasks_.front());
tasks_.pop();
lock.unlock();

task();
}
}
  • Line-by-line explanation
  • Why unlock before executing task? (critical!)
  • Condition variable mechanics
  • Stop token for graceful shutdown

Task Submission: Adding Work

template<typename F>
auto submit(F&& f) -> std::future<decltype(f())> {
using ret_type = decltype(f());
std::packaged_task<ret_type()> task(std::forward<F>(f));

auto fut = task.get_future();
{
std::unique_lock<std::mutex> lock(mtx_);
tasks_.emplace([task = std::move(task)]() mutable {
task();
});
}
cv_.notify_one();
return fut;
}
  • Template mechanics explained
  • std::packaged_task for result retrieval
  • Perfect forwarding
  • Why notify after unlocking

The Helper: run_pending_task

  • Why this exists (avoid blocking main thread)
  • Implementation walkthrough
  • Use case in recursive algorithms

Why this section:

  • Core learning content - most time spent here
  • Incremental complexity (constructor → loop → submit)
  • Explains “why” for every “what”
  • Code comments are teaching tools
  • Builds muscle memory through repetition

Putting It to Work: The Pizza Kitchen Revisited

The Code

ThreadPool pool(8);
std::vector<std::future<void>> futures;

for (const auto& order : orders) {
futures.push_back(pool.submit([order]() {
make_pizza(order);
}));
}

for (auto& future : futures) {
future.wait();
}

The Results

  • Sequential: ~10,000ms
  • Thread Pool (8 threads): ~1,500ms
  • Speedup: 6.7x
  • Visual: Bar chart comparing execution times
  • Why not 8x? (overhead, Amdahl’s law)

Observing the Threads

  • How to see threads in action (debugger, htop)
  • CPU utilization before/after
  • Visual: CPU usage graph

Understanding the Bottleneck: Lock Contention

The Problem

  • All threads compete for one mutex
  • Condition variable causes kernel-level blocking
  • Visual: Timeline showing threads waiting for lock

Measuring Contention

  • How to detect lock contention (profiling tools)
  • Expected vs actual speedup
  • When does it break down? (16+ threads)

The Trade-off

  • Simplicity vs scalability
  • When is this design good enough?
  • Foreshadowing: “There’s a better way…” (work stealing)

Common Pitfalls and How to Avoid Them

Deadlocks

  • Example: Holding lock while executing task
  • How to debug
  • Prevention strategies

Exception Safety

  • What happens if a task throws?
  • std::packaged_task handles this
  • Still need to call .get() on future

Lifetime Issues

  • Thread pool must outlive submitted tasks
  • Dangling references in lambdas
  • Use shared_ptr when needed

Forgetting to Wait

  • Fire-and-forget pitfall
  • Destructor races
  • Always join or wait

Extending the Design: Production Considerations

What’s Missing?

  • Priority queues? not sure tbh - might be better used in a scheduling article - could even go into os concepts too
  • Task cancellation
  • Dynamic thread count
  • Exception logging
  • Metrics/observability

When to Use This Design

  • Independent tasks
  • Uniform task duration
  • Low thread count (less than 8)
  • Simplicity matters

When to Look Elsewhere

  • Recursive algorithms (work stealing better)
  • Very high thread counts
  • I/O-bound workloads (async I/O better)

Key Takeaways and Next Steps

What We Learned

  • Thread pools reuse threads for efficiency
  • Shared queue + condition variable = simple coordination
  • Type erasure enables heterogeneous task storage
  • Lock contention is the main bottleneck

Exercises for the Reader

  1. Modify to support priority tasks
  2. Add task cancellation
  3. Implement dynamic thread count
  4. Add performance metrics

Coming Next

  • Teaser for Blog Post 2: Work Stealing
  • “What if threads could help each other?”
  • Preview of 10-100x speedup for recursive algorithms

Complete Code Listing

Content:

  • Full ThreadPool class
  • Full task_wrapper class
  • Pizza kitchen benchmark code
  • Build instructions (CMake)

Why this section:

  • Readers can copy-paste and experiment
  • Removes friction to trying it out
  • Serves as reference during exercises

Further Reading (Appendix)

Content:

  • C++ Concurrency in Action (Anthony Williams)
  • C++ Threading documentation (maybe overkill - where would user even start)
  • Herb Sutter’s talks on concurrency
  • Intel TBB documentation
  • Resources to other easy to use pools? e.g. boost, std::async (programmer has little control tho)?

visuals

  1. Pizza kitchen timeline (sequential vs parallel)
  2. Process/thread memory diagram
  3. Thread pool architecture diagram
  4. Worker thread state machine
  5. Lock contention visualization …profiling visuals would be cool too
  6. Performance bar chart (sequential vs parallel)
  7. CPU utilization graph
  8. Task flow diagram (submit → queue → execute)

Other notes about Concurrency and/or C++