All Notes
Part 1. Thread Pools
Reading time
6 mins
Table of contents
- Intro: The Pizza Kitchen Problem
- Basic example (sequential processing)
- Concurrency Fundamentals: Foundational knowledge
- Designing Our Thread Pool: The Architecture
- Potential solution, create a copyable wrapper
- visuals
Intro: The Pizza Kitchen Problem
- Story: A pizza kitchen with one cook making pizzas sequentially
- Introduce the problem: 50 orders, each taking 100-500ms to prepare
- Sequential execution time: ~10 seconds total
- What if we had multiple cooks working in parallel?
- Transition: This is exactly what thread pools solve in software
Basic example (sequential processing)
- Code snippet: sequential pizza processing
- Benchmark results showing ~10s runtime
- Visualise?
- …lead on with CPU cores sitting idle
Concurrency Fundamentals: Foundational knowledge
Threads vs Processes
- Visual: Process with multiple threads diagram
Challenges with shared state if we split work with threads
- Race conditions explained with simple counter example
- Why we need synchronisation primitives and any overhead this adds
C++20 Threading Primitives
std::jthread- auto-joining threads (before + after c++ demo worth it? e.g. joiner threads class?)std::mutex- worth discussing lock-free or nah?std::condition_variable- thread signallingstd::stop_token- cooperative cancellation- Code snippets for each with comments
Designing Our Thread Pool: The Architecture
I think at this point it’s worth explaining the purpose of each thing we introduce? Reader could get overwhelmed otherwise if not experienced in multithreading
Core Components
- Shared task queue (what work needs to be done)
- Worker threads (who does the work)
- Synchronization primitives (how they help coordinate concurrent jobs safely)
- Visual: arch diagram with labeled components?
Design Decisions
- Why a shared queue? (simplicity, fairness)
- How many threads? (hardware_concurrency)
- deeper problem though - out of scope for this article? can still do experiments and visualise the effect of different number of threads
- FIFO vs LIFO task ordering
- priority / scheduling important with its own problems.
- problem dependent - pizzas fifo most likely. can’t think otherwise
- How to handle shutdown gracefully
- cv cool - wonder if there’s other options other than cv/stoken
- worth researching more?
The Task Wrapper Challenge
- Problem: Storing different function types in one queue
- Solution: Type erased task
- Code: Simple
task_wrapperimplementation?- could reference c++ concurrency in action & sw design from Klaus
- Why
std::functionisn’t enough for type erasure here (move-only semantics)- elaborate??
//packaged_task only movable
// remember it has promise inside
// can only perform move
std::packaged_task<int()> task([]{ return 1; });
// function wants a copyable but pt is not copyable --> Error
std::function<void()> func = std::move(task);
std::queue<std::function<void()>> tasks;
tasks.push([]{ return 42; }); // Return value lost
Potential solution, create a copyable wrapper
class task_wrapper {
struct callable_base {
virtual ~callable_base() = default;
virtual void call() = 0;
};
template<typename F>
struct callable_impl : callable_base {
F func;
callable_impl(F&& f) : func(std::forward<F>(f)) {}
void call() override { func(); }
};
std::unique_ptr<callable_base> impl;
public:
template<typename F>
task_wrapper(F&& f) : impl(std::make_unique<callable_impl<std::decay_t<F>>>(std::forward<F>(f))) {}
task_wrapper(task_wrapper&&) = default;
void operator()() { impl->call(); }
};
// use packaged_task correctly this time...
std::packaged_task<int()> task([]{ return 42; });
auto fut = task.get_future();
std::queue<task_wrapper> tasks;
// below should work, lambdas be const by default
// so need to mark as mutable since when packaged_task op()
// is called, it will modify some internal state
tasks.emplace([task = std::move(task)]() mutable { task(); });
// ...as normal
int result = fut.get();
Implementation: Building the Thread Pool
The Class Structure
class ThreadPool {
private:
std::mutex mtx_;
std::condition_variable_any cv_;
std::vector<std::jthread> workers_;
std::queue<task_wrapper> tasks_;
public:
ThreadPool(size_t num_threads);
template<typename F> auto submit(F&& f) -> std::future<decltype(f())>;
void run_pending_task();
};
- Explanation of each member variable
- Why these specific types?
Constructor: Spawning Workers
ThreadPool(size_t num_threads) {
for (size_t i = 0; i < num_threads; i++) {
workers_.emplace_back([this](std::stop_token stoken) {
worker_loop(stoken);
});
}
}
- Step-by-step breakdown
- Lambda capture explained
- Why
emplace_backvspush_back
The Worker Loop: The Heart of the Pool
void worker_loop(std::stop_token stoken) {
while (!stoken.stop_requested()) {
std::unique_lock<std::mutex> lock(mtx_);
cv_.wait(lock, stoken, [this] { return !tasks_.empty(); });
if (stoken.stop_requested()) return;
if (tasks_.empty()) continue;
task_wrapper task = std::move(tasks_.front());
tasks_.pop();
lock.unlock();
task();
}
}
- Line-by-line explanation
- Why unlock before executing task? (critical!)
- Condition variable mechanics
- Stop token for graceful shutdown
Task Submission: Adding Work
template<typename F>
auto submit(F&& f) -> std::future<decltype(f())> {
using ret_type = decltype(f());
std::packaged_task<ret_type()> task(std::forward<F>(f));
auto fut = task.get_future();
{
std::unique_lock<std::mutex> lock(mtx_);
tasks_.emplace([task = std::move(task)]() mutable {
task();
});
}
cv_.notify_one();
return fut;
}
- Template mechanics explained
std::packaged_taskfor result retrieval- Perfect forwarding
- Why notify after unlocking
The Helper: run_pending_task
- Why this exists (avoid blocking main thread)
- Implementation walkthrough
- Use case in recursive algorithms
Why this section:
- Core learning content - most time spent here
- Incremental complexity (constructor → loop → submit)
- Explains “why” for every “what”
- Code comments are teaching tools
- Builds muscle memory through repetition
Putting It to Work: The Pizza Kitchen Revisited
The Code
ThreadPool pool(8);
std::vector<std::future<void>> futures;
for (const auto& order : orders) {
futures.push_back(pool.submit([order]() {
make_pizza(order);
}));
}
for (auto& future : futures) {
future.wait();
}
The Results
- Sequential: ~10,000ms
- Thread Pool (8 threads): ~1,500ms
- Speedup: 6.7x
- Visual: Bar chart comparing execution times
- Why not 8x? (overhead, Amdahl’s law)
Observing the Threads
- How to see threads in action (debugger, htop)
- CPU utilization before/after
- Visual: CPU usage graph
Understanding the Bottleneck: Lock Contention
The Problem
- All threads compete for one mutex
- Condition variable causes kernel-level blocking
- Visual: Timeline showing threads waiting for lock
Measuring Contention
- How to detect lock contention (profiling tools)
- Expected vs actual speedup
- When does it break down? (16+ threads)
The Trade-off
- Simplicity vs scalability
- When is this design good enough?
- Foreshadowing: “There’s a better way…” (work stealing)
Common Pitfalls and How to Avoid Them
Deadlocks
- Example: Holding lock while executing task
- How to debug
- Prevention strategies
Exception Safety
- What happens if a task throws?
std::packaged_taskhandles this- Still need to call
.get()on future
Lifetime Issues
- Thread pool must outlive submitted tasks
- Dangling references in lambdas
- Use
shared_ptrwhen needed
Forgetting to Wait
- Fire-and-forget pitfall
- Destructor races
- Always join or wait
Extending the Design: Production Considerations
What’s Missing?
- Priority queues? not sure tbh - might be better used in a scheduling article - could even go into os concepts too
- Task cancellation
- Dynamic thread count
- Exception logging
- Metrics/observability
When to Use This Design
- Independent tasks
- Uniform task duration
- Low thread count (less than 8)
- Simplicity matters
When to Look Elsewhere
- Recursive algorithms (work stealing better)
- Very high thread counts
- I/O-bound workloads (async I/O better)
Key Takeaways and Next Steps
What We Learned
- Thread pools reuse threads for efficiency
- Shared queue + condition variable = simple coordination
- Type erasure enables heterogeneous task storage
- Lock contention is the main bottleneck
Exercises for the Reader
- Modify to support priority tasks
- Add task cancellation
- Implement dynamic thread count
- Add performance metrics
Coming Next
- Teaser for Blog Post 2: Work Stealing
- “What if threads could help each other?”
- Preview of 10-100x speedup for recursive algorithms
Complete Code Listing
Content:
- Full
ThreadPoolclass - Full
task_wrapperclass - Pizza kitchen benchmark code
- Build instructions (CMake)
Why this section:
- Readers can copy-paste and experiment
- Removes friction to trying it out
- Serves as reference during exercises
Further Reading (Appendix)
Content:
- C++ Concurrency in Action (Anthony Williams)
- C++ Threading documentation (maybe overkill - where would user even start)
- Herb Sutter’s talks on concurrency
- Intel TBB documentation
- Resources to other easy to use pools? e.g. boost, std::async (programmer has little control tho)?
visuals
- Pizza kitchen timeline (sequential vs parallel)
- Process/thread memory diagram
- Thread pool architecture diagram
- Worker thread state machine
- Lock contention visualization …profiling visuals would be cool too
- Performance bar chart (sequential vs parallel)
- CPU utilization graph
- Task flow diagram (submit → queue → execute)
Other notes about Concurrency and/or C++
- 🌲Part 2. Work Stealing Thread Pools
Work Stealing Thread Poools C++20
- 🌿Developing a deep learning framework
Developing a deep learning framework
- 🌱MPMC Queue
MPMC Queue
- 🌱C++ low-latency design patterns
A brief description of what this note covers
- 🌱Atomics
Atomics
- 🌿SPSC Queue
SPSC Thread-Safe Queue
- 🌿Implementing STL's std::shared_ptr
Implementing STL's std::shared_ptr
- 🌿Implementing STL's std::unique_ptr
Implementing STL's std::unique_ptr
- 🌿Implementing STL's std::vector
A brief description of what this note covers
- 🌿Type Erasure in C++
Type Erasure in C++