The authors introduce a complete, highly accessible pattern language that will help any experienced developer "think parallel"-and start writing effective parallel code almost immediately. Instead of formal theory, they deliver proven solutions to the challenges faced by parallel programmers, and pragmatic guidance for using today's parallel APIs in the real world. Coverage includes:. Patterns have helped thousands of programmers master object-oriented development and other complex programming technologies.
With this book, you will learn that they're the best way to master parallel programming too. Stay ahead with the world's most comprehensive technology and business learning platform. With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more. Start Free Trial No credit card required. Patterns for Parallel Programming by Berna L. Massingill, Beverly A. Sanders, Timothy G. Proven design patterns are another source of help. This guide introduces you to the most important and frequently used patterns of parallel programming and provides executable code samples for them, using PPL.
When thinking about where to begin, a good place to start is to review the patterns in this book. See if your problem has any attributes that match the six patterns presented in the following chapters. If it does, delve more deeply into the relevant pattern or patterns and study the sample code. Attention impatient readers: you can skip ahead to the table of patterns and when they can be used. Writing parallel programs has the reputation of being hard, but help has arrived.
- Dealing With Depression: In 12 Step Recovery (Fellow travelers series);
- WaveFront Strategies!
- Patterns for Parallel Programming [Book];
- Counter-Movements in the Sciences: The Sociology of the Alternatives to Big Science.
- "Parallel Design Patterns and Program Performance" by Yu Zhao.
- System 80+ Standard [nucl. powerplnt] Design - Vol 18.
- Software design pattern?
The Importance of Potential Parallelism Declaring the potential parallelism of your program allows the execution environment to run the program on all available cores, whether one or many. The patterns in this book are ways to express potential parallelism. If you correctly structure your code, the run-time environment can automatically adapt to the workload on a particular computer. This is why the patterns in this book only express potential parallelism.
They do not guarantee parallel execution in every situation. It deserves some explanation. Some parallel applications can be written for specific hardware. For example, creators of programs for a console gaming platform have detailed knowledge about the hardware resources that will be available at run time. They know the number of cores and the details of the memory architecture in advance.
The game can be written to exploit the exact level of parallelism provided by the platform. Complete knowledge of the hardware environment is also a characteristic of some embedded applications, such as industrial process control. The life cycle of such programs matches the life cycle of the specific hardware they were designed to use.
In contrast, when you write programs that run on general-purpose computing platforms, such as desktop workstations and servers, there is less predictability about the hardware features. You may not always know how many cores will be available. You also may be unable to predict what other software could be running at the same time as your application. In the past, programmers assumed that their applications would automatically run faster on later generations of hardware. You could rely on this assumption because processor clock speeds kept increasing.
With multicore processors, clock speeds on newer hardware are not increasing as much as they did in the past. Instead, the trend in processor design is toward more cores. If you want your application to benefit from hardware advances in the multicore world, you need to adapt your programming model. You should introduction expect that the programs you write today will run on computers with many more cores within a few years. Finally, you must plan for these contingencies in a way that does not penalize users who might not have access to the latest hardware.
You want your parallel application to run as fast on a single-core computer as an application that was written using only sequential code. In other words, you want scalable performance from one to many cores. Allowing your application to adapt to varying hardware capabilities, both now and in the future, is the motivation for potential parallelism. For many common scenarios, the speed of the loop will be approximately proportional to the number of cores.
A well-written parallel program runs at approximately the same speed as a sequential program when there is only one core available. Decomposition, Coordination, and Scalable Sharing The patterns in this book contain some common themes. The patterns described in this guide are design patterns. You can apply them when you design and implement your algorithms and when you think about the overall structure of your application. Although the example applications are small, the principles they demonstrate apply equally well to the architectures of large applications.
Understanding Tasks Tasks are sequential operations that work together to perform a larger operation. If the chosen granularity is too fine, the overhead of managing tasks will dominate. In general, tasks should be as large as possible, but they should remain independent of each other, and there should be enough tasks to keep the cores busy. You may also need to consider the heuristics that will be used for task scheduling. Tasks are sequential units of work. Tasks should be large, independent, and numerous enough to keep all cores busy.
Tasks and threads take very different approaches to scheduling. Tasks are much more compatible with the concept of potential parallelism than threads are. While a new thread immediately introduces additional concurrency to your application, a new task introduces only the potential for additional concurrency. Decomposing a problem into tasks requires a good understanding of the algorithmic and structural aspects of your application.
An example of these guidelines at work can be seen in a parallel ray tracing application. A ray tracer constructs a synthetic image by simulating the path of each ray of light in a scene. The individual ray simulations are a good level of granularity for parallelism. Breaking the tasks into smaller units, for example, by trying to decompose the ray simulation itself into independent tasks, only adds overhead, because the number of ray simulations is already large enough to keep all cores occupied.
If your tasks vary greatly in duration, you generally want more of them in order to fill in the gaps. Another advantage to grouping work into larger and fewer tasks is that larger tasks are often more independent of each other than are smaller tasks. Larger tasks are less likely than smaller tasks to share local variables or fields. Unfortunately, in applications that rely on large mutable object graphs, such as applications that expose a large object model with many public classes, methods, and properties, the opposite may be true.
In these cases, the larger the task, the more chance there is for unexpected sharing of data or other side effects. The overall goal is to decompose the problem into independent tasks that do not share data, while providing a sufficient number of tasks to occupy the number of cores available. When considering the number of cores, you should take into account that future generations of hardware will have more cores.
Tasks that are independent of one another can run in parallel, while some tasks can begin only after other tasks complete. Constraints can arise from control flow the steps of the algorithm or data flow the availability of inputs and outputs. Various mechanisms for coordinating tasks are possible. The way tasks are coordinated depends on which parallel pattern you use. Regardless of the mechanism you choose for coordinating tasks, in order to have a successful design, you must understand the dependencies between tasks. The problem is that when a program is running in parallel, different parts of the program may be racing against each other to perform updates on the same memory location.
The result of such unintended data races can be catastrophic. The solution to the problem of data races includes techniques for synchronizing threads. You may already be familiar with techniques that synchronize concurrent threads by blocking their execution in certain circumstances. Examples include locks, atomic compare-and-swap operations, and semaphores. All of these techniques have the effect of serializing access to shared resources. Although your first impulse for data sharing might be to add locks or other kinds of synchronization, adding synchronization reduces the parallelism of your application.
Every form of synchronization is a form of serialization. Your tasks can end up contending over the locks instead of doing the work you want them to do. Programming with locks is also error-prone. These techniques include the use of immutable, readonly data, sending messages instead of updating shared variables, and introducing new steps in your algorithm that merge local versions of mutable state at appropriate checkpoints.
Techniques for scalable sharing may involve changes to an existing algorithm. Conventional object-oriented designs can have complex and highly interconnected in-memory graphs of object references. As a result, traditional object-oriented programming styles can be very difficult to adapt to scalable parallel execution.
Your first impulse might be to consider all fields of a large, interconnected object graph as mutable shared state, and to wrap access to these fields in serializing locks whenever there is the possibility that they may be shared by multiple tasks. Unfortunately, this is not a scalable approach to sharing. Locks can often negatively affect the performance of all cores. Locks force cores to pause and communicate, which takes time, and they introduce serial regions in the code, which reduces the potential for parallelism.
As the number of cores gets larger, the cost of lock contention can increase. As more and more tasks are added that share the same data, the overhead associated with locks can dominate the computation. In addition to performance problems, programs that rely on complex synchronization are prone to a variety of problems, including deadlock. Deadlock occurs when two or more tasks are waiting for each other to release a lock. Most of the horror stories about parallel programming are actually about the incorrect use of shared mutable state or locking protocols.
Scalable sharing may involve changes to your algorithm. Adding synchronization locks can reduce the scalability of your application. This book uses synchronization sparingly. You should, too. Locks can be thought of as the goto statements of parallel programming: they are error prone but necessary in certain situations, and they are best left, when possible, to compilers and libraries.
First and foremost, the code still needs to be correct. Use patterns. This is a particularly tempting approach when you parallelize an existing sequential application. Although this may give you some initial improvements in performance, it has many pitfalls, such as those described in the previous section. As a result, traditional profile-and-optimize techniques may not produce the best results.
A far better approach is to understand your problem or application and look for potential parallelism across the entire application as a whole. What you discover may lead you to adopt a different architecture or algorithm that better exposes the areas of potential parallelism in your application. Instead, prepare your program for parallel execution by making structural changes. Techniques for decomposition, coordination, and scalable sharing are interrelated. You need to consider all of these aspects together when choosing your approach for a particular application. After reading the preceding description, you might complain that it all seems vague.
How specifically do you divide your problem into tasks? Exactly what kinds of coordination techniques should you use? Questions like these are best answered by the patterns described in this book. Patterns are a true shortcut to understanding. As you begin to see the design motivations behind the patterns, you will also develop your intuition about how the patterns and their variations can be applied to your own applications. The following section gives more details about how to select the right pattern. The Parallel Loop pattern Chapter 2 Do you have distinct operations with well-defined control dependencies?
Are these operations largely free of serializing dependencies? The Parallel Task pattern Chapter 3 Do you need to summarize data by applying some kind of combination operator? Do you have loops with steps that are not fully independent? The Parallel Aggregation pattern Chapter 4 Does the ordering of steps in your algorithm depend on data flow constraints?
The Futures pattern Chapter 5 Does your algorithm divide the problem domain dynamically during the run? Do you operate on recursive data structures such as graphs? The Dynamic Task Parallelism pattern Chapter 6 Does your application perform a sequence of operations repetitively? Does the input data have streaming characteristics? Does the order of processing matter? The Pipeline pattern Chapter 7 Parallel loops apply an independent operation to multiple inputs simultaneously.
Parallel tasks allow you to establish parallel control flow in the style of fork and join. Parallel aggregation introduces special steps in the algorithm for merging partial results. Futures make the data flow dependencies between tasks explicit. This pattern is also referred to as the Task Graph pattern. This pattern takes a divide-andconquer approach and spawns new tasks on demand. Pipelines consist of components that are connected by queues, in the style of producers and consumers. All the components run in parallel even though the order of inputs is respected. One way to become familiar with the possibilities of the six patterns is to read the first page or two of each chapter.
This will give you an overview of approaches that have been proven to work in a wide variety of applications. Then go back and more deeply explore patterns that may apply in your situation. This book makes a distinction between the two terms. It usually refers to the existence of multiple threads of execution that may each get a slice of time to execute before being preempted by another thread, which also gets a slice of time.
Concurrency is necessary in order for a program to react to external stimuli such as user input, devices, and sensors. Operating systems and games, by their very nature, are concurrent, even on one core. With parallelism, concurrent threads execute at the same time on multiple cores. Parallel programming focuses on improving the performance of applications that use a lot of processor power and are not constantly interrupted when multiple cores are available.
The goals of concurrency and parallelism are distinct. The main goal of concurrency is to reduce latency by never allowing long periods of time to go by without at least some computation being performed by each unblocked thread. In other words, the goal of concurrency is to prevent thread starvation. Concurrency is required operationally. For example, an operating system with a graphical user interface must support concurrency if more than one window at a time can update its display area on a single-core computer.
Parallelism, on the other hand, is only about throughput. Its goal is to maximize processor usage across all available cores; to do this, it uses scheduling algorithms that are not preemptive, such as algorithms that process queues or stacks of work to be done. This may, at first, seem counterintuitive. Figure 1 illustrates this. Even with fewer cores, you can see that the expected speed-up is not linear.
Figure 2 illustrates this. The elapsed time spent in sequential processing is constant. The illustration also shows why you might be satisfied with a 2x speed-up on a four-core computer for actual applications, as opposed to sample programs. The important question is always how scalable the application is. Scalability depends on the amount of time spent doing work that is inherently sequential in nature. Performance can influence the mix of application features. As the number of cores increases, the overhead incurred by accessing shared memory also increases.
Also, parallel algorithms may include overhead for coordination that would not be necessary for the sequential case. Profiling tools, such as the Visual Studio Concurrency Visualizer, can help you understand how effective your use of parallelism is. In summary, because an application consists of parts that must run sequentially as well as parts that can run in parallel, the application overall will rarely see a linear increase in performance with a linear increase in the number of cores, even if certain parts of the application see a near linear speed-up.
A Few Tips Always try for the simplest approach. These libraries were written by experts and have been thoroughly tested; they help you avoid many of the common problems that arise in parallel programming. While this may produce some improvement, it does not necessarily give you the best results. If you do share data, use one of the containers provided by the API you are using, such as a shared queue. Raise the level of abstraction from threads to tasks in your applications.
Exercises 1. What are some of the tradeoffs between decomposing a problem into many small tasks and decomposing it into larger tasks? What is the maximum potential speed-up of a program that spends 10 percent of its time in sequential processing when you move it from one to four cores? What is the difference between parallelism and concurrency?
For More Information If you are interested in better understanding the terminology used in the text, refer to the glossary at the end of this book. The design patterns presented in this book are consistent with classifications of parallel patterns developed by groups in both industry and academia.
In the terminology of these groups, the patterns in this book would be considered to be algorithm or implementation patterns. Classification approaches for parallel patterns can be found in the book by Mattson, et al. This book attempts to be consistent with the terminology of these sources. In cases where this is not possible, an explanation appears in the text. Duffy, Joe. Concurrent Programming on Windows, AddisonWesley, Mattson, Timothy G. Sanders, and Berna L. Patterns for Parallel Programming. Addison-Wesley, Steps often take place at the same time, in parallel. Sometimes, two steps take place in the opposite order than they would if the loop were sequential.
This is because it can be hard to tell if the steps are actually independent of each other. It takes practice to learn how to recognize when one step is dependent on another step. Sometimes, using this pattern on a loop with dependent steps causes the program to behave in a completely unexpected way, and perhaps to stop responding.
Other times, it introduces a subtle bug that only appears once in a million runs. Instead, the run-time environment executes the steps of the loop at the same time on as many cores as it can. The loop works correctly no matter how many cores are available. If there is only one core and assuming the work performed by each iteration is not too small, then the performance is close to perhaps within a few percentage points of the sequential equivalent.
If there are multiple cores, performance improves; in many cases, performance improves proportionately with the number of cores. The steps must not communicate by writing to shared variables. The first argument is the lowest index of the loop. The second argument is the exclusive upper bound, or the largest index plus one.
Lambda expressions denote function objects that can capture variables from their enclosing scope. Unlike a sequential loop, some higher-valued indices may be processed before some lower-valued indices.
- Build a Backyard Natural Grass Putting Green.
- design patterns on parallel programming.
- Designing Online Learning: A Primer for Librarians.
The first argument is an iterator that references the position of the first element in the range to be operated on. The second argument is an iterator that references the position one past the final element in the range. You must choose the right granularity. Robust exception handling is an important aspect of parallel loop processing.
Check carefully for dependencies between loop iterations! You must be extremely cautious when getting data from accessors. Large object models are known for sharing mutable state in unbelievably convoluted ways. By default, the degree of parallelism that is, how many iterations run at the same time in hardware depends on the number of available cores.
How much faster depends on the kind of work your loop does. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. Here is an example, where total is shared across iterations. Any variable that is declared outside of the scope of the loop body is a shared variable. Shared references to types such as classes or arrays will implicitly allow all fields or array elements to be shared.
Parameters that are passed by reference or by pointer result in shared variables, as do variables captured by reference in a lambda expression. For example, an accessor method named GetParent is likely to refer to global state. GetParent is a reference to a single shared object. This is shown in the following code example. The loop body references data[i] and data[i — 1]. Your best options are to look elsewhere in your program for opportunities for parallelism or to analyze your algorithm and see if it matches some of the advanced parallel patterns that occur in scientific computing.
Parallel scan and parallel dynamic programming are examples of these patterns. Fabrikam Shipping extends credit to its commercial accounts. It uses customer credit trends to identify accounts that might pose a credit risk. Each customer account includes a history of past balance-due amounts. To identify at-risk accounts, Fabrikam uses statistical trend analysis to calculate a projected credit balance for each account. In the application, a top-level loop iterates over customers in the account repository. The body of the loop fits a trend line to the balance history, extrapolates the projected balance, compares it to the credit limit, and assigns the warning flag if necessary.
Only you can do that. The Fit method is a utility function that uses the statistical least squares method to create a trend line from an array of numbers. The Fit method is a pure function. The prediction is a three-month projection based on the trend. If a prediction is more negative than the overdraft limit credit balances are negative numbers in the accounting system , the account is flagged for review. Timing numbers vary; you may want to run the online samples on your own computer.
Variations The credit analysis example shows a typical way to use parallel loops, but there can be variations. This section introduces some of the most important ones. Breaking out of Loops Early Breaking out of loops is a familiar part of sequential iteration. However, you can break out of a parallel loop by canceling the task group that contains it. The example code shows that if you want to break out of a parallel loop early, you need to create a task group object and execute the parallel loop within that task group.
You should keep in mind that parallel loops may execute steps out of order. Exception Handling Throwing an unhandled exception prevents new iterations from starting. If the body of a parallel loop throws an unhandled exception, the parallel loop no longer begins any new steps. By default, iterations that are executing at the time of the exception, other than the iteration that threw the exception, will complete. After they finish, the parallel loop will throw an exception in the context of the thread that invoked it.
Because the loop runs in parallel, there may be more than one exception. If more than one exception has occurred, the parallel loop will nondeterministically choose one of the exceptions to throw. The remaining exceptions will not be externally observable. The reason for this is that there are two types of overhead that are introduced when processing a loop: the cost of managing worker threads and the cost of invoking the function object.
In most situations, these costs are negligible, but with very small loop bodies they can be significant. Iterating with step sizes greater than one lets you embed a sequential loop within your parallel loop. Each iteration of the outer parallel loop handles a range of indices instead of individual indices. By grouping iterations into ranges, you can avoid some of the overhead of a normal parallel loop.
When the amount of work in each iteration is large or Consider using partitioning strategies when you have many iterations that each perform a small amount of work. Generally, you would only use the more complicated syntax after profiling or in the case where loop bodies are extremely small and the number of iterations large. The number of ranges that you use will normally depend on the number of cores in your computer. A good default number of ranges is approximately three to ten times the number of cores.
When the amount of work for each item is small, the cost of load balancing can become significant. Controlling the Degree of Parallelism You can control the maximum number of active threads used concurrently by a parallel loop. Reducing the degree of parallelism is often done in performance testing to simulate less capable hardware. The term degree of parallelism refers to the number of cores that are used to process iterations simultaneously. The degree of parallelism is automatically managed by the underlying components of the system. They highlight issues that need to be carefully considered as well as problem areas.
Here are some issues to think about when you implement a parallel loop. Hidden Loop Body Dependencies Incorrect analysis of loop dependencies is a frequent source of software defects. Be careful that all parallel loop bodies do not contain hidden dependencies. The case of trying to share an instance of a class which is not thread safe across parallel iterations is an example of a subtle dependency. You should also be careful when you share state by using reference variables from the enclosing lexical scope in a lambda expression. When loop bodies are not fully independent of each other, it may still be possible to use parallel loops.
However, in these cases, you must make sure that all shared variables are protected and synchronized, and you must understand the performance characteristics of any synchronization you add. Adding synchronization can greatly reduce the performance of a parallel program, but forgetting to add necessary synchronization can result in a program with bugs that are catastrophic and difficult to reproduce.
In this case, the overhead required by the parallel loop itself will dominate the calculation. Cooperative blocking operations should be performed infrequently within the parallel loop. Which of the following problems could be solved using the parallel loop techniques taught in this chapter? Sorting an in-memory array of numbers with a million elements. Adding together all the numbers in one collection to obtain a single sum. Adding numbers from two collections pair-wise to obtain a collection of sums.
Counting the total number of occurrences of each word in a collection of text files. Finding the word that occurs most frequently in each file in a collection of text files. Choose a suitable problem from Exercise 1. Code two solutions, using a sequential loop and a parallel loop. Use command line options to independently vary the number of iterations the number of accounts and the amount of work done in the body of each iteration the number of months in the credit history.
Record the execution times reported by the program for all three versions, using several different combinations of numbers of accounts and months. Repeat the tests on different computers with different numbers of cores and with different execution loads from other applications. The book by Mattson, et al.
Mattson, T. Sanders, and B. Messmer, B. This is data parallelism. Chapter 3 explains what happens when there are distinct asynchronous operations that can run simultaneously. This is task parallelism. Data parallelism and task parallelism are two ends of a spectrum.
Data parallelism occurs when a single operation is applied to many inputs. Task parallelism uses multiple operations, each with its own input. Scheduling is an important aspect of parallel tasks. Instead, they are placed in a work queue. Tasks run when their associated task scheduler removes them from the queue, usually as processor resources become available. In this way, tasks embody the concept of potential parallelism that was introduced in Chapter 1. Another important aspect of task-based applications is how they handle exceptions.
In PPL, an unhandled exception that occurs during the execution of a task is deferred for later observation. At that time, the exception is 27 Parallel tasks are asynchronous operations that can run at the same time. This approach is also known as task parallelism. This allows you to use the same exception handling approach in parallel programs that you use in sequential programs. The Basics Each task is a sequential operation; however, tasks can often run in parallel. This means that neither method writes to memory locations or files that the other method might read.
This is shown in the following code. The function creates a new task group with new parallel tasks for each lambda expression in its argument list. Depending on the current work load and system configuration, tasks might be scheduled to run one after another, or they might run at the same time. If more than one of the work functions throws an exception, the runtime chooses one of the exceptions to be rethrown. The remaining exceptions will not be externally observed.
You can reproduce this functionality by creating a task group object and calling its run and wait methods. In other words, the argument can be any object that supports the function call operator with the signature void operator. It is also possible to combine the run and wait steps into a single operation. An Example An example of task parallelism is an image processing application where images are created with layers.
Separate images from different sources are processed independently and then combined with a process known as alpha blending. This process superimposes semitransparent layers to form a single image. The source images that are combined are different, and different image processing operations are performed on each of them.
Use the wait method to block the current context until all of the tasks in a task group have completed. Lambda expressions and pointers to static functions do not require explicit deletion and are therefore easier to use than class-type functors. In the example, there are only two source images, and the operations are simple: conversion to gray scale and rotation. In a more realistic example, there might be more source images and more complicated operations.
The SetToGray and Rotate methods are entirely independent of each other. This means that you can execute them in separate tasks. If two or more cores are available, the tasks might run in parallel, and the image processing operations might complete in less elapsed time than a sequential version would. This call does not return until all of the tasks complete. It then calls task group run methods to create and run two tasks that execute SetToGray and Rotate.
Coordinating Tasks with Cooperative Blocking The classes and functions in the Concurrency namespace implement a task-coordination feature known as cooperative blocking.
With cooperative blocking, your task can suspend its execution and relinquish control to the task scheduler until a specific condition is met. This usually occurs when another task performs an action that the first task needs. Cooperative blocking provides a robust and highly programmable way to coordinate tasks. Cooperative blocking can improve the performance of a parallel application by enabling fuller use of processor resources.
A cooperatively blocked task represents an opportunity for the task scheduler to apply processor resources to other tasks. If you are going to use PPL effectively, it is important to understand the interaction of coop- Cooperative blocking provides a robust and highly programmable way to coordinate tasks. Using lower-level blocking operations of the operating system is sometimes called noncooperative blocking. If you do any type of blocking, you will need to consider whether cooperative or noncooperative blocking is most appropriate. In general, cooperative blocking has the advantage of better coordination with the task scheduler.
The following table lists the most important operations that the runtime considers to be cooperative blocking operations.
CSS Parallel Programming in Grid and Cloud - Autumn
Note that these operations are not guaranteed to block every time they are called; instead, they have the potential to block if certain conditions exist. All of the classes are in the Concurrency namespace. Acquiring the lock may block the current task if the lock is in use by another task. Only one task at a time can possess the lock. Invoking the event::wait method may block if the event has not yet been set. This operation is used by libraries to implement new task-coordination control structures.
It is not normally used by application code. Context::Yield method This method suspends the current thread so that another worker thread may be given the opportunity to resume execution. Although Yield potentially suspends execution of the current thread, it never results in a blocked context. Therefore, Yield can only loosely be considered a blocking operation.
The asend function is expected to be nonblocking in future versions of the runtime. It may block if the messaging block does not yet have a value to provide. In such cases you must ensure that all possible code paths are deadlock free. Tasks in the task group that have not yet started are not allowed to run. New tasks that are added to the task group by the run method are ignored after the cancel method has been called. Tasks in the task group that have started before cancellation is signaled continue to run, but their behavior may change.
If a task of a task group that is being canceled invokes any function in the Concurrency namespace, an exception may be thrown. The specific set of functions in the Concurrency namespace that will throw exceptions when a task group is undergoing cancellation and the specific set of circumstances that will cause such exceptions to be thrown are intentionally not specified. Therefore, your application logic should not depend on whether any particular library function throws or does not throw an exception when a task group is being canceled.
In the current version of PPL, canceling a task group with cooperatively or noncooperatively blocked tasks may result in deadlock in certain cases. For example, consider the case where you create two tasks in a task group that share an instance E of the cooperatively blocking event class. One of the tasks calls the wait method of event E, and the other task calls the signal method of event E.
Cancellation will automatically propagate across task groups in certain situations. For example, cancellation will be propagated if a task in task group A is waiting for the tasks of task group B to complete. The task in task group A that is waiting on task group B remains blocked until task group B has no more running tasks. Of course, if task group B is very fast, its wait function might return normally before there is a chance to propagate a cancellation from task group A. Note that the runtime only returns the enumerated value canceled for the wait method of the top-most task group in a tree of dependent task groups.
The other stack frames will have an internal exception passed through them. For example, in an interactive GUI-based application, checking for cancellation more than once per second is probably a good idea. An application that runs in the background could poll for cancellation less frequently, perhaps every two to ten seconds.
Profile your application to collect performance data that will help you determine the best places to test for cancellation requests in your code. In particular, look for places in your code where you can poll for cancellation at evenly spaced intervals. The runtime catches the exception and records its details in internal data structures. Then, the runtime cancels the task group that contains the faulted task. Under certain conditions, the cancellation is automatically propagated to other task groups.
If there is more than one exception that is, if more than one task threw an unhandled exception , the runtime will choose one of the exceptions to rethrow. The remaining exceptions will not be observed. If you need to record all of the exceptions, you can implement custom logging or tracing code. The task group has the same behavior as if it were newly created. When a task of a task group throws an unhandled exception, the runtime cancels that task group.
Speculative execution occurs when you perform an operation in anticipation of a particular result. For example, you might predict that the current computation will produce the value If the first computation results in something other than 42, you can restart the second operation using the correct value as the input. Another example of speculative execution occurs when you execute more than one asynchronous operation in parallel but need just one of the operations to complete before proceeding.
Imagine, for example, that you use three different search tasks to search for an item. In cases like this you wait for the first task to complete and usually cancel the remaining tasks. However, you should always observe any exceptions that might have occurred in any of the tasks. You want to continue if any of the three functions completes. To make this possible, the code passes a reference to the task group object to each of the work functions. Inside of the work functions, the code cancels the task group when it completes its work. The following code shows how to accomplish this inside of the SearchLeft function.
For performance reasons the code only polls at a specified number of loop iterations. Anti-Patterns Here are some things to watch out for when you use task groups. Closures can refer to variables defined outside of their lexical scope, such as local variables that were declared in a scope that contains the closure. Problems occur when you reference a variable without considering its scope.
For example, you might see 4, 4, 4, 4. The reason is that the variable i is captured by reference and shared by all the closures created by the steps of the for loop. By the time the tasks start, the value of the single, shared variable i will probably be different from the value of i when the task was created. The solution is to capture the variable by value in the appropriate scope. This version prints the numbers 1, 2, 3, 4 in an arbitrary order, as was originally intended.
The reason is that the value of the variable i is passed to the closure. Effectively, a new variable named i is instantiated with each iteration of the for loop. Unintended Propagation of Cancellation Requests Be aware that implicit parent-child relationships can influence the behavior of cancellation and exception handling, especially with libraries. If you use a library that is implemented with PPL, the API calls into that library may create task groups that are internal to the library.
You avoid cases of unintended propagation of runtime context by using a neutral, non-PPL thread context to call into any library functions that may depend on task group wait operations. For example, you could use a lightweight task to invoke library functions. A lightweight task is a lower-level type of task that does not result in the propagation of cancellation requests. They are described in Appendix A.
However, programmers often underestimate how much serializing operations can degrade performance. Well-designed applications require explicit synchronization operations less often than poorly designed applications. Design Notes This section describes some of the design considerations that were used to create the Parallel Patterns Library, along with some recommended coding practices. The missing wait exception only occurs in the normal flow of execution; it will not be thrown if you unwind the stack due to an application exception.
Therefore, you do not need an RAII wrapper that calls the wait method. However, if the task group is canceled while the tasks are running, only one of the wait functions will return the canceled status value. The other invocation will return a normal status, due to interleaving. To avoid performance bottlenecks, review your use of locks and other synchronization operations carefully. The task will not migrate among threads at run time. This is a useful guarantee because it lets you use thread-affine abstractions, such as critical sections, without having to worry, for example, that the EnterCriticalSection function will be executed in a different thread than the LeaveCriticalSection function.
In general, creating a new task is a much less resource-intensive operation than creating a new thread. It is possible for an application to create hundreds of thousands or even millions of tasks and still run efficiently. You may want to profile your application as you experiment with strategies for using tasks in your application.
If your tasks are too fine grained, you will incur overhead for task management that may hurt performance. For example, a task that performs a single arithmetic operation is not going to improve performance. However, if you decompose your application into too few tasks, you will not fully exploit the potential parallelism of the application.
How Tasks Are Scheduled The techniques for scheduling tasks and scheduling threads demonstrate fundamentally different goals. The operating system generally schedules threads in a way that minimizes latency. Preemptive thread scheduling allows users to interact with a busy system with very little perceived delay, even on a system with only one core. In contrast, the task scheduler tries to maximize overall throughput rather than ensure low latency for any given task.
In other words, when you decompose a computationally intensive operation into tasks that can potentially run in parallel, you want to complete the overall operation as quickly as possible without concern for the scheduling delays that any given task might experience. For task-based systems such as PPL and the underlying Concurrency Runtime, the measure of success is the speed of the overall result. The goal is to optimize the use of processor resources.
These techniques mainly focus on keeping cores busy and on an effective use of system caches. Structured task groups have lower overhead than the task groups, but there are restrictions on how structured task groups can be used. These restrictions require stack-based work functions and strict nesting of subordinate structured task groups. Although the task groups are recommended for most application programming, structured task groups are important for implementing parallel libraries.
It encapsulates a work function used by a task. Most applications will never need access to task handles; however, you must use task handles with structured task groups. Unlike lambda expressions, task handles require explicit memory management by your application. The image blender example in this chapter uses task parallelism: a different task processes each image layer.
A typical strategy in image processing uses data parallelism: the same computation processes different portions of an image or different images. Is there a way to use data parallelism in the image blender example? If there is, what are the advantages and disadvantages, compared to the task parallelism discussed here?
Patterns for Parallel Programming
In the image blender sample, the image processing methods SetToGray and Rotate are void methods that do not return results, but they save their results by updating their second argument. Further Reading Leijen et al. Leijen, D. Schulte, and S. Arora and G. ACM, ConcRTExtras software. However, not all parallel loops have loop bodies that execute independently.
For example, a sequential loop that calculates a sum does not have independent steps. All the steps accumulate their results in a single variable that represents the sum calculated up to that point. This accumulated value is an aggregation. If you were to convert the sequential loop to a parallel loop without making any other changes, your code would fail to produce the expected result.
Parallel reads and writes of the single variable would corrupt its state. Nonetheless, there is a way for an aggregation operation to use a parallel loop. This is the Parallel Aggregation pattern. Although calculating a sum is an example of aggregation, the pattern is more general than that. It works for any binary operation that is associative. However, some implementations of the Parallel Aggregation pattern also expect the operations to be commutative. The Parallel Aggregation pattern uses unshared, local variables that are merged at the end of the computation to give the final result.
Using unshared, local variables for partial, locally calculated results makes the steps of a loop independent of each other. The Parallel Aggregation pattern is also known as the Parallel Reduction pattern because it combines multiple inputs into a single output. This is a typical sequential for loop.
In this example and the ones that follow, IsPrime is a user-provided function that determines if its argument is a prime number. The result is a count of how many prime numbers are contained in the input sequence. How can sequential accumulation be adapted for parallel processing?
You might also be tempted to wrap a critical section around the operation that increments the count variable. The critical section would prevent parallel iterations from performing conflicting reads and writes, but the performance of that approach would be much, much worse than the sequential version you are trying to optimize. The cost of synchronization would be prohibitive. In fact, programmers often underestimate the performance cost of synchronization operations. Instead, redesign the algorithm to use a two-phase approach. First, subdivide the problem into as many tasks as you have cores and calculate partial results locally on a per-core basis.
Then, once all of the per-task partial results are ready, sequentially merge the results into one final accumulated value. PPL provides a special data structure that makes it easy to create per-task local results in parallel and merge them as a final sequential step. This data structure is the combinable class. The following code examples show how to use the combinable class to implement the Parallel Aggregation pattern. To compute the initial, local values the constructor of the combinable class takes a function as an argument. The number of tasks depends on the level of concurrency available in the current context.
In this example the combination function is integer addition. The return value of the combine method is the final aggregated value. The easiest way to understand how these functions work is to compare them with their corresponding sequential operations in STL. STL provides a very simple way to express sequential aggregation with iterators. Here is an example. The first and second arguments to the accumulate function give the iteration bounds.
The third argument is the initial value of the accumulation variable, and the fourth argument is a binary reduction function that will be successively applied to the accumulation variable and to each iterated value. The job of the reduction function is to combine two input values. Here is the implementation of the reduction function, IncrementIfPrime. The combinable class assumes that the operation provided as an argument to the combine method is commutative. The following code gives an example. The first two arguments give the iteration bounds.
If the reduction is based on addition, this element will be 0. For multiplicative reduction, the identity element is 1. For reductions such as aggregate set union, the identity element is the empty set. The fourth argument is a function object that can be applied on a subrange of an iterator to produce a local partial aggregation. This example uses a functor created by instantiating the CountPrimes class. The return value of the function object is the local partial result from the first phase of the Parallel Aggregation pattern.
The fifth argument is a reduction function that will combine the partial results that have been calculated for each of the subranges. Here is the implementation of the CountPrimes class-type functor. There will be enough ranges to compensate for the effects of uneven workloads, but not so many ranges that the overhead of calculating them dominates the computation. PPL determines how many ranges to create.
In this example, the CountPrimes function object will be invoked one time for each of the ranges. It executes a sequential accumulation operation on the subrange and collects the result. Its declarative nature makes it less prone to error than other approaches, and its performance on multicore computers is competitive with them. Instead, all the synchronization occurs internally. Also, a call to the combinable::local method inside of a parallel loop adds the cost of a hash table lookup to each iteration of the loop. In general, use parallel aggregation to increase performance when iterations perform complex computations.
It arises in many other application contexts. The example is of a social network service, where subscribers can designate other subscribers as friends. The site recommends new friends to each subscriber by identifying other subscribers who are friends of friends. To limit the number of recommendations, the service only recommends the candidates who have the largest number of mutual friends. Candidates can be identified in independent parallel operations, and then candidates are ranked and selected in an aggregation operation.
Subscribers are identified by integer ID numbers.