When all branches complete, the control joins back to unify the flows from the branches. Results computed by the branches are typically read from memory and merged at the join point. Parallel regions can fork and join recursively in the same manner that divide and conquer programs split and join recursively. In this sense, fork join is the divide and conquer of parallel computing. As we will see, it is often possible to extend an existing language with support for fork-join parallelism by providing libraries or compiler extensions that support a few simple primitives.
Such extensions to a language make it easy to derive a sequential program from a parallel program by syntactically substituting the parallelism annotations with corresponding serial annotations. This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism.
In PASL, fork join is expressed by application of the fork2 function. The function expects two arguments: one for each of the two branches. In the sample code below, the first branch writes the value 1 into the cell b1 and the second 2 into b2 ; at the join point, the sum of the contents of b1 and b2 is written into the cell j. When this code runs, the two branches of the fork join are both run to completion. The branches may or may not run in parallel i.
In general, the choice of whether or not any two such branches are run in parallel is chosen by the PASL runtime system. The join point is scheduled to run by the PASL runtime only after both branches complete. Before both branches complete, the join point is effectively blocked. Later, we will explain in some more detail the scheduling algorithms that the PASL uses to handle such load balancing and synchronization duties. In fork-join programs, a thread is a sequence of instructions that do not contain calls to fork2.
A thread is essentially a piece of sequential computation. The two branches passed to fork2 in the example above correspond, for example, to two independent threads. Moreover, the statement following the join point i. All writes performed by the branches of the binary fork join are guaranteed by the PASL runtime to commit all of the changes that they make to memory before the join statement runs.
In terms of our code snippet, all writes performed by two branches of fork2 are committed to memory before the join point is scheduled. The PASL runtime guarantees this property by using a local barrier. Such barriers are efficient, because they involve just a single dynamic synchronization point between at most two processors. In the example below, both writes into b1 and b2 are guaranteed to be performed before the print statement.
PASL provides no guarantee on the visibility of writes between any two parallel branches. In the code just above, for example, writes performed by the first branch e. Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function.
Although useless as a program because of efficiency issues, this example is the "hello world" program of parallel computing. Let us start by considering a sequential algorithm. This function for computing the Fibonacci numbers is inefficient because the algorithm takes exponential time, whereas there exist dynamic programming solutions that take linear time.
To write a parallel version, we remark that the two recursive calls are completely independent : they do not depend on each other neither uses a piece of data generated or written by another. We can therefore perform the recursive calls in parallel. In general, any two independent functions can be run in parallel.
To indicate that two functions can be run in parallel, we use fork2. Suppose that we wish to map an array to another by incrementing each element by one. This is not a good parallel algorithm but it is not difficult to give a parallel algorithm for incrementing an array. The code for such an algorithm is given below. In the Fibonacci example, we started with a sequential algorithm and derived a parallel algorithm by annotating independent functions.
It is also possible to go the other way and derive a sequential algorithm from a parallel one. As you have probably guessed this direction is easier, because all we have to do is remove the calls to the fork2 function. The sequential elision of our parallel Fibonacci code can be written by replacing the call to fork2 with a statement that performs the two calls arguments of fork2 sequentially as follows.
The sequential elision is often useful for debugging and for optimization. It is useful for debugging because it is usually easier to find bugs in sequential runs of parallel code than in parallel runs of the same code. It is useful in optimization because the sequentialized code helps us to isolate the purely algorithmic overheads that are introduced by parallelism. By isolating these costs, we can more effectively pinpoint inefficiencies in our code. We defined fork-join programs as a subclass case of multithreaded programs.
Since, a fork-join program does not explicitly manipulate threads, it is not immediately clear what a thread refers to. To define threads, we can partition a fork-join computation into pieces of serial computations, each of which constitutes a thread. What we mean by a serial computation is a computation that runs serially and also that does not involve any synchronization with other threads except at the start and at the end. More specifically, for fork-join programs, we can define a piece of serial computation a thread , if it executes without performing parallel operations fork2 except perhaps as its last action.
When partitioning the computation into threads, it is important for threads to be maximal; technically a thread can be as small as a single instruction. A thread is a maximal computation consisting of a sequence of instructions that do not contain calls to fork2 except perhaps at the very end. We partition each invocation of this function into two threads labeled by "M" and "C" respectively. The second corresponds to the "continuation" of the fork2 , which is in this case includes no computation.
Based on this dag, we can create another dag, where each thread is replaced by the sequence of instructions that it represents. This would give us a picture similar to the dag we drew before for general multithreaded programs. Such a dag representation, where we represent each instruction by a vertex, gives us a direct way to calculate the work and span of the computation. If we want to calculate work and span ond the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.
The computation dag of a fork-join program applied to an input unfolds dynamically as the program executes. An execution of a fork-join program can generate a massive number of threads. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation continuation threads. To do this, we can use an online scheduling algorithm. The following is a schedule for the dag shown in this Figure assuming that each thread takes unit time.
The "async-finish" approach offers another mechanism for structured parallelism. It is similar to fork-join parallelism but is more flexible, because it allows essentially any number of parallel branches to synchronize at the same point called a "finish". Each "async" creates a separate parallel branch, which by construction must signal a "finish. As with fork-join, it is often possible to extend an existing language with support for async-finish parallelism by providing libraries or compiler extensions that support a few simple primitives.
This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism, e. To indicate that two functions can be run in parallel, we use async. It is a requirement that each async takes place in the context of a finish.
- Jorge L.C. Sanz (Author of Opportunities and Constraints of Parallel Computing)!
- History of modern non-Marxian economics, from marginalist revolution through the Keynesian revolution to contemporary monetarist counter-revolution?
- 1. Preface.
- Bonjour Tristesse.
All async 'ed computations that takes place in the context of a finish synchronize at that finish , i. The important point about a finish is that it can serve as the synchronization point for an arbitrary number of async 'ed computations. This is not quite useful in Fibonacci because there are only two computations to synchronize at a time, but it will be useful in our next example.
Recall our example for mapping an array to another by incrementing each element by one. It is helpful to compare this to fork-join, where we had to synchronize parallel branches in pairs. Here, we are able to synchronize them all at a single point. Each invocation of this function corresponds to a thread labeled by "M". Note that all threads synchronize at the single finish node. If we want to calculate work and span on the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.
In practice, however, the async-finish version creates fewer threads by about a factor of two , which can make a difference. Futures were first used for expressing parallelism in the context of functional languages, because they permit a parallel computation to be a first-class value. Such a value can be treated just like any other value in the language. For example, it can be placed into a data structure, passed to other functions as arguments. Another crucial difference between futures and fork-join and async-finish parallelism is synchronization: in the latter synchronization is guided by "control dependencies", which are made apparent by the code.
By inspecting the code we can identify the synchronization points for a piece of parallel computation, which correspond to join and finish. In futures, synchronization can be more complex because it is based on "data dependencies. If the computation has not completed, then it will be executed to completion. If it has completed, its result will be used. Using futures, we can parallelize essentially any piece of computation, even if it depends on other parallel computations.
Recall that when using fork-join and async-finish, we have had to ensure that the parallel computations being spawned are indeed independent. This is not necessary with futures. To indicate a parallel computations, we shall use the future construct, which takes an expression as an argument and starts a parallel computation that will compute the value of that expression sometime in the future. The construct returns a "future" value representing the computation, which can be used just like any other value of the language.
When the time comes to use the value of the future, we demand its completion by using the force construct. Note that this code is very similar to the fork-join version. In fact, we can directly map any use of fork-join parallelism to futures. Much the same way that we represent fork-join and async-finish computations with a dag of threads, we can also represent future-based computations with a dag.
The idea is to spawn a subdag for each future and for each force add an edge from the terminus of that subdag to the vertex that corresponds to the force. For example, for the Fibonacci function, the dags with futures are identical to those with fork-join. As we shall see, however, we can create more interesting dags with futures.
We can write a parallel version of this code using futures by following the same approach that we did for Fibonacci. Suppose that the array is given to us an array of future values. To compute each value of the array, the function force 's the corresponding element of the source inside a future. Futures enable expressing parallelism at a very fine level of granularity, at the level of individual data dependencies. This in turn allows parallelizing computations at a similarly finer grain, enabling a technique called pipelining.
As a result, if we have a collection of the same kinds of tasks that can be decomposed in the same way, we can have multiple of them "in flight" without having to wait for one to complete. When computing sequentially, pipelining does not help performance because we can perform one computation at each time step. But when using parallel hardware, we can keep multiple processors busy by pipelining. Suppose furthermore that these task depend on each other, that is a later task use the results of an earlier task.
Thus it might seem that there is no parallelism that we can exploit. We can then execute these tasks in parallel as shown below. This idea of pipelining turns out to be quite important in some algorithms, leading sometimes to asymptotic improvements in run-time. It can, however, be painful to design the algorithm to take advantage of pipelining, especially if all we have available at our disposal are fork-join and async-finish parallelism, which require parallel computations to be independent.
When using these constructs, we might therefore have to redesign the algorithm so that independent computations can be structurally separated and spawned in parallel. On the other hand, futures make it trivial to express pipelined algorithms because we can express data dependencies and ignore how exactly the individual computations may need to be executed in parallel. For example, in our hypothetical example, all we have to do is create a future for each sub-task, and force the relevant subtask as needed, leaving it to the scheduler to take advantage of the parallelism made available.
While we shall not discuss this in detail, it is possible to improve the asymptotic complexity of certain algorithms by using futures and pipelining. An important research question regarding futures is their scheduling. One challenge is primarily contention. When using futures, it is easy to create dag vertices, whose out-degree is non-constant. Another challenge is data locality. Async-finish and fork-join programs can be scheduled to exhibit good data locality, for example, using work stealing. With futures, this is more tricky.
It can be shown for example, that even a single scheduling action can cause a large negative impact on the data locality of the computation. In a multithreaded program, a critical section is a part of the program that may not be executed by more than one thread at the same time. Critical sections typically contain code that alters shared objects, such as shared e.
This means that the a critical section requires mutual exclusion : only one thread can be inside the critical section at any time. If threads do not coordinate and multiple threads enter the critical section at the same time, we say that a race condition occurs, because the outcome of the program depends on the relative timing of the threads, and thus can vary from one execution to another.
Race conditions are sometimes benign but usually not so, because they can lead to incorrect behavior. Spectacular examples of race conditions' effects include the "Northeast Blackout" of , which affected 45 million people in the US and 10 million people in Canada. It can be extremely difficult to find a race condition, because of the non-determinacy of execution. A race condition may lead to an incorrect behavior only a tiny fraction of the time, making it extremely difficult to observe and reproduce it.
For example, the software fault that lead to the Northeast blackout took software engineers "weeks of poring through millions of lines of code and data to find it" according to one of the companies involved. The problem of designing algorithms or protocols for ensuring mutual exclusion is called the mutual exclusion problem or the critical section problem.
There are many ways of solving instances of the mutual exclusion problem. But broadly, we can distinguish two categories: spin-locks and blocking-locks. The idea in spin locks is to busy wait until the critical section is clear of other threads. Solutions based on blocking locks is similar except that instead of waiting, threads simply block. When the critical section is clear, a blocked thread receives a signal that allows it to proceed.
The term mutex , short for "mutual exclusion" is sometimes used to refer to a lock. Mutual exclusions problems have been studied extensively in the context of several areas of computer science. For example, in operating systems research, processes, which like threads are independent threads of control, belonging usually but not always to different programs, can share certain systems' resources. To enable such sharing safely and efficiently, researchers have proposed various forms of locks such as semaphores , which accepts both a busy-waiting and blocking semantics.
Another class of locks, called condition variables enable blocking synchronization by conditioning an the value of a variable. In parallel programming, mutual exclusion problems do not have to arise. For example, if we program in a purely functional language extended with structured multithreading primitives such as fork-join and futures, programs remain purely functional and mutual-exclusion problems, and hence race conditions, do not arise.
If we program in an imperative language, however, where memory is always a shared resource, even when it is not intended to be so, threads can easily share memory objects, even unintentionally, leading to race conditions. Writing to the same location in parallel. In the code below, both branches of fork2 are writing into b. What should then the output of this program be? At the time of the print, the contents of b is determined by the last write. Thus depending on which of the two branches perform the write, we can see both possibilities:.
Consider the following alternative implementation of the Fibonacci function. By "inlining" the plus operation in both branches, the programmer got rid of the addition operation after the fork2. As in the example shows, separate threads are updating the value result but it might look like this is not a race condition because the update consists of an addition operation, which reads the value and then writes to i.
When written in this way, it is clear that these two parallel threads are not independent: they both read result and write to result. Thus the outcome depends on the order in which these reads and writes are performed, as shown in the next example. The number to the left of each instruction describes the time at which the instruction is executed. Note that since this is a parallel program, multiple instructions can be executed at the same time. The particular execution that we have in this example gives us a bogus result: the result is 0, not 1 as it should be.
The reason we get a bogus result is that both threads read the initial value of result at the same time and thus do not see each others write. In this example, the second thread "wins the race" and writes into result. The value 1 written by the first thread is effectively lost by being overwritten by the second thread. Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem.
These operations may allow, for example, testing the contents of a machine word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic read-modify-write or RMW , for short, operations. A handful of different RMW operations have been proposed. They typically take the memory location x , and a value v and replace the value of stored at x with f x,v. For example, the fetch-and-add operation takes the location x and the increment-amount, and atomically increments the value at that location by the specified amount, i. The compare-and-swap operation takes the location x and takes a pair of values a,b as the second argument, and stores b into x if the value in x is a , i.
The operation "compare-and-swap" is a reasonably powerful synchronization operation: it can be used by arbitrarily many threads to agree reach consensus on a value. Access to the contents of any given cell is achieved by the load and store methods. If the contents equals the contents of expected , then writes new into the target and returns true. As another example use of the atomic class, recall our Fibonacci example with the race condition. In that example, race condition arises because of concurrent writes to the result variable.
The following implementation of Fibonacci is not safe because the variable result is shared and updated by multiple threads. We can solve this problem by declaring result to be an atomic type and using a standard busy-waiting protocol based on compare-and-swap. The idea behind the solution is to load the current value of result and atomically update result only if it has not been modified by another thread since it was loaded.
This guarantees that the result is always updated read and modified correctly without missing an update from another thread. The example above illustrates a typical use of the compare-and-swap operation. In this particular example, we can probably prove our code is correct. But this is not always as easy due to a problem called the "ABA problem.
While reasonably powerful, compare-and-swap suffers from the so-called ABA problem. In the mean time some other thread also reads result and performs some operations on it, setting it back to 2 after it is done. The trouble is that the value has actually changed and has been changed back to the value that it used to be.
Thus, compare-and-swap was not able to detect this change because it only relies on a simple shallow notion of equality. If for example, the value stored in result was a pointer, the fact that the pointer remains the same does not mean that values accessible from the pointer has not been modified; if for example, the pointer led to a tree structure, an update deep in the tree could leave the pointer unchanged, even though the tree has changed.
The ABA problem is an important limitation of compare-and-swap: the operation itself is not atomic but is able to behave as if it is atomic if it can be ensured that the equality test of the subject memory cell suffices for correctness. In the example below, ABA problem may happen if the counter is incremented and decremented again in between a load and a store but it is impossible to observe because it is harmless. If however, the compare-and-swap was on a memory object with references, the ABA problem could have had observable effects.
The ABA problem can be exploited to give seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compare-and-swap are usually tagged with an additional field that counts the number of updates.
This solves the basic problem but only up to a point because the counter itself can also wrap around. We are now going to study the practical performance of our parallel algorithms written with PASL on multicore computers. You need to replace these settings with your own where appropriate. The PASL sources that we are going to use are part of a branch that we created specifically for this course. You can access the sources either via the tarball linked by the github webpage or, if you have git , via the command below. You can skip this section if you are using a computer already setup by us or you have installed an image file containing our software.
To skip this part and use installed binaries, see the heading "Starting with installed binaries", below. Currently, the software associated with this course supports Linux only. Any machine that is configured with a recent version of Linux and has access to at least two processors should be fine for the purposes of this course.
Before we can get started, however, the following packages need to be installed on your system. The rest of this section explains what are the optional software dependencies and how to configure PASL to use them. At the time of writing this document, the system-default implementations of malloc and free that are provided by Linux distributions do not scale well with even moderately large amounts of concurrent allocations. Fortunately, for this reason, organizations, such as Google and Facebook, have implemented their own scalable allocators that serve as drop-in replacements for malloc and free.
Using tcmalloc for your own experiements is easy. The following lines need to be in the settings. This assignment can be issued either at the command line or in the environment loader script, e.
If your system has a non-uniform memory architecture i. The particular policy that works best for our applications is round-robin NUMA page allocation. Do not worry if that term is unfamiliar: all it does is disable NUMA support, anyway! If the output that you see is something like the following, then your machine has NUMA.
Otherwise, it probably does not. To configure PASL to use hwloc , add the following lines to the settings. At this point, you have either installed all the necessary software to work with PASL or these are installed for you. In either case, make sure that your PATH variable makes the software visible. For setting up your PATH variable on andrew. We have installed much of the needed software on andrew. So you need to go through a relatively minimal set up.
It is important that this is at the beginning of the PATH variable. To make interaction easier, we also added the relative path. We are going to use two command-line tools to help us to run experiments and to analyze the data. These tools are part of a library that we developed, which is named pbench.
The pbench sources are available via github. The tarball of the sources can be downloaded from the github page. The following command builds the tools, namely prun and pplot. The former handles the collection of data and the latter the human-readable output e. Make sure that the build succeeded by checking the pbench directory for the files prun and pplot.
If these files do not appear, then the build failed. It will be convenient for you to make these aliases persistent, so that next time you log in, the aliases will be set. Add the commands above to your shell configuration file. When we are tuning our parallel algorithms, it can be helpful to visualize their processor utilization over time, just in case there are patterns that help to assign blame to certain regions of code.
Later, we are going to use the utilization visualizer that comes packaged along with PASL. To build the tool, run the following make command. We recommend that you make this alias persistent by putting it into your shell configuration file as you did above for the pbench tools. PASL comes equipped with a Makefile that can generate several different kinds of executables. These different kinds of executables and how they can be generated is described below for a benchmark program pgm.
To speed up the build process, add to the make command the option -j e. This option enables make to parallelize the build process. Note that, if the build fails, the error messages that are printed to the terminal may be somewhat garbled. As such, it is better to use -j only if after the debugging process is complete. We are going to start our experimentation with three different instances of the same program, namely bench.
This program serves as a "driver" for the benchmarks that we have implemented. These implementations are good parallel codes that we expect to deliver good performance. We first build the baseline version. The file extension. We can now run the baseline for one of our benchmarks, say Fibonacci by using the -bench argument to specify the benchmark and the -n argument to specify the input value for the Fibonacci function. The exectime indicates the wall-clock time in seconds that is taken by the benchmark.
In general, this time measures only the time taken by the benchmark under consideration. It does not include the time taken to generate the input data, for example. The utilization relates to the utilization of the processors available to the program. We will return to this measure soon. The result field reports a value computed by the benchmark. However, all instances of fork2 are erased as described in an earlier chapter. The run time of the sequential elision in this case is similar to the run time of the sequential baseline because the two are similar codes.
However, for most other algorithms, the baseline will typically be at least a little faster. Unless specified otherwise, however, the parallel binary uses just one processor. Because our machine has 40 processors, we can run the same application using all available processors. Before running this command, please adjust the -proc option to match the number of cores that your machine has. Note that you can use any number of cores up to the number you have available.
You can use nproc or lscpu to determine the number of cores your machine has. We see from the output of the processor run that our program ran faster than the sequential runs. We may ask at this point: What is the improvement that we just observed from the parallel run of our program? One common way to answer this question is to measure the "speedup".
Note that speedup is defined with respect to a baseline program. How exactly should this baseline program be chosen? One option is to take the sequential elision as a baseline. The speedup curve with such a baseline can be helpful in determining the scalability of a parallel algorithm but it can also be misleading, especially if speedups are taken as a indicator of good performance, which they are not because they are only relative to a specific baseline.
For speedups to be a valid indication of good performance, they must be calculated against an optimized implementation of the best serial algorithm for the same problem. The speedup at a given number of processors is a good starting point on the way to evaluating the scalability of the implementation of a parallel algorithm. The next step typically involves considering speedups taken from varying numbers of processors available to the program. The data collected from such a speedup experiment yields a speedup curve , which is a curve that plots the trend of the speedup as the number of processors increases.
The shape of the speedup curve provides valuable clues for performance and possibly for tuning: a flattening curve suggests lack of parallelism; a curve that arcs up and then downward suggests that processors may be wasting time by accessing a shared resource in an inefficient manner e.
Although not linear i. Let us see what a speedup curve can tell us about our parallel Fibonacci program. We need to first get some data. The following command performs a sequence of runs of the Fibonacci program for varying numbers of processors. You can now run the command yourself. If successful, the command generates a file named plots. The output should look something like the plot in speedup plot below. The plot shows that our Fibonacci application scales well, up to about twenty processors.
As expected, at twenty processors, the curve dips downward somewhat. We know that the problem size is the primary factor leading to this dip. How much does the problem size matter? The speedup plot in the Figure below shows clearly the trend. The run time that a given parallel program takes to solve the same problem can vary noticeably because of certain effects that are not under our control, such as OS scheduling, cache effects, paging, etc.
We can consider such noise in our experiments random noise. Noise can be a problem for us because noise can lead us to make incorrect conclusions when, say, comparing the performance of two algorithms that perform roughly the same. To deal with randomness, we can perform multiple runs for each data point that we want to measure and consider the mean over these runs. The prun tool enables taking multiple runs via the -runs argument. Moreover, the pplot tool by default shows mean values for any given set of runs and optionally shows error bars.
The documentation for these tools gives more detail on how to use the statistics-related features. Suppose that, on our processor machine, the speedup that we observe is larger than 40x. It might sound improbable or even impossible. But it can happen. Ordinary circumstances should preclude such a superlinear speedup , because, after all, we have only forty processors helping to speed up the computation. Superlinear speedups often indicate that the sequential baseline program is suboptimal.
This situation is easy to check: just compare its run time with that of the sequential elision. If the sequential elision is faster, then the baseline is suboptimal. Other factors can cause superlinear speedup: sometimes parallel programs running on multiple processors with private caches benefit from the larger cache capacity. These issues are, however, outside the scope of this course. As a rule of thumb, superlinear speedups should be regarded with suspicion and the cause should be investigated.
To put this hunch to the test, let us examine the utilization of the processors in our system. We need to first build a binary that collects and outputs logging data. Each of these fields can be useful for tracking down inefficiencies. In the present case, however, none of the new values shown above are highly suspicious, considering that there are all at most in the thousands. Since we have not yet found the problem, let us look at the visualization of the processor utilization using our pview tool. To get the necessary logging data, we need to run our program again, this time passing the argument --pview.
Every time we run with --pview this binary file is overwritten. To see the visualization of the log data, we call the visualizer tool from the same directory. The output we see on our processor machine is shown in the Figure below. The window shows one bar per processor. Time goes from left to right. Idle time is represented by red and time spent busy with work by grey. You can zoom in any part of the plot by clicking on the region with the mouse.
To reset to the original plot, press the space bar. From the visualization, we can see that most of the time, particularly in the middle, all of the processors keep busy. However, there is a lot of idle time in the beginning and end of the run. This pattern suggests that there just is not enough parallelism in the early and late stages of our Fibonacci computation. We are pretty sure that or Fibonacci program is not scaling as well is it could. What is important is to know more precisely what it is that we want our Fibonacci program to achieve. To this end, let us consider a distinction that is important in high-performance computing: the distinction between strong and weak scaling.
In general, strong scaling concerns how the run time varies with the number of processors for a fixed problem size. Sometimes strong scaling is either too ambitious, owing to hardware limitations, or not necessary, because the programmer is happy to live with a looser notion of scaling, namely weak scaling. In weak scaling, the programmer considers a fixed-size problem per processor. We are going to consider something similar to weak scaling. In the Figure below , we have a plot showing how processor utilization varies with the input size.
- Emma: 200th-Anniversary Annotated Edition (Penguin Classics Deluxe Edition);
- The Welsh in America: Letters from the Immigrants.
- Enumeration Type Documentation.
- Jeb Stuart.
- Made in Mexico: Tradition, Tourism, and Political Ferment in Oaxaca.
The scenario that we just observed is typical of multicore systems. For computations that perform lots of highly parallel work, such limitations are barely noticeable, because processors spend most of their time performing useful work. The utilization plot is shown in the Figure below. We have seen in this lab how to build, run, and evaluate our parallel programs. Concepts that we have seen, such as speedup curves, are going to be useful for evaluating the scalability of our future solutions. Strong scaling is the gold standard for a parallel implementation.
But as we have seen, weak scaling is a more realistic target in most cases. In many cases, a parallel algorithm which solves a given problem performs more work than the fastest sequential algorithm that solves the same problem. This extra work deserves careful consideration for several reasons. First, since it performs additional work with respect to the serial algorithm, a parallel algorithm will generally require more resources such as time and energy.
By using more processors, it may be possible to reduce the time penalty, but only by using more hardware resources. Assuming perfect scaling, we can reduce the time penalty by using more processors.
Sometimes, a parallel algorithm has the same asymptotic complexity of the best serial algorithm for the problem but it has larger constant factors. This is generally true because scheduling friction, especially the cost of creating threads, can be significant. In addition to friction, parallel algorithms can incur more communication overhead than serial algorithms because data and processors may be placed far away in hardware. These considerations motivate considering "work efficiency" of parallel algorithm. Work efficiency is a measure of the extra work performed by the parallel algorithm with respect to the serial algorithm.
We define two types of work efficiency: asymptotic work efficiency and observed work efficiency. The former relates to the asymptotic performance of a parallel algorithm relative to the fastest sequential algorithm. The latter relates to running time of a parallel algorithm relative to that of the fastest sequential algorithm. An algorithm is asymptotically work efficient if the work of the algorithm is the same as the work of the best known serial algorithm.
The parallel array increment algorithm that we consider in an earlier Chapter is asymptotically work efficient, because it performs linear work, which is optimal any sequential algorithm must perform at least linear work. We consider such algorithms unacceptable, as they are too slow and wasteful.
We consider such algorithms to be acceptable. We build this code by using the special optfp "force parallel" file extension. This special file extension forces parallelism to be exposed all the way down to the base cases. Later, we will see how to use this special binary mode for other purposes.
In practice, observed work efficiency is a major concern. First, the whole effort of parallel computing is wasted if parallel algorithms consistently require more work than the best sequential algorithms. In other words, in parallel computing, both asymptotic complexity and constant factors matter.
Based on these discussions, we define a good parallel algorithm as follows. We say that a parallel algorithm is good if it has the following three characteristics:. For example, a parallel algorithm that performs linear work and has logarithmic span leads to average parallelism in the orders of thousands with the small input size of one million. For such a small problem size, we usually would not need to employ thousands of processors.
It would be sufficient to limit the parallelism so as to feed tens of processors and as a result reduce impact of excess parallelism on work efficiency. In many parallel algorithms such as the algorithms based on divide-and-conquer, there is a simple way to achieve this goal: switch from parallel to sequential algorithm when the problem size falls below a certain threshold. This technique is sometimes called coarsening or granularity control.
But which code should we switch to: one idea is to simply switch to the sequential elision, which we always have available in PASL. If, however, the parallel algorithm is asymptotically work inefficient, this would be ineffective. In such cases, we can specify a separate sequential algorithm for small instances. Optimizing the practical efficiency of a parallel algorithm by controlling its parallelism is sometimes called optimization , sometimes it is called performance engineering , and sometimes performance tuning or simply tuning.
In the rest of this document, we use the term "tuning. In fact, there is barely a difference between the serial and the parallel runs. The tuning is actually done automatically here by using an automatic-granularity-control technique described in the section. Our parallel program on a single processor is one percent slower than the sequential baseline. Such work efficiency is excellent. The basic idea behind coarsening or granularity control is to revert to a fast serial algorithm when the input size falls below a certain threshold.
To determine the optimal threshold, we can simply perform a search by running the code with different threshold settings. While this approach can help find the right threshold on the particular machine that we performed the search, there is no guarantee that the same threshold would work on another machine. In fact, there are examples in the literature that show that such optimizations are not portable , i. In the general case, determining the right threshold is even more difficult.
To see the difficulty consider a generic polymorphic , higher-order function such as map that takes a sequence and a function and applies the function to the sequence. The problem is that the threshold depends both on the type of the elements of the sequence and the function to be mapped over the sequence. For example, if each element itself is a sequence the sequence is nested , the threshold can be relatively small.
If, however, the elements are integers, then the threshold will likely need to be relatively large. This makes it difficult to determine the threshold because it depends on arguments that are unknown at compile time. Essentially the same argument applies to the function being mapped over the sequence: if the function is expensive, then the threshold can be relatively small, but otherwise it will need to be relatively large. As we describe in this chapter , it is sometimes possible to determine the threshold completely automatically.
There has been significant research into determining the right threshold for a particular algorithm. This problem, known as the granularity-control problem , turns out to be a rather difficult one, especially if we wish to find a technique that can ensure close-to-optimal performance across different architectures. In this section, we present a technique for automatically controlling granularity by using asymptotic cost functions. In general, the value returned by the complexity function need only be precise with respect to the asymptotic complexity class of the associated computation.
The complexity function above is preferable, however, because it is simpler. In other words, when expressing work costs, we only need to be precise up to the asymptotic complexity class of the work. In PASL, a controlled statement , or cstmt , is an annotation in the program text that activates automatic granularity control for a specified region of code.
To support such automatic granularity control PASL uses a prediction algorithm to map the asymptotic work cost as returned by the complexity function to actual processor cycles. When the predicted processor cycles of a particular instance of the controlled statement falls below a threshold determined automatically for the specific machine , then that instance is sequentialized, by turning off the ability to spawn parallel threads for the execution of that instance. If the predicted processor cycle count is higher than the threshold, then the statement instance is executed in parallel.
In other words, the reader can think of a controlled statement as a statement that executes in parallel when the benefits of parallel execution far outweigh its cost and that executes sequentially in a way similar to the sequential elision of the body of the controlled statement would if the cost of parallelism exceeds its benefits.
We note that while the sequential exection is similar to a sequential elision, it is not exactly the same, because every call to fork2 must check whether it should create parallel threads or run sequentially. Thus the execution may differ from the sequential elision in terms of performance but not in terms of behavior or semantics. The code below uses a controlled statement to automatically select, at run time, the threshold size for our parallel array-increment function. The controlled statement takes three arguments, whose requirements are specified below, and returns nothing i.
The effectiveness of the granularity controller may be compromised if any of the requirements are not met. The first argument is a reference to the controller object. The controller object is used by the controlled statement to collect profiling data from the program as the program runs. The label must be unique to the particular controller.
Moreover, the controller must be declared as a global variable. The second argument is the complexity function. The type of the return value should be long. The third argument is the body of the controlled statement. The return type of the controlled statement should be void. When the controlled statement chooses sequential evaluation for its body the effect is similar to the effect where in the code above the input size falls below the threshold size: the body and the recursion tree rooted there is sequentialized.
When the controlled statement chooses parallel evaluation, the calls to fork2 create parallel threads.
Introduction to Parallel Computing
It is not unusual for a divide-and-conquer algorithm to switch to a different algorithm at the leaves of its recursion tree. For example, sorting algorithms, such as quicksort, may switch to insertion sort at small problem sizes. In the same way, it is not unusual for parallel algorithms to switch to different sequential algorithms for handling small problem sizes.
Client refers to a CPU kernel that performs and assignments parallel tasks. Worker refers to the other CPU cores that run parallel code, and all workers work at the same time to achieve accelerated computing. Cycle decomposition diagram. The key method is running parallel for-loops on workers within contingencies.
The loop variable represents the number of times the loop is executed. The parallel strategy in this paper is used in the expected multi-contingency.
The large variable array contains all contingencies can be segmented into each worker by using sliced variable in the implementation of the parallel for-loops program i and therefore the data transmission between worker and client is reduced. The array contains all the contingency information is actually a contingency variable in this paper.
Data arrays contains nodal admittance matrix, generator operating data and parameters are classified as broadcast variable which will be shared by all the workers. First of all, the conventional optimal power flow calculation is carried out without considering the transient stability constraints of formula 20 , only the calculation of formula 17 — 19 are solved.
Then taking the optimal solution of the conventional optimal power flow as the initial value, the transient stability analysis is carried out to determine whether the formula 20 is satisfied. If the optimal solution is satisfied, the calculation is terminated; otherwise, the next step is taken. The initial values of the parameters under each contingency are passed to each worker, and each worker calculates the Jacobian and Hessian matrix independently to complete the iterative calculation of the interior point method.
If both satisfied, the calculation is terminated; otherwise, the calculation process above is repeated. In this section, the NE system, the IEEE bus system and bus systems are used for simulation from the aspects including computing time, speed-up and efficiency. Furthermore, the simulations are run on 32 Cores 1. Formula 16 is used as transient stability constraint to solve the OTS model 17— In order to meet the requirements of transient stability, economy is sacrificed properly, and more iteration is added to adjust the system operation mode meanwhile. The swing curves of three test systems are shown in Figs.
From Table 5 to Table 7 , iterations of the NE system and the IEEE system changes from 17 to 55 and from 26 to 56 respectively and iterations of the C system changes little when considering multi-contingency. In particular, with the same number of iterations, the calculation time increases approximate linearly as the number of contingency increases, reflecting the good computational characteristics of the model in this paper. According to the results of C system, when considering three contingency, the results of objective function is consistent with the results of the first contingency which shows that the first contingency plays a dominant position when considering the previous three contingency.
The system adjustment mainly aims to the constraints of the first contingency which can meet the constraints of other contingency as well. However, when considering 6 contingency, the 4th contingency becomes the dominant contingency. Therefore, if the leading contingency in the sets can be found, a great amount of computation of OTS problem can be reduced effectively, which can be another research goal in the future.
Speed-up refers to the ratio of computation time that the same task running on a single-processor and parallel processor system, which is used to measure the performance of parallel programs. In the field of parallel computing, the speed-up ratio is defined as. Where T s and T p refers to the serial execution time and parallel execution time, respectively.
It is also observed that the acceleration is not proportional to the number of contingencies. This phenomenon mainly has the following two reasons:. Communication Loss. As a result, the parallel efficiency decreased with the increase of contingency. In fact, in the same number of contingencies, the smaller the system size, the corresponding efficiency tends to be lower as shown in Table 9. Serial Calculation. The parallel method is used to solve the coefficient matrix of the modified equation merely, so the other process of the algorithm still uses the serial solution.
In general, as the number of contingencies increases, the amount of data required for serial computation is increasing as well, which is another reason the speed-up ratio tends to be saturated. Nevertheless, the speed-up ratio still shows a good upward trend. It is shown that the parallel method of this paper can adapt to the calculation and analysis of large power systems, which can effectively reduce the computing time of OTS.
Furthermore, compared with the existing approach in [ 17 , 18 , 19 ], for example, the computing time in 2 contingencies in IEEE system is In addition, compared with the existing approach, the proposed algorithm has the following advantages:. The method proposed in this paper has better acceleration effect, especially in larger systems or more contingencies, which shows excellent calculation performance.
This algorithm can handle even more than 32 contingencies. Theoretically, as long as the number of cores is sufficient, more contingencies can be processed at the same time and a higher speed-up ratio can be achieved. The simulation data also shows that the algorithm has strong convergence for any number of contingency. The parallel part of the algorithm is mainly based on parallel computing technology. In hardware, without changing the algorithm structure, the increase in the number of cores and the frequency can easily achieve the performance extension.
In this paper, the OTS model based on the criteria according to the swing curves of generator rotor and the characteristics of transient stability analysis is proposed. Tests results in three different systems shown that the computing time of OTS is reduced efficiently without obvious changes on the number of iterations and the optimal solution. The algorithm proposed in this paper is simple in constraint and good in convergence. The multi-core processors parallel computing adopted in this paper is basically the same as the single CPU because of its hardware structure, which can improve the computing capacity without changing the hardware structure.
Therefore, the method is versatile and scalable and is not limited to a specific algorithm or architecture. El-Hawary, M. Optimal power flow: Solution technologies, requirement and challenges [R]. Tong, X. A semismooth Newton method for solving optimal power flow. Journal of Industrial and Management Optimization, 3 3 , — Luo, K. Decoupled semismooth Newton algorithm for optimal power flow problems. Control and Decision, 21 5 , — New model and algorithm for solving the KKT system of optimal power flow. Control Theory and Applications, 23 2 , — De-qiang, G.
Stability constrained optimal power flow. Momoh, J. Challenges to optimal power flow. La Scala, M. On-line dynamic preventive control: An algorithm for transient security dispatch [J]. Chen, L. Optimal operation solutions of power systems with transient stability constraints.
- What is Parallel Computing?;
- Sascha Hunold?
- Leadership for the Twenty-First Century.
- Parallel Computing Works! - 1st Edition;
- Death of an Ordinary Man: A Novel.
Ming bo, L. Calculation of available transfer capability with transient stability constraints. Proceedings of the CSEE, 23 9 , 28— Vecchiola, C. Gomez, A. Implementation of the fast decoupled load flow on a vector computer [J]. Sasaki, H. A parallel computation algorithm for static state estimation by means of matrix inversion lemma. Li, Y. Abur, A. Sutter, H. Software and the concurrency revolution. Q focus: Multiprocessors, 3 7 , 54— Jiang, Q. A reduced-space interior point method for transient stability constrained optimal power flow. Mo, N.
Transient stability constrained optimal power flow using particle swarm optimisation. Geng, G. Parallel transient stability-constrained optimal power flow using GPU as coprocessor[J]. Download references. YD Yang conceived and designed the study. All authors read and approve the final manuscript. Correspondence to Yude Yang.