|
ECE 566 Parallel Processing Instructor: Prof. Ashfaq Khokhar Textbook: Vipin Kumar, et al., Introduction to Parallel Computing.---Design and Analysis of Algorithms, 2nd Ed .Benjamin/Cummings and Addison Wesley, 2003. Course Assignments Finding
Maximum of N numbers over P processors assuming the underlying communication
network is (a). A ring Interconnection Network (b) An all to all (fully
connected) interconnection network. The report I submitted describing the solution to the assignment can be found here. The source code in MPI (Message Passing Interface) is also part of this report. The solution I gave for the problem is based upon a greedy approach as for as reduction of the inter process communication is concerned. Note: The
solution assumes that every process at the start of the programme has the
required data present in its local memory. This assumption means that we
generate locally the random data (obviously not practical) and then
communicate to get the maximum out of all the processes. This approach means
that by changing the number of processes at command line will simply increase
the problem size too and the scalability of the solution can not be adjudged
fully. It however does allow us to see how the communication overhead scales
with increasing problem size as well as increasing processes. A better
solution would have been to keep all the required data at a central location
local to one (master) process who then distributes the data evenly among all
processes. This will give the true picture of scalability when the numbers of
processes are increased with the fixed problem size. [Download Report.] Selected Problems from Chapter 2 of Text Book. P 2.20 is an interesting problem which asks you to modify the structure of the 2D mesh with wrap-around connections so that every link has the same length. Every row and every column arranged in a diamond configuration gives us the required result. MPI
Implementation of Quicksort over Borg. Borg
is a 14-processor HP V-server (a shared memory machine) at UIC. The problem
was to be solved in the spring break. I had time enough to first develop my
own cluster in Multimedia Research Lab. The report that contains the
implementation of Quicksort over Multimedia Lab cluster can be found here. The comparison of the Run
times of my quick sort implementation over Borg and MLC can be found here.My solution employs a load
balancing scheme which is not very good. Use of the prefix-scan operation to
judge the available load can be used to develop better load balancing scheme.
Note of assignment1 is applicable to this one as well. Selected
Problems from Chapter 3. Chapter 3 of the text book is a
classic text on Principles of Parallel Algorithm Design. If one goes through
the chapter thoroughly, the end-of-the-chapter problems are pretty simple to
solve. Selected Problems from Chapter 4: Chapter 4 of the book lacks in Visualization approach. I think the communication Operations in Parallel Processing Environments can be best explained through comprehensive Visualization, I found the Visualization schemes employed by the author less interesting. A picture is worth a 1000 words, lots of things that are explained in words, are not well supported by interesting pictures of the communication process. Midterm: I learnt the importance of using collective communication operations provided by MPI. Chapter 5. Excellent source on analytical modeling of parallel programmes. There were no assignments given from chapter 5 specifically. But the presentations and the final take home exam involved issues discussed in this chapter . Presentations: The class was divided in groups. Each group was assigned a specific problem to give presentations on. These problems are discussed in end chapters of the text book. Our group (me, Padma and Hairong) was assigned the problem of Parallel Matrix Multiplication. Padma gave presentation on Cannon’s Algorithm (8.2.2) and Hairong presented DNS (8.2.3). I discussed the effects of Matrix Layouts in memory on the performance of Matrix Multiplication. The presentation was adapted from the paper “Recursive Array Layouts and Fast Matrix Multiplication” by Chatterjee et al, IEEE TPDS November, 2002. Final Project: A Parallel Scalable Decision Tree Classifier for Data Mining. It
was also a part of my research activity at Multimedia Research
Lab. I surveyed the decision tree based classifiers (my research page)
and decided to implement ScalParC by Vipin Kumar et al. The report
that discusses the first two phases of the classifier can be found here. The complete solution with
source code will soon be made available here. Final Take home Exam. The problem 1
was about experimenting with multithreaded Matrix Multiplication on a shared
memory machine. I concluded that the present implementation of OpenMP by HP
that we are using on Borg does not provide nested parallelism. The
performance comparison between different levels of the parallelism for matrix
multiplication in the form of graphs can be found here. It does not include
discussion and conclusions drawn from these graphs, as these were submitted
handwritten and I don’t have it in digitized version. An important problem in
the Final exam was to give the MIMD version of Levialdi’s Algorithm ‘On
shrinking Binary Picture Patterns’. The Algorithm can be found in the paper
by Levialdi [‘On shrinking binary picture patterns. ACM Communications
January 1972]. My solution borrows ideas of neighborhood and shrinking from
Levialdi but modifies the Shrinking Operator which now specifically deals
with my self defined data structures. These data structures take care of the
global components (those components who are scattered across number of
processors) in such a way so as when the shrinking procedure is applied on
them, they give the required shrunk image with the connectivity (both 4 and
8) of the individual components in tact. Comments and
Suggestions wahmad at ieee dot org |
|
|