[HOME]

[Personal Info]

[Academics]

[Research]

[CV]

 

ECE 566  Parallel Processing

 

Instructor: Prof. Ashfaq Khokhar

Textbook:  Vipin Kumar, et al., Introduction to Parallel Computing.---Design and Analysis of Algorithms, 2nd Ed .Benjamin/Cummings and Addison Wesley, 2003.

 

Course Assignments

 

Finding Maximum of N numbers over P processors assuming the underlying communication network is (a). A ring Interconnection Network (b) An all to all (fully connected) interconnection network.

        The report I submitted describing the solution to the assignment can be found here. The source code in MPI (Message Passing Interface) is also part of this report. The solution I gave for the problem is based upon a greedy approach as for as reduction of the inter process communication is concerned.

          Note: The solution assumes that every process at the start of the programme has the required data present in its local memory. This assumption means that we generate locally the random data (obviously not practical) and then communicate to get the maximum out of all the processes. This approach means that by changing the number of processes at command line will simply increase the problem size too and the scalability of the solution can not be adjudged fully. It however does allow us to see how the communication overhead scales with increasing problem size as well as increasing processes. A better solution would have been to keep all the required data at a central location local to one (master) process who then distributes the data evenly among all processes. This will give the true picture of scalability when the numbers of processes are increased with the fixed problem size. [Download Report.]

Selected Problems from Chapter 2 of Text Book. P 2.20 is an interesting problem which asks you to modify the structure of the 2D mesh with wrap-around connections so that every link has the same length. Every row and every column arranged in a diamond configuration gives us the required result.

MPI Implementation of Quicksort over Borg. Borg is a 14-processor HP V-server (a shared memory machine) at UIC. The problem was to be solved in the spring break. I had time enough to first develop my own cluster in Multimedia Research Lab. The report that contains the implementation of Quicksort over Multimedia Lab cluster can be found here. The comparison of the Run times of my quick sort implementation over Borg and MLC can be found here.My solution employs a load balancing scheme which is not very good. Use of the prefix-scan operation to judge the available load can be used to develop better load balancing scheme. Note of assignment1 is applicable to this one as well.    

Selected Problems from Chapter 3.  Chapter 3 of the text book is a classic text on Principles of Parallel Algorithm Design. If one goes through the chapter thoroughly, the end-of-the-chapter problems are pretty simple to solve.

Selected Problems from Chapter 4:  Chapter 4 of the book lacks in Visualization approach. I think the communication Operations in Parallel Processing Environments can be best explained through comprehensive Visualization, I found the Visualization schemes employed by the author less interesting. A picture is worth a 1000 words, lots of things that are explained in words, are not well supported by interesting pictures of the communication process.

Midterm: I learnt the importance of using collective communication operations provided by MPI.

Chapter 5. Excellent source on analytical modeling of parallel programmes. There were no assignments given from chapter 5 specifically. But the presentations and the final take home exam involved issues discussed in this chapter .

Presentations: The class was divided in groups. Each group was assigned a specific problem to give presentations on. These problems are discussed in end chapters of the text book. Our group (me, Padma and Hairong) was assigned the problem of Parallel Matrix Multiplication. Padma gave presentation on Cannon’s Algorithm (8.2.2) and Hairong presented DNS (8.2.3). I discussed the effects of Matrix Layouts in memory on the performance of Matrix Multiplication. The presentation was adapted from the paper “Recursive Array Layouts and Fast Matrix Multiplication” by Chatterjee et al, IEEE TPDS November, 2002.

Final Project: A Parallel Scalable Decision Tree Classifier for Data Mining. It was also a part of my research activity at Multimedia Research Lab. I surveyed the decision tree based classifiers (my research page) and decided to implement ScalParC by Vipin Kumar et al. The report that discusses the first two phases of the classifier can be found here. The complete solution with source code will soon be made available here.     

 Final Take home Exam.  The problem 1 was about experimenting with multithreaded Matrix Multiplication on a shared memory machine. I concluded that the present implementation of OpenMP by HP that we are using on Borg does not provide nested parallelism. The performance comparison between different levels of the parallelism for matrix multiplication in the form of graphs can be found here. It does not include discussion and conclusions drawn from these graphs, as these were submitted handwritten and I don’t have it in digitized version. An important problem in the Final exam was to give the MIMD version of Levialdi’s Algorithm ‘On shrinking Binary Picture Patterns’. The Algorithm can be found in the paper by Levialdi [‘On shrinking binary picture patterns. ACM Communications January 1972]. My solution borrows ideas of neighborhood and shrinking from Levialdi but modifies the Shrinking Operator which now specifically deals with my self defined data structures. These data structures take care of the global components (those components who are scattered across number of processors) in such a way so as when the shrinking procedure is applied on them, they give the required shrunk image with the connectivity (both 4 and 8) of the individual components in tact.

 

Comments and Suggestions      wahmad at ieee dot org