Dense Matrix Algorithms (Chapter 8) Alexandre David B2-206

Hasonló dokumentumok
On The Number Of Slim Semimodular Lattices

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Nonparametric Tests

Construction of a cube given with its centre and a sideline

Computer Architecture

Statistical Inference

Correlation & Linear Regression in SPSS

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Nonparametric Tests. Petra Petrovics.

Correlation & Linear Regression in SPSS

Performance Modeling of Intelligent Car Parking Systems

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Factor Analysis

Introduction to 8086 Assembly

Mapping Sequencing Reads to a Reference Genome

Cluster Analysis. Potyó László

Efficient symmetric key private authentication

Local fluctuations of critical Mandelbrot cascades. Konrad Kolesko

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Linear. Petra Petrovics.

Választási modellek 3

Ensemble Kalman Filters Part 1: The basics

Using the CW-Net in a user defined IP network

A modern e-learning lehetőségei a tűzoltók oktatásának fejlesztésében. Dicse Jenő üzletfejlesztési igazgató

Bevezetés a kvantum-informatikába és kommunikációba 2015/2016 tavasz

Véges szavak általánosított részszó-bonyolultsága

Csima Judit április 9.

Dependency preservation

Gyártórendszerek modellezése: MILP modell PNS feladatokhoz

Angol Középfokú Nyelvvizsgázók Bibliája: Nyelvtani összefoglalás, 30 kidolgozott szóbeli tétel, esszé és minta levelek + rendhagyó igék jelentéssel

(c) 2004 F. Estrada & A. Jepson & D. Fleet Canny Edges Tutorial: Oct. 4, '03 Canny Edges Tutorial References: ffl imagetutorial.m ffl cannytutorial.m

KÖZGAZDASÁGI ALAPISMERETEK (ELMÉLETI GAZDASÁGTAN) ANGOL NYELVEN BASIC PRINCIPLES OF ECONOMY (THEORETICAL ECONOMICS)

Cashback 2015 Deposit Promotion teljes szabályzat

(NGB_TA024_1) MÉRÉSI JEGYZŐKÖNYV

16F628A megszakítás kezelése

Klaszterezés, 2. rész

INDEXSTRUKTÚRÁK III.

Cloud computing. Cloud computing. Dr. Bakonyi Péter.

Lexington Public Schools 146 Maple Street Lexington, Massachusetts 02420

Cloud computing Dr. Bakonyi Péter.

Pletykaalapú gépi tanulás teljesen elosztott környezetben

Adatbázisok 1. Rekurzió a Datalogban és SQL-99

1. Gyakorlat: Telepítés: Windows Server 2008 R2 Enterprise, Core, Windows 7

FÖLDRAJZ ANGOL NYELVEN GEOGRAPHY

Utasítások. Üzembe helyezés

STUDENT LOGBOOK. 1 week general practice course for the 6 th year medical students SEMMELWEIS EGYETEM. Name of the student:

Statistical Dependence

Kvantum-informatika és kommunikáció 2015/2016 ősz. A kvantuminformatika jelölésrendszere szeptember 11.

3. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

SQL/PSM kurzorok rész

FOSS4G-CEE Prágra, 2012 május. Márta Gergely Sándor Csaba

Tudományos Ismeretterjesztő Társulat

KN-CP50. MANUAL (p. 2) Digital compass. ANLEITUNG (s. 4) Digitaler Kompass. GEBRUIKSAANWIJZING (p. 10) Digitaal kompas

Create & validate a signature

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Hypothesis Testing. Petra Petrovics.

ios alkalmazásfejlesztés Koltai Róbert

A Lean Beszállító fejlesztés tapasztalatai a Knorr Bremse-nél

Eladni könnyedén? Oracle Sales Cloud. Horváth Tünde Principal Sales Consultant március 23.

bab.la Cümle Kalıpları: İş Sipariş İngilizce-Macarca

bab.la Cümle Kalıpları: İş Sipariş Macarca-İngilizce

Kategória: Category: ...% európai / European...% USA-beli / from the USA...% egyéb / other

Bel SE Hungary IDPA. Stage 1st: Running to you 1. pálya: Lerohanás. Course Designer:Attila Belme

GEOGRAPHICAL ECONOMICS B

Probabilistic Analysis and Randomized Algorithms. Alexandre David B2-206

First experiences with Gd fuel assemblies in. Tamás Parkó, Botond Beliczai AER Symposium

Galileo Signal Priority A new approach to TSP

Matrix-Based Analysis of Multiplex Graphs

Phenotype. Genotype. It is like any other experiment! What is a bioinformatics experiment? Remember the Goal. Infectious Disease Paradigm

Analitikai megoldások IBM Power és FlashSystem alapokon. Mosolygó Ferenc - Avnet

THE CHARACTERISTICS OF SOUNDS ANALYSIS AND SYNTHESIS OF SOUNDS

Decision where Process Based OpRisk Management. made the difference. Norbert Kozma Head of Operational Risk Control. Erste Bank Hungary

Gottsegen National Institute of Cardiology. Prof. A. JÁNOSI

MATEMATIKA ANGOL NYELVEN

2 level 3 innovation tiles. 3 level 2 innovation tiles. 3 level 1 innovation tiles. 2 tribe pawns of each color. 3 height 3 tribe pawns.

Bird species status and trends reporting format for the period (Annex 2)

FÖLDRAJZ ANGOL NYELVEN

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

Genome 373: Hidden Markov Models I. Doug Fowler

MATEMATIKA ANGOL NYELVEN

NASODRILL ORRSPRAY: TARTÁLY- ÉS DOBOZFELIRAT, VALAMINT A BETEGTÁJÉKOZTATÓ SZÖVEGE. CSECSEMŐ GYERMEK FELNŐTT 100 ml-es üveg

discosnp demo - Peterlongo Pierre 1 DISCOSNP++: Live demo

Tavaszi Sporttábor / Spring Sports Camp május (péntek vasárnap) May 2016 (Friday Sunday)

ELEKTRONIKAI ALAPISMERETEK ANGOL NYELVEN

EN United in diversity EN A8-0206/419. Amendment

Márkaépítés a YouTube-on

FÖLDRAJZ ANGOL NYELVEN

ICT ÉS BP RENDSZEREK HATÉKONY TELJESÍTMÉNY SZIMULÁCIÓJA DR. MUKA LÁSZLÓ

Előszó.2. Starter exercises. 3. Exercises for kids.. 9. Our comic...17

Abigail Norfleet James, Ph.D.

168 AV6 TÍPUS AV6 TYPE. Vezeték. max hossz Rail max length. Cikkszám Code. Alkatrészek Components

PIACI HIRDETMÉNY / MARKET NOTICE

ÉRETTSÉGI VIZSGA május 25.

Széchenyi István Egyetem

FAMILY STRUCTURES THROUGH THE LIFE CYCLE

ON A NEWTON METHOD FOR THE INVERSE TOEPLITZ EIGENVALUE PROBLEM MOODY T. CHU

Bevezetés a kvantum-informatikába és kommunikációba 2016/2017 tavasz

Introduction to the use of MATLAB (Communications and Signal processing) Dr.Wührl, Tibor

Operációs rendszerek Memóriakezelés 1.1

Contact us Toll free (800) fax (800)

és alkalmazások, MSc tézis, JATE TTK, Szeged, Témavezető: Dr. Hajnal Péter

Háromfázisú hálózatok

Számítógéppel irányított rendszerek elmélete. A rendszer- és irányításelmélet legfontosabb részterületei. Hangos Katalin. Budapest

A logaritmikus legkisebb négyzetek módszerének karakterizációi

Longman Exams Dictionary egynyelvű angol szótár nyelvvizsgára készülőknek

Átírás:

Dense Matrix Algorithms (Chapter 8) Alexandre David B2-206

Dense Matrix Algorithm Dense or full matrices: few known zeros. Other algorithms for sparse matrix. Square matrices for pedagogical purposes only can be generalized. Natural to have data decomposition. 3.2.2 input/output/intermediate data. 3.4.1 mapping schemes based on data partitioning. 04-04-2006 Alexandre David, MVP'06 2

Today Matrix*Vector Matrix*Matrix Solving systems of linear equations. 04-04-2006 Alexandre David, MVP'06 3

Matrix*Vector Recall A x y y i = n k = 1 a ik x k Serial algorithm: n 2 multiplications and addition. W = n 2 04-04-2006 Alexandre David, MVP'06 4

Matrix*Matrix Recall A B C c ij = n k = 1 a ik b kj Serial algorithm: n 3 multiplications and addition. W = n 3 04-04-2006 Alexandre David, MVP'06 5

Matrix*Vector Serial Algorithm y i = n k = 1 a ik x k procedure MAT_Vec(A,x,y) for i := 0 to n-1 do y[i] := 0 for k := 0 to n-1 do y[i] := y[i] + A[i,k]*x[k] done done endproc How to parallelize? 04-04-2006 Alexandre David, MVP'06 6

Matrix*Matrix Serial Algorithm c ij = n k = 1 a ik b kj procedure MAT_MULT(A,B,C) for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0 for k := 0 to n-1 do C[i,j] := C[i,j] + A[i,k]*B[k,j] done done done endproc How to parallelize? 04-04-2006 Alexandre David, MVP'06 7

Matrix*Vector Row-wise 1-D Partitioning Initial distribution: Each process has a row of the n*n matrix. Each process has an element of the n*1 vector. Each process is responsible for computing one element of the result. 04-04-2006 Alexandre David, MVP'06 8

Matrix*Vector 1-D A x y n processes But every process needs the entire vector all-to-all broadcast. 04-04-2006 Alexandre David, MVP'06 9

All-to-All Broadcast x 04-04-2006 Alexandre David, MVP'06 10

Parallel Computation A x y n processes n y i = k = 1 a ik x k in parallel on the n processes. 04-04-2006 Alexandre David, MVP'06 11

Example Matrix*Vector (Program 6.4) Allgather (All-to-all broadcast) Partition on rows. Multiply 04-04-2006 Alexandre David, MVP'06 12

Analysis All-to-all broadcast & multiplications in Θ(n). For n processes W=n 2 =nt P. The algorithm is cost-optimal. A parallel system is cost-optimal iff pt P =Θ(W). 04-04-2006 Alexandre David, MVP'06 13

Performance Metrics Efficiency E=S/p. Measure time spent in doing useful work. Previous sum example: E = Θ(1/logn). Cost C=pT P. A.k.a. work or processor-time product. Note: E=T S /C. Cost optimal if E is a constant. 04-04-2006 Alexandre David, MVP'06 14

Using Fewer Processes Brent s scheduling principle: It s possible. Using p processes: n/p rows per process. Communication time = t s logp+t w (n/p)(p-1) ~ t s logp+t w n = Θ(n). Computation: n*n/p. pt P = Θ(n 2 ) = W It is cost optimal. 04-04-2006 Alexandre David, MVP'06 15

Scalability Recall Efficiency increases with the size of the problem. Efficiency decreases with the number of processors. Scalability measures the ability to increase speedup in function of p. 04-04-2006 Alexandre David, MVP'06 16

Isoefficiency Function For scalable systems efficiency can be kept constant if T 0 /W is kept constant. For a target E Keep this constant Isoefficiency function W=KT 0 (W,p) 04-04-2006 Alexandre David, MVP'06 17

Is Our Algorithm Scalable? T 0 =pt P -W T 0 =t s p logp+t w np. We want to determine W=KT 0. Try with both terms separately: W=Kt s p logp. W=Kt w np=n 2 W=(Kt w p) 2. Bound from concurrency: p=o(n) W=Ω(p 2 ). W=Θ(p 2 ) : asymptotic isoefficiency function. Rate to increase the problem size (in function of p) to maintain a fixed efficiency: p=θ(n). 04-04-2006 Alexandre David, MVP'06 18

Matrix*Vector 2-D Matrix n*n partitioned on n*n processes. Vector n*1 distributed in the last (or 1 st column). Similarly we want fewer processes: blocks of (n/ p) 2 elements. 04-04-2006 Alexandre David, MVP'06 19

Matrix*Vector 2-D A x y Processes in column i need element of the vector in row i. 1. Distribute on diagonal. 2. One-to-all broadcast on columns. 3. Multiplication. 4. All-to-one reduction (+). 04-04-2006 Alexandre David, MVP'06 20

Example Matrix*Vector (Program 6.8) Partition. Distribute vector. Sum reduce on rows. Row sub-topology. Colum sub-topology. Local multiplication. 04-04-2006 Alexandre David, MVP'06 21

Which one is better? 1-D or 2-D? 04-04-2006 Alexandre David, MVP'06 22

Analysis Communications: one-to-one Θ(1) + one-to-all broadcast Θ( logn) + all-to-one reduction Θ( logn). + multiplications Θ(1). T P =Θ(n 2 logn) not cost-optimal. Brent s scheduling principle? 04-04-2006 Alexandre David, MVP'06 23

Using Fewer Processes Blocks of (n/ p) 2 elements. Costs: one to one in t s +t w n/ p + one-to-all broadcast in (t s +t w n/ p) log p + all-to-one reduction in (t s +t w n/ p) log p + computations in (n/ p) 2. Total ~ n 2 /p+t s logp+(t w n/ p) logp. pt P =Θ(n 2 ) cost-optimal. 04-04-2006 Alexandre David, MVP'06 24

Scalability Analysis T 0 =pt P -W=t s logp+t w n p logp. As before, isoefficiency analysis: W=Kt s p logp. W=Kt w n p logp=n 2 W=(Kt w p logp) 2. Bound from concurrency: p=o(n 2 ) W=Ω(p). W=Θ(p log 2 p). p=f(n)? p log 2 p=θ(n 2 ) p=θ(n 2 / log 2 n). 04-04-2006 Alexandre David, MVP'06 25

Which One Is Better? 1-D: T P ~ n 2 /p+t s logp+t w n. 2-D: T P ~ n 2 /p+t s logp+(t w n/ p) logp. 1-D: W=Θ(p 2 ). 2-D: W=Θ(p log 2 p). Degree of concurrency 04-04-2006 Alexandre David, MVP'06 26

Block Matrix*Matrix procedure BLOCK_ MAT_MULT(A,B,C) for i := 0 to qn-1 do for j := 0 to n-1 q do C[i,j] := 0 for k := 0 to qn-1 do C[i,j] := C[i,j] + A[i,k]*B[k,j] done done done endproc q*q blocks of (n/q)*(n/q) submatrices. Still n 3 additions & multiplications. 04-04-2006 Alexandre David, MVP'06 27

A Simple Parallel Algorithm Map the algorithm to p=q 2 processes. We need all A[i,k] and B[k,j] to compute the C[i,j]. Steps: All-to-all broadcast of A[i,k] on rows. All-to-all broadcast of B[k,j] on columns. Local multiplications. 04-04-2006 Alexandre David, MVP'06 28

Analysis Costs: all-to-all p broadcasts of n 2 /p elements = t s log p+t w (n 2 /p)( p-1) *2 + computations = p multiplications of (n/ p)*(n/ p) matrices cost n 3 /p. pt P =Θ(n 3 ) for p=o(n 2 ) cost-optimal. Isoefficiency W=Θ(p 3/2 ). Drawback: memory requirement in n 2 p. Better? 04-04-2006 Alexandre David, MVP'06 29

Cannon s Algorithm Idea: re-schedule computations to avoid contention. Processes on rows i hold a different A[i,k]. Processes on columns j hold a different B[k,j]. Rotate the matrices we need only 2 submatrices per process at any time. memory efficient in O(n 2 ). 04-04-2006 Alexandre David, MVP'06 30

Align A & B 04-04-2006 Alexandre David, MVP'06 31

04-04-2006 Alexandre David, MVP'06 32

04-04-2006 Alexandre David, MVP'06 33

Analysis Costs: 2* (A & B) p-single step shifts = 2(t s +t w n 2 /p) p + p multiplications of (n/ p)*(n/ p) submatrices = n 3 /p. Cost-optimal, same isoefficiency function as previously. 04-04-2006 Alexandre David, MVP'06 34

The DNS Algorithm 3-D partitioning! Cube with faces corresponding to A, B, C. Internal nodes correspond to multiply operations P i,j,k. Multiplications in time Θ(1). Additions in time Θ(logn). Communication Can use up to n 3 processes better concurrency. 04-04-2006 Alexandre David, MVP'06 35

04-04-2006 Alexandre David, MVP'06 36

k j i 04-04-2006 Alexandre David, MVP'06 37

Communication Steps Move the columns of A & rows of B. One-to-all broadcast along j & i axis. All-to-one reduction (+) along k axis. Communication on groups of n processes, in time Θ(logn). Not cost optimal for n 3 processes. 04-04-2006 Alexandre David, MVP'06 38

Brent s Scheduling Principle Theorem If a parallel computation consists of k phases taking time t 1,t 2,,t k using a 1,a 2,,a k processors in phases 1,2,,k then the computation can be done in time O(a/p+t) using p processors where t =sum(t ), i a =sum(a i t ). i 04-04-2006 Alexandre David, MVP'06 39

Look At One Dimension k phases = logn. t i = constant time. a i = n/2,n/4,,1 processors. With q processors we can use time O( logn+n/p). Choose q=o(n/ logn) time O( logn) and this is optimal! 3-D: use p=o(n 3 / log 3 n) p=q 3 04-04-2006 Alexandre David, MVP'06 40

Systems of Linear Equations A x b a 0,0 x 0 +a 0,1 x 1 + +a 0,n-1 x n-1 =b 0, a n-1,0 x 0 +a n-1,1 x 1 + +a n-1,n-1 x n-1 =b n-1 04-04-2006 Alexandre David, MVP'06 41

Solving Systems of Linear Equations Step 1: Reduce the original system to U x y Step2: Solve & back-substitute from x n-1 to x 0. 04-04-2006 Alexandre David, MVP'06 42

Technical Issues Non singular matrices. Numerical precision (is the solution numerically stable) permute columns. In particular no division by zero, thanks. Procedure known as Gaussian elimination with partial pivoting. 04-04-2006 Alexandre David, MVP'06 43

Gaussian Elimination W=2n 3 /3 04-04-2006 Alexandre David, MVP'06 44

Parallel Gaussian Elimination 1-D partitioning: 1 process/row. Process j computes A[*,j]. Cost (+communication) = Θ(n 3 logn) not cost optimal. All processes work on the same iteration. k+1 iteration starts when k th iteration is complete. Improve: pipelined/asynchronous version. 04-04-2006 Alexandre David, MVP'06 45

Pipelined Version P k forwards & does not wait. P j s forward & do not wait. 04-04-2006 Alexandre David, MVP'06 46

Pipelined Gaussian Elimination No logn for communication (no broadcast) and the rest of the computations are the same. The pipelined version is cost-optimal. Fewer processes: Block 1-D partitioning, loss of efficiency due to idle processes (load balance problem). Cyclic 1-D partitioning better. 04-04-2006 Alexandre David, MVP'06 47

Gaussian Elimination 2-D Partitioning Similar as before. Pipelined version cost-optimal. More scalable than 1-D. 04-04-2006 Alexandre David, MVP'06 48

Finally Back-Substitution Intrinsically serial algorithm. Pipelined parallel version not cost optimal. Does not matter because of lower order of magnitude. 04-04-2006 Alexandre David, MVP'06 49