Weblog elemzés Hadoopon 1/39

Az előadás témái Egy Hadoop job életciklusa A Weblog-projekt 2/39

Mi a Hadoop? A Hadoop egy párhuzamos programozási séma egy implementációja. 3/39

A programozási séma: MapReduce 4/39

MapReduce 5/39

Az elosztott háttértár: HDFS Forrás: developer.yahoo.com/hadoop/tutorial 6/39

Egy Hadoop job életciklusa Forrás: http://infosys.cs.uni-saarland.de/publications/dqj+10crv2.pdf 7/39

Egy Hadoop job életciklusa 8/39

Megválasztható elemek Java interface-eket kell implementálni vagy ősosztályok alosztályát létrehozni A job konfigurációjában meg kell adni a kiválasztott implementációt pl.: conf.setmapperclass(map1.class); 11/39

Megválasztható elemek Mapper map(key_in, value_in, output_collector) Reducer reduce(key_in, list_of_values, output_collector) Combiner=Reducer azonos interface, más funkció hívások száma nem garantált 12/39

Megválasztható elemek key és value típusa: Writable readfields(input_stream) write(output_stream) key típusa: WritableComparable compareto(other) opcionális: WritableComparator szerializált formában hasonlít össze két elemet regisztrálni kell! (WritableComparator.define(...)) rendezés és csoportosítás 13/39

Megválasztható elemek InputFormat Inputot logikailag darabolja a map végrehajtók részére Rekordokat beolvassa OutputFormat Rekordokat kiírja 14/39

Megválasztható elemek Beépített formátumok: TextInputFormat/OutputFormat SequenceFileInputFormat/OutputFormat DBInputFormat/OutputFormat 15/39

Megválasztható elemek Partitioner Kulcsokat szétosztja a reduce végrehajtók között getpartition(key, value, numpartitions) Default: HashPartitioner 16/39

A Weblog Projekt 17/39

Az adatbázis jelenleg logfile előfeldolgozás puffer tábla csillagséma statisztikák riportozó webszerverek relációs adatbázis 18/39

Hasonló adatbázisok napi átlagok Weblog1 IT-log Weblog2 időszak : 2010 szeptember 2010 november 2008 október elemi események események száma 1.3 millió 3 millió ~ 60 millió tömörített file 0.04 GB gzip2 1.8 GB gzip2 17 GB bzip2 tömörítetlen file ~ 0.77 GB ~ 32 GB ~ 115 GB előfeldolgozott, szűrt események események száma 0.7 millió 3 millió ~ 30 millió tömörítetlen file 0.25 GB 1.8 GB ~ 11 GB tömörített file Oracle tábla ~ 17 GB feldolgozás előfeldolgozás - ~1:15 óra dimenzió-frissítés 30 perc ~1 óra ténytábla-frissítés 10 perc ~1:30 óra statisztikák száma 10 5 ~8 statisztika-frissítés 20 perc ~3 óra környezet arch. 0.4 GB compressed basic Intel Pentium 2.9Ghz dual core 1.8 GB compressed basic 2 x Intel Xeon 2.27Ghz quad core 4 proc. valamilyen Sun szerver, kb. 2004-es memória 8 GB 18 GB ~ 4 GB disk raid ~2.6 TB raid ~2 TB scsi oprendszer Debian Linux CentOS Linux AIX adatbázis Oracle 10 Oracle 11.2 Oracle 9 19/39

Problémás elemek logfile előfeldolgozás puffer tábla csillagséma statisztikák riportozó webszerverek relációs adatbázis 20/39

A terv Hadoop logfile előfeldolgozás puffer tábla csillagséma statisztikák riportozó webszerverek relációs adatbázis 21/39

A terv Statisztikák: Hadoopon Egyetlen, nagy tábla SQL-ek helyett MapReduce jobok A Weblog1 adatoknál nagyobb adatmennyiségnél van jelentősége 22/39

A tesztfeladat select date, source_type, host, http_status_code, http_method, ip_type, sum(1), sum(decode(agent_type, 'ROBOT', 1, 0) ), sum(decode(agent_type, 'STANDARD_BROWSER', 1, 0) ), count(distinct user), count(distinct ip), count(distinct resource), count(distinct referrer_resource), count(distinct referrer_host) from <joins> group by date, source_type, cube(host, http_status_code, http_method, ip_type) 23/39

A tesztfeladat Tesztcélokra módosított SQL Nincs benne join Adatforrás Weblog 1, fő ténytábla, tömörítetlen előfeldolgozott adatok 600 nap 34 GB Hadoop konfiguráció: 33 node Futásidők Oracle adatbázison Eredeti SQL: 17h 50m Módosított SQL: 15h 10m 24/39

Egy megoldási kísérlet: Mi mindenre jó a MapReduce? 25/39

Sum('ROBOT') group by date Map date if agent_type='robot' agent_type date 1 select date, source_type, host, http_status_code, http_method, ip_type, sum(1), sum(decode(agent_type, 'ROBOT', 1, 0) ), sum(decode(agent_type, 'STANDARD_BROWSER', 1, 0) ), count(distinct user), count(distinct ip), count(distinct resource), count(distinct referrer_resource), count(distinct referrer_host) from <joins> group by date, source_type, cube(host, http_status_code, http_method, ip_type) Reduce date date date date 1 1 1 3 26/39

Count(distinct user) group by date Map1 date date user user 1 Reduce1 date date date date user user user user 1 1 1 1 select date, source_type, host, http_status_code, http_method, ip_type, sum(1), sum(decode(agent_type, 'ROBOT', 1, 0) ), sum(decode(agent_type, 'STANDARD_BROWSER', 1, 0) ), count(distinct user), count(distinct ip), count(distinct resource), count(distinct referrer_resource), count(distinct referrer_host) from <joins> group by date, source_type, cube(host, http_status_code, http_method, ip_type) Map2 date date user 1 1 Reduce2 date date date date date 1 1 1 1 4 27/39

Sum('ROBOT'), Count(distinct user) Map1 group by date if agent_type='robot' date stat_id=1 stat_id=2 agent_type date date user null user 1 1 Reduce1 select date, source_type, host, http_status_code, http_method, ip_type, sum(1), sum(decode(agent_type, 'ROBOT', 1, 0) ), sum(decode(agent_type, 'STANDARD_BROWSER', 1, 0) ), count(distinct user), count(distinct ip), count(distinct resource), count(distinct referrer_resource), count(distinct referrer_host) from <joins> group by date, source_type, cube(host, http_status_code, http_method, ip_type) stat_id=1 stat_id=1 stat_id=1 stat_id=1 date date date date null null null null 1 1 1 3 stat_id=2 stat_id=2 stat_id=2 stat_id=2 stat_id=2 date date date date date user user user user user 1 1 1 1 1 28/39

Sum('ROBOT'), Count(distinct user) Map2 stat_id=1 date date null 3 3 0 group by date stat_id=1 stat_id=2 select date, source_type, host, http_status_code, http_method, ip_type, sum(1), sum(decode(agent_type, 'ROBOT', 1, 0) ), sum(decode(agent_type, 'STANDARD_BROWSER', 1, 0) ), count(distinct user), count(distinct ip), count(distinct resource), count(distinct referrer_resource), count(distinct referrer_host) from <joins> group by date, source_type, cube(host, http_status_code, http_method, ip_type) stat_id=2 date ip date 0 1 1 stat_id=1 stat_id=2 Reduce2 date date date date 0 3 0 3 1 0 1 2 29/39

Count(distinct user) group by date, cube(host, http_status_code) select date, source_type, host, http_status_code, http_method, ip_type, sum(1), sum(decode(agent_type, 'ROBOT', 1, 0) ), sum(decode(agent_type, 'STANDARD_BROWSER', 1, 0) ), count(distinct user), count(distinct ip), count(distinct resource), count(distinct referrer_resource), count(distinct referrer_host) from <joins> group by date, source_type, cube(host, http_status_code, http_method, ip_type) Map1 date date date date date host host * host * http_sc http_sc http_sc * * user user user user user 1 1 1 1 30/39

A teljes MapReduce job Map1: 128 kimenet Minden statisztikára (8 db) cube-dimenziók minden részhalmazára (2 4 db) Reduce1: statisztikánként más Map2: összegezhető formátumba hoz Key: a group by dimenziók Egy kulcs = egy sor a kimenetben Value: táblázat Az adott statisztika értékén kívül 0-k Reduce2: összegez 31/39

Teszteredmények I. Map1 Reduce1 Map2 Reduce2 Total Number of Tasks 271 33 990 33 Data read from disk 912 GB 595 GB 119 GB 113 GB 1740 GB Data written to disk 1260 GB 595 GB 227 GB 113 GB 2196 GB Input bytes 34 GB 365 GB 61.5 GB 111 GB 572 GB Output bytes 365 GB 61.5 GB 111 GB 129 MB 538 GB Input records 88.7 million 9 877 million 1 286 million 1 286 million 5 477 million Output records 9 877million 1 286 million 1 286 million 3.27 million 12 451 million Time 1h 17m 42m 14m 12s 7m 19s 2h 21m 32/39

Egy második kísérlet: Mi mindenre nem kell MapReduce? 33/39

Map date date date date date source_type source_type source_type source_type source_type host host * host * http_status_code http_status_code http_status_code * * http_method http_method http_method http_method * ip_type ip_type ip_type ip_type * agent_type agent_type agent_type agent_type agent_type user user user user user ip ip ip ip ip resource resource resource resource resource referrer_resource referrer_resource referrer_resource referrer_resource referrer_resource referrer_host referrer_host referrer_host referrer_host referrer_host 34/39

Reduce date date date date date source_type source_type source_type source_type source_type host host host host host * * * * * * * * * * ip_type ip_type ip_type ip_type ip_type agent_type1 agent_type2 agent_type3 agent_type_n sum(1) user1 user2 user3 user_n sum('robot') ip1 ip2 ip3 ip_n sum('browser') resource1 resource2 resource3 resource_n count(distinct user) referrer_resource1 referrer_resource2 referrer_resource3 referrer_resource_n count(distinct ip) referrer_host1 referrer_host2 referrer_host3 referrer_host_n count(distinct resource) count(distinct referrer_resource) count(distinct referrer_host) 35/39

Az eredmények... 36/39

Teszteredmények II. Map Reduce Total Number of Tasks 271 33 33 Data read from disk 162 GB 104 GB 266 GB Data written to disk 261 GB 104 GB 364 GB Input bytes 34 GB 101 GB 135 GB Output bytes 101 GB 129 MB 101 GB Input records 88.7 million 1 419 million 1 507 million Output records 1 419 million 3.28 million 1 422 million Time 11m23s 4m43s 16m06s 37/39

Összefogalás Több MapReduce Több Memória Data read from disk 1740 GB 266 GB Data written to disk 2196 GB 364 GB Time 2h 21m 16m06s A MapReduce sémát csak arra használjuk, amire muszáj! 38/39

Köszönöm a figyelmet! Gosztonyi Balázs gosztonyi@ilab.sztaki.hu 39/39