Technical Report Series on Corpus Building



Hasonló dokumentumok
Correlation & Linear Regression in SPSS

On The Number Of Slim Semimodular Lattices

Using the CW-Net in a user defined IP network

USER MANUAL Guest user

Construction of a cube given with its centre and a sideline

Website review acci.hu

István Micsinai Csaba Molnár: Analysing Parliamentary Data in Hungarian

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Nonparametric Tests

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Linear. Petra Petrovics.

Correlation & Linear Regression in SPSS

Unit 10: In Context 55. In Context. What's the Exam Task? Mediation Task B 2: Translation of an informal letter from Hungarian to English.

Cluster Analysis. Potyó László

STUDENT LOGBOOK. 1 week general practice course for the 6 th year medical students SEMMELWEIS EGYETEM. Name of the student:

discosnp demo - Peterlongo Pierre 1 DISCOSNP++: Live demo

Mapping Sequencing Reads to a Reference Genome

Statistical Inference

Madonna novellái. 1. szint Július. Madonna képekkel illusztrált novelláskötetet(1) jelentet meg

Széchenyi István Egyetem

Phenotype. Genotype. It is like any other experiment! What is a bioinformatics experiment? Remember the Goal. Infectious Disease Paradigm

Gottsegen National Institute of Cardiology. Prof. A. JÁNOSI


Genome 373: Hidden Markov Models I. Doug Fowler

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Hypothesis Testing. Petra Petrovics.

First experiences with Gd fuel assemblies in. Tamás Parkó, Botond Beliczai AER Symposium

Cashback 2015 Deposit Promotion teljes szabályzat

Proxer 7 Manager szoftver felhasználói leírás

Angol Középfokú Nyelvvizsgázók Bibliája: Nyelvtani összefoglalás, 30 kidolgozott szóbeli tétel, esszé és minta levelek + rendhagyó igék jelentéssel

BÍRÁLATOK ÉS KONFERENCIÁK

Lecture 11: Genetic Algorithms

Computer Architecture

7 th Iron Smelting Symposium 2010, Holland

3. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

A rosszindulatú daganatos halálozás változása 1975 és 2001 között Magyarországon

Tudományos Ismeretterjesztő Társulat

Decision where Process Based OpRisk Management. made the difference. Norbert Kozma Head of Operational Risk Control. Erste Bank Hungary

EEA, Eionet and Country visits. Bernt Röndell - SES

Utolsó frissítés / Last update: február Szerkesztő / Editor: Csatlós Árpádné

Supporting Information

Néhány folyóiratkereső rendszer felsorolása és példa segítségével vázlatos bemutatása Sasvári Péter

A BÜKKI KARSZTVÍZSZINT ÉSZLELŐ RENDSZER KERETÉBEN GYŰJTÖTT HIDROMETEOROLÓGIAI ADATOK ELEMZÉSE

Minta ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA II. Minta VIZSGÁZTATÓI PÉLDÁNY

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

Statistical Dependence

A jövedelem alakulásának vizsgálata az észak-alföldi régióban az évi adatok alapján

INDEXSTRUKTÚRÁK III.

Create & validate a signature

Hogyan használja az OROS online pótalkatrész jegyzéket?

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Nonparametric Tests. Petra Petrovics.

ENROLLMENT FORM / BEIRATKOZÁSI ADATLAP

Context-Aware Correction of Spelling Errors in Hungarian Medical Documents

Performance Modeling of Intelligent Car Parking Systems

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Budapest By Vince Kiado, Klösz György

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Factor Analysis

Bird species status and trends reporting format for the period (Annex 2)

Supplementary Table 1. Cystometric parameters in sham-operated wild type and Trpv4 -/- rats during saline infusion and

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Where are the parrots? (Hol vannak a papagájok?)

Utolsó frissítés / Last update: Szeptember / September Szerkesztő / Editor: Csatlós Árpádné

mondat ami nélkül ne indulj el külföldre

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

(NGB_TA024_1) MÉRÉSI JEGYZŐKÖNYV

3. MINTAFELADATSOR EMELT SZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

FÖLDRAJZ ANGOL NYELVEN GEOGRAPHY

Bioinformatics: Blending. Biology and Computer Science

Lopocsi Istvánné MINTA DOLGOZATOK FELTÉTELES MONDATOK. (1 st, 2 nd, 3 rd CONDITIONAL) + ANSWER KEY PRESENT PERFECT + ANSWER KEY

ANGOL NYELVI SZINTFELMÉRŐ 2013 A CSOPORT. on of for from in by with up to at

Ültetési és öntözési javaslatok. Planting and watering instructions

Választási modellek 3

EN United in diversity EN A8-0206/419. Amendment

A modern e-learning lehetőségei a tűzoltók oktatásának fejlesztésében. Dicse Jenő üzletfejlesztési igazgató

Rezgésdiagnosztika. Diagnosztika

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Longman Exams Dictionary egynyelvű angol szótár nyelvvizsgára készülőknek

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Revenue Stamp Album for Hungary Magyar illetékbélyeg album. Content (tartalom) Documentary Stamps (okmánybélyegek)

TestLine - Angol teszt Minta feladatsor

Az Open Data jogi háttere. Dr. Telek Eszter

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Tudok köszönni tegezve és önözve, és el tudok búcsúzni. I can greet people in formal and informal ways. I can also say goodbye to them.

Dependency preservation

Word and Polygon List for Obtuse Triangular Billiards II

Klaszterezés, 2. rész

KIEGÉSZÍTŽ FELADATOK. Készlet Bud. Kap. Pápa Sopr. Veszp. Kecsk Pécs Szomb Igény

FIATAL MŰSZAKIAK TUDOMÁNYOS ÜLÉSSZAKA

PIACI HIRDETMÉNY / MARKET NOTICE

FAMILY STRUCTURES THROUGH THE LIFE CYCLE

Magyar - Angol Orvosi Szotar - Hungarian English Medical Dictionary (English And Hungarian Edition) READ ONLINE

Website review mozogjotthon.com

Eladni könnyedén? Oracle Sales Cloud. Horváth Tünde Principal Sales Consultant március 23.

ENROLLMENT FORM / BEIRATKOZÁSI ADATLAP

Milyen végzettség, jogosultság szükséges a pályázaton való részvételhez?

SAJTÓKÖZLEMÉNY Budapest július 13.

1 Itroduction. 2 Example: Close Ë in the Rudimenta? BAKONYI Gábor. (Hungary, Budapest, Csillaghegy) October 30, 2009

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Tudományos Ismeretterjesztő Társulat

KELER KSZF Zrt. bankgarancia-befogadási kondíciói. Hatályos: július 8.

Bevezetés a kvantum-informatikába és kommunikációba 2015/2016 tavasz

A vitorlázás versenyszabályai a évekre angol-magyar nyelvű kiadásának változási és hibajegyzéke

Átírás:

Technical Report Series on Corpus Building Vol. 5 (April 2013) Hungarian Corpora Uwe Quasthoff Dirk Goldhahn Zita Hollós Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

Affiliation of the authors: Uwe Quasthoff, Dirk Goldhahn: Institut für Informatik,Universität Leipzig {quasthoff, dgoldhahn}@informatik.uni-leipzig.de Zita Hollós: Károli Gáspár Református Egyetem (Budapest), hollos.zita@kre.hu Copyright: Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, http://asv.informatik.uni-leipzig.de/ Technical Report Series on Corpus Building Vol. 1: Deutscher Wortschatz 2013 Vol. 2: Danish Corpora Vol. 3: Dutch Corpora Vol. 4: Icelandic Corpora Vol. 5: Hungarian Corpora This PDF document was created using the open source tool mwlib. For more infotmation, see http://code.pediapress.com/ PDF generated at: Tue, 15 May 2013

Hungarian corpora 1 Introduction to corpus creation 1 HUN - a processing related language description 2 HUN corpora 4 HUN corpus comparison 8 Processing details 10 Appendix to hun news 2007: Database summary 10 Appendix to hun news 2008: Database summary 10 Appendix to hun news 2009: Database summary 11 Appendix to hun news 2010: Database summary 11 Appendix to hun news 2011: Database summary 12 Appendix to hun newscrawl 2011: Database summary 12 Appendix to hun wikipedia 2007: Database summary 13 Appendix to hun wikipedia 2012: Database summary 13 Appendix to hun web 2003: Database summary 14 Appendix to hun web 2011: Database summary 14 Appendix to hun mixed 2012: Database summary 15 Content details 16 Appendix to hun news 2007: Size of different TLDs 16 Appendix to hun news 2008: Size of different TLDs 16 Appendix to hun news 2009: Size of different TLDs 17 Appendix to hun news 2010: Size of different TLDs 17 Appendix to hun news 2011: Size of different TLDs 17 Appendix to hun newscrawl 2011: Size of different TLDs 18 Appendix to hun web 2003: Size of different TLDs 18 Appendix to hun web 2011: Size of different TLDs 18 Appendix to hun mixed 2012: Size of different TLDs 19 Appendix to hun news 2007: Size of largest domains 19 Appendix to hun news 2008: Size of largest domains 20 Appendix to hun news 2009: Size of largest domains 20 Appendix to hun news 2010: Size of largest domains 21

Appendix to hun news 2011: Size of largest domains 22 Appendix to hun newscrawl 2011: Size of largest domains 22 Appendix to hun web 2003: Size of largest domains 23 Appendix to hun web 2011: Size of largest domains 23 Appendix to hun mixed 2012: Size of largest domains 24 Appendix to hun news 2007: Number of sources by time period 25 Appendix to hun news 2008: Number of sources by time period 26 Appendix to hun news 2009: Number of sources by time period 27 Appendix to hun news 2010: Number of sources by time period 28 Appendix to hun news 2011: Number of sources by time period 30 Word details 32 Appendix to hun news 2007: Words by length without multiplicity 32 Appendix to hun news 2008: Words by length without multiplicity 34 Appendix to hun news 2009: Words by length without multiplicity 36 Appendix to hun news 2010: Words by length without multiplicity 38 Appendix to hun news 2011: Words by length without multiplicity 40 Appendix to hun newscrawl 2011: Words by length without multiplicity 42 Appendix to hun wikipedia 2007: Words by length without multiplicity 44 Appendix to hun wikipedia 2012: Words by length without multiplicity 46 Appendix to hun web 2003: Words by length without multiplicity 48 Appendix to hun web 2011: Words by length without multiplicity 50 Appendix to hun mixed 2012: Words by length without multiplicity 52 Appendix to hun news 2007: Words by length with multiplicity 54 Appendix to hun news 2008: Words by length with multiplicity 56 Appendix to hun news 2009: Words by length with multiplicity 58 Appendix to hun news 2010: Words by length with multiplicity 60 Appendix to hun news 2011: Words by length with multiplicity 62 Appendix to hun newscrawl 2011: Words by length with multiplicity 64 Appendix to hun wikipedia 2007: Words by length with multiplicity 66 Appendix to hun wikipedia 2012: Words by length with multiplicity 68 Appendix to hun web 2003: Words by length with multiplicity 70 Appendix to hun web 2011: Words by length with multiplicity 72 Appendix to hun mixed 2012: Words by length with multiplicity 74 Appendix to hun news 2007: The most frequent 50 words 75 Appendix to hun news 2008: The most frequent 50 words 76 Appendix to hun news 2009: The most frequent 50 words 77 Appendix to hun news 2010: The most frequent 50 words 78

Appendix to hun news 2011: The most frequent 50 words 79 Appendix to hun newscrawl 2011: The most frequent 50 words 80 Appendix to hun wikipedia 2007: The most frequent 50 words 81 Appendix to hun wikipedia 2012: The most frequent 50 words 82 Appendix to hun web 2003: The most frequent 50 words 83 Appendix to hun web 2011: The most frequent 50 words 84 Appendix to hun mixed 2012: The most frequent 50 words 85 Appendix to hun news 2007: Longest words in top-1.000 by rank 86 Appendix to hun news 2008: Longest words in top-1.000 by rank 87 Appendix to hun news 2009: Longest words in top-1.000 by rank 88 Appendix to hun news 2010: Longest words in top-1.000 by rank 89 Appendix to hun news 2011: Longest words in top-1.000 by rank 90 Appendix to hun newscrawl 2011: Longest words in top-1.000 by rank 91 Appendix to hun wikipedia 2007: Longest words in top-1.000 by rank 92 Appendix to hun wikipedia 2012: Longest words in top-1.000 by rank 93 Appendix to hun web 2003: Longest words in top-1.000 by rank 94 Appendix to hun web 2011: Longest words in top-1.000 by rank 95 Appendix to hun mixed 2012: Longest words in top-1.000 by rank 96 Character N-gram details 97 Appendix to hun news 2007: Alphabet as used in the top-100.000 words 97 Appendix to hun news 2008: Alphabet as used in the top-100.000 words 98 Appendix to hun news 2009: Alphabet as used in the top-100.000 words 99 Appendix to hun news 2010: Alphabet as used in the top-100.000 words 101 Appendix to hun news 2011: Alphabet as used in the top-100.000 words 102 Appendix to hun newscrawl 2011: Alphabet as used in the top-100.000 words 103 Appendix to hun wikipedia 2007: Alphabet as used in the top-100.000 words 105 Appendix to hun wikipedia 2012: Alphabet as used in the top-100.000 words 106 Appendix to hun web 2003: Alphabet as used in the top-100.000 words 107 Appendix to hun web 2011: Alphabet as used in the top-100.000 words 109 Appendix to hun mixed 2012: Alphabet as used in the top-100.000 words 110 Abbreviation details 112 Appendix to hun news 2007: Most frequent abbreviations 112 Appendix to hun news 2008: Most frequent abbreviations 113 Appendix to hun news 2009: Most frequent abbreviations 114 Appendix to hun news 2010: Most frequent abbreviations 115 Appendix to hun news 2011: Most frequent abbreviations 116

Appendix to hun newscrawl 2011: Most frequent abbreviations 117 Appendix to hun wikipedia 2007: Most frequent abbreviations 118 Appendix to hun wikipedia 2012: Most frequent abbreviations 119 Appendix to hun web 2003: Most frequent abbreviations 120 Appendix to hun web 2011: Most frequent abbreviations 121 Appendix to hun mixed 2012: Most frequent abbreviations 122 Appendix to hun news 2007: Left neighbors of the full stop 123 Appendix to hun news 2008: Left neighbors of the full stop 124 Appendix to hun news 2009: Left neighbors of the full stop 125 Appendix to hun news 2010: Left neighbors of the full stop 126 Appendix to hun news 2011: Left neighbors of the full stop 127 Appendix to hun newscrawl 2011: Left neighbors of the full stop 128 Appendix to hun wikipedia 2007: Left neighbors of the full stop 129 Appendix to hun wikipedia 2012: Left neighbors of the full stop 130 Appendix to hun web 2003: Left neighbors of the full stop 131 Appendix to hun web 2011: Left neighbors of the full stop 132 Appendix to hun mixed 2012: Left neighbors of the full stop 133 Appendix to hun news 2007: Left neighbors of the full stop with additional internal full stops 134 Appendix to hun news 2008: Left neighbors of the full stop with additional internal full stops 135 Appendix to hun news 2009: Left neighbors of the full stop with additional internal full stops 136 Appendix to hun news 2010: Left neighbors of the full stop with additional internal full stops 137 Appendix to hun news 2011: Left neighbors of the full stop with additional internal full stops 138 Appendix to hun newscrawl 2011: Left neighbors of the full stop with additional internal full stops 139 Appendix to hun wikipedia 2007: Left neighbors of the full stop with additional internal full stops 140 Appendix to hun wikipedia 2012: Left neighbors of the full stop with additional internal full stops 141 Appendix to hun web 2003: Left neighbors of the full stop with additional internal full stops 142 Appendix to hun web 2011: Left neighbors of the full stop with additional internal full stops 143 Appendix to hun mixed 2012: Left neighbors of the full stop with additional internal full stops 144 Sentences details 145 Appendix to hun news 2007: Shortest sentences 145 Appendix to hun news 2008: Shortest sentences 146 Appendix to hun news 2009: Shortest sentences 148 Appendix to hun news 2010: Shortest sentences 149 Appendix to hun news 2011: Shortest sentences 151 Appendix to hun newscrawl 2011: Shortest sentences 152 Appendix to hun wikipedia 2007: Shortest sentences 154

Appendix to hun wikipedia 2012: Shortest sentences 155 Appendix to hun web 2003: Shortest sentences 157 Appendix to hun web 2011: Shortest sentences 158 Appendix to hun mixed 2012: Shortest sentences 160 Appendix to hun news 2007: Longest sentences 161 Appendix to hun news 2008: Longest sentences 163 Appendix to hun news 2009: Longest sentences 165 Appendix to hun news 2010: Longest sentences 167 Appendix to hun news 2011: Longest sentences 169 Appendix to hun newscrawl 2011: Longest sentences 171 Appendix to hun wikipedia 2007: Longest sentences 173 Appendix to hun wikipedia 2012: Longest sentences 175 Appendix to hun web 2003: Longest sentences 177 Appendix to hun web 2011: Longest sentences 179 Appendix to hun mixed 2012: Longest sentences 181 Appendix to hun news 2007: Length of sentences in characters 183 Appendix to hun news 2008: Length of sentences in characters 184 Appendix to hun news 2009: Length of sentences in characters 185 Appendix to hun news 2010: Length of sentences in characters 186 Appendix to hun news 2011: Length of sentences in characters 187 Appendix to hun newscrawl 2011: Length of sentences in characters 188 Appendix to hun wikipedia 2007: Length of sentences in characters 189 Appendix to hun wikipedia 2012: Length of sentences in characters 190 Appendix to hun web 2003: Length of sentences in characters 191 Appendix to hun web 2011: Length of sentences in characters 192 Appendix to hun mixed 2012: Length of sentences in characters 193 Appendix to hun news 2007: Length of sentences in words 194 Appendix to hun news 2008: Length of sentences in words 195 Appendix to hun news 2009: Length of sentences in words 196 Appendix to hun news 2010: Length of sentences in words 197 Appendix to hun news 2011: Length of sentences in words 198 Appendix to hun newscrawl 2011: Length of sentences in words 199 Appendix to hun wikipedia 2007: Length of sentences in words 200 Appendix to hun wikipedia 2012: Length of sentences in words 201 Appendix to hun web 2003: Length of sentences in words 202 Appendix to hun web 2011: Length of sentences in words 203 Appendix to hun mixed 2012: Length of sentences in words 204

Oddities details 205 Appendix to hun news 2007: Longest words 205 Appendix to hun news 2008: Longest words 205 Appendix to hun news 2009: Longest words 206 Appendix to hun news 2010: Longest words 206 Appendix to hun news 2011: Longest words 207 Appendix to hun newscrawl 2011: Longest words 207 Appendix to hun wikipedia 2007: Longest words 208 Appendix to hun wikipedia 2012: Longest words 208 Appendix to hun web 2003: Longest words 209 Appendix to hun web 2011: Longest words 209 Appendix to hun mixed 2012: Longest words 210 Appendix to hun news 2007: Sentences with high average word length 210 Appendix to hun news 2008: Sentences with high average word length 211 Appendix to hun news 2009: Sentences with high average word length 212 Appendix to hun news 2010: Sentences with high average word length 213 Appendix to hun news 2011: Sentences with high average word length 214 Appendix to hun newscrawl 2011: Sentences with high average word length 216 Appendix to hun wikipedia 2007: Sentences with high average word length 217 Appendix to hun wikipedia 2012: Sentences with high average word length 218 Appendix to hun web 2003: Sentences with high average word length 219 Appendix to hun web 2011: Sentences with high average word length 220 Appendix to hun mixed 2012: Sentences with high average word length 221 Appendix to hun news 2007: Problems with sentence segmentation - words ending in a stopword 222 Appendix to hun news 2008: Problems with sentence segmentation - words ending in a stopword 223 Appendix to hun news 2009: Problems with sentence segmentation - words ending in a stopword 224 Appendix to hun news 2010: Problems with sentence segmentation - words ending in a stopword 224 Appendix to hun news 2011: Problems with sentence segmentation - words ending in a stopword 225 Appendix to hun newscrawl 2011: Problems with sentence segmentation - words ending in a stopword 226 Appendix to hun wikipedia 2007: Problems with sentence segmentation - words ending in a stopword 227 Appendix to hun wikipedia 2012: Problems with sentence segmentation - words ending in a stopword 227 Appendix to hun web 2003: Problems with sentence segmentation - words ending in a stopword 228 Appendix to hun web 2011: Problems with sentence segmentation - words ending in a stopword 229 Appendix to hun mixed 2012: Problems with sentence segmentation - words ending in a stopword 230

1 Hungarian corpora Introduction to corpus creation The Leipzig Corpora Collection (LCC) collects Web based corpora for many different languages. The main text genres are newspaper texts, Wikipedias and randomly collected web pages. All corpora are processed in the same way: Crawling Web pages HTML stripping Language identifikation Sentence segmentation Cleaning: Removal of ill-formed sentences Duplicate removal Calculation of word frequences and word co-occurrences As result we have a corpus containing only well-formed sentences in the language under consideration. The sentences are in random order; hence, sharing the corpus does not violate copyright law because it is impossible to reconstruct the original texts. The pre-processing steps contain both language independent steps (like HTML stripping and duplicate removal) and language dependent steps (like language identification and sentence segmentation). Especially the language specific parts are vulnerable to specific processing problems. The aim of the paper is to identify possible problems and evaluate the results. The following problems are adressed: A processing-focused language description Language size: How much text is available for this language? What are the biggest sources? Corpus description: Genre, size, crawling and processing date. Possible problems in language identification: Which languages are similar? Character set and alphabet Inspecting the word list: Most frequent words, longer high frequent words and longest words at all. Word length distribution. Can abbreviations confuse sentence segmentation? Information about the abbreviation list. Inspecting sentences: Inspect shortest and longest sentences to identify possible segmentation problems. Sentence length distribution. The paper describes the result of these inspections; the appendices show the exact results for the different corpora. This helps to compare the corpora with respect to quality. In the section quality overview, an overall quality description for each corpus is given. All corpora contain only minor problems which are irrelevant for most applications. Otherwise the corpus creation has been iterated.

HUN - a processing related language description 2 HUN - a processing related language description General properties of the language Native Name: Magyar Classifiation: Uralic Total Number of Speakers: 12.5M Largest countries with number of spakers: Hungary (9.5M), Romania (1.5M), Serbia (0.5M), Slovakia (0.5M) Source: http:/ / www. ethnologue. org/ show_language. asp?code=hun Processing summary latin alphabet with some additional characters full stop is used as sentence boundary and for abbreviations apostrostophes used very rarely Properties important for processing Alphabet and punctuation Alphabet: A Á B C Cs D Dz Dzs E É F G Gy H I Í J K L Ly M N Ny O Ó Ö Ő P (Q) R S Sz T Ty U Ú Ü Ű V (W) (X) (Y) Z Zs Characters in parentheses appear only in loanwords and proper names. The digraphs and trigraphs above are considered as single letters. Usual latin punctuation Source: http:/ / de. wikipedia. org/ wiki/ Ungarische_Sprache#Alphabet Due to keybord limitations, some authors use incorrect alternativ characters: ô and õ instead of ő (up to 10% each) û instead of ű (up to 20%) Usage of uppercase letters: At sentence beginnings and for proper names. Titles like doktor are often written in lowercase.

HUN - a processing related language description 3 Sentence segmentation and word tokenization Abbreviations Abbreviations confusing with sentence boundaries: Special abbreviation list. Has to be cleaned. Sources for abbreviations: Machine generated / ZH Abbreviations with full stop may appear in the word list without full stop. Apostrophes Use of apostrophes: Very rare in foreign proper names. Frequency ratio compared with comma in hun_newscrawl_2011: '/, = 23.360 / 14.775.507 Multiwords Number of multiwords: 23.058 sources: Wikipedia For a list of some high frequent multiwords, see Appendix 1: Statscript 3.12.2 applied to hun_mixed_2012 Sources and ranking (2012) Estimated number of webpages containing text Google.com top-5 words: 261,000,000 results for "az" "és" "A" "hogy" "is" Google.com top-10 words: 61,300,000 results for "az" "és" "A" "hogy" "is" "nem" "Az" "egy" "meg" "volt" Rank according to number of speakers (Ethnologue): 61 Rank according to Wikipedia size (2012-04-02, see http:/ / de. wikipedia. org/ wiki/ Wikipedia:Sprachen): Rank 19 with 213.891 articles. Rank according to number of newspapers as found by AbyZ (2012): 90 newspapers, rank 23 Rank according to number of newspapers with RSS feeds (2012): 102 newspapers, rank 15 Rank according to our corpus size (5/2012): 8

HUN corpora 4 HUN corpora Quality Overview Quality Ratings A: Very good quality. Ready to use (or already used) for frequency dictionary. Size as large as possible Only minimal errors Multiple genres (if possible) A-: Small problems identified. They should not affect usage. B: Native speaker quality. Information about abbreviations and sentence boundaries by native speaker Resulting statistics checked by native speaker, possible errors corrected C: Non-native speaker quality Obvious problems shown in corpus statistics are corrected D: First version Pre-processing with default abbreviation list and default sentence boundaries E: Poor Quality: Old, outdated or faulty. Corpus Quality REMARK FOR EDITORS: THIS TABLE WILL BE FILLED IN THE LAST STEP SUMMARIZING ALL OTHER RESULTS. Due to keybord limitations, some authors use incorrect alternativ characters within words. Corpus Quality rating Known problems to-dos hun_news_2007 B top-50 words not clean, maximal sentence length problem - hun_news_2008 A- top-50 words not clean - hun_news_2009 A- top-50 words not clean - hun_news_2010 A- top-50 words not clean - hun_news_2011 A- top-50 words not clean - hun_newscrawl_2011 A- top-50 words not clean - hun_wikipedia_2007 A- top-50 words not clean - hun_wikipedia_2010 A- top-50 words not clean - hun_web_2003 A- top-50 words not clean - hun_web_2011 A- top-50 words not clean - hun_mixed_2012 A- top-50 words not clean -

HUN corpora 5 Processing Overview For more details, see Appendix: Database Summary and Appendix: Number of sources by time period. Corpus Size (M sentences) Size (M running words) Multiwords Crawling date Production date hun_news_2007 0.7 13 8316 daily 2007, mainly from May to December 2012 hun_news_2008 1.6 30 12686 daily 2008 2012 hun_news_2009 1.3 25 11549 daily 2009 2012 hun_news_2010 1.7 32 11868 daily 2010 2012 hun_news_2011 1.8 34 11677 daily 2011 2012 hun_newscrawl_2011 10.6 176 23058 batch crawling 2011 2012 hun_wikipedia_2007 0.6 8 19114 dump 2007 2007 hun_wikipedia_2012 2.5 38 41709 dump 2012 2012 hun_web_2003 18.2 254 20317 (2002) 2007 hun_web_2011 3.2 48 15981 randomly 2011 2012 hun_mixed_2012 40.7 622 50852-2012 Content Overview For more details, see Appendix: Size of different TLDs and Appendix: Size of different domains. Corpus Type of sources Countries Number of sources Publishing date Biggest source hun_news_2007 News hu 39 newspapers 2007 www.origo.hu hun_news_2008 News hu 78 newspapers 2008 www.origo.hu hun_news_2009 News hu 82 newspapers 2009 www.origo.hu hun_news_2010 News hu 68 newspapers 2010 www.origo.hu hun_news_2011 News hu 59 newspapers 2011 hvg.hu hun_newscrawl_2011 News hu 66 newspapers 2011 and before www.hhrf.org hun_wikipedia_2007 Wikipedia - 1 2007 and before wikipedia.org hun_wikipedia_2010 Wikipedia - 1 2010 and before wikipedia.org hun_web_2003 Web hu unknown 2002 and before unknown hun_web_2011 Web hu, ro, sk,... 33992 domains 2011 and before www.termeszetvilaga.hu/ hun_mixed_2012 combined combined 34113 domains 2011 and before www.hhrf.org/

HUN corpora 6 Words Appendix: Words by Length without multiplicity shows a plot of the corresponding length distribution. A smooth asymetric bell-shaped curve is expected. Appendix: Words by Length with multiplicity shows a plot of the corresponding length distribution. A smooth asymetric bell-shaped curve is expected. Appendix: The Most Frequent 50 Words shows the most frequent stopwords as well as one or more words related to the region. Appendix: Longest Words in Top-1000 by rank shows the 25 longest words within the top-1000. They usually give an impression of the main topics treated in the corpus. Appendix: Longest Words with minimum frequency 2 should give an idea of very long words. In the case of processing problems, different types of non-words may appear. This might help to improve the word definition. Corpus Word length graph without multiplicity Word length graph with multiplicity Most Frequent 50 Words Longest Words in Top-1000 Longest Words with minimum frequency 2 hun_news_2007 okay okay at rank 34 okay routes, Trojan-Downloader.Win32.Conhook.gen hun_news_2008 okay okay at rank 38 okay routes hun_news_2009 okay okay at rank 38 okay missing blanks, routes hun_news_2010 okay okay and included hun_news_2011 okay okay and included hun_newscrawl_2011 okay okay, and is. included hun_wikipedia_2007 okay okay and felhasználó(k included okay okay okay Wikipédia-felhasználó(k included missing blanks, routes missing blanks, routes missing blanks, routes routes hun_wikipedia_2010 okay okay included okay routes hun_web_2003 okay okay okay okay missing blanks, routes, special characters hun_web_2011 okay okay and is. included okay missing blanks, routes, special characters hun_mixed_2012 okay okay okay okay all above Abbreviations Abbreviations are usually not used as sentence boundaries. Conversely, missing abbreviations can overgenerate sentence boundaries. Due to limitations in the processing chain, the list of abbreviations used for sentence boundary detection can differ from the abbreviations in the word list. Appendix: Most Frequent Abbreviations shows possible under-generation of sentence boundaries by wrong abbreviations (i.e. words ending in a full stop) in the word list.

HUN corpora 7 Sentences Appendix: Shortest sentences shows the shortest declarative, exclamatory and interrogative sentences. In preprocessing, a minimal length for sentences might be specified. And missing abbreviations are often visible as faulty sentence engings. Appendix: Longest sentences shows the longest declarative, exclamatory and interrogative sentences. Usually, the maximun sentence length is defined as 256 characters (not 256 bytes). Very long exclamatory or interrogative sentences often contain an overseen sentence boundary. Appendix: Length of sentences in characters shows the distribution of the sentence length. A large and balanced corpus will result in a smooth and bell-shaped curve. Isolated local maxima usually result from large sets of near duplicate sentences. Corpus Shortest sentences Longest sentences Length distribution (in characters) Length distribution (in words) hun_news_2007 okay char_length=249 too few long sentences, maximum 256 byte instead 256 characters okay hun_news_2008 okay okay okay okay hun_news_2009 okay okay okay okay hun_news_2010 okay okay okay okay hun_news_2011 okay okay okay okay hun_newscrawl_2011 okay okay okay okay hun_wikipedia_2007 okay okay okay okay hun_wikipedia_2010 okay okay okay okay hun_web_2003 okay okay okay okay hun_web_2011 okay okay okay okay hun_mixed_2012 okay okay okay okay Oddities Appendix: Sentences with high average word length: Average sentences contain many stopwords, and these stopwords are usually short. Hence, they restrict the average word length in a sentence. Conversely, sentences with high average word length are often ill formed. They may be used to improve pre-processing. Appendix: Problems with sentence segmentation - Words ending in a stopword: If there are many ill-formed word or sentence boundaries witout a blank between two words, they will generate new ill-formed words. The appendix shows the most frequent words ending in an uppercase stopword. If they are infrequent then the date were of high quality.

HUN corpora 8 Corpus Sentences with high average word length Words ending in a stopword hun_news_2007 okay maxfreq=25 hun_news_2008 okay maxfreq=28 hun_news_2009 okay maxfreq=18 hun_news_2010 okay maxfreq=22 hun_news_2011 okay maxfreq=43 hun_newscrawl_2011 special characters included maxfreq=24 hun_wikipedia_2007 okay okay hun_wikipedia_2010 URLs in sentences okay hun_web_2003 missing blanks maxfreq=24 hun_web_2011 special characters, missing blanks maxfreq=10 hun_mixed_2012 all above maxfreq=64 POS Tagging HunPOS provides POS-Tagging and is used for Hungarian and Swedish. If applied to a corpus, frequencies for words with POS-tags are provided. HUN corpus comparison Automated Corpus comparison For the following comparisons, the following tests on the top-1000 words are performed: Vectors based on the frequencies of the top-1000 words are created for the analysed languages. The cosine of the angle between these vectors is computed. Identical languages receive a value of 0, distinct languages get a value of 1. The same analysis is conducted using the frequencies of the top-1000 typical letter trigrams of the languages. Monolingual word list comparison (top-1000 words) As one can expect the comparisons show: The different news corpora have different word lists with maximum distance 0.17 (hun_newscrawl_2011 und hun_news_2007) The wikipedia corpora are similar with maximum distance 0.12 The web corpora have distance 0.15 The mixed corpus hun_mixed_2012 holds a central position with maximum distances of 0.33 to the other corpora.

HUN corpus comparison 9 Multilingual word list comparison (top-1000 words) Both the comparison of the top-1000 words and the comparison of the letter trigrams used in these words show that there is no similar language in our data. The distance of the mixed corpus to the next languages are 0.85 for the words and 0.85 for the letter trigrams. Both distances are so large that they do not represent similarity. On average the value for the most similar language is 0.58 for trigrams. The most similar languages based on words: Palauan, Konkani, Catalan-Valencian-Balear +--------+---------------------+--------------------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+--------------------------+-------------+ hun pau Palauan 0.853886 hun knn Konkani 0.864753 hun cat Catalan-Valencian-Balear 0.880592 hun ibo Igbo 0.894747 hun slk Slovak 0.900509 +--------+---------------------+--------------------------+-------------+ The most similar languages based on letter trigrams: Norwegian(Bokmål), Dutch, Palauan +--------+---------------------+--------------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+--------------------+-------------+ hun nob Norwegian, Bokmål 0.848486 hun nld Dutch 0.849091 hun pau Palauan 0.854097 hun bre Breton 0.85495 hun swe Swedish 0.856901 +--------+---------------------+--------------------+-------------+.

10 Processing details Appendix to hun news 2007: Database summary Values for some general parameters Parameter Value Number of sentences 714822 Number of running word forms 12956111 Number of distinct word forms 714183 Number of multiwords 8315 Percentage of words with frequency=1 55.3007 Number of sentence based co-occurrences 3362610 Number of neighbour co-occurrences 438308 Appendix to hun news 2008: Database summary Values for some general parameters Parameter Value Number of sentences 1634391 Number of running word forms 29628596 Number of distinct word forms 1196175 Number of multiwords 12319 Percentage of words with frequency=1 54.4390 Number of sentence based co-occurrences 8265434 Number of neighbour co-occurrences 953417

Appendix to hun news 2009: Database summary 11 Appendix to hun news 2009: Database summary Values for some general parameters Parameter Value Number of sentences 1346971 Number of running word forms 24709716 Number of distinct word forms 1045980 Number of multiwords 11212 Percentage of words with frequency=1 54.1993 Number of sentence based co-occurrences 6846292 Number of neighbour co-occurrences 801470 Appendix to hun news 2010: Database summary Values for some general parameters Parameter Value Number of sentences 1738375 Number of running word forms 31773034 Number of distinct word forms 1190558 Number of multiwords 11465 Percentage of words with frequency=1 53.2940 Number of sentence based co-occurrences 9710382 Number of neighbour co-occurrences 1026032

Appendix to hun news 2011: Database summary 12 Appendix to hun news 2011: Database summary Values for some general parameters Parameter Value Number of sentences 1842418 Number of running word forms 33794981 Number of distinct word forms 1189327 Number of multiwords 11314 Percentage of words with frequency=1 52.4479 Number of sentence based co-occurrences 10588798 Number of neighbour co-occurrences 1097482 Appendix to hun newscrawl 2011: Database summary Values for some general parameters Parameter Value Number of sentences 10565916 Number of running word forms 175950790 Number of distinct word forms 4205279 Number of multiwords 23058 Percentage of words with frequency=1 57.6108 Number of sentence based co-occurrences 41983602 Number of neighbour co-occurrences 4639348

Appendix to hun wikipedia 2007: Database summary 13 Appendix to hun wikipedia 2007: Database summary Values for some general parameters Parameter Value Number of sentences 576936 Number of running word forms 8490788 Number of distinct word forms 755994 Number of multiwords 18635 Percentage of words with frequency=1 61.2925 Number of sentence based co-occurrences 1784520 Number of neighbour co-occurrences 249066 Appendix to hun wikipedia 2012: Database summary Values for some general parameters Parameter Value Number of sentences 2538545 Number of running word forms 37908657 Number of distinct word forms 1990194 Number of multiwords 40586 Percentage of words with frequency=1 59.7776 Number of sentence based co-occurrences 8686698 Number of neighbour co-occurrences 1109807

Appendix to hun web 2003: Database summary 14 Appendix to hun web 2003: Database summary Values for some general parameters Parameter Value Number of sentences 18201276 Number of running word forms 253599346 Number of distinct word forms 5460919 Number of multiwords 20312 Percentage of words with frequency=1 58.1748 Number of sentence based co-occurrences 49192226 Number of neighbour co-occurrences 6022000 Appendix to hun web 2011: Database summary Values for some general parameters Parameter Value Number of sentences 3154647 Number of running word forms 48490550 Number of distinct word forms 2399258 Number of multiwords 15981 Percentage of words with frequency=1 58.4376 Number of sentence based co-occurrences 12126608 Number of neighbour co-occurrences 1461387

Appendix to hun mixed 2012: Database summary 15 Appendix to hun mixed 2012: Database summary Values for some general parameters Parameter Value Number of sentences 40696055 Number of running word forms 622441617 Number of distinct word forms 10309904 Number of multiwords 50852 Percentage of words with frequency=1 58.5375 Number of sentence based co-occurrences 133013288 Number of neighbour co-occurrences 13557699

16 Content details Appendix to hun news 2007: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 82288 98.57 net/ 1193 1.43 Appendix to hun news 2008: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 166170 98.53 net/ 2393 1.42

Appendix to hun news 2009: Size of different TLDs 17 Appendix to hun news 2009: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 142853 98.90 net/ 1479 1.02 Appendix to hun news 2010: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 210879 98.08 com/ 4090 1.90 Appendix to hun news 2011: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 225455 98.04 com/ 4499 1.96

Appendix to hun newscrawl 2011: Size of different TLDs 18 Appendix to hun newscrawl 2011: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 719961 91.25 org/ 54595 6.92 com/ 14103 1.79 Appendix to hun web 2003: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 1 100.00 Appendix to hun web 2011: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 303190 89.83.ro/ 6845 2.03 com/ 6237 1.85.sk/ 4555 1.35.eu/ 4463 1.32

Appendix to hun mixed 2012: Size of different TLDs 19 Appendix to hun mixed 2012: Size of different TLDs TLDs larger than 1% TLD # of sources %.hu/ 1806682 91.56 org/ 103631 5.25 com/ 29002 1.47 Appendix to hun news 2007: Size of largest domains Largest domains Source # of sentences www.origo.hu/ 136623 inforadio.hu/ 95917 vg.hu/ 90051 hvg.hu/ 81590 www.fn.hu/ 73227 ma.hu/ 63818 nol.hu/ 60633 www.hwsw.hu/ 13187 www.computerworld.hu/ 12225 www.mobilport.hu/ 9433 # of distinct sources 39

Appendix to hun news 2008: Size of largest domains 20 Appendix to hun news 2008: Size of largest domains Largest domains Source # of sentences www.origo.hu/ 297386 nol.hu/ 222498 vg.hu/ 201612 inforadio.hu/ 186952 www.fn.hu/ 153306 bulvar.ma.hu/ 58171 belfold.ma.hu/ 48732 eletmod.hu/ 33618 www.mult-kor.hu/ 28446 www.f1hirek.hu/ 23267 # of distinct sources 78 Appendix to hun news 2009: Size of largest domains Largest domains Source # of sentences www.origo.hu/ 266628 inforadio.hu/ 163956 vg.hu/ 160084 www.fn.hu/ 155419 bulvar.ma.hu/ 63894 www.noilapozo.hu/ 44271 belfold.ma.hu/ 37837 www.vg.hu/ 25917 www.hwsw.hu/ 25269 www.mult-kor.hu/ 24987 # of distinct sources 82

Appendix to hun news 2009: Size of largest domains 21 Appendix to hun news 2010: Size of largest domains Largest domains Source # of sentences www.origo.hu/ 330951 hvg.hu/ 312316 vg.hu/ 194296 inforadio.hu/ 176676 www.fn.hu/ 162180 vg.hu.feedsportal.com/ 35102 belfold.ma.hu/ 34898 www.mult-kor.hu/ 34586 prohardver.hu/ 33482 eletmod.hu/ 31213 # of distinct sources 68

Appendix to hun news 2011: Size of largest domains 22 Appendix to hun news 2011: Size of largest domains Largest domains Source # of sentences hvg.hu/ 533993 www.origo.hu/ 353074 inforadio.hu/ 165280 www.fn.hu/ 103730 vg.hu/ 103264 www.vg.hu/ 101239 belfold.ma.hu/ 37868 www.mult-kor.hu/ 35710 webbulvar.hu/ 33642 kulfold.ma.hu/ 32150 # of distinct sources 59 Appendix to hun newscrawl 2011: Size of largest domains Largest domains Source # of sentences www.hhrf.org/ 3555353 www.inforadio.hu/ 606089 www.vg.hu/ 603946 www.borsonline.hu/ 591559 www.pecsiujsag.hu/ 534121 www.metropol.hu/ 494560 www.hir24.hu/ 377990 www.pecsinapilap.hu/ 346658 www.delmagyar.hu/ 321883 www.hirkereso.hu/ 236754 # of distinct sources 66

Appendix to hun newscrawl 2011: Size of largest domains 23 Appendix to hun web 2003: Size of largest domains Largest domains Source # of sentences mokk.bme.hu/ 18201276 # of distinct sources 1 Appendix to hun web 2011: Size of largest domains Largest domains Source # of sentences www.termeszetvilaga.hu/ 15077 www.hetnap.rs/ 14954 www.bedoe.de/ 14822 www.literatura.hu/ 13738 www.panoramada.co.rs/ 13664 www.karizmatikus.hu/ 11241 www.c3.hu/ 10292 www.amarodrom.hu/ 8890 www.ligetgaleria.c3.hu/ 8073 www.matud.iif.hu/ 7462 # of distinct sources 33992

Appendix to hun mixed 2012: Size of largest domains 24 Appendix to hun mixed 2012: Size of largest domains Largest domains Source # of sentences mokk.bme.hu/ 18175878 www.hhrf.org/ 3410680 2300915 www.origo.hu/ 1441254 hvg.hu/ 939325 inforadio.hu/ 785583 vg.hu/ 750708 www.fn.hu/ 656361 www.borsonline.hu/ 583199 hu.wikipedia.org/ 565951 # of distinct sources 34112

Appendix to hun news 2007: Number of sources by time period 25 Appendix to hun news 2007: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2007 83481 100.00 Number of sources per month month # of sources % 2007-05 8070 9.67 2007-06 11057 13.24 2007-07 7978 9.56 2007-08 11558 13.85 2007-09 10177 12.19 2007-10 12831 15.37 2007-11 11624 13.92 2007-12 9615 11.52

Appendix to hun news 2008: Number of sources by time period 26 Appendix to hun news 2008: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2008 168648 100.00 Number of sources per month month # of sources % 2008-01 14408 8.54 2008-02 10939 6.49 2008-03 14043 8.33 2008-04 15252 9.04 2008-05 14191 8.41 2008-06 14783 8.77 2008-07 15410 9.14 2008-08 14149 8.39 2008-09 15203 9.01

Appendix to hun news 2008: Number of sources by time period 27 2008-10 15500 9.19 2008-11 12951 7.68 2008-12 11819 7.01 Appendix to hun news 2009: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2009 144439 100.00 Number of sources per month

Appendix to hun news 2009: Number of sources by time period 28 month # of sources % 2009-01 13655 9.45 2009-02 13429 9.30 2009-03 14320 9.91 2009-04 13468 9.32 2009-05 11041 7.64 2009-06 12629 8.74 2009-07 13069 9.05 2009-08 12610 8.73 2009-09 13284 9.20 2009-10 13026 9.02 2009-11 3613 2.50 2009-12 10295 7.13 Appendix to hun news 2010: Number of sources by time period Number of sources by year, month, and day

Appendix to hun news 2010: Number of sources by time period 29 Number of sources per year year # of sources % 2010 215015 100.00 Number of sources per month month # of sources % 2010-01 12563 5.84 2010-02 13733 6.39 2010-03 16484 7.67 2010-04 15343 7.14 2010-05 15013 6.98 2010-06 17485 8.13 2010-07 17070 7.94 2010-08 27797 12.93 2010-09 21000 9.77 2010-10 19563 9.10 2010-11 19566 9.10 2010-12 19398 9.02

Appendix to hun news 2011: Number of sources by time period 30 Appendix to hun news 2011: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2011 229971 100.00 Number of sources per month month # of sources % 2011-01 19167 8.33 2011-02 18711 8.14 2011-03 18235 7.93 2011-04 20623 8.97 2011-05 21459 9.33 2011-06 17014 7.40 2011-07 20048 8.72 2011-08 21294 9.26 2011-09 21210 9.22

Appendix to hun news 2011: Number of sources by time period 31 2011-10 19923 8.66 2011-11 19224 8.36 2011-12 13063 5.68

32 Word details Appendix to hun news 2007: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 10.7910 word length percentage 1 0.0175 2 0.1885 3 1.0601 4 2.2298 5 4.0295 6 5.5293 7 7.5329 8 9.1070 9 10.1342

Appendix to hun news 2007: Words by length without multiplicity 33 10 10.5437 11 10.2983 12 9.4571 13 7.9139 14 6.4579 15 4.8850 16 3.5947 17 2.5876 18 1.7854 19 1.2214 20 0.8611 21 0.5697 22 0.4020 23 0.2614 24 0.1627 25 0.1057 26 0.0731 27 0.0473 28 0.0316 29 0.0213 30 0.0161 31 0.0099 32 0.0088 33 0.0060 34 0.0046 35 0.0029 36 0.0022 37 0.0015 38 0.0013 39 0.0003 40 0.0006 41 0.0007 42 0.0013 43 0.0003 44 0.0003 45 0.0003 46 0.0006 47 0.0003 48 0.0001

Appendix to hun news 2007: Words by length without multiplicity 34 49 0.0004 Appendix to hun news 2008: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 11.0121 word length percentage 1 0.0115 2 0.1484 3 0.8978 4 2.0060 5 3.9656 6 5.2979 7 7.1896 8 8.7829 9 9.7668 10 10.2091

Appendix to hun news 2008: Words by length without multiplicity 35 11 10.1170 12 9.4490 13 8.0719 14 6.7007 15 5.2186 16 3.9089 17 2.8328 18 2.0055 19 1.4166 20 0.9749 21 0.6711 22 0.4534 23 0.3018 24 0.2081 25 0.1388 26 0.0930 27 0.0586 28 0.0439 29 0.0275 30 0.0181 31 0.0110 32 0.0078 33 0.0064 34 0.0043 35 0.0037 36 0.0032 37 0.0015 38 0.0019 39 0.0010 40 0.0009 41 0.0008 42 0.0010 43 0.0004 44 0.0005 45 0.0004 46 0.0005 47 0.0006 48 0.0002 49 0.0005

Appendix to hun news 2008: Words by length without multiplicity 36 50 0.0001 Appendix to hun news 2009: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 10.8835 word length percentage 1 0.0130 2 0.1609 3 0.9633 4 2.0999 5 4.0981 6 5.7137 7 7.3575 8 8.9859 9 9.9066 10 10.2997

Appendix to hun news 2009: Words by length without multiplicity 37 11 10.1197 12 9.3615 13 7.9563 14 6.5136 15 5.0633 16 3.7325 17 2.6924 18 1.8924 19 1.3129 20 0.9120 21 0.6247 22 0.4183 23 0.2814 24 0.1825 25 0.1243 26 0.0888 27 0.0628 28 0.0361 29 0.0267 30 0.0196 31 0.0142 32 0.0078 33 0.0065 34 0.0042 35 0.0038 36 0.0033 37 0.0030 38 0.0020 39 0.0013 40 0.0009 41 0.0003 42 0.0016 43 0.0011 44 0.0005 45 0.0011 46 0.0008 47 0.0008 48 0.0002 49 0.0004

Appendix to hun news 2009: Words by length without multiplicity 38 50 0.0005 Appendix to hun news 2010: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 10.9363 word length percentage 1 0.0120 2 0.1586 3 0.9429 4 2.0799 5 4.0837 6 5.7803 7 7.2510 8 8.8547 9 9.7846 10 10.1886

Appendix to hun news 2010: Words by length without multiplicity 39 11 10.0286 12 9.3081 13 7.9665 14 6.5479 15 5.0932 16 3.8064 17 2.7742 18 1.9598 19 1.3620 20 0.9434 21 0.6462 22 0.4500 23 0.2986 24 0.2031 25 0.1357 26 0.0942 27 0.0619 28 0.0393 29 0.0298 30 0.0208 31 0.0135 32 0.0120 33 0.0068 34 0.0050 35 0.0042 36 0.0038 37 0.0022 38 0.0015 39 0.0011 40 0.0009 41 0.0008 42 0.0013 43 0.0008 44 0.0006 45 0.0007 46 0.0008 47 0.0006 48 0.0003 49 0.0005

Appendix to hun news 2010: Words by length without multiplicity 40 50 0.0002 Appendix to hun news 2011: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 10.9884 word length percentage 1 0.0119 2 0.1504 3 0.9175 4 2.0123 5 4.1014 6 5.9481 7 7.2345 8 8.7390 9 9.5968 10 10.0894

Appendix to hun news 2011: Words by length without multiplicity 41 11 9.9823 12 9.2430 13 7.9067 14 6.5620 15 5.1424 16 3.8619 17 2.8447 18 2.0173 19 1.4120 20 0.9936 21 0.6800 22 0.4807 23 0.3145 24 0.2118 25 0.1479 26 0.0977 27 0.0639 28 0.0467 29 0.0338 30 0.0206 31 0.0156 32 0.0117 33 0.0098 34 0.0061 35 0.0051 36 0.0036 37 0.0046 38 0.0033 39 0.0018 40 0.0025 41 0.0015 42 0.0020 43 0.0014 44 0.0014 45 0.0009 46 0.0016 47 0.0015 48 0.0006 49 0.0011

Appendix to hun news 2011: Words by length without multiplicity 42 50 0.0010 Appendix to hun newscrawl 2011: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 11.4199 word length percentage 0 0.0001 1 0.0056 2 0.0776 3 0.5610 4 1.5075 5 3.3445 6 4.8434 7 6.6160 8 8.2364 9 9.2869

Appendix to hun newscrawl 2011: Words by length without multiplicity 43 10 9.9473 11 10.1384 12 9.5309 13 8.4166 14 7.0876 15 5.6789 16 4.3307 17 3.2258 18 2.3234 19 1.6365 20 1.1547 21 0.8006 22 0.5519 23 0.3757 24 0.2513 25 0.1777 26 0.1239 27 0.0834 28 0.0587 29 0.0420 30 0.0295 31 0.0205 32 0.0159 33 0.0123 34 0.0087 35 0.0074 36 0.0059 37 0.0048 38 0.0035 39 0.0028 40 0.0035 41 0.0021 42 0.0018 43 0.0014 44 0.0014 45 0.0014 46 0.0015 47 0.0011 48 0.0009

Appendix to hun newscrawl 2011: Words by length without multiplicity 44 49 0.0007 Appendix to hun wikipedia 2007: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 10.1330 word length percentage 1 0.0607 2 0.3164 3 1.3221 4 2.7984 5 4.7686 6 6.7770 7 9.1960 8 10.4683 9 11.1370 10 10.7591

Appendix to hun wikipedia 2007: Words by length without multiplicity 45 11 10.1257 12 8.8521 13 7.1553 14 5.6192 15 4.0812 16 2.9084 17 2.0023 18 1.3484 19 0.8792 20 0.5828 21 0.3841 22 0.2709 23 0.1713 24 0.1259 25 0.0865 26 0.0673 27 0.0440 28 0.0344 29 0.0246 30 0.0194 31 0.0172 32 0.0116 33 0.0111 34 0.0095 35 0.0066 36 0.0073 37 0.0042 38 0.0032 39 0.0020 40 0.0020 41 0.0016 42 0.0012 43 0.0011 44 0.0015 45 0.0009 46 0.0009 47 0.0008 48 0.0003 49 0.0003

Appendix to hun wikipedia 2007: Words by length without multiplicity 46 50 0.0003 Appendix to hun wikipedia 2012: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 10.6209 word length percentage 1 0.0376 2 0.2007 3 1.0566 4 2.3085 5 4.2301 6 6.0149 7 8.3116 8 9.5587 9 10.6302 10 10.3331

Appendix to hun wikipedia 2012: Words by length without multiplicity 47 11 9.9490 12 9.0434 13 7.6963 14 6.2370 15 4.7923 16 3.5551 17 2.5467 18 1.7644 19 1.1957 20 0.8029 21 0.5421 22 0.3702 23 0.2446 24 0.1674 25 0.1179 26 0.0847 27 0.0558 28 0.0412 29 0.0318 30 0.0249 31 0.0185 32 0.0136 33 0.0112 34 0.0098 35 0.0078 36 0.0066 37 0.0046 38 0.0042 39 0.0021 40 0.0030 41 0.0023 42 0.0022 43 0.0011 44 0.0013 45 0.0019 46 0.0013 47 0.0012 48 0.0009 49 0.0006

Appendix to hun wikipedia 2012: Words by length without multiplicity 48 50 0.0004 Appendix to hun web 2003: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 12.0009 word length percentage 1 0.0031 2 0.0653 3 0.4910 4 1.1992 5 2.4167 6 3.5336 7 5.2950 8 7.0422 9 8.4468 10 9.5285

Appendix to hun web 2003: Words by length without multiplicity 49 11 10.1834 12 10.0057 13 9.1915 14 7.9689 15 6.5255 16 5.1051 17 3.8916 18 2.8511 19 2.0439 20 1.4419 21 0.9997 22 0.6789 23 0.4614 24 0.3090 25 0.2121 26 0.1424 27 0.0939 28 0.0648 29 0.0462 30 0.0325 31 0.0234 32 0.0155 33 0.0118 34 0.0096 35 0.0070 36 0.0058 37 0.0046 38 0.0034 39 0.0029 40 0.0023 41 0.0022 42 0.0018 43 0.0014 44 0.0014 45 0.0011 46 0.0008 47 0.0008 48 0.0007 49 0.0005

Appendix to hun web 2003: Words by length without multiplicity 50 50 0.0005 Appendix to hun web 2011: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 11.1426 word length percentage 1 0.0139 2 0.1397 3 0.8852 4 1.9202 5 3.4928 6 4.9107 7 6.8416 8 8.6281 9 9.6471 10 10.2540

Appendix to hun web 2011: Words by length without multiplicity 51 11 10.2832 12 9.5825 13 8.3118 14 6.8602 15 5.3757 16 4.0056 17 2.9485 18 2.0787 19 1.4360 20 0.9804 21 0.6586 22 0.4429 23 0.2949 24 0.1911 25 0.1356 26 0.0932 27 0.0648 28 0.0449 29 0.0338 30 0.0241 31 0.0167 32 0.0133 33 0.0107 34 0.0072 35 0.0064 36 0.0047 37 0.0042 38 0.0040 39 0.0028 40 0.0023 41 0.0022 42 0.0021 43 0.0014 44 0.0011 45 0.0009 46 0.0008 47 0.0009 48 0.0008 49 0.0007

Appendix to hun web 2011: Words by length without multiplicity 52 50 0.0005 Appendix to hun mixed 2012: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 11.9199 word length percentage 1 0.0090 2 0.0600 3 0.4680 4 1.2708 5 2.9164 6 4.2836 7 5.9756 8 7.4197 9 8.5243 10 9.1628

Appendix to hun mixed 2012: Words by length without multiplicity 53 11 9.5632 12 9.3395 13 8.6278 14 7.5529 15 6.2875 16 5.0143 17 3.8927 18 2.9039 19 2.1136 20 1.5210 21 1.0790 22 0.7544 23 0.5240 24 0.3577 25 0.2527 26 0.1724 27 0.1172 28 0.0829 29 0.0593 30 0.0421 31 0.0304 32 0.0220 33 0.0168 34 0.0126 35 0.0102 36 0.0081 37 0.0065 38 0.0051 39 0.0039 40 0.0039 41 0.0032 42 0.0025 43 0.0022 44 0.0019 45 0.0018 46 0.0014 47 0.0014 48 0.0010 49 0.0010

Appendix to hun mixed 2012: Words by length without multiplicity 54 50 0.0009 Appendix to hun news 2007: Words by length with multiplicity Percentage of words of fixed length in characters, counted with multiplicty Average word length 6.2632 word length percentage 1 10.5348 2 9.7344 3 7.3793 4 8.6251 5 9.6600 6 9.3720 7 9.6020 8 8.2627 9 7.2115 10 5.7667

Appendix to hun news 2007: Words by length with multiplicity 55 11 4.3645 12 3.3182 13 2.1719 14 1.4659 15 0.9114 16 0.5957 17 0.3651 18 0.2178 19 0.1551 20 0.0945 21 0.0637 22 0.0385 23 0.0253 24 0.0195 25 0.0090 26 0.0092 27 0.0047 28 0.0027 29 0.0070 30 0.0022

Appendix to hun news 2008: Words by length with multiplicity 56 Appendix to hun news 2008: Words by length with multiplicity Percentage of words of fixed length in characters, counted with multiplicty Average word length 6.2514 word length percentage 1 10.3686 2 9.7782 3 7.4001 4 8.5566 5 9.8859 6 9.5245 7 9.5421 8 8.3801 9 7.1540 10 5.7695 11 4.3193 12 3.2796