Context-Aware Correction of Spelling Errors in Hungarian Medical Documents

Context-Aware Correction of Spelling Errors in Hungarian Medical Documents Borbála Siklósi 1, Attila Novák 1,2, Gábor Prószéky 1,2 Pázmány Péter Catholic University, Faculty of Information Technology 2 MTA-PPKE Language Technology Research Group {siklosi.borbala, novak.attila, proszeky}@itk.ppke.hu Ez a munka részben a TÁMOP 4.2.1.B 11/2/KMR-2011 0002 támogatásával készült.

Outline Spelling errors SMT applied to automatically correct errors Translation models Language model Decoding Results Problems

Spelling errors isspelling? Series of abbreviations Abbreviation? Abbreviation? Latin word? End of sentence? isspelling? End of sentence? Abbreviation or missing comma?

Spelling errors mistyping, accidentally swapping letters, inserting extra letters or just missing some; lack or improper use of punctuation (e.g. no sign of sentence boundaries, missing commas, no space between punctuation and the neighboring words); grammatical errors; sentence fragments; domain-specific and often ad hoc abbreviations, which usually do not correspond to any standard

Previous work Generating correction candidates: Word forms of one edit distance from the original Suggestions of the morphology Scores: Weighted language models Stopword list Automatically generated abbreviation list Judgement of the morphology Accepted form Non-accepted form if frequent, then correct General and domain specific word lists Weighted edit distance (neighbouring keys, etc) Features of the original form

Application of a machine translation system Ĉ = argmax P(C E) = argmax P(E C)P(C) / P(E) Erroneous sentence: E=e 1,e 2...e n, Correct sentence: C=c 1,c 2...c k Corrected sentence: Ĉ=c 1,c 2...c k Decoding translation (CORRECTION) model * language model

Translation models TM for general words TM for abbreviations TM for joining errors

Translation model for general words Instead of learning from parallel corpus, using the ranked suggestions of the correction candidate generator The first 20 suggestions based on the scores by transfering them to quasi-probability distribution hosszúságu hosszúsági 0.01649 hosszúságu hosszúságú 0.01560 hosszúságu hosszúsága 0.01353 hosszúságu hosszúságuk 0.01317 hosszúságu hosszúságul 0.01292 hosszúságu hosszúságé 0.01284 hosszúságu hosszúság 0.01034

Translation model for abbreviations Problems: the same word or phrase usually appears in several different abbreviated forms in the text the suggestion generator would prefer to transform the original abbreviation to a very frequent similar common word

Solution: Translation model for abbreviations Collecting abbreviations and integrating them to the morphological analyzer Collecting potential variations for each abbreviation and deriving a probability score based on their frequencies conj. conj. 0.6078 conj conj. 0.8696 conj conj 0.1303 mko mko 0.4891 mko mko. 0.9970 mko. mko. 0.9993

Translation model for joining errors Exploiting the possibilities of phrase-based translation When inserting a space, the scores are calculated for each word and their average is considered for the phrase soronkívül soron kívül 0.02074 soronkívül soronkívül 0.01459

Language model It is responsible for weighting the sequences of corrected words towards their real occurrence in the language Lexical context Problem: the language model can only be built from noisy texts, but 1-gram 2-gram 3-gram General text 873951 4794135 7886616 Clinical text 275609 1409290 2440636 Általános magyar nyelvű és orvosi szövegekben előforduló különböző n-gramok száma (800000 mondatos korpuszban)

Decoding Moses toolkit Parameters can be set in the configuration file: Weighting of translation models (independent, high) Weighting of the language model (lower) Distortion limit (not allowed) Penalty for difference in the length of the sentence

Results Manually corrected testset of 2000 sentences (hard to judge correctness) System Accuracy 1 st best suggestion 72.5% SMT-based contex-aware 88.28%

Examples csppent előírés szerint, 2000 mondatból álló kézzel javított teszthalmaz Original sentence: Baseline correction: SMT correction: Reference: Original sentence: Baseline correction: SMT correction: Reference: Original sentence: Baseline correction: SMT correction: Reference: Original sentence: Baseline correction: SMT correction: Reference: cseppent előír és szerint, cseppent előírás szerint, cseppent előírás szerint, th : mko tovább 1 x duotrav 3 ü-1 rec, íb : 2 x azoipt 3 ü-1 rec th : mko tovább 1 x duotrav 3 ü-1 sec, kb : 2 x azoipt 3 ü-1 sec th. : mko tovább 1 x duotrav 3 ü-1 rec, kb : 2 x azopt 3 ü-1 rec th. : mko tovább 1 x duotrav 3 ü-1 rec, kb : 2 x azopt 3 ü-1 rec /alsó m?fogsor. /alsó műfogsor. alsó műfogsor. alsó műfogsor. vértelt nyállkahártyák, kp erezett conjuctiva, fehér sclera. vértelt nyálkahártyák, kp erezett conjunctiva, fehér sclera. vértelt nyálkahártyák, kp. erezett conjunctiva, fehér sclera. vértelt nyálkahártyák, kp. erezett conjunctiva, fehér sclera.

Problematic cases Transforming a correct word to another correct one original sentence: homályos látást panaszol. (s/he complains about blurred vision) corrected sentence: homályos látás panaszok. (complaints of blurred vision) original sentence: panasz nem volt. corrected sentence: panasza nem volt. (there were no complaints) (s/he didn t have any complaints) Multiple errors in one word original sentence: corrected sentence: reference sentence: gyógyógyszerei : ld lázlap gyógyógyszerei : ld lázlap gyógyszerei : ld. lázlap

Conclusion In very noisy, Hungarian, domain specific, clinical texts that are full of abbreviations the automatic correction of spelling errors was performed with high accuracy

Conclusion Further plans: integrating more information to the translation models, such as part-of-speech tags iterative creation of the language model, after correcting the corpus correcting multiple errors

Thank you! siklosi.borbala@itk.ppke.hu