Significance assessment in local sequence alignment with gaps. Ralf Bundschuh Department of Physics, The Ohio State University

Hasonló dokumentumok
Mapping Sequencing Reads to a Reference Genome

Statistical Inference

Correlation & Linear Regression in SPSS

On The Number Of Slim Semimodular Lattices

Performance Modeling of Intelligent Car Parking Systems

Phenotype. Genotype. It is like any other experiment! What is a bioinformatics experiment? Remember the Goal. Infectious Disease Paradigm

Genome 373: Hidden Markov Models I. Doug Fowler

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Hypothesis Testing. Petra Petrovics.

Cluster Analysis. Potyó László

CLUSTALW Multiple Sequence Alignment

Bioinformatics: Blending. Biology and Computer Science

Correlation & Linear Regression in SPSS

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Linear. Petra Petrovics.

Construction of a cube given with its centre and a sideline

Választási modellek 3

Klaszterezés, 2. rész

A logaritmikus legkisebb négyzetek módszerének karakterizációi

Computer Architecture


Supporting Information

Supporting Information

Csima Judit április 9.

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Nonparametric Tests

Bird species status and trends reporting format for the period (Annex 2)

Angol Középfokú Nyelvvizsgázók Bibliája: Nyelvtani összefoglalás, 30 kidolgozott szóbeli tétel, esszé és minta levelek + rendhagyó igék jelentéssel

Dependency preservation

Bevezetés a kvantum-informatikába és kommunikációba 2015/2016 tavasz

Longman Exams Dictionary egynyelvű angol szótár nyelvvizsgára készülőknek

Pletykaalapú gépi tanulás teljesen elosztott környezetben

Ensemble Kalman Filters Part 1: The basics

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Regression

Lecture 11: Genetic Algorithms

discosnp demo - Peterlongo Pierre 1 DISCOSNP++: Live demo

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Factor Analysis

Statistical Dependence

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Geokémia gyakorlat. 1. Geokémiai adatok értelmezése: egyszerű statisztikai módszerek. Geológus szakirány (BSc) Dr. Lukács Réka

A jövőbeli hatások vizsgálatához felhasznált klímamodell-adatok Climate model data used for future impact studies Szépszó Gabriella

Local fluctuations of critical Mandelbrot cascades. Konrad Kolesko

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

A rosszindulatú daganatos halálozás változása 1975 és 2001 között Magyarországon

Unification of functional renormalization group equations

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

MATEMATIKA ANGOL NYELVEN

Tudományos Ismeretterjesztő Társulat

EN United in diversity EN A8-0206/419. Amendment

STUDENT LOGBOOK. 1 week general practice course for the 6 th year medical students SEMMELWEIS EGYETEM. Name of the student:

SZOFTVEREK A SORBANÁLLÁSI ELMÉLET OKTATÁSÁBAN

Unit 10: In Context 55. In Context. What's the Exam Task? Mediation Task B 2: Translation of an informal letter from Hungarian to English.

A modern e-learning lehetőségei a tűzoltók oktatásának fejlesztésében. Dicse Jenő üzletfejlesztési igazgató

Sebastián Sáez Senior Trade Economist INTERNATIONAL TRADE DEPARTMENT WORLD BANK

Eladni könnyedén? Oracle Sales Cloud. Horváth Tünde Principal Sales Consultant március 23.

Véges szavak általánosított részszó-bonyolultsága

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Nonparametric Tests. Petra Petrovics.

Professional competence, autonomy and their effects

MATEMATIKA ANGOL NYELVEN MATHEMATICS

Manuscript Title: Identification of a thermostable fungal lytic polysaccharide monooxygenase and

Using the CW-Net in a user defined IP network

OROSZ MÁRTA DR., GÁLFFY GABRIELLA DR., KOVÁCS DOROTTYA ÁGH TAMÁS DR., MÉSZÁROS ÁGNES DR.

Descriptive Statistics

First experiences with Gd fuel assemblies in. Tamás Parkó, Botond Beliczai AER Symposium

EEA, Eionet and Country visits. Bernt Röndell - SES

ELEKTRONIKAI ALAPISMERETEK ANGOL NYELVEN

Számítógéppel irányított rendszerek elmélete. A rendszer- és irányításelmélet legfontosabb részterületei. Hangos Katalin. Budapest

Lopocsi Istvánné MINTA DOLGOZATOK FELTÉTELES MONDATOK. (1 st, 2 nd, 3 rd CONDITIONAL) + ANSWER KEY PRESENT PERFECT + ANSWER KEY

Effect of the different parameters to the surface roughness in freeform surface milling

7 th Iron Smelting Symposium 2010, Holland

PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER

Dinamikus rendszerek identifikációja genetikus programozással

Supplementary Table 1. Cystometric parameters in sham-operated wild type and Trpv4 -/- rats during saline infusion and

FÖLDRAJZ ANGOL NYELVEN GEOGRAPHY

Tudományos Ismeretterjesztő Társulat

FÖLDRAJZ ANGOL NYELVEN

Decision where Process Based OpRisk Management. made the difference. Norbert Kozma Head of Operational Risk Control. Erste Bank Hungary

Report on the main results of the surveillance under article 11 for annex II, IV and V species (Annex B)

Utolsó frissítés / Last update: február Szerkesztő / Editor: Csatlós Árpádné

Efficient symmetric key private authentication

Néhány folyóiratkereső rendszer felsorolása és példa segítségével vázlatos bemutatása Sasvári Péter

Meteorológiai ensemble elırejelzések hidrológiai célú alkalmazásai

FOSS4G-CEE Prágra, 2012 május. Márta Gergely Sándor Csaba

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

KN-CP50. MANUAL (p. 2) Digital compass. ANLEITUNG (s. 4) Digitaler Kompass. GEBRUIKSAANWIJZING (p. 10) Digitaal kompas

Word and Polygon List for Obtuse Triangular Billiards II

Adatbázisok 1. Rekurzió a Datalogban és SQL-99

(NGB_TA024_1) MÉRÉSI JEGYZŐKÖNYV

A BÜKKI KARSZTVÍZSZINT ÉSZLELŐ RENDSZER KERETÉBEN GYŰJTÖTT HIDROMETEOROLÓGIAI ADATOK ELEMZÉSE

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

KELET-ÁZSIAI DUPLANÁDAS HANGSZEREK ÉS A HICHIRIKI HASZNÁLATA A 20. SZÁZADI ÉS A KORTÁRS ZENÉBEN

Rezgésdiagnosztika. Diagnosztika

ENROLLMENT FORM / BEIRATKOZÁSI ADATLAP

Kvantum-informatika és kommunikáció 2015/2016 ősz. A kvantuminformatika jelölésrendszere szeptember 11.

KELER KSZF Zrt. bankgarancia-befogadási kondíciói. Hatályos: július 8.

Dense Matrix Algorithms (Chapter 8) Alexandre David B2-206

FÖLDRAJZ ANGOL NYELVEN GEOGRAPHY

: az i -ik esélyhányados, i = 2, 3,..I

Probabilistic Analysis and Randomized Algorithms. Alexandre David B2-206

Számítógéppel irányított rendszerek elmélete. Gyakorlat - Mintavételezés, DT-LTI rendszermodellek

Expansion of Red Deer and afforestation in Hungary

FIATAL MŰSZAKIAK TUDOMÁNYOS ÜLÉSSZAKA

Átírás:

Significance assessment in local sequence alignment with gaps Ralf Bundschuh Department of Physics, The Ohio State University Collaborators: T. Hwa, R. Olsen, Physics Department, UC San Diego S. Altschul, Y.K. Yu, National Center for Biotechnology Information, NIH Outline: Biological sequences and what to do with them Sequence alignment Statistical significance Central conjecture Applications Conclusions Funding: NSF, DAAD, Beckman foundation

Biological sequences and what to do with them I The data: a piece of human chromosome...tctacatgtaaaaatatgtatttttaaaaattggatgtcatgggctgggtgtggtggctcatgcctgtaatcccagcactttgggaggccgaggcgggtggatctcctgaggtcg GGAGTTCGAGACCAGCCTGACCAACATGGAGAAACCCCGTCTCTACTAAAAACACAAAATTATCCAGGCATGGTGGCACATGACTGTAATCCCAGCTACTAGGGAGGCTGAGGCAGGA GAAACACTTGAACCTGGGAGGCGGAGGTTGAGGTGAGCCGAGATCGCGCCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCGGTCTCAAAAAAAAAAAATTTGATGTTATGGAA GTAGGGAGACAAAAAATGCTCTACAACTATTAACTGATGCTTTTCTGGTTTTGTTCTCCAGACACCATTCGCTTTTCACCCAAGATGATTTGATGTCTTATAAAACTCTGATGAACCA TGATGGCTACACAGACATTAAGTATAGACAGCTATCAAGATGGGCAACAGGTGAGCTTGAACTTGATTCTGCATTCTAATTACAAATCAACCTGGCACTCAAGCATGAACATTGCTTT GTATACTTGCAATTCAATTGCCATGAGGTTGCATGCTCAGTGTTAGTGTATTATGCATTTATTGTACATTCGTGTTCAGAAAAAAAGCCATAGAATAATACTATTTCGTTAACTGATA CCAAGATTGCCAGGAATCTTGACTTCCCTAAGTCATATGACAGTTTCTTGGGAATTTACCTTTTTAATGTCAGTGTTAATTAGCACTGTTACTTTGAAAGAAAACCCGGTTGATTTTC ATGATGACAGATTCCCATGTTGACTGGTGGCTCTTCTGAGTGTCTAACTGGATCAGCTTTTGAATGGGAATCTTGTAGCCTCGTCTCCCCAGTTGTAGGCATGAGAGGGGCTGTCCCA GTAATGAATTTGCAGGGGCCCCAGTGCTCTATCTTTGTACCTTGCTCGTGCTTGGATGGTTGTGCCATACACGGGCAGCTCTCCATTGCCCTCCCACCATAGATGAGACTTTGTTCTC CTTTCTTGGCCTGGAAGCTGTGGTGTTTTGTGCTTTTGAGTATCTGAGTGTTTTGTGTTCTGTGACCTGAATGAATTGAGGAGCAGGTGGATCGAGACTTGGCTGAGGCCCTTGTGGT TTGCGATCTTGTTAAACACGGTGTTCTGAACCCACTGGCATTTGGCTCATCATCCCACTGACTCTGGAGCCAGTGAAGGGATTTGGCCCTGCCCTTTACTTTCCTGCCCAGCAGGCAG GGGCAGTGCAGTACACCCCCCTCGGCTCTCCTCCCCACCTCGAGGACTTCGGTGGCAAGGATCAGGCTCCGGAAAACTCACTGGAGCCATGCTGGTGAGGTCTGAGGAGGGGTTAGGA GCTGAGGCGCTGGGTCCCCCTTTCCCTGGTGGTTAGTTTTACCAACCAGTCCTTGTTCAGTTCCTGTGGCAGAGATTTTTGTTGTTGGTGGTGGTGGTGTTAGTGTTTTTTTTTCTCT GTATAGCAATTAAAGGAGGGAGATTCTGTGATGTAGTCAGCCTGCTTCCTTAGCCTAGAAGTCCTTAGTCCTTTGGTATTTCCAATTGACTTTTTTTTTTTTTTCTAAAATGCAAATC TAATAATGTCCCGCCTGAGCTCTCCAGTGGCTCCCTGTGGATTCCTGTGGGTTTCATGCAAAGACTGAGCTGCTCTGTGGCCCGAATCGTCTGGCCCCTCTGGACCCCAGGACGCCCC CAACATCTCTGCCTGGCATATCTTGGGCACCTCTCTGCCTGCCCTGGAGCACCGGCCTCACTGTTCCCATCACTCTTCTCCCTCCTGCCTGCCAGGTCTTTGCTCCGACCCCACTGCT GCCTCCTGTGCACCAAGGCACGGTGACCACCTCCAACACAGCCTGGTTGCTACCAGCCACCTCCTCCCAGGCAGCTGTGCCAGGTGCAGATGACACCTGGAGCACTGCCCTTTTCATA CCCGAGTGTTTCCAAGGGCCTCGGAAGTGTTTAATCAGCATTATTTTAAATAAACATTGAAATATATCTACAGCGTAGACCTATCATAATTATTTTGCCATTTTTCCAAGGTTGAACA TTTAGGTTTCTCTCTTTTCACAATCATTTTTTTTTCAAATACTGAAATGAATCTTTTAAGGCTTCTTATTTTTTATTATTTATTTATTTACTTATTTATTTATTTTGAGACAGAGTCT TGCTCTGTTGCCCCGGCTGGAGTGCAGTGGTGTGATCTCAGCTCACTGCAACCTCCGTCTCCCAGGTTCAAGCAATTCTCCTGTCTCAGCCTCCTGAGTACCTGGGATTACAGGTGTG TGCCACCACGCCCAGCTAATTTTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTGGCCAGGCTGATCTTGAACCCCTGACCTCAGGTGATCCGCCCACCTTGGCCTCCCAAAG TGCTGGGATCACAGGCATAAGCCACTGTGCCTGGCCTTTTTAAGGCTTTTTATACATGCTGGCAGATTGCCTTACACAAATACTGTGTCCACTTAGGCTTTATTGCTTTTATTTTTTG GATTTTTTAAGAGAAACATAAACAGTTTTCCTAATATGTTGTACCATTTAAAGGCAGCAGAATAGAAGTCATCTTATTGCAAAAACAAGACATTGGAGAGAGAGCACAGGGCTGGAGG ATGTGAGAGGCGTCCTGTGCGGGTGGGCGTTCATGGCTGGCCCCCAGTCTGTCTGGACAGTGGGGATGGCCCCGCTCCCATGAGGTCTCCCCGCCCCCGCTGCCCCAAGCTGCTTCCT CAAGGGGCAGAAGCATGGCCAAATCCACCGCGGGAGAAATGGCCCGTCCTGGTCCTGAGGAAGCTGAGGTCAGGACAGTCTAATCTGCTGCTCATGGATAACTAGAAGTTTACTTTCA CGAAATTTTGTTTTTGTAAACTGATTTTTTTTAACGATTTAAATGTTTTTTACCTAAATGACAAAGGCATTGCTTGTTTAAAGCAGTTTAAATGATAGTATCTTTTAAGGCTTTAAGT AAACACAGCTGGCCTTTTCCTTTCTGAATGCAGTGACATTTTTATGGCTATGTATTGCTGAGGTTTGAGGGTAGATATGGGAGAAGTTCAACCTTGTCCCAAATATGTAGCGTATGGG TTAGGTTGTGTCTGTGACATGGTAAGAAGACCTTGGACTATTTGTCTATGCTTCTCTGACTGTAGTATTGCTACACGTGAGGAGATATAGCAATAGACGGAGGAGTAGATTGCACGTG TGCCGTTCTCTTCATTAGGCCATCGCCAGTGTTATTTTAAAATCATGAGCATTTCACAGACATTGGTTATCTCAGTTCTTAAGCAGTTTGGAAATTTTACCTTTATTTTAAAGCACAT TAAATTCAAAGCTGTTTTGTTGGTAGCATTCTTTACATATGATTGCAGTCATACTGCTGTCTGGCAATTCCCCACTCCAACTGCATTCCTGTTGGAGCATGGAATACCTCGTAGTAGT TTGTGTTTGCTAATAAGAATTATCTTCCTCCTTTTACAGATGCAAGTAGTAACAGAGTTAAAGACAGAACAAGATCCAAACTGCTCTGAACCCGATGCAGAAGGAGTGAGCCCTCCCC CTGTGGAGTCTCAGACCCCGATGGATGTGGACAAGCAGGCCATTTATAGGTAGTGCCCGGGGTGCCGGCTGTGGGGGCCTCAGAGGAGTGAGGAGTGTCCACAAGTTTGTTAAAAGGA ACACCAGAACTAGTAGCGAGGCAGAGTTGTGAGGTGCCATTATGATGTGATAACTGCTCGGCTGAAAATTTTCTCAGATTAAGATGTTATTTCTGGCCAGGTGTGGTGGCTCATGCCT GTAATCCTAGCACTTTGGGAGGCTGAGGCGGGCGGATCACTTGAGGTTAAGAGTTTGAGACCAGCCTGGCCAACATGGTGAAACTCCATCTCTACTAA... Sequence of 3 billion A, C, G, and T s The problem: What does it mean?

Biological seqeunces and what to do with them II 3 How do we find out what a sequence does? First step: Determine which part codes for proteins and determine protein sequence solved, e.g., myoglobin: MGLSDGEWQLVLHVWAKVEADVAGHGQDILIRLF... Second step: compare sequence to sequences of genes (of other organisms) with known function Sequence alignment algorithms...lvlhvw......lvagdw... Σ= Calculate score Σ for every pair of sequences. global: Needleman and Wunsch, 97, local: Smith and Waterman, 98 Sequences similar enough putative functional similarity Putative function to be confirmed by experiment Everybody does it: paper presenting most widely used program BLAST is most cited paper in all of science written in the 9 s Altschul et al., 99

Sequence alignment I 4 How do sequence alignment algorithms work? AGMKCYDHPSARQAW Want to find sequence similarities, e.g., KDAGVMKYEHPSQRW Need way to compare individual letters scoring matrix s a,b Has to represent similarities between letters DNA-DNA comparison: for a = b s a,b = µ for a b Protein-protein: BLOSUM or PAM matrices Many parameters a p( a ) sa,b for PAM A R N D C Q E G H I L K M F P S T W Y V 3.5%.3%.%.4%.87%.9%.84% 3.3%.99%.3% 4.6%.59%.%.74%.34% 3.%.63%.6%.45%.9% 6 6 3 4 3 6 7 6 8 4 3 4 7 5 7 6 5 4 6 4 N(s) 5 4 3 5 5 s 6 9 7 6 6 3 7 6 PAM Combine entries from the scoring matrix to an alignment score Σ 4

Sequence alignment II 5 Simplest algorithm: gapless alignment Given pair of sequences a...a N and b...b N Assign score to pair of substrings a i l+...a i and b j l+...b j S[i, j, l] l k= s ai k,b j k i= l=6 AGMKC YDHPS ARQAW KDAGVMK YEHPS QRW j=3 Optimal alignment has score Σ max i,j,l S(,3,6)=+3+6+6++=7 S[i, j, l] Clever way to find optimal alignment S i,j max l S[i, j, l] S i,j = max{s i,j + s ai,b j, } Σ = max i,j S i,j

Statistical Significance I 6 Problem: Algorithms assign score Σ to every sequence pair, even random ones Which Σ implies biological relationship? Statistical answer: find score distribution P (Σ) if algorithm is applied to random sequences random random AWDQR EAIPST P( Σ) Σ= Σ=7 Σ=3 Σ= = Σ =5 Σ Σ can assign probability p(σ) =Pr{Σ σ} = Choose highest score across a database of size N σ P (Σ)dΣ to each score new sequence known sequence Σ= known sequence Σ = P( Σ) largest Σ p > /N not similar enough known sequence k Σ=5 Σ p p < /N biologically relevant! known sequence N Σ=8

Significance Assessment II 7 Need score distribution P (Σ) for random sequences Obtain distribution by numerical simulation (shuffling method) random random AWDQR EAIPST P( Σ) Σ= Σ=7 Σ=3 Σ= = Σ =5 Σ Σ Generate many pairs of random sequences with correct letter composition Align each pair Take histogram of alignment scores Very time consuming (hours)

Significance Assessment III 8 Precompute distribution But: distribution depends on Scoring parameters, e.g. s a,b Sequence ensemble, e.g. p a Pre-computation only possible for some fixed set of scoring systems at overall amino acid/base pair frequencies Problems: unusual amino acid composition Iterative schemes (PSI-BLAST) Query Sequence Database Sequences Alignment Score Functions refine score fn Opt. Alignment signficance evaluation homologous sequences Position dependent scoring parameters Need fast numerical way or analytical theory to get score distribution

Significance Assessment IV 9 Analytical theory for gapless alignment: Gumbel or extreme value distribution Karlin and Altschul, Proc. Natl. Acad. Sci. 99 P( Σ ) P (Σ) =κλ exp[ λσ κe λσ ] e λ S Σ < Σ> ~ log λ κ universal shape Dependence on scoring system and letter frequencies in parameters κ, λ κ, λ known: e λs a,b p a p b e λs a,b = Corrections due to finite sequence length analytically known Makes gapless alignment powerful tool: BLAST Altschul et al., J. Mol. Biol. 99

Sequence alignment III Problem: during evolution pieces of sequences are inserted and deleted alignment algorithm has to compensate for this do gapped alignments to find weak similarities First: gapped global alignment Alignment = way of inserting gaps in sequences Score gaps by δ --AG-MKCYDHPSARQAW KDAGVMK-YEHPS--QRW Score S[A] for each alignment A S[A] = (a,b) A s a,b δn g In practice: affine gap cost δ + ɛk Find alignment with highest score Σ = max A S[A] Exponentially many alignments for each pair of sequences!

Sequence alignment IV Representation on an alignment grid: --AG-MKCYDHPSARQAW KDAGVMK-YEHPS--QRW Dynamic programming: O(N ) algorithm Needleman and Wunsch, 97 Σ = h(, N) C K Y D H P S A R Q W A (r,t)= (6,4) A G M (r,t)= (r,t)= (,) K (,3) D A G V M K Y E H P S Q (r,t)= R (,7) W h(r, t + ) = max{h(r ±,t) δ, h(r, t ) + s(r, t)} r t

Sequence alignment V Often only pieces of two sequences related local alignment with gaps Look at all pairs of subsequences a i...a j and b k...b l of the original sequences Calculate gapped global alignment score Σ(i, j, k, l) for each pair using Needleman-Wunsch algorithm Total alignment score is best of these Σ = max Σ(i, j, k, l) i,j,k,l O(N 6 ) algorithm? Again dynamic programming solution Smith and Waterman, 98 S j,l max Σ(i, j, k, l) i,k S j,l = max{s j,l δ, S j,l δ, S j,l + s aj,b k, } Σ = max j,k S j,k Practically used sequence comparison programs (BLAST, FASTA) are approximations of this algorithm

Significance Assessment V 3 Problem: Significance assessment less clear for alignment with gaps Shape of distribution numerically verified to be still Gumbel What are the values of the non-universal parameters λ and κ? Only very limited theories and heuristics Zhang, 995, Siegmund and Yakir, 999, Mott and Tribe, 999, Mott, Statistical physics problem: Find distribution of observable in disordered system "disorder" "physical system" random random AWDQR EAIPST "observable" Σ= Σ=7 Σ=3 Σ= Σ= Σ =5 Use statistical physics methods

Central conjecture 4 Central conjecture: The distribution of Σ is of the Gumbel form with the Gumbel parameter λ given by e λh(,n) = Properties: h(, N) free energy e λh(,n) Z λ replicated partition function hard to calculate analytically h(, N) < typical sequence pairs contribute zero to e λh(,n) rare events dominate e λh(,n) numerically hard Reduces to proven formula e λs = for gapless alignment Not proven, but applications work

Applications I 5 Application I: DNA-DNA comparison with µ = δ: n prob p Mapping possible to asymmetric exclusion process Model of highway traffic Exactly solvable model of non-equilibrium physics W r Can calculate e λh(,n) Gumbel λ given by + p exp[ λ ( + µ)] + p exp[ λ ( + µ)] exp[ λ µ]= ( ) λ.5 Compare calculated with simulated λ (p =/4, DNA) Provides test case for heuristic approaches.5 Equation (*) Simulation: uncorrelated Simulation: sequences 3 4 5 6 7 8 µ=δ BLAST p-value calculation changed since version..3 Altschul, RB, Olsen, and Hwa, Nucleic Acids Res.

Applications II 6 Application II: Smart numerics Estimate e λh(,n) by numerical sampling Instead of random sequence pairs (with h(, N) < ) use correlated sequence pairs with h(, N) > such that e λh(,n) corr is not dominated by rare events The difference between and corr can be handled analytically Estimate e λh(,n) for N = 6, N = 8, N = and extrapolate to N Results: Scoring system λ reference λ this algorithm time (min) BLOSUM45/4/.96±.8.978±.5 3: BLOSUM6//.67±..669±.6 5:49 BLOSUM8//.993±..34±.5 : PAM7//.9±.3.9±.3 :6 PAM3/9/.963±..954±. :7

Applications III 7 Application III: e λh(,n) = in general hard to fulfill Instead of calculating complicated quantity for Smith-Waterman alignment change the algorithm Smith-Waterman S(r +,t) δ, S(r,t) δ S(r, t+)=max S(r, t ) + s(r, t), Replace by hybrid algorithm, Σ = max r,t S(r, t) Z(r, t+) = Z(r+,t)e λ glδ + Z(r,t)e λ glδ, Σ = max r,t +Z(r, t )( e λ glδ )e λ gls(r,t) + log Z(r, t) Similar to: Viterbi probabilistic HMM (forward backward) RNA minimal energy (Zuker) partition function (Vienna) Guarantees e h(,n) = λ =independent of scoring system

Applications IV 8 Numerical test of hybrid statistics 5 i.i.d. amino acid sequences of length N, PAM- scoring matrix, + k gap cost D(Σ) Gumbel Fit % λ.5..5. Simulation Theory λ 5 5 5 Σ.95 4 6 8 N Score histogram is of Gumbel form λ =for large N Even sequence length dependence theoretically understood

Applications V 9 Works even for position-specific scoring systems E.g., protein family Hidden Markov Models from Pfam database Bateman et al., Nucleic Acids Res. D(Σ) Gumbel fit 5 5 Σ # occurences 6 5 4 3.6.8..4 λ Score histogram is of Gumbel form λ =within ±% for all, 6 models

Applications VI How is the performance in terms of sensitivity? Test algorithm on standard database: PDB9D-B Brenner et al., Proc. Natl. Acad. Sci. 998 Use SCOP as gold standard (known relations between sequences) Murzin et al., J. Mol. Biol. 995 Do pairwise alignments of all sequences in database Vary p-value cutoff and measure Number of relations found Coverage=, Number of total relations Ideal: high coverage at low errors per query Hybrid alignment s sensitivity is at least as good as that of other methods of wrong relations Errors per Query=Number Coverage.45.4.35.3.5 Hybrid gapped BLAST Smith Waterman SSEARCH FASTA ktup= FASTA ktup= Number of sequences ungapped BLAST Errors Per Query

Conclusions and outlook Sequence alignment is standard tool in...lvlhvw......lvagdw... molecular biology Σ= Sequence alignment algorithms rely on dynamic programming methods.5 Statistics of sequence alignments are important and poorly understood Statistical physics methods can be applied to characterize and improve sequence alignment Future directions Other analytical solutions for λ Better importance sampling to estimate λ Incorporate hybrid algorithm into PSI-BLAST λ # occurences.5 6 5 4 3 Equation (*) Simulation: uncorrelated Simulation: sequences 3 4 5 6 7 8 µ=δ.6.8..4 λ Statistics of other algorithms of computational biology (RNA folding, gene finding, clustering,...)