Significance assessment in local sequence alignment with gaps. Ralf Bundschuh Department of Physics, The Ohio State University

Significance assessment in local sequence alignment with gaps Ralf Bundschuh Department of Physics, The Ohio State University Collaborators: T. Hwa, R. Olsen, Physics Department, UC San Diego S. Altschul, Y.K. Yu, National Center for Biotechnology Information, NIH Outline: Biological sequences and what to do with them Sequence alignment Statistical significance Central conjecture Applications Conclusions Funding: NSF, DAAD, Beckman foundation

Biological sequences and what to do with them I The data: a piece of human chromosome...tctacatgtaaaaatatgtatttttaaaaattggatgtcatgggctgggtgtggtggctcatgcctgtaatcccagcactttgggaggccgaggcgggtggatctcctgaggtcg GGAGTTCGAGACCAGCCTGACCAACATGGAGAAACCCCGTCTCTACTAAAAACACAAAATTATCCAGGCATGGTGGCACATGACTGTAATCCCAGCTACTAGGGAGGCTGAGGCAGGA GAAACACTTGAACCTGGGAGGCGGAGGTTGAGGTGAGCCGAGATCGCGCCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCGGTCTCAAAAAAAAAAAATTTGATGTTATGGAA GTAGGGAGACAAAAAATGCTCTACAACTATTAACTGATGCTTTTCTGGTTTTGTTCTCCAGACACCATTCGCTTTTCACCCAAGATGATTTGATGTCTTATAAAACTCTGATGAACCA TGATGGCTACACAGACATTAAGTATAGACAGCTATCAAGATGGGCAACAGGTGAGCTTGAACTTGATTCTGCATTCTAATTACAAATCAACCTGGCACTCAAGCATGAACATTGCTTT GTATACTTGCAATTCAATTGCCATGAGGTTGCATGCTCAGTGTTAGTGTATTATGCATTTATTGTACATTCGTGTTCAGAAAAAAAGCCATAGAATAATACTATTTCGTTAACTGATA CCAAGATTGCCAGGAATCTTGACTTCCCTAAGTCATATGACAGTTTCTTGGGAATTTACCTTTTTAATGTCAGTGTTAATTAGCACTGTTACTTTGAAAGAAAACCCGGTTGATTTTC ATGATGACAGATTCCCATGTTGACTGGTGGCTCTTCTGAGTGTCTAACTGGATCAGCTTTTGAATGGGAATCTTGTAGCCTCGTCTCCCCAGTTGTAGGCATGAGAGGGGCTGTCCCA GTAATGAATTTGCAGGGGCCCCAGTGCTCTATCTTTGTACCTTGCTCGTGCTTGGATGGTTGTGCCATACACGGGCAGCTCTCCATTGCCCTCCCACCATAGATGAGACTTTGTTCTC CTTTCTTGGCCTGGAAGCTGTGGTGTTTTGTGCTTTTGAGTATCTGAGTGTTTTGTGTTCTGTGACCTGAATGAATTGAGGAGCAGGTGGATCGAGACTTGGCTGAGGCCCTTGTGGT TTGCGATCTTGTTAAACACGGTGTTCTGAACCCACTGGCATTTGGCTCATCATCCCACTGACTCTGGAGCCAGTGAAGGGATTTGGCCCTGCCCTTTACTTTCCTGCCCAGCAGGCAG GGGCAGTGCAGTACACCCCCCTCGGCTCTCCTCCCCACCTCGAGGACTTCGGTGGCAAGGATCAGGCTCCGGAAAACTCACTGGAGCCATGCTGGTGAGGTCTGAGGAGGGGTTAGGA GCTGAGGCGCTGGGTCCCCCTTTCCCTGGTGGTTAGTTTTACCAACCAGTCCTTGTTCAGTTCCTGTGGCAGAGATTTTTGTTGTTGGTGGTGGTGGTGTTAGTGTTTTTTTTTCTCT GTATAGCAATTAAAGGAGGGAGATTCTGTGATGTAGTCAGCCTGCTTCCTTAGCCTAGAAGTCCTTAGTCCTTTGGTATTTCCAATTGACTTTTTTTTTTTTTTCTAAAATGCAAATC TAATAATGTCCCGCCTGAGCTCTCCAGTGGCTCCCTGTGGATTCCTGTGGGTTTCATGCAAAGACTGAGCTGCTCTGTGGCCCGAATCGTCTGGCCCCTCTGGACCCCAGGACGCCCC CAACATCTCTGCCTGGCATATCTTGGGCACCTCTCTGCCTGCCCTGGAGCACCGGCCTCACTGTTCCCATCACTCTTCTCCCTCCTGCCTGCCAGGTCTTTGCTCCGACCCCACTGCT GCCTCCTGTGCACCAAGGCACGGTGACCACCTCCAACACAGCCTGGTTGCTACCAGCCACCTCCTCCCAGGCAGCTGTGCCAGGTGCAGATGACACCTGGAGCACTGCCCTTTTCATA CCCGAGTGTTTCCAAGGGCCTCGGAAGTGTTTAATCAGCATTATTTTAAATAAACATTGAAATATATCTACAGCGTAGACCTATCATAATTATTTTGCCATTTTTCCAAGGTTGAACA TTTAGGTTTCTCTCTTTTCACAATCATTTTTTTTTCAAATACTGAAATGAATCTTTTAAGGCTTCTTATTTTTTATTATTTATTTATTTACTTATTTATTTATTTTGAGACAGAGTCT TGCTCTGTTGCCCCGGCTGGAGTGCAGTGGTGTGATCTCAGCTCACTGCAACCTCCGTCTCCCAGGTTCAAGCAATTCTCCTGTCTCAGCCTCCTGAGTACCTGGGATTACAGGTGTG TGCCACCACGCCCAGCTAATTTTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTGGCCAGGCTGATCTTGAACCCCTGACCTCAGGTGATCCGCCCACCTTGGCCTCCCAAAG TGCTGGGATCACAGGCATAAGCCACTGTGCCTGGCCTTTTTAAGGCTTTTTATACATGCTGGCAGATTGCCTTACACAAATACTGTGTCCACTTAGGCTTTATTGCTTTTATTTTTTG GATTTTTTAAGAGAAACATAAACAGTTTTCCTAATATGTTGTACCATTTAAAGGCAGCAGAATAGAAGTCATCTTATTGCAAAAACAAGACATTGGAGAGAGAGCACAGGGCTGGAGG ATGTGAGAGGCGTCCTGTGCGGGTGGGCGTTCATGGCTGGCCCCCAGTCTGTCTGGACAGTGGGGATGGCCCCGCTCCCATGAGGTCTCCCCGCCCCCGCTGCCCCAAGCTGCTTCCT CAAGGGGCAGAAGCATGGCCAAATCCACCGCGGGAGAAATGGCCCGTCCTGGTCCTGAGGAAGCTGAGGTCAGGACAGTCTAATCTGCTGCTCATGGATAACTAGAAGTTTACTTTCA CGAAATTTTGTTTTTGTAAACTGATTTTTTTTAACGATTTAAATGTTTTTTACCTAAATGACAAAGGCATTGCTTGTTTAAAGCAGTTTAAATGATAGTATCTTTTAAGGCTTTAAGT AAACACAGCTGGCCTTTTCCTTTCTGAATGCAGTGACATTTTTATGGCTATGTATTGCTGAGGTTTGAGGGTAGATATGGGAGAAGTTCAACCTTGTCCCAAATATGTAGCGTATGGG TTAGGTTGTGTCTGTGACATGGTAAGAAGACCTTGGACTATTTGTCTATGCTTCTCTGACTGTAGTATTGCTACACGTGAGGAGATATAGCAATAGACGGAGGAGTAGATTGCACGTG TGCCGTTCTCTTCATTAGGCCATCGCCAGTGTTATTTTAAAATCATGAGCATTTCACAGACATTGGTTATCTCAGTTCTTAAGCAGTTTGGAAATTTTACCTTTATTTTAAAGCACAT TAAATTCAAAGCTGTTTTGTTGGTAGCATTCTTTACATATGATTGCAGTCATACTGCTGTCTGGCAATTCCCCACTCCAACTGCATTCCTGTTGGAGCATGGAATACCTCGTAGTAGT TTGTGTTTGCTAATAAGAATTATCTTCCTCCTTTTACAGATGCAAGTAGTAACAGAGTTAAAGACAGAACAAGATCCAAACTGCTCTGAACCCGATGCAGAAGGAGTGAGCCCTCCCC CTGTGGAGTCTCAGACCCCGATGGATGTGGACAAGCAGGCCATTTATAGGTAGTGCCCGGGGTGCCGGCTGTGGGGGCCTCAGAGGAGTGAGGAGTGTCCACAAGTTTGTTAAAAGGA ACACCAGAACTAGTAGCGAGGCAGAGTTGTGAGGTGCCATTATGATGTGATAACTGCTCGGCTGAAAATTTTCTCAGATTAAGATGTTATTTCTGGCCAGGTGTGGTGGCTCATGCCT GTAATCCTAGCACTTTGGGAGGCTGAGGCGGGCGGATCACTTGAGGTTAAGAGTTTGAGACCAGCCTGGCCAACATGGTGAAACTCCATCTCTACTAA... Sequence of 3 billion A, C, G, and T s The problem: What does it mean?

Biological seqeunces and what to do with them II 3 How do we find out what a sequence does? First step: Determine which part codes for proteins and determine protein sequence solved, e.g., myoglobin: MGLSDGEWQLVLHVWAKVEADVAGHGQDILIRLF... Second step: compare sequence to sequences of genes (of other organisms) with known function Sequence alignment algorithms...lvlhvw......lvagdw... Σ= Calculate score Σ for every pair of sequences. global: Needleman and Wunsch, 97, local: Smith and Waterman, 98 Sequences similar enough putative functional similarity Putative function to be confirmed by experiment Everybody does it: paper presenting most widely used program BLAST is most cited paper in all of science written in the 9 s Altschul et al., 99

Sequence alignment I 4 How do sequence alignment algorithms work? AGMKCYDHPSARQAW Want to find sequence similarities, e.g., KDAGVMKYEHPSQRW Need way to compare individual letters scoring matrix s a,b Has to represent similarities between letters DNA-DNA comparison: for a = b s a,b = µ for a b Protein-protein: BLOSUM or PAM matrices Many parameters a p( a ) sa,b for PAM A R N D C Q E G H I L K M F P S T W Y V 3.5%.3%.%.4%.87%.9%.84% 3.3%.99%.3% 4.6%.59%.%.74%.34% 3.%.63%.6%.45%.9% 6 6 3 4 3 6 7 6 8 4 3 4 7 5 7 6 5 4 6 4 N(s) 5 4 3 5 5 s 6 9 7 6 6 3 7 6 PAM Combine entries from the scoring matrix to an alignment score Σ 4

Sequence alignment II 5 Simplest algorithm: gapless alignment Given pair of sequences a...a N and b...b N Assign score to pair of substrings a i l+...a i and b j l+...b j S[i, j, l] l k= s ai k,b j k i= l=6 AGMKC YDHPS ARQAW KDAGVMK YEHPS QRW j=3 Optimal alignment has score Σ max i,j,l S(,3,6)=+3+6+6++=7 S[i, j, l] Clever way to find optimal alignment S i,j max l S[i, j, l] S i,j = max{s i,j + s ai,b j, } Σ = max i,j S i,j

Statistical Significance I 6 Problem: Algorithms assign score Σ to every sequence pair, even random ones Which Σ implies biological relationship? Statistical answer: find score distribution P (Σ) if algorithm is applied to random sequences random random AWDQR EAIPST P( Σ) Σ= Σ=7 Σ=3 Σ= = Σ =5 Σ Σ can assign probability p(σ) =Pr{Σ σ} = Choose highest score across a database of size N σ P (Σ)dΣ to each score new sequence known sequence Σ= known sequence Σ = P( Σ) largest Σ p > /N not similar enough known sequence k Σ=5 Σ p p < /N biologically relevant! known sequence N Σ=8

Significance Assessment II 7 Need score distribution P (Σ) for random sequences Obtain distribution by numerical simulation (shuffling method) random random AWDQR EAIPST P( Σ) Σ= Σ=7 Σ=3 Σ= = Σ =5 Σ Σ Generate many pairs of random sequences with correct letter composition Align each pair Take histogram of alignment scores Very time consuming (hours)

Significance Assessment III 8 Precompute distribution But: distribution depends on Scoring parameters, e.g. s a,b Sequence ensemble, e.g. p a Pre-computation only possible for some fixed set of scoring systems at overall amino acid/base pair frequencies Problems: unusual amino acid composition Iterative schemes (PSI-BLAST) Query Sequence Database Sequences Alignment Score Functions refine score fn Opt. Alignment signficance evaluation homologous sequences Position dependent scoring parameters Need fast numerical way or analytical theory to get score distribution

Significance Assessment IV 9 Analytical theory for gapless alignment: Gumbel or extreme value distribution Karlin and Altschul, Proc. Natl. Acad. Sci. 99 P( Σ ) P (Σ) =κλ exp[ λσ κe λσ ] e λ S Σ < Σ> ~ log λ κ universal shape Dependence on scoring system and letter frequencies in parameters κ, λ κ, λ known: e λs a,b p a p b e λs a,b = Corrections due to finite sequence length analytically known Makes gapless alignment powerful tool: BLAST Altschul et al., J. Mol. Biol. 99

Sequence alignment III Problem: during evolution pieces of sequences are inserted and deleted alignment algorithm has to compensate for this do gapped alignments to find weak similarities First: gapped global alignment Alignment = way of inserting gaps in sequences Score gaps by δ --AG-MKCYDHPSARQAW KDAGVMK-YEHPS--QRW Score S[A] for each alignment A S[A] = (a,b) A s a,b δn g In practice: affine gap cost δ + ɛk Find alignment with highest score Σ = max A S[A] Exponentially many alignments for each pair of sequences!

Sequence alignment IV Representation on an alignment grid: --AG-MKCYDHPSARQAW KDAGVMK-YEHPS--QRW Dynamic programming: O(N ) algorithm Needleman and Wunsch, 97 Σ = h(, N) C K Y D H P S A R Q W A (r,t)= (6,4) A G M (r,t)= (r,t)= (,) K (,3) D A G V M K Y E H P S Q (r,t)= R (,7) W h(r, t + ) = max{h(r ±,t) δ, h(r, t ) + s(r, t)} r t

Sequence alignment V Often only pieces of two sequences related local alignment with gaps Look at all pairs of subsequences a i...a j and b k...b l of the original sequences Calculate gapped global alignment score Σ(i, j, k, l) for each pair using Needleman-Wunsch algorithm Total alignment score is best of these Σ = max Σ(i, j, k, l) i,j,k,l O(N 6 ) algorithm? Again dynamic programming solution Smith and Waterman, 98 S j,l max Σ(i, j, k, l) i,k S j,l = max{s j,l δ, S j,l δ, S j,l + s aj,b k, } Σ = max j,k S j,k Practically used sequence comparison programs (BLAST, FASTA) are approximations of this algorithm

Significance Assessment V 3 Problem: Significance assessment less clear for alignment with gaps Shape of distribution numerically verified to be still Gumbel What are the values of the non-universal parameters λ and κ? Only very limited theories and heuristics Zhang, 995, Siegmund and Yakir, 999, Mott and Tribe, 999, Mott, Statistical physics problem: Find distribution of observable in disordered system "disorder" "physical system" random random AWDQR EAIPST "observable" Σ= Σ=7 Σ=3 Σ= Σ= Σ =5 Use statistical physics methods

Central conjecture 4 Central conjecture: The distribution of Σ is of the Gumbel form with the Gumbel parameter λ given by e λh(,n) = Properties: h(, N) free energy e λh(,n) Z λ replicated partition function hard to calculate analytically h(, N) < typical sequence pairs contribute zero to e λh(,n) rare events dominate e λh(,n) numerically hard Reduces to proven formula e λs = for gapless alignment Not proven, but applications work

Applications I 5 Application I: DNA-DNA comparison with µ = δ: n prob p Mapping possible to asymmetric exclusion process Model of highway traffic Exactly solvable model of non-equilibrium physics W r Can calculate e λh(,n) Gumbel λ given by + p exp[ λ ( + µ)] + p exp[ λ ( + µ)] exp[ λ µ]= ( ) λ.5 Compare calculated with simulated λ (p =/4, DNA) Provides test case for heuristic approaches.5 Equation (*) Simulation: uncorrelated Simulation: sequences 3 4 5 6 7 8 µ=δ BLAST p-value calculation changed since version..3 Altschul, RB, Olsen, and Hwa, Nucleic Acids Res.

Applications II 6 Application II: Smart numerics Estimate e λh(,n) by numerical sampling Instead of random sequence pairs (with h(, N) < ) use correlated sequence pairs with h(, N) > such that e λh(,n) corr is not dominated by rare events The difference between and corr can be handled analytically Estimate e λh(,n) for N = 6, N = 8, N = and extrapolate to N Results: Scoring system λ reference λ this algorithm time (min) BLOSUM45/4/.96±.8.978±.5 3: BLOSUM6//.67±..669±.6 5:49 BLOSUM8//.993±..34±.5 : PAM7//.9±.3.9±.3 :6 PAM3/9/.963±..954±. :7

Applications III 7 Application III: e λh(,n) = in general hard to fulfill Instead of calculating complicated quantity for Smith-Waterman alignment change the algorithm Smith-Waterman S(r +,t) δ, S(r,t) δ S(r, t+)=max S(r, t ) + s(r, t), Replace by hybrid algorithm, Σ = max r,t S(r, t) Z(r, t+) = Z(r+,t)e λ glδ + Z(r,t)e λ glδ, Σ = max r,t +Z(r, t )( e λ glδ )e λ gls(r,t) + log Z(r, t) Similar to: Viterbi probabilistic HMM (forward backward) RNA minimal energy (Zuker) partition function (Vienna) Guarantees e h(,n) = λ =independent of scoring system

Applications IV 8 Numerical test of hybrid statistics 5 i.i.d. amino acid sequences of length N, PAM- scoring matrix, + k gap cost D(Σ) Gumbel Fit % λ.5..5. Simulation Theory λ 5 5 5 Σ.95 4 6 8 N Score histogram is of Gumbel form λ =for large N Even sequence length dependence theoretically understood

Applications V 9 Works even for position-specific scoring systems E.g., protein family Hidden Markov Models from Pfam database Bateman et al., Nucleic Acids Res. D(Σ) Gumbel fit 5 5 Σ # occurences 6 5 4 3.6.8..4 λ Score histogram is of Gumbel form λ =within ±% for all, 6 models

Applications VI How is the performance in terms of sensitivity? Test algorithm on standard database: PDB9D-B Brenner et al., Proc. Natl. Acad. Sci. 998 Use SCOP as gold standard (known relations between sequences) Murzin et al., J. Mol. Biol. 995 Do pairwise alignments of all sequences in database Vary p-value cutoff and measure Number of relations found Coverage=, Number of total relations Ideal: high coverage at low errors per query Hybrid alignment s sensitivity is at least as good as that of other methods of wrong relations Errors per Query=Number Coverage.45.4.35.3.5 Hybrid gapped BLAST Smith Waterman SSEARCH FASTA ktup= FASTA ktup= Number of sequences ungapped BLAST Errors Per Query

Conclusions and outlook Sequence alignment is standard tool in...lvlhvw......lvagdw... molecular biology Σ= Sequence alignment algorithms rely on dynamic programming methods.5 Statistics of sequence alignments are important and poorly understood Statistical physics methods can be applied to characterize and improve sequence alignment Future directions Other analytical solutions for λ Better importance sampling to estimate λ Incorporate hybrid algorithm into PSI-BLAST λ # occurences.5 6 5 4 3 Equation (*) Simulation: uncorrelated Simulation: sequences 3 4 5 6 7 8 µ=δ.6.8..4 λ Statistics of other algorithms of computational biology (RNA folding, gene finding, clustering,...)