Genome 373: Hidden Markov Models I. Doug Fowler

Genome 373: Hidden Markov Models I Doug Fowler

Review From Gene Prediction I transcriptional start site G open reading frame transcriptional termination site promoter 5 untranslated region 3 untranslated region We briefly revisited what a gene is and what the key parts of genes are

Review From Gene Prediction I Given a sequence, we want to be able to predict the major features of genes in the sequence (e.g. create gene models) Start GCGGGGGGCCG GGGGCCGGGCGGGCCCCCCGCCGC CGGGGCCCGGGCGGCGGC GCCGGCCCCGCCCCCGCGG GGCCGCGGGGCGGGCCCC CGGCGCGGCCGGCGCCGGGCCC CCGCGCCCGCCCGG GGGCGGCCGCCCCGCCCGCGGCC CGGCGGCCGGGCCGGC GCGCCCCGCCGGCGG CCCCGCGGGGCCCGG GGGGCGCGGCCCGGCCGC GGCGGCCCGGGCGCCCGCCCCCCC CCGGGCCGCCGGCCGGCC GCGCGCGGCGGCCGCCCG GCGCGCCGGGGCGG GCGCGCGCCCCCGCCGGGC GGGCGGCCCCCCGGCCGCGGCCGG GCCGCCCGCCG CCCCCCGCCGGGGGGC GCCCCCCGGCCCCG CGCCCGCCCCCCCGGCGGG CCGCCCGC Exon 1 Intron 1 Exon 2 Stop GCGGGGGGCCG GGGGCCGGGCGGGCCCCCCGCCGC CGGGGCCCGGGCGGCGGC GCCGGCCCCGCCCCCGCGG GGCCGCGGGGCGGGCCCC CGGCGCGGCCGGCGCCGGGCCC CCGCGCCCGCCCGG GGGCGGCCGCCCCGCCCGCGGCC CGGCGGCCGGGCCGGC GCGCCCCGCCGGCGG CCCCGCGGGGCCCGG GGGGCGCGGCCCGGCCGC GGCGGCCCGGGCGCCCGCCCCCCC CCGGGCCGCCGGCCGGCC GCGCGCGGCGGCCGCCCG GCGCGCCGGGGCGG GCGCGCGCCCCCGCCGGGC GGGCGGCCCCCCGGCCGCGGCCGG GCCGCCCGCCG CCCCCCGCCGGGGGGC GCCCCCCGGCCCCG CGCCCGCCCCCCCGGCGGG CCGCCCGC

Review From Gene Prediction I We want a model that can predict whether each base in a sequence is in one of a known set of states (intergenic, start exon, intron, stop) Start GCGGGGGGCCG GGGGCCGGGCGGGCCCCCCGCCGC CGGGGCCCGGGCGGCGGC GCCGGCCCCGCCCCCGCGG GGCCGCGGGGCGGGCCCC CGGCGCGGCCGGCGCCGGGCCC CCGCGCCCGCCCGG GGGCGGCCGCCCCGCCCGCGGCC CGGCGGCCGGGCCGGC GCGCCCCGCCGGCGG CCCCGCGGGGCCCGG GGGGCGCGGCCCGGCCGC GGCGGCCCGGGCGCCCGCCCCCCC CCGGGCCGCCGGCCGGCC GCGCGCGGCGGCCGCCCG GCGCGCCGGGGCGG GCGCGCGCCCCCGCCGGGC GGGCGGCCCCCCGGCCGCGGCCGG GCCGCCCGCCG CCCCCCGCCGGGGGGC GCCCCCCGGCCCCG CGCCCGCCCCCCCGGCGGG CCGCCCGC Exon 1 Intron 1 Exon 2 Stop GCGGGGGGCCG GGGGCCGGGCGGGCCCCCCGCCGC CGGGGCCCGGGCGGCGGC GCCGGCCCCGCCCCCGCGG GGCCGCGGGGCGGGCCCC CGGCGCGGCCGGCGCCGGGCCC CCGCGCCCGCCCGG GGGCGGCCGCCCCGCCCGCGGCC CGGCGGCCGGGCCGGC GCGCCCCGCCGGCGG CCCCGCGGGGCCCGG GGGGCGCGGCCCGGCCGC GGCGGCCCGGGCGCCCGCCCCCCC CCGGGCCGCCGGCCGGCC GCGCGCGGCGGCCGCCCG GCGCGCCGGGGCGG GCGCGCGCCCCCGCCGGGC GGGCGGCCCCCCGGCCGCGGCCGG GCCGCCCGCCG CCCCCCGCCGGGGGGC GCCCCCCGGCCCCG CGCCCGCCCCCCCGGCGGG CCGCCCGC

n d hoc Model We could just build an ad hoc model that would incorporate each of the pieces of information we talked about last time (e.g. start, stop, length of ORF, splice site motifs, etc)

n d hoc Model We could just build an ad hoc model that would incorporate each of the pieces of information we talked about last time (e.g. start, stop, length of ORF, splice site motifs, etc) For example, we could label all starts, stops and potential ORFs. hen we could slide across 100 base pair windows and compute the probability of splice site motifs. Finally, we could combine these two pieces of information to find genes

n d hoc Model We could just build an ad hoc model that would incorporate each of the pieces of information we talked about last time (e.g. start, stop, length of ORF, splice site motifs, etc) For example, we could label all starts, stops and potential ORFs. hen we could slide across 100 base pair windows and compute the probability of splice site motifs. Finally, we could combine these two pieces of information to find genes What are the problems here?

n d hoc Model We could just build an ad hoc model that would incorporate each of the pieces of information we talked about last time (e.g. start, stop, length of ORF, splice site motifs, etc) Many problems arise with this strategy: How should we weight each part of the model? What happens if we want to add new information (alternative splicing, etc)? d hoc models get messy very quickly!

n Overview of Markov Models Markov models are a formal framework for assigning states to a linear sequence of symbols (like DN) GGCGG state = start state = stop

n Overview of Markov Models Markov models are a formal framework for assigning states to a linear sequence of symbols (like DN) state = start GGCGG state = stop Markov models are probabalistic, meaning that we can use them to pick out the most likely states for a particular sequence

n Overview of Markov Models Markov models are a formal framework for assigning states to a linear sequence of symbols (like DN) state = start GGCGG state = stop Markov models are probabalistic, meaning that we can use them to pick out the most likely states for a particular sequence (this is exactly what we want to do to find genes!)

n Overview of Markov Models Markov models are a formal framework for assigning states to a linear sequence of symbols (like DN) state = start GGCGG state = stop Markov models are probabalistic, meaning that we can use them to pick out the most likely states for a particular sequence (this is exactly what we want to do to find genes!) Markov models have diverse applications in genomics including gene finding, sequence alignment, regulatory site identification, protein secondary structure prediction, etc

Outline Markov Chains/Models Hidden Markov Models

Markov Chain Markov chain is a random process of transitions from one state to another in a state space

Markov Chain 0.9 0.9 his model describes a Markov chain with two states, and Markov chain is a random process of transitions from one state to another in a state space

Markov Chain 0.9 0.9 here are four possible transitions: ->, ->, ->, -> Markov chain is a random process of transitions from one state to another in a state space

Markov Chain 0.9 0.9 he transitions describe the linear order in which we expect states to occur

Markov Chain 0.9 0.9 his model describes a sequence composed of s and s, and you could get any sequence from this model

Markov Chain What type of sequence would this model describe?

Markov Chain One that alternated between and

Markov Chain 0.9 nd this one?

Markov Chain 0.9 Runs of interrupted by one

Markov Model 0.9 0.9 Markov chain is a random process of transitions from one state to another in a state space In other words, transitions between states are probabilistic

Markov Model 0.9 0.9 Formally, a transition between states two states s and t is associated with a probability (a st, the transition probability) a st = P (x i = t x i 1 = s)

Markov Model 0.9 0.9 his expresses a key property of a Markov chain: the probability of any symbol x i depends only on the previous symbol x i-1 a st = P (x i = t x i 1 = s)

Markov Model 0.9 0.9 his is also referred to as the Markov property a st = P (x i = t x i 1 = s)

Markov Model 0.9 0.9 Given that we start with an, we can write down the probability of any sequence of symbols P (sequence) =0.9

Markov Model 0.9 0.9 Given that we start with an, we can write down the probability of any sequence of symbols P (sequence) =?

Markov Model 0.9 0.9 Given that we start with an, we can write down the probability of any sequence of symbols P (sequence) =

Markov Model 0.9 0.9 Given that we start with an, we can write down the probability of any sequence of symbols P (sequence) =0.9 0.9... 0.9

Markov Model 0.9 0.9 Formally, the probability of observing any particular sequence is the product of the transition probabilities for the sequence P (sequence) =P (x 1 ) LY i=2 a xi 1 x i

Markov Model 0.9 0.9 Probability for the beginning state Product of the second through the L th transition probabilities LY P (sequence) =P (x 1 ) i=2 a xi 1 x i

Markov Model Can ell Us the Most Likely Sequence 0.9 0.9 Which is the more likely sequence given our model?

Markov Model Can ell Us the Most Likely Sequence 0.9 0.9 Clearly, the first is more likely to occur and we can write down the exact probability of each! You all calculate them!

Markov Model Can ell Us the Most Likely Sequence 0.9 0.9 Clearly, the first is more likely to occur and we can write down the exact probability of each! P =0.9 6 =0.053 P = 7 =0.0000001

Markov Model Can ell Us the Most Likely Sequence 0.9 0.9 nd, starting with an what is the most likely eight symbol sequence of all?

Markov Model Can ell Us the Most Likely Sequence 0.9 0.9 nd, starting with an what is the most likely eight symbol sequence of all? P =0.9 7 =0.47

Beginning and Ending States in a Markov Model B 0.9 0.9 E We can add begin (B) and end (E) states with their own transition probabilities a Bs, a se

Beginning and Ending States in a Markov Model B 0.9 0.9 E What is the consequence of modeling the end state?

Beginning and Ending States in a Markov Model B 0.9 0.9 E What is the consequence of modeling the end state? We add sequence length to the model (there is a non-zero probability that the next state is end )

Outline Markov Chains Hidden Markov Models

What is Hidden in an HMM?

What is Hidden in an HMM? In our simple Markov model we had full knowledge of both the symbols (x i ) and the model states 0.9 0.9

What is Hidden in an HMM? In fact, they were identical and we talked about them interchangeably! 0.9 0.9 Symbols: States:

What is Hidden in an HMM? In our simple Markov model we had full knowledge of both the symbols (x i ) and the model states In a hidden Markov model (HMM), the model states are unknown (e.g. hidden from us) We will see that given a set of transition probabilities and a set of symbols we can use an HMM to identify the most likely sequence of states and that this will let us solve our gene finding problem!

HMM for vs. Rich Regions Let s extend our initial example to one where, given a sequence composed of s and s we want to discriminate between - rich and -rich regions

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Now we have a model where there are two states: rich (a) and rich (t)

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 he states no longer correspond directly to the symbols or. In an -rich region, for example, we ll still observe some s and vice versa.

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 he states no longer correspond directly to the symbols or. Instead, they are associated with emission probabilities that dictate the the frequency with which or will be observed.

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 hat is, when in the rich state the model will emit an 80% of the time and a 20% of the time

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Formally, we denote the probability that we will see the symbol b when the model is in state k: e k (b) =P (x i = b i = k) where π is the sequence of model states

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Just like before, we can use the model to generate sequence

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t t However, now multiple state paths (π) could give rise to a particular sequence

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t t State path #2: t t t t aaaa Given the model, transition probabilities, emission probabilities and a sequence of symbols we can begin to think about the most likely state path

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t t State path #2: t t t t aaaa Intuitively, it s pretty easy to figure out. Which of these two is the most likely?

HMM for vs. Rich Regions 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t t State path #2: t t t t aaaa Highly likely path Unlikely path his is the basic idea of an HMM: figure out the most likely state path given a sequence, a model and transition probabilities

Probability of a Given Sequence and State Path 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Formally, the joint probability of a given sequence x and a state path π is given by: Y L P (x, ) =a 0 1 i=1 e i (x i )a i i+1

Probability of a Given Sequence and State Path 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 P(initial state) P(emitting symbol x i in state π i ) L Y P(transition from state π i to state π i+1 ) P (x, ) =a 0 1 i=1 e i (x i )a i i+1

Example State Path Probability Calculation 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t t State path #2: t t t t aaaa P (x, ) =a 0 1 L Y i=1 e i (x i )a i i+1 P (path 1 )=(0.8 0.9)... (0.8 )... (0.8 0.9) = 0.008 P (path 2 )=(0.2 0.9)... (0.2 )... (0.2 0.9) = 1.2 10 7 Let s start at the beginning. i = 1 and (,a) and (,t). We multiply the emission and transition probabilities.

Example State Path Probability Calculation 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t State path #2: t t t t aaaa P (x, ) =a 0 1 L Y i=1 e i (x i )a i i+1 P (path 1 )=(0.8 0.9)... (0.8 )... (0.8 0.9) = 0.008 P (path 2 )=(0.2 0.9)... (0.2 )... (0.2 0.9) = 1.2 10 7 nd continue doing that for the whole sequence and each state path, getting the probability of each state path given the observed sequence

What Does the Most Likely Path Mean? 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t State path #2: t t t t aaaa P (x, ) =a 0 1 L Y i=1 e i (x i )a i i+1 P (path 1 )=(0.8 0.9)... (0.8 )... (0.8 0.9) = 0.008 P (path 2 )=(0.2 0.9)... (0.2 )... (0.2 0.9) = 1.2 10 7 It turns out that state path #1 is the most likely path for this model. So, what can we say?

What Does the Most Likely Path Mean? 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t State path #2: t t t t aaaa P (x, ) =a 0 1 L Y i=1 e i (x i )a i i+1 P (path 1 )=(0.8 0.9)... (0.8 )... (0.8 0.9) = 0.008 P (path 2 )=(0.2 0.9)... (0.2 )... (0.2 0.9) = 1.2 10 7 hat the first four positions in the sequence are likely from an rich region and the last four are from a rich region!

Example State Path Probability Calculation 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: Y L State path #1: a t aat t t a P (x, ) =a 0 1 i=1 e i (x i )a i i+1 Now, you all take a minute and try to calculate the likelihood of this state path given that the transition probability into the first state a (a 0 π1) is 1

Example State Path Probability Calculation 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: a t aat t t a State path #1: P (x, ) =a 0 1 L Y i=1 e i (x i )a i i+1 P =1 (0.8 )(0.8 )(0.8 0.9)(0.2 )(0.8 0.9)(0.8 0.9)(0.8 )(0.2 1) P =7.6 10 7 Now, you all take a minute and try to calculate the likelihood of this state path given that the transition probabilities into the first state (a 0 π1) and to the end state are 1

Summary 0.9 0.9 P =0.9 7 =0.47 We learned that a Markov chain is a random process of transitions from one state to another in a state space, and that we could write down a model to describe a Markov chain We saw how a simple Markov model could generate the most likely sequence

Summary 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t State path #2: t t t t aaaa We learned that a Markov chain is a random process of transitions from one state to another in a state space, and that we could write down a model to describe a Markov chain We saw how a simple Markov model could generate the most likely sequence We learned that in a hidden Markov model, states are unknown to us and associated with a set of emission probabilities so that many different state paths can generate a given sequence

Summary 0.9 rich rich 0.9 : 0.8 : 0.2 : 0.2 : 0.8 Sequence: State path #1: aaaat t t State path #2: t t t t aaaa P (x, ) =a 0 1 L Y i=1 e i (x i )a i i+1 We learned that a Markov chain is a random process of transitions from one state to another in a state space, and that we could write down a model to describe a Markov chain We saw how a simple Markov model could generate the most likely sequence We learned that in a hidden Markov model, states are unknown to us and associated with a set of emission probabilities so that many different state paths can generate a given sequence We saw how we could use an HMM to calculate the probability of any (hidden) state path given a sequence

Next ime he Viterbi lgorithm (or, how can we find the most probable state path?) toy gene finding example Generate the a gene finding HMM