VL Bioinformatik für Nebenfächler SS2018 Woche 2

Hasonló dokumentumok
Where in the Genome does Replication Begin?

Where in the Genome does Replication Begin?

Lopocsi Istvánné MINTA DOLGOZATOK FELTÉTELES MONDATOK. (1 st, 2 nd, 3 rd CONDITIONAL) + ANSWER KEY PRESENT PERFECT + ANSWER KEY

Mapping Sequencing Reads to a Reference Genome

Tudományos Ismeretterjesztő Társulat

Angol Középfokú Nyelvvizsgázók Bibliája: Nyelvtani összefoglalás, 30 kidolgozott szóbeli tétel, esszé és minta levelek + rendhagyó igék jelentéssel

Előszó.2. Starter exercises. 3. Exercises for kids.. 9. Our comic...17

On The Number Of Slim Semimodular Lattices

Minta ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA II. Minta VIZSGÁZTATÓI PÉLDÁNY

ANGOL NYELVI SZINTFELMÉRŐ 2013 A CSOPORT. on of for from in by with up to at

Genome 373: Hidden Markov Models I. Doug Fowler

Széchenyi István Egyetem

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

Construction of a cube given with its centre and a sideline

Emelt szint SZÓBELI VIZSGA VIZSGÁZTATÓI PÉLDÁNY VIZSGÁZTATÓI. (A részfeladat tanulmányozására a vizsgázónak fél perc áll a rendelkezésére.

Using the CW-Net in a user defined IP network

Proxer 7 Manager szoftver felhasználói leírás

Phenotype. Genotype. It is like any other experiment! What is a bioinformatics experiment? Remember the Goal. Infectious Disease Paradigm

Angol érettségi témakörök 12.KL, 13.KM, 12.F

Travel Getting Around

STUDENT LOGBOOK. 1 week general practice course for the 6 th year medical students SEMMELWEIS EGYETEM. Name of the student:

ANGOL NYELVI SZINTFELMÉRŐ 2012 A CSOPORT. to into after of about on for in at from

USER MANUAL Guest user

(Asking for permission) (-hatok/-hetek?; Szabad ni? Lehet ni?) Az engedélykérés kifejezésére a következő segédigéket használhatjuk: vagy vagy vagy

FAMILY STRUCTURES THROUGH THE LIFE CYCLE

1. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

TestLine - Angol teszt Minta feladatsor

Correlation & Linear Regression in SPSS

Tudományos Ismeretterjesztő Társulat

Tutorial 1 The Central Dogma of molecular biology

discosnp demo - Peterlongo Pierre 1 DISCOSNP++: Live demo

7. osztály Angol nyelv

3. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

16F628A megszakítás kezelése

Szundikáló macska Sleeping kitty

FOSS4G-CEE Prágra, 2012 május. Márta Gergely Sándor Csaba

Lesson 1 On the train

3. MINTAFELADATSOR EMELT SZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése


Excel vagy Given-When-Then? Vagy mindkettő?

fátyolka tojásgy jtœ lap [CHRegg] összeszereléséhez

ANGOL NYELVI SZINTFELMÉRŐ 2011 B CSOPORT. for on off by to at from

ANGOL NYELVI SZINTFELMÉRŐ 2014 A CSOPORT

A rosszindulatú daganatos halálozás változása 1975 és 2001 között Magyarországon

ANGOL NYELVI SZINTFELMÉRŐ 2008 A CSOPORT

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Hypothesis Testing. Petra Petrovics.

EXKLUZÍV AJÁNDÉKANYAGOD A Phrasal Verb hadsereg! 2. rész

7. Bemutatom a barátomat, Pétert. 8. Hogy hívják Önt? 9. Nagyon örülök, hogy találkoztunk. 10. Nem értem Önt. 11. Kérem, beszéljen lassabban.

EGY KIS ZŰRZAVAR. Lecke (Középhaladó 1. / 1.) SOMETIMES, SOMETIME VAGY SOME TIME?

A modern e-learning lehetőségei a tűzoltók oktatásának fejlesztésében. Dicse Jenő üzletfejlesztési igazgató

ENROLLMENT FORM / BEIRATKOZÁSI ADATLAP

T Á J É K O Z T A T Ó. A 1108INT számú nyomtatvány a webcímen a Letöltések Nyomtatványkitöltő programok fülön érhető el.

Utazás Szállás. Szállás - Keresés. Szállás - Foglalás. Útbaigazítás kérése. ... kiadó szoba?... a room to rent? szállásfajta.

Vakáció végi akció Ukrajnában

Create & validate a signature

Budapest By Vince Kiado, Klösz György

Intézményi IKI Gazdasági Nyelvi Vizsga

ANGOL NYELVI SZINTFELMÉRŐ 2015 B CSOPORT

OB I. Reining Gyerek, Green, Novice Youth, Novice Amateur, Junior AQHA #5

OLYMPICS! SUMMER CAMP

ANGOL NYELV Helyi tanterv

HALLÁS UTÁNI SZÖVEGÉRTÉS 25. Beauty and the Beast Chapter Two Beauty s rose

It Could be Worse. tried megpróbált while miközben. terrifying. curtain függöny

Please stay here. Peter asked me to stay there. He asked me if I could do it then. Can you do it now?

PAST ÉS PAST PERFECT SUBJUNCTIVE (múlt idejű kötőmód)

Tudományos Ismeretterjesztő Társulat

Társasjáték az Instant Tanulókártya csomagokhoz

PONTOS IDŐ MEGADÁSA. Néha szükséges lehet megjelölni, hogy délelőtti vagy délutáni / esti időpontról van-e szó. Ezt kétféle képpen tehetjük meg:

Unit 10: In Context 55. In Context. What's the Exam Task? Mediation Task B 2: Translation of an informal letter from Hungarian to English.

Computer Architecture

Na de ennyire részletesen nem fogok belemenni, lássuk a lényeget, és ha kérdésed van, akkor majd tedd fel külön, négyszemközt.

mondat ami nélkül ne indulj el külföldre

Where are the parrots? (Hol vannak a papagájok?)

First experiences with Gd fuel assemblies in. Tamás Parkó, Botond Beliczai AER Symposium

Magyar - Angol Orvosi Szotar - Hungarian English Medical Dictionary (English And Hungarian Edition) READ ONLINE

Relative Clauses Alárendelő mellékmondat

(NGB_TA024_1) MÉRÉSI JEGYZŐKÖNYV

UTÓKÉRDÉSEK (Tag-questions)

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Factor Analysis

A MUTATÓNÉVMÁSOK. A mutatónévmások az angolban is (mint a magyarban) betölthetik a mondatban

Travel General. General - Essentials. General - Conversation. Asking for help. Asking if a person speaks English

JEROMOS A BARATOM PDF

Ültetési és öntözési javaslatok. Planting and watering instructions

Learn how to get started with Dropbox: Take your stuff anywhere. Send large files. Keep your files safe. Work on files together. Welcome to Dropbox!

Megyei Angol csapatverseny 2015/16. MOVERS - 3. forduló

Ady Endre: Párizsban járt az Ősz című versének és angol fordításainak alakzatvizsgálata

1. feladat: Hallgasd meg az angol szöveget, legalább egyszer.

KERÜLETI DIÁKHETEK VERSENYKIÍRÁS 2017.

MINDENGYEREK KONFERENCIA

Az Open Data jogi háttere. Dr. Telek Eszter

Daloló Fülelő Halász Judit Szabó T. Anna: Tatoktatok Javasolt nyelvi szint: A2 B1 / Resommended European Language Level: A2 B1

Cluster Analysis. Potyó László

2-5 játékos részére, 10 éves kortól

Business Opening. Very formal, recipient has a special title that must be used in place of their name

Statistical Inference

Rezgésdiagnosztika. Diagnosztika

Cashback 2015 Deposit Promotion teljes szabályzat

MATEMATIKA ANGOL NYELVEN MATHEMATICS

KN-CP50. MANUAL (p. 2) Digital compass. ANLEITUNG (s. 4) Digitaler Kompass. GEBRUIKSAANWIJZING (p. 10) Digitaal kompas

Átírás:

VL Bioinformatik für Nebenfächler SS2018 Woche 2 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of

VL Homepage Script-Sprache Tutorien Algorithm basics for biological problems Tim Conrad, VL Bioinformatik f. Nebenfächler, SS2018

Zentrales Dogma: von DNA zu Proteinen Tim Conrad, VL Bioinformatik f. Nebenfächler, SS2018

Ashanthi DeSilva and Michael Blaese in 2013

Before a Cell Divides, it Must Replicate its Genome

Replication begins in a region called the replication origin (oric) Where in a genome does it all begin?

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

Finding Origin of Replication Finding oric Problem: Finding oric in a genome. Input. A genome. Output. The location of oric in the genome. OK let s cut out this DNA fragment. Can the genome replicate without it? This is not a computational problem!

How Does the Cell Know to Begin Replication in Short oric? Replication origin of Vibrio cholerae ( 500 nucleotides): atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc There must be a hidden message telling the cell to start replication here.

The Hidden Message Problem Hidden Message Problem. Finding a hidden message in a string. Input. A string Text (representing replication origin). Output. A hidden message in Text. This is not a computational problem either! The notion of hidden message is not precisely defined.

The Gold-Bug Problem 53++!305))6*;4826)4+.)4+);806*;4 8!8`60))85;]8*:+*8!83(88)5*!;46( ;88*96*?;8)*+(;485);5*!2:*+(;495 6*2(5*4)8`8*;4069285);)6!8)4++;1 (+9;48081;8:8+1;48!85;4)485!5288 06*81(+9;48;(88;4(+?34;48)4+;161 ;:188;+?; A secret message left by pirates ( The Gold-Bug by Edgar Allan Poe)

Why is ;48 so Frequent? Hint: The message is in English 53++!305))6*;4826)4+.)4+);806*48!8`60))85;]8*:+*8!83(88)5*!46(88 *96*?;8)*+(;485);5*!2:*+(;4956*2 (5*4)8`8*;4069285);)6!8)4++;1(+9 ;48081;8:8+1;48!85;4)485 528806*81(+9;48;(88;4(+?34;48)4+ ;161;:188;+?;

THE is the Most Frequent English Word 53++!305))6*THE26)4+.)4+)806*THE!8`60))85;]8*:+*8!83(88)5*!;46(; 88*96*?;8)*+(THE5);5*!2:*+(;4956 *2(5*4)8`8*;4069285);)6!8)4++;1( +9THE081;8:8+1THE!85;4)485!52880 6*81(+9THE;(88;4(+?34THE)4+;161; :188;+?;

Could you Complete Decoding the Message? 53++!305))6*THE26)H+.)H+)806*THE!E`60))E5;]E*:+*E!E3(EE)5*!TH6(T EE*96*?;E)*+(THE5)T5*!2:*+(TH956 *2(5*H)E`E*TH0692E5)T)6!E)H++T1( +9THE0E1TE:E+1THE!E5T4)HE5!52880 6*E1(+9THET(EETH(+?34THE)H+T161T :1EET+?T

The Hidden Message Problem Revisited Hidden Message Problem. Finding a hidden message in a string. Input. A string Text (representing oric). Output. A hidden message in Text. This is not a computational problem either! The notion of hidden message is not precisely defined. Hint: For various biological signals, certain words appear surprisingly frequently in small regions of the genome. AATTT is a surprisingly frequent 5-mer in: ACAAATTTGCATAATTTCGGGAAATTTCCT

The Frequent Words Problem Frequent Words Problem. Finding most frequent k-mers in a string. Input. A string Text and an integer k. Output. All most frequent k-mers in Text. This is better, but where is the definition of a most frequent k-mer?

The Frequent Words Problem Frequent Words Problem. Finding most frequent k-mers in a string. Input. A string Text and an integer k. Output. All most frequent k-mers in Text. A k-mer Pattern is a most frequent k-mer in a text if no other k-mer is more frequent than Pattern. AATTT is a most frequent 5-mer in: ACAAATTTGCATAATTTCGGGAAATTTCCT

Does the Frequent Words Problem Make Sense to Biologists? Frequent Words Problem. Finding most frequent k-mers in a string. Input. A string Text and an integer k. Output. All most frequent k-mers in Text. Replication is performed by DNA polymerase and the initiation of replication is mediated by a protein called DnaA. DnaA binds to short (typically 9 nucleotides long) segments within the replication origin known as a DnaA box. A DnaA box is a hidden message telling DnaA: bind here! And DnaA wants to see multiple DnaA boxes.

DNA Polymerase

What is the Runtime of Your Algorithm? Frequent Words Problem. Finding most frequent k-mers in a string. Input. A string Text and an integer k. Output. All most frequent k-mers in Text. Text 2 k 4 k + Text k Text k log( Text ) Text??? You will later see how a naive and slow algorithm with Text 2 k runtime can be turned into a fast algorithm with Text runtime ( Text stands for the length of string Text)

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

oric of Vibrio cholerae atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaacctgagtgg atgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagatgatcaag agaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagc gccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgttt atcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactct gcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctcttgatcatcg atccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatca tgtttccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Too Many Frequent Words Which One is a Hidden Message? atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaacctgagtgg atgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagatgatcaag agaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagc gccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgttt atcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactct gcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctcttgatcatcg atccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatca TgtttccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc Most frequent 9-mers in this oric (all appear 3 times): ATGATCAAG, CTTGATCAT, TCTTGGATCA, CTCTTGATC Is it STATISTICALLY surprising to find a 9-mer appearing 3 or more times within 500 nucleotides?

Hidden Message Found! atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaacctgagtgg atgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagatgatcaag agaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagc gccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgttt atcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactct gcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctcttgatcatcg atccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatca TgtttccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc ATGATCAAG are reverse complements and likely DnaA boxes TACTAGTTC (DnaA does not care what strand to bind to) It is VERY SURPRISING to find a 9-mer appearing 6 or more times (counting reverse complements) within a short 500 nucleotides.

Can we Now Find Hidden Messages in Thermotoga petrophila? aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactgaaactaa aatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaattacataccgta tattgtattaaattgacgaacaattgcatggaattgaatatatgcaaaacaaacctaccaccaaac tctgtattgaccattttaggacaacttcagggtggtaggtttctgaagctctcatcaatagactat tttagtctttacaaacaatattaccgttcagattcaagattctacaacgctgttttaatgggcgtt gcagaaaacttaccacctaaaatccagtatccaagccgatttcagagaaacctaccacttacctac cacttacctaccacccgggtggtaagttgcagacattattaaaaacctcatcagaagcttgttcaa aaatttcaatactcgaaacctaccacctgcgtcccctattatttactactactaataatagcagta taattgatctgaaaagaggtggtaaaaaa No single occurrence of ATGATCAAG or CTTGATCAT from Vibrio Cholerae!!! Applying the Frequent Words Problem to this replication origin: AACCTACCA, ACCTACCAC, GGTAGGTTT, TGGTAGGTT, AAACCTACC, CCTACCACC Different genomes different hidden messages (DnaA boxes)

Hidden Messages in Thermotoga petrophila aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactgaaactaa aatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaattacataccgta tattgtattaaattgacgaacaattgcatggaattgaatatatgcaaaacaaacctaccaccaaac tctgtattgaccattttaggacaacttcagggtggtaggtttctgaagctctcatcaatagactat tttagtctttacaaacaatattaccgttcagattcaagattctacaacgctgttttaatgggcgtt gcagaaaacttaccacctaaaatccagtatccaagccgatttcagagaaacctaccacttacctac cacttacctaccacccgggtggtaagttgcagacattattaaaaacctcatcagaagcttgttcaa aaatttcaatactcgaaacctaccacctgcgtcccctattatttactactactaataatagcagta taattgatctgaaaagaggtggtaaaaaa Ori-Finder software confirms that CCTACCACC are candidate hidden messages. GGATGGTGG We learned how to find hidden messages IF oric is given. But we have no clue WHERE oric is located in a (long) genome.

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

Finding Replication Origin Our strategy BEFORE: given a previously known oric (a 500- nucleotide window), find frequent words (clumps) in oric as candidate DnaA boxes. replication origin frequent words

Finding Replication Origin Our strategy BEFORE: given previously known oric (a 500- nucleotide window), find frequent words (clumps) in oric as candidate DnaA boxes. replication origin frequent words But what if the position of the replication origin within a genome is unknown!

Finding Replication Origin Our strategy BEFORE: given previously known oric (a 500- nucleotide window), find frequent words (clumps) in oric as candidate DnaA boxes. replication origin frequent words NEW strategy: find frequent words in ALL windows within a genome. Windows with clumps of frequent words are candidate replication origins. frequent words replication origin

What is a Clump? Intuitive: Formal: A A k-mer forms an a (L, clump t)-clump inside inside Genome Genome if there if there is a is short a short interval (length of Genome L) interval of which Genome it appears in which many it appears times. many (at least t) times. Clump Finding Problem. Find patterns forming clumps in a string. Input. A string Genome and integers k (length of a pattern), L (window length), and t (number of patterns in a clump). Output. All k-mers forming (L, t)-clumps in Genome. There exist 1904 different 9-mers forming (500,3)-clumps in E. coli genome. It is absolutely unclear which of them point to the replication origin

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

DNA Strands Have Directions! 5 oric 3 3 oric 5 The two strands run in opposite directions (from 5 to 3 ): Blue Strand Clockwise, Green Strand Counter-Clockwise terc terc

DNA Strands Have Directions 5 oric 3 3 oric 5 If you were a DNA Polymerase, how would you replicate a genome??? terc terc

Four DNA Polymerases Do the Job oric 5 3 3 5 oric terc terc

Continue as Replication Fork Enlarges 5 3 3 5 Simple but wrong: DNA polymerases are unidirectional: they can only traverse a parent strand in the opposite (3 5 ) direction.

If you Were a UNIDIRECTIONAL DNA Polymerase, how Would you Replicate a Genome? 5 3 3 5 Reverse (leading) half-strand Forward (lagging) half-strand Reverse (leading) half-strand Forward (lagging) half-strand No Big problem replicating reverse forward (leading) (lagging) half-strands (thick (thin lines).

If you Were a UNIDIRECTIONAL DNA Polymerase, How Would you Replicate a Genome??? 5 3 3 5 Reverse (leading) half-strand Forward (lagging) half-strand Reverse (leading) half-strand Forward (lagging) half-strand

Wait until the Fork Opens and 5 3 3 5

Wait until the Fork Opens and Replicate 5 3 3 5

Wait until the Fork Opens and Replicate Wait until the Fork Opens Even More and Okazaki fragments

Wait until the Fork Opens and Replicate Wait until the Fork Opens Even More and Okazaki fragments Okazaki fragments REPLICATE! Instead of copying the entire half-strand, many Okazaki fragments are replicated.

Okazaki Fragments Need to be Ligated to Fill in the Gaps Okazaki fragments The genome has been replicated!

Different Lifestyles of Reverse and Forward Half-Strands The reverse (leading) half-strand lives a double-stranded life most of the time. The forward (lagging) half-strand spends a large portion of its life single-stranded, waiting to be replicated. waiting waiting But why would a computer scientist care?

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

Asymmetry of Replication Affects Nucleotide Frequencies Single-stranded DNA has a much higher mutation rate than double-stranded DNA. Thus, if one nucleotide has a greater mutation rate, then we should observe its shortage on the forward half-strand that lives single-stranded life! Which nucleotide (A/C/G/T) has the highest mutation rate? Why?

The Peculiar Statistics of #G - #C Cytosine (C) rapidly mutates into thymine (T) through deamination; deamination rates rise 100-fold when DNA is single stranded! Forward half-strand (single-stranded life): shortage of C, normal G Reverse half-strand (double-stranded life): shortage of G, normal C #C #G #G - #C Reverse half-strand 219518 201634-17884 Forward half-strand 207901 211607 +3706 Difference +11617-9973

Take a Walk Along the Genome #G - #C is DECREASING #G - #C is INCREASING 5 3 3 oric 5 C high G low You walk along the genome and see that #G - #C have been decreasing and then suddenly starts increasing. WHERE ARE YOU IN THE GENOME? C low G high terc C high/g low #G - #C is DECREASING as we walk along the REVERSE half-strand C low/g high #G - #C is INCREASING as we walk along the FORWARD half-strand

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

Skew Diagram Skew(k): #G - #C for the first k nucleotides of Genome. Skew diagram: Plot Skew(k) against k CATGGGCATCGGCCATACGCC

Skew Diagram of E. Coli: Where is the Origin of Replication? oric You walk along the genome and see that #G - #C have been decreasing and then suddenly starts increasing: WHERE ARE YOU IN THE GENOME?

We Found the Replication Origin in E. Coli BUT The minimum of the Skew Diagram points to this region in E. coli: aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcggta tgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaagacc tgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgtgatctc ttattaggatcgcactgccctgtggataacaaggatccggcttttaagatcaacaacctgg aaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcagaatga ggggttatacacaactcaaaaactgaacaacagttgttctttggataactaccggttgatc caagcttcctgacagagttatccacagtagatcgcacgatctgtatacttatttgagtaaa ttaacccacgatcccagccattcttctgccggatcttccggaatgtcgtgatcaagaatgt tgatcttcagtg But there are no frequent 9-mers (that appear three or more times) in this region! SHOULD WE GIVE UP?