Where in the Genome does Replication Begin?

Hasonló dokumentumok
Where in the Genome does Replication Begin?

VL Bioinformatik für Nebenfächler SS2018 Woche 2

Genome 373: Hidden Markov Models I. Doug Fowler

Mapping Sequencing Reads to a Reference Genome

Correlation & Linear Regression in SPSS

Construction of a cube given with its centre and a sideline

On The Number Of Slim Semimodular Lattices

Csima Judit április 9.

Angol Középfokú Nyelvvizsgázók Bibliája: Nyelvtani összefoglalás, 30 kidolgozott szóbeli tétel, esszé és minta levelek + rendhagyó igék jelentéssel

Bioinformatics: Blending. Biology and Computer Science

Phenotype. Genotype. It is like any other experiment! What is a bioinformatics experiment? Remember the Goal. Infectious Disease Paradigm

Lopocsi Istvánné MINTA DOLGOZATOK FELTÉTELES MONDATOK. (1 st, 2 nd, 3 rd CONDITIONAL) + ANSWER KEY PRESENT PERFECT + ANSWER KEY

Performance Modeling of Intelligent Car Parking Systems

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Linear. Petra Petrovics.

16F628A megszakítás kezelése

Utazás Szállás. Szállás - Keresés. Szállás - Foglalás. Útbaigazítás kérése. ... kiadó szoba?... a room to rent? szállásfajta.

Correlation & Linear Regression in SPSS

discosnp demo - Peterlongo Pierre 1 DISCOSNP++: Live demo

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Nonparametric Tests

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Hypothesis Testing. Petra Petrovics.

Supplementary Figure 1

STUDENT LOGBOOK. 1 week general practice course for the 6 th year medical students SEMMELWEIS EGYETEM. Name of the student:

Széchenyi István Egyetem

Cluster Analysis. Potyó László

Lecture 11: Genetic Algorithms

Tutorial 1 The Central Dogma of molecular biology

Angol érettségi témakörök 12.KL, 13.KM, 12.F

Gottsegen National Institute of Cardiology. Prof. A. JÁNOSI

Tudományos Ismeretterjesztő Társulat

TestLine - Angol teszt Minta feladatsor

Véges szavak általánosított részszó-bonyolultsága

SOLiD Technology. library preparation & Sequencing Chemistry (sequencing by ligation!) Imaging and analysis. Application specific sample preparation

7. osztály Angol nyelv

MINDENGYEREK KONFERENCIA

ANGOL NYELVI SZINTFELMÉRŐ 2013 A CSOPORT. on of for from in by with up to at

3. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

Cashback 2015 Deposit Promotion teljes szabályzat

Tavaszi Sporttábor / Spring Sports Camp május (péntek vasárnap) May 2016 (Friday Sunday)

Proxer 7 Manager szoftver felhasználói leírás

fátyolka tojásgy jtœ lap [CHRegg] összeszereléséhez

FOSS4G-CEE Prágra, 2012 május. Márta Gergely Sándor Csaba

Contact us Toll free (800) fax (800)

ENROLLMENT FORM / BEIRATKOZÁSI ADATLAP

A rosszindulatú daganatos halálozás változása 1975 és 2001 között Magyarországon

Unit 10: In Context 55. In Context. What's the Exam Task? Mediation Task B 2: Translation of an informal letter from Hungarian to English.

Emelt szint SZÓBELI VIZSGA VIZSGÁZTATÓI PÉLDÁNY VIZSGÁZTATÓI. (A részfeladat tanulmányozására a vizsgázónak fél perc áll a rendelkezésére.

MATEMATIKA ANGOL NYELVEN

ANGOL NYELVI SZINTFELMÉRŐ 2008 A CSOPORT

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Nonparametric Tests. Petra Petrovics.

It Could be Worse. tried megpróbált while miközben. terrifying. curtain függöny

Website review acci.hu

SAJTÓKÖZLEMÉNY Budapest július 13.

EXKLUZÍV AJÁNDÉKANYAGOD A Phrasal Verb hadsereg! 2. rész

A jövedelem alakulásának vizsgálata az észak-alföldi régióban az évi adatok alapján

INDEXSTRUKTÚRÁK III.

Computer Architecture

Magyar - Angol Orvosi Szotar - Hungarian English Medical Dictionary (English And Hungarian Edition) READ ONLINE

Lesson 1 On the train

Supporting Information

mondat ami nélkül ne indulj el külföldre

(Asking for permission) (-hatok/-hetek?; Szabad ni? Lehet ni?) Az engedélykérés kifejezésére a következő segédigéket használhatjuk: vagy vagy vagy

JEROMOS A BARATOM PDF

Lexington Public Schools 146 Maple Street Lexington, Massachusetts 02420

University of Bristol - Explore Bristol Research

Cloud computing. Cloud computing. Dr. Bakonyi Péter.

Create & validate a signature

Please stay here. Peter asked me to stay there. He asked me if I could do it then. Can you do it now?

Searching in an Unsorted Database

Statistical Inference

MATEMATIKA ANGOL NYELVEN

Daloló Fülelő Halász Judit Szabó T. Anna: Tatoktatok Javasolt nyelvi szint: A2 B1 / Resommended European Language Level: A2 B1

FORGÁCS ANNA 1 LISÁNYI ENDRÉNÉ BEKE JUDIT 2

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Factor Analysis

Bevezetés a kvantum-informatikába és kommunikációba 2015/2016 tavasz

2. Local communities involved in landscape architecture in Óbuda

Smaller Pleasures. Apróbb örömök. Keleti lakk tárgyak Répás János Sándor mûhelyébõl Lacquerware from the workshop of Répás János Sándor

Jelentkezés Ajánlólevél / Referencialevél

EBBEN A VIZSGARÉSZBEN A VIZSGAFELADAT ARÁNYA

Számítógéppel irányított rendszerek elmélete. Gyakorlat - Mintavételezés, DT-LTI rendszermodellek

Néhány folyóiratkereső rendszer felsorolása és példa segítségével vázlatos bemutatása Sasvári Péter

Üzleti élet Nyitás. Nagyon hivatalos, a címzettnek meghatározott rangja van, aminek szerepelnie kell

1. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

Üzleti élet Nyitás. Nagyon hivatalos, a címzettnek meghatározott rangja van, aminek szerepelnie kell

Utasítások. Üzembe helyezés

Abigail Norfleet James, Ph.D.

Előszó.2. Starter exercises. 3. Exercises for kids.. 9. Our comic...17

MATEMATIKA ANGOL NYELVEN

A JUHTARTÁS HELYE ÉS SZEREPE A KÖRNYEZETBARÁT ÁLLATTARTÁSBAN ÉSZAK-MAGYARORSZÁGON

Trinucleotide Repeat Diseases: CRISPR Cas9 PacBio no PCR Sequencing MFMER slide-1

EPILEPSY TREATMENT: VAGUS NERVE STIMULATION. Sakoun Phommavongsa November 12, 2013

MATEMATIKA ANGOL NYELVEN

FÜGGŐ BESZÉD SZENVEDŐ SZERKEZETTEL ( AZT MONDJÁK/GONDOLJÁK RÓL, HOGY kezdetű mondatok)

Using the CW-Net in a user defined IP network

Cloud computing Dr. Bakonyi Péter.

Adatbázisok 1. Rekurzió a Datalogban és SQL-99

Nemzetközi Kenguru Matematikatábor

Intézményi IKI Gazdasági Nyelvi Vizsga

(NGB_TA024_1) MÉRÉSI JEGYZŐKÖNYV

ANGOL MAGYAR PARBESZEDEK ES PDF

Meteorológiai ensemble elırejelzések hidrológiai célú alkalmazásai

Átírás:

Where in the Genome does Replication Begin? Chapter 1 Bioinformatics Algorithms Phillip Compaeu and Pavel Pevzner ebook link: https://stepic.org/course/bioinformatics-algorithms-2/

DNA Replication Replication Origin (oric) DNA Polymerase Viral Vectors Frost resistant tomatoes, pesticide resistant corn Gene Therapy

Gene Therapy Vector Origin of replication, Multicloning site, Selectable marker

The Problem Finding Finding Origin Origin of of Replication Replication Problem Problem Input: Input: The The DNA DNAstring string Genome. Genome. Output: Output: The The location location of of oric oric in in Genome. Genome. Is Is the the Finding Finding Origin Origin of of Replication Replication Problem Problem aa clearly clearly stated stated computational computational problem? problem?

Hidden Messages in oric Bacterial genome Single circular chromosome Vibrio Cholerae 1,108,250 nucleotides Vibrio Cholerae oric atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Hidden Messages in oric DnaA Replication is mediated by this protein DnaA Box DnaA binds here Multiple DnaA boxes help DnaA bind better Vibrio Cholerae oric atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Hidden Message Problem Find Find aa Hidden Hidden Message Message in in the the Replication Replication Origin Origin Input: Input:AAstring string text text (representing (representing the the replication replication origin origin of of aa genome). genome). Output: Output:AAhidden hidden message message in in the the text. text. Is Is the the Hidden Hidden message message problem problem aa clearly clearly stated stated computational computational problem? problem?

The Eureka Moment

The Eureka Moment It may well be doubted whether human ingenuity can construct an enigma of the kind which human ingenuity may not, by proper application, resolve. -- Edgar Allan Poe (through Legrand) ;48 = THE

Hidden Messages Are there frequent words in the oric? ACAACTATGCATACTATCGGGAACTATCCT ACAACTATGCATACTATCGGGAACTATCCT k-mer: String of length k Count(text, pattern): No. of times the k-mer Pattern appears as a substring of text. Count(ACAACTATGCATACTATCGGGAACTATCCT, Count(ACAACTATGCATACTATCGGGAACTATCCT,ACTAT) ACTAT)==33

Frequent Words Problem Find Find the the most most frequent frequent k-mers k-mers in in aa string string Input: Input:AAstring string Text Text and and an an integer integer k. k. Output: Output:All All most most frequent frequent k-mers k-mers in in Text. Text.

Frequent Words Naive Solution Total k-mers = Text k +1 Each k-mer is compared with at most Text - k other k-mers. Each comparison compares at most k characters Time Complexity: 2 O( Text k ) Other Other Implementations Implementations k 4 + Text k, Text k log( Text ), Text

Frequent Words The Encoding Mystery k count k-mers 3 25 tga 4 11 atga tgat 5 8 gatca tgatc 6 8 tgatca 7 5 atgatca 8 4 atgatcaa 9 3 atgatcaag cttgatcat tcttgatca ctcttgatc atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Frequent Words The Encoding Mystery k count k-mers 3 25 tga 4 11 atga tgat 5 8 gatca tgatc 6 8 tgatca 7 5 atgatca 8 4 atgatcaa Which Which of of these these results results are are statistically statistically significant? significant? 9 3 atgatcaag cttgatcat tcttgatca ctcttgatc

Frequent Words The Encoding Mystery DnaA Boxes are usually 9 nucleotides long. Frequent 9-mers: ATGATCAAG, CTTGATCAT, TCTTGATCA, CTCTTGATC Probability a 9-mer appears 3 or more times in a randomly generated DNA string of length 500 1/1300 One of the four 9-mers is the DnaA Box? If so, which one of the four?

Frequent Words The Encoding Mystery Which one of the four is more surprising compared to the others? Consider ATGATCAAG and CTTGATCAT. Reverse complements!!!

Frequent Words The Encoding Mystery Which one of the four is more surprising compared to the others? Consider ATGATCAAG and CTTGATCAT. Reverse complements!!! 5' 5' AATT G GAATT CCAAAAG G 3' 3' 3' 3' TTAACC TTAAG G TT TT CC 5' 5'

Frequent Words The Encoding Mystery Which one of the four is more surprising compared to the others? Consider ATGATCAAG and CTTGATCAT. Reverse complements!!! atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Frequent Words The Encoding Mystery 6 occurances of a 9-mer in a string of 500 nucleotides is statistically more significant than 3 occurances. ATGATCAAG is the DnaA box? atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Vibrio Cholerae DnaA Box How confident are we that the DnaA Box has been found? What if ATGATCAAG occurs along the entire genome? Check for all occurances of ATGATCAAG in the genome.

Pattern Matching Problem Find Find all all occurances occurances of of aa pattern pattern in in aa string string Input: Input: Strings Strings Pattern Pattern and and Genome. Genome. Output: Output:All All starting starting positions positions in in Genome Genome where where Pattern Pattern appears appears as as aa substring. substring. 116556, 116556, 149355, 149355, 151913, 151913, 152013, 152013, 152394, 152394, 186189, 186189, 194276, 194276, 200076, 200076, 224527, 224527, 307692, 307692, 479770, 479770, 610980, 610980, 653338, 653338, 679985, 679985, 768828, 768828, 878903, 878903, 985368 985368

Clumps Positions of ATGATCAAG form a clump in positions 151913, 152013, and 152394. There are no other clumps We We now now have have strong strong computational computational and and statistical statistical evidence evidence that that ATGATCAAG/CTTGATCAT ATGATCAAG/CTTGATCAT is is the the DnaA DnaABox Box in in Vibrio Vibrio Cholerae Cholerae

More Insights... Is IsATGATCAAG/CTTGATCAT ATGATCAAG/CTTGATCAT the the DnaA DnaABox Box for for all all Bacteria? Bacteria? Is Is the the clumping clumping effect effect of of ATGATCAAG/CTTGATCAT ATGATCAAG/CTTGATCAT just just aa statistical statistical fluke fluke in in Vibrio Vibrio Cholerae? Cholerae? Do Do other other Bacteria Bacteria have have other other DnaA DnaABoxes? Boxes?

oric of Thermotoga Petrophila aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga aactaaaatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaa ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa acaaacctaccaccaaactctgtattgaccattttaggacaacttcagggtggtaggttt ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat ccaagccgatttcagagaaacctaccacttacctaccacttacctaccacccgggtggta agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa cctaccacctgcgtcccctattatttactactactaataatagcagtataattgatctga

Thermotoga Petrophila ATGATCAAG ATGATCAAG or or CTTGATCAT CTTGATCAT does does not not occur occur at at all all!!!!!! 6 different 9-mers appear 3 or more times AACCTACCA, AAACCTACC, ACCTACCAC, CCTACCACC, GGTAGGTTT, TGGTAGGTT. Occurance of 6 different 9-mers in a sequence of 500 nucleotides 3 or more times is extremely unlikely From the Ori-Finder tool, the DnaA Box is CCTACCACC/GGTGGTAGG

Thermotoga Petrophila GGTGGTAGG/CCTACCACC GGTGGTAGG/CCTACCACC aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga aactaaaatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaa ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa acaaacctaccaccaaactctgtattgaccattttaggacaacttcagggtggtaggttt ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat ccaagccgatttcagagaaacctaccacttacctaccacttacctaccacccgggtggta agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa CCTACCACCtgcgtcccctattatttactactactaataatagcagtataattgatctga

Now What? Unlikely Unlikely that thatatgatcaag/cttgatcat ATGATCAAG/CTTGATCAT or or GGTGGTAGG/CCTACCACC GGTGGTAGG/CCTACCACC are are DnaA DnaAboxes boxes for for aa newly newly sequenced sequenced Bacteria. Bacteria. Most Most frequent frequent 9-mers 9-mers in in T. T.Petrophila Petrophila did did not not give give any any special special indication indication to to identify identify the the DnaA DnaABox. Box. Does Does that that mean mean that that our our heuristic heuristic for for finding finding the the DnaA DnaAboxes boxes has has just just failed? failed? Step Step back back aa bit bit......

What are we trying to solve? Where is the oric? Big Clue: Identify the DnaA box. From our experience with Vibrio Cholerae DnaA DnaAbox box is is (most (most likely) likely) the the k-mer k-mer that that occurs occurs in in clumps clumps in in aa short short sequence sequence of of the the genome genome This This sequence sequence is is (most (most likely) likely) in in the the Neighbourhood Neighbourhood of of the the orics orics

Clump Finding Find every k-mer that forms a clump in the genome in a window of size L Given Given integers integers LL and and t,t, aa k-mer k-mer Pattern Pattern forms forms an an (L, (L, t)-clump t)-clump inside inside aa (larger) (larger) string string Genome Genome ifif there there is is an an interval interval of of Genome Genome of of length length LL in in which which this this k-mer k-mer appears appears at at least least tt times. times. X???? ATGATCAAG ATGATCAAG forms forms aa (500,3)-clump (500,3)-clump in in the the Vibrio Vibrio cholerae cholerae genome genome X

Clump Finding Problem Find Find patterns patterns forming forming clumps clumps in in aa string string Input: Input:AAstring string Genome, Genome, and and integers integers k, k, LL and and t.t. Output: Output:All All distinct distinct k-mers k-mers forming forming (L, (L, t)-clumps t)-clumps in in Genome Genome Continuing Continuing from from the the naive naive frequent frequent words words algorithm, algorithm, 2 O( L k Genome ) Can Can you you come come up up with with an an algorithm algorithm that that takes takes k O(4 + k Genome ) 7 LL<< 1000, 1000, kk << 15, 15, Genome Genome << 10 107

Clump Finding in E. Coli More More than than 1904 1904 different different 9-mers 9-mers form form (500, (500, 3) 3) clumps clumps in in E. E. Coli Coli!!!!!! Each Each is is as as likely likely aa candidate candidate as as the the other other for for the the DnaA DnaAbox. box. What What now now?????? Biological Biological insights insights into into the the replication replication process process might might help help......

DNA Replication Revisited Replication Terminus

DNA Replication

DNA Replication Revisited DNA DNAPolymerase Polymerase can can read read the the strand strand from from 3' 3' 5' 5' only only www.dnalc.org

DNA Replication How How does does the the unidirectional unidirectional polymerase polymerase replicate replicate the the entire entire circular circular genome? genome? How How many many DNA DNAPolymerases Polymerases are are required? required? Why? Why?

DNA Replication

Asymmetry in DNA Replication

DNA Replication

DNA Replication

Asymmetry in DNA Replication Leading or Reverse half strand (3' 5') DNA Polymerase works non-stop Completes replication sooner than the Forward half strand. Lives double-stranded most of its life. Lagging or Forward half strand (5' 3') DNA Polymerases work in stop-go fashion on Okazaki fragments DNA Ligase binds Okazaki fragments Waits longer for the DNA Polymerase to attach and replicate Lives single-stranded life most of the time.

Asymmetry in DNA Replication

Asymmetry in DNA Replication Does Does this this asymmetry asymmetry provide provide clues clues for for identification identification of of the the oric? oric? Forward Forward half-strand half-strand undergoes undergoes more more mutations mutations during during its its time time as as aa single single strand strand than than the the reverse reverse half-strand. half-strand. Which Which among amonga, A, C, C, G, G, TT has has the the highest highest mutation mutation rate? rate? CC GG AA TT Entire EntireStrand Strand 427419 427419 413241 413241 491488 491488 491363 491363 Reverse ReverseHalf-Strand Half-Strand 219518 219518 201634 201634 243963 243963 246641 246641 Forward ForwardHalf-Strand Half-Strand 207901 207901 211607 211607 247525 247525 244722 244722 Difference Difference +11617 +11617-9973 -9973-3562 -3562-1919 -1919

Peculiar Statistics of Half-Strands GG AA TT Entire EntireStrand Strand 427419 427419 413241 413241 491488 491488 491363 491363 Reverse ReverseHalf-Strand Half-Strand 219518 219518 201634 201634 243963 243963 246641 246641 Forward ForwardHalf-Strand Half-Strand 207901 207901 211607 211607 247525 247525 244722 244722 Difference Difference +11617 +11617-9973 -9973-3562 -3562-1919 -1919 Deamination rises 100 fold in forward strands CC Cytosine Thymine T G bonds are corrected to T A in 2nd round of replication Forward Forward HS HS (Single-stranded (Single-stranded life): life): Shortage Shortage of of C, C, Normal Normal G G Reverse Reverse HS HS (Double-stranded (Double-stranded life): life): Shortage Shortage of of G, G, Normal Normal CC

Peculiar Statistics of Half-Strands

Skew Diagram Skewi (Genome) i :0 Genome

E. coli Skew Diagram Where Where is is the the oric? oric? oric

Minimum Skew Problem Find Find aa position position in in aa genome genome minimizing minimizing the the skew. skew. Input: Input:AADNA DNAstring string Genome. Genome. Output: All integer(s) i minimizing Skewi(Genome) among all values of i (from 0 to Genome ). Does Does the the min min skew skew position position change change based based on on varying varying initial initial positions? positions? Approximate Approximate position position of of the the oric oric in in E. E. coli: coli: 3923620 3923620

DnaA box of E. coli Approximate Approximate position position of of the the oric oric in in E. E. coli: coli: 3923620 3923620 aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgc ataacgcggtatgaaaatggattgaagcccgggccgtggattctactcaa ctttgtcggcttgagaaagacctgggatcctgggtattaaaaagaagatc tatttatttagagatctgttctattgtgatctcttattaggatcgcactg ccctgtggataacaaggatccggcttttaagatcaacaacctggaaagga tcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag aatgaggggttatacacaactcaaaaactgaacaacagttgttctttgga taactaccggttgatccaagcttcctgacagagttatccacagtagatcg cacgatctgtatacttatttgagtaaattaacccacgatcccagccattc ttctgccggatcttccggaatgtcgtgatcaagaatgttgatcttcagtg No No 9-mer 9-mer occurs occurs 33 or or more more times times in in this this oric oric!!!!!!

Revisit Vibrio Cholerae atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc Observe Observe the the 9-mers 9-mersATGATCAAC ATGATCAAC and and CATGATCAT CATGATCAT

Previously Invisible DnaA Boxes atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc Finding Finding 88 approximate approximate occurances occurances of of 9-mers 9-mers is is statistically statistically more more suprising. suprising. ATGATCAAG, ATGATCAAG, CTTGATCAT, CTTGATCAT,ATGATCAAC, ATGATCAAC, CATGATCAT CATGATCAT DnaA DnaAcan can bind bind to to slight slight modifications modifications of of the the DnaA DnaAboxes boxes!!!!!!

Approximate Pattern Matching Problem Find Find all all approximate approximate occurrences occurrences of of aa pattern pattern in in aa string. string. Input: Input: Strings Strings Pattern Pattern and and Text Text along along with with an an integer integer d. d. Output: All starting positions where Pattern appears as a substring of Text with at most d mismatches.

DnaA Boxes in E. coli aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgc ataacgcggtatgaaaatggattgaagcccgggccgtggattctactcaa ctttgtcggcttgagaaagacctgggatcctgggtattaaaaagaagatc tatttatttagagatctgttctattgtgatctcttattaggatcgcactg ccctgtggataacaaggatccggcttttaagatcaacaacctggaaagga tcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag aatgaggggttatacacaactcaaaaactgaacaacagttgttctttgga TAACtaccggttgatccaagcttcctgacagagTTATCCACAgtagatcg cacgatctgtatacttatttgagtaaattaacccacgatcccagccattc ttctgccggatcttccggaatgtcgtgatcaagaatgttgatcttcagtg DnaA DnaABox Box of of E. E. coli: coli: TTATCCACA TTATCCACA

Epilogue Hidden messages cluster in a genome Clumps DnaA boxes may not be perfect

Complications Some bacteria have fewer DnaA boxes Frequent Words Problem does not work! Terminus of replication is not often located directly opposite to oric The skew diagram is often more complex than E. colis Skew diagram of T. petrophila

Open Problems Multiple origins of replication in a bacterial genome Finding oric in Archaea and Yeast Computing probabilities of patterns in a string

Multiple Origins of Replication Biologists long believed that each bacterial chromosome has a single oric Xia (2012) argued that some bacteria may have multiple replication origins. Bacteria would be able to replicate faster Skew diagram of Wigglesworthia glossinidia Does Does bacterial bacterial genome genome have have multiple multiple origins origins of of replication? replication? Xia, DNA Replication and Strand Asymmetry in Prokaryotic and Mitochondrial Genomes, Current Genomics, 13(1), 2012

Multiple Origins of Replication Genome rearrangements can cause multiple local minima in the skew diagram Reversal: a segment of chromosome is flipped and switched into the opposite strand Horizontal gene transfer: Gene from forward half-strand of one is transferred to the reverse half-strand of another

Finding orics in Archaea and Yeast Archaea have multiple orics Skew diagram of Sulfolobus solfataricus Yeast have hundreds of orics Coordinated replication Develop Develop an an algorithm algorithm to to reliably reliably locate locate orics orics in inarchaea Archaea and and Yeast. Yeast.

Computing Probabilities of Patterns in a String Is Is itit statistically statistically surprising surprising to to find find aa 9-mer 9-mer appearing appearing 33 or or more more times times within within 500 500 nucleotides nucleotides Probability that 01 ( 11 ) appears in a random binary string of length 4 is 11/16 (8/16) The overlapping words paradox Pattern 11 overlaps but not 01 Pr Prdd(N, (N,A, A, Pattern Pattern t), t), Pr(N, Pr(N,A, A, k, k, t), t),......