Crash Course in Omics Terminology, Concepts & Data Types

Hasonló dokumentumok
Crash Course in Omics Terminology, Concepts & Data Types

Crash Course in Omics Terminology, Concepts & Data Types

Welcome! EuPathDB Workshop Crash Course in Omics Terminology, Concepts & Data Types

Phenotype. Genotype. It is like any other experiment! What is a bioinformatics experiment? Remember the Goal. Infectious Disease Paradigm

Welcome! EuPathDB Workshop 2010

Crash Course in Omics Terminology and Concepts

Trinucleotide Repeat Diseases: CRISPR Cas9 PacBio no PCR Sequencing MFMER slide-1

Mapping Sequencing Reads to a Reference Genome

Tutorial 1 The Central Dogma of molecular biology

Genome 373: Hidden Markov Models I. Doug Fowler

Crash Course in Omics Terminology and Concepts. Genome assembly. Clone-by-Clone Genome Sequencing. Shotgun DNA Sequencing (Technology)

Supporting Information

SOLiD Technology. library preparation & Sequencing Chemistry (sequencing by ligation!) Imaging and analysis. Application specific sample preparation

discosnp demo - Peterlongo Pierre 1 DISCOSNP++: Live demo

Bioinformatics: Blending. Biology and Computer Science

Correlation & Linear Regression in SPSS

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Nonparametric Tests

Nan Wang, Qingming Dong, Jingjing Li, Rohit K. Jangra, Meiyun Fan, Allan R. Brasier, Stanley M. Lemon, Lawrence M. Pfeffer, Kui Li

On The Number Of Slim Semimodular Lattices

3. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

Supplementary Figure 1

Flowering time. Col C24 Cvi C24xCol C24xCvi ColxCvi

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet Factor Analysis

Construction of a cube given with its centre and a sideline

First experiences with Gd fuel assemblies in. Tamás Parkó, Botond Beliczai AER Symposium

Supplemental Table S1. Overview of MYB transcription factor genes analyzed for expression in red and pink tomato fruit.

Correlation & Linear Regression in SPSS

Orvosi Genomtudomány 2014 Medical Genomics Április 8 Május 22 8th April 22nd May

Supporting Information

EN United in diversity EN A8-0206/419. Amendment

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Linear. Petra Petrovics.

Cluster Analysis. Potyó László

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

Performance Modeling of Intelligent Car Parking Systems

The beet R locus encodes a new cytochrome P450 required for red. betalain production.

Sebastián Sáez Senior Trade Economist INTERNATIONAL TRADE DEPARTMENT WORLD BANK

EEA, Eionet and Country visits. Bernt Röndell - SES

STUDENT LOGBOOK. 1 week general practice course for the 6 th year medical students SEMMELWEIS EGYETEM. Name of the student:

Bird species status and trends reporting format for the period (Annex 2)

Architecture of the Trypanosome RNA Editing Accessory Complex, MRB1

Angol Középfokú Nyelvvizsgázók Bibliája: Nyelvtani összefoglalás, 30 kidolgozott szóbeli tétel, esszé és minta levelek + rendhagyó igék jelentéssel

Supplementary Table 1. Cystometric parameters in sham-operated wild type and Trpv4 -/- rats during saline infusion and

ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA I. VIZSGÁZTATÓI PÉLDÁNY

A modern e-learning lehetőségei a tűzoltók oktatásának fejlesztésében. Dicse Jenő üzletfejlesztési igazgató

Statistical Inference

Expression analysis of PIN genes in root tips and nodules of Lotus japonicus

Budapest By Vince Kiado, Klösz György

Rezgésdiagnosztika. Diagnosztika

Smaller Pleasures. Apróbb örömök. Keleti lakk tárgyak Répás János Sándor mûhelyébõl Lacquerware from the workshop of Répás János Sándor

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Nonparametric Tests. Petra Petrovics.

USER MANUAL Guest user

Expansion of Red Deer and afforestation in Hungary

Involvement of ER Stress in Dysmyelination of Pelizaeus-Merzbacher Disease with PLP1 Missense Mutations Shown by ipsc-derived Oligodendrocytes

Statistical Dependence

Contact us Toll free (800) fax (800)

ELIXIR-Magyarország: lehetőségek és kihívások: Bálint Bálint L, Debreceni Egyetem, ELIXIR-Magyarország oktatási koordinátor

kpis(ppk20) kpis(ppk46)

Supporting Information

7 th Iron Smelting Symposium 2010, Holland

Using the CW-Net in a user defined IP network

Forensic SNP Genotyping using Nanopore MinION Sequencing

ELEKTRONIKAI ALAPISMERETEK ANGOL NYELVEN

Minta ANGOL NYELV KÖZÉPSZINT SZÓBELI VIZSGA II. Minta VIZSGÁZTATÓI PÉLDÁNY

ANGOL NYELVI SZINTFELMÉRŐ 2013 A CSOPORT. on of for from in by with up to at

Gottsegen National Institute of Cardiology. Prof. A. JÁNOSI

Utolsó frissítés / Last update: február Szerkesztő / Editor: Csatlós Árpádné

Széchenyi István Egyetem

Pletykaalapú gépi tanulás teljesen elosztott környezetben

Tudományos Ismeretterjesztő Társulat

Manuscript Title: Identification of a thermostable fungal lytic polysaccharide monooxygenase and

Felnőttképzés Európában

már mindenben úgy kell eljárnunk, mint bármilyen viaszveszejtéses öntés esetén. A kapott öntvény kidolgozásánál még mindig van lehetőségünk

A cell-based screening system for RNA Polymerase I inhibitors

A rosszindulatú daganatos halálozás változása 1975 és 2001 között Magyarországon

Create & validate a signature

Az Open Data jogi háttere. Dr. Telek Eszter

University of Bristol - Explore Bristol Research

Választási modellek 3

Can/be able to. Using Can in Present, Past, and Future. A Can jelen, múlt és jövő idejű használata

Proxer 7 Manager szoftver felhasználói leírás

A jövedelem alakulásának vizsgálata az észak-alföldi régióban az évi adatok alapján

Magyar - Angol Orvosi Szotar - Hungarian English Medical Dictionary (English And Hungarian Edition) READ ONLINE

Utolsó frissítés / Last update: Szeptember / September Szerkesztő / Editor: Csatlós Árpádné

Longman Exams Dictionary egynyelvű angol szótár nyelvvizsgára készülőknek

Cashback 2015 Deposit Promotion teljes szabályzat

ELEKTRONIKAI ALAPISMERETEK ANGOL NYELVEN

1. MINTAFELADATSOR KÖZÉPSZINT. Az írásbeli vizsga időtartama: 30 perc. III. Hallott szöveg értése

Tudományos Ismeretterjesztő Társulat

Report on esi Scientific plans 7 th EU Framework Program. José Castell Vice-President ecopa, ES

A fő egészségügyi kihívások

Tavaszi Sporttábor / Spring Sports Camp május (péntek vasárnap) May 2016 (Friday Sunday)

SQL/PSM kurzorok rész

JEROMOS A BARATOM PDF

Markerless Escherichia coli rrn Deletion Strains for Genetic Determination of Ribosomal Binding Sites

Supplemental Information. Investigation of Penicillin Binding Protein. (PBP)-like Peptide Cyclase and Hydrolase

István Micsinai Csaba Molnár: Analysing Parliamentary Data in Hungarian

Lopocsi Istvánné MINTA DOLGOZATOK FELTÉTELES MONDATOK. (1 st, 2 nd, 3 rd CONDITIONAL) + ANSWER KEY PRESENT PERFECT + ANSWER KEY

Effect of the different parameters to the surface roughness in freeform surface milling

EPILEPSY TREATMENT: VAGUS NERVE STIMULATION. Sakoun Phommavongsa November 12, 2013

Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Hypothesis Testing. Petra Petrovics.

Átírás:

Crash Course in Omics Terminology, Concepts & Data Types 14 th Annual EuPathDB Workshop Jessie Kissinger June 2, 2019 FungiDB MicrosporidiaDB AmoebaDB PlasmoDB ToxoDB CryptoDB PiroplasmaDB TrichDB GiardiaDB TriTrypDB 380 genome sequences in Release 43! Pawlowski BMC Biology 2013 1

Infectious Disease Paradigm Experimental systems Host Pathogen Vector Protein Expression RNA Sequencing Genome and Annotation Climate Structure Interactions Host distribution BLAST Similarities Phylogenetics parasitophorous! vacuole! Orthologs SNPs ChIP-Chip Host- Pathogen transcription factors! Subcellular Location (FOS, EGR2, etc)! Phenotype apoptosis! mitochondrial function! cytokines, signaling, etc! sterol biosynthesis! Comparative Genomics cell cycle, DNA replication/repair! Expression profiling Putative arrays Function (many platforms) (GO) host cell! nucleus! Pathways Models Host response Isolates Modified from slide provided by David Roos 2

Genomes:Important Considerations Biological Size Mb Gb Ploidy Haploid Diploid Tetraploid Repeat content Retrotransposons Big gene families AT content Clone vs population Technical Read length Short Long Coverage 5X 100X Read Quality 20 30 40 Bias? DNA sequencing technologies: 2006 2016 Elaine R Mardis NATURE PROTOCOLS VOL.12 NO.2 2017 213 Seq types chart download (11 MB) https://www.dropbox.com/s/kfkkft5glmxd68z/forallyouseqmethods.pdf?dl=0 (PDF figures for next two slides) 3

https://github.com/ecological-and-evolutionary-genomics/eeg2016/wiki/mar-21-exercise-7----spades-assembler 4

https://www.biostars.org/p/104218/ https://github.com/trinityrnaseq/naplesworkshop2016/wiki/day_2 5

Pairs Give Order & Orientation Scaffold https://www.biostars.org/p/104218/ http://www.cs.cmu.edu/~sssykim/teaching/s13/slides/lecturesvii.pdf 6

FASTQ format for reads https://www.abmgood.com/marketing/knowledge_base/next_generation_sequencing_data_analysis.php https://www.aphl.org/conferences/proceedings/documents/2018/4_eija%20trees.pdf https://www.aphl.org/conferences/proceedings/documents/2018/4_eija%20trees.pdf 7

Hybrid Assembly = short + long reads from differing technologies It is a VERY useful approach for correcting genome assemblies https://genome.cshlp.org/content/ early/2017/01/27/gr.213405.116.full.pdf 30,000 ft View - Annotation I II III IV Genes A B B C D E F 8

9 AAGCTTCGCCAGGCTGTAAATCCCGTGAGTCGTCCTCACAAATCATCAAGCAGGTGTCCTCAGGGAGACTGCCTGACTGAGTTATGCTAATTCCTTTCTACTTTGGCGTGGTCACGTGTA ACCATATCCGAATCATTTCTCTAGCCCTACGAACAGGTAAGAGCGCTAGGGATGTCCGTGGAGTAGTGTGCTTACTCGATAATATTCAGTTGGGACTACCAGCGAGGCGCTCGCTTTGCT CACGCAATGCCTGAGACAGTTGCAGAATGAATGGTAACCGACAAACGCGTTCATATGCGTTTTCAAACTTAGTAGACGCGTACTGTCTGAAACTGGCGGTCACAGGCACCAGATAACGCC CTTGGCATCGGCATGTCTCGTACAGAGGTCCGTATGTAGTGCCACGACTTCTAAATCCGGCGACAGGCTGGTCTTTTGTCTTACCACGTATTAGCCCGCGTGCGATTTCTCGGAGCGCAC CTGTTCAACACTAGAAAACGGAGTTTCCTGATCGAGAAGCCACCACCTTTCCAGAAGTTGAACGCTAGCATGTCATTCGATTTTCACCCCCCGCGTAGTTCCTGTGTGTCATTCGTTGTC GAGACAACTCTGTCCCGCCCCGGTGCTGTTCCATATGCGTGACTTTCCCGCAATTTTTTCAGACTTTCAGGAAAGACAGGCTCCGGAACGATCTCGTCCATGACTGGTAAATCCACGACA CCGCAATGGCCCCCAGCACCTCTATCTCTCGTGCCAGGGGACTAACGTTGTATGCGTCTGCGTCTTGTCTTTTTGCATTCGCTTTCCAAAAAAGAGAGCCATCCGTTCCCCCGCACATTC AACGCCGCGAGTGCGGTTTTTGTCTTTTTTGAGTGGTAGGACGCTTTTCATGCGCGAACTACGTGGACATTAAGTTCCATTCTCTTTTTCGACAGCACGAAACCTTGCATTCAAACCCGC CCGCGGAAGATCCGATCTTGCTGCTGTTCGCAGTCCCAGTAGCGTCCTGTCGGCCGCGCCGTCTCTGTTGGTGGGCAGCCGCTACACCTGTTATCTGACTGCCGTGCGCGAAAATGACGC CATTTTTGGGAAAATCGGGGAACTTCATTCTTTAAAAGTATGCGGAGGTTTCCTTTTTCTTCTGTTCGTTTCTTTTTCTCGGGTTTGATAACCGTGTTCGATGTAAGCACTTTCCGTCTC TCCTCCGTGCTTTGTTCGACATCGAGACCAGGTGTGCAGATCCTTCGCTTGTCGATCCGGAGACGCGTGTCTCGTAGAACCTTTTCATTTTACCACACGGCAGTGCGGAGCACTGCTCTG AGTGCAGCAGGGACGGGTGAAGTTTCGCTTTAGTAGTGCGTTTCTGCTCTACGGGGCGTTGTCGTGTCTGGGAAGATGCAGAAACCGGTGTGTCTGGTCGTCGCGATGACCCCCAAGAGG GGCATCGGCATCAACAACGGCCTCCCGTGGCCCCACTTGACCACAGATTTCAAACACTTTTCTCGTGTGACAAAAACGACGCCCGAAGAAGCCAGTCGCCTGAACGGGTGGCTTCCCAGG AAATTTGCAAAGACGGGCGACTCTGGACTTCCCTCTCCATCAGTCGGCAAGAGATTCAACGCCGTTGTCATGGGACGGAAAACCTGGGAAAGCATGCCTCGAAAGTTTAGACCCCTCGTG GACAGATTGAACATCGTCGTTTCCTCTTCCCTGTGAGCACACACAGTAGTCGCCACACGCTGTTTGAGACGTGTCAATCTCCAAGAGTGTGGACGCTGTTCCACGTCTTCAAATGTTTCC CAACATCCGTCGTCTAGTAGACACACCAACAAAAAGCACACGGCGAATCTGCTCATCGGAGGGAGGAGCCGGGGGGCACACAACTATCCTCAACTCTCGAACGAACATATCCGGGGCCGC GAAGACGTCCAGTCTCTCAAATCCAACCCGGAACGCAAACATTTCTGCATCAAGTCACGATTGCGCCGGTACCTCCATGTGTAAGCAGTTCCATGAAACCTCCGATATTACACACGACTG TGGATATGAATTATATGCAGATGCATATATACTGAGACGCCGATGCAACTATAGGTTTCCTGGCCCTCCATGGATATTTCAGACCTTCCTCTCACATTTGGTTTGCCCGTACACCTCCGT TACGCTTTTTTTCTGGCTTTCTTCTTCGTCTCTGTTTATCAGCAAAGAAGAAGACATTGCGGCGGAGAAGCCTCAAGCTGAAGGCCAGCAGCGCGTCCGAGTCTGTGCTTCACTCCCAGC AGCTCTCAGCCTTCTGGAGGAAGAGTACAAGGATTCTGTCGACCAGATTTTTGTCGTGGGTATGTTGTCCTAAACTCCTTGGAACTCCATTCTTGGTCAGAAACGTACTGAAACTGTATA CATGTATATACAGATGTATGGATAATATCTAGAGAAGATACAGGGAAGACTGGCAAGGATGAAAAGACATGCAGCTTTAACGAAGCAGAGGGCATTGGCGAGAGGGACGCCCGTTATGCT GTGTGATGTGGCTGTGAATCTTACCTCGCCGTTTGACTTGCTGCAGCGCTTTGTCCACTTGAACGTGACTTCTTGTTTCTACCTTCCCCAACGCCTTCTATTCCCTTCACTGCGAAAGCG CGCTCAGTGGGCCGTCACCGAACACCCTTGGTTCTTTCGTTCAGCTGTTGTCCTCTTTCTCGCGTTGCTTCCTGTGGCGTCGTGGCTCGGCTTCTCTCTCTTTCCTGTTGGTGCGTCCAG ACTATGTCGCCTGTTTCCCCACCCTTCTCGGCTTGTGCTTTCAGGAGGAGCGGGACTGTACGAGGCAGCGCTGTCTCTGGGCGTTGCCTCTCACCTGTACATCACGCGTGTAGCCCGCGA GTTTCCGTGCGACGTTTTCTTCCCTGCGTTCCCCGGAGATGACATTCTTTCAAACAAATCAACTGCTGCGCAGGCTGCAGCTCCTGCCGAGTCTGTGTTCGTTCCCTTTTGTCCGGAGCT CGGAAGAGAGAAGGACAATGAAGCGACGTATCGACCCATCTTCATTTCCAAGACCTTCTCAGACAACGGGGTACCCTACGACTTTGTGGTTCTCGAGAAGAGAAGGAAGACTGACGACGC AGCCACTGCGGAACCGGTAAGAGGCAACCGAAGCGCGTAGATAAGAAAAACAACAAAGAGAAGGTGAAACACGAAGAGAAGGGAAAATGCGGAGAAACCGTGGATTTACAAAGATATCAA GAGCAATGCTTTGTGGAGATTTTTTTTAATTCAGTAGAGACACCCGCCGTGCGAGGTGTGTAGAAATAACTGCGACCCTGGAGACAGAGATGCCGCGAGTACACCACTTGTCGTTTTTCC TCCTATGTTCATGACGGGTGCTGAACGTCTATCGTACTTAATTGGAGGAGTCGTCTCCGAAGCAGCTTTGGCTGGCCATCCGTGTGTTTGCCTTGTTCCTGAAAAGCCAGAAGGCGCTCC ACAGTGAGGCGATATACAGGGACGCCTACCGGAGCCCCGTTTTCTGCCTTTGTCGACTCTTGCAGAGCAACGCAATGAGCTCCTTGACGTCCACGAGGGAGACAACTCCCGTGCACGGGT TGCAGGCTCCTTCTTCGGCCGCAGCCATTGCCCCGGTGTTGGCGTGGATGGACGAAGAAGACCGGAAAAAACGCGAGCAAAAGGAACTGATTCGGGCCGTTCCGCATGTTCACTTTAGAG GCCATGAAGAATTCCAGTACCTTGATCTCATTGCCGACATTATTAACAATGGAAGGACAATGGATGACCGAACGGGTAACGGCGACTGCGAGAAAAAGCCACACCGTTTTCTCCTGTGAT TCTGTCCGCAAGCCCTCTTTTGCTTCATCCACCCTTTGCTATTCTCCGCCGCCTTCCTTTTCTGCTCCATGTTCAATTCGTTCGCTTCTTCAGTCTTTCCATCTTCCCCTGTTACCTCTG TCATTCGTTTTCTTGCCTCTATTTAACTGTGTTCTACTCACAGTCTGCATTCCGCGATAGACGAGCTTCCACGTCTTGCGTCTCGACAAGCAACTGTCATTTGTACGCGCCTCCCTCCAC CGTGAATCGGATTGTCGGTTCGCCGGTTCCTGGGTCAGAAAAGGCCTGCGCCAGTATTCTGAATAATACCCTTCGCCATTGTAAAGAGGCGAAGGAACAAAGAGATATTTCGGCGCATCT TTTGTGCGGCGCGTTTCCTCGTGCTTCACACCGATGCCCTTCTGTGCATGTCTTCTGCTCCTCGTCCTTCTCTCTTTTTCCCTGTTTAGGCGTTGGTGTCATCTCCAAATTCGGCTGCAC TATGCGCTACTCGCTGGATCAGGCCTTTCCACTTCTCACCACAAAGCGTGTGTTCTGGAAAGGGTAAGGGCGTCTTCAGTGAATGCATATATTTGACTTCAGACATTCTTAACTGTTTGA CAACCAACGTACAAATTTGTTTGTCCGTGTGCGTGTTCGACATGTCAAGTATGTGAAGAGTCGCTACTGTAGACTAACGCACGAACCAGATTTGTTTATCTGCATGCGCTGTGCACCCGT TTCTGAGTGTCTGGAGTTTCCGCAACCTTCCTTTGAATTTCTGGGTTCGTTTTTTTATGCGCGCACTGGTTTGCATGTGGCCTGAGAGAGCACAGATCGAAGGTGGGGTGATGTGGCGTC GCTGCAGAGAAACTCCGGCGAAGGCGACAGATAAAGGAGAGTGGAAATCATTGAACAGTGTCGGTCGTCTGTTGTTTCGCAGGGTCCTCGAAGAGTTGCTGTGGTTCATTCGCGGCGACA CGAACGCAAACCATCTTTCTGAGAAGGGCGTGAAGGCAAGTCTACGTTGTACCTCTTGTCTCTGCCGAAGCTCAGATGTCTCCACGGCGTTGGTTTCTTTTCGTTTTTGCTTTCGTGGCA TTACCATCGAGTCACCACTCATAGTTGCGTGTGTCTACATGTTTTCTAGAACGTCCGTTGTGTTGCCTCGTGGCGACCGGCGCGAGTGTATGTACCCTGCGCTGTCAGAAGTTGATCCTT Six Frame Translation ORF-finding ORFs Genes

The Coding Sequence - CDS ATGCAGAAACCGGTGTGTCTGGTCGTCGCGATGACCCCCAAGAGGGGCATCGGCATCAACAACGGCCTCCCGTGGCCCC ACTTGACCACAGATTTCAAACACTTTTCTCGTGTGACAAAAACGACGCCCGAAGAAGCCAGTCGCCTGAACGGGTGGCT TCCCAGGAAATTTGCAAAGACGGGCGACTCTGGACTTCCCTCTCCATCAGTCGGCAAGAGATTCAACGCCGTTGTCATG GGACGGAAAACCTGGGAAAGCATGCCTCGAAAGTTTAGACCCCTCGTGGACAGATTGAACATCGTCGTTTCCTCTTCCC TCAAAGAAGAAGACATTGCGGCGGAGAAGCCTCAAGCTGAAGGCCAGCAGCGCGTCCGAGTCTGTGCTTCACTCCCAGC AGCTCTCAGCCTTCTGGAGGAAGAGTACAAGGATTCTGTCGACCAGATTTTTGTCGTGGGAGGAGCGGGACTGTACGAG GCAGCGCTGTCTCTGGGCGTTGCCTCTCACCTGTACATCACGCGTGTAGCCCGCGAGTTTCCGTGCGACGTTTTCTTCC CTGCGTTCCCCGGAGATGACATTCTTTCAAACAAATCAACTGCTGCGCAGGCTGCAGCTCCTGCCGAGTCTGTGTTCGT TCCCTTTTGTCCGGAGCTCGGAAGAGAGAAGGACAATGAAGCGACGTATCGACCCATCTTCATTTCCAAGACCTTCTCA GACAACGGGGTACCCTACGACTTTGTGGTTCTCGAGAAGAGAAGGAAGACTGACGACGCAGCCACTGCGGAACCGAGCA ACGCAATGAGCTCCTTGACGTCCACGAGGGAGACAACTCCCGTGCACGGGTTGCAGGCTCCTTCTTCGGCCGCAGCCAT TGCCCCGGTGTTGGCGTGGATGGACGAAGAAGACCGGAAAAAACGCGAGCAAAAGGAACTGATTCGGGCCGTTCCGCAT GTTCACTTTAGAGGCCATGAAGAATTCCAGTACCTTGATCTCATTGCCGACATTATTAACAATGGAAGGACAATGGATG ACCGAACGG >Translation Frame 1 MQKPVCLVVAMTPKRGIGINNGLPWPHLTTDFKHFSRVTKTTPEEASRLN GWLPRKFAKTGDSGLPSPSVGKRFNAVVMGRKTWESMPRKFRPLVDRLNI VVSSSLKEEDIAAEKPQAEGQQRVRVCASLPAALSLLEEEYKDSVDQIFV VGGAGLYEAALSLGVASHLYITRVAREFPCDVFFPAFPGDDILSNKSTAA QAAAPAESVFVPFCPELGREKDNEATYRPIFISKTFSDNGVPYDFVVLEK RRKTDDAATAEPSNAMSSLTSTRETTPVHGLQAPSSAAAIAPVLAWMDEE DRKKREQKELIRAVPHVHFRGHEEFQYLDLIADIINNGRTMDDRT Terminology 10

30,000 ft View - Synteny I II III IV Species 1 A B B C D E F Species 2 A B B C D E F Synteny among Plasmodia 11

Homology Early Globin Gene α-chain gene Gene Duplication β-chain gene α frog α chick α mouse β mouse β chick β frog PARALOGS ORTHOLOGS ORTHOLOGS HOMOLOGS Synteny shows relationships in positioning: Ontologies show relationships in meaning The Gene Ontology GO provides terms to link genes with similar functions and/or locations in the cell. An ontology was needed because the cultural traditions in different organisms led to different gene naming schemes that made it difficult to identify orthologous genes with the same function. 12

For Example: D. melanogaster gene CG3340 annotated as: Kruppel and P. falciparum gene PF3D7_1209300 annotated a putative KROX1 Can both be annotated with GO term: GO:0003705 (RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity) Both proteins, functionally, are Zinc Fingers despite their different names Note that the Gene Ontologies themselves contain only information about terms in the ontology and their relationships to other terms 13

Expression Profiles (RNA and Protein) The pattern of expression of one or more genes over time or a set of experimental conditions, e.g. during development or a drug treatment or in a genetic mutant such as a gene knock-out. Always has a time and location component RNA expression RNA-Seq (NGS) Little sequence bias Quantitative Usually are strandspecific Can be used to identify UTR s and exon splice junctions Expressed Sequence Tags, ESTs Usually represent partial cdna Often clustered Come from libraries that may, or may not be normalized Often used to identify genes in genomes and locations of introns SAGE tags Serial Analysis of Gene Expression 14

30,000 ft View RNA-Seq Annotation of genome features I II III IV RNA-Seq reads A B B C D E F FPKM = Fragments per kilobase of exon per million fragments mapped Genes can be located on either DNA strand Convention - Gene location = non-template strand, i.e. same as the mrna 15

Complex patterns of eukaryotic mrna splicing: What is a Gene? Figure 8-14 RNA-seq identifies splice junctions if present (remember context dependent) https://bioconductor.org/packages/devel/workflows/vignettes/recountworkflow/inst/doc/recount-workflow.html 16

RNA-seq data are very powerful http://bioinfo.vanderbilt.edu/vangard/services-rnaseq.html 17

Protein Expression/Sequence Data MW-Isoelectric point MW Sequence/spans Technology 2D gel electrophoresis Mass spectrometry Tandem MS (MS-MS, LC MS-MS etc) Typical 2 D gel How Tandem MS Works Liquid chromatography Collision Inducted Dissociation (CID) Complex mixture Protein + + + + + + + + + + + + + + Ionized Peptides Isolation Fragmentation Measurement Peptides 18

Tandem MS protein data Sequest Database Search Mass Spectrometer Protein Database Nucleic Acid Database EST Database Tandem Mass Spectrum Theoretical Mass Spectrum Correlation Analysis Ranked Score of Matched Peptides 19

Peptide database ENNPCKLQYDYNTNVTHGFGQEYPCETDIVERFSDTEGAQCDKKKIKDNSEGACAPYRRL HVCVRNLENINDYSKINNKHNLLVEVCLAAKYEGESITGRYPQHQETNPDTKSQLCTVLA RSFADIGDIIRGKDLYRGGNTKEKKKRKKLEENLKTIFGHIYDELKNGKTNGEEELQKRY RGDKDNDFYQLREDWWDANRETVWKAITCNAGSYQYSQPTCGRGEIPYVTLSKCQCIAGE VPTYFDYVPQYLRWFEEWAEDFCRKKKKKIPNVKTNCRQVQRGKEKYCDRDGYNCDGTIR KQYIYRLDTDCTKCSLACKTFAEWIDNQKEQFDKQKQKYQNEISGGGGRRQKRSTHSTKE YEGYEKHFNEELRNEGKDVRSFLQLLSKEKICKERIQVGEETANYGNFENESNTFSHTEY CDRCPLCGVDCSSDNCRKKPDKSCDEQITDKEYPPENTTKIPKLTAEKRKTGILKKYEKF CKNSDGNNGGQIKKWECHYEKNDKDDGNGDINNCIQGDWKTSKNVYYPISYYSFFYGSII DMLNESIEWRERLKSCINDAKLGKCRKGCKNPCECYKRWVEKKKDEWDKIKEFFRKQKDL LKDIAGMDAGELLEFYLENIFLEDMKNANGDPKVIEKFKEILGKENEEVQDPLKTKKTID DFLEKELNEAKNCVEKNPDNECPKQKAPGDGAAPSDPPREDITHHDGEHSSDEDEEEEEE EEQQPPAEGTEQGEEKSESKEVVEQQETPQKDTEKTVPTTTPTVDVCDTVKTALADTGSL NAACSLKYVTGKNYGWRCIAPSGTTSGKDGAICVPPRTQELCLYYLKELSDTTQKGLREA FIKTAAQETYLLWQKYKEDKQNETASTELDIDDPQTQLNGGEIPEDFKRQMFYTFGDYRD LFLGRYIGNDLDKVNNNITAVFQNGDHIPNGQKTDRQRQEFWGTYGKDIWKGMLCALQEA GGKKTLTETYNYSNVTFNGHLTGTKLNEFASRPSFLRWMTEWGDQFCRERITQLQILKER CMVYQYNGDKGKDDKKEKCTEACTYYKEWLTNWQDNYKKQNQRYTEVKGTSPYKEDSDVK ESKYAHGYLRKILKNIICTSGTDIAYCNCMEGTSTTDSSNNDNIPESLKYPPIEIEEGCT CKDPSPGEVIPEKKVPEPKVLPKPPKLPKRQPKERDFPTPALKNAMLSSTIMWSIGIGFA TFTYFYLKKKTKSTIDLLRVINIPKSDYDIPTKLSPNRYIPYTSGKYRGKRYIYLEGDSG TDSGYTDHYSDITSSSESEYEELDINDIYAPRAPKYKTLIEVVLEPSGNNTTASGNNTPS DTQNDIQNDGIPSSKITDNEWNTLKDEFISQYLQSEQPNDVPNDYSSGDIPLNTQPNTLY FDNPDEKPFITSIHDRDLYSGEEYSYNVNMVNTNNDIPISGKNGTYSGIDLINDSLNSNN Note: ORFs in addition to predicted Genes must be searched 30,000 ft View - Proteomics I II III IV A B B C D E F Proteomic Sequence spans 20

Mass Spectrometry can be used to measure metabolic and other chemical compounds Complex mixtures can be analyzed and interpreted 21

Metabolites can be linked to metabolic pathways and enzymes 22

Alleles and Phenotype Some phenotypes are caused by a single locus in the genome and a single allele at that locus (e.g. some flower colors, or Drosophila eye color) Other phenotypes (Type-I diabetes, heart disease are multi-locus or complex (i.e. many genes are involved, each potentially with many alleles) Homologous chromsomes (in a diploid) A a AAGCCTCATC ACGCCTCATC SNP =Single Nucleotide Polymorphism 23

30,000 ft View- NGS SNPs NGS sequence reads from different clinical/environ mental isolates I II III IV = SNP Population data Data Single Nucleotide Polymorphisms, SNPs Alleles Allele frequency Haplotypes Technology Chip-Seq NGS 24

Alleles have frequencies in different populations Populations and alleles can have geographic boundaries A parasite isolate comes from a particular population, a particular location and will have a specific haplotype (e.g. representation of alleles) often characterized via SNPs 25

Bioinformatics uses algorithms Algorithms are sets of rules for solving problems or identifying patterns Algorithms can be general or case specific and often need to be trained Computational analysis, like wet-bench analyses are only as good as the tools, techniques and material allow, and all interpretations come with caveats (like the experimental conditions, often call parameters in bioinformatics. How to find an intron Usually begins with GT and end with AG Must be longer than 19 nucleotides Must contain a branchpoint A Donor GT often followed by a sequence pattern. This pattern is species-specific Acceptor AG often proceeded by pyrimidine stretch Has a mean length of X as is observed in this species 26

Toxoplasma gondii Intron Patterns Different prediction methods often generate different results Prediction 1 Prediction 2 We provide lots evidence so that you can decide or design an experiment to confirm! 27

Metadata The next Frontier Data about the data are critical What makes a data set valuable? (The reason it was generated but often this is missing) Introducing the data set How can you find the data set you need? Pull down Menu? A search of data set properties? Data generator Clinical outcome Geographic location Phenotype Data sharing standards 28

Global All-Staff Retreat, 4-6 June 2018 (Athens GA) The End If you have questions, I and the other instructors will be around and we are happy to talk to you. These slides are available to you as a PDF on the workshop web site. 29

Overview of transcription: Either strand can serve as a template for a gene Figure 8-4 https://en.wikipedia.org/wiki/rna-seq 30

https://www.researchgate.net/publication/279179474_annual_report_on_zoonoses_in_denmark_2013/figures?lo=1 31

https://slideplayer.com/slide/12967081/ https://www.aphl.org/conferences/proceedings/documents/2018/4_eija%20trees.pdf 32