Bioinformatics: Blending Biology and Computer Science MDNMSITNTPTSNDACLSIVHSLMCHRQ GGESETFAKRAIESLVKKLKEKKDELDSL ITAITTNGAHPSKCVTIQRTLDGRLQVAG RKGFPHVIYARLWRWPDLHKNELKHVK YCQYAFDLKCDSVCVNPYHYERVVSPGI DLSGLTLQSNAPSSMMVKDEYVHDFEG QPSLSTEGHSIQTIQHPPSNRASTETYST PALLAPSESNATSTANFPNIPVASTSQPA SILGGSHSEGLLQIASGPQPGQQQNGFT GQPATYHHNSTTTWTGSRTAPYTPNLP HHQNGHLQHHPPMPPHPGHYWPVHNE LAFQPPISNHPAPEYWCSIAYFEMDVQV GETFKVPSSCPIVTVDGYVDPSGGDRFC LGQLSNVHRTEAIERAR
Techniques BLAST Database Searches Entrez Database Data Mining Multiple Sequence Alignments Motif Searches 3D Structure Analysis
The Challenge Your colleagues have given you a DNA sequence of unknown origin. Mystery DNA Sequence What is it? What does it do? BLAST IT!!! 1 actctgctgg tggcctcgcg taccactgtg gccaagcggt agctggaacg tgcagccgac 61 caccatgggg agtagcaaga gcaagcctaa ggaccccagc cagcgccggc gcagcctgga 121 gccacccgac agcacccacc acgggggatt cccagcctcg cagaccccca acaagacagc 181 agcccccgac acgcaccgca cccccagccg ctccttcggg accgtggcca ccgagcccaa 241 gctcttcgag gacttcaaca cttctgacac cgttacgtcg ccgcagcgtg ccggggcact 301 ggctggcggc gtcaccactt tcgtggctct ctacgactac gagtcctgga ttgaaacgga 361 cttgtccttc aagaaaggag aacgcctgca gattgtcaac aacacggaag gtaactggtg 421 gctggctcat tccgtgacta caggacagac gggctacatc cccagtaact atgtcgcgcc 481 ctcagactcc atccaggctg aagagtggta ctttgggaag atcactcgtc gggagtccga 541 gcggctgctg ctcaaccccg aaaacccccg gggaaccttc ttggtccggg agagcgagac 601 gacaaaaggt gcctattgcc tctccgtttc tgactttgac aacgccaagg ggctcaatgt 661 gaagcactac aagatccgca agctggacag cggcggcttc tacatcacct cacgcacaca 721 gttcagcagc ctgcagcagc tggtggccta ctactccaaa catgctgatg gcttgtgcca 781 ccgcctgacc aacgtctgcc ccacgtccaa gccccagacc cagggactcg ccaaggacgc 841 gtgggaaatc ccccgggagt cgctgcggct ggaggtgaag ctggggcagg gctgctttgg 901 agaggtctgg atggggacct ggaacggcac caccagagtg gccataaaga ctctgaagcc 961 cggcaccatg tccccggagg ccttcctgca ggaagcccaa gtgatgaaga agctccagca 1021 tgagaagctg gttcaactgt acgcagtcgt gtcggaagag cccatctaca tcgtcattga 1081 gtacatgagc aaggggagcc tcctggattt cctgaaggga gagatgggca agtacctgcg 1141 gctgccacag ctcgttgata tggctgatca gattgcatcc ggcatggcct atgtggagag 1201 gatgaactac gtgcaccgag acctgcgggc ggccaacatc ctggtggggg agaacctggt 1261 gtgcaaggtg gctgactttg ggctggcacg cctcatcgag gacaacgagt acacagcacg 1321 gcaaggtgcc aagttcccca tcaagtggac agcccccgag gcagccctct atggccggtt 1381 caccatcaag tcggatgtct ggtccttcgg catcctgctg actgagctga ccaccaaggg 1441 ccggatgcca tacccaggga tgggcaacgg ggaggtgctg gaccgggtgg agaggggcta 1501 ccgcatgccc tgcccgcccg agtgccccga gtcgctgcat gaccttatgt gccagtgctg 1561 gcggagggac cctgaggagc ggcccacttt tgagtacctg caggcccagc tgctccctgc 1621 ttgtgtgttg gaggtcgctg agtagtgcgc gagcaaaatt taagctacaa caaggcaagg 1681 cttggccgac aattgcatga agaatctgct tagggttagg cgttttgcgc tgcttcgcga 1741 tgtacgggcc agatatacgc gtatctgagg ggactagggt gtgtttaggc gaaaagcggg 1801 g
ENTREZ/BLAST ENTREZ (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi) An all-purpose tool for biomedical research data mining BLAST (http://www.ncbi.nlm.nih.gov/blast) Basic Local Alignment Search Tool used to explore sequence databases.
BLAST Results Identification Number Score gi 210270 gb M33292.1 ALRVSRC Rous sarcoma virus (Schmidt-R... 3570 0.0 gi 61498 emb V01169.1 REASV5 Avian sarcoma virus src gene a... 3459 0.0 gi 212700 gb J00844.1 CHKSRC Chicken c-src gene, complete c... 2977 0.0 gi 210264 gb M21526.1 ALRSRCAC Rous sarcoma virus defective... 2970 0.0 gi 61706 emb X15345.1 RERSVH19 Hamster H-19 proviral DNA (L... 2902 0.0 gi 61896 emb X51861.1 RSVPRSRC Duck adapted Rous sarcoma vi... 2839 0.0 gi 4885608 ref NM_005417.1 Homo sapiens v-src sarcoma (Sch... 1160 0.0 gi 15321730 gb M24704.2 XELSRCA Xenopus laevis pp60c-src pr... 375 e-100 gi 338458 gb K03218.1 HUMSRC11 Human c-src-1 proto-oncogene... 174 2e-4 Name Collect the corresponding Protein sequences Use the identification number (210270) to search ENTREZ (http://www.ncbi.nlm.nih.gov/entrez/)
Multiple Sequence Alignments Compare a series of proteins to determine how similar they are to each other http://pir.georgetown.edu/pirwww/search/multaln.html
How s that for similarity? Why are viral proteins similar to human and chicken proteins? ONCOGENES v-src MGSSKSKPKDPSQRRRSLEPPDSTHHGG---FPASQTPNKTAAPDTHRTPSRSFGTVATE avian_src MGSSKSKPKDPSQRRRSLEPPDSTHHGG---FPASQTPNKTAAPDTHRTPSRSFGTVATE chicken_src MGSSKSKPKDPSQRRRSLEPPDSTHHGG---FPASQTPNKTAAPDTHRTPSRSFGTVATE human_c-src MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE ***.******.*********.:..*.* *******.*.*:.* ** ** :*...*:* v-src PKLFEDFNTSDTVTSPQRAGALAGGVTTFVALYDYESWIETDLSFKKGERLQIVNNTEGN avian_src PKLFGGFNTSDTVTSPQRAGALAGGVTTFVALYDYESWIETDLSFKKGERLQIVNNTEGN chicken_src PKLFGGFNTSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGD human_c-src PKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGD ****.**:***********.**************** ********************: v-src WWLAHSVTTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNPENPRGTFLVRES avian_src WWLAHSLTTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNPENPRGTFLVRES chicken_src WWLAHSLTTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNPENPRGTFLVRES human_c-src WWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRES ******::***************************************.************ v-src ETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFSSLQQLVAYYSKHADGL avian_src ETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFSSLQQLVAYYSKHADGL chicken_src ETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFSSLQQLVAYYSKHADGL human_c-src ETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGL *******************************************.**************** v-src CHRLTNVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTL avian_src CHRLTNVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTL chicken_src CHRLTNVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTL human_c-src CHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTL *****.****************************************************** v-src KPGTMSPEAFLQEAQVMKKLQHEKLVQLYAVVSEEPIYIVIEYMSKGSLLDFLKGEMGKY avian_src KPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVIEYMSKGSLLDFLKGEMGKY chicken_src KPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGEMGKY human_c-src KPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKY ********************:******************* *************** *** v-src LRLPQLVDMADQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT avian_src LRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT chicken_src LRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT human_c-src LRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT ********** ************************************************* v-src ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRMPYPGMGNGEVLDRVER avian_src ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMGNGEVLDRVER chicken_src ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVER human_c-src ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVER *******************************************:***** * ****:*** v-src GYRMPCPPECPESLHDLMCQCWRRDPEERPTFEYLQAQLLPACVLEVAE------- avian_src GYRMPCPPECPESLHDLMCQCWRRDPEERPTFEYLQAQLLPACVLEVAE------- chicken_src GYRMPCPPECPESLHDLMCQCWRRDPEERPTFEYLQAFLEDYFTSTEPQYQPGENL human_c-src GYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL ***********************::************ *..: 97% identical
Motifs Regions of conserved sequence and function. http://pfam.wustl.edu/hmmsearch.shtml
Protein Structures Why are protein structures valuable to research? Visualize how your protein looks Identify common domains Locate important amino acid positions Predict potential functions Model mutations
Summary BLAST Database Searches Entrez Database Data Mining Multiple Sequence Alignments Motif Searches 3D Structure Analysis Find It On The Web - http://web.wi.mit.edu/proteins/ai/home.html