GBIO0002-1 Genetics and Bioinformatics Introduction to DB and R Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 [email protected] Office hours: by appointment Overview
1. Intro to basic R 2. Introduction to public databases 3. The submission system Bioinformatics Definition 1: the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics (Merriam-Webster dictionary) Definition 2: a field that works on the
problems involving intersection of Biology/Computer Science/Statistics Practical classes. Why? Practical classes During these classes will be looking at practical aspects of the topics introduced in theory classes. It is suggested to execute sample R scripts and demonstrations on your PCs. Homework assignments are of 2 types (graded) Homework assignments result in a group report and
can be handed in electronically in French or English Introduction to A basic tutorial Statistical languages GUIs SAS SPSS
R GUI Less fancy and no frills, but free! Definition R is a free software environment for statistical computing and graphics1 R is considered to be one of the most widely used languages amongst statisticians, data miners, bioinformaticians and others.
R is free implementation of S language Other commercial statistical packages are SPSS, SAS, MatLab 1 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/) Why to learn R? Since it is free and open-source, R is widely used by bioinformaticians and statisticians
It is multiplatform and free Has wide very wide selection of additional libraries that allow it to use in many domains including bioinformatics Main library repositories CRAN and BioConductor Programming? Should I be scared? R is a scripting language and, as such, is much more easier to learn than other compiled languages as C
R has reasonably well written documentation (vignettes) Syntax in R is simple and intuitive if one has basic statistics skills R scripts will be provided and explained in-class Topics covered in this tutorial
Operators / Variables Main objects types Plotting and plot modification functions Writing and reading data to/from files Variables/Operators Variables store one element x <- 25 Here x variable is assigned value 25
Check value assigned to the variable x >x  25 Basic mathematical operators that could be applied to variables: (+),(-),(/),(*) Use parenthesis to obtain desired sequence of mathematical operations Arithmetic operators What is the value of small z here?
>x <- 25 > y <- 15 > z <- (x + y)*2 > Z <- z*z > z  80 Vectors Vectors have only 1 dimension and represent enumerated sequence of data. They can also store variables
> v1 <- c(1, 2, 3, 4, 5) > mean(v1)  3 The elements of a vector are specified /modified with braces (e.g. [number]) > v1 <- 48 > v1  48 2 3 4 5 Logical operators
These operators mostly work on vectors, matrices and other data types Type of data is not important, the same operators are used for numeric and character data types Operator < <= > >= ==
!= !x x|y x&y Description less than less than or equal to greater than greater than or equal to exactly equal to
not equal to Not x x OR y x AND y Logical operators Can be applied to vectors in the following way. The return value is either True or False > v1  48 2 3 4 5
> v1 <= 3  FALSE TRUE TRUE FALSE FALSE R workspace Display all workplace objects (variables, vectors, etc.) via ls(): >ls()  "Z" "v1" "x" "y" "z" Useful tip: to save workplace and restore from a file use:
>save.image(file = " workplace.rda") >load(file = "workplace.rda") How to find help info? Any function in R has help information To invoke help use ? Sign or help(): ? function_name() ? mean help(mean, try.all.packages=T) To search in all packages installed in
your R installation always use try.all.packages=T in help() To search for a key word in R documentation use help.search(): help.search("mean") Basic data types Data could be of 3 basic data types: numeric character logical
Numeric variable type: > x <- 1 > mode(x)  "numeric" Basic data types Logical variable type (True/False): > y <- 3<4 > mode(y)  "logical"
Character variable type: > z <- "Hello class" > mode(z)  "character" Data structures The main data objects in R are: Matrices (single data type) Data frames (supports various data types) Lists (contain set of vectors)
Other more complex objects Matrices are 2D objects (rows/columns) > m <- matrix(0,2,3) > m [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 Lists Lists contain various vectors. Each
vector in the list can be accessed by double braces [[number]] > x <- c(1, 2, 3, 4) > y <- c(2, 3, 4) > L1 <- list(x, y) > L1 []  1 2 3 4 []  2 3 4
Data Frames Data frames are similar to matrices but can contain various data types > x <- c(1,5,10) > y <- c("A", "B", "C") > z <-data.frame(x,y) x y 1 1 A 2 5 B 3 10 C
To get/change column and row names use colnames() and rownames() Input/Output To read data into R from a text file use read.table() read help(read.table) to learn more scan() is a more flexible alternative raw_data <-read.table(file="data_file.txt") To write data into R from a text file use
read.table() > write.table(mydata, "data_file.txt") Conversion between data types One can convert (data coerce) one type of data into another using as.xxx where xxx is a data type Plots generation in R R provides very rich set of plotting possibilities
The basic command is plot() Each library has its own version of plot() function When R plots graphics it opens graphical device that could be either a window or a file Plotting functions R offers following array of plotting functions Function
Description plot(x) plot of the values of x variable on the y axis bi-variable plot of x and y values (both axis scaled based on values of x and y variables) circular pie-char Plots a box plot showing variables via their quantiles Plots a histogram(bar plot)
plot(x,y) pie(y) boxplot(x) hist(x) Plot modification functions Often R plots are not optimal and one would like to add colors or to correct position of the legend or do other appropriate modifications R has an array of graphical parameters
that are a bit complex to learn at first glance. Here is the full list Some of the graphical parameters can be specified inside plot() or using other graphical functions such as lines() Plot modification functions Function Description
points(x,y) lines(x,y) add points to the plot using coordinates specified in x and y vectors adds a line using coordinates in x and y mtext(text,side=3) adds text to a given margin specified by side number boxplot(x)
this a histogram that bins values of x into categories represented as bars adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. arrows(x0,y0,x1,y1, Angle provides rotational angle and code specifies at which end angle=30, code=1) arrow should be drawn abline(h=y) draws horizontal line at y coordinate rect(x1, y1, x2, y2) draws rectangle at x1, y1, x2, y2 coordinates legend(x,y)
title() plots legend of the plot at the position specified by x and y vectors used to generate a given plot adds title to the plot axis(side, vect) adds axis depending on the chosen one of the 4 sides; vector specifying where tick marks are drawn
Installation of new libraries There are two main R repositories CRAN BioConductor To install package/library from CRAN install.packages("seqinr") To install packages from BioConductor source("http://bioconductor.org/biocLite.R") biocLite("GenomicRanges")
Installation of new libraries Download and install latest R version on your PC. Go to http://cran.rproject.org/ Install following libraries by running install.packages(c("seqinr", "ape", "GenABEL")) source("http://bioconductor.org/biocLite.R") biocLite(c("limma","affy","hgu133plus2.db","Bios trings", "muscle")) What are we looking for?
Data & databases Biologists Collect Lots of Data Hundreds of thousands of species to explore Millions of written articles in scientific journals Detailed genetic information: gene names phenotype of mutants location of genes/mutations on chromosomes linkage (distances between genes) High Throughput lab technologies
PCR Rapid inexpensive DNA sequencing (Illumina HiSeq) Microarrays (Affymetrix) Genome-wide SNP chips / SNP arrays (Illumina) Must store data such that Minimum data quality is checked Well annotated according to standards Made available to wide public to foster research What is database? Organized collection of data
Information is stored in "records, "fields, tables Fields are categories Must contain data of the same type (e.g. columns below) Records contain data that is related to one object (e.g. protein, SNP) (e.g. rows below) SNP ID SNPSeqID Gene
AAGGTAAAAGGCAA TCAGCACAGCC TCAACCTGGAGTCAGA GGCT M-05554_1 12.MMHAP31FLD3.seq procollagen, type III,
alpha TGCGCAGAAGCTGA AGTCTA TTTTGAGGTGTTAATG GTTCT Genome sequencing generates lots of data Biological Databases
The number of databases is constantly growing! - OBRC: Online Bioinformatics Resources Collection currently lists over 2826 databases (2013) Main databases by category Literature PubMed: scientific & medical abstracts/citations Health OMIM: online mendelian inheritance in man Nucleotide Sequences Nucleotide: DNA and RNA sequences
Genomes Genome: genome sequencing projects by organism dbSNP: short genetic variations Genes Protein: protein sequences UniProt: protein sequences and related information Chemicals PubChem Compound: chemical information with structures, information and links Pathways BioSystems: molecular pathways with links to genes, proteins
KEGG Pathway: information on main biological pathways Growth of UniProtKB database UniProtKB contains mainly protein sequences (entries). The database growth is exponential Data management issues? (e.g. storage, search, indexing?) number of entries
Source: http://www.ebi.ac.uk/uniprot/TrEMBLstats Primary and Secondary Databases Primary databases REAL EXPERIMENTAL DATA (raw) Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.)
Secondary databases DERIVED INFORMATION (analyzed and annotated) Fruits of analyses of primary data in the primary sources (patterns, blocks, profiles etc. which represent the most conserved features of multiple alignments) Primary Databases Sequence Information DNA: EMBL, Genbank, DDBJ Protein: SwissProt, TREMBL, PIR, OWL
Genome Information GDB, MGD, ACeDB Structure Information PDB, NDB, CCDB/CSD Secondary Databases Sequence-related Information ProSite, Enzyme, REBase
Genome-related Information OMIM, TransFac Structure-related Information DSSP, HSSP, FSSP, PDBFinder Pathway Information KEGG, Pathways GenBank database
Contains all DNA and protein sequences described in the scientific literature or collected in publicly funded research One can search by protein name to get DNA/mRNA sequences The search results could be filtered by species and other parameters
GenBank main fields NCBI Databases contain more than just DNA & protein sequences NCBI main portal: http://www.ncbi.nlm.nih.gov/ Fasta format to store sequences The FASTA format is now universal for all databases and software that handles DNA and protein sequences
Specifications: One header line starts with > with a ends with [return] Saccharomyces cerevisiae strain YC81 actin (ACT1) gene GenBank: JQ288018.1 >gi|380876362|gb|JQ288018.1| Saccharomyces cerevisiae strain YC81 actin (ACT1) gene, partial cds
TGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAACACCCTGTTCTTTTGACTGAAGCTCCA ATGAACCCTAAATCAAACAGAGAAAAGATGACTCAAATTATGTTTGAAACTTTCAACGTTCCAGCCTTCTAC GTTTCCATCCAAGCCGTTTTGTCCTTGTACTCTTCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGAT GGTGTTACTCACGTCGTTCCAATTTACGCTGGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTGGCC GGTAGAGATTTGACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTTACTCTTTCTCCACCACTGCTGAA AGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTTACGTCGCCTTGGACTTCGAGCAAGAAATGCAAACC GCTGCTCAATCTTCTTCAATTGAAAAATCCTACGAACTTCCAGATGGTCAAGTCATCACTATTGGTAAC OMIM database Online Mendelian Inheritance in Man (OMIM) information on all known mendelian disorders linked to
over 12,000 genes Started at 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders Linked disease data Links disease phenotypes and causative genes Used by physicians and geneticists OMIM basic search Online Tutorial: http://www.openhelix.com/OMIM Each search results entry has *, +, # or % symbol
# entries are the most informative as molecular basis of phenotype genotype association is known is known Will do search on: Ankylosing spondylitis (AS) AS characterized by chronic inflammation of spine OMIM-search results Look for the entires that link to the genes. Apply filters if needed Filter results if known SNP is associated to
the entry Some of the interesting entries. Try to look for the ones with # sign OMIM-entries OMIM Gene ID -entries OMIM-Finding disease linked genes Read the report of given top gene linked phenotype
Mapping Linkage heterogeneity section Go back to the original results Previously seen entry *607562 IL23R PubMed database
PubMed is one of the best known database in the whole scientific community Most of biology related literature from all the related fields are being indexed by this database It has very powerful mechanism of constructing search queries Many search fields Logical operators (AND, OR) Provides electronic links to most journals Example of searching by author articles published within 2012-2013 References
 Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21.16 (2005): 3439-3440.  Hamosh, Ada, et al. "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders." Nucleic acids research30.1 (2002): 52-55.  Ihaka, Ross, and Robert Gentleman. "R: a language for data analysis and graphics." Journal of computational and graphical statistics 5.3 (1996): 299-314. Assignment Submission
Step by Step Guide Assignment submission All assignments should be zipped into one file (*.zip) and submitted online Create a submission account Account creation Any member of the group can submit assignment Account details will be emailed to you automatically All GBIO009-1 students should create an account
Submit your assignment After account creation login into a submission page The remaining time to deadline is displayed. Good idea to check it from time to time in order to be on top of things File extension should be zip
Can submit assignment as many times as you wish See you the next class: Tue, on Sept 22nd ! Next class form groups of 2-3 persons to work on HW
Computers Impact. The good news: Computers allow us to work 100% faster. The bad news: They generate 300% more work. If it's green, it's biology, if it stinks, its chemistry, if it has numbers its math, if it doesn't work,...
Mátyás a kortársak megítélése szerint LUDOVICUS TUBERO Mátyás „homo novus" Kétfajta kép : 1. feltétel nélküli dicséretek 2. elítélés Mátyás az utókor szerint - legendák, mesék igazságos, barátkozó - történettudomány „Kemény ember volt, rendet csinált" Megítélések -XVI. század (Heltai Gáspár)...
1. The long-term rise & fall of great powers. PPP basis. The global contribution by major economies from 1 AD to 2008 AD according to estimates by Angus Maddison, 2007, Contours of the World Economy, I - 2030 AD (Oxford...
MAP SCALE Sizing the Model ... soil map Isopleth map (contour or isarithmic): shows a continuous three dimensional surface such as elevation using lines connecting points of equal value (contours). e.g elevation, travel time contours from a point(s), land values,...
The Ghost of Summer Suns. Halloween is approaching, and this means that Arcturus, the star sparkling low in the west-northwest in twilight, is taking on its role as "the Ghost of Summer Suns." What does this mean? For several days...
The software in the receiver than uses the pseudoranges from four satellites to solve for the four unknowns: three coordinates and the receiver clock correction. The primary GPS signal is transmitted at a "carrier' frequency of 1575 MHz ("L-band"), corresponding...
Ready to download the document? Go ahead and hit continue!