We are searching data for your request:
Upon completion, a link will appear to access the found materials.
The R language specializes in statistical data analysis, and is also quite useful for visualizing large datasets. This third part covers the basics of R as a programming language (data types, if-statements, functions, loops and when to use them) as well as techniques for large-scale, multi-test analyses. Other topics include S3 classes and data visualization with ggplot2.
Thumbnail: Logo for R. (CC BY-SA 4.0; Hadley Wickham and others at RStudio via https://www.r-project.org/logo/RStudio)
Ten simple rules for biologists learning to program
Copyright: © 2018 Carey, Papin. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Jason A. Papin is co-Editor-in-Chief of PLOS Computational Biology.
Multiplying matrices using a multiplication operator in R is one of a massive array of matrix operations and matrix algebra you can perform in R. R has two multiplication operators for matrices. The first is denoted by * which is the same as a simple multiplication sign. This operation does a simple element by element multiplication up to matrices.
The second operator is denoted by %*% and it performs a matrix multiplication between the two matrices.
Note that the order of the matrices affects the results in matrix multiplication. The original matrix and the second matrix are each identified by a matrix multiplication operator, and are combined for a result of the product matrix. If you inverse the order of the original matrix and the second matrix, the result matrix will be slightly different than the matrix product of the first operation.
Data Science interview coding questions + solution code
24) What are factor variable in R language?
Factor variables are categorical variables that hold either string or numeric values. Factor variables are used in various types of graphics and particularly for statistical modelling where the correct number of degrees of freedom is assigned to them.
25) What is the memory limit in R?
8TB is the memory limit for 64-bit system memory and 3GB is the limit for 32-bit system memory.
26) What are the data types in R on which binary operators can be applied?
Scalars, Matrices ad Vectors.
27) How do you create log linear models in R language? (click here to get interview problems + solution code)
Using the loglm () function
28) What will be the class of the resulting vector if you concatenate a number and NA?
29) What is meant by K-nearest neighbour?
K-Nearest Neighbour is one of the simplest machine learning classification algorithms that is a subset of supervised learning based on lazy learning. In this algorithm the function is approximated locally and any computations are deferred until classification.
Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization
30) What will be the class of the resulting vector if you concatenate a number and a character?
31) Write code to build an R function powered by C?
32) If you want to know all the values in c (1, 3, 5, 7, 10) that are not in c (1, 5, 10, 12, 14). Which in-built function in R can be used to do this? Also, how this can be achieved without using the in-built function.
Using in-built function - setdiff(c (1, 3, 5, 7, 10), c (1, 5, 10, 11, 13))
Without using in-built function - c (1, 3, 5, 7, 10) [! c (1, 3, 5, 7, 10) %in% c (1, 5, 10, 11, 13).
33) How can you debug and test R programming code?
R code can be tested using Hadley&rsquos testthat package.
34) What will be the class of the resulting vector if you concatenate a number and a logical? (get interview problems + solution code)
35) Write a function in R language to replace the missing value in a vector with the mean of that vector.
36) What happens if the application object is not able to handle an event?
The event is dispatched to the delegate for processing.
37) Differentiate between lapply and sapply.
If the programmers want the output to be a data frame or a vector, then sapply function is used whereas if a programmer wants the output to be a list then lapply is used. There one more function known as vapply which is preferred over sapply as vapply allows the programmer to specific the output type. The disadvantage of using vapply is that it is difficult to be implemented and more verbose.
38) Differentiate between seq (6) and seq_along (6)
Seq_along(6) will produce a vector with length 6 whereas seq(6) will produce a sequential vector from 1 to 6 c( (1,2,3,4,5,6)).
39) How will you read a .csv file in R language?
read.csv () function is used to read a .csv file in R language. Below is a simple example &ndash
filcontent <-read.csv (sample.csv)
40) How do you write R commands?
The line of code in R language should begin with a hash symbol (#).
41) How can you verify if a given object &ldquoX&rdquo is a matric data object?
If the function call is.matrix(X ) returns TRUE then X can be termed as a matrix data object.
42) What do you understand by element recycling in R?
If two vectors with different lengths perform an operation &ndashthe elements of the shorter vector will be re-used to complete the operation. This is referred to as element recycling.
Example &ndash Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated when computing the result.
43) How can you verify if a given object &ldquoX&rdquo is a matrix data object?
If the function call is.matrix(X) returns true then X can be considered as a matrix data object otheriwse not.
44) How will you measure the probability of a binary response variable in R language?
Logistic regression can be used for this and the function glm () in R language provides this functionality.
45) What is the use of sample and subset functions in R programming language?
Sample () function can be used to select a random sample of size &lsquon&rsquo from a huge dataset.
Subset () function is used to select variables and observations from a given dataset.
46) There is a function fn(a, b, c, d, e) a + b * c - d / e. Write the code to call fn on the vector c(1,2,3,4,5) such that the output is same as fn(1,2,3,4,5).
do.call (fn, as.list(c (1, 2, 3, 4, 5)))
47) How can you resample statistical tests in R language?
Coin package in R provides various options for re-randomization and permutations based on statistical tests. When test assumptions cannot be met then this package serves as the best alternative to classical methods as it does not assume random sampling from well-defined populations.
48) What is the purpose of using Next statement in R language?
If a developer wants to skip the current iteration of a loop in the code without terminating it then they can use the next statement. Whenever the R parser comes across the next statement in the code, it skips evaluation of the loop further and jumps to the next iteration of the loop.
49) How will you create scatterplot matrices in R language?
A matrix of scatterplots can be produced using pairs. Pairs function takes various parameters like formula, data, subset, labels, etc.
The two key parameters required to build a scatterplot matrix are &ndash
- formula- A formula basically like
50) How will you check if an element 25 is present in a vector?
There are various ways to do this-
- It can be done using the match () function- match () function returns the first appearance of a particular element.
- The other is to use %in% which returns a Boolean value either true or false.
- Is.element () function also returns a Boolean value either true or false based on whether it is present in a vector or not.
51) What is the difference between library() and require() functions in R language?
There is no real difference between the two if the packages are not being loaded inside the function. require () function is usually used inside function and throws a warning whenever a particular package is not found. On the flip side, library () function gives an error message if the desired package cannot be loaded.
52) What are the rules to define a variable name in R programming language?
A variable name in R programming language can contain numeric and alphabets along with special characters like dot (.) and underline (-). Variable names in R language can begin with an alphabet or the dot symbol. However, if the variable name begins with a dot symbol it should not be a followed by a numeric digit.
53) What do you understand by a workspace in R programming language?
The current R working environment of a user that has user defined objects like lists, vectors, etc. is referred to as Workspace in R language.
54) Which function helps you perform sorting in R language?
55) How will you list all the data sets available in all R packages?
Using the below line of code-
data(package = .packages(all.available = TRUE))
56) Which function is used to create a histogram visualisation in R programming language?
57) Write the syntax to set the path for current working directory in R environment.
58) How will you drop variables using indices in a data frame? (get interview problems + solution code)
Let&rsquos take a dataframe df<-data.frame(v1=c(1:5),v2=c(2:6),v3=c(3:7),v4=c(4:8))
Suppose we want to drop variables v2 & v3 , the variables v2 and v3 can be dropped using negative indicies as follows-
59) What will be the output of runif (7)?
It will generate 7 randowm numbers between 0 and 1.
60) What is the difference between rnorm and runif functions ?
rnorm function generates "n" normal random numbers based on the mean and standard deviation arguments passed to the function.
Syntax of rnorm function -
runif function generates "n" unform random numbers in the interval of minimum and maximum values passed to the function.
Syntax of runif function -
Get More Practice, More Data Science and Machine Learning Projects , and More guidance.Fast-Track Your Career Transition with ProjectPro
61) What will be the output on executing the following R programming code &ndash
62) How will you combine multiple different string like &ldquoData&rdquo, &ldquoScience&rdquo, &ldquoin&rdquo ,&ldquoR&rdquo, &ldquoProgramming&rdquo as a single string &ldquoData_Science_in_R_Programmming&rdquo ?
63) Write a function to extract the first name from the string &ldquoMr. Tom White&rdquo.
substr (&ldquoMr. Tom White&rdquo,start=5, stop=7)
64) Can you tell if the equation given below is linear or not ? (get interview problems + solution code)
Emp_sal= 2000+2.5(emp_age) 2
Yes it is a linear equation as the coefficients are linear.
65) What will be the output of the following R programming code ?
66) What will be the output of the following R programming code?
print("X is an even number")
print("X is an odd number")
Executing the above code will result in an error as shown below -
## 3: print("X is an even number")
R programming language does not know if the else related to the first &lsquoif&rsquo or not as the first if() is a complete command on its own.
67) I have a string "[email protected]". Which string function can be used to split the string into two different strings [email protected]&rdquo and &ldquocom&rdquo ? (get interview problems + solution code)
This can be accomplished using the strsplit function which splits a string based on the identifier given in the function call. The output of strsplit() function is a list.
Output of the strsplit function is -
68) What is R Base package?
R Base package is the package that is loaded by default whenever R programming environent is loaded .R base package provides basic fucntionalites in R environment like arithmetic calcualtions, input/output.
69) How will you merge two dataframes in R programming language? (get interview problems + solution code)
Merge () function is used to combine two dataframes and it identifies common rows or columns between the 2 dataframes. Merge () function basically finds the intersection between two different sets of data.
Merge () function in R language takes a long list of arguments as follows &ndash
Syntax for using Merge function in R language -
merge (x, y, by.x, by.y, all.x or all.y or all )
- X represents the first dataframe.
- Y represents the second dataframe.
- by.X- Variable name in dataframe X that is common in Y.
- by.Y- Variable name in dataframe Y that is common in X.
- all.x - It is a logical value that specifies the type of merge. all.X should be set to true, if we want all the observations from dataframe X . This results in Left Join.
- all.y - It is a logical value that specifies the type of merge. all.y should be set to true , if we want all the observations from dataframe Y . This results in Right Join.
- all &ndash The default value for this is set to FALSE which means that only matching rows are returned resulting in Inner join. This should be set to true if you want all the observations from dataframe X and Y resulting in Outer join.
70) Write the R programming code for an array of words so that the output is displayed in decreasing frequency order.
R Programming Code to display output in decreasing frequency order -
71) How to check the frequency distribution of a categorical variable?
The frequency distribution of a categorical variable can be checked using the table function in R language. Table () function calculates the count of each categories of a categorical variable.
Output of the above R Code &ndash
Programmers can also calculate the % of values for each categorical group by storing the output in a dataframe and applying the column percent function as shown below -
t = data.frame(table(gender))
t$percent= round(t$Freq / sum(t$Freq)*100,2)
72) What is the procedure to check the cumulative frequency distribution of any categorical variable? (get interview problems + solution code)
The cumulative frequency distribution of a categorical variable can be checked using the cumsum () function in R language.
gender = factor(c("f","m","m","f","m","f"))
y = table(gender)
Output of the above R code-
73) What will be the result of multiplying two vectors in R having different lengths?
The multiplication of the two vectors will be performed and the output will be displayed with a warning message like &ndash &ldquoLonger object length is not a multiple of shorter object length.&rdquo Suppose there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the multiplication of the vectors a*b will give the resultant as 2 6 6 with the warning message. The multiplication is performed in a sequential manner but since the length is not same, the first element of the smaller vector b will be multiplied with the last element of the larger vector a.
74) R programming language has several packages for data science which are meant to solve a specific problem, how do you decide which one to use?
CRAN package repository in R has more than 6000 packages, so a data scientist needs to follow a well-defined process and criteria to select the right one for a specific task. When looking for a package in the CRAN repository a data scientist should list out all the requirements and issues so that an ideal R package can address all those needs and issues.
The best way to answer this question is to look for an R package that follows good software development principles and practices. For example, you might want to look at the quality documentation and unit tests. The next step is to check out how a particular R package is used and read the reviews posted by other users of the R package. It is important to know if other data scientists or data analysts have been able to solve a similar problem as that of yours. When you in doubt choosing a particular R package, I would always ask for feedback from R community members or other colleagues to ensure that I am making the right choice.
75) How can you merge two data frames in R language?
Data frames in R language can be merged manually using cbind () functions or by using the merge () function on common rows or columns.
76) Explain the usage of which() function in R language.
which() function determines the postion of elemnts in a logical vector that are TRUE. In the below example, we are finding the row number wherein the maximum value of variable v1 is recorded.
mydata=data.frame(v1 = c(2,4,12,3,6))
It returns 3 as 12 is the maximum value and it is at 3rd row in the variable x=v1.
77) How will you convert a factor variable to numeric in R language ? (get interview problems + solution code)
A factor variable can be converted to numeric using the as.numeric() function in R language. However, the variable first needs to be converted to character before being converted to numberic because the as.numeric() function in R does not return original values but returns the vector of the levels of the factor variable.
X <- factor(c(4, 5, 6, 6, 4))
X1 = as.numeric(as.character(X))
78) Explain the significance of R programming language for Data Science ?
i) Most of the calculations can be done with the help of vector so it is easy for data scientists to add functions to a single vector without having to put them in a loop.
ii) A turning complete language that can be used for any kind of data science task whether it is in the field of genetics, statistics or biology.
iii) Being an interpreted language , it does not require any compiler-making development of code easier.
79) What is power analysis ?
Power analysis is the process used to determine the effect of a given sample size and is generally used for experimental design.Pwr package in R is used for power analysis.
80) Explain the usage of abline() function.
abline function in R used to add reference line to a graph. Below is the syntax of using abline function -
81) What is the usage of lattice package in R ?
Lattice package helps enhance base R graphics by providing better defaults and helps easily display multi-variate relationships.
R Interview Questions for Data Science
1) What is the need of factorizing variables in R?
2) List some of your favorite functions in R programming language along with their usage.
3) Explain the differences between Python and R.
4) What is multi-threading and how can you implement it in R programming language?
5) Implement string operations in R language.
6) dplyr <- "ggplot2" library(dplyr). Which package will be loaded on executing the command and why?
7) Why you should use R language for statistical work?
8) What according to you are disadvantages of R Programming over Python?
9) Which R objects have you most frequently worked with?
10) Build a binary search tree in R language.
11) How can you produce co-relations and covariances in R lanaguge?
12) How can you develop a package in R language and do version control?
This list of 100 data science interview questions is not an exhaustive one and we know that we have not gotten all the answers here. We request the data science community to help us out with the questions that we did not get the answers to. Please do chime in with any data science interview questions related to R programming that you think ought to be here. We will add it in.
82) How will you save the output of an R plot?
Ans: To create a pdf, you can use the pdf() function, and if one wishes to save the plot in jpeg format, they can use the jpeg() function.
Q2. Suppose you have a dataset &lsquoCallRecords.csv&rsquo that contains the two columns: &lsquodur_min&rsquo and &lsquobalance&rsquo. How will you plot a graph of the two variables?
The key point to remember is that to refer to a variable in R, we are required to type the dataset and the variable name joined with a $ symbol as R does not know to look for the variables in the dataset automatically.
84) Write the code to implement linear regression over all the variables of a dataset &lsquod&rsquo except one of the variables, &lsquoage&rsquo to predict the value of variable &lsquoy&rsquo.
In R, a special identifier exists that one can use in a formula to mean all the variables, it is the &lsquo.&rsquo identifier.
85) What do (principal) diagonal and non diagonal elements of a confusion matrix printed using table() function in R represent?
Ans: The diagonal elements of the confusion matrix represent correct predictions for a given target variable while the non-diagonal elements represent incorrect predictions.
86) How many inputs does knn() function of the class library in R require? Explain each of them briefly.
Ans: The knn() function requires following four inputs:
A matrix that consists of feature values from the training data.
A matrix that consists of feature values from testing data, for which we want to make predictions.
A vector that contains target values or class labels for the training data.
A value of K that specifies the number of nearest neighbors to be used by the algorithm.
3: Programming in R - Biology
A not always very easy to read, but practical copy & paste format has been chosen throughout this manual. In this format all commands are represented in code boxes, where the comments are given in blue color . To save space, often several commands are concatenated on one line and separated with a semicolon ' '. All comments/explanations start with the standard comment sign ' # ' to prevent them from being interpreted by R as commands. This way several commands can be pasted with their comment text into the R console to demo the different functions and analysis steps. Commands starting with a ' $ ' sign need to be executed from a Unix or Linux shell. Windows users can simply ignore them. Commands highlighted in red color are considered essential knowledge. They are important for someone interested in a quick start with R and Bioconductor. Where relevant, the output generated by R is given in green color .
- The installation instructions are provided in the Administrative Section of this manual.
- R working environments with syntax highlighting support and utilities to send code to the R console:
- Basic R code editors provided by Rguis , Rgedit, RKWard, Eclipse, Tinn-R, Notepad++ (NppToR) : R working environment based on vim and tmux (ESS add-on package)
- Kelly Black's R Tutorial
- Kim Seefeld's R-introduction for Biostatistics
- Peter Dalgaard's book Introductory Statistics with R
by Wim Krijnen.
- References on R programming are listed in the 'Programming in R' chapter of this manual.
- vectors: ordered collection of numeric, character, complex and logical values.
- factors: special type vectors with grouping information of its components
- data frames: two dimensional structures with different data types
- matrices: two dimensional structures with data of same type
- arrays: multidimensional arrays of vectors
- lists: general form of vectors with different types of elements
- functions: piece of code
- Object, row and column names should not start with a number.
- Avoid spaces in object, row and column names.
- Avoid special characters like '#'.
- : excellent choice for beginners (Cheat Sheet)
The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)
- by Rafael Irizarry and Michael Love ,
Basics on Functions and Packages
R contains most arithmetic functions like mean, median, sum, prod, sqrt, length, log, etc. An extensive list of R functions can be found on the function and variable index page. Many R functions and datasets are stored in separate packages, which are only available after loading them into an R session. Information about installing new packages can be found in the administrative section of this manual.
Loading of libraries/packages
Information and management of objects
System commands under Linux
Reading and Writing Data from/to Files
Interfacing with Google Docs
Data and Object Types
General Subsetting Rules
(1) Subsetting by positive or negative index/position numbers
(2) Subsetting by same length logical vectors
Basic Operators and Calculations
- Comparison operators
- equal: ==
- not equal: !=
- greater/less than: > <
- greater/less than or equal: >= <= Example:
- Calculations [ Function Index ]
- Four basic arithmetic functions: addition, subtraction, multiplication and division
R's regular expression utilities work similar as in other languages. To learn how to use them in R, one can consult the main help page on this topic with: ?regexp .
Function tapply applies calculation on all members of a level
Matrices and Arrays
Merging arrays: example for building location tables for microtiter plates
Script for mapping 24/48/96 to 384 well plates
Reformatting data frames with reshape.2 and splitting/apply routines with plyr
## reshape2 ##
## Some of these operations are important for plotting routines with the ggplot2 library.
## melt: rbinds many columns
(iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
(df_mean <- melt(iris_mean, id.vars=c("Species"), variable.name = "Samples")) # See ?melt.data.frame for details.
Species Samples value
1 setosa Sepal.Length 5.006
2 versicolor Sepal.Length 5.936
3 virginica Sepal.Length 6.588
4 setosa Sepal.Width 3.428
5 versicolor Sepal.Width 2.770
6 virginica Sepal.Width 2.974
## dcast: cbinds row aggregates (reverses melt result)
dcast(df_mean, formula = Species
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
## colsplit: splits strings into columns
x <- c("a_1_4", "a_2_3", "b_2_5", "c_3_9")
colsplit(x, "_", c("trt", "time1", "time2"))
trt time1 time2
1 a 1 4
2 a 2 3
3 b 2 5
4 c 3 9
## plyr ##
ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize)
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), transform)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species mean
1 5.1 3.5 1.4 0.2 setosa 5.006
2 4.9 3.0 1.4 0.2 setosa 5.006
3 4.7 3.2 1.3 0.2 setosa 5.006
4 4.6 3.1 1.5 0.2 setosa 5.006
## Usage with parallel package
library(parallel) library(doMC) registerDoMC(2) # 2 cores
test <- ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize, parallel=TRUE)
Lists are ordered collections of objects that can be of different modes (e.g. numeric vector, array, etc.). A name can be assigned to each list component.
Some Great R Functions
The table() function counts the occurrence of entries in a vector. It is the most basic "clustering function":
The combn() function creates all combinations of elements:
The aggregate() function computes any type of summary statistics of data subsets that are grouped together:
The apply() function simplifies the coding of iterative tasks (loops). In the following example, apply() is first used to compute a row-wise calculation on a matrix. In a second example apply() is called by the custom function colAg() to perform similar calculations, but allowing the user to select any combination of column aggregates with an easy to handle grouping vector.
tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R") # Imports the colAg() function.
colAg(myMA=myMA, group=c(1,1,1,2,2,2,3,3,4,4), myfct=mean)[1:4,]
# Applies a computation given under the myfct argument (here: mean) for the column aggregates specified under the group argument
# (here: c(1,1,1,2,2,2,3,3,4,4)). The columns in the resulting object are named after the chosen aggregates. Note: the function can
# only perform those calculations that can be applied to sets of two or more values, such as mean, sum, sd, min and max. Much faster
# to compute, but less flexible, alternatives are given in the data frame section .
The %in% function returns the intersect between two vectors. In a subsetting context with ' [ ] ', it can be used to intersect matrices, data frames and lists:
The merge() function joins data frames based on a common key column:
Introduction to R Graphics
- plot: generic x-y plotting
- barplot: bar plots
- boxplot: box-and-whisker plot
- hist: histograms
- pie: pie charts
- dotchart: cleveland dot plots
- image, heatmap, contour, persp: functions to generate image-like plots
- qqnorm, qqline, qqplot: distribution comparison plots
- pairs, coplot: display of multivariant data
y[,1], data=y[,1:2]) abline(myline, lwd=2) # Adds a regression line to the plot.
summary(myline) # Prints summary about regression line.
plot(y[,1], y[,2], log="xy") # Generates the same plot as above, but on log scale.
plot(y[,1], y[,2]) text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) # Adds a mathematical formula to the plot.
plot(y) # Produces all possible scatter plots for all-against-all columns in a matrix or a data frame. The column headers of
# the matrix or data frame are used as axis titles.
pairs(y) # Alternative way to produce all possible scatter plots for all-against-all columns in a matrix or a data frame.
library(scatterplot3d) scatterplot3d(y[,1:3], pch=20, color="red") # Plots a 3D scatter plot for first three columns in y.
library(geneplotter) smoothScatter(y[,1], y[,2]) # Same as above, but generates a smooth scatter plot that shows the density
# of the data points.
1:10) # Simple scatter plot.
1:10 | rep(LETTERS[1:5], each=2), as.table=TRUE) # Plots subcomponents specified by grouping vector after '|' in separate panels. The argument as.table controls the order of the panels.
myplot <- xyplot(Petal.Width
Sepal.Width | Species , data = iris) print(myplot) # Assigns plotting function to an object and executes it.
Sepal.Width | Species , data = iris, layout = c(3, 1, 1)) # Changes layout of individual plots.
## Change plotting parameters
show.settings() # Shows global plotting parameters in a set of sample plots.
default <- trellis.par.get() mytheme <- default names(mytheme) # Stores the global plotting parameters in list mytheme and prints its component titles.
mytheme["background"][][] <- "grey" # Sets background to grey
mytheme["strip.background"][][] <- "transparent" # Sets background of title bars to transparent.
trellis.par.set(mytheme) # Sets global parameters to 'mytheme'.
show.settings() # Shows custom settings.
1:10 | rep(LETTERS[1:5], each=2), as.table=TRUE, layout=c(1,5,1), col=c("red", "blue"))
Sepal.Width | Species, data=iris, type="a", layout=c(1,3,1))
iris[1:4] | Species, iris) # Plots data for each species in iris data set in separate line plot.
iris[1:4] | Species, iris, horizontal.axis = FALSE, layout = c(1, 3, 1)) # Changes layout of plot.
Species, ncol=1) # Plots three line plots, one for each sample in Species column.
# Imports a function that plots a loan amortization table as bar plot.
## (A) Sample Set: the following transforms the iris data set into a ggplot2-friendly format
iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean) # Calculates the mean values for the aggregates given by the Species column in the iris data set.
iris_sd <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=sd) # Calculates the standard deviations for the aggregates given by the Species column in the iris data set.
convertDF <- function(df=df, mycolnames=c("Species", "Values", "Samples")) < myfactor <- rep(colnames(df)[-1], each=length(df[,1])) mydata <- as.vector(as.matrix(df[,-1])) df <- data.frame(df[,1], mydata, myfactor) colnames(df) <- mycolnames return(df) ># Defines function to convert data frames into ggplot2-friendly format.
df_mean <- convertDF(iris_mean, mycolnames=c("Species", "Values", "Samples")) # Converts iris_mean.
df_sd <- convertDF(iris_sd, mycolnames=c("Species", "Values", "Samples")) # Converts iris_sd.
limits <- aes(ymax = df_mean[,2] + df_sd[,2], ymin=df_mean[,2] - df_sd[,2]) # Define standard deviation limits.
## (B) Bar plots of data stored in df_mean
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") # Plots bar sets defined by 'Species' column next to each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + coord_flip() + opts(axis.text.y=theme_text(angle=0, hjust=1)) # Plots bars and labels sideways.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="stack") # Plots same data set as stacked bars.
ggplot(df_mean, aes(Samples, Values)) + geom_bar(aes(fill = Species)) + facet_wrap(
Species, ncol=1) # Plots data sets below each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") # Generates the same plot as before, but with error bars.
# (C) Customizing colors
library(RColorBrewer) display.brewer.all() # Select a color scheme and pass it on to 'scale_*' arguments.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_brewer(pal="Greys") + scale_color_brewer(pal = "Greys") # Generates the same plot as before, but with grey color scheme.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_manual(values=c("red", "green3", "blue")) + scale_color_manual(values=c("red", "green3", "blue")) # Uses custom colors passed on as vectors.
## Sample data
y <- lapply(1:4, function(x) matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))))
## Plot single heatmap:
## Arrange several heatmaps in one plot
x1 <- levelplot(y[], col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="colorpanel")
x2 <- levelplot(y[], col.regions=heat.colors(75), main="heat.colors")
x3 <- levelplot(y[], col.regions=rainbow(75), main="rainbow")
x4 <- levelplot(y[], col.regions=redgreen(75), main="redgreen")
print(x2, split=c(2,1,2,2), newpage=FALSE)
print(x3, split=c(1,2,2,2), newpage=FALSE)
print(x4, split=c(2,2,2,2), newpage=FALSE)
Arrange several heatmaps in one plot with 'nearly' fixed row height
## Sample matrix with rows sorted by clustering dendrogram
y <- matrix(rnorm(100), 20, 5, dimnames=list(paste("g", 1:20, sep=""), paste("t", 1:5, sep="")))
hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete") # Hierarchical clustering of rows
y <- y[hr$order,] # Orders rows in matrix by clustering dendrogram
y <- t(scale(t(y))) # Scaling is necessary since levelplot() does not scale the data
## Plot matrix as heatmap with levelplot
library(lattice) library(gplots) library(grid)
x1 <- levelplot(t(y), col.regions=colorpanel(40, "darkblue", "yellow", "white"), colorkey=list(TRUE, space="bottom"), scales=list(x=list(cex=1.3), y=list(cex=1.3)), xlab="", ylab="", main="20 genes", aspect="fill")
x2 <- levelplot(t(y[11:20,]), col.regions=colorpanel(40, "darkblue", "yellow", "white"), colorkey=FALSE, scales=list(x=list(cex=1.3), y=list(cex=1.3)), xlab="", ylab="", main="10 genes", aspect="fill")
x3 <- levelplot(t(y[1:5,]), col.regions=colorpanel(40, "darkblue", "yellow", "white"), colorkey=FALSE, scales=list(x=list(cex=1.3), y=list(cex=1.3)), xlab="", ylab="", main="5 genes", aspect="fill")
## Arrange plots in single view
grid.newpage() # Open a new page on grid device
pushViewport(viewport(layout = grid.layout(4, 2, heights=unit(c(0.55, 0.01, 0.37, 0.07), "npc")))) # Define plotting grid and height of each plotting panel. The latter defines the height of each heatmap.
vp <- pushViewport(viewport(layout.pos.row=1:4, layout.pos.col=1))
print(x1, vp = vp, newpage=FALSE)
vp <- pushViewport(viewport(layout.pos.row=1, layout.pos.col=2))
print(x2, vp = vp, newpage=FALSE)
vp <- pushViewport(viewport(layout.pos.row=3, layout.pos.col=2))
print(x3, vp = vp, newpage=FALSE)
Computation of Venn intersects
The following imports several functions from the overLapper.R script for computing Venn intersects and plotting Venn diagrams (old version: vennDia.R). These functions are relatively generic and scalable by supporting the computation of Venn intersects of 2-20 or more samples. The upper limit around 20 samples is unavoidable because the complexity of Venn intersects increases exponentially with the sample number n according to this relationship: (2^n) - 1. A useful feature of the actual plotting step is the possiblity to combine the counts from several Venn comparisons with the same number of test sets in a single Venn diagram. The overall workflow of the method is to first compute for a list of samples sets their Venn intersects using the overLapper function, which organizes the result sets in a list object. Subsequently, the Venn counts are computed and plotted as bar or Venn diagrams. The current implementation of the plotting function, vennPlot , supports Venn diagrams for 2-5 sample sets. To analyze larger numbers of sample sets, the Intersect Plot methods often provide reasonable alternatives. These methods are much more scalable than Venn diagrams, but lack their restrictive intersect logic. Additional Venn diagram resources are provided by limma, gplots, vennerable, eVenn, VennDiagram, shapes, C Seidel (online) and Venny (online).
tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports required functions.
setlist <- list(A=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18), F=sample(letters, 22, replace=T))
# To work with the overLapper function, the sample sets (here six) need to be stored in a list object where the different
# compontents are named by unique identifiers, here 'A to F'. These names are used as sample labels in all subsequent data
# sets and plots.
sets <- read.delim("http://faculty.ucr.edu/
setlistImp <- lapply(colnames(sets), function(x) as.character(sets[sets[,x]!="", x]))
names(setlistImp) <- colnames(sets)
# Example how a list of test sets can be imported from an external table file stored in tab delimited format. Such
# a file can be easily created from a spreadsheet program, such as Excel. As a reminder, copy & paste from external
# programs into R is also possible (see read.delim function).
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets") OLlist names(OLlist)
# With the setting type="vennsets", the overLapper function computes all Venn Intersects for the six test samples in
# setlist and stores the results in the Venn_List component of the returned OLlist object. By default, duplicates are
# removed from the test sets. The setting keepdups=TRUE will retain duplicates by appending a counter to each entry. When
# assigning the value "intersects" to the type argument then the function will compute Regular
# Intersects instead of Venn Intersects. The Regular Intersect approach (not compatible with Venn diagrams!) is described
# in the next section. Both analyses return a present-absent matrix in the Intersect_Matrix component of OLlist. Each overlap
# set in the Venn_List data set is labeled according to the sample names provided in setlist. For instance, the composite
# name 'ABC' indicates that the entries are restricted to A, B and C. The seperator used for naming the intersect samples
# can be specified under the sep argument. By adding the argument cleanup=TRUE, one can minimize formatting issues in the
# sample sets. This setting will convert all characters in the sample sets to upper case and remove leading/trailing spaces.
## Bar plot of Venn counts ##
olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Venn Bar Plot")
# Generates a bar plot for the Venn counts of the six test sample sets. In contrast to Venn diagrams, bar plots scale
# to larger numbers of sample sets. The layout of the plot can be adjusted by changing the default values of the argument:
# margins=c(4,10,3,1). The minimum number of counts to consider in the plot can be set with the mincount argument
# (default is 0). The bars themselves are colored by complexity levels using the default setting: mycol=OLlist$Complexity_Levels.
## 2-way Venn diagrams ##
setlist2 <- setlist[1:2] OLlist2 <- overLapper(setlist=setlist2, sep="_", type="vennsets")
OLlist2$Venn_List counts <- sapply(OLlist2$Venn_List, length) vennPlot(counts=counts)
# Plots a non-proportional 2-way Venn diagram. The main graphics features of the vennPlot function can be controlled by
# the following arguments (here with 2-way defaults): mymain="Venn Diagram": main title mysub="default": subtitle
# ccol=c("black","black","red"): color of counts lcol=c("red","green"): label color lines=c("red","green"):
# line color mylwd=3: line width ccex=1.0: font size of counts lcex=1.0: font size of labels. Note: the vector
# lengths provided for the arguments ccol, lcol and lines should match the number of their corresponding features
# in the plot, e.g. 3 ccol values for a 2-way Venn diagram and 7 for a 3-way Venn diagram. The argument setlabels
# allows to provide a vector of custom sample labels. However, assigning the proper names in the original test set list
# is much more effective for tracking purposes.
## 3-way Venn diagrams ##
setlist3 <- setlist[1:3] OLlist3 <- overLapper(setlist=setlist3, sep="_", type="vennsets")
counts <- list(sapply(OLlist3$Venn_List, length), sapply(OLlist3$Venn_List, length))
vennPlot(counts=counts, mysub="Top: var1 Bottom: var2", yoffset=c(0.3, -0.2))
# Plots a non-proportional 3-way Venn diagram. The results from several Venn comparisons can be combined in a
# single Venn diagram by assigning to the count argument a list with several count vectors. The positonal offset
# of the count sets in the plot can be controlled with the yoffset argument. The argument setting colmode=2 allows
# to assign different colors to each count set. For instance, with colmode=2 one can assign to ccol a color vector
# or a list, such as ccol=c("blue", "red") or ccol=list(1:8, 8:1).
## 4-way Venn diagrams ##
setlist4 <- setlist[1:4]
OLlist4 <- overLapper(setlist=setlist4, sep="_", type="vennsets")
counts <- list(sapply(OLlist4$Venn_List, length), sapply(OLlist4$Venn_List, length))
vennPlot(counts=counts, mysub="Top: var1 Bottom: var2", yoffset=c(0.3, -0.2))
# Plots a non-proportional 4-way Venn diagram. The setting type="circle" returns an incomplete 4-way Venn diagram as
# circles. This representation misses two overlap sectors, but is sometimes easier to navigate than the default
# ellipse version.
## 5-way Venn diagrams ##
setlist5 <- setlist[1:5] OLlist5 <- overLapper(setlist=setlist5, sep="_", type="vennsets")
counts <- sapply(OLlist5$Venn_List, length)
vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a non-proportional 5-way Venn diagram.