Valid XHTML     Valid CSS2    

R warm up session for the Bioinformatics Summer School

        gilles.hunault@univ-angers.fr

Remember that with more than 2 millions of functions distributed as packages grouped in task views R is a fantastic tool... but not the only one.

There is even a site dedicated to R and bioinformatics called Bioconductor.

Official documentation (many languages) is here.

non su

 

non su

 

non su

Very very short lectures

About R and bioinformatics

About R and [bio]statistics

 

You can follow this link to have a short reference card for R and Rstudio. A local copy is here.

 

Choose the section that suits you best:

  1. I don't know R

  2. I think I know R for statistics and bioinformatics

  3. I think I know how to program in R

  4. I know Python better than R

 

1. I don't know R

OK, you need a gentle introduction to R. So practice is the best solution. Read the following document r-intro based on Frédéric PROIA's warm up document and type the code in Rstudio to see what R answers.

You can load the file warmup.r with all its R code if you don't want to type the code -- but usually to write things helps to remember them. You can also cut and paste the following lines, each instruction at a time:

Then try to solve these exercises:

Exercise 1

Use the iris dataset. Compute the mean of the fourth column. Why is the summary function a «good but limited function»?

Compute now the median of the fourth column. Does R help you to decide how to choose between the mean and the median (or another computation) as the best descriptor of the values? Does R show the unit of the petals' width?

Exercise 2

Read the elf.dar data file using the explanations given at the end of the page elf.htm. Convert the SEXE column of the dataframe into a factor: 0=male and 1=female. Use the table and prop.table functions to compute absolute and relative counts. How to sort them in decreasing order?

Exercise 3

Compute the GC content of gene X94991.1. Use all nice bioinformatics functions of R to do it with a minimum of instructions.

Hint: use the R code from R15 and install the ape package.

Exercise 4

Use again the elf data. Compare the age of women and men with the help of the t.test function. Comment the output.

Does R check that the assumptions of the test are fullfilled?

Exercise 5

What is the purpose of the msaR and AlignStat packages?

Exercise 6

Explain in which cases it is better to use beanplots than boxplots. Use R to show it. You have to install the beanplot package.

Exercise 7

Install and then load the rms package. Why does the installation take so long? Which datasets are included?

This package is associated to a Springer book. What is its name?

Exercise 8

Install and then load the faraway package. Which datasets are included?

This package is associated to a CRC book. What is its name?

How do you remove all lines with NA values with R in the diabetes dataset of this package? Is it a good idea to do so?

Exercise 9

Load the survival package. Check that you don't need to install it. Why? Is there also a book associated to this package?

Try to find the class and the dimensions of two of its dataset, named kidney and leukemia.

How can you find the class and the dimensions of all the datasets of this package?

Exercise 10

Why should you avoid for loops as much as possible in R if you are dealing with columns or lines of data frames?

2. I think I know R for statistics and bioinformatics

There are some good and some bad practices in R to compute statistics and produce bioinformatics results. Check how you do things with these exercises.

No programming is needed here. Use Rstudio to edit and run your code.

Exercise 1

Use the iris dataset. Compute the means of the four first columns with a single instruction.

Can you apply the summary function on the first column for each species with a single instruction?

Exercise 2

Use the cars dataset. Write a single instruction to have the row names showing Car001 Car002... No loop accepted.

Exercise 3

What is the shortest way to see the columns' name and their index, such as in this example:


     [1,] Sepal.Length
     [2,] Sepal.Width
     [3,] Petal.Length
     [4,] Petal.Width
     [5,] Species
     

You may use the iris dataset. Remember: neither programming nor for loop here.

Exercise 4

Read the diabetes dataset at the address http://forge.info.univ-angers.fr/~gh/wstat/Eda/diabetes.dar.

Beware that the first line is the name of the columns and that the first column gives the names of the lines.

Remove all the lines with NA for the bp.1s variable. Which column has then the maximum of NA values?

Hint: use apply and an anonymous function.

Exercise 5

Use again the diabetes dataset. Compute and add to this dataset with a single instruction the categorical variable ageCL based on the rule 'young' if age<18, 'old' otherwise.

When you modify a data frame, what are the differences between the transform and mutate functions?

Exercise 6

What is the most efficient way to compute and display the maximal value and the number of times it occurs in a vector with many many values? How can you prove it?

Exercise 7

Describe an ordinal variable with counts, percentages and cumulated frequencies without any for loop like the following table. Don't forget the NA values.


     Frequency table for the variale  METAVIR_F
                         0      1      2      3      4   <NA>
     Count           250.0  800.0  516.0  364.0  380.0  647.0
     Sums            250.0 1050.0 1566.0 1930.0 2310.0 2957.0
     Percentages       8.5   27.1   17.5   12.3   12.9   21.9
     Cumulative        8.5   35.5   53.0   65.3   78.1  100.0
     

Exercise 8

Build a graphical description of a continuous variable with the curve for the estimation of the density and the normal candidate curve like the one below.

       non su

Make a function of your instructions. Which parameters are needed?

Exercise 9

What are the pros and cons of the ggplot2 and lattice packages compared to classical plots in R?

Exercise 10

Are you able to use Shiny and a Jupyter R notebook to have a small app in R? Prove it.

3. I think I know how to program in R

Let's check it. Use Rstudio to edit, run, debug and profile (you said you were a programmer, right?) your code.

Exercise 1

Use the iris dataset. Apply the summary function for all numeric columns for each species with a single instruction.

Hint: use an anonymous function for tapply.

Exercise 2

Create a cats function that underlines a string with a given character, use "=" as default.

Example:


     > cats("First part: descriptive statistics")
     
     First part: descriptive statistics
     ==================================
     
     > cats("Second part: inferential statistics","-")
     
     Second part: inferential statistics
     -----------------------------------
     
     
     

Exercise 3

Create a timer function that prints the date and time before and after executing some code and that computes the duration of the execution.


     > timer( myFunction( 10**3 ) )
     
     Start: 08 june 2018 11:13:02 CEST
     [...] output
     Stop:  08 june 2018 11:13:02 CEST
     
     Time difference of 0.001301765 secs
     

Hint: use the ellipsis for the parameter of the function.

Exercise 4

Create a function extractPvalue that extracts the p-value of a t-test. For example :


     
     > extractPvalue( t.test(iris2$Sepal.Length ~ iris2$Species2) )
     
     1.866144e-07
     

Create a function extractPvalues that produces a table of all t-tests for the columns of a data frame, using the name of the factor as a parameter. For example:


     
     > extractPvalues( iris2, "Species2" )
     
     Variable        p-value
     Sepal.Length    1.866144e-07
     Sepal.Width     0.001819
     Petal.Length    ...
     Petal.Width
     

The iris2 data frame corresponds to the iris dataset without the "setosa" flowers. Define it with a single instruction.

Hint: in extractPvalues use apply with an anonymous function that calls the extractPvalue function from previous exercise.

Exercise 5

Define a function that uses a quantitative variable and two factors and displays the boxplots side by side as in the example of the tooth growth for the guinea pigs found in example("boxplot"). Don't forget the legend.

Exercise 6

The chi-square test function computes only the value of the test statistic but doesn't show where the main differences between the theoretical and the observed values are. Define a function that details the contribution (theo-obs)˛/theo for each level and that sorts them by relative importance. As usual, no loops. Here is an example:


     Details of the chi-square statistic test value:
     
       Ind.    The Obs     Dif      Cntr       Pct    Cumul
          2 27.500  55 -27.500 27.500000 42.060623 45.50227
          1  6.875  18 -11.125 18.002273 27.534066 18.00227
          3 41.250  21  20.250  9.940909 15.204394 55.44318
          4 27.500  12  15.500  8.736364 13.362069 64.17955
          5  6.875   4   2.875  1.202273  1.838849 65.38182
     

Exercise 7

Define a function that displays all the subsets of a given set. For example for givenSet <- c("a","b","c") it must display something like:


     1/8  empty set
     2/8  { a }
     3/8  { b }
     4/8  { c }
     5/8  { a , b }
     6/8  { a , c }
     7/8  { b , c }
     8/8  { a , b , c}
     

The subsets must be numbered and produced with an increasing number of elements (variant: decreasing) and displayed in alphabetic order.

Which part of the biostatistics or bioinformatics may need these subsets? Try to produce both iterative and recursive solutions.

Exercise 8

Run the library("gdata") instruction and check that it displays some information. Define a function .library (yes, with a dot at the beginning) that loads silently the libray, that is that supresses the warning outputs. What is the use of the dot at the beginning of the function's name?

Exercise 9

Write a function that finds the longest common subsequence of n (greater than two) aminoacid sequences. Beware that this is not the same problem as finding the longest common substring. Describe first the method and its complexity using O(n) notation. Is it possible to use it for DNA sequences which may be very very long? Prove that it is fast by profiling your code.

Exercise 10

Build a small example for a class of statistical objets (continuous, factor...) with basic methods (size, describe, plot... ) using R4, R5 and R6 formalism in order to explain the pros and cons of these three object oriented mecanisms in R.

Why must every R programmer decide to learn tidyverse or not?

What is the best R package to run "serious" tests (unit tests, integration tests...) in R?

4. I know Python better than R

Great. So use Python to solve the exercises from section 2 (« I think I know R for statistics and bioinformatics») and from section 3 (« I think I know how to program in R»). Then send me at gilles.hunault@univ-angers.fr your scripts so I can check your programming skills.

 

Final note: selected answers to the exercises are hidden but clickable on this page. Do you like to  play  hide and seek?

Source code for this page (php)

 

 

retour gH    Retour à la page principale de   (gH)