## Data Science in R

Nine students joined our class, with all but one going to high schools. They found R incredibly easy to learn and gave us fantastic reviews-

** "I initially thought R was a very complicated languages (in terms of syntax). However, I found it quite easy to learn."** - A.S.

** "I especially liked how it is very easy to pick up. It is very helpful."** - J.M.

** "I was surprised at what R could accomplish."** - H.R.

** "I enjoyed the very fast paced lessons in which we learned a lot"** - A.C.

** "R was so efficient and easy to use"** - A.M.

** “I enjoyed how the instructor individually helped us if we had trouble”** - S.B.

## What is R?

R is a powerful and popular tool for biological and medical data analysis, co-developed by Robert Gentleman of Fred Hutch cancer research center. Dr. Gentleman is currently the chief scientist of DNA ancestry company 23 and Me.

The utility of R can be better understood by looking at its alternatives - Excel and Matlab. R is not only superior to both of them, it is absolutely free. Who can compete with that?

In our daily life, be it in scientific research, business analytics or hunting for Pokemons, we often encounter the problem of finding patterns from tables of data. The common solution is to load the tables in Excel, run functions, draw figures and perform statistical analysis. R is more efficient than Excel in doing all of that and a lot more. Here are the tasks R can do unlike other popular alternatives -

- Vector math,
- Finding information by joining multiple spreadsheets,
- Creating random data and do complex statistics,
- Analyzing characters and strings as required in genetics,
- Drawing beautiful and professional-quality plots.

I must add that R syntax for performing the above tasks is rather easy to learn, as you saw in the student reviews.

## Day 1

We introduce students to R syntax and its concept of vector. The students use R to compute sums like (1+2+3+….+N) or (1.2+2.3+3.4+….N.(N+1)) in a breeze.

We also show how R allows construction of vectors with random data, such as coin tossing or dice throwing experiments. Students see statistical distribution of the outcomes, and this is a very effective way to learn probability and statistics.

Students use their skills to compute pi by throwing darts at a board. This is a fun problem with immense use in the research world. This method of solving problem is known as Monte Carlo simulation.

## Day 2

## Day 3

On the third day, the class gets introduced to Bioconductor library for analyzing DNA and protein sequences. They download their favorite microbial genomes from NCBI, count nucleotides, identify doublets, triplets and translate from nucleotides to proteins.

They also see another immensely powerful aspect of R - joining multiple tables or spreadsheets to find patterns. For this, they again use examples from Pokemons, but researchers can use similar commands to filter genes expressed above cutoff and their annotations.

## Day 4

On day 4, the students practice an wide range of R tools useful for analyzing medical or research data. They get introduced to normal distribution and T test.

Researchers often have to compare multiple data sets to find out whether they are statistically different. For example, they may want to know whether a drug is effective. Students can solve these kinds of problems easily in R by conducting the above tests.

Students also learn about linear regression, correlation vs causation and Type I and II errors.

## Day 5

On day 5, students learn how to split their data in normalized form into multiple tables. Also, they learn R scripting language (‘for’, ‘while’ and ‘if’ statements). They use these skills to perform T test with multiple parameters and understand how small sample size affects the test.

Finally, to get more practice on the spreadsheet commands, they analyze data from international soccer to list all historic wins and losses of the Brazilian team.

We end the day with an ice cream social.