Data Visualization: Theory and Practice (in R)
Athit Kao, PhD
UCI Bioinformatics Support Group - May 23, 2018
whoami
R enthusiast
PhD, Biomedical Sciences, UCI
IGB Biomedical Informatics Training program alumnus
Focus on proteomics mass spectrometry and bioinformatics
Prerequisites
Fundamental awareness of R and spreadsheet software
No programming experience necessary
R and RStudio installed (for 2nd half)
whoareyou
Who uses spreadsheet software to visualize data?
What types of figures (plots, charts, graphs, etc.) do you make?
Who has made a figure for a poster, presentation, or manuscript?
How many people have programmed before? What languages?
Comparison of Data Visualization Methods
Spreadsheet Software
- Many options (e.g. Sheets, Calc, Numbers, etc.)
- Quick editing and prototyping
- WYSIWYG plots
- Manual editing/manipulation error-prone
- File size limits, e.g. Excel has limit of 1M rows by 16K columns
R Programming Language
- In addition to data visualization, you can do almost anything in R
- Rapid and reproducible visualizations for data exploration
- No practical limit on data size
- Steep learning curve
Data Formats and Workflows
Click drag.. click.. click click...
What can we visualize here?
mtcars data from R
Histogram, Frequency Plot, Density Plot
- One-dimensional, not typically used to compare multiple groups
- Area is proportional to frequency of variable
- E.g. comparison of horsepower between all, manual, and automatic vehicles
Box Plot, Box-and-whisker Plot
- One-dimensional, typically used to compare multiple groups
- Minimally shows: First/third quartile, median, min/max
- Histogram + box plot = violin plot
- E.g. comparison of fuel efficiency across engine cylinder count
Scatter Plot, Scattergram
- Two-dimensional, can be used to compare multiple groups
- Each group has each datapoint's x and y values plotted
- E.g. Visualize the effect of engine displacement to horsepower across engine cylinder count
Writing Script vs. Using Console
Pre-loaded Data Sets in R
Why are we typing out everything?...
- So you can make mistakes:
- “That was case sensitive?”
- “I forgot a parenthesis/bracket?”
- “That wasn't a period/comma/semi-colon?”
- We learn better from mistakes
- We can help each other out here
Follow along with the red line numbers
Exercise #1b: Density Plot
Exercise #3: Scatter Plot
Summarize for me...
- What are some advantages R has over Excel for data visualization?
- Why is it important to have code define a visualization?
- Anything weird or counterintuitive?
- T/F: Shared data and code will output the same exact visualization
- T/F: Anything spreadsheet software can visualize, R can visualize better
- T/F: You need to pay for “Premium Enterprise R” to make violin plots
- #4, TRUE; #5, TRUE; #6, FALSE!!!
Questions?
- Just try new code and see what happens; this isn't wet lab
- Google: Don't just search with “R”, use “R language”
- Stack Overflow: Use tag “[r]”
Keep going, you got this!
App: The Assignment Operator
- Use “<-” or “=”?
- I use “=” for consistency (with other programming languages) with no issues
- Consistency even within R
App: Tilde Operator ("~")
- Operator for use in “formulas”, used in many different functions and packages
- E.g. “y ~ x” means y (or whatever is on the left) is the response to x (or whatever is on the right)
- E.g. “mpg ~ cyl + am” means mpg depends on cylinder count and whether it's automatic transmission
> help(“~”)
App: NIH Proficiency Scale
- 1. Fundamental Awareness (basic knowledge): Common knowledge/understanding of basic techniques/concepts
- 2. Novice (limited experience): Expected to need help when performing this skill
- 3. Intermediate (practical application): Able to successfully complete tasks; expert required occasionally
- 4. Advanced (applied theory): Able to successfully complete tasks without assistance
- 5. Expert (recognized authority): Can provide guidance, troubleshooting, and answers related to this skill
- Source: https://hr.nih.gov/working-nih/competencies/competencies-proficiency-scale