Data Visualization: Theory and Application (in R)
Athit Kao, PhD
UCI Bioinformatics Support Group - May 23, 2018
whoami
R enthusiast
PhD, Biomedical Sciences, UCI
IGB Biomedical Informatics Training program alumnus
Focus on proteomics mass spectrometry and bioinformatics
Prerequisites
Understand content from previous deck: http://learn.athitkao.com/presentation_datavis1.html
Basic programming experience necessary
R and RStudio installed (for 2nd half)
Consider difficulties encountered using spreadsheet programs, e.g.
- Generating a scatter plot with tens of thousands of datapoints
- Replicating a specific figure formatting style for new data
- Documenting changes made to source data of figure
Visualizing Exploratory Data Analysis
Dive in
Visualize everything
You might find something new and interesting
You can do this a lot quicker in R than spreadsheet software
Reminder: Quick visuals don't need to be fancy or polished
Information Design: K.I.S.S.
Concept: Layering
- This concept is used in base and ggplot2 plotting systems
- Build up to your final plot by adding layers of customizations
- ggplot2 will automatically adjust axes as the data changes, base will not
Data Formats for Repeated Measures
Limitations of Excel for Modern STEM
- Fixed visualizations types
- Good luck plotting hundreds to thousands of megabytes of data (Excel has a limit of 32K datapoints/series)
- E.g. Scatterplot with ~100K datapoints (3 x 32K = 3.7 MB)
Limitations of Excel for Modern STEM
- E.g. Scatterplot with 1 million datapoints (3 x 333K = 40 MB, didn't attempt with Excel)
Visualization Systems in R
- Three main systems with new ones always being developed
- base: Original system for R
- Build up from a blank canvas
- Used in the previous presentation
- lattice:
- Visualizations made with single function call
- Not covered here
- ggplot2: install.packages(“ggplot2”)
- Hybrid between base and lattice, highly customizable
- Covered in this presentation
Exercise #1: Data Formats
Exercise #2: Density Plot (ggplot2)
Exercise #3: Box Plot (ggplot2)
Exercise #4: Scatter Plot with Layering (ggplot2)
Saving Visualizations, part II
Summarize for me...
- Why are we typing out everything instead of copy/paste from the presentation?
- Why would you still use the base graphics system?
- What is an aspect of clean information design?
- T/F: You can effectively explore new data without visuals
- T/F: You can generate a ggplot2 visual in multiple lines of code
- T/F: You can generate a ggplot2 visual in just one line of code
- #4, FALSE; #5, TRUE; #6, TRUE
Questions?
- Just try new code and see what happens; this isn't wet lab
- Google: Don't just search with “R”, use “R language”
- Stack Overflow: Use tag “[r]”
Keep going, you got this!
App: One Million Datapoints
App: NIH Proficiency Scale
- 1. Fundamental Awareness (basic knowledge): Common knowledge/understanding of basic techniques/concepts
- 2. Novice (limited experience): Expected to need help when performing this skill
- 3. Intermediate (practical application): Able to successfully complete tasks; expert required occasionally
- 4. Advanced (applied theory): Able to successfully complete tasks without assistance
- 5. Expert (recognized authority): Can provide guidance, troubleshooting, and answers related to this skill
- Source: https://hr.nih.gov/working-nih/competencies/competencies-proficiency-scale