Data Visualization: Theory and Practice (in R)

Athit Kao, PhD
UCI Bioinformatics Support Group - May 23, 2018

whoami

  • R enthusiast

  • PhD, Biomedical Sciences, UCI

  • IGB Biomedical Informatics Training program alumnus

  • Focus on proteomics mass spectrometry and bioinformatics

Prerequisites

  • Fundamental awareness of R and spreadsheet software

  • No programming experience necessary

  • R and RStudio installed (for 2nd half)

whoareyou

  • Who uses spreadsheet software to visualize data?

  • What types of figures (plots, charts, graphs, etc.) do you make?

  • Who has made a figure for a poster, presentation, or manuscript?

  • How many people have programmed before? What languages?

Comparison of Data Visualization Methods

Spreadsheet Software

  • Many options (e.g. Sheets, Calc, Numbers, etc.)
  • Quick editing and prototyping
  • WYSIWYG plots
  • Manual editing/manipulation error-prone
  • File size limits, e.g. Excel has limit of 1M rows by 16K columns

R Programming Language

  • In addition to data visualization, you can do almost anything in R
  • Rapid and reproducible visualizations for data exploration
  • No practical limit on data size
  • Steep learning curve

Data Formats and Workflows

Figure 4

Click drag.. click.. click click...

Figure 4

What can we visualize here?

Figure 0

Histogram, Frequency Plot, Density Plot

  • One-dimensional, not typically used to compare multiple groups
  • Area is proportional to frequency of variable
  • E.g. comparison of horsepower between all, manual, and automatic vehicles

Figure 9

Box Plot, Box-and-whisker Plot

  • One-dimensional, typically used to compare multiple groups
  • Minimally shows: First/third quartile, median, min/max
  • Histogram + box plot = violin plot
  • E.g. comparison of fuel efficiency across engine cylinder count

Figure 9c

Scatter Plot, Scattergram

  • Two-dimensional, can be used to compare multiple groups
  • Each group has each datapoint's x and y values plotted
  • E.g. Visualize the effect of engine displacement to horsepower across engine cylinder count

Figure 9d

Writing Script vs. Using Console

Figure 0

Pre-loaded Data Sets in R

Why are we typing out everything?...

  • So you can make mistakes:
    • “That was case sensitive?”
    • “I forgot a parenthesis/bracket?”
    • “That wasn't a period/comma/semi-colon?”
  • We learn better from mistakes
  • We can help each other out here

Follow along with the red line numbers

Exercise #1a: Histogram

Figure 11b

Exercise #1b: Density Plot

Figure 11c

Exercise #2: Box Plot

Figure 12b

Exercise #3: Scatter Plot

Figure 13b

Basic Aesthetics

Figure 16b

Saving Visualizations

Figure 0

Summarize for me...

  1. What are some advantages R has over Excel for data visualization?
  2. Why is it important to have code define a visualization?
  3. Anything weird or counterintuitive?
  4. T/F: Shared data and code will output the same exact visualization
  5. T/F: Anything spreadsheet software can visualize, R can visualize better
  6. T/F: You need to pay for “Premium Enterprise R” to make violin plots
  7. #4, TRUE; #5, TRUE; #6, FALSE!!!

Questions?

  • Just try new code and see what happens; this isn't wet lab
  • Google: Don't just search with “R”, use “R language”
  • Stack Overflow: Use tag “[r]” Figure F1

Keep going, you got this!

Figure 0

App: The Assignment Operator

  • Use “<-” or “=”?
  • I use “=” for consistency (with other programming languages) with no issues
  • Consistency even within R

Figure S2

App: Tilde Operator ("~")

  • Operator for use in “formulas”, used in many different functions and packages
  • E.g. “y ~ x” means y (or whatever is on the left) is the response to x (or whatever is on the right)
  • E.g. “mpg ~ cyl + am” means mpg depends on cylinder count and whether it's automatic transmission

> help(“~”)

App: NIH Proficiency Scale

  • 1. Fundamental Awareness (basic knowledge): Common knowledge/understanding of basic techniques/concepts
  • 2. Novice (limited experience): Expected to need help when performing this skill
  • 3. Intermediate (practical application): Able to successfully complete tasks; expert required occasionally
  • 4. Advanced (applied theory): Able to successfully complete tasks without assistance
  • 5. Expert (recognized authority): Can provide guidance, troubleshooting, and answers related to this skill
  • Source: https://hr.nih.gov/working-nih/competencies/competencies-proficiency-scale