Data Visualization: Theory and Application (in R)

Athit Kao, PhD
UCI Bioinformatics Support Group - May 23, 2018

whoami

  • R enthusiast

  • PhD, Biomedical Sciences, UCI

  • IGB Biomedical Informatics Training program alumnus

  • Focus on proteomics mass spectrometry and bioinformatics

Prerequisites

  • Understand content from previous deck: http://learn.athitkao.com/presentation_datavis1.html

  • Basic programming experience necessary

  • R and RStudio installed (for 2nd half)

  • Consider difficulties encountered using spreadsheet programs, e.g.

    • Generating a scatter plot with tens of thousands of datapoints
    • Replicating a specific figure formatting style for new data
    • Documenting changes made to source data of figure

Visualizing Exploratory Data Analysis

  • Dive in

  • Visualize everything

  • You might find something new and interesting

  • You can do this a lot quicker in R than spreadsheet software

  • Reminder: Quick visuals don't need to be fancy or polished

Information Design: K.I.S.S.

Figure 20c

Concept: Layering

Figure 21

  • This concept is used in base and ggplot2 plotting systems
  • Build up to your final plot by adding layers of customizations
  • ggplot2 will automatically adjust axes as the data changes, base will not

Figure 0

Data Formats for Repeated Measures

Figure 18b

Limitations of Excel for Modern STEM

  • Fixed visualizations types
  • Good luck plotting hundreds to thousands of megabytes of data (Excel has a limit of 32K datapoints/series)
  • E.g. Scatterplot with ~100K datapoints (3 x 32K = 3.7 MB) Figure 22

Figure 0

Limitations of Excel for Modern STEM

  • E.g. Scatterplot with 1 million datapoints (3 x 333K = 40 MB, didn't attempt with Excel) Figure 23a

Figure 0

Visualization Systems in R

  • Three main systems with new ones always being developed
  • base: Original system for R
    • Build up from a blank canvas
    • Used in the previous presentation
  • lattice:
    • Visualizations made with single function call
    • Not covered here
  • ggplot2: install.packages(“ggplot2”)
    • Hybrid between base and lattice, highly customizable
    • Covered in this presentation

Exercise #1: Data Formats

Figure 0

Exercise #2: Density Plot (ggplot2)

Figure 0

Exercise #3: Box Plot (ggplot2)

Figure 0

Exercise #4: Scatter Plot with Layering (ggplot2)

Figure 0

Saving Visualizations, part II

Figure 0

Summarize for me...

  1. Why are we typing out everything instead of copy/paste from the presentation?
  2. Why would you still use the base graphics system?
  3. What is an aspect of clean information design?
  4. T/F: You can effectively explore new data without visuals
  5. T/F: You can generate a ggplot2 visual in multiple lines of code
  6. T/F: You can generate a ggplot2 visual in just one line of code
  7. #4, FALSE; #5, TRUE; #6, TRUE

Questions?

  • Just try new code and see what happens; this isn't wet lab
  • Google: Don't just search with “R”, use “R language”
  • Stack Overflow: Use tag “[r]” Figure F1

Keep going, you got this!

Figure 0

App: ~100K Datapoints

Figure 0

App: One Million Datapoints

Figure 0

App: NIH Proficiency Scale

  • 1. Fundamental Awareness (basic knowledge): Common knowledge/understanding of basic techniques/concepts
  • 2. Novice (limited experience): Expected to need help when performing this skill
  • 3. Intermediate (practical application): Able to successfully complete tasks; expert required occasionally
  • 4. Advanced (applied theory): Able to successfully complete tasks without assistance
  • 5. Expert (recognized authority): Can provide guidance, troubleshooting, and answers related to this skill
  • Source: https://hr.nih.gov/working-nih/competencies/competencies-proficiency-scale