Functions: Theory and Application (in R)

Athit Kao, PhD
UCI Bioinformatics Support Group - April 25, 2018

whoami

  • R enthusiast

  • PhD, Biomedical Sciences, UCI

  • IGB Biomedical Informatics Training program alumnus

  • Focus on proteomics mass spectrometry and bioinformatics

Prerequisites

  • Understand content from previous deck: http://learn.athitkao.com/presentation_functions1.html

  • Basic programming experience necessary

  • R and RStudio installed

  • Consider a simple use case for programming in your field, e.g.:

    • Filter specific data from file
    • Munge proprietary data format into CSV
    • Calculating RPKM for sequencing data

Readability, part II

  • Encapsulation via functions is an aspect of code legibility
  • Others include:
    • Documentation (comments and READMEs)
    • Formatting (indentation and line wrapping)
    • Variable naming (obvious and consistent names)
  • There is a balance appropriate for you/your audience: Figure 10

Generalized Concept of Functions

Figure 2 Figure 4

Function Argument Order

Anonymous Functions (in lapply)

Multiple Arguments (in lapply)

  • First argument always the iterated variable
  • Must explicitly name additional arguments Figure 19

The Assignment Operator

  • Use “<-” or “=”?
  • I use “=” for consistency (with other programming languages) with no issues
  • Consistency even within R
  • Figure 30

Environments and Variable Scoping in R

  • Scoping refers to the visibility of variables in different environments
  • Global: Can be referenced from anywhere
  • Local: Accessible only within its environment
  • http://adv-r.had.co.nz/Environments.html
  • R uses “lexical” (a.k.a. static) scoping rules
  • Figure 31

Microsoft Excel

Pros

  • Quick editing and prototyping
  • WYSIWYG plots
  • Widely known and used program
  • Many similar alternatives (e.g. Sheets, Calc, Numbers, etc.)
  • Programming/automation with Visual Basic

Cons

  • Limit of 1M rows by 16K columns
  • 32bit version has 2GB file limit
  • Manual editing error-prone
  • Complex visualizations not possible
  • Third-party packages close to non-existent (e.g. advanced statistics, machine learning, etc.)
  • Closed source

R Programming Language

Pros

  • Quick prototyping towards minimum viable product
  • Can generate complex interactive visualizations (plots, reports, etc.)
  • No practical limit on data size
  • Easily reproduce results and adapt code to changes
  • You can do almost anything purely in R

Cons

  • Steep learning curve
  • Third-party package developers not forced to follow a set standard
  • Open source

Split-Apply-Combine Paradigm

Figure 0

Multithreading Split-Apply-Combine