Towards a reproducible and reusable publication and analysis workflow

Rad Suchecki

Nathan Watson-Haigh

Stuart Stephen

Alex Whan

Tuesday, 29 May, 2018

Reproducible manuscript - why?

  • To avoid errors
    • Widely reported inconsistencies between results and the methodology reported
  • To promote computational reproducibility
    • Other people (and you!) can take your data and get the same numbers that are in your paper
    • Document must specify where ALL the numbers come from
    • Otherwise numbers hard to recover even in absence of errors
  • To create documents which can be revised easily
    • New data, updated software or requests from reviewers can be incorporated much more easily

Ingredients for reproducibility in data science

  • data
>1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 REF
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA
TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTT
ATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGG
  • code
#!/bin/bash
awk -vOFS="\t" '{split($1,sim,"|");if(sim[4]==$3 && sim[5]==$4-1){count++}};END{print count,NR,count/NR}' ${@:-/dev/stdin}
  • compute environment
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] revealjs_0.9     kableExtra_0.9.0 rmarkdown_1.9   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17      rstudioapi_0.7    knitr_1.20       
 [4] xml2_1.2.0        magrittr_1.5      hms_0.4.2        
 [7] rvest_0.3.2       munsell_0.4.3     viridisLite_0.3.0
[10] colorspace_1.3-2  R6_2.2.2          rlang_0.2.0      
[13] plyr_1.8.4        stringr_1.3.1     httr_1.3.1       
[16] tools_3.4.4       htmltools_0.3.6   yaml_2.1.18      
[19] rprojroot_1.3-2   digest_0.6.15     tibble_1.4.2     
[22] readr_1.1.1       evaluate_0.10.1   stringi_1.2.2    
[25] compiler_3.4.4    pillar_1.2.1      methods_3.4.4    
[28] scales_0.5.0      backports_1.1.2   pkgconfig_2.0.1  

BioKanga

Aims

  • evaluation of BioKanga‚Äôs sequence alignment module vs state-of-the-art tools
  • turn-key reproducibility
  • re-usability