Wednesday, July 10, 2019

Snakemake

I'm doing some evaluation of different workflow languages, turns out I like Snakemake a lot!

I've converted https://gencore.bio.nyu.edu/variant-calling-pipeline/ to use Snakemake, https://github.com/arthurtsang/variant-calling-pipeline

What I like about it,


  1. yaml syntax is clean and easy to follow.
  2. wildcard is very useful and easy to understand.
    • If I have a rule with "{sample}.fastq" for the output, and another rule to have "A.fastq, B.fastq, C.fastq" as input, the first rule would be executed once for A, B and C.
  3. expand is also very helpful.
    • It tripped me up a bit initially the wildcard in the expand function cannot use the variable in the global.  i.e.  if we have SAMPLES=["A", "B", "C"] defined, expand( "/some/directory/{SAMPLES}.fastq") won't work.  the correct syntax is expand( "/some/directory/{samples}.fastq", samples = SAMPLES ).  
  4. integration with Conda.  
    • unfortunately, it pollutes the code a bit by having a conda directive in every rule, but it works out really nice and easy.
    • i haven't tried, but it should be possible to build custom channel hosting somewhere in the infrastructure for private binaries distribution.  
  5. using the filename to build DAG
    • rules are connected using the output of a rule to an input of another rule.  It's kinda like how spring is finding which bean to create first.  
    • the limitation is everything is file based.  i.e. if a step doesn't really need to generate a file, we'll have to touch an empty state file for snakemake to build the DAG.
    • Also, there is no support for Linux style pipe.  You can't really pipe the result of a rule to another.  

No comments:

Post a Comment