I've converted https://gencore.bio.nyu.edu/variant-calling-pipeline/ to use Snakemake, https://github.com/arthurtsang/variant-calling-pipeline.
What I like about it,
- yaml syntax is clean and easy to follow.
- wildcard is very useful and easy to understand.
- If I have a rule with "{sample}.fastq" for the output, and another rule to have "A.fastq, B.fastq, C.fastq" as input, the first rule would be executed once for A, B and C.
- expand is also very helpful.
- It tripped me up a bit initially the wildcard in the expand function cannot use the variable in the global. i.e. if we have SAMPLES=["A", "B", "C"] defined, expand( "/some/directory/{SAMPLES}.fastq") won't work. the correct syntax is expand( "/some/directory/{samples}.fastq", samples = SAMPLES ).
- integration with Conda.
- unfortunately, it pollutes the code a bit by having a conda directive in every rule, but it works out really nice and easy.
- i haven't tried, but it should be possible to build custom channel hosting somewhere in the infrastructure for private binaries distribution.
- using the filename to build DAG
- rules are connected using the output of a rule to an input of another rule. It's kinda like how spring is finding which bean to create first.
- the limitation is everything is file based. i.e. if a step doesn't really need to generate a file, we'll have to touch an empty state file for snakemake to build the DAG.
- Also, there is no support for Linux style pipe. You can't really pipe the result of a rule to another.