I've converted https://gencore.bio.nyu.edu/variant-calling-pipeline/ to use Snakemake, https://github.com/arthurtsang/variant-calling-pipeline.
What I like about it,
- yaml syntax is clean and easy to follow.
- wildcard is very useful and easy to understand.
- If I have a rule with "{sample}.fastq" for the output, and another rule to have "A.fastq, B.fastq, C.fastq" as input, the first rule would be executed once for A, B and C.
- expand is also very helpful.
- It tripped me up a bit initially the wildcard in the expand function cannot use the variable in the global. i.e. if we have SAMPLES=["A", "B", "C"] defined, expand( "/some/directory/{SAMPLES}.fastq") won't work. the correct syntax is expand( "/some/directory/{samples}.fastq", samples = SAMPLES ).
- integration with Conda.
- unfortunately, it pollutes the code a bit by having a conda directive in every rule, but it works out really nice and easy.
- i haven't tried, but it should be possible to build custom channel hosting somewhere in the infrastructure for private binaries distribution.
- using the filename to build DAG
- rules are connected using the output of a rule to an input of another rule. It's kinda like how spring is finding which bean to create first.
- the limitation is everything is file based. i.e. if a step doesn't really need to generate a file, we'll have to touch an empty state file for snakemake to build the DAG.
- Also, there is no support for Linux style pipe. You can't really pipe the result of a rule to another.
Hello Arthur, have you tried Nextflow? (workflow framework). Would be curious to know your thoughts.
ReplyDeleteI just published an updated version of the pipeline (and post) using GATK4 and Nextflow: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/
I'm a fan of nextflow, and the pipeline linked to in the above comment is a great example
ReplyDelete