Wednesday, July 10, 2019

Snakemake

I'm doing some evaluation of different workflow languages, turns out I like Snakemake a lot!

I've converted https://gencore.bio.nyu.edu/variant-calling-pipeline/ to use Snakemake, https://github.com/arthurtsang/variant-calling-pipeline

What I like about it,


  1. yaml syntax is clean and easy to follow.
  2. wildcard is very useful and easy to understand.
    • If I have a rule with "{sample}.fastq" for the output, and another rule to have "A.fastq, B.fastq, C.fastq" as input, the first rule would be executed once for A, B and C.
  3. expand is also very helpful.
    • It tripped me up a bit initially the wildcard in the expand function cannot use the variable in the global.  i.e.  if we have SAMPLES=["A", "B", "C"] defined, expand( "/some/directory/{SAMPLES}.fastq") won't work.  the correct syntax is expand( "/some/directory/{samples}.fastq", samples = SAMPLES ).  
  4. integration with Conda.  
    • unfortunately, it pollutes the code a bit by having a conda directive in every rule, but it works out really nice and easy.
    • i haven't tried, but it should be possible to build custom channel hosting somewhere in the infrastructure for private binaries distribution.  
  5. using the filename to build DAG
    • rules are connected using the output of a rule to an input of another rule.  It's kinda like how spring is finding which bean to create first.  
    • the limitation is everything is file based.  i.e. if a step doesn't really need to generate a file, we'll have to touch an empty state file for snakemake to build the DAG.
    • Also, there is no support for Linux style pipe.  You can't really pipe the result of a rule to another.  

EPIPE write EPIPE error when using Protractor with control flow turned off

I was hit with an error `EPIPE write EPIPE` when running protractor with control flow turned off.  https://github.com/angular/protractor/issues/4294.  It is caused by misusing await somewhere in the code.
tsline/Intellij is pretty good at warning developer that a promise returned has been ignored.  However, one that it didn’t catch and caught me as a surprise is the `ElementArrayFinder` returned by `element.all()`.  It’s not a promise, but if you try to use it directly, like `element.all().find()` or my favorite `element.all().getText()`, you’ll have about 0.1% of the time running into the EPIPE error.
Unfortunately, the test I’m working on, calls that for about 8000 times…  so it always fails after an hr of running.
Anyway, from the github issue, it appears to be a bug in the selenium driver and it’s fixed in the latest 4.0.0-beta driver which protractor is not using.  The solution is to `await` on the `ElementArrayFinder` too, it’ll return `ElementFinder[]` which you can loop through.

Running Plantuml on AWS Lambda

I couldn't get that to work, well, not entirely.  Anyway, here are the issues that I've come across.

My project just started using DDD and since we have an overseas team, it is a problem as they can't physically participate in the note sticking exercise.  Inspired by webeventstorming.com (https://www.slideshare.net/OuzhanSoykan/domain-driven-design-developer-summit-turkey/27), I’ve decided to build a plantuml.com like a web page to support the eventstorming syntax on top of plantuml.  To be able to work on the diagram together, I've used https://firepad.io with `ACE` as the editor which use https://firebase.google.com for a realtime database.  The end result is pretty cool, but integrating that (esp `firebase`) to an angular app induced quite a bit of pain.  

  1. firebase uses a very old version of `grpc` which only supports node 9.x.  
  2. npm modules that have a dependency on the node version is a mess.  they almost never document which version supports which node version.  
  3. firebase rules (you can set indexing and ACL to docs) only provides the basics.  you’ll have to build the whole user management if you want user groups.  
  4. the app used to set `windows.location.href` to update the encoded UML in the URL path.  when enabled authentication, it went into a loop.  
  5. `ng2-ace-editor` didn’t really doc how to load the theme and mode (for syntax highlighting), luckily we have google.



After jumping through a few hoops, now I have a home page listing all documents and can edit them collaboratively.

Then it struck me I could host the app on `AWS`, putting the angular app on s3 and break down the server with API-gateway and lambda function (after all, there’s only 2 API needed).  this decision caused me more pain than firebase…

  1. deploying the angular app on `S3` is very straightforward, however, since angular is using path for routes, the app only works if you navigate from `/` and will get a 404 or 403 if you put the angular route on the browser directly (as it’s trying to find that file on s3).  I read somewhere that we can use CloudFront to map the error page to `index.html`, but I couldn’t get that to work.
  2. API-gateway and lambda have a lot of caveats.  
    1. the lambda is pointed to the zip/jar file on an s3 bucket.  however, every time you’ve updated the zip/jar file, you’ll have to update the lambda with the same URL to the s3 object.  
    2. uploading a large file (10MB) on web console sucks.
    3. when you’re creating a lambda function, you should ignore the designer setting up a trigger to api gateway.  it looks like you can set up the relationship to API gateway there, but it’s a chicken and egg problem.  you’ll need to publish your API before you can configure the trigger here, but you’ll need the lambda setup before you can publish the API.  
    4. `plantuml` use `graphviz` which is not present on the lambda container.  we can set an env variable on the lambda conf to define where to find graphviz.  and packaged it in the zip file.
    5. can’t be jar file as it’ll lose execution permission
    6. maven also needs a bit of a hack to restore the execution permission when moving files with maven-resource-plugin
    7. the zip file is extracted to /var/task on the lambda container.  couldn’t find where it’s documented, I write log messages to `CloudWatch` to find that out.  and yes, logs go to `CloudWatch`.
    8. it’d be close to impossible to rely on a lot of external libraries.  even graphviz supports static linking, there are some libraries that don’t provide the static linking library.  
    9. couldn’t figure out a way to write lambda function with `kotlin` implementing the Handler interface (most example you can find will do that) which provides the API gateway event.  thus, there’s no way to set the response content-type or other HTTP related functions.
    10. don’t forget to define all response code in the method response.  I once forgot 200, and it got a weird error message which definitely didn’t tell you about the missing config.
    11. the API gateway assumes everything is using JSON.  I have an API that takes in the UML text (text/plain) and returns an encoded string (text/plain).  I have to convert the post body into a JSON (using the API gateway’s mapping template) before passing to the lambda function.  
    12. outputting a binary stream is also annoying.  I ended up have to base64 encode the image file, only have the API gateway to decode it right away.  also, there’s a `$util.base64decode` function in the mapping template which certainly doesn’t work.  luckily the option to convert to binary does work.
    13. setting the response content-type is also a bit tricky.  it’s where you define the HTTP response code supported. 
    14. cold start lambda could take anywhere from 5 to 10 seconds.  it’s very slow.  also, each container will only be used a certain time and it’ll be recycled.  i.e. we’ll hit cold start more than necessary.

I've got it sort of working, but a lot of plantuml diagrams doesn't work because of missing libraries.

PlantUML tricks


I've bee using plantuml a lot to draw various diagrams.  Here are a few tricks I've picked up.

Increase the size of the image

# define PLANTUML_LIMIT_SIZE to increase max size
java -DPLANTUML_LIMIT_SIZE=1024000 -jar plantuml.jar > png

Reading from pipe

grep “xxx” puml | java -jar plantuml.jar -pipe > png

Increase node separation or rank separation

@startuml
skinparam nodesep 10
skinparam ranksep 20

Notes on the same rank

@startuml
!pragma teoz true
note over a: hi
/ note over b: yea
/ note over c: yo
a -> b: x
& b -> c: y
@enduml

Arrows on the same rank

@startuml
!pragma teoz true
a -> b: x
& b -> c: y
@enduml

Rank is set automatically based on line length.

->  is same rank

--> is one rank lower

---> is 2 rank lowers

You can use -[norank]> to create arrows between nodes that will be always on same rank independent of length.


Rude email

I've received a lot of rude email in the past.  Here's a couple "interesting" one.

cause: two teams are discussing who should fix a small bug.  here's the response when I press on.


I personally do not want to contribute to this mess, but also have no time to waste on protecting "2 lines of code" against expanding entropy.
So, if you really believe that <a component> is the right place where to fix <redacted> issues, feel free to change the description in the java code yourself.
cause: a team decided to implement a new feature without consulting anyone and put it into the product.


Sorry I will not profession in this email.
Sorry for wasting your time with this feature. Only somebody with everyday contact with <software name> can understand value of it, which has been developed during weekend. Also I don’t understand to call something as scientific project, if other products including operating systems have it for decade.
After tech talk about month ago where he was (I don’t remember the exact wording) talking about brave people who innovate. My understanding (may be wrong) was that he talks about <a country> (brave -> innovate), <another country> (not brave -> don’t innovate).
I am also sad that people in <software name> don’t understand what is aligned with priorities and what is not. And accuse us that <team name> is not aligned.
There can be at least three explanation:

  1. They don’t know what are priorities (unlikely) 
  2. They don’t know what <team name> is doing (then we don’t need to track our work in <something like jira> because it’s wasting of our time). 
  3. They don’t understand what <team name> is doing (it’s my fault I am not able to explain it)