Tuesday, September 24, 2019

Setup HTTP Git Server using Nginx on Docker

I was working on the kubernates executor for nextflow which appears to only able to pull the pipeline from GitHub or Bitbucket (which turns out to be not true), however, the pipeline script is proprietary and it's company's policy to house the source code on internal GHE.  Thus, I've decided to setup up a Git server just as described here, https://www.howtoforge.com/tutorial/ubuntu-git-server-installation/

(note git's own website also talked about the use of the `git-http-backend`, https://git-scm.com/book/en/v2/Git-on-the-Server-Smart-HTTP.)

Also, I'm running the nextflow script in a pod, so the git server needs to run in a pod too.

Here's the dockerfile I used to build the docker image.



The nginx config is mostly the same as the one in the howtoforge tutorial.

the run.sh file (since there's no systemd on docker, we have to launch nginx and fcgiwrap in the background


since the purpose for me is to launch nextflow, I've started fcgiwrap and nginx in the background and then launch nextflow. If you'd like to run a pod serving the files over git, you can launch run.sh as CMD in the Dockerfile and start nginx with daemon off.

Friday, August 30, 2019

Circuit Breaker and Bulkhead on Microservice

(this is a very old post that I forgot to publish)

I’m thinking how we can leverage circuit breaker in our services.  Let’s say we’ve the following services

UI -> svc A -> svc B

Putting circuit breaker on svc A would make it fail fast and a chance for svc B to recover.

And, say, we have a pool of svc B and we leverage some service discovery and “find” a svc B when svc A starts up (probably don’t want to do that every time svc B is needed)

If that instance of svc B fails, we shouldn’t just trip the breaker but to find another svc B from the service registry.  Now, the circuit breaker should really be implemented in the service registry (to mark that as opened) and we’d need some way to close (half-open) the breaker for that instance.

Now, instead of having a direct connection to svc B, we put a loadbalancer before svc B,

UI -> svc A -> LB -> (svc B)xN

Now, putting a circuit break on svc A actually doesn’t make (too) much sense.  If a couple of svc B got very busy and timed out, svc A might open the circuit breaker while some of the svc B are actually fine.  And if just one of them are slowed, the circuit break on svc A might never open and will suffer intermittent performance issue.   We could instead put the circuit breaker on the LB and to svc A, unless the LB itself or all svc B are dead, it won’t trip the breaker.  However, timeout would be different.  Since LB will trip the breaker for that svc B instance, tripping the breaker on svc A would just failing requests for no reason (assuming there’re more svc B available.)

Using mesos and marathon, we can do either service discovery (the consuming service look for production service directly) or loadbalance (it has haproxy integration and if using consul, it has nginx integration too (and I read that it’s quicker to change the config)).  We’ll have to make a decision and that’d affect the docker/mesos/marathon exercise I’m working on (I’ll make sure the scenario we pick worked)

And a more general questions, should we implement the circuit breaker per service (server) or per api (url)?  Don’t know what hystrix has implemented, but I’d think per service would be good enough.  i.e any fail API could trip the breaker.

As for bulkhead, basically microservice is a bulkhead pattern on service layer.  And nodejs, accidentally, on the process level.  I couldn’t find any documentation, but the one process that nodejs has appears to bind to one processor.  And the example the book keep using , self-denial attack, is something we can prepare ahead of time and I really doubt if our customer will have that use case.  But if there’s anything to do, we might have to do it with our orchestrating framework (I’d recommend mesos now as it’s the most matured framework, most other solutions are built on top of mesos or it’s new, like Google Kubernetes or ClusterHQ Flocker.

BTW, it appears the deployment is well though on Marathon.  https://mesosphere.github.io/marathon/docs/deployments.html  should be really fun to try out.

also, if we are using the same lb for all services, it will become a hotspot and we might want to have a lb for each, or a few services. 

with all these services and lb, adding monitoring and logging, it's vital that we have the orchestrating piece done,  installer just won't cut it.  




gmail's plus sign trick

I am working on a test for the user registration on my website.  problem is I need to create a new account every time and that ties to the user's email address (gmail).  I couldn't create a new email account every time I ran the test.  Not only will I create a lot of email accounts (even if I can automate that) but also enabling API for each new account would have a lot of work.

Turns out there's a plus sign trick that can have a seemingly different email address send to the same email account.  someone@gmail.com and someone+12345@gmail.com will deliver to the same someone@gmail.com inbox!

With that, I can generate gmail addresses and search the inbox by "to: someone+12345@gmail.com" to retrieve the email for the test.

Getting HTML from Gmail body using GMail API and protractor

Part of the automated tests we are building involves checking email, verify its contents and click on a link to continue the registration process.  To do that, i setup a new gmail account, and follow Google's Quickstart instruction to enable GMail API.  Well, all you really have to do is to click the "ENABLE THE GMAIL API" button on the page. But before you do that, make sure you have selected the correct google account on the upper right corner.

Now that we have enabled the API and downloaded the credentials json file.  We can follow the example on the quickstart instruction to authenticate to gmail API.  However, the getNewToken will simply display a URL on the console and you are supposed to manually go to the URL on a browser and copy the code back to the program.  But we're writing an automated test with protractor, let's automate that too!

It's mostly the same as the example, except that when token.json is not found, it'll open a new browser window and grab the code automatically.  Also, note that the code uses an "AppPo" class, it's just a simple utility class I use to check if the button exists before clicking it.

The gmail API's list method returns a list of emails, only with the message id and the thread id, we'll have to call the get method to retrieve the content. The structure it returned is too complicated to my taste, and after all, I just want to grab the HTML content, so I decided to just get the raw content and reconstruct the HTML. Now you have it, we can now call searchEmail and pass it's returns to getHtmlFromEmailBody to get the HTML content for all emails returned.  With cheerio, we can easily find all links like this.

const $ = gmailUtils.getHtmlFromEmailBody(...);
$('a').each( (i,a) => console.log( $(a).attr('href') ) );

Wednesday, July 10, 2019

Snakemake

I'm doing some evaluation of different workflow languages, turns out I like Snakemake a lot!

I've converted https://gencore.bio.nyu.edu/variant-calling-pipeline/ to use Snakemake, https://github.com/arthurtsang/variant-calling-pipeline

What I like about it,


  1. yaml syntax is clean and easy to follow.
  2. wildcard is very useful and easy to understand.
    • If I have a rule with "{sample}.fastq" for the output, and another rule to have "A.fastq, B.fastq, C.fastq" as input, the first rule would be executed once for A, B and C.
  3. expand is also very helpful.
    • It tripped me up a bit initially the wildcard in the expand function cannot use the variable in the global.  i.e.  if we have SAMPLES=["A", "B", "C"] defined, expand( "/some/directory/{SAMPLES}.fastq") won't work.  the correct syntax is expand( "/some/directory/{samples}.fastq", samples = SAMPLES ).  
  4. integration with Conda.  
    • unfortunately, it pollutes the code a bit by having a conda directive in every rule, but it works out really nice and easy.
    • i haven't tried, but it should be possible to build custom channel hosting somewhere in the infrastructure for private binaries distribution.  
  5. using the filename to build DAG
    • rules are connected using the output of a rule to an input of another rule.  It's kinda like how spring is finding which bean to create first.  
    • the limitation is everything is file based.  i.e. if a step doesn't really need to generate a file, we'll have to touch an empty state file for snakemake to build the DAG.
    • Also, there is no support for Linux style pipe.  You can't really pipe the result of a rule to another.  

EPIPE write EPIPE error when using Protractor with control flow turned off

I was hit with an error `EPIPE write EPIPE` when running protractor with control flow turned off.  https://github.com/angular/protractor/issues/4294.  It is caused by misusing await somewhere in the code.
tsline/Intellij is pretty good at warning developer that a promise returned has been ignored.  However, one that it didn’t catch and caught me as a surprise is the `ElementArrayFinder` returned by `element.all()`.  It’s not a promise, but if you try to use it directly, like `element.all().find()` or my favorite `element.all().getText()`, you’ll have about 0.1% of the time running into the EPIPE error.
Unfortunately, the test I’m working on, calls that for about 8000 times…  so it always fails after an hr of running.
Anyway, from the github issue, it appears to be a bug in the selenium driver and it’s fixed in the latest 4.0.0-beta driver which protractor is not using.  The solution is to `await` on the `ElementArrayFinder` too, it’ll return `ElementFinder[]` which you can loop through.

Running Plantuml on AWS Lambda

I couldn't get that to work, well, not entirely.  Anyway, here are the issues that I've come across.

My project just started using DDD and since we have an overseas team, it is a problem as they can't physically participate in the note sticking exercise.  Inspired by webeventstorming.com (https://www.slideshare.net/OuzhanSoykan/domain-driven-design-developer-summit-turkey/27), I’ve decided to build a plantuml.com like a web page to support the eventstorming syntax on top of plantuml.  To be able to work on the diagram together, I've used https://firepad.io with `ACE` as the editor which use https://firebase.google.com for a realtime database.  The end result is pretty cool, but integrating that (esp `firebase`) to an angular app induced quite a bit of pain.  

  1. firebase uses a very old version of `grpc` which only supports node 9.x.  
  2. npm modules that have a dependency on the node version is a mess.  they almost never document which version supports which node version.  
  3. firebase rules (you can set indexing and ACL to docs) only provides the basics.  you’ll have to build the whole user management if you want user groups.  
  4. the app used to set `windows.location.href` to update the encoded UML in the URL path.  when enabled authentication, it went into a loop.  
  5. `ng2-ace-editor` didn’t really doc how to load the theme and mode (for syntax highlighting), luckily we have google.



After jumping through a few hoops, now I have a home page listing all documents and can edit them collaboratively.

Then it struck me I could host the app on `AWS`, putting the angular app on s3 and break down the server with API-gateway and lambda function (after all, there’s only 2 API needed).  this decision caused me more pain than firebase…

  1. deploying the angular app on `S3` is very straightforward, however, since angular is using path for routes, the app only works if you navigate from `/` and will get a 404 or 403 if you put the angular route on the browser directly (as it’s trying to find that file on s3).  I read somewhere that we can use CloudFront to map the error page to `index.html`, but I couldn’t get that to work.
  2. API-gateway and lambda have a lot of caveats.  
    1. the lambda is pointed to the zip/jar file on an s3 bucket.  however, every time you’ve updated the zip/jar file, you’ll have to update the lambda with the same URL to the s3 object.  
    2. uploading a large file (10MB) on web console sucks.
    3. when you’re creating a lambda function, you should ignore the designer setting up a trigger to api gateway.  it looks like you can set up the relationship to API gateway there, but it’s a chicken and egg problem.  you’ll need to publish your API before you can configure the trigger here, but you’ll need the lambda setup before you can publish the API.  
    4. `plantuml` use `graphviz` which is not present on the lambda container.  we can set an env variable on the lambda conf to define where to find graphviz.  and packaged it in the zip file.
    5. can’t be jar file as it’ll lose execution permission
    6. maven also needs a bit of a hack to restore the execution permission when moving files with maven-resource-plugin
    7. the zip file is extracted to /var/task on the lambda container.  couldn’t find where it’s documented, I write log messages to `CloudWatch` to find that out.  and yes, logs go to `CloudWatch`.
    8. it’d be close to impossible to rely on a lot of external libraries.  even graphviz supports static linking, there are some libraries that don’t provide the static linking library.  
    9. couldn’t figure out a way to write lambda function with `kotlin` implementing the Handler interface (most example you can find will do that) which provides the API gateway event.  thus, there’s no way to set the response content-type or other HTTP related functions.
    10. don’t forget to define all response code in the method response.  I once forgot 200, and it got a weird error message which definitely didn’t tell you about the missing config.
    11. the API gateway assumes everything is using JSON.  I have an API that takes in the UML text (text/plain) and returns an encoded string (text/plain).  I have to convert the post body into a JSON (using the API gateway’s mapping template) before passing to the lambda function.  
    12. outputting a binary stream is also annoying.  I ended up have to base64 encode the image file, only have the API gateway to decode it right away.  also, there’s a `$util.base64decode` function in the mapping template which certainly doesn’t work.  luckily the option to convert to binary does work.
    13. setting the response content-type is also a bit tricky.  it’s where you define the HTTP response code supported. 
    14. cold start lambda could take anywhere from 5 to 10 seconds.  it’s very slow.  also, each container will only be used a certain time and it’ll be recycled.  i.e. we’ll hit cold start more than necessary.

I've got it sort of working, but a lot of plantuml diagrams doesn't work because of missing libraries.