PolyglotConf 2014: Notes on Introduction to Distributed Systems

The session on Introduction to Distributed Systems was definitely one of my favorites this year. I can argue it was on the same lines of “how to become a better programmer” on Polyglot 2013’s conference, because I could compile a lot of call-to-actions.

The session was led by Fred Hebert (author of Learn You Some Erlang) and Jeremy Pierre (Askuity’s Distributed System guru) explaining the core problems you may find in distributed systems in layman words. The first concept to get tackled was the CAP Theorem, in which Fred used a metaphor of two groups of people separated on an stranded island trying to communicate and coordinate. The second one was mentioned by Jeremy, commonly known as the 2 General Problem or The Byzantine Generals Problem which was really enlightening.

After those two first intros, a lot of recommendations came by from the session leaders and Saem Ghani. Here is a summary of the different resources a person should read in order to get a better understanding on Distributed Systems:

1) End-To-End Arguments in System Design[pdf] by J.H. Saltzer and others.

2) Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services[pdf] by Seth Gilbert and Nancy Lynch.

3) Virtual Time and Global States of Distributed Systems[pdf] by Friedemann Mattern

4) Two Phase Commit Protocol, I also read this blog post that tackles this concept with some illustrations.

5) In Search of an Understandable Consensus Algorithm[pdf] a.k.a the Raft algorithm by Diego Ongaro and John Ousterhout

6) Idempotence is Not a Medical Condition by Pat Helland.

Bonus blog-posts and papers:

8) The 8 Fallacies of Distributed Computing[pdf] this is an explained version done by Arnon Rotem-Gal-Oz

9) Dynamo: Amazon’s Highly Available Key-value Store[pdf] by a bunch of people. This paper was emphasized as a suggested reading by Jeremy and Saem.

10) Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications[pdf] by a bunch of people. This paper was emphasized as a suggested reading by Saem.

11) Notes on Distributed Systems for Young Bloods[pdf] by Jeff Hodges. This blog post was praised by Jeremy, stating that he tries to read it at least 2 times a year.

The session was an amazing experience, and I would recommend for people missing the PolyglotConf this year to try for the next one.

“If we want to make programming concurrent and parallel software easier, we need to embrace the idea that different problems require different tools; a single tool just doesn’t cut it. Image processing is naturally expressed in terms of parallel array operations, whereas threads are good fit in the case of a concurrent web server. So in Haskell, we aim to provide the right tool for the job, for as many jobs as possible.”
— Simon Marlow in Parallel and Concurrent Programming in Haskell
“The meaning of a referentially transparent (RT) expression does not depend on context and may be reasoned about locally, whereas the meaning of non-RT expression is context dependent and requires more global reasoning. For instance, the meaning of the RT expression 42 + 5 doesn’t depend on the larger expression it’s embedded in (it’s always and forever equal to 47). But the meaning of the expression throw new Exception("fail") is very context-dependent, it takes on different meanings depending on which try block it is nested within.”
— Functional Programming in Scala by Paul Chiusano
“Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.”
— Release It by Michael T. Nygard

I love this saying in spanish…


El chivo que más mea.

Translation: The goat that pisses the most, meaning someone that is in charge/high-ranking/the boss.

Example: Roger and Sally wouldn’t dare disobey Tammy’s orders because she’s the goat that pisses the most.

“The biggest issue in changing a monolith into microservices lines in changing the communication pattern. A naive conversion from in-memory methods calls to RPC leads to chatty communications which don’t perform well. Instead you need to replace the fine-grained communication with a coarser-grained approach”
— Microservices by Martin Fowler
“A partition among your application servers can cause clustered, in-memory sessions to diverge, leading to invariants being violated both server-side and in users’ web browsers as they are fed data from different divergent copies of their session. Increased latency (whatever cause) can trip timeouts between different parts of a system, leading to spiking reconnect attempts and thus cascading, catastrophic latency.”
— Distributed Systems and The End of the API

Introducing navorski

The Terminal

TL;DR: Easy way to manage well defined terminals (github link)

One of the reasons I moved from Vim to Emacs was the fact that Emacs has better support for async processes and the facility to have a bash terminal running inside your editor. With those features in mind, I’m always looking for opportunities to make workflows more automated when terminals can help, like run programs each time I save a file, or run a repl of a language that doesn’t have an inferior mode, etc.

After some time, I realized that I was writing the same code over and over again, so I decided to create this small utility library called navorski.el. This library defines common functions you may want to use to interact with a terminal that has a well established purpose. I think the best way to explain how this work is with a story of how I use it.

I started interacting with Scala not long ago, and one thing I really liked about it was SBT, specially the ‘~’ operand that runs repeatedly a SBT command when I save a file that belongs to the SBT project. I wanted to have my tests running while I change the code, and also a scala repl active at all times[1].

sbt-mode could help me with this, but only half of the way because I can only have one of those two within a single sbt-mode buffer. My solution? Create a navorski profile that runs the SBT tests.

(nav/defterminal scala-sbt-tester
  :buffer-name "sbt-tester"
  :program-path "/usr/local/bin/sbt"
  :program-args "~ test"
  :modify-default-directory be/sbt-root-directory)

When this function gets evaluated it creates the following functions:

  • nav/scala-sbt-tester-pop-to-buffer
  • nav/scala-sbt-tester-send-string
  • nav/scala-sbt-tester-send-region
  • nav/scala-sbt-tester-kill-buffer

When I call M-x nav/scala-sbt-tester-pop-to-buffer it will automatically create a buffer that runs the program found in /usr/local/bin/sbt, with the argument '~ test'. This program is going to run in the directory path that the be/sbt-root-directory function returns[2]. I add a keybinding for the pop-to-buffer generated function and done, one keybinding away from having the runner running at all times.

This is a particular use case among many:

  • Persist running programs using GNU screen (check the :scree-session-name option)
  • Create terminals for remote connections (using SSH) (check the :remote-host option)
  • Create “inferior-mode” likes buffer easily by using the send-string and send-region functions.
  • etc…

If this sounds awesome, give it a try, and if you love living in your terminal, don’t forget to star it on github.

[1] I have a Haskell/Clojure background, so yes, all the time!

[2] This function receives as parameter the current default-directory when executing the command, and returns a modified path where you want the program to run, in this particular example, is going to be the SBT root dir.

tiempo - A library for time units/intervals in Haskell

TL;DR: A library to specify time units easily

It has come to my attention that sometimes as a library developer, there is no easy way to specify time in asynchronous functions. One particular example comes to mind:

threadDelay :: Int -> IO ()

One thing I find disturbing about this signature, is that just by its types I can’t tell what this function is expecting, is it milliseconds, seconds, minutes? Most of Haskell standard library functions that work with time in one way or another use microseconds as the standard unit; sadly the standard library for some reason doesn’t provide an alias for Int so that the signature could be the documentation.

Another problem that I normally have is: what happens if I’m developing a library that works with time, that uses an internal API that works in millisecond units? should I use the same Int and expect milliseconds, or should I expect microseconds and do the transformations myself inside the function. Is this the right way to go?

After playing with distributed-process-platform (a.k.a Cloud Haskell’s OTP), one thing that got my attention was the fact that they used a type for different time units, replicating their effort into a way less ambitious library I came up with this:

import Tiempo

main :: IO ()
main = do
  threadDelay (microSeconds 3000)
  threadDelay (milliSeconds 500)
  threadDelay (seconds 3)
  threadDelay (minutes 1)
  threadDelay (hours 1)

Providing time intervals this way looks more appealing to me, I can specify the units myself on the function (not common knowledge required). If you are a library developer and need to have one specific time unit, you can get the right unit easily using a transformation function.

-- threadDelay is already implemented by Tiempo, but I'm  
-- providing the definition here for the sake of completeness

import Tiempo
import qualified Control.Concurrent as Concurrent

threadDelay :: TimeInterval -> IO () 
threadDelay interval = Concurrent.threadDelay (toMicroSeconds interval)

Independently of the unit specified by the client code, I can get the unit I require to satisfy the spec of a lower API.

Hopefully you will find this library as useful as I do, and kudos for the devs from Cloud Haskell to provide this little yet awesome idea.

Gonzo enjoying the afternoon sun