Sunday, May 26, 2013

A taste of molecular phylogenetics in Julia

I've been meaning for some time to try out Julia, the up-and-coming scientific computing language/environment that might eventually give R, MATLAB, Mathematica, and SciPy all a run for their money. Julia feels familiar if you've used those systems, but it has a lot of more modern language features, an LLVM back-end that produces performant machine code, and integrated support for parallel and distributed computing. The project incubated from 2009-2012 and, with a strong push from applied math groups at MIT, has been gaining steam quickly in the last year.

As far as I could tell via Google, no phylogenetic sequence analysis code has been written in Julia, so this seemed like an interesting place to start. In this post, I'll build up some toy code for commonly-used models of molecular sequence evolution as continuous-time Markov processes on a phylogenetic tree. This will enable a little implementation of Felsenstein's algorithm.

In case it wasn't obvious, I must warn that these are my first lines of Julia, so they might be completely non-idiomatic. There's also no warranty as to correctness, as I banged it all out on the weekend while watching TV!

Codes and alphabets

I started with a simple abstraction to allow us to work equally well with nucleotide, codon and amino acid sequences. We mainly need a way to easily map an observed sequence letter to an array index: 

module Code
export T, dna, aa, codon61, codon64

# a code is defined by an alphabet; indices is a reverse lookup from
# character to array index
type T

  T(chs) = begin
    idxs = (String=>Integer)[chs[i] => i for i = 1:Base.length(chs)]
    assert(length(idxs) == length(chs))

# size of the code
import Base.length
length(c::T) = length(c.alphabet)

dna = T(["A","G","C","T"])

# amino acid sequences
aa = T(["A","R","N","D","C","Q","E","G","H","I",

# not shown: codons


I've used Julia's module system to encapsulate a composite data type, $\operatorname{Code}.T$, which provides an array of character states (letters) and the aforementioned mapping from letters back to array indices. It has a constructor that, given the array of characters, computes the reverse mapping using a nice comprehension syntax to create an associative data structure (dict). We also "overload" the global $\operatorname{length}$ method so that it works on this type, returning the number of letters in the alphabet - 4 for nucleotides, 20 for amino acids, and either 61 or 64 for codons.

julia> Code.length(Code.dna)

julia> map((ch) -> Code.dna.indices[ch], ["C","G","T","A"])
4-element Int64 Array:

The scheme of defining a module centered around a data type, which is called $T$ within the module, is based on the ML/OCaml convention. I'll continue to use this below, but I don't know if this is idiomatic for Julia. In fact, I suspect it is not, since the built-in $\operatorname{show}$ function doesn't qualify instances of $T$ with the corresponding module name:

julia> show(Code.dna)
T(["A", "G", "C", "T"],["G"=>2,"A"=>1,"C"=>3,"T"=>4])


I next defined a data structure for rooted phylogenetic trees centered around an array of parent indices, i.e. an array $p$ where $p_i$ is the index of the parent of node $i$. If the tree has $k$ leaves and it's bifurcating, then it has a total of $2k-1$ nodes and the length of $p$ is $2k-2$. But the same representation handles multifurcating nodes easily.

(Indeed- that's a napkin.)

module Tree
export T, size, leaves

# A tree with k leaves is defined by an array p of length l, l ∈
# [k,2k-2]. p[i] is the index of the parent of node i. p[1:k] are the
# leaves, and i < p[i]. The total number of nodes is n=l+1, and p[i] ∈
# [1,n].
type T
  children::Array{Set{Integer},1}  # inverts parents

  T(k,pts) = new(k,pts,children_from_parents(k,pts),[])
  T(k,pts,bls) = begin
    assert(length(bls) == length(pts))
    assert(all((t) -> isfinite(isfinite(t) && t>=0), bls))

leaves(t::T) = t.k
size(t::T) = length(t.parents)+1

function children_from_parents(k,pts)
  assert(length(pts) >= k)
  n = length(pts)+1
  children = [Set{Integer}() for i in 1:n]
  for i in 1:length(pts)
    assert(pts[i] > i && pts[i] <= n && (i>k||pts[i]>k))
    add!(children[pts[i]], i)


Given $k$ and $p$, the $\operatorname{Tree}.T$ constructor automatically computes and stores the reverse mapping from parent to children, calling upon the helper function $\operatorname{children\_from\_parents}$, which also validates $p$. There are two constructors - one that takes an optional set of branch lengths, and one that does not. The latter just stores an empty array in place of branch lengths. Realistically it ought to take an array of species names too, but they won't be needed for this example.

julia> t = Tree.T(3,[5,4,4,5],[1.5,0.5,0.5,1.0])
T(3,[5, 4, 4, 5],[Set{Integer}(), Set{Integer}(), Set{Integer}()  …  Set{Integer}(4,1)],[1.5, 0.5, 0.5, 1.0])

Rate matrices and substitution matrices

Now for the part of this business that made me, as a grad student, really wish I'd paid better attention to linear algebra and stochastic processes as an undergraduate! We model molecular evolution as a homogeneous, reversible, continuous-time Markov process running along the tree. This process is governed by a rate matrix $Q$ giving the instantaneous rates of substitution from one character to another. Along a given branch of the tree with length $t$, the conditional probability of substitution from one character to another is found in the matrix $P^{(t)} = \exp(Qt)$. Example:

Q = \begin{bmatrix}
{-.11} & .06 & .03 & .02 \\

 .04 &  {-.09}  &  .03  & .02 \\
  .02 & .03 &  {-.09} & .04 \\
  .02 & .03  & .06 & {-.11}

P^{(1.5)} = \exp(1.5\cdot Q) \approx \begin{bmatrix}

 0.85 &  0.081 & 0.042&  0.028 \\
 0.053 & 0.88  & 0.042 & 0.028 \\
 0.028 & 0.042 & 0.88  & 0.053 \\
 0.028 & 0.042&  0.079 & 0.85


Under the nucleotide order A,G,C,T, the entry $q_{1,2}$ is the instantaneous rate of substitution from A to G per unit time. $p^{(1.5)}_{1,2}$ is the probability of substitution from A to G in 1.5 units of time, conditional on starting as A - 8.1% in the above example. (The rows of $Q$ sum to zero, and the rows of $P$ sum to one.) This particular $Q$ matrix implies a slightly GC-rich sequence, with a higher rate of transitions than transversions. $Q$ is not quite symmetric, instead obeying the constraint of detailed balance. Typically one also comes up with a rule for normalizing the magnitude of $Q$ to remove an unnecessary degree of freedom with the branch lengths.

The matrix exponential $\exp(Qt)$ is not trivial to compute, but the eigendecomposition of $Q$ permits relatively efficient computation of $P^{(t)}$ for multiple values of $t$, e.g. all the branches of the tree. With the matrix $S$ of eigenvectors in the rows, its inverse $S^{-1}$, and the diagonal matrix $\Lambda$ of eigenvectors,

\exp(Qt) = S\exp(\Lambda t)S^{-1}

where the diagonal matrix $\exp(\Lambda t)$ is as simple as filling in each diagonal entry $i$ with $e^{\lambda_i t}$, where $\lambda_i$ is the $i$th eigenvalue of $Q$. Thus by precomputing $S$, $S^{-1}$, and $\Lambda$, we can compute $P^{(t)}$ for any $t$ with a couple of matrix multiplications.

While this example is 4-by-4 for nucleotides, the math also holds, and our code will also work, for 20-by-20 amino acid or 61-by-61 codon matrices. All that's needed is a consistent mapping of the alphabet onto matrix indices.

The Julia code for all of this is nice and brief:

module RateMatrix
export T, to_p

type T

  T(m) = begin
    vals, vecs = eig(m)
    ivecs = inv(vecs)
    raw_pi = reshape(ivecs[indmin(abs(vals)),:], size(m,1))

to_p(q,t) = q.evecs*diagm(exp(q.evals*t))*q.inv_evecs


We define a type for the rate matrix and export a function $\operatorname{to\_p}$ to compute $P^{(t)}$ for a given $t$. The $\operatorname{RateMatrix}.T$ constructor also uses the eigendecomposition to get $\pi$, the equilibrium frequencies of the characters implicit in the rate matrix - it's the normalized eigenvector corresponding to the eigenvalue with the smallest (essentially zero) magnitude.

I must point out that the $\operatorname{reshape}$ call is a bit ugly - it coerces a 1-by-$n$ two-dimensional array to a length-$n$ one-dimensional array. There must be a more elegant incantation. (Also the $\operatorname{eig}$ function may have been renamed in newer versions of Julia; I'm using the version provided in Ubuntu 13.04.)

Phylogenetic model: tree + rate matrix

Slap together a tree and a rate matrix, and you've got yourself a complete model.

module PhyloModel
export T
using Tree, RateMatrix

type T

  T(t,q) = begin
    ps = [RateMatrix.to_p(q,x) for x in t.branch_lengths]

show(io::IO, m::T) = begin
  print(io, "tree = " ), m.tree)
  print(io, "\nq = "), m.q_matrix.matrix)
  print(io, "\npi = "), m.q_matrix.pi)

t = Tree.T(3,[5,4,4,5],[1.5,0.5,0.5,1.0])
q = RateMatrix.T([
 -0.11   0.06   0.03   0.02;
  0.04  -0.09   0.03   0.02;
  0.02   0.03  -0.09   0.04;
  0.02   0.03   0.06  -0.11;
model = PhyloModel.T(t,q)

By adding a new $\operatorname{show}$ method defined over $\operatorname{PhyloModel}.T$, we customize how values of this type are displayed in the toplevel:

julia> model
tree = T(3,[5, 4, 4, 5],[Set{Integer}(), Set{Integer}(), Set{Integer}()  …  Set{Integer}(4,1)],[1.5, 0.5, 0.5, 1.0])
q = 4x4 Float64 Array:
 -0.11   0.06   0.03   0.02
  0.04  -0.09   0.03   0.02
  0.02   0.03  -0.09   0.04
  0.02   0.03   0.06  -0.11
pi = [0.2, 0.3, 0.3, 0.2]

Simulating evolution with the generative model

The first interesting thing we can make our model do is generate a random alignment site from the 3-species phylogeny according to the model's substitution process. Julia's standard library surprisingly seems to lack a function to make a weighted random choice, so I had to write a kinda lousy one.

# Given nonnegative weights [w_1 w_2, ..., w_n], choose an index from
# [1,n] with probability proportional to the weights
function random_choice(weights)
  assert(all((x) -> isfinite(x) && x >= 0, weights))

  # choose random x in [0,sum(weights)]
  x = rand()*sum(weights)

  # find the smallest i s.t. sum(weights[1:i]) >= x
  for i in 1:(length(weights)-1)
    x -= weights[i]
    if x <= 0
      return i
  return length(weights)

function sim_site(code,model)
  n = Tree.size(model.tree)
  chs = [-1 for i in 1:n]

  # sample the root state from the equilibrium distribution
  chs[n] = random_choice(model.q_matrix.pi)

  # work down the tree, sampling the state at each node by applying
  # the random substitution process to the previously-sampled state of
  # is parent
  for i in (n-1):-1:1
    pt_ch = chs[model.tree.parents[i]]
    chs[i] = random_choice(model.p_matrices[i][pt_ch,:])
  # take the leaves and convert the state indices to the corresponding
  # strings
  return map((ch) -> code.alphabet[ch],

A few examples:

julia> sim_site(Code.dna,model)
3-element ASCIIString Array:

julia> sim_site(Code.dna,model)
3-element ASCIIString Array:

julia> sim_site(Code.dna,model)
3-element ASCIIString Array:

Felsenstein's algorithm

Now for the more interesting reverse question: given a specific alignment site, what's the probability of $\operatorname{sim\_site}$ generating it? Also known as the likelihood of the model given the site, and efficiently computed with the dynamic programming procedure known in this field as Felsenstein's algorithm, which can also be understood as a special case of one part of belief propagation.

Since this will involve multiplying a lot of probabilities together, we'll do all the calculations on log-probabilities in order to avoid numerical underflow problems. It's debatable whether this is necessary for individual sites, at least for alignments of (on the order of) dozens of taxa - but it's certainly necessary if we go on to analyze millions of sites, and we might prevent future confusion/errors by doing everything in log space as a matter of course. Since Felsenstein's algorithm involves summing probability quantities, writing it in log space necessitates the log-sum-exp trick.

# logsumexp: given x = [log(a_1), log(a_2), ..., log(a_n)]
#   compute log(a_1 + a_2 + ... + a_n)
function logsumexp(x)
  mx = max(x)

function felsenstein(code::Code.T, model::PhyloModel.T, site)
  # initialize the alpha matrix with log(0)
  a = Array(Float64, Tree.size(model.tree), length(code))

  # initialize the rows corresponding to the leaves with log(1) in the
  # column corresponding to the observed letter
  for i in 1:Tree.leaves(model.tree)
   a[i,code.indices[site[i]]] = 0

  # fill the rows corresponding to the nodes
  for pt in (Tree.leaves(model.tree)+1):Tree.size(model.tree)
    # for each possible letter at node pt...
    for x in 1:length(code)
      # for each child of node pt...
      for ch in model.tree.children[pt]
        # include the likelihood contribution of the branch from pt to
        # ch, conditional on x in pt
        lk = logsumexp(log(model.p_matrices[ch][x,:]) + a[ch,:])
        a[pt,x] = (isinf(a[pt,x]) ? 0 : a[pt,x]) + lk

  # return the now-filled-in alpha matrix

function site_likelihood(code::Code.T, model::PhyloModel.T, site)
  a = felsenstein(code,model,site)
  a_root = reshape(a[Tree.size(model.tree),:],length(code))
  logsumexp(log(model.q_matrix.pi) + a_root)

Here's an example of running this on an alignment site A; G; G:

julia> println(felsenstein(Code.dna, model, ["A", "G", "G"]))
5x4 Float64 Array:
    0.0      -Inf          -Inf        -Inf      
 -Inf           0.0        -Inf        -Inf      
 -Inf           0.0        -Inf        -Inf      
   -7.10013    -0.0890196    -8.4492     -8.4492 
   -3.13521    -3.11947      -7.21741    -7.21741

julia> println(site_likelihood(Code.dna, model, ["A", "G", "G"]))

The matrix produced by the Felsenstein algorithm has five rows, one for each node in the tree, and four columns, one for each nucleotide letter. The first three rows, corresponding to the leaves of the tree, each have the entry corresponding to the observed letter set to log(1), and others set to log(0). The algorithm then works up the ancestral nodes in the tree, computing the likelihood of the subtrees conditional on a certain ancestral letter. At the root of the tree (the bottom row of the matrix), the numbers indicate that ancestral A and G have approximately equal likelihood, and ancestral C and T have much smaller likelihoods. (That sentence had to be phrased carefully - these likelihoods are not posterior probabilities!) This seems sensible given the alignment A; G; G and the topology of our tree.

The final answer says that, if we run $\operatorname{sim\_site}$ many times, we should see the alignment A; G; G about $e^{-3.8} = 2.2\%$ of the time. Indeed,

julia> n = 1000000
       k = 0
       s = ["A","G","G"]
       for i in 1:n
         if sim_site(Code.dna, model) == s
           k += 1


Recommended reading

I'll have to stop there for now - if I had time to work on this much further, I might next look at some straightforward performance optimizations for the $\operatorname{felsenstein}$ function, defining a framework for constrained parameterizations of the rate matrix, and then optimizing such parameterizations by maximum likelihood. Some of the challenges involved in this (e.g. symbolic differentiation) would provide highly complementary exercises of the language's features, so I hope I get to try it one of these days!


Julia: final thoughts

I've really enjoyed my tour of Julia so far, and I hope to see it succeed in the long run. I'm really interested in seeing further development of its plotting (really the main reason we tolerate R, isn't it?) and parallel/distributed computing capabilities.

As a longtime OCaml enthusiast, I would normally direct some polemics at Julia's dynamic type system, but thanks to Steve Yegge I'm happy to largely hand-wave this as a liberal/conservative thing. While I've been very pleased with the blend of type-safety, clean abstraction, and native-code performance OCaml brings to my scientific programming, its lack of concise syntax for declaring and accessing arrays/matrices is definitely a problem (e.g. compare $\operatorname{RateMatrix}$ above to my previous more-capable, but also far longer OCaml version). Julia therefore certainly seems competitive, with its mix of math-syntactic sugar, LLVM performance, and (though lacking static type safety) runtime type inference and convenient runtime type assertion syntax.

Miscellaneous minor complaints, some probably reflecting my ignorance of the language:
  • No multi-line comments
  • No pattern matching, although the support for macros partly makes up for this. More generally, it's unclear whether algebraic data types are intended to be idiomatic...
  • No lexer/parser generator - I could not see a convenient way to write a Newick tree parser
  • No advertised unit testing framework - especially disconcerting for a dynamically-typed language!
  • Have to sprinkle 'end' everywhere (spoiled by Python, CoffeeScript, and ocaml+twt)
  • $\operatorname{show}$ on a composite type should print the field names, in addition to the values
  • Documentation doesn't specify asymptotic runtimes of common operations

Appendix: complete code listing