# James Hensman’s Weblog

## June 14, 2010

### Multiple Matrix Multiplication in numpy

Filed under: Uncategorized — jameshensman @ 10:45 am

There are two ways to deal with matrices in numpy. The standard numpy array in it 2D form can do all kinds of matrixy stuff, like dot products, transposes, inverses, or factorisations, though the syntax can be a little clumsy. For those who just can’t let go of matlab, there’s a matrix object which prettifies the syntax somewhat. For example, to do $A = B^\top C^{-1}$

import numpy as np

#using arrays
B = np.random.randn(2,2) # random array
C = np.random.randn(2,2)

A = np.dot(B.T, np.linalg.inv(C))

#using matrices
B = np.mat(B) #cast as matrix
C = np.mat(C)

A = B.T*C.I


Despite the nicer syntax in the matrix version, I prefer to use arrays. This is in part because they play nicer with the rest of numpy, but mostly because they behave better in higher dimensions. This is really useful when you have to deal with lots of matrices, which seems to occur all the time in my work:

A = np.random.randn(100,2,2) # a hundred 2x2 arrays
A2 = [np.mat(np.random.randn(2,2)) for i in range(100)] # a hundered 2x2 matrices


## Transposing

Transposing the list of matrices is easy: just use the .T operator. This doesn’t work for the 100x2x2 array though, since it switches the axis in a way we don;t want. The solution is to manually specify which axes to switch.

A2T = [a.T for a in A2] # matrix version
AT = np.transpose(A,(0,2,1)) #array version 1
#or
AT = np.rollaxis(A,-2,-1)# array version 2


## Multiplication

Suppose you have a series of matrices which you want to (right) multiply by another matrix. This is messy with the matrix object, where you need to do list comprehension, but nice as pie with the array object.

#matrix version
A = [np.mat(np.random.randn(2,2)) for i in range(100)]
B = np.mat(np.random.randn(2,2))
AB = [a*B for a in A]

#array version
A = np.random.randn(100,2,2)
B = np.random.randn(2,2)
AB = np.dot(A,B)


Left-multiplication is a little harder, but possible using a transpose trick: $(BA)^\top = A^\top B^\top$

#matrix version
BA = [Ba for a in A]

#array version
BA = np.transpose(np.dot(np.transpose(A,(0,2,1)),B.T),(0,2,1))


Okay, the syntax is getting ugly there, I’ll admit. Suppose now that you had two sets of matrices, and wanted the product of each element, as in $C_i = A_i B_i, \, i=1 \ldots 100$

#matrix version
A = [np.mat(np.random.randn(2,2)) for i in range(100)]
B = [np.mat(np.random.randn(2,2)) for i in range(100)]
AB = [a*b for a,b in zip(A,B)]

#array version
A = np.random.randn(100,2,2)
B = np.random.randn(100,2,2)
np.sum(np.transpose(A,(0,2,1)).reshape(100,2,2,1)*B.reshape(100,2,1,2),-3)


I’ll admit that the syntax of this is very weird. To see how it works, consider taking the (matrix) product of two arrays in a similar manner:

A = np.random.randn(2,2)
B = np.random.randn(2,2)
np.sum(A.T.reshape(2,2,1)*B.reshape(2,1,2),0)


The reshaping persuades numpy to broadcast the multiplication, which results in a 2x2x2 cube of numbers. Summing the numbers along the first dimension of the cube results in matrix multiplication. I have some scribbles which illustrate this, which I’ll post if anyone wants.

## Speed

The main motivation for using arrays in this manner is speed. I find for loops in python to be rather slow (including within list comps), so I prefer to use numpy array methods whenever possible. The following runs a quick test, multiplying 1000 3×3 matrices together. It’s a little crude, but it shows the numpy.array method to be 10 times faster than the list comp of np.matrix.

import numpy as np
import timeit

#compare multiple matrix multiplication using list coms of matrices and deep arrays

#1) the matrix method
setup1 = """
import numpy as np
A = [np.mat(np.random.randn(3,3)) for i in range(1000)]
B = [np.mat(np.random.randn(3,3)) for i in range(1000)]
"""

test1 = """
AB = [a*b for a,b in zip(A,B)]
"""

timer1 = timeit.Timer(test1,setup1)
print timer1.timeit(100)

#2) the array method
setup2 = """
import numpy as np
A = np.random.randn(1000,3,3)
B = np.random.randn(1000,3,3)
"""

test2 = """
AB = np.sum(np.transpose(A,(0,2,1)).reshape(1000,3,3,1)*B.reshape(1000,3,1,3),0)
"""

timer2 = timeit.Timer(test2,setup2)
print timer2.timeit(100)


In the likely event that there’s a better way to do this, or I’ve screwed up some code somewhere, please leave me a comment 🙂

## May 15, 2009

### Tracking a finger

Filed under: Uncategorized — jameshensman @ 11:29 am

Right. Having broken my avi file into pngs, how do I keep track of an object in the vid? Check this out to see what I mean:

I’m not sure what the tapping of the finger is about (at the start of the vid), but I have data from the experiment which I’d like to correlate to the position of the finger on the rig. the idea is to follow that black blob.

Having broken the vid into png images, I loaded them into python using PIL:

from PIL import Image
images = [Image.open('./tmp/'+f) for f in imagenames]


and then made them into numpy arrays, slicing off only the red channel and the middle part of the image(by inspection, the dot was most visible on the red channel):

images = [np.asarray(image)[upper:lower,:,0] for image in images]#0 indexes red


Now, scipy.ndimage has an awesome tool for identifying regions of an image. I used a threshold (set by inspection) first:

images = [imagehttp://www.youtube.com/watch?v=Cu08aLAUVuY%5D the top image is the (sliced) original image, with a green dot where I think the block pen mark is. The middle plot is a pylab.imshow() of the regions, and the final subplot is the same after applying my set of rules. Thanks to Sarah T. for letting me post the videos. Edit: corrected some code

### Making and Breaking Videos into pngs

Filed under: Uncategorized — jameshensman @ 10:59 am

I’ve got some videos that I’d like to do a little bit of processing on. nothing spectacular, I just want to track a black blob.

Luckily numpy/Scipy has all the tools I need to get started… on images. I need to break the (avi) video file into a series of png files (i could use jpeg, I suppose, but it’s not like hdd space is limited here…). The tool I’ve been using for this is ffmpeg.

Now, ffmpeg is a phenomenal beast. It can re-scale videos, re-encode them, crop them, and all kinds of other things that I’m not interested in. The command I’ve used to break a avi file into pngs is:

ffmpeg -i ./tmp/%03d.png

It’s nice that ffmpeg recognises the %03d format string, huh? Now that I’ve got some images, I can load them into numpy and process them (future blog post!)

After doing said processing, ffmpeg should be able to re-code my processed images into a video file for convenience. However, I’m really lazy, and can’t be bothered to read the whole of the ffmpeg doc. So I’m copying Mike here and using:

mencoder ‘mf://tmp/*.png’ -mf type=png:fps=10 -ovc lavc -lavcopts vcodec=wmv2 -oac copy -o <whatever>.mpg

I’m sure ffmpeg can do the job too, but it just isn’t working for me.

In fact, I’m wrapping both of those commands up in python, using the os module.

## February 12, 2009

### Expected Values (Matrices)

Filed under: Uncategorized — jameshensman @ 9:41 am

I’ve been fiddling around in python, implementing a Variational Bayesian Principal Component Analysis (VBPCA) algorithm. It’s true, I could do this happily in vibes , but where’s the fun in that? Besides, I think I learned more doing it this way.

Happily, there’s a nice paper by Chris Bishop, which explains clearly what’s going on, and gives you the update equations. For example:

${\bf m_x}^{(n)} = \langle \tau \rangle {\bf \Sigma_x} \langle {\bf W}^\top \rangle (t_n - \langle {\mathbf \mu} \rangle)$

which involves taking the expected value of $\tau$, ${\bf W}$ and $\mathbf \mu$. These three are straightforward:

The distribution for $\tau$ is $p(\tau) = \textit{Gamma}(\tau \mid a,b)$, and so the expected value of $\tau$ is $a/b$.

The distribution for $\mu$ is $p(\mu) = \mathcal{N}(\mu \mid {\bf m_\mu},{\bf \Sigma_\mu})$, and so the expected value of $\mu$ is simply ${\bf m_\mu}$.

The distribution for $\bf W$ is a slightly more complex beast: $\bf W$ is a non-square matrix, and we have a Gaussian distribution for each row. That is: $p({\bf W}) = \prod_{i=1}^d \mathcal{N}({\bf w}_i \mid {\bf m}_w^{(i)}, {\bf \Sigma}_w)$. Bizarrely, the rows share a covariance matrix. The expected value of ${\bf W}$ is simple: is just a matrix made of a stack of ${\bf m}_w^{(i)}$s.

The tricky bit comes in a different update equation, where we need to evaluate $\langle {\bf W}^\top {\bf W} \rangle$. The first thing to notice is that (where ${\bf W}$ is a d by q matrix):

${\bf W^\top W} = \sum_{i=1}^d {\bf w}_i {\bf w}_i^\top$.

Since $p({\bf w}_i)$ is Gaussian, $\langle {\bf w}_i {\bf w}_i^\top \rangle = {\bf m}_i{\bf m}_i^\top + {\bf \Sigma}_w$. Now we can write:

$\langle {\bf W^\top W} \rangle = \langle \sum_{i=1}^d {\bf w}_i {\bf w}_i^\top \rangle = \sum_{i=1}^d \langle {\bf w}_i {\bf w}_i^\top \rangle = \sum_{i=1}^d \left({\bf m}_i{\bf m}_i^\top + {\bf \Sigma}_w \right)$

Simple when you know how.

Edit: Anyone know how to get a bold $\mu$? I’ve tried {\bf \mu} and \mathbf{\mu}.

## February 4, 2009

### Back to school for me…

Filed under: Uncategorized — jameshensman @ 6:32 pm

I’ve been having a preliminary flick through Andrew Gelman’s book, (amazon) which so far seems excellent. I thought I’d have a shot at the questions in the introductory chapter.

First question. Easy, no problem.

Second question. Ooh, bit trickier. Got there in the end.

Third question. This has taken me 40 minutes, which seems like justification for posting a solution on this ‘ere weblog.

Suppose that in each individual in a population there is a pair of genes, each of which can be either X or x, that controls eye colour: those with xx have blue eyes, those with XX or Xx or xX have brown eyes. Those with Xx are known as heterozygotes.

The proportion of individuals with blue eyes (xx) is $a^2$, and the proportion of heterozygotes is $2a(1-a)$.

Each parent transmits one gene to the child: if the parent is a heterozygote, the probability that they transmit X is $0.5$. Assuming random mating, show that amongst brown eyed parents with brown eyed children, the proportion of heterozygotes is $2a/(1+2a)$.

Okay says I, it’s just Bayes’ rule, no? Let’s denote all heterozygotes as Xx (this should save significant keypresses…), children as ch and parents as pa.

Under Bayes rule we need a likelihood $p(\text{ch}=Xx | \text{pa})$, prior $p(\text{pa})$ and a ‘marginal likelihood’ term, which we get by summing the above. So we’re going to get something like:

$p(\text{ch}=Xx) = \frac{p(\text{ch}=Xx | \text{pa}) p(\text{pa})} {\sum_\text{ch} p(\text{ch}=Xx | \text{pa}) p(\text{pa})}$

We also need to make sure that we only consider brown eyed individuals (XX, Xx, and xX. not xx). let’s have a look at the priors:

$p(xx) = a^2$
$p(Xx) = 2a(1-a)$
and a little manipulation yeilds:
$p(XX) = (1-a)^2$

So looking at combinations of parents who have brown eyes:

$p(\text{pa}=XX,XX) = (1-a)^4$
$p(\text{pa}=XX,Xx) = 2*2a(1-a)*(1-a)^2$
$p(\text{pa}=Xx,xx) = 4a^2(1-a)^2$

Each of which can be considered a prior, with correspondings likelihoods (of the child being Xx):

$p(\text{ch}=Xx | \text{pa}=XX,XX) = 0$
$p(\text{ch}=Xx | \text{pa}=XX,Xx) = 0.5$
$p(\text{ch}=Xx | \text{pa}=Xx,xx) = 0.5$

Now, the top line of Thomas Bayes’ famous rule looks like:

$0.5*4a(1-a)^3 + 0.5*4a^2(1-a)^2$

remembering that the bottom line must consist of all brown eyed children of brown eyed parents (not just the heterozygotes), the bottom line looks like:

$(1-a)^4 + 4a(1-a)^3 + 0.75*4a^2(1-a)^2$

Phew. Cancelling a few terms does indeed leave

$p(\text{ch}=Xx | \text{pa}=(Xx,XX),(XX,XX),(Xx,Xx)) = \frac{2a}{1+2a}$

Not going to school for years switches your brain off: It took me as long to figure out how to do this as it did to blog it…

There’s actually a second part to the problem, (first set by Lindley, 1965, apparently), which turns out to have some rather messy terms in. I think I’ll save blogging that for another day.

## January 29, 2009

### ipython %bg coolness

Filed under: python — jameshensman @ 3:17 pm

in ipython, I just found the magic function ‘%bg’, or background. If you want to run a long process, but keep the window active, this threads the task and stores the result for you to pick up later. e.g.

In [1]: from numpy import matlib as ml

In [2]: %bg ml.rand(2,2)
Starting job # 0 in a separate thread.

In [3]: %bg ml.zeros(3)
Starting job # 1 in a separate thread.

In [4]: jobs[0].result
Out[4]:
matrix([[ 0.97556473, 0.67794221],
[ 0.9331659 , 0.78887001]])

In [5]: jobs[1].result
Out[5]: matrix([[ 0., 0., 0.]])

Mucho coolness, especially if you have a long process (or lots of them) to complete. I wonder if it’s actually multi-threaded, as in using both of my CPUs?

### python.copy()

Filed under: python — jameshensman @ 10:48 am
Tags: ,

Assignment in python is exactly that. Assignment, not copying. For those of us who have switched to the language of the gods from some inferior mortal language (Matlab), this can lead to some frustration.

For example, within my Variational Factor Analysis (VBFA) class, I need to keep a record of something I’m calling b_phi_hat. One of the the methods in the class involves the update of this little vector, which depends on its initial (prior) value, b_phi. Like this:

class VBFA:
import numpy as np
def __init__(self):
self.b_phi = np.mat(np.zeros((5,1)))
#blah blah...

def update_phi(self):
self.b_phi_hat = self.b_phi
for i in range(5):
self.b_phi_hat[i] = self.something()


update_phi() get called 100s of times when the class is used. Spot the problem? It’s on line 8, where b_phi_hat is assigned to b_phi. When the loop runs on the next two lines, it’s modifying the original, not just a copy of the original, i.e. after the first iteration line 8 doesn’t ‘refresh’ b_phi_hat, it keeps it at its current value.

What I should have written is:

import numpy as np
class VBFA:
def __init__(self):
self.b_phi = np.mat(np.zeros((5,1)))
#blah blah...

def update_phi(self):
self.b_phi_hat = self.b_phi.copy()
for i in range(5):
self.b_phi_hat[i] = self.something()


which explicitly makes a copy of the original on line 8, refreshing  b_phi_hat with every iteration.

## January 28, 2009

### Some notes on Factor Analysis (FA)

Filed under: Uncategorized — jameshensman @ 9:21 am

Factor analysis is a statistical technique which can uncover (linear) latent structures in a set of data.  It is very similar to PCA, but with a different noise model.  The model conssits of a set of observed parameters $\{\bf{x}_n \}_{n=1}^N$ (this is your collected data), some latent paramenters $\{\bf{z}_n \}_{n=1}^N$, with distribution $p(\bf{z_n}) = \mathcal{N}(\bf{0}, \bf{I})$.  There exists a noisy linear map $\bf{A}$ from the latent space to the observed variables: $p(\bf{x}_n) = \mathcal{N}\left(\bf{Az}_n, \bf{\Psi}\right)$, where $\bf{\Psi}$ is a diagonal matrix.

It’s also possible to assume a mean vector for the observed variables, but I’m going to ignore that for a moment for clarity.

Some simple algebra yields $p(\bf{x_n}) = \mathcal{N}\left(\bf{0}, \bf{AA}^\top + \bf{\Psi}\right)$. It should now be clear that we are modeling the distribution of $\bf{X}$ as a Gaussian distribution with limited degrees of freedom.

Variational Approach

The variational approach involved placing conjugate priors over the model parameters ($\bf{A}$ and $\bf{\Psi}$), and finding a factorised approximation to the posterior.

The variational approach to FA yields a distinct advantage: by placing an ARD prior over the columns of $A$, unnecessary components get ‘switched off’, and the corresponding column of A goes to zero. This is dead useful: you don’t need to know the dimension of the latent space beforehand: it just drops out of the model.

Papers
There is a super Masters thesis on variational FA here by a chap called Frederik Brink Nielsen.

An immediate extension of factor analysis springs to mind: if the distribution of the data is just a Gaussian, why not have a mixture of them? It seems Ghahramani was there first: GIYF

More recently, Zhao and Yu proposed an alteration to the Variational FA model, which apparently achieves a tighter bound by making $\bf{A}$ dependent on $\bf{\Psi}$. this apparently makes the model less prone to under-fitting (i.e. it drops factors more easily). Neural Networks Journal

## January 26, 2009

### Working with log likelihoods

Filed under: python — jameshensman @ 1:41 pm
Tags: ,

Sometimes, when you have a series of number representing the log-probability of something, you need to add up the probabilities. Perhaps to normalise them, or perhaps to weight them… whatever.  You end up writing (or Mike ends up writing):

logsumexp = lambda x: np.log(sum([np.exp(xi) for xi in x]))

Which is going to suck when them members of x are small. Small enough that the precision of the 64 bit float you holding them runs out, and they exponentiate to zero (somewhere near -700).  Your code is going to barf when it get to the np.log part, and finds it can’t log zero.

One solution is to add a constant to each member of x, so that you don’t work so close to the limits of the precision, and remove the constant later:

def logsumexp(x):
x += 700
x = np.sum(np.exp(x))
return np.log(x) - 700


Okay, so my choice of 700 is a little arbitrary, but that (-700) is where the precision starts to run out, and it works for me. Of course, if your numbers are way smaller than that, you may have a problem.

Edit: grammar. And I’m getting used to this whole weblog shenanigan. Oh, and <code>blah</code> looks rubbish: I'm going to stop doing that. 

## January 23, 2009

### A place for putting things

Filed under: Uncategorized — jameshensman @ 5:19 pm

This is a place for putting things, so that they can be found.  Some of these things include thoughts on what I’m doing at the moment, which consists largely of

• python
• probabilistic models of data
• learning
• teaching

I’m hoping that by storing my thoughts, they’ll begin to make more sense. I may even start writing thoughts on my thoughts, but mostly I’m just going to put them here for safekeeping.

Blog at WordPress.com.