GSoC Mentor Summit

I’m in the airport waiting for my flight after finishing the Google Summer of Code Mentor Summit. This event took place this weekend. Two or three mentors from many of the GSoC projects came out to the Google campus to participate in an un-conference about GSoC. Google and SymPy were kind enough to send me and two others (Stefan Krastanov and Gilbert Gede) so I thought I’d slightly repay the favor by reporting my experiences. I’ll list some generated ideas and thoughts below. They range from the application process, to SymPy and scientific Python in general to some meta-thoughts about the conference itself.

Application process

How should we improve our application process to attract good students, detect good students, and match good students to appropriate mentors?

We should ask indirect questions to query for curiosity and passion.

Negative examples:

  • Why do you want to work on SymPy?
  • Why do you like Math?
  • How long have you been programming?

Positive examples:

  • Copy the favorite snippet of SymPy code you’ve seen here and tell us why you like it.
  • What aspects of SymPy do you like the most?
  • What editor do you use and why?

Experience was that direct questions tend to have low information content (everyone says the same thing). Indirect questions will be ignored by the lazy but engage the engaged. You often want to test for curiosity and passion more than actual experience in the domain.

We should match mathematically strong students with strong programmer mentors and strong programmer students with strong mathematical mentors. We often do the opposite due to shared interests but this might result in ideal contributions

Funding

Other people have funding. Should we? What would we do with it? How would we get it? It might not be as hard as we think. Who uses us? Can we get a grant? Are there companies who might be willing to fund directed work on SymPy?

Interactions with SymPians

This is my first time physically interacting with SymPy contributors other than my old mentor. It was a really positive experience. As a community we’re pretty distributed, both geographically and in applications/modules. Getting together and talking about SymPy was oddly fascinating. We should do it more. It made us think about SymPy at a bigger scale.

Some thoughts

  • Do we want to organize a SymPy meetup (perhaps collocated with some other conference like SciPy)? What would this accomplish?
  • What is our big plan for SymPy? Do we have one or are we all just a bunch of hobbyists who work on our own projects? Are we actively pursuing a long term vision? I think that we could be more cohesive and generate more forward momentum. I think that this can be created by occasional collocation.
  • This could also be accomplished by some sort of digital meetup that’s more intense than the e-mail/IRC list. An easy test version of this could be a monthly video conference.

Broader community

I’m accustomed to academic conferences. I recently had a different experience at the SciPy conference which mixed academic research with code. I really liked this mix of theory and application and had a great time at SciPy. GSoC amplified this change, replacing a lot of academics with attendees that were purely interested in code. This was personally very strange for me, I felt like an outsider.

The scientific/numeric python community doesn’t care as intensely about many of the issues that are religion to a substantial fraction of the open source world. My disinterest in these topics and my interest in more esoteric/academic topics also made me feel foreign. There were still people like me though and they were very fun to find, just a bit rarer.

This is the first conference I’ve been to where I was one of the better dressed attendees :)

Local Community

Other projects of our size exist under an umbrella organization like the Apache foundation. I see our local community as the numpy/scipy/matplotlib stack. How can we more tightly integrate ourselves with this community? NumFocus was started up recently. Should we engage/use NumFocus more? How can we make use of and how can we support our local community?

Meta-Mentor Summit

This section includes my thoughts about the summit itself. It’s distinctly structured. I’ll share my opinions about this structure.

The informal meeting spaces were excellent. Far better than the average academic conference. I felt very comfortable introducing myself and my project to everyone. It was a very social and outgoing crowd.

Some of the sessions were really productive and helpful. The unconference structure had a few strong successes.

There were a lot of sessions that could have been better organized.

  • Frequently we didn’t have a goal in mind; this can be ok but I felt that in many cases a clear goal would have kept conversation on topic.
  • People very often wanted to share their experiences from events in their organization. This is good, we need to share experiences, but often people wouldn’t filter out org-specific details. We need to be mindful about holding the floor. We have really diverse groups and I’m pretty sure that the KDE guys don’t want to hear the details of symbolic algebra algorithms.
  • Sessions are sometimes dominated by one person
  • In general I think that we should use neutral meeting facilitators within the larager sessions. I think that they could be much more productive with some light amount of control.

Specific Interactions with other Orgs

It was really cool to associate physical humans to all of the software projects I’ve benefitted from over the years. It’s awesome to realize that it’s all built by people, and not by some abstract force. I had a number of positive experiences with orgs like Sage and SciLab that are strongly related to SymPy as well as orgs that are completely unrelated like OpenIntents, Scala, and Tor.

Conclusions

I had a good time and came away with thoughts of the future. We have something pretty cool here and I think that we should think more aggressively about where we want to take it.

Posted in Uncategorized | 3 Comments

Simplifying Sets

SymPy’s sets module is a pleasure to work on. The math is approachable well structured. There are basic sets (Intervals, FiniteSets) compound sets (Unions, Intersections, Cartesian Products) and operations (contains, complement, measure, subset). Because the problem is easy to understand and intrinsically simple, sets is a great project to practice coding. Can we write code that is as simple as the problem we’re solving?

Historically I have been bad at this. I am guilty of writing needlessly complex code. A friend recently sent me a talk by Rich Hickey, the creator of Clojure, about simplicity versus ease. I decided to try to make the SymPy.Sets code simpler as an educational project.

The current issue with sets is that many classes contain code to interact with every other type of class. I.e. we have code that looks like this:

def operation(self, other):
    if other.is_FiniteSet:
        ...
    if other.is_Interval:
        ...
    if other.is_ProductSet:
        ...

This is because the rules to, say join the FiniteSet {1,2,3,4} with the Interval [2, 3) can be complex. The sets module handles this all marvelously well and produces [2, 3] U {1, 4}, a nice answer. The code to do it however is atrocious and filled with nests of rules and special cases. Much of this code is in the Union and RealUnion classes but some of it is in FiniteSet, some of it is in Interval as well. Everything works, it’s just complex.

This is similar to the situation in Mul.flatten and friends.

So what is the solution for Sets? How do we simplify Union and Intersection?

First, lets acknowledge that Union/Intersection serve two purposes

  1. They serve as a container of sets
  2. They simplify these sets using known rules

We separate these two aspects and solve them independently.

We separate these two in the same way Mul and Add handle it. We create a reduce/flatten method and, while we call it by default, it is now separate from the construction logic. There has been talk about separating these two parts of our container classes even further by having container classes that only contain and simplifyers/canonicalizers that only simplify/canonicalize.

We need a simple way to manage all of the special rules we know for simplifying collections of sets. The issue is that there are a lot of special cases; FiniteSets can do some things, Intervals others, and how do we anticipate not-yet-defined sets? Our solution is as follows.

Every set class has methods _union(self, other) and _intersect(self, other). These methods contain local simplification rules. I.e. if self knows how to interact with other it returns a new, simplified set, otherwise it returns None for “I don’t know what to do in this situation”. For example Intervals know how to intersect themselves with other Intervals but they don’t know how to interact with FiniteSets, luckily FiniteSets know how to do this. Together they know how to handle any situation between them.

Here are the local interaction methods for EmptySet.

def _union(self, other):
    return other
def _intersect(self, other):
    return S.EmptySet

These are particularly simple, are known only by EmptySet, and yet produce proper behavior in any interaction. When we add EmptySet to the family of Sets we don’t need to add code to Union or Intersection. Everything is nicely contained.

When they simplify, the Union and Intersection classes do two things.

  1. They walk over the collection of sets and use local rules to perform simplifications
  2. They also contain a few “global rules” that can accelerate the process by looking at the entire collection of sets at once.

In this way it is very easy to extend the Sets module with new classes without breaking Union and Intersection. Additionally, the old nest of code has been cleanly separated and placed into the relevant classes. Unions and Intersections no longer need to know every possible interaction between every possible Set. Instead they manage interactions and let Sets simplify themselves.

A final note. I like this idea of managing many small simplification rules. I stole this idea from Theano, a symbolic/numeric python library. They go one step further though and separate the rule from the container class. I.e. rather than telling Intervals how to interact with Intervals they make a separate rule and include it in some separate simplifying manager. If this idea interests you I suggest you look at their documentation on optimizations.

Posted in Uncategorized | Leave a comment

sympy.stats is in

Development

It seems there was a flurry of development over the winter holidays.

Tom’s Meijer-G integration code was merged into master giving SymPy an incredibly powerful definite integration engine. This encouraged me to finish up the pull request for random variables.

Earlier this morning we finally merged it in and sympy.stats is now in master. If you’re interested please play with it and generate feedback. At the very least it should be able to solve many of your introductory stats homework problems :)

Actually, I tried using it for a non-trivial example last month and generated an integral which killed the integration engine (mostly this was due to a combination of trigonometric and delta functions). However, I still really wanted the result. The standard solution to analytically intractable statistics problems is to sample. This pushed me to build a monte carlo engine into sympy stats.

Sampling

The family of stats functions P, E, Var, Density, Given, now have a new member, Sample. You can generate a random sample of any random expression as follows

>>> from sympy.stats import *
>>> X, Y = Die(6), Die(6)
>>> roll = X+Y
>>> Sample(roll)
10
>>> Sample(roll)
5
>>> Sample(X, roll>10) # Sample X given that X+Y>10
6

Sampling is of course more fail-proof than solving integrals and so expressions can be made arbitrarily complex without issue. This sampling mechanism is also built into the probability and expectation functions using the keyword “numsamples”

>>> from sympy.stats import *
>>> X, Y = Normal(0, 1), Normal(0, 1)
>>> P(X>Y)
1/2
>>> P(X>Y, numsamples = 1000)
499 
────
1000
>>> E(X+Y)
0
>>> E(X+Y, numsamples = 1000)
-0.0334982435603208

Future

GSoC 2012 was announced a couple days ago. I’m excited to see what projects are proposed.

Posted in Uncategorized | 3 Comments

Week 12: Pull Requests

As before, not much to report. Slow plodding through testing, bug fixing, etc….

I have a pull request here for Matrix Expressions

https://github.com/sympy/sympy/pull/532

My branch for Finite and Continuous Random Variables is below. It doesn’t have a pull request yet (I’m waiting for Tom’s code to get in) but I’d be thrilled if anyone wanted to look it over in the meantime.

https://github.com/mrocklin/sympy/tree/rv2

There is another branch for Multivariate Random Normals that depends on the previous two. I suspect that it might have to change based on feedback from the previous two branches. It’s probably not worth reviewing at this point but, if you’re interested, here it is.

https://github.com/mrocklin/sympy/tree/mvn_rv

Posted in Uncategorized | Leave a comment

Week 11: Testing, cleaning

I’m increasing testing coverage and fixing errors in my random variables branch.

I’m not sure how to proceed with the matrix expressions ideas. On one hand I should wait until the community comes to a consensus about what SymPy Matrix Expressions should be (or even if they should be at all). On the other hand I don’t ever see this consensus happening. How do I spur on a decision here?

Posted in Uncategorized | 1 Comment

Week 10 for Random Variables

I’ve been neglecting my GSoC project this week. This is what’s on the burner though:

  1. Write up a blogpost on my implementation of Matrix Expressions. What they can and can’t do. I’d like to generate discussion on this topic.
  2. Test my code against Tom’s integration code. This has been happening over the last 24 hours actually. It’s cool to see lots of new things work and work well – I feel like I’m driving a sports car. I think that this cross-branch testing has been helpful to locate bugs in both of our codebases.
  3. After I check what will and won’t work with Tom’s code I need to fill out tests and polish documentation for my main Discrete and Continuous RV branch. It’d be nice to have it presentable to the community for review.

Posted in Uncategorized | 3 Comments

Multivariate Normal Random Variables

Multivariate Normal Random Variables are extraordinarily convenient.

The probability density of a multivariate normal random variable is proportional to the following:

f(X) = e^{ \left(- \mu + X\right)^T \Sigma^{-1} \left(- \mu + X\right)}

Where X is an n-dimensional state vector, \mu is the mean of the distribution, and \Sigma is an n by n covariance matrix. Pictorally a 2-D density might be represented like this:

With contour lines showing decreasing probability levels dropping off around the mean (blue x). This distribution is entirely defined by the two quantities, \mu which gives the center of the distribution and \Sigma which effectively gives the shape of the ellipses. That is, rather than carry around the functional form above, we can simply define X as X \sim N(\mu, \Sigma) and forget the rest.

Multivariate normals are convenient for three reasons

  1. They are easy to represent – we only need a mean and covariance matrix
  2. Linear transformations of normals are again normals
  3. All operations are represented with linear algebra

First off, multivariate normals are simple to represent. This ends up being a big deal for functions on very high dimensional spaces. Imagine writing down a general function on 1000 variables.

Second, linear functions of normals are again normals. This is huge. For example this means that we could project the image above to one of the coordinate axes (or any axis) and get out our old friend the bell curve. As we work on our random variables the three conveniences remain true.

Third, the computation to perform these linear transformations of random variables is done solely through linear algebra on the mean and covariance matrices. Fortunately, linear algebra is something about which we know quite a bit.

So, as long as we’re willing to say that our variables are normally distributed (which is often not far from the truth) we can efficiently represent and compute on huge spaces of interconnected variables.


Multivariate Normals (MVNs) have been a goal of mine for some time while working on this project. They’re where this project starts to intersect with my actual work. I do lots of manipulations on MVNs and would like to stop dealing with all the matrix algebra.

In order to build them correctly it was clear I would need a relatively powerful symbolic matrix expression system. I’ve been working on something over at this branch.

Now we can represent symbolic matrices and, using them, represent MVN Random Variables

# Lets make a Multivariate Normal Random Variable
>>> mu = MatrixSymbol('mu', n, 1) # n by 1 mean vector
>>> Sigma = MatrixSymbol('Sigma', n, n) # n by n covariance matrix
>>> X = Normal(mu, Sigma, 'X') # a multivariate normal random variable

# Density is represented just by the mean and covariance
>>> Density(X)
(μ, Σ)

>>> H = MatrixSymbol('H', k, n) # A linear operator
>>> Density(H*X) # What is the density of X after being transformed by H?
(H⋅μ, H⋅Σ⋅H')

# Lets make some measurement noise
>>> zerok = ZeroMatrix(k, 1) # mean zero
>>> R = MatrixSymbol('R', k, k) # symbolic covariance matrix
>>> noise = Normal(zerok, R, 'eta')

# Density after noise added in?
>>> Density(H*X + noise) # This is a Block matrix
⎛[H  I]⋅⎡μ⎤, [H  I]⋅⎡Σ  0⎤⋅⎡H'⎤⎞
⎜       ⎢ ⎥        ⎢    ⎥ ⎢  ⎥⎟
⎝       ⎣0⎦        ⎣0  R⎦ ⎣I ⎦⎠

# When we collapse the above expression it looks much nicer
>>> block_collapse(Density(H*X + noise))
(H⋅μ, R + H⋅Σ⋅H')

# Now lets imagine that we observe some value of HX+noise,
# what does that tell us about X? How does our prior distribution change?
>>> data = MatrixSymbol('data', k, 1)
>>> Density(X ,  Eq(H*X+noise, data)  ) # Density of X given  HX+noise==data
# I'm switching to the latex printer for this

\left[\begin{smallmatrix}\mathbb{I} & \bold{0}\end{smallmatrix}\right] \left(\left[\begin{smallmatrix}\Sigma & \bold{0}\\\bold{0} & R\end{smallmatrix}\right] \left[\begin{smallmatrix}H^T\\\mathbb{I}\end{smallmatrix}\right] \left(\left[\begin{smallmatrix}H & \mathbb{I}\end{smallmatrix}\right] \left[\begin{smallmatrix}\Sigma & \bold{0}\\\bold{0} & R\end{smallmatrix}\right] \left[\begin{smallmatrix}H^T\\\mathbb{I}\end{smallmatrix}\right]\right)^{-1} \left( \left[\begin{smallmatrix}H & \mathbb{I}\end{smallmatrix}\right] \left[\begin{smallmatrix}\mu\\\bold{0}\end{smallmatrix}\right] - data\right) + \left[\begin{smallmatrix}\mu\\\bold{0}\end{smallmatrix}\right]\right)

\left[\begin{smallmatrix}\mathbb{I} & \bold{0}\end{smallmatrix}\right] \left(\mathbb{I} - \left[\begin{smallmatrix}\Sigma & \bold{0}\\\bold{0} & R\end{smallmatrix}\right] \left[\begin{smallmatrix}H^T\\\mathbb{I}\end{smallmatrix}\right] \left(\left[\begin{smallmatrix}H & \mathbb{I}\end{smallmatrix}\right] \left[\begin{smallmatrix}\Sigma & \bold{0}\\\bold{0} & R\end{smallmatrix}\right] \left[\begin{smallmatrix}H^T\\\mathbb{I}\end{smallmatrix}\right]\right)^{-1} \left[\begin{smallmatrix}H & \mathbb{I}\end{smallmatrix}\right]\right) \left[\begin{smallmatrix}\Sigma & \bold{0}\\\bold{0} & R\end{smallmatrix}\right] \left[\begin{smallmatrix}\mathbb{I}\\\bold{0}\end{smallmatrix}\right]

# Again, this block matrix expression can be collapsed to the following
>>> block_collapse(Density(X, Eq(H*X+noise, data) ))
μ + Σ⋅H'⋅(R + H⋅Σ⋅H')^-1⋅(-H⋅μ + -data) ,
(I + -Σ⋅H'⋅(R + H⋅Σ⋅H')^-1⋅H)⋅Σ

\begin{smallmatrix}\mu + \Sigma H^T \left(R + H \Sigma H^T\right)^{-1} \left(  H \mu - data\right)\end{smallmatrix}

\begin{smallmatrix}\left(\mathbb{I} - \Sigma H^T \left(R + H \Sigma H^T\right)^{-1} H\right) \Sigma\end{smallmatrix}

This is the multivariate case of my previous post on data assimilation. Effectively all I’ve done here is baked in the logic behind the Kalman Filter and exposed it through my statistics operators Density, Given, etc… so that it has become more approachable.

Some disclaimers.
1) This is all untested. Please let me know if something is wrong. Already I see an error with the latex printing.
2) For organizational reasons it seems unlikely that Matrix Expressions will make it into SymPy in their current form. As a result this code probably won’t make it into SymPy any time soon.

My active branch is over here:
https://github.com/mrocklin/sympy/tree/mvn_rv/

with the multivariate normal code here:
https://github.com/mrocklin/sympy/tree/mvn_rv/sympy/statistics/mvnrv.py

The matrices live here:
https://github.com/mrocklin/sympy/tree/matrix_expr/sympy/matrices

Posted in Uncategorized | 6 Comments