How should we improve our application process to attract good students, detect good students, and match good students to appropriate mentors?

We should ask indirect questions to query for curiosity and passion.

Negative examples:

- Why do you want to work on SymPy?
- Why do you like Math?
- How long have you been programming?

Positive examples:

- Copy the favorite snippet of SymPy code you’ve seen here and tell us why you like it.
- What aspects of SymPy do you like the most?
- What editor do you use and why?

Experience was that direct questions tend to have low information content (everyone says the same thing). Indirect questions will be ignored by the lazy but engage the engaged. You often want to test for curiosity and passion more than actual experience in the domain.

We should match mathematically strong students with strong programmer mentors and strong programmer students with strong mathematical mentors. We often do the opposite due to shared interests but this might result in ideal contributions

Other people have funding. Should we? What would we do with it? How would we get it? It might not be as hard as we think. Who uses us? Can we get a grant? Are there companies who might be willing to fund directed work on SymPy?

This is my first time physically interacting with SymPy contributors other than my old mentor. It was a really positive experience. As a community we’re pretty distributed, both geographically and in applications/modules. Getting together and talking about SymPy was oddly fascinating. We should do it more. It made us think about SymPy at a bigger scale.

Some thoughts

- Do we want to organize a SymPy meetup (perhaps collocated with some other conference like SciPy)? What would this accomplish?
- What is our big plan for SymPy? Do we have one or are we all just a bunch of hobbyists who work on our own projects? Are we actively pursuing a long term vision? I think that we could be more cohesive and generate more forward momentum. I think that this can be created by occasional collocation.
- This could also be accomplished by some sort of digital meetup that’s more intense than the e-mail/IRC list. An easy test version of this could be a monthly video conference.

I’m accustomed to academic conferences. I recently had a different experience at the SciPy conference which mixed academic research with code. I really liked this mix of theory and application and had a great time at SciPy. GSoC amplified this change, replacing a lot of academics with attendees that were purely interested in code. This was personally very strange for me, I felt like an outsider.

The scientific/numeric python community doesn’t care as intensely about many of the issues that are religion to a substantial fraction of the open source world. My disinterest in these topics and my interest in more esoteric/academic topics also made me feel foreign. There were still people like me though and they were very fun to find, just a bit rarer.

This is the first conference I’ve been to where I was one of the better dressed attendees :)

Other projects of our size exist under an umbrella organization like the Apache foundation. I see our local community as the numpy/scipy/matplotlib stack. How can we more tightly integrate ourselves with this community? NumFocus was started up recently. Should we engage/use NumFocus more? How can we make use of and how can we support our local community?

This section includes my thoughts about the summit itself. It’s distinctly structured. I’ll share my opinions about this structure.

The informal meeting spaces were excellent. Far better than the average academic conference. I felt very comfortable introducing myself and my project to everyone. It was a very social and outgoing crowd.

Some of the sessions were really productive and helpful. The unconference structure had a few strong successes.

There were a lot of sessions that could have been better organized.

- Frequently we didn’t have a goal in mind; this can be ok but I felt that in many cases a clear goal would have kept conversation on topic.
- People very often wanted to share their experiences from events in their organization. This is good, we need to share experiences, but often people wouldn’t filter out org-specific details. We need to be mindful about holding the floor. We have really diverse groups and I’m pretty sure that the KDE guys don’t want to hear the details of symbolic algebra algorithms.
- Sessions are sometimes dominated by one person
- In general I think that we should use neutral meeting facilitators within the larager sessions. I think that they could be much more productive with some light amount of control.

It was really cool to associate physical humans to all of the software projects I’ve benefitted from over the years. It’s awesome to realize that it’s all built by people, and not by some abstract force. I had a number of positive experiences with orgs like Sage and SciLab that are strongly related to SymPy as well as orgs that are completely unrelated like OpenIntents, Scala, and Tor.

I had a good time and came away with thoughts of the future. We have something pretty cool here and I think that we should think more aggressively about where we want to take it.

]]>

Historically I have been bad at this. I am guilty of writing needlessly complex code. A friend recently sent me a talk by Rich Hickey, the creator of Clojure, about simplicity versus ease. I decided to try to make the SymPy.Sets code simpler as an educational project.

The current issue with sets is that many classes contain code to interact with every other type of class. I.e. we have code that looks like this:

def operation(self, other): if other.is_FiniteSet: ... if other.is_Interval: ... if other.is_ProductSet: ...

This is because the rules to, say join the FiniteSet `{1,2,3,4}`

with the Interval `[2, 3)`

can be complex. The sets module handles this all marvelously well and produces `[2, 3] U {1, 4}`

, a nice answer. The code to do it however is atrocious and filled with nests of rules and special cases. Much of this code is in the Union and RealUnion classes but some of it is in FiniteSet, some of it is in Interval as well. Everything works, it’s just complex.

This is similar to the situation in `Mul.flatten`

and friends.

So what is the solution for Sets? How do we simplify Union and Intersection?

First, lets acknowledge that Union/Intersection serve two purposes

- They serve as a container of sets
- They simplify these sets using known rules

We separate these two aspects and solve them independently.

We separate these two in the same way Mul and Add handle it. We create a reduce/flatten method and, while we call it by default, it is now separate from the construction logic. There has been talk about separating these two parts of our container classes even further by having container classes that only contain and simplifyers/canonicalizers that only simplify/canonicalize.

We need a simple way to manage all of the special rules we know for simplifying collections of sets. The issue is that there are a lot of special cases; FiniteSets can do some things, Intervals others, and how do we anticipate not-yet-defined sets? Our solution is as follows.

Every set class has methods `_union(self, other)`

and `_intersect(self, other)`

. These methods contain local simplification rules. I.e. if `self`

knows how to interact with `other`

it returns a new, simplified set, otherwise it returns `None`

for “I don’t know what to do in this situation”. For example `Intervals`

know how to intersect themselves with other `Intervals`

but they don’t know how to interact with `FiniteSets`

, luckily `FiniteSets`

know how to do this. Together they know how to handle any situation between them.

Here are the local interaction methods for `EmptySet`

.

def _union(self, other): return other def _intersect(self, other): return S.EmptySet

These are particularly simple, are known only by EmptySet, and yet produce proper behavior in any interaction. When we add EmptySet to the family of Sets we don’t need to add code to Union or Intersection. Everything is nicely contained.

When they simplify, the Union and Intersection classes do two things.

- They walk over the collection of sets and use local rules to perform simplifications
- They also contain a few “global rules” that can accelerate the process by looking at the entire collection of sets at once.

In this way it is very easy to extend the Sets module with new classes without breaking Union and Intersection. Additionally, the old nest of code has been cleanly separated and placed into the relevant classes. Unions and Intersections no longer need to know every possible interaction between every possible Set. Instead they manage interactions and let Sets simplify themselves.

A final note. I like this idea of managing many small simplification rules. I stole this idea from Theano, a symbolic/numeric python library. They go one step further though and separate the rule from the container class. I.e. rather than telling Intervals how to interact with Intervals they make a separate rule and include it in some separate simplifying manager. If this idea interests you I suggest you look at their documentation on optimizations.

]]>

It seems there was a flurry of development over the winter holidays.

Tom’s Meijer-G integration code was merged into master giving SymPy an incredibly powerful definite integration engine. This encouraged me to finish up the pull request for random variables.

Earlier this morning we finally merged it in and sympy.stats is now in master. If you’re interested please play with it and generate feedback. At the very least it should be able to solve many of your introductory stats homework problems :)

Actually, I tried using it for a non-trivial example last month and generated an integral which killed the integration engine (mostly this was due to a combination of trigonometric and delta functions). However, I still really wanted the result. The standard solution to analytically intractable statistics problems is to sample. This pushed me to build a monte carlo engine into sympy stats.

The family of stats functions P, E, Var, Density, Given, now have a new member, Sample. You can generate a random sample of any random expression as follows

>>> from sympy.stats import * >>> X, Y = Die(6), Die(6) >>> roll = X+Y >>> Sample(roll) 10 >>> Sample(roll) 5 >>> Sample(X, roll>10) # Sample X given that X+Y>10 6

Sampling is of course more fail-proof than solving integrals and so expressions can be made arbitrarily complex without issue. This sampling mechanism is also built into the probability and expectation functions using the keyword “numsamples”

>>> from sympy.stats import * >>> X, Y = Normal(0, 1), Normal(0, 1) >>> P(X>Y) 1/2 >>> P(X>Y, numsamples = 1000) 499 ──── 1000 >>> E(X+Y) 0 >>> E(X+Y, numsamples = 1000) -0.0334982435603208

GSoC 2012 was announced a couple days ago. I’m excited to see what projects are proposed.

]]>

I have a pull request here for Matrix Expressions

https://github.com/sympy/sympy/pull/532

My branch for Finite and Continuous Random Variables is below. It doesn’t have a pull request yet (I’m waiting for Tom’s code to get in) but I’d be thrilled if anyone wanted to look it over in the meantime.

https://github.com/mrocklin/sympy/tree/rv2

There is another branch for Multivariate Random Normals that depends on the previous two. I suspect that it might have to change based on feedback from the previous two branches. It’s probably not worth reviewing at this point but, if you’re interested, here it is.

https://github.com/mrocklin/sympy/tree/mvn_rv

]]>

I’m not sure how to proceed with the matrix expressions ideas. On one hand I should wait until the community comes to a consensus about what SymPy Matrix Expressions should be (or even if they should be at all). On the other hand I don’t ever see this consensus happening. How do I spur on a decision here?

]]>

- Write up a blogpost on my implementation of Matrix Expressions. What they can and can’t do. I’d like to generate discussion on this topic.
- Test my code against Tom’s integration code. This has been happening over the last 24 hours actually. It’s cool to see lots of new things work and work well – I feel like I’m driving a sports car. I think that this cross-branch testing has been helpful to locate bugs in both of our codebases.
- After I check what will and won’t work with Tom’s code I need to fill out tests and polish documentation for my main Discrete and Continuous RV branch. It’d be nice to have it presentable to the community for review.

]]>

The probability density of a multivariate normal random variable is proportional to the following:

Where is an n-dimensional state vector, is the mean of the distribution, and is an n by n covariance matrix. Pictorally a 2-D density might be represented like this:

With contour lines showing decreasing probability levels dropping off around the mean (blue x). This distribution is entirely defined by the two quantities, which gives the center of the distribution and which effectively gives the shape of the ellipses. That is, rather than carry around the functional form above, we can simply define X as and forget the rest.

Multivariate normals are convenient for three reasons

- They are easy to represent – we only need a mean and covariance matrix
- Linear transformations of normals are again normals
- All operations are represented with linear algebra

First off, multivariate normals are simple to represent. This ends up being a big deal for functions on very high dimensional spaces. Imagine writing down a general function on 1000 variables.

Second, linear functions of normals are again normals. This is huge. For example this means that we could project the image above to one of the coordinate axes (or any axis) and get out our old friend the bell curve. As we work on our random variables the three conveniences remain true.

Third, the computation to perform these linear transformations of random variables is done solely through linear algebra on the mean and covariance matrices. Fortunately, linear algebra is something about which we know quite a bit.

So, as long as we’re willing to say that our variables are normally distributed (which is often not far from the truth) we can efficiently represent and compute on huge spaces of interconnected variables.

Multivariate Normals (MVNs) have been a goal of mine for some time while working on this project. They’re where this project starts to intersect with my actual work. I do lots of manipulations on MVNs and would like to stop dealing with all the matrix algebra.

In order to build them correctly it was clear I would need a relatively powerful symbolic matrix expression system. I’ve been working on something over at this branch.

Now we can represent symbolic matrices and, using them, represent MVN Random Variables

# Lets make a Multivariate Normal Random Variable >>> mu = MatrixSymbol('mu', n, 1) # n by 1 mean vector >>> Sigma = MatrixSymbol('Sigma', n, n) # n by n covariance matrix >>> X = Normal(mu, Sigma, 'X') # a multivariate normal random variable # Density is represented just by the mean and covariance >>> Density(X) (μ, Σ) >>> H = MatrixSymbol('H', k, n) # A linear operator >>> Density(H*X) # What is the density of X after being transformed by H? (H⋅μ, H⋅Σ⋅H') # Lets make some measurement noise >>> zerok = ZeroMatrix(k, 1) # mean zero >>> R = MatrixSymbol('R', k, k) # symbolic covariance matrix >>> noise = Normal(zerok, R, 'eta') # Density after noise added in? >>> Density(H*X + noise) # This is a Block matrix ⎛[H I]⋅⎡μ⎤, [H I]⋅⎡Σ 0⎤⋅⎡H'⎤⎞ ⎜ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟ ⎝ ⎣0⎦ ⎣0 R⎦ ⎣I ⎦⎠ # When we collapse the above expression it looks much nicer >>> block_collapse(Density(H*X + noise)) (H⋅μ, R + H⋅Σ⋅H') # Now lets imagine that we observe some value of HX+noise, # what does that tell us about X? How does our prior distribution change? >>> data = MatrixSymbol('data', k, 1) >>> Density(X , Eq(H*X+noise, data) ) # Density of X given HX+noise==data # I'm switching to the latex printer for this

# Again, this block matrix expression can be collapsed to the following >>> block_collapse(Density(X, Eq(H*X+noise, data) )) μ + Σ⋅H'⋅(R + H⋅Σ⋅H')^-1⋅(-H⋅μ + -data) , (I + -Σ⋅H'⋅(R + H⋅Σ⋅H')^-1⋅H)⋅Σ

This is the multivariate case of my previous post on data assimilation. Effectively all I’ve done here is baked in the logic behind the Kalman Filter and exposed it through my statistics operators Density, Given, etc… so that it has become more approachable.

Some disclaimers.

1) This is all untested. Please let me know if something is wrong. Already I see an error with the latex printing.

2) For organizational reasons it seems unlikely that Matrix Expressions will make it into SymPy in their current form. As a result this code probably won’t make it into SymPy any time soon.

My active branch is over here:

https://github.com/mrocklin/sympy/tree/mvn_rv/

with the multivariate normal code here:

https://github.com/mrocklin/sympy/tree/mvn_rv/sympy/statistics/mvnrv.py

The matrices live here:

https://github.com/mrocklin/sympy/tree/matrix_expr/sympy/matrices

]]>

Matrices are used in a number of contexts and, understandably, SymPy represents them in a few ways. You can represent a matrix or linear operator with a Symbol or you can write out a matrix’s components explicitly with a Matrix object. Thanks to recent work by Sherjil over here, Matrix objects are quickly becoming more powerful.

Recently I’ve wanted to build up purely symbolic matrix expressions using Symbol but kept running into problems because I didn’t want to add things to the SymPy core Expr that were specific to matrices. The standard SymPy Expr wasn’t really designed with Matrices in mind and I found that this was holding me back a bit.

I decided to branch off a MatrixExpr class that, while much less stable, is open to experimentation. It’s been lots of fun so far. I’ve used it for my GSoC project to build up large expressions using block matrices.

I’ll have examples in a future post related to my GSoC project. For now if you’d like to check it out my code resides here:

https://github.com/mrocklin/sympy/tree/matrix_expr/sympy/matrices

There is a MatrixExpr class with associated MatrixSymbol, MatAdd, MatMul, MatPow, Inverse, Transpose, Identity, ZeroMatrix objects. All the things you need for basic expressions. Most of the logic still depends on the subclassed Add, Mul, Pow classes with a little bit added on.

Also, because my GSoC project needed it I built a fun BlockMatrix class that holds MatrixExpr’s and can be freely mixed with normal MatrixExprs in an expression.

]]>

The next big step is to write up a Multivariate Normal Random Variable. Operations on these will generate expressions in Linear Algebra. I’m still working out the best way to integrate this into SymPy.

A bit less exciting (though arguably just as important) I’ve started writing tests and filling in gaps for Discrete and Continuous RVs.

]]>

Ok, I went outside and felt the air temperature. I think that it is 30C but I’m not very good at telling the temperature when it’s humid out; we’ll say that it’s 30C with a standard deviation of 3C. That is, we wouldn’t be surprised if it was 27C or 33C but we’re confident that it’s not 20C or 40C. Rather than represent the temperature as a number like 30, lets represent it with a Normal Random Variable with mean 30 and standard deviation 3.

`>>> T = Normal(30, 3)`

We represent this pictorally as follows.

Hopefully you find this representation intuitive. If you’re a math guy you might like the functional form of this:

>>> Density(T)

You’ll see that as x gets away from the mean, 30, the probability starts to drop off rapidly. The speed of the drop off is moderated by 2 times the square of the standard deviation, 18, present in the denominator of the exponent.

We’ll call this curve the prior distribution of T. Prior to what you might ask? Prior to me going outside with a thermometer to measure the actual temperature.

My thermometer is small and the lines are very close together. This makes it difficult to get a precise measurement. I think I can only measure the temperature to the nearest one or two degrees. We’ll describe this measurement noise by another Normal Random Variable with standard deviation 1.5.

`>>> noise = Normal(0, 1.5)`

`>>> Density(noise)`

Before I go outside lets think about the value I might measure on this thermometer. Given our understanding of the temperature we expect to get a value around 30C. This value might vary though both because our original estimate of T might be off (the weather might actually be 33C) and because I won’t measure it correctly because the lines are small. We can describe this value+variability as the sum of two random variables

>>> observation = T + noise

Ok, I just came back from outside. I measured 26C on the thermometer; reasonable but a bit lower than I expected. We remember that this measurement has some uncertainty attached to it due to the noise and represent it with a random variable rather than just a number

>>> data = 26 + noise

Note how the data curve looks skinnier. This is because its more precise around the mean value, 26. It’s taller so that the area under the curve is equal to one.

After we make this measurement how should our estimate of the temperature change? We have the original estimate, 30C +- 3C, and the new measurement 26C +- 1.5C. We could take the thermometer’s reading because it is more precise but this doesn’t use the original information at all. The old measurement still has useful information and it would be best to cleanly assimilate the new bit of data (26C+-1.5C) into our prior understanding (30C +-3C).

This is the problem of Data Assimilation. We want to assimilate a new meaurement (data) into our previous understanding (prior) to form a new and better-informed understanding (posterior). Really, we want to compute

*<<T_new = T_old given that observation == 26 >>*

We can do this in SymPy as follows

`>>> T_new = Given( T , Eq(observation, 26) )`

This posterior is represented below both visually and mathematically

The equation tells us that the temperature, here represented by , drops off both as it gets away from the prior mean 30C and as it gets away from the measured value 26C. The value with maximum likelihood is somewhere between. Visually it looks like the blue curve is described by 27C +- 1.3C or so.

We notice that the posterior is a judicious compromise between the two, weighting the data more heavily because it was more precise. The astute statistician might notice that the variance of the posterior is lower than either of the other two (the blue curve is skinnier and so varies less). That is, by combining both measurements we were able to reduce uncertainty below the best of either of them. It’s worth noting that this solution wasn’t built into SymPy-stats or into some standard probabilistic trick. This is *the *answer given the most basic and fundamental rules.

The code for this demo is available here. It depends this statistics branch of SymPy and on matplotlib for the plotting .

Data assimilation is a very active research topic these days. Common applications include weather prediciton, automated aircraft navigation, and phone-gps navigation.

Phone/GPS navigation is a great example. Your phone has three ways of telling where it is; cell towers, GPS satellites, and the accelerometer. Cell towers are reliable but low precision. GPS satellites are precise to a few meters but need some help to find out generally where they are and update relatively infrequently. The accelerometer has fantastic, real-time precision but is incapable of large scale location detection. Merging all three types of information on-the-fly creates a fantastic product that magically tells us to “turn left here” at just the right time.

]]>