### Development

It seems there was a flurry of development over the winter holidays.

Tom’s Meijer-G integration code was merged into master giving SymPy an incredibly powerful definite integration engine. This encouraged me to finish up the pull request for random variables.

Earlier this morning we finally merged it in and sympy.stats is now in master. If you’re interested please play with it and generate feedback. At the very least it should be able to solve many of your introductory stats homework problems :)

Actually, I tried using it for a non-trivial example last month and generated an integral which killed the integration engine (mostly this was due to a combination of trigonometric and delta functions). However, I still really wanted the result. The standard solution to analytically intractable statistics problems is to sample. This pushed me to build a monte carlo engine into sympy stats.

### Sampling

The family of stats functions P, E, Var, Density, Given, now have a new member, Sample. You can generate a random sample of any random expression as follows

>>> from sympy.stats import * >>> X, Y = Die(6), Die(6) >>> roll = X+Y >>> Sample(roll) 10 >>> Sample(roll) 5 >>> Sample(X, roll>10) # Sample X given that X+Y>10 6

Sampling is of course more fail-proof than solving integrals and so expressions can be made arbitrarily complex without issue. This sampling mechanism is also built into the probability and expectation functions using the keyword “numsamples”

>>> from sympy.stats import * >>> X, Y = Normal(0, 1), Normal(0, 1) >>> P(X>Y) 1/2 >>> P(X>Y, numsamples = 1000) 499 ──── 1000 >>> E(X+Y) 0 >>> E(X+Y, numsamples = 1000) -0.0334982435603208

### Future

GSoC 2012 was announced a couple days ago. I’m excited to see what projects are proposed.

Thanks a lot for sympy.stats this looks very useful. :-)

The syntax in your example seems to be strange.

P(X>Y, numsamples = 1000) seemst to mix the terminalogy of statistical property (probability) with the estimator of the probability (which is just a fraction).

Similar with E(X+Y, numsamples = 1000). E referst to a property of the statistical distribution (expected value) but combined with numsamples it provides the mean of the sample (again an estimator)

Anyway, thanks a lot for providing this functionality!

Hi linuxuser,

Thanks for the comment. I should confess that I am not much of a statistician and that my knowledge does not extend greatly beyond the scope of this project. I appreciate that more knowledgeable folks are around to provide some needed critical feedback!

I’m open to a name-change if you have a suggestion. The virtue of the current syntax is that it is simple and intuitive, even if it may be technically misleading. I suspect that this error is unlikely to affect most of the potential users while the simple syntax is likely to be important. Can you think of a convention which is more accurate without adding significantly more user-complexity?

-Matt

Sorry, for the late reply. Here are my suggestion:

First the E(X+Y, numsamples = 1000) case. I suggest to use the term Mean instead of E – than you could express the law of large numbers with nice terminology:

The mean of a sample (sample mean) converges to the expected value if the sample size goes to infinity.

For the other example, P(X>Y, numsamples = 1000), I guess you allow the following operators: ,=,!= I would call it the fraction of observation for which the condition is satisfied. Note that in finite samples the fraction is typically different from the underlying probability.

Mean(X+Y, numsamples = 1000)

Fraction((X>Y, numsamples = 1000)