Article 64RSY Using Python as a statistical calculator

Using Python as a statistical calculator

by
John
from John D. Cook on (#64RSY)

This post is for someone unfamiliar with SciPy who wants to use Python to do basic statistical calculations. More specifically, this post will look at working with common families of random variables-normal, exponential, gamma, etc.-and how to calculate probabilities with these.

All the statistical functions you need will be in the stats subpackage of SciPy. You could import everything with

 >>> from scipy.stats import *

This will make software developers cringe because it's good software development practice to only import what you need [1]. But we're not writing software here; we're using Python as a calculator.

Distributions

Every distribution in SciPy has a location parameter loc which defaults to 0, and scale parameter scale that defaults to 1.

The table below lists additional parameters that some common distributions have.

DistributionSciPy name Parameters
normalnorm
exponentialexpon
betabetashape1, shape2
binomialbinomsize, prob
chi-squaredchi2df
Ffdf1, df2
gammagammashape
hypergeometrichypergeomM, n, N
Poissonpoissonlambda
Student ttdf

When using any statistical software its good to verify that the software is using the parameterization that you expect. For example, the exponential distribution can be parameterized by rate or by scale. SciPy does everything by scale. One way to test the parameterization is to calculate the mean. For example, if the mean of an exp(100) random variable is 100, you're software is using the scale paraemterization. If the mean is 1/100, it;s using the rate.

Functions

For a random variable X from one of the families above, we'll show how to compute Prob(X x), Prob(X >= x), and how to solve for x given one of these probabilities. And for discrete distributions, we'll show how to find Prob(X = p).

CDF: Cumulative distribution function

The probability Prob(X x) is the CDF of X at x. For example, the probability that a standard normal random variable is less than 2 is

 norm.cdf(2)

We didn't have to specify the location or scale because the standard normal uses default parameters. If X had mean 3 and standard deviation 10 we'd call

 norm.cdf(2, loc=3, scale=10)

For another example, suppose we're working with a gamma distribution with shape 3 and scale 5 and want to know the probability of such a random variable taking on a value less than 2. Then we'd call

 gamma.cdf(2, 3, scale=5)

The second argument, 2, is the shape parameter.

CCDF: Complementary cumulative distribution function

The probability Prob(X >= x) is the complementary CDF of X at x, which is 1 minus the CDF at x. SciPy uses the name sf for the CCDF. Here sf" stands for survival function.

Why have both a CDF and CCDF function? One reason is convenience, but a more important reason is numerical accuracy. If a probability p is very near 1, then 1 - p will be partially or completely inaccurate.

For example, let X be a standard normal random variable. The probability that X is greater than 9 is

 norm.sf(9)

which is on the order of 10-19. But if we compute

 1 - norm.cdf(9)

we get exactly 0. Floating point numbers have 16 figures of precision, and we need 19 to represent the difference between our probability and 1. More on why mathematical libraries have functions that seem unnecessary at first here.

Inverse CDF

SciPy calls the inverse CDF function ppf for percentile point function.

For example, suppose X is a gamma random variable with shape 3 and scale 5, and we want to solve for the value of x such that Prob(X x) is 0.7. We could do this with

 gamma.ppf(0.7, 3, scale=5)
Inverse CCDF

SciPy calls the inverse CCDF function isf for inverse survival function. So, for example,

 gamma.isf(0.3, 3, scale=5)

should return the same value as the example above because the point where the right tail has probability 0.3 is the same as the point where the left tail has probability 0.7.

Probability mass function

For a discrete random variable X we can compute the probability mass function, the probability that X takes on a value x. Unsurprisingly SciPy calls this function pmf.

For example, the probability of seeing 6 heads out of 10 flips of a fair coin is computed by

 binom.pmf(6, 10, 0.5)
Getting help

To find out more about a probability distribution, call help with that distribution as an argument. For example, you could call

 help(hypergeom)

to read more about the hypergeometric distribution.

For much more information, including topics not addressed here, see the documentation for scipy.stats.

***

[1] The reason to not import more than you need is to reduce namespace confusion. If you see a function foo, is that the foo function from package A or from package B. The most common way this comes up in statistics is confusion over the gamma function and the gamma distribution. In SciPy, the gamma function is in scipy.special while the gamma distribution is in scipy.stats. If you've imported everything from scipy.stats then gamma means scipy.stats.gamma. You could bring the gamma function into your environment by importing it with another name. For example

 from scipy.special import gamma as gammaf

would import the gamma function giving it the name gammaf.

The post Using Python as a statistical calculator first appeared on John D. Cook.
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments