# 7 multivariate probability

##### Pdf File 882.37KByte,

7. Multivariate Probability

Chris Piech and Mehran Sahami Oct 2017

Often you will work on problems where there are several random variables (often interacting with one another). We are going to start to formally look at how those interactions play out.

For now we will think of joint probabilities with two random variables X and Y .

1 Discrete Joint Distributions

In the discrete case a joint probability mass function tells you the probability of any combination of events X = a and Y = b:

pX,Y (a, b) = P(X = a,Y = b)

This function tells you the probability of all combinations of events (the "," means "and"). If you want to back calculate the probability of an event only for one variable you can calculate a "marginal" from the joint probability mass function:

pX (a) = P(X = a) = PX,Y (a, y) y

pY (b) = P(Y = b) = PX,Y (x, b) x

In the continuous case a joint probability density function tells you the relative probability of any combination of events X = a and Y = y. In the discrete case, we can define the function pX,Y non-parametrically. Instead of using a formula for p we simply state the probability of each possible outcome.

2 Continuous Joint Distributions

Random variables X and Y are Jointly Continuous if there exists a Probability Density Function (PDF) fX,Y such that:

a2 b2

P(a1 < X a2, b1 < Y b2) =

fX,Y (x, y)dy dx

a1 b1

Using the PDF we can compute marginal probability densities:

fX (a) = fX,Y (a, y)dy

-

fY (b) = fX,Y (x, b)dx

-

Let F(a, b) be the Cumulative Density Function (CDF):

P(a1 < X a2, b1 < Y b2) = F(a2, b2) - F(a1, b2) + F(a1, b1) - F(a2, b1)

1

3 Multinomial Distribution

Say you perform n independent trials of an experiment where each trial results in one of m outcomes, with respective probabilities: p1, p2, . . . , pm (constrained so that i pi = 1). Define Xi to be the number of trials with outcome i. A multinomial distribution is a closed form function that answers the question: What is the probability that there are ci trials with outcome i. Mathematically:

P(X1 = c1, X2 = c2, . . . , Xm = cm) =

n c1, c2, . . . , cm

pc11 pc22 . . . pcmm

Example 1

A 6-sided die is rolled 7 times. What is the probability that you roll: 1 one, 1 two, 0 threes, 2 fours, 0 fives, 3 sixes (disregarding order).

7! 1 1 1 1 1 0 1 2 1 0 1 3

P(X1 = 1, X2 = 1, X3 = 0, X4 = 2, X5 = 0, X6 = 3) = 2!3! 6

6

6

6

6

6

17 = 420

6

Fedaralist Papers

In class we wrote a program to decide whether or not James Madison or Alexander Hamilton wrote Fedaralist Paper 49. Both men have claimed to be have written it, and hence the authorship is in dispute. First we used historical essays to estimate pi, the probability that Hamilton generates the word i (independent of all previous and future choices or words). Similarly we estimated qi, the probability that Madison generates the word i. For each word i we observe the number of times that word occurs in Fedaralist Paper 49 (we call that count ci). We assume that, given no evidence, the paper is equally likely to be written by Madison or Hamilton.

Define three events: H is the event that Hamilton wrote the paper, M is the event that Madison wrote the paper, and D is the event that a paper has the collection of words observed in Fedaralist Paper 49. We would like to know whether P(H|D) is larger than P(M|D). This is equivalent to trying to decide if P(H|D)/P(M|D) is larger than 1.

The event D|H is a multinomial parameterized by the values p. The event D|M is also a multinomial, this time parameterized by the values q.

Using Bayes Rule we can simplify the desired probability.

P(D|H )P(H )

P(H|D) =

P(D)

P(D|H)P(H) P(D|H)

=

=

P(M|D) P(D|M)P(M) P(D|M)P(M) P(D|M)

P(D)

=

n c1 ,c2 ,...,cm

n c1 ,c2 ,...,cm

i pci i i qci i

=

i pci i i qci i

This seems great! We have our desired probability statement expressed in terms of a product of values we have already estimated. However, when we plug this into a computer, both the numerator and denominator come out to be zero. The product of many numbers close to zero is too hard for a computer to represent. To fix this problem, we use a standard trick in computational probability: we apply a log to both sides and apply

2

some basic rules of logs.

P(H|D) log

P(M|D)

= log

i pci i i qci i

= log( pci i ) - log( qci i )

i

i

= log(pci i ) - log(qci i )

i

i

= cilog(pi) - cilog(qi)

i

i

This expression is "numerically stable" and my computer returned that the answer was a negative number. We can use exponentiation to solve for P(H|D)/P(M|D). Since the exponent of a negative number is a number smaller than 1, this implies that P(H|D)/P(M|D) is smaller than 1. As a result, we conclude that Madison was more likely to have written Federalist Paper 49.

4 Expectation with Multiple RVs

Expectation over a joint isn't nicely defined because it is not clear how to compose the multiple variables.

However, expectations over functions of random variables (for example sums or multiplications) are nicely defined: E[g(X,Y )] = x,y g(x, y)p(x, y) for any function g(X,Y ). When you expand that result for the function g(X,Y ) = X +Y you get a beautiful result:

E[X +Y ] = E[g(X,Y )] = g(x, y)p(x, y) = [x + y]p(x, y)

x,y

x,y

= xp(x, y) + yp(x, y)

x,y

x,y

= x p(x, y) + y p(x, y)

xy

yx

= xp(x) + yp(y)

x

y

= E[X] + E[Y ]

This can be generalized to multiple variables:

n

n

E Xi = E[Xi]

i=1

i=1

Expectations of Products Lemma

Unfortunately the expectation of the product of two random variables only has a nice decomposition in the case where the random variables are independent of one another.

E[g(X)h(Y )] = E[g(X)]E[h(Y )]

if and only if X and Y are independent

Example 3

A disk surface is a circle of radius R. A single point imperfection is uniformly distributed on the disk with joint PDF:

1

fX,Y (x, y) = R2

if x2 + y2 R2

0 else

Let D be the distance from the origin: D = X2 +Y 2. What is E[D]? Hint: use the lemmas

3

5 Independence with Multiple RVs

Discrete

Two discrete random variables X and Y are called independent if: P(X = x,Y = y) = P(X = x)P(Y = y) for all x, y

Intuitively: knowing the value of X tells us nothing about the distribution of Y . If two variables are not independent, they are called dependent. This is a similar conceptually to independent events, but we are dealing with multiple variables. Make sure to keep your events and variables distinct.

Continuous

Two continuous random variables X and Y are called independent if: P(X a,Y b) = P(X a)P(Y b) for all a, b

This can be stated equivalently as: FX,Y (a, b) = FX (a)FY (b) for all a, b fX,Y (a, b) = fX (a) fY (b) for all a, b

More generally, if you can factor the joint density function then your continuous random variable are independent:

fX,Y (x, y) = h(x)g(y) where - < x, y <

Example 2

Let N be the # of requests to a web server/day and that N Poi( ). Each request comes from a human (probability = p) or from a "bot" (probability = (1?p)), independently. Define X to be the # of requests from humans/day and Y to be the # of requests from bots/day.

Since requests come in independently, the probability of X conditioned on knowing the number of requests is a Binomial. Specifically:

(X|N) Bin(N, p) (Y |N) Bin(N, 1 - p)

Calculate the probability of getting exactly i human requests and j bot requests. Start by expanding using the chain rule:

P(X = i,Y = j) = P(X = i,Y = j|X +Y = i + j)P(X +Y = i + j)

We can calculate each term in this expression:

P(X = i,Y = j|X +Y = i + j) = i + j pi(1 - p) j i

P(X +Y = i + j) = e- i+ j (i + j)!

Now we can put those together and simplify:

P(X = i,Y = j) = i + j pi(1 - p) je- i+ j

i

(i + j)!

As an exercise you can simplify this expression into two independent Poisson distributions.

4

Symmetry of Independence

Independence is symmetric. That means that if random variables X and Y are independent, X is independent of Y and Y is independent of X. This claim may seem meaningless but it can be very useful. Imagine a sequence of events X1, X2, . . . . Let Ai be the event that Xi is a "record value" (eg it is larger than all previous values). Is An+1 independent of An? It is easier to answer that An is independent of An+1. By symmetry of independence both claims must be true.

6 Convolution of Distributions

Convolution is the result of adding two different random variables together. For some particular random variables computing convolution has intuitive closed form equations. Importantly convolution is the sum of the random variables themselves, not the addition of the probability density functions (PDF)s that correspond to the random variables.

Independent Binomials with equal p

For any two Binomial random variables with the same "success" probability: X Bin(n1, p) and Y Bin(n2, p) the sum of those two random variables is another binomial: X + Y Bin(n1 + n2, p). This does not hold when the two distribution have different parameters p.

Independent Poissons

For any two Poisson random variables: X Poi(1) and Y Poi(2) the sum of those two random variables is another Poisson: X +Y Poi(1 + 2). This holds when 1 is not the same as 2.

Independent Normals

For any two normal random variables X N (?1, 12) and Y N (?2, 22) the sum of those two random variables is another normal: X +Y N (?1 + ?2, 12 + 22).

General Independent Case

For two general independent random variables (aka cases of independent random variables that don't fit the above special situations) you can calculate the CDF or the PDF of the sum of two random variables using the following formulas:

FX+Y (a) = P(X +Y a) =

FX (a - y) fY (y)dy

y=-

fX+Y (a) =

fX (a - y) fY (y)dy

y=-

There are direct analogies in the discrete case where you replace the integrals with sums and change notation for CDF and PDF.

5

Example 1

Calculate the PDF of X + Y for independent uniform random variables X Uni(0, 1) and Y Uni(0, 1)? First plug in the equation for general convolution of independent random variables:

1

fX+Y (a) = fX (a - y) fY (y)dy

y=0

1

fX+Y (a) = fX (a - y)dy

y=0

Because fY (y) = 1

It turns out that is not the easiest thing to integrate. By trying a few different values of a in the range [0, 2] we can observe that the PDF we are trying to calculate is discontinuous at the point a = 1 and thus will be easier to think about as two cases: a < 1 and a > 1. If we calculate fX+Y for both cases and correctly constrain the bounds of the integral we get simple closed forms for each case:

a

fX+Y (a) = 2 - a

0

if 0 < a 1 if 1 < a 2 else

7 Conditional Distributions

Before we looked at conditional probabilities for events. Here we formally go over conditional probabilities for random variables. The equations for both the discrete and continuous case are intuitive extensions of our understanding of conditional probability:

Discrete

The conditional probability mass function (PMF) for the discrete case:

pX|Y (x|y)

=

P(X

=

x|Y

=

y)

=

P(X = x,Y = P(Y = y)

y)

=

PX,Y (x, y) pY (y)

The conditional cumulative density function (CDF) for the discrete case:

FX|Y (a|y)

=

P(X

a|Y

=

y)

=

xa pX,Y (x, y) pY (y)

=

pX|Y (x|y)

xa

Continuous

The conditional probability density function (PDF) for the continuous case:

fX|Y (x|y) =

fX,Y (x, y) fY (y)

The conditional cumulative density function (CDF) for the continuous case:

a

FX|Y (a|y) = P(X a|Y = y) = fX|Y (x|y)dx

-

Mixing Discrete and Continuous

These equations are straightforward once you have your head around the notation for probability density functions ( fX (x)) and probability mass functions (pX (x)).

6