# SPTK: Random Variables

Our toolkit expands to include basic probability theory.

Previous SPTK Post: Complex Envelopes Next SPTK Post: Examples of Random Variables

In this Signal Processing ToolKit post, we examine the concept of a random variable.

[Jump straight to ‘Significance of Random Variables in CSP’ below.]

What is a random variable? As the authors of The Literature [R153] say, “a random variable is actually neither random nor a variable.” Good start! They go on to give the usual definition of a random variable, which is a mapping between events in a probability space and the real or complex numbers. We need to review the basics of probability so that we can grasp these key ideas of mappings, events, and probability space.

### Axiomatic Probability

Probability theory is often motivated by games of chance, such as those involving dice and cards, and by the associated familiar actions of counting things and forming ratios. When we want to understand what the “chances” are of throwing a six-sided die and seeing that the top face shows two pips, we count all the possible occurrences, or outcomes, associated with throwing the die and divide the number of desired occurrences by the obtained number of possible occurrences to arrive at the desired chance or probability. In this dice example, that chance would be $1/6$: one outcome corresponds to two pips and there are exactly six possible outcomes. The notion of repeating an experiment, such as throwing dice, a large number of times and recording the outcomes in order to create a table of probabilities is aligned with the frequentist school of thought on probability, and it is intuitive, because we intuitively understand the notions of experimentation, repetition, and counting.

Each time we perform the experiment (also called a trial), we obtain an outcome or event. Suppose our experiment can have some outcome called $A$. We perform $N$ trials of the experiment and observe $N_A$ of the $A$ events. This implies that $0 \leq N_A \leq N$. We define an observed probability of the event $A$ by the ratio of $N_A$ to $N$,

$\displaystyle \hat{P} (A) = \frac{N_A}{N}. \hfill (1)$

We must have, then

$\displaystyle 0 \leq \hat{P}(A) \leq 1, \hfill (2)$

which is the origin of the fact that probabilities lie between zero and one. The frequentist then imagines letting the number of experimental trials increase without bound, $N \rightarrow \infty$. The observed probability then passes to a theoretical probability,

$\displaystyle P(A) = \lim_{N\rightarrow\infty} \hat{P}(A) = \lim_{N\rightarrow\infty} \frac{N_A}{N}. \hfill (3)$

A complication in interpretation arises for the cases of $P(A) = 0$ and $P(A) =1$. Imagine doing an infinite number of trials and observing the event $A$ during exactly one of those trials. Then the probability of $A$ is zero by (3), but $A$ does actually occur. So we cannot claim that a zero-probability event is impossible. It just happens ‘with probability zero,’ which is also referred to as ‘almost never’ happening. Similarly, for $P(A) = 1$, we cannot interpret this to mean that $A$ always happens. It just happens with probability one, or ‘almost surely.’

Let the experiment have $M$ possible outcomes $A_j$, and only these outcomes. Moreover, suppose these outcomes can only happen by themselves–we never see event $A_1$ and event $A_3$ happen simultaneously, for example. Then these particular events are mutually exclusive. The observed probability for event $A_j$ is

$\displaystyle \hat{P}(A_j) = \frac{N_{A_j}}{N}. \hfill (4)$.

Now, an experimental trial always produces an outcome so that

$\displaystyle \sum_{j=1}^M N_{A_j} = N. \hfill (5)$

When we pass to the limit of an infinite number of trials, we obtain

$\displaystyle P(A_j) = \lim_{N\rightarrow\infty} \hat{P}(A_j) \hfill (6)$

with

$\displaystyle \sum_{j=1}^M P(A_j) = 1. \hfill (7)$

We can consider more complicated experiments and associated events (outcomes). We might, for example, wish to know the probability of the occurrence of two events, should the experiment allow that kind of situation. This leads to the notion of the joint probability of two events $A$ and $B$. Here the word ‘joint’ is associated with the notion of ‘and.’ We seek the probability that event $A$ happens and, simultaneously, event $B$ happens. In the frequentist approach, we simply count the number of experimental trials that result in observing both $A$ and $B$

$\displaystyle P(A \mbox{\rm\ and\ } B) = P(A \cap B) = \lim_{N\rightarrow\infty} \frac{N_{A,B}}{N}. \hfill (8)$

This joint probability may or may not be equal to the product of the individual probabilities $P(A) P(B)$. For example, consider the experiment consisting of throwing two fair six-sided dice. One joint event is ‘the sum is $12$ on one throw and is also $12$ on the next.’ We intuitively think this should be the probability of getting a sum of $12$ times itself, because the two throws do not influence each other. We can construct a big table of all possible events relating to two consecutive throws and find that the probability is indeed $(1/36)(1/36)$.

Other events are more tangled up together. Consider next the joint event that on a single throw the sum is seven (event $A$) and one die shows a six (event $B$). For event $A$, there are six possibilities out of thirty-six total, and for event $B$ there are eleven possibilities. So $P(A) = 1/6$ and $P(B) = 11/36$. But looking at the thirty-six possibilities, there are only two that correspond to $A \cap B$. So in this case, $P(A\cap B) \neq P(A)P(B)$. We need another kind of probability: conditional probability.

The conditional probability is the probability of one event given that another event happened, which is denoted by $P(A|B)$, and we say ‘probability of $A$ given $B$.’ This probability is defined by

$\displaystyle P(A|B) = P(A\cap B)/P(B), \hfill (9)$

which is the probability of the joint event as a fraction of the probability of the conditioning event. This leads immediately to

$P(A \cap B) = P(A|B)P(B). \hfill (10)$

The symmetry of this relation with respect to events $A$ and $B$ leads to Baye’s rule

$\displaystyle P(A|B)P(B) = P(B|A)P(A). \hfill (11)$

In the important case for which the conditioning event has no effect on the probability,

$\displaystyle P(A|B) = P(A), \hfill (12)$

we have

$\displaystyle P(A \cap B) = P(A)P(B), \hfill(13)$

and we say that the two events are statistically independent.

This frequentist, or frequency of occurrence, approach to probability is usually too cumbersome to apply to complex problems, such as analyzing the probabilistic behavior of radio signals (but eventually we will cover the attempt, called fraction-of-time probability, at the CSP Blog), so it is relegated to its role as a motivator, and a more abstract axiomatic probability theory is embraced.

The axioms of axiomatic probability are the following:

$\displaystyle P(A) \ge 0, \hfill (14)$

$\displaystyle P(S) = 1, \hfill (15)$

$\displaystyle A \cap B = \emptyset \Rightarrow P(A \cup B) = P(A) + P(B). \hfill (16)$

The key abstraction is the probability space $S$, which contains all possible events of interest. It is a set. Some event must happen when an experiment is conducted, so that the probability of the event that corresponds to the entire probability space is one.

The axioms can be used to develop joint probabilities, conditional probabilities, and they are consistent with Bayes’ Rule. In probability theory, the union of two sets in $S$ is denoted by $A \cup B$ or $A + B$: this is ‘$A$ or $B$.’ The intersection is denoted by $A \cap B$ or $A, B$: this is ‘$A$ and $B$.’

From the axioms, we can develop the following relationships

$\displaystyle P(\emptyset) = 0, \hfill (17)$

$\displaystyle P(\bar{A}) = 1 - P(A), \hfill (18)$

$\displaystyle P(A \cup B) = P(A) + P(B) - P(A \cap B), \hfill (19)$

$\displaystyle P(A \cup B) = P(A) + P(B) \ \ \ (A, B \mbox{\rm \ mutually\ exclusive}), \hfill (20)$

$\displaystyle P(A|B) = P(A \cap B) / P(B)\ \ \ (\mbox{\rm definition}). \hfill (21)$

Here the bar on $A$ in $\bar{A}$ denotes the complement of $A$: $\bar{A} \cup A = S.$

If $A$ and $B$ are statistically independent, then $P(A \cap B) = P(A)P(B)$. For independent events,

$\displaystyle P(A \cup B) = P(A) + P(B) - P(A)P(B). \hfill (22)$

### Random Variables

A random variable is a function that maps the elements of the probability space $S$ to real numbers ($\mathbb{R}$) or to complex numbers ($\mathbb{C}$).

Once we have a probability space $S$ and a random variable $X$, we have a characterization of the experiment that involves numerical values, so we can potentially apply arithmetic and all those mathematical tools that are built on arithmetic to making inferences or predictions regarding future trials of the experiment. And of course, once we are in a numerical domain, we can use computers to do all of our calculations.

The issue now becomes characterization of a random variable–what is it like? There are high-level properties of a random variable, such as whether it is real-valued or complex-valued. If the range of numbers for the random variable is finite or countable (such as the integers), the random variable is called discrete. Otherwise it is continuous. The next step in characterization is to connect the probability theory to the numerical values embodying the random variable.

### The Cumulative Distribution Function

A simple way to start a characterization of a random variable $X$ is to look at the probability that the variable is less than some constant $x$. In other words, we seek the probability of the event $X \leq x$. Such an event implicitly assumes $X$ is a real-valued random variable; extensions to complex variables are straightforward.

$\displaystyle F_X(x) \triangleq P\left[X \leq x \right]. \hfill (23)$

This function, $F_X(x)$, which is associated with the random variable $X$ and is a function of the number $x$, is called the cumulative distribution function (CDF). Since the CDF is a probability, it must obey the following restriction

$\displaystyle 0 \leq F_X(x) \leq 1, \hfill (24)$

which follows from the axioms. The function must approach zero as $x$ approaches $-\infty$, because the probability that the random variable is less than $-\infty$ must be zero:

$\displaystyle \lim_{x\rightarrow -\infty} F_X(x) = 0. \hfill (25)$

Similarly,

$\displaystyle \lim_{x\rightarrow \infty} F_X(x) = 1. \hfill (26)$

The CDF is a non-decreasing function of $x$ because the probability that $X \leq x_2$ cannot be smaller than the probability that $X \leq x_1$ if $x_2 \ge x_1$,

$\displaystyle x_2 \ge x_1 \Rightarrow F_X(x_2) \ge F_X(x_1). \hfill (27)$

We can use the CDF to find the probability that the random variable lies on some interval, say $x_1 < X \leq x_2$. Let $A$ be the event $(X \leq x_1) \cup (X > x_2)$ and let $B$ be the complement of $A$, $(X > x_1) \cap (X \le x_2)$. The events $A$ and $B$ are mutually exclusive (note the precise definitions of the intervals) and $A \cup B = S$ so that

$\displaystyle P(S) = P(A) + P(B) = 1 \Rightarrow P(B) = 1 - P(A). \hfill (28)$

The event $A$ is itself made up of mutually exclusive events so that

$\displaystyle P(A) = P(X \le x_1) + P(X > x_2), \hfill (29)$

$\displaystyle = F_X(x_1) + (1 - P(X \leq x_2)), \hfill (30)$

$\displaystyle = F_X(x_1) - F_X(x_2) + 1. \hfill (31)$

Combining (28) with (31) yields

$\displaystyle P(B) = 1 - P(A) = 1 - F_X(x_1) + F_X(x_2) - 1, \hfill (32)$

or

$\displaystyle P(x_1 < X \leq x_2) = F_X(x_2) - F_X(x_1). \hfill (33)$

So we can find the probability that the random variable lies on any interval by using the CDF. Further, we can combine different intervals to find the probability that the random variable meets all sorts of conditions, using the basic axioms and derived rules of probability. In this sense, the CDF is fundamental.

Notice that if $F_X(x)$ is continuous, the probability that the random variable takes on any particular value $x_2$ is zero because

$\displaystyle \lim_{x_1 \rightarrow x_2} F_X(x_1) = F_X(x_2). \hfill (34)$

This mathematically formalizes the notion that if one chooses a real number ‘at random’ from the interval $(0, 1)$, the probability that that number is any particular value $x_0$ is zero. The number of possibilities is an uncountable infinity, but our frequency of occurrence is defined by a countable infinity of trials, so we can repeat the experiment forever and still not pick $x_0$, or perhaps pick it a few times, but the probability of picking it is still zero. Nevertheless, after the fact, some number was chosen in each trial.

When the CDF possesses step discontinuities, however, we can see that the random variable can take on the value of the independent variable $x$ at the discontinuity location with a non-zero probability. Discrete random variables have this property–their CDFs are piecewise constant functions, characterized by step-function discontinuities. For example, if we consider throwing a single fair six-sided die and the event ‘number of pips showing on the upper face after the die comes to a stop,’ we can easily map the events to the six integers $\{1, 2, 3, 4, 5, 6\}$. Then the CDF for this random variable is the stair-step function shown in Figure 2.

The CDF is one complete way of quantifying the probabilistic behavior of a random variable, but in many kinds of calculations we favor working with its derivative, the probability density function.

### The Probability Density Function

The PDF is defined as the derivative of the CDF,

$\displaystyle f_X(x) = \frac{d}{dx} F_X(x). \hfill (35)$

It follows that the CDF is the integral of the PDF,

$\displaystyle F_X(x) = \int_{-\infty}^t f_X(u) \, du. \hfill (36)$

Recalling the fundamental theorem of calculus, which says that

$\displaystyle \int_a^b g(x) \, dx = G(b) - G(a), \hfill (37)$

where $g(x) = \frac{d}{dx} G(x)$, we have

$\displaystyle \int_{x_1}^{x_2} f_X(x) \, dx = F_X(x_2) - F_X(x_1), \hfill (38)$

$\displaystyle = P(x_1 < X \le x_2), \hfill (39)$

provided that $x_1$ and $x_2$ are not locations of impulses in $f_X(x)$. This is why $f_X(x)$ is a probability density$f_X(x) \, dx$ is interpreted as the probability that $X$ takes on values in some small neighborhood near $x$.

The PDF for the CDF in Figure 2, which corresponds to throwing a fair six-sided die, is shown in Figure 3. Since the random variable $X$ here is a discrete random variable, its PDF is purely impulsive. In such cases, the PDF is often replaced with the closely related probability mass function (PMF), where the ‘mass’ at a value $x$ is equal to the corresponding area in the impulsive PDF. In the next SPTK post we’ll look at a variety of PDFs for random variables that arise in the context of communication signals, their propagation channels, and their receivers.

The value $f_X(x)$ is not, as we’ve mentioned, a probability, but not any function can qualify as a PDF. Here are some important (for analysis and signal-processing practice) properties of PDFs.

Non-Negativity. $\displaystyle f_X(x) \ge 0.$ This arises from the non-decreasing property of the CDF. Thus its derivative, the PDF, must be either zero (when the CDF is constant over some stretch of $x$), or positive (when the CDF increases over some stretch of $x$).

Unit Total Area. $\displaystyle \int_{-\infty}^\infty f_X(x) \, dx = 1$. This follows from the behavior of the CDF at $\pm \infty$.

Integrates to a CDF. $\displaystyle F_X(x) = \int_{-\infty}^x f_X(u)\, du$. This follows from the defintion of the PDF.

Subinterval Integrals are Probabilities. $\displaystyle \int_a^b f_X(x)\, dx = P(a < x \leq b)$. This follows from the definition of the PDF and the properties of the CDF.

### Expectation and Moments

The notion of averaging over time to characterize the gross behavior of a time function is intuitive and we use it a lot in signal processing, and in CSP. You simply add up all the values of the time function over some interval of time $[t_1, t_2]$ with length $T = t_2 - t_1$ and divide by $T$ to obtain the temporal average.

When dealing with a random variable defined on a probability space, there is no time to average over. To find the gross properties of a random variable we must average over the only thing we’ve got: the probability space $S$.

The average value of a random variable $X$ with PDF $f_X(x)$ is given by the expected value defined by

$\displaystyle \bar{X} = E[X] = \int_{-\infty}^\infty x f_X(x) \, dx. \hfill (40)$

This is a reasonable average because we are taking the probability that the random variable takes on values near $x$, $f_X(x)\, dx$, and multiplying that probability by $x$, and summing up. So values of the random variable that are more probable get represented at a higher level in the sum. So in that sense the value of $E[X]$ is the value that we expect, hence the terminology of expected value.

Similarly, the expected value of any function of $X$ is given by

$\displaystyle E[g(X)] = \int_{-\infty}^\infty g(x) f_X(x) \, dx. \hfill (41)$

The most important and useful functions $g(\cdot)$ are the homogenous nonlinearities $g(X) = X^n$, for $n=1, 2, \ldots$. The corresponding expected values are called the moments of $X$. The mean $\bar{X}$ is the first moment of $X$. The second moment is

$\displaystyle E[X^2] = \int_{-\infty}^\infty x^2 f_X(x) \, dx. \hfill (42)$

More useful are the centralized moments, which are the expected values of $[X - \bar{X}]^n$. The second centralized moment is called the variance:

$\displaystyle \sigma_X^2 \triangleq E \left[ X - \bar{X} \right]^2 = \int_{-\infty}^\infty (x-\bar{X})^2 f_X(x) \, dx. \hfill (43)$

The square root of the variance is called the standard deviation. An important property of the expectation is that it is linear: $E[g(X) + h(Y)] = E[g(X)] + E[h(Y)]$. We can easily find the relationship between the second moment (also called the mean-square), the mean, and the variance:

$\displaystyle \sigma_X^2 = E\left[(X - \bar{X})^2\right] \hfill (44)$

$\displaystyle = E\left[ X^2 + \bar{X}^2 -2X\bar{X} \right] \hfill (45)$

$\displaystyle = E[X^2] + \bar{X}^2 -2\bar{X}E[X] \hfill (46)$

$\displaystyle = E[X^2] - \bar{X}^2. \hfill (47)$

so that the variance is the mean-square minus the squared mean.

### Canonical Example: The Uniform Distribution

Quite a few examples of random-variable density functions are provided in the next SPTK post. Here we consider only two: the uniform and the Gaussian. First the uniform, which is simple and so provides clear basic illustrations of the fundamental random-variable characterizations of CDF, PDF, mean, and variance.

The uniform distribution has a PDF that is a rectangle on some interval $[a, b]$ with $b > a$. Because the density must integrate to one, the uniform density is completely specified by the two numbers $a$ and $b$:

$\displaystyle f_X(x) = \frac{1}{b-a} \mbox{\rm rect}\left(\frac{x - (a+b)/2}{b-a}\right), \hfill (48)$

where $\mbox{\rm rect}(x)$ is our usual rectangle function with unit width, unit height, and center of $x=0$. The name of uniform is apt because the probability is uniform across the interval $[a, b]$ in the sense that the probability that the random variable takes on values in any subinterval of length $\delta$ is $\delta/(b-a)$.

Let’s illustrate the use of the PDF and expectation by calculating the first- and second-order moments for this simple random variable. The mean value is given by

$\displaystyle E[X] = \bar{X} = \int_{-\infty}^\infty x f_X(x) \, dx \hfill (49)$

$\displaystyle = \int_a^b \frac{x}{b-a} \, dx = \left. \frac{1}{b-a}\frac{x^2}{2} \right|_{x=a}^b \hfill (50)$

$\displaystyle = \frac{a+b}{2}, \hfill (51)$

which is the midpoint of the interval $[a, b]$, matching intuition. For example if $a= 0$ and $b = 1$, as in MATLAB’s rand.m function, then the mean value is $0.5$.

The second moment, or mean-square, is given by

$\displaystyle E[X^2] = \int_{-\infty}^\infty x^2 f_X(x) \, dx \hfill (52)$

$\displaystyle = \int_a^b \frac{x^2}{b-a} \, dx \hfill (53)$

$\displaystyle = \left. \frac{1}{b-a} \frac{x^3}{3} \right|_{x=a}^b \hfill (54)$

$\displaystyle = \frac{b^2 + ab + a^2}{3}. \hfill (55)$

We can use (47) find the variance,

$\displaystyle \sigma_X^2 = \frac{(b-a)^2}{12}, \hfill (56)$

and the standard deviation

$\displaystyle \sigma_X = \frac{b-a}{\sqrt{12}}. \hfill (57)$

These values become quite simple for the rand.m case of $a= 0$ and $b = 1$:

$\displaystyle \bar{X} = 1/2$

$\displaystyle E[X^2] = 1/3$

$\displaystyle \sigma_X^2 = 1/12$

$\displaystyle \sigma_X = 1/2\sqrt(3).$

In the next SPTK post we’ll verify these formulas by using MATLAB’s random-number generator and a histogram function.

### A Ubiquitous Random Variable: The Gaussian Distribution

From numerous observations of random phenomena, it has been found that many physical processes are well-modeled by using Gaussian random variables. The Gaussian distribution is also known as the normal distribution (because it is so common) and the Gaussian probability density function is the well-known and oft-cited ‘bell-shaped curve.’

The PDF for the Gaussian random variable is explicitly a function of the mean value $\bar{X} = E[X]$ and the variance $\sigma_X^2 = E[(X - \bar{X})^2]$:

$\displaystyle f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-(x-\bar{X})^2/(2\sigma_X^2)}. \hfill (58)$

The ubiquity of the Gaussian distribution is due to the interesting fact that when a large number of random variables are added together, the distribution of the resulting sum approaches a Gaussian independently from the particular distributions of the variables. This fact is summarized in the central limit theorem, which we will briefly touch on near the end of the post. The Gaussian PDF and CDF are illustrated in Figure 4. Note that most of the ‘probability action’ lies within one standard deviation of the mean value (about $0.84 - 0.16 = 0.68$). That is, $P(\bar{X}-\sigma_X < X \leq \bar{X}+\sigma_X) = 0.068$.

### More Than One Random Variable: Joint Densities and Correlation

In many situations we are interested in the relationship between two or more random variables. We must therefore generalize our definitions of the cumulative distribution and probability density functions.

Let’s consider two random variables $X$ and $Y$. The natural extension of the CDF from one variable to two is to consider the event where $X \leq x$ and also $Y \leq y$:

$\displaystyle F_{XY}(x, y) = P \left[ (X \leq x) \cap (Y \leq y) \right]. \hfill (59)$

The function $F_{XY}(x,y)$ is called the joint cumulative distribution function. The corresponding density function is the partial derivative of the distribution function

$\displaystyle f_{XY}(x, y) = \frac{\partial^2}{\partial x \, \partial y} F_{XY}(x, y). \hfill (60)$

The properties of $F_X(x, y)$ and $f_X(x,y)$ are similar to those of their univariate counterparts. In addition, the marginal distributions $f_X(x)$ and $f_Y(y)$ can be obtained from the joint distribution by integration,

$\displaystyle f_X(x) = \int_{-\infty}^\infty f_{XY}(x,y) \, dy, \hfill (61)$

$\displaystyle f_Y(y) = \int_{-\infty}^\infty f_{XY}(x,y) \, dx. \hfill (62)$

Expectation works the same way as before,

$\displaystyle E[g(X, Y)] = \int_{-\infty}^\infty \int_{-\infty}^\infty g(x, y) f_{XY}(x,y) \, dx \, dy. \hfill (63)$.

The expected value of $XY$ is called the correlation (for complex-valued random variables, it is the expectation of $XY^*$, as we’ve seen in the context of the non-conjugate and conjugate spectral correlation functions).

The conditional probability density functions are related to the joint density functions as follows

$\displaystyle F_{X|Y}(x|y) = P(X \leq x | Y = y) \hfill (64)$

$\displaystyle f_{X|Y}(x|y) = \frac{d}{dx} F_{X|Y}(x|y) = \frac{f_{XY}(x, y)}{f_Y(y)} \hfill (65)$

$\displaystyle f_{Y|X}(y|x) = \frac{f_{X|Y}(x|y) f_Y(y)}{f_X(x)} \hfill (66)$

$\displaystyle f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x) f_X(x)}{f_Y(y)}. \hfill (67)$

Two random variables $X$ and $Y$ are said to be statistically independent if their joint density function factors into the product of the marginal density functions (compare with (13) above),

$\displaystyle \mbox{\rm S.I.} \Rightarrow f_{XY}(x,y) = f_X(x) f_Y(y). \hfill (68)$

For statistically independent variables, the conditioning makes no difference to the density function

$\displaystyle \mbox{\rm S.I.} \Rightarrow f_{X|Y}(x|y) = f_X(x). \hfill (69)$

The cross correlation between two statistically independent random variables $X$ and $Y$ is simply the product of their mean values,

$\displaystyle E[XY] = \int_{-\infty}^\infty \int_{-\infty}^\infty xy f_{XY}(x,y) \, dx\, dy, \hfill (70)$

$\displaystyle = \int_{-\infty}^\infty \int_{-\infty}^\infty xy f_X(x)f_Y(y) \, dx\, dy, \hfill (71)$

$\displaystyle = \bar{X}\bar{Y}. \hfill (72)$

If either $\bar{X} = 0$ or $\bar{Y} = 0$, then the correlation is zero. We can conclude that the correlation between two statistically independent zero-mean random variables is zero.

The cross correlation, $R_{XY} = E[XY]$, is the expected value of the product of two random variables, but it can be more revealing about their relationship to look at the covariance, which is the expected value of the two random variables with their means removed, $K_{XY} = E[(X-\bar{X})(Y-\bar{Y})]$. However, the two variables could have quite different scales (approximate length of the support of their PDFs), and so we might consider also scaling each variable so that their variances are equal. It is easy to show that if a random variable $X$ has variance $\sigma_X^2$ and mean $\bar{X}$, then the variable $Z = (X-\bar{X})/\sigma_X$ has a mean of zero and a unit variance. Therefore, the cross correlation of normalized variables is

$\displaystyle \rho = E\left[ \left(\frac{X-\bar{X}}{\sigma_X}\right) \left(\frac{Y-\bar{Y}}{\sigma_Y}\right) \right], \hfill (73)$

which is called the correlation coefficient. The correlation coefficient always lies between $-1$ and $1$. We can see that by defining $\alpha = (X-\bar{X})/\sigma_X$ and $\beta = (Y-\bar{Y})/\sigma_Y$ and looking at the expected value of $(\alpha + \beta)^2$ and $(\alpha - \beta)^2$:

$\displaystyle E[(\alpha \pm \beta)^2] = E[\alpha^2 \pm 2\alpha\beta + \beta^2] \hfill (74)$

$\displaystyle = E[\alpha^2] \pm 2E[\alpha\beta] + E[\beta^2] \hfill (75)$

$\displaystyle = 1 \pm 2E[\alpha\beta] + 1 = 2 \pm E[\alpha\beta]. \hfill (76)$

Now the correlation coefficient is equal to $\rho = E[\alpha\beta]$ here, so

$\displaystyle E[(\alpha \pm \beta)^2] = 2 \pm 2\rho = 2(1\pm \rho). \hfill (77)$

We must have $E[(\alpha\pm\beta)^2] \ge 0$ (why?), so

$\displaystyle 2(1\pm\rho) \ge 0 \Rightarrow |\rho| \leq 1. \hfill (78)$

Therefore the correlation coefficient between any two random variables must always be between $-1$ and $+1$. Moreover, if the two random variables $X$ and $Y$ are statistically independent, then $\rho = 0$ because $E[(X-\bar{X})/\sigma_X] = E[(Y-\bar{Y})/\sigma_Y] = 0$.

At one extreme, if $Y = aX$ for $a > 0$, then $\rho = 1$. At the other extreme, if $Y = bX$ for $b < 0$, then $\rho = -1$. The correlation coefficient gives an indication of the degree to which two random variables are linearly related, and also an indication of the sign of that relationship. It is a useful exercise to compute the correlation coefficient between a random variable $X$ and another random variable $Y = aX +b$, where $a$ and $b$ are constants.

### The Central Limit Theorem

Suppose we have a sequence of random variables $X_k$, $k = 1, 2, \ldots$ with identical density functions. If we form the sum of the first $N$,

$\displaystyle Z_N = \sum_{k=1}^N X_k, \hfill (79)$

then we’ve simply summed up $N$ random variables. Consider the normalized version of each $Z_N$, as we did for the correlation coefficient

$\displaystyle Y_N = \frac{Z_N - \bar{Z}_N}{\sigma_{Z_N}}, \hfill (80)$

where $\bar{Z_N}$ is the mean value of $Z_N$ and $\sigma_{Z_N}^2$ is the variance of $Z_N$. Then, clearly, $E[Y_N] = 0$ and $\sigma_{Y_N}^2 = 1$. The central limit theorem says that the probability density function for $Y_N$ approaches the standard normal distribution as $N \rightarrow \infty$. The standard normal distribution is just the Gaussian distribution with zero mean and unit variance, so that the theorem says that

$\displaystyle f_Y(y) = \lim_{N\rightarrow\infty} f_{Y_N}(y) = \frac{1}{\sqrt{2\pi}} e^{-y^2/2}. \hfill (81)$

The implication of the central limit theorem is that when a random variable is modeled as the sum of a large number of independent similar events, the distribution of that random variable tends to the Gaussian. A relevant example for the CSP Blog is the electric field value produced by the many electrons in a conductor (thermal noise). Turning to signals, and as a preview, if we consider the reception of a large number of interfering signals at a single radio receiver, we can see that the resulting composite signal will tend to a Gaussian signal no matter the distributions of the involved interferers.

There is a lot more to say about the theory of random variables, and there are many textbooks that treat the topic. I suggest The Literature [R149] and [R156] as good starting points. In the next Signal Processing ToolKit post, we’ll look at several kinds of random variables by using MATLAB to generate them and to investigate their parameters: PDF, mean, variance, correlation, etc.

### Significance of Random Variables in CSP

A single random variable, such as that corresponding to a coin toss, a die roll, or even a noise voltage, isn’t central to CSP. As we’ve documented at the CSP Blog, cyclostationary signal processing is about the properties of observable signals, which we conceptualize as functions of time. A random process (also called a stochastic process or a random signal) is what we need to bridge the gap between abstract probability theory and concrete sampled signals.

A random process is a time-indexed collection of random variables (or space-indexed or indexed by any other independent variable, but for us, time-indexed is most appropriate). It is associated with a probability space, just like a random variable. We’ll look into random processes in a future SPTK post. Cyclostationary signals are most often defined as a certain class of random process. I use ‘cyclostationary signal’ instead of ‘cyclostationary random process’ on the CSP Blog because I want to emphasize signal processing rather than probability theory; we’re practitioners here.

Note that the correlation between random variables is used in the spectral correlation function, and that the spectral coherence function is a correlation coefficient. The temporal moment function is a higher-order moment that happens to be periodically time varying, and so has Fourier series coefficients that are the cyclic temporal moments. All of this is connected to cumulants and cyclic cumulants, and cyclic polyspectra too, but we’ll wait to make that connection concrete until we introduce the Fourier transform of the probability density function, which is called the characteristic function.

Previous SPTK Post: Complex Envelopes Next SPTK Post: Examples of Random Variables