In this Signal Processing ToolKit post, we examine the concept of a random variable.
What is a random variable? As the authors of The Literature [R153] say, “a random variable is actually neither random nor a variable.” Good start! They go on to give the usual definition of a random variable, which is a mapping between events in a probability space and the real or complex numbers. We need to review the basics of probability so that we can grasp these key ideas of mappings, events, and probability space.
Probability theory is often motivated by games of chance, such as those involving dice and cards, and by the associated familiar actions of counting things and forming ratios. When we want to understand what the “chances” are of throwing a six-sided die and seeing that the top face shows two pips, we count all the possible occurrences, or outcomes, associated with throwing the die and divide the number of desired occurrences by the obtained number of possible occurrences to arrive at the desired chance or probability. In this dice example, that chance would be : one outcome corresponds to two pips and there are exactly six possible outcomes. The notion of repeating an experiment, such as throwing dice, a large number of times and recording the outcomes in order to create a table of probabilities is aligned with the frequentist school of thought on probability, and it is intuitive, because we intuitively understand the notions of experimentation, repetition, and counting.
Each time we perform the experiment (also called a trial), we obtain an outcome or event. Suppose our experiment can have some outcome called . We perform trials of the experiment and observe of the events. This implies that . We define an observed probability of the event by the ratio of to ,
We must have, then
which is the origin of the fact that probabilities lie between zero and one. The frequentist then imagines letting the number of experimental trials increase without bound, . The observed probability then passes to a theoretical probability,
A complication in interpretation arises for the cases of and . Imagine doing an infinite number of trials and observing the event during exactly one of those trials. Then the probability of is zero by (3), but does actually occur. So we cannot claim that a zero-probability event is impossible. It just happens ‘with probability zero,’ which is also referred to as ‘almost never’ happening. Similarly, for , we cannot interpret this to mean that always happens. It just happens with probability one, or ‘almost surely.’
Let the experiment have possible outcomes , and only these outcomes. Moreover, suppose these outcomes can only happen by themselves–we never see event and event happen simultaneously, for example. Then these particular events are mutually exclusive. The observed probability for event is
Now, an experimental trial always produces an outcome so that
When we pass to the limit of an infinite number of trials, we obtain
We can consider more complicated experiments and associated events (outcomes). We might, for example, wish to know the probability of the occurrence of two events, should the experiment allow that kind of situation. This leads to the notion of the joint probability of two events and . Here the word ‘joint’ is associated with the notion of ‘and.’ We seek the probability that event happens and, simultaneously, event happens. In the frequentist approach, we simply count the number of experimental trials that result in observing both and
This joint probability may or may not be equal to the product of the individual probabilities . For example, consider the experiment consisting of throwing two fair six-sided dice. One joint event is ‘the sum is on one throw and is also on the next.’ We intuitively think this should be the probability of getting a sum of times itself, because the two throws do not influence each other. We can construct a big table of all possible events relating to two consecutive throws and find that the probability is indeed .
Other events are more tangled up together. Consider next the joint event that on a single throw the sum is seven (event ) and one die shows a six (event ). For event , there are six possibilities out of thirty-six total, and for event there are eleven possibilities. So and . But looking at the thirty-six possibilities, there are only two that correspond to . So in this case, . We need another kind of probability: conditional probability.
The conditional probability is the probability of one event given that another event happened, which is denoted by , and we say ‘probability of given .’ This probability is defined by
which is the probability of the joint event as a fraction of the probability of the conditioning event. This leads immediately to
The symmetry of this relation with respect to events and leads to Baye’s rule
In the important case for which the conditioning event has no effect on the probability,
and we say that the two events are statistically independent.
This frequentist, or frequency of occurrence, approach to probability is usually too cumbersome to apply to complex problems, such as analyzing the probabilistic behavior of radio signals (but eventually we will cover the attempt, called fraction-of-time probability, at the CSP Blog), so it is relegated to its role as a motivator, and a more abstract axiomatic probability theory is embraced.
The axioms of axiomatic probability are the following:
The key abstraction is the probability space , which contains all possible events of interest. It is a set. Some event must happen when an experiment is conducted, so that the probability of the event that corresponds to the entire probability space is one.
The axioms can be used to develop joint probabilities and conditional probabilities, and they are consistent with Bayes’ Rule. In probability theory, the union of two sets in is denoted by or : this is ‘ or .’ The intersection is denoted by or : this is ‘ and .’
From the axioms, we can develop the following relationships
Here the bar on in denotes the complement of :
If and are statistically independent, then . For independent events,
A random variable is a function that maps the elements of the probability space to real numbers () or to complex numbers ().
Once we have a probability space and a random variable , we have a characterization of the experiment that involves numerical values, so we can potentially apply arithmetic–and all those mathematical tools that are built on arithmetic–to making inferences or predictions regarding future trials of the experiment. And of course, once we are in a numerical domain, we can use computers to do all of our calculations.
The issue now becomes characterization of a random variable–what is it like? There are high-level properties of a random variable, such as whether it is real-valued or complex-valued. If the range of numbers for the random variable is finite or countable (such as the integers), the random variable is called discrete. Otherwise it is continuous. The next step in characterization is to connect the probability theory to the numerical values embodying the random variable.
The Cumulative Distribution Function
A simple way to start a characterization of a random variable is to look at the probability that the variable is less than some constant . (Is there a simpler way? Is this fundamental?) In other words, we seek the probability of the event . Such an event implicitly assumes is a real-valued random variable; extensions to complex variables are straightforward.
This function, , which is associated with the random variable and is a function of the number , is called the cumulative distribution function (CDF). Since the CDF is a probability, it must obey the following restriction
which follows from the axioms. The function must approach zero as approaches , because the probability that the random variable is less than must be zero:
The CDF is a non-decreasing function of because the probability that cannot be smaller than the probability that if ,
We can use the CDF to find the probability that the random variable lies on some interval, say . Let be the event and let be the complement of , . The events and are mutually exclusive (note the precise definitions of the intervals) and so that
The event is itself made up of mutually exclusive events so that
Combining (28) with (31) yields
So we can find the probability that the random variable lies on any interval by using the CDF. Further, we can combine different intervals to find the probability that the random variable meets all sorts of conditions, using the basic axioms and derived rules of probability. In this sense, the CDF is fundamental.
Notice that if is continuous, the probability that the random variable takes on any particular value is zero because
This mathematically formalizes the notion that if one chooses a real number ‘at random’ from the interval , the probability that that number is any particular value is zero. The number of possibilities is an uncountable infinity, but our frequency of occurrence is defined by a countable infinity of trials, so we can repeat the experiment forever and still not pick , or perhaps pick it a few times, but the probability of picking it is still zero. Nevertheless, after the fact, some number was chosen in each trial.
When the CDF possesses step discontinuities, however, we can see that the random variable can take on the value of the independent variable at the discontinuity location with a non-zero probability. Discrete random variables have this property–their CDFs are piecewise constant functions, characterized by step-function discontinuities. For example, if we consider throwing a single fair six-sided die and the event ‘number of pips showing on the upper face after the die comes to a stop,’ we can easily map the events to the six integers . Then the CDF for this random variable is the stair-step function shown in Figure 2.
The CDF is one complete way of quantifying the probabilistic behavior of a random variable, but in many kinds of calculations we favor working with its derivative, the probability density function (PDF).
The Probability Density Function
The PDF is defined as the derivative of the CDF,
It follows that the CDF is the integral of the PDF,
Recalling the fundamental theorem of calculus, which says that
where , we have
provided that and are not locations of impulses in . This is why is a probability density— is interpreted as the probability that takes on values in some small neighborhood near .
The PDF for the CDF in Figure 2, which corresponds to throwing a fair six-sided die, is shown in Figure 3. Since the random variable here is a discrete random variable, its PDF is purely impulsive. In such cases, the PDF is often replaced with the closely related probability mass function (PMF), where the ‘mass’ at a value is equal to the corresponding area in the impulsive PDF. In the next SPTK post we’ll look at a variety of PDFs for random variables that arise in the context of communication signals, their propagation channels, and their receivers.
The value is not, as we’ve mentioned, a probability, but not any function can qualify as a PDF. Here are some important (for analysis and signal-processing practice) properties of PDFs.
Non-Negativity. This arises from the non-decreasing property of the CDF. Thus its derivative, the PDF, must be either zero (when the CDF is constant over some stretch of ), or positive (when the CDF increases over some stretch of ).
Unit Total Area. . This follows from the behavior of the CDF at .
Integrates to a CDF. . This follows from the defintion of the PDF.
Subinterval Integrals are Probabilities. . This follows from the definition of the PDF and the properties of the CDF.
Expectation and Moments
The notion of averaging over time to characterize the gross behavior of a time function is intuitive and we use it a lot in signal processing, and in CSP. You simply add up all the values of the time function over some interval of time with length and divide by to obtain the temporal average.
When dealing with a random variable defined on a probability space, there is no time to average over. To find the gross properties of a random variable we must average over the only thing we’ve got: the probability space .
The average value of a random variable with PDF is given by the expected value defined by
This is a reasonable average because we are taking the probability that the random variable takes on values near , , and multiplying that probability by , and summing up. So values of the random variable that are more probable get represented at a higher level in the sum. So in that sense the value of is the value that we expect, hence the terminology of expected value.
Similarly, the expected value of any function of is given by
The most important and useful functions are the homogenous nonlinearities , for . The corresponding expected values are called the moments of . The mean is the first moment of . The second moment is
More useful are the centralized moments, which are the expected values of . The second centralized moment is called the variance:
The square root of the variance is called the standard deviation. An important property of the expectation is that it is linear: . We can easily find the relationship between the second moment (also called the mean-square), the mean, and the variance:
so that the variance is the mean-square minus the squared mean.
Canonical Example: The Uniform Distribution
Quite a few examples of random-variable density functions are provided in the next SPTK post. Here we consider only two: the uniform and the Gaussian. First the uniform, which is simple and so provides clear basic illustrations of the fundamental random-variable characterizations of CDF, PDF, mean, and variance.
The uniform distribution has a PDF that is a rectangle on some interval with . Because the density must integrate to one, the uniform density is completely specified by the two numbers and :
where is our usual rectangle function with unit width, unit height, and center of . The name of uniform is apt because the probability is uniform across the interval in the sense that the probability that the random variable takes on values in any subinterval of length is .
Let’s illustrate the use of the PDF and expectation by calculating the first- and second-order moments for this simple random variable. The mean value is given by
which is the midpoint of the interval , matching intuition. For example if and , as in MATLAB’s rand.m function, then the mean value is .
The second moment, or mean-square, is given by
We can use (47) find the variance,
and the standard deviation
These values become quite simple for the rand.m case of and :
A Ubiquitous Random Variable: The Gaussian Distribution
From numerous observations of random phenomena, it has been found that many physical processes are well-modeled by using Gaussian random variables. The Gaussian distribution is also known as the normal distribution (because it is so common) and the Gaussian probability density function is the well-known and oft-cited ‘bell-shaped curve.’
The PDF for the Gaussian random variable is explicitly a function of the mean value and the variance :
The ubiquity of the Gaussian distribution is due to the interesting fact that when a large number of random variables are added together, the distribution of the resulting sum approaches a Gaussian independently from the particular distributions of the variables. This fact is summarized in the central limit theorem, which we will briefly touch on near the end of the post. The Gaussian PDF and CDF are illustrated in Figure 4. Note that most of the ‘probability action’ lies within one standard deviation of the mean value (about ). That is, .
More Than One Random Variable: Joint Densities and Correlation
In many situations we are interested in the relationship between two or more random variables. We must therefore generalize our definitions of the cumulative distribution and probability density functions.
Let’s consider two random variables and . The natural extension of the CDF from one variable to two is to consider the event where and also :
The function is called the joint cumulative distribution function. The corresponding density function is the partial derivative of the distribution function
The properties of and are similar to those of their univariate counterparts. In addition, the marginal distributions and can be obtained from the joint distribution by integration,
Expectation works the same way as before,
The expected value of is called the correlation (for complex-valued random variables, it is the expectation of , as we’ve seen in the context of the non-conjugate and conjugate spectral correlation functions).
The conditional probability density functions are related to the joint density functions as follows
Two random variables and are said to be statistically independent if their joint density function factors into the product of the marginal density functions (compare with (13) above),
For statistically independent variables, the conditioning makes no difference to the density function
The cross correlation between two statistically independent random variables and is simply the product of their mean values,
If either or , then the correlation is zero. We can conclude that the correlation between two statistically independent zero-mean random variables is zero.
The cross correlation, , is the expected value of the product of two random variables, but it can be more revealing about their relationship to look at the covariance, which is the expected value of the two random variables with their means removed, . However, the two variables could have quite different scales (approximate length of the support of their PDFs), and so we might consider also scaling each variable so that their variances are equal. It is easy to show that if a random variable has variance and mean , then the variable has a mean of zero and a unit variance. Therefore, the cross correlation of normalized variables is
which is called the correlation coefficient. The correlation coefficient always lies between and . We can see that by defining and and looking at the expected value of and :
Now the correlation coefficient is equal to here, so
We must have (why?), so
Therefore the correlation coefficient between any two random variables must always be between and . Moreover, if the two random variables and are statistically independent, then because .
At one extreme, if for , then . At the other extreme, if for , then . The correlation coefficient gives an indication of the degree to which two random variables are linearly related, and also an indication of the sign of that relationship. It is a useful exercise to compute the correlation coefficient between a random variable and another random variable , where and are constants.
The Central Limit Theorem
Suppose we have a sequence of random variables , with identical density functions. If we form the sum of the first ,
then we’ve simply summed up random variables. Consider the normalized version of each , as we did for the correlation coefficient
where is the mean value of and is the variance of . Then, clearly, and . The central limit theorem says that the probability density function for approaches the standard normal distribution as . The standard normal distribution is just the Gaussian distribution with zero mean and unit variance, so that the theorem says that
The implication of the central limit theorem is that when a random variable is modeled as the sum of a large number of independent similar events, the distribution of that random variable tends to the Gaussian. A relevant example for the CSP Blog is the electric field value produced by the many electrons in a conductor (thermal noise). Turning to signals, and as a preview, if we consider the reception of a large number of interfering signals at a single radio receiver, we can see that the resulting composite signal will tend to a Gaussian signal no matter the distributions of the involved interferers.
There is a lot more to say about the theory of random variables, and there are many textbooks that treat the topic. I suggest The Literature [R149] and [R156] as good starting points. In the next Signal Processing ToolKit post, we’ll look at several kinds of random variables by using MATLAB to generate them and to investigate their parameters: PDF, mean, variance, correlation, etc.
Significance of Random Variables in CSP
A single random variable, such as that corresponding to a coin toss, a die roll, or even a noise voltage, isn’t central to CSP. As we’ve documented at the CSP Blog, cyclostationary signal processing is about the properties of observable signals, which we conceptualize as functions of time. A random process (also called a stochastic process or a random signal) is what we need to bridge the gap between abstract probability theory and concrete sampled signals.
A random process is a time-indexed collection of random variables (or space-indexed or indexed by any other independent variable, but for us, time-indexed is most appropriate). It is associated with a probability space, just like a random variable. We’ll look into random processes in a future SPTK post. Cyclostationary signals are most often defined as a certain class of random process. I use ‘cyclostationary signal’ instead of ‘cyclostationary random process’ on the CSP Blog because I want to emphasize signal processing rather than probability theory; we’re practitioners here.
Note that the correlation between random variables is used in the spectral correlation function, and that the spectral coherence function is a correlation coefficient. The temporal moment function is a higher-order moment that happens to be periodically time varying, and so has Fourier series coefficients that are the cyclic temporal moments. All of this is connected to cumulants and cyclic cumulants, and cyclic polyspectra too, but we’ll wait to make that connection concrete until we introduce the Fourier transform of the probability density function, which is called the characteristic function.