Update June 2020
I’ll be adding new papers to this post as I find them. At the end of the original post there is a sequence of date-labeled updates that briefly describe the relevant aspects of the newly found papers. Some machine-learning modulation-recognition papers deserve their own post, so check back at the CSP Blog from time-to-time for “Comments On …” posts.
This is another post about machine learning (ML) and modulation recognition (MR). Previously we looked at the basic idea of MR and why it is a difficult signal-processing problem to solve. We also looked at several papers in the engineering literature that apply neural-network-based ML processing to the MR problem. Finally, I posted a large simulated communication-signal dataset to the CSP Blog as a challenge to the Machine Learners, together with a corresponding set of processing results I obtained by applying non-machine-learning CSP-based MR and parameter-estimation algorithms to the posted data set.
In this post, I want to point out the kinds of data sets that are used in various modulation-recognition ML papers and ask questions about their fidelity, appropriateness, and utility. I’ve been especially puzzled to read the common refrain about how ML algorithms produce performance better than “conventional methods.” Rarely are these conventional methods described in any detail, and when details are provided they are garbled or insufficient. What is most curious, though, is that the training and testing data sets used in the ML MR papers are narrow in scope, yet conventional methods of MR are often not narrow in scope. If the ML MR algorithm is trained and tested on a set of modulated signals all having symbol rate kHz (this happens; keep reading), what is the corresponding conventional MR method with which to compare? Is it a generic method that is provided prior information on the known rate? Or is it some highly specialized conventional method that is derived or built from the ground up using that prior information about the symbol rate? What are they talking about when they claim superiority to the conventional method in such cases?
So I think there is a sort of gap between what the Machine Learners think of as modulation recognition and what the “conventional” researchers and practitioners mean. When I say I can recognize a BPSK signal, I mean I can recognize all BPSK signals.
‘All BPSK signals’ is a bit of an exaggeration. There are many pathological cases, such as a BPSK signal with symbol rate so low that it would take days to collect even a couple symbols. In general, I mean all BPSK signals that can fit within my receiver bandwidth and for which I can capture many symbols worth of data in a reasonable time–all practical BPSK signals.
What I am getting at is the idea that the particular values of the symbol rate, carrier offset, power, and pulse shaping aren’t terribly important. Yes, if the power is made small enough, we won’t be able to detect or recognize the signal, nor estimate any of its parameters. Yes, if the carrier offset is too large, the signal will be distorted by the receiver filter. Yes, if the symbol rate is too large, the signal will not be fully captured by the receiver. What’s important is the inherent structure of the BPSK signal that supplies its “BPSK-ness” which is what allows us to distinguish it from all other signals.
But even with those caveats, ‘all BPSK signals’ is an awful lot of signals. Just consider all BPSK signals with a fixed carrier frequency offset, fixed pulse function, symbol rate that is any real number in the interval Hz, and a receiver with passband bandwidth of MHz. That’s an infinite number of BPSK signals. If you are uncomfortable with BPSK signals having irrational symbol rates (and I don’t blame you if you are), then I’ll back off and offer only the rational symbol rates in Hz. The set is still infinite. Countable, but infinite.
The power of a probabilistic approach to recognizing BPSK signals (My Papers [25, 26, 28, 38, 43, 44]) is that their probabilistic parameters (moments, cumulants, probability density functions [PDFs], etc.) all exhibit the same pattern. So if you can recognize the probability pattern, you can recognize the signal type, and thereby recognize all BPSK signals. The probability structure of the BPSK random process is what provides the distinguishing characteristic of BPSK-ness. It would be cool if a neural network could learn the probability structure from sampled data using training. Can a machine learn the BPSK PDF?
For clarity, and fun, here is a video that shows blind estimation of the cycle frequencies for BPSK signals with randomly chosen rates, carrier frequency offsets, excess bandwidths (SRRC roll-offs), and power levels. The cycle-frequency pattern is always the same: non-conjugate cycle frequencies are and conjugate cycle frequencies are .
Signals Used in Machine-Learning Modulation-Recognition Papers
Let’s look at the scope of signals that the Machine Learners have been using in their papers.
[R138] Convolutional Radio Modulation Recognition Networks (O’Shea et al)
Let’s start with [R138] because all the other papers I consider here cite it. (See also my original criticism of this paper.)
The data set is associated with O’Shea and his company, DeepSig, and you can get the data set by following a link found on DeepSig’s website:
The question at hand is what are the signals and signal-parameters involved with RML2016.10a.tar.bz2? I believe this is the data set most often used by the papers cited later in this post. How many BPSK signals are in there?
In the authors’ various papers (The Literature [R100, R137, R138, R139, R140]), they mention that the digital signals have “roughly” eight samples per symbol, which means a symbol rate of “roughly” in normalized Hz. So that’s one BPSK signal. Note that each signal example is only 128 samples long. It is difficult to do any kind of statistical cross-check on this signals to see if their cyclostationarity conforms to known cycle-frequency patterns and cyclic-cumulant magnitudes, so instead let’s try to infer as much as we can from PSDs.
The signals samples are stored compactly in a pickle file, which can be read most conveniently using python. Here is the python program I wrote to extract the signals from the archive so that I could take a look at them:
import numpy as np
Xd = cPickle.load(open(“RML2016.10a_dict.pkl”,’rb’))
snrs,mods = map(lambda j: sorted(list(set(map(lambda x: x[j], Xd.keys())))), [1,0])
for mod in mods:
print “Modulation Type “, mod
for snr in snrs:
print ” SNR “, snr
X = Xd[(mod,snr)]
print ” Number of files “, X.shape
# for ind in range(X.shape):
for ind in range(100):
# print ” File “, ind
Y = X[ind]
Z = np.zeros((1, Y.shape),dtype=complex)
for c in range(Y.shape):
Z[0, c] = complex(Y[0, c], Y[1, c])
# print np.abs(Z[0,c]), Y[0, c], Y[1, c]
# Create a filename string from the mod type and the SNR.
fn = ‘rml_’ + mod + ‘_’ + str(snr) + ‘_’ + str(ind) + ‘.tim’
# Open a file for writing using the created string
fn_fid = open (fn, “w”)
# Write the data in ASCII CMS format.dummy = str(2) + ‘\n’
dummy = str(Y.shape) + ‘\n’
for c in range(Y.shape):
dummy = str(Z[0, c].real) + ‘ ‘
dummy = str(Z[0, c].imag) + ‘\n’
# Close the file.
I just extracted the first files for each combination of signal type and SNR parameter. The description of the data set in R138 is:
with eleven signal types and a constant signal power of unity ( dB),
so that SNR is modified by varying the noise power. The authors offer up some time- and frequency-domain views of one example for each of the eleven types using these tiny unit-less graphs:
I’ve remarked on the strangeness of these plots elsewhere, so here we’ll just take our own look at the data. To aid us mathematically, the authors provide the following signal model:
Equation (4) in R138 is inscrutable, but at least we learn that only one value of excess bandwidth is chosen (, a practical value). In terms of BPSK signals, we have a SRRC BPSK signal with rate and EBW . Let’s see if PSD plots bear out these values. The occupied bandwidth of a SRRC PSK signal with rate and EBW is
Here is a plot of the BPSK PSDs for the largest of the SNR parameter values ():
The occupied bandwith (say, the -dB bandwidth) is at its largest about , so that checks out with (A) above. Some of the traces have narrower bandwidth, but a propagation channel is applied to each, so it is plausible that sometimes the bandwidth is narrower than nominal.
There are two strange things, though, about these BPSK PSD traces. The first is the noise floor. For many of the PSD traces, the out-of-band energy has a very smooth appearance, implying there is no added noise. These are the traces with values above about dB. For the remainder, the out-of-band energy is erratic, like noise usually appears in these kinds of PSD plots, but the value of the noise is tens of decibels lower than the apparently noise-free cases. In other words, for this particular SNR parameter (), the out-of-band noise energy varies by dB.
The second thing is the existence of traces that don’t appear to conform to the stated signal model. You can see a couple signals result in PSDs that are both wider and taller than nominal. Moreover, they have a smooth bimodal appearance. I can see how the application of randomized channels might cause a boost in power, but not an increase in bandwidth. The SRRC signal has very little out-of-band energy to boost (mathematically zero).
For several of the SNR parameters, some of the PSD traces do not appear to be related to a BPSK signal. For example, here are the BPSK PSDs for the SNR parameter of :
I can’t grasp how the SNR parameter is causing the SNR to change. Here are four of the BPSK sequence, corresponding to SNR parameters and :
Going from SNR parameter to appears to decrease the SNR, and then going from to appears to increase the SNR much more than by dB.
Here are videos for each of the eleven signals, showing PSDs for each signal and SNR-parameter combination:
The QAM and PSK signal types all have the same basic PSD, and that PSD is consistent with a single stated symbol rate (‘roughly’ ) and a single SRRC EBW (). They all have some weird outlier PSDs where there doesn’t seem to be a signal component and the energy is more-or-less even across the frequency band. Maybe I’m doing something wrong with the pickle file, but the outlier PSDs are do not appear in any systematic way as the vectors are extracted from the file, so that’s hard to believe.
A couple other things are worth mentioning about this data set.
64-QAM. For 64-QAM, one of the SNR parameters results in what looks like two different noise-floor values:
But you’ll see from the 64-QAM movie above that this dual-noise-floor behavior doesn’t happen for the other values of the SNR parameter in the pickle file.
AM-DSB. For AM-DSB, the signal is essentially a sine wave. Double-sideband AM can be “suppressed carrier” or “transmitted carrier”, the former possessing no finite-strength sine-wave components, the latter possessing one with frequency equal to the carrier frequency and power level set by the AM modulation index (so there is a family of AM-DSB-TC signals indexed by the modulation-index parameter). Here is an example AM-DSB PSD from the RML data set that shows the presence of a tone and little else:
The tone appears rectangular in the PSD estimate because I’m using the FSM with a rectangular smoothing window.
AM-SSB. There does not appear to be any signal component for any SNR parameter. Here is the graph for the largest SNR parameter of :
WBFM. The wideband FM signal type appears to be very narrowband, little more than a sine-wave in noise:
CPFSK. There are many variants of CPFSK, involving different choices for the modulation index, the alphabet size for the pulse-amplitude-modulated signal that drives the sinusoid’s phase, and the pulse function for that PAM signal-whether it is partial-response or full-response. A common choice is for a modulation index of , full-response rectangular pulse, and a binary PAM alphabet, which is mathematically equivalent to minimum-shift keying (MSK). But there are plenty of others. No information is provided about which one we have here, but there are obvious sidelobes, unlike for the SRRC signals,
although not all the traces show evidence of the sidelobes. The CPFSK signal here illustrates the strange behavior of the SNR parameter (see CPFSK movie above): For SNR parameters of and above, the peak of the signal’s PSD is at about dB and the noise floor is at about dB. I would characterize all of the CPFSK signals as ‘high SNR’ for SNR parameters of and above (this is more-or-less true for all the signal types).
I can’t be completely sure at this time that I’ve extracted the various -sample signal snippets from the pickle file, but many aspects of the estimated PSDs align with the signal description in [R138]. The occupied bandwidths of the various PSK, PAM, and QAM signals are the same, and are consistent with a single symbol rate of and SRRC excess bandwidth parameter (roll-off) of .
If you train your machine using this data set, you’ll have attempted to teach it to recognize one BPSK signal.
[R133] Automatic Modulation Classification of Cochannel Signals Using Deep Learning (Sun et al)
This paper tackles the problem of jointly recognizing each of two signals that share the same frequency band and overlap in time as well: the cochannel-signal case (see also My Papers [25,26,28]). Here is the stated signal model:
which is a great start. Note that the two PSK/QAM signals can have different power levels (), carrier frequencies (), carrier phases (), and constellations (). If we take the mathematical model in R133’s (1) seriously, though, the two signals have the same symbol-clock phase , and the same pulse-shaping function . The former seems unrealistic, the latter is plausible. The two signals also have exactly the same symbol rate .
Later, after all the neural-network description stuff, the signals used in the simulation are elaborated upon:
Note that there is no mention of the values of the two parameters and which control the power levels of the two signals, and therefore the signal-to-interference ratio (SIR). If you do the arithmetic suggested by the various parameter settings, you get
So we have to assume that the ratio is fixed over all the entries, and is probably unity. The SNR appears to be relative to the signal that is the sum of the two interferering signals.
Most importantly for our topic of all BPSK signals, ‘the baud rate of the signal is set to 40,000.’ Presumably this means that each of the two interfering signals has the same symbol rate of kHz. The carrier frequency offset is said to be within Hz of kHz, meaning that the two interfering signals have pretty much the same carrier-frequency offset: it will be very difficult to resolve using their short signal segments. The symbol-clock phases of the two signals are random variables with small variance centered at samples, so they have highly similar, but not identical, symbol-clock phase parameters.
I don’t understand (failure of my impoverished imagination?) the physical setup that would lead to these parameter choices–especially the identical power levels. The range of possible inputs to the MR system is highly circumscribed–there is a lot of prior information to use in the creation of a non-ML algorithm, but there is no comparison to a non-ML algorithm in the paper.
So here, in R133, we have one BPSK signal.
[R134] Modulated Autocorrelation Convolution Networks for Automatic Modulation Classification Based on Small Sample Set (Zhang et al)
This next ML-MR paper also features a single BPSK signal, but also makes clear that the authors’ don’t know much about communication signals or signal theory, but aren’t afraid to let us know that.
“cycle-stationary moments” are used for signals with periodic components–you won’t find that assertion on the CSP Blog. Most MR work is focused on signals that don’t have periodic components, which includes most communication signals.
Where is this superior performance documented? I wonder about the “periodic representation of communication signals.”
Equation R134 (11) is not going to enlighten anyone: the left side is a function of some index , the right both a function of and time . I’m wondering where the symbol variable went to in (11) and how the inphase and quadrature amplitudes work–they both multiply everything.
So, once again, a single symbol rate is used ( kHz, or samples per symbol at a sample rate of MHz).
I can’t imagine how the FSK signals work with (11).
I couldn’t find a definition of in the paper; presumably it is the standard deviation for a random variable that models some aspect of a clock running in the SDRs used to transmit the signals. A normal distribution with zero mean and variance is used for both the time-domain parameter and the frequency . Hard to reconcile those two choices.
I’m declaring that there is one BPSK signal here.
[R135] End-to-End Learning from Spectrum Data: A Deep Learning Approach for Wireless Signal Identification in Spectrum Monitoring Applications (Kulin et al)
One BPSK signal.
[R136] Deep Learning Models for Wireless Signal Classification with Distributed Low-Cost Spectrum Sensors (Rajendran et al)
The next paper uses the simple but error-free model formulation shown in Equation (1) below:
The machine will be trained using the RML data set of R138 (see above), as well as a modified version that adds another symbol rate. I think Table I should list ‘8’ as the Samples per symbol parameter, so the modified version of the RML data set includes a second rate of .
The authors describe the data set that we analyzed for R138, although I am confident that the symbol rate is not in that data set.
They describe their motivation for including a second rate as that of “evaluating the sample-rate dependencies of the” ML model. I wonder why they think two rates (one half the other) are enough?
So here we’ve moved up to two BPSK signals.
[R137] Over the Air Deep Learning Based Radio Signal Classification (O’Shea et al)
This O’Shea paper started out promising, in the context of all BPSK signals, because the authors’ state in the Abstract that they want to consider the effects of symbol rate.
I searched through the paper, but could not find any description of how (or if) the symbol rate was varied. I came to the conclusion that it was not. (Leave a Comment below if I’m wrong!)
Equation (1) in R137 is OK. It doesn’t consider random processes (functions of time ), or include a multi-dimensional delay vector, but OK, these are what people use when they model their cyclostationary signals as stationary. But then the “cumulantss” are described, the fourth-order cumulant is called a moment, and then (2) contains a mysterious square-root operation. These digressions don’t get us any further along on deciding how many BPSK signals are considered in the paper, but I couldn’t resist pointing out the short shrift given to statistics.
OK! So we will have lots of BPSK signals because the roll-off is varied.
I haven’t analyzed this data set yet. This data set is analyzed in a separate post. The symbol rate is not varied, as far as I can tell, but the carrier offset is allowed to be a random variable with normal distribution having variance . Two values are considered: and . The former is deemed moderate carrier offset, the latter minor carrier offset. I believe only a single symbol rate is used, although the discussion around this parameter is not clear. I believe the parameter in Table I is a symbol-clock phase parameter, else the uniform distribution on doesn’t make much sense.
So here in R137 we definitely have more than one BPSK signal. We have many: the excess bandwidth in the SRRC pulse is varied and there is a small random carrier-frequency offset.
Update February 2021. I estimated the PSDs for signals of the same type and with the largest SNR from the [R137] data set. I then plotted the PSDs for PSK signals having excess bandwidths (roll-offs) of and . Here is the result:
I conclude that the symbol rate, excess bandwidth, and carrier offset are not varied.
So, one BPSK signal after all. Longer data records would allow me (and you) to examine these signals in more detail, so as to verify the distributions in Table I.
[R139] Semi-Supervised Radio Signal Identification (O’Shea et al)
Another of O’Shea’s papers uses the RML data set.
So, one BPSK signal.
[R141] Interference Classification Using Deep Neural Networks (Yu et al)
Maybe I shouldn’t include this paper in the present post, because I’m not quite clear on whether or not there is a BPSK signal involved. Nevertheless, there are some lessons about applying CSP that are likely valuable to many of the readers of the CSP Blog. So let’s check it out.
In the Abstract, we see that the authors want to perform modulation recognition, but the signals of interest are interferers, not the signals involved in their communication link:
And they preview their result: CSP fails. Uh-oh. The data model is signal plus interferer plus noise:
I’m not sure why these assumptions are needed, but here they are:
There is apparently cochannel interference, but somehow also perfect synchronization has been achieved. Here is the universe of interferers:
Interferers 1, 3, and 4 are not cyclostationary signals. Interferer 2 is trivially cyclostationary, but can also be easily detected, characterized, and removed by linear (Fourier) methods. Interferer 5 could be any of the signals, I suppose, that we have studied at the CSP Blog, including BPSK. So that’s where a BPSK signal might be lurking.
The mathematics is a bit sloppy. Equation (3) is almost the periodogram, but you need the factor of . We’ve got several temporal parameters: , , and . I think should correspond to and the width of should correspond to . The cycle frequency should be to be consistent with convention. Finally, you can’t maximize the complex-valued quantity in (6); you need the magnitude.
OK, so the “cyclic spectrum” here is (6), and is what people (and the CSP Blog) usually call the cyclic-domain profile. Here are the authors’ plots for the CDPs for the various interferers:
This is confusing for a couple reasons. The y-axis is labeled “maximum cyclic spectral coherence,” but the authors haven’t defined or mentioned spectral coherence.
The plot for the signal of interest does not have any prominent peaks, yet the signal of interest is said to be an MPSK signal, which always has a non-conjugate CF equal to the symbol rate. Unless the signal is filtered and sampled at a rate less than or equal to the symbol rate! Or the raw symbol sequence is processed…
The non-conjugate CDP for the sine-wave signal should not have any peaks aside from the one corresponding to . A complex-valued sine wave has only one non-conjugate CF () and one conjugate CFs (). Here is a non-conjugate CDP for a sine-wave with frequency Hz:
and the conjugate CDP:
The CDP for the Unknown Modulated Signal is difficult to assess since the signal set is not described.
I also wonder why there are only points (or so) in the cycle-frequency dimension.
Moving on to the signal parameters, we notice that the signals are generated and then simply decimated by a factor of . This will introduce aliasing and make it difficult to understand any resulting CDP.
After the machine is trained using all these downsampled signals, the conclusion is that the PSD works best:
which seems reasonable to me. There are hardly any cyclostationary signals here.
R141 is perhaps an extreme example, but it illustrates the trend: Quickly get through signal definition, analysis, and generation, then throw whatever you have into the machine. If the labels come out to your liking, publish. If they don’t, iterate. The relationship between the output-label performance and the truth of the input labels doesn’t really matter, because no one can look at the trained neural network and say what it has used to make its decisions.
Due to the Unknown Modulated Signal mystery, I’m going with an imaginary number of BPSK signals for R141.
[R143] Fast Deep Learning for Automatic Modulation Classification (Ramjee et al)
Finally, R143 uses the RML data set, which contains PSK, QAM, AM, FM, and CPFSK signals, but posits a signal model that is valid only for the PSK and QAM signals. So, one BPSK signal here.
What’s Going on Here?
The researchers are mostly focused on very narrow problems (“one BPSK signal”) for two reasons. The first is that the machines have enormous appetites and if you consider varying several parameters (modulation type, symbol rate, carrier offset, SNR, pulse shaping, etc.) you end up with an impractical training set. The topic is still modulation recognition, so you can’t reduce the number of modulation types too much, but you can select one value of all the other parameters. You still have a data-set size problem, because you have to generate many examples of the signal by using different random transmitted symbol sequences.
The second reason is that these researchers don’t know much signal theory and they don’t know much about the details of communication signals and systems. This is evidenced by the various strange and mangled mathematical expressions for signals and for features. I think they want to get the training set creation over with as soon as possible so they can hand-craft their machine hyperparameters.
A second serious issue is that posted data sets do not appear to be carefully vetted by their producers. I know this is a tedious task. I had a small problem with the set I posted for the Machine Learner Challenge-a kind CSP Blog reader pointed it out. I was able to describe the issue in the Challenge data post, and it is minor enough that I didn’t need to pull the data set from the Downloads page. But I will pull it if serious problems are found.
A final thought. I’ve uncovered several major problems with the RML data set (The Literature [R138]) in this post. I’ve also documented here that multiple ML researchers are relying on that data set to train and test their algorithms. If one uses a flawed data set for ML and the recognition performance is good, what does that mean about the ML algorithm? It must mean that, at least some of the time, the machine is using idiosyncrasies or flaws in the data set as valid classification features. This makes it vital that the data sets are carefully vetted. A major obstacle to vetting the RML data set is the extreme shortness of the data vectors (128 samples). This prevents a statistical analysis of the vectors. That is, if the BPSK signals in the data set were longer, I could look at their spectral correlation functions and higher-order cyclic cumulants and verify that they have the BPSK-ness properties we’ve established mathematically. And longer data vectors in the data set would not prevent ML researchers from using short segments-just use successive chunks of samples from each vector.
“Garbage-In => Garbage-Out” might not apply to ML MR. It doesn’t matter to ML recognition performance that the input vectors adhere to the established mathematical models of real-world communication signal types. All that matters is that there are sufficiently many differences between the vectors for each class. The machine will find them, because it doesn’t care if the BPSK vectors are consistent with BPSK-ness, unlike MR methods that are based on probability models. They just care if the BPSK vectors are different in any consistently measurable way from the vectors in the other classes. But then when that machine is applied to a different data set, with different idiosyncrasies or none, poor performance will result.
Update June 8, 2020
The new paper The Literature [R146] also uses the RML 2016a data set:
The description of the RML 2016.10a data set is consistent with DeepSig’s description and my understanding of it. I hope that the researchers applied an antialiasing filter prior to downsampling, else they will alias a lot of noise into the center of their band prior to ML processing.
For the highest RML SNR parameter of dB, the following confusion matrix is presented:
It is a bit hard to see, but the true (input) labels are the rows and the ML assigned signal-type labels are the columns; the third row and column corresponds to AM-SSB. So, most of the time the machine assigns the ‘SSB’ label to inputs that are labeled ‘SSB.’ But remember that there is no signal component in the SSB-labeled signals in the RML data set. To verify, I looked at the first SSB signals, instead of the first that I did earlier in this post:
So this particular machine learns to recognize noise as AM-SSB. It doesn’t matter that the signal isn’t there because ‘AM-SSB’ness doesn’t matter to the machine. But this idea can extend to all other elements of the confusion matrix. How much of any of them is due to the character of the signal and how much is due to idiosyncrasies of the training/testing data sets that are unintentionally introduced by the data-set creator?
This unfortunate behavior for AM-SSB could have been detected by the researchers if they also included a twelfth labeled input: AWGN.
Comments, corrections, recommendations, criticisms are welcome. Enter them below.