One Last Time …

We take a quick look at a fourth DeepSig dataset called 2016.04C.multisnr.tar.bz2 in the context of the data-shift problem in machine learning.

And if we get this right,

We’re gonna teach ’em how to say

Goodbye …

You and I.

Lin-Manuel Miranda, “One Last Time,” Hamilton

I didn’t expect to have to do this, but I am going to analyze yet another DeepSig dataset. One last time. This one is called 2016.04C.multisnr.tar.bz2, and is described thusly on the DeepSig website:

Figure 1. Description of various DeepSig data sets found on the DeepSig website as of November 2021.

I’ve analyzed the 2018 dataset here, the RML2016.10b.tar.bz2 dataset here, and the RML2016.10a.tar.bz2 dataset here.

Now I’ve come across a manuscript-in-review in which both the RML2016.10a and RML2016.04c data sets are used. The idea is that these two datasets represent two sufficiently distinct datasets so that they are good candidates for use in a data-shift study involving trained neural-network modulation-recognition systems.

The data-shift problem is, as one researcher puts it:

Data shift or data drift, concept shift, changing environments, data fractures are all similar terms that describe the same phenomenon: the different distribution of data between train and test sets

Georgios Sarantitis

But … are they really all that different?

The Dataset

Our first clue that these aren’t actually good datasets for a data-shift study is that DeepSig says they are quite similar. The 10a and 10b datasets are ‘cleaner and more normalized’ versions of the 4C dataset, and they are meant to ‘supersede’ the 4C dataset, not complement it. But let’s take a look anyway.

The 4C dataset is again in pickle form, and I read the I/Q samples out using a slightly modified version of the python program I used for the other datasets. In this post, I only look at 100 signal instances for each combination of modulation-type label and SNR label. The two kinds of labels are read directly from the pickle file–they are DeepSig’s labels.

The dataset contains 11 signal-type labels: BPSK, QPSK, 8PSK, CPFSK, GFSK, AM-DSB, AM-SSB, QAM16, QAM64, PAM4, and WBFM. And as before, each is associated with SNR labels that range from -20 to +18 in steps of 2. Each signal instance consists of 128 inphase and quadrature (complex) samples.

Power Spectra

We see the same kinds of problems as in the other 2016 datasets: lots of instances that appear to be just noise and the lack of any discernable signal component for any SNR label for the SSB signal type. There does not appear to be a coherent way to understand the SNR label in terms of measurable total or inband SNRs. But overall the dataset is similar to the others and will not make a good data-shift complement to 10a or 10b. Moreover, the dataset is flawed and should not be used in any study (SSB is missing!).

Here are 100 PSD estimates for BPSK and each of the SNR labels found in the pickle file:

Video 1: 100 PSD estimates for BPSK for each encountered SNR label in the 4C pickle file.

As in the 2016-A data set, there is no signal component to the AM-SSB signal instances. Even for the largest SNR label of 18, the PSDs are plainly just noise.

Figure 2. 100 PSDs for the AM-SSB signal-type label and the largest SNR label of 18. No signal component is evident and all the other AM-SSB PSDs for the remaining SNR labels look just like these.

The AM-DSB and WBFM signal types are simply very narrowband signals, as evidenced by their PSDs, which are near-perfect rectangles. I’m using the frequency-smoothing method (FSM) of spectrum estimation, together with a rectangular smoothing window, so that any impulse in the PSD will look just like a rectangle (here with width 0.1).

Figure 3. 100 PSD estimates for the AM-DSB signal label and the largest SNR label of 18.
Figure 4. 100 PSD estimates for the WBFM signal label and the largest SNR label of 18.

It is unclear whether any noise is actually added to the CPFSK signal for the label of 18, as evidenced by the fact that the out-of-band spectral components can vary by tens of dB, as in Figure 5.

Figure 5. 100 PSD estimates for the CPFSK signal label and the SNR label of 18.

The other signal labels produce various curious PSD estimates–you can see all of the PSDs I generated by viewing the movies at the end of this post.

Comparison to 10a and 10b

Our primary purpose here is to assess the suitability of the 4C dataset as sufficiently different from the A or B dataset so that it would be useful in a data-shift study involving it and A or B. So let’s take a look at some PSDs for each of the three datasets side-by-side.

Figure 6. 100 PSDs for each data set and the signal label BPSK for the SNR label of 18.
Figure 7. 100 PSDs for each data set for the signal label AM-DSB and the SNR label of 18.
Figure 8. 100 PSDs for each data set for the signal label CPFSK and the SNR label of 18.
Figure 9. 100 PSDs for each data set for the signal label CPFSK and the SNR label of 18.

I created many more of these three-way PSD comparison plots. You can find them in a zip archive on the Downloads page.

Discussion

By comparing the PSDs for a particular signal label over the three datasets, it appears that the parameters of the signals do not significantly vary between those three data sets. I see no evidence of differences in symbol rate, carrier offset, pulse type, or pulse roll-off. The set of considered modulation types is identical, except that the SSB signal label is not present in dataset B, and the SSB signal has zero power in datasets A and 4C.

We could do a more complete statistical (CSP) analysis if the signal instances were significantly longer than their length of 128 samples.

So if you take the datasets ‘as is,’ they are not good candidates for a data-shift machine-learning study–the probability distributions of the underlying random variables appear to be too similar. However, one might convert one of the datasets into a more suitable dataset by taking the higher-SNR elements and performing filtering, frequency-shifting, and resampling operations, then add noise to recreate the SNR range of the original datasets. This does not remove the fundamental problem that the datasets all originate from the same researchers, and so likely will jointly possess whatever idiosyncrasies those researchers have wittingly or unwittingly introduced through their signal-generation process.

Or a data-shift (generalization in machine learning) researcher could compare one of these datasets to one of their own creation, or try using one of these and a different publicly available dataset.

PSD Videos

Video 2. PSDs for the QPSK signal label.
Video 3. PSDs for the WBFM signal label.
Video 4. PSDs for the QAM64 signal label.
Video 5. PSDs for the QAM16 signal label.
Video 6. PSDs for the PAM4. signal label.
Video 7. PSDs for the GFSK signal label.
Video 8. PSDs for the CPFSK signal label.
Video 9. PSDs for the AM-SSB signal label.
Video 10. PSDs for the AM-DSB signal label.
Video 11. PSDs for the 8PSK signal label.

Author: Chad Spooner

I'm a signal processing researcher specializing in cyclostationary signal processing (CSP) for communication signals. I hope to use this blog to help others with their cyclo-projects and to learn more about how CSP is being used and extended worldwide.

Leave a Comment, Ask a Question, or Point out an Error