Quality datasets containing digital signals with varied parameters and lengths sufficient to permit many kinds of validation checks by signal-processing experts remain in short supply. In this post, we continue our efforts to provide such datasets by offering a companion unlabeled dataset to CSPB.ML.2023.
CSPB.ML.2023 is a two-part dataset with 120,000 binary data files. The first 60,000 are single-signal files and the last 60,000 are two-signal files created by combining pairs of the single-signal files. This means that many of the two-signal files contain cochannel signals, but not all, since there are cases where the two combined signals are narrowband enough and have different enough carrier-frequency offsets such that they do not overlap in frequency. All metadata for the CSPB.ML.2023 data is provided in the original post, facilitating supervised machine learning.
The new dataset is CSPB.ML.2023G1 and it is aimed at, yet again, facilitating generalization studies. Therefore the labels and all metadata are withheld. The concept is that if you think you have a neural network or other supervised-learning structure that is well trained on CSPB.ML.2023 (or any other digital-signal dataset), AND you think that your structure has a high degree of generalization, you can test it using CSPB.ML.2023G1.
CSPB.ML.2023G1 is different from CSPB.ML.2023 in that the random-number generator is seeded differently and the probability density function for the carrier-frequency offset parameter is altered. The signal types are the same as in CSPB.ML.2023: BPSK, QPSK, 8PSK, 16QAM, 64QAM, SQPSK, MSK, and GMSK. So three PSKs, two square-constellation QAMs, and three offset-QPSK variants.
A selection of estimated power spectral density plots for the new dataset is shown in Video 1.
A CSP-based blind modulation-recognition and parameter-estimation system (My Papers [25,26,28,43]) is applied, as it was for the CSPB.ML.{2018,2022,2023} datasets to provide a non-machine-learning performance data point. The results for the 60,000 single-signal files in CSPB.ML.2023G1 are summarized in Figure 1.
The CSP-based results for the two-signal portion of the dataset are shown in Figure 2. These are close to the same results as for CSPB.ML.2023, which we would expect because the CSP processing is invariant to the particulars of the density functions for symbol rate, carrier frequency offset, excess bandwidth, etc. (Compare to Figure 4 in the CSPB.ML.2023 post.)
Dataset zip Files
First 5000 single-signal files, in five batches:
First 5000 two-signal files, in five batches:
The remaining files are available by request in the comments below. See the original CSPB.ML.2023 post for details on the data-file format and a link to a MATLAB file reader.
Should you apply a modulation recognizer to this data and want to know how you did, contact me in the comments.
Hey Chad!
Question for you. Do you ever plan to and/or see value in publishing a synthetic dataset that adds multipath effects? Correct me if I am wrong, but the CSPB.ML datasets are all multipath-free, right?
One challenge that arises is that to achieve good coverage over all scenarios, your parameter space has exploded. Not only do we already want sufficient examples of each modulation, but there are comparably as many (maybe more?) channel realizations that one could use if you vary delay spread, doppler spread, coherence time, tap-delay line distributions… what else? Regardless, it is a lot of data.
There is also the argument to evaluate against “hardware impairments” that people (machine learners in particular) love to cite but rarely define. I think you have exited synthetic data territory when you hit that level of fidelity, but just my opinion.
Curious what your thoughts are.
Cheers!
Stephan
Yes, all CSPB.ML.* datasets posted so far are free from channel effects. That is, they all correspond to a propagation channel that is impulsive,
and the noise is always additive white Gaussian noise.
I do see value in creating, posting, and using a dataset that includes the things I already vary–such as excess bandwidth, the bit sequence, the carrier frequency offset, the symbol rate–but also includes a non-trivial random propagation channel. I would likely use a discrete multipath channel model, where I vary the number of rays, the complex amplitude of each ray, and the delay spread of the rays.
I fully agree that unless the random variables controlling those discrete-multipath parameters are severely limited in scope (e.g., limited to two rays), then creating a reasonably sized dataset that also fully samples those random variables will be a computational and storage challenge (nightmare?). So that’s one reason I haven’t done it yet.
Another reason is more personal. My current modulation-recognition algorithms and their software implementations do not explicitly handle multipath, and classification performance will suffer in the presence of significant frequency-selective effects. However, I do have a mathematical (CSP) approach to joint channel estimation and modulation recognition, and it should be able to handle cochannel situations. That is, it will do parameter estimation and joint modulation recognition for two cochannel signals that experience frequency-selective channel effects. I’ve been dragging my feet on generating and posting the kinds of datasets you have brought up because when I do, I want to also post results for a SP/CSP approach for the Learners to compare to. But I don’t have that fully implemented.
Regarding hardware impairments, I think that is a bit of a distraction or some kind of feint. I capture a lot of data from Ettus radios as well as using embedded ADC boards, and I receive much more data from various customers, partners, and randos. For the vast majority of these datasets, there is no problematic or even discernable hardware impairment. I can tell because my systems do more than CSP–they can blindly extract digital-QAM/PSK constellations. These often approach textbook quality, which cannot happen if there is an unequalized frequency-selective channel present, or significant hardware impairments in either the transmitter or receiver (such as IQ imbalance or bad phase noise).
By far the most common “impairment” I face as a data analyst, algorithm developer, and SP/CSP software developer is irregular noise floors. So my systems employ automatic spectral segmentation algorithms that feature a lot of special modes that can handle noise-floor tilts, divots, ripples, pedestals, etc. But don’t take that as “Chad has solved it all,” the irregular-noise-floor problem is ever evolving and always challenging and I just try to adapt.
I’ve been fully focused on providing datasets that push SP and ML mod-rec toward the more difficult problems (e.g., cochannel) but also that push on the pain points of the ML approach–specifically generalization. Until we can make SP/CSP and/or ML algorithms that can generalize well (and we can with SP/CSP actually), then there isn’t a need to crank up the realism with channels. If you can’t solve the simpler problems involving tiny changes in some of the governing random variables, you can’t solve the larger problems that vastly increase the number of involved random variables.