CSPB.ML.2023G1

Another dataset aimed at the continuing problem of generalization in machine-learning-based modulation recognition. This one is a companion to CSPB.ML.2023, which features cochannel situations.

Quality datasets containing digital signals with varied parameters and lengths sufficient to permit many kinds of validation checks by signal-processing experts remain in short supply. In this post, we continue our efforts to provide such datasets by offering a companion unlabeled dataset to CSPB.ML.2023.

CSPB.ML.2023 is a two-part dataset with 120,000 binary data files. The first 60,000 are single-signal files and the last 60,000 are two-signal files created by combining pairs of the single-signal files. This means that many of the two-signal files contain cochannel signals, but not all, since there are cases where the two combined signals are narrowband enough and have different enough carrier-frequency offsets such that they do not overlap in frequency. All metadata for the CSPB.ML.2023 data is provided in the original post, facilitating supervised machine learning.

The new dataset is CSPB.ML.2023G1 and it is aimed at, yet again, facilitating generalization studies. Therefore the labels and all metadata are withheld. The concept is that if you think you have a neural network or other supervised-learning structure that is well trained on CSPB.ML.2023 (or any other digital-signal dataset), AND you think that your structure has a high degree of generalization, you can test it using CSPB.ML.2023G1.

CSPB.ML.2023G1 is different from CSPB.ML.2023 in that the random-number generator is seeded differently and the probability density function for the carrier-frequency offset parameter is altered. The signal types are the same as in CSPB.ML.2023: BPSK, QPSK, 8PSK, 16QAM, 64QAM, SQPSK, MSK, and GMSK. So three PSKs, two square-constellation QAMs, and three offset-QPSK variants.

A selection of estimated power spectral density plots for the new dataset is shown in Video 1.

Video 1. Every thousandth PSD for CSPB.ML.2023G1. The first half of the video shows single-signal PSDs and the second half shows the two-signal PSDs, which often contain cochannel signals.

A CSP-based blind modulation-recognition and parameter-estimation system (My Papers [25,26,28,43]) is applied, as it was for the CSPB.ML.{2018,2022,2023} datasets to provide a non-machine-learning performance data point. The results for the 60,000 single-signal files in CSPB.ML.2023G1 are summarized in Figure 1.

Figure 1. CSP-based modulation recognition and parameter estimation results for the single-signal portion of CSPB.ML.2023G1. Here T denotes the number of processed samples per member of the dataset.

The CSP-based results for the two-signal portion of the dataset are shown in Figure 2. These are close to the same results as for CSPB.ML.2023, which we would expect because the CSP processing is invariant to the particulars of the density functions for symbol rate, carrier frequency offset, excess bandwidth, etc. (Compare to Figure 4 in the CSPB.ML.2023 post.)

Figure 2. CSP-based modulation recognition and parameter estimation results for the two-signal members of the CSPB.ML.2023G1 dataset. Note that the performance here is inferior to that for the single-signal case of Figure 1 (due to much lower SINR) but that performance increases monotonically with increasing processing length T.

Dataset zip Files

First 5000 single-signal files, in five batches:

CSPB.ML.2023G1 Batch 1.

CSPB.ML.2023G1 Batch 2.

CSPB.ML.2023G1 Batch 3.

CSPB.ML.2023G1 Batch 4.

CSPB.ML.2023G1 Batch 5.

First 5000 two-signal files, in five batches:

CSPB.ML.2023G1 Batch 61.

CSPB.ML.2023G1 Batch 62.

CSPB.ML.2023G1 Batch 63.

CSPB.ML.2023G1 Batch 64.

CSPB.ML.2023G1 Batch 65.

The remaining files are available by request in the comments below. See the original CSPB.ML.2023 post for details on the data-file format and a link to a MATLAB file reader.

Should you apply a modulation recognizer to this data and want to know how you did, contact me in the comments.

Author: Chad Spooner

I'm a signal processing researcher specializing in cyclostationary signal processing (CSP) for communication signals. I hope to use this blog to help others with their cyclo-projects and to learn more about how CSP is being used and extended worldwide.

2 thoughts on “CSPB.ML.2023G1”

  1. Hey Chad!

    Question for you. Do you ever plan to and/or see value in publishing a synthetic dataset that adds multipath effects? Correct me if I am wrong, but the CSPB.ML datasets are all multipath-free, right?

    One challenge that arises is that to achieve good coverage over all scenarios, your parameter space has exploded. Not only do we already want sufficient examples of each modulation, but there are comparably as many (maybe more?) channel realizations that one could use if you vary delay spread, doppler spread, coherence time, tap-delay line distributions… what else? Regardless, it is a lot of data.

    There is also the argument to evaluate against “hardware impairments” that people (machine learners in particular) love to cite but rarely define. I think you have exited synthetic data territory when you hit that level of fidelity, but just my opinion.

    Curious what your thoughts are.

    Cheers!
    Stephan

    1. Yes, all CSPB.ML.* datasets posted so far are free from channel effects. That is, they all correspond to a propagation channel that is impulsive,

      h(t) = \delta(t)

      and the noise is always additive white Gaussian noise.

      I do see value in creating, posting, and using a dataset that includes the things I already vary–such as excess bandwidth, the bit sequence, the carrier frequency offset, the symbol rate–but also includes a non-trivial random propagation channel. I would likely use a discrete multipath channel model, where I vary the number of rays, the complex amplitude of each ray, and the delay spread of the rays.

      I fully agree that unless the random variables controlling those discrete-multipath parameters are severely limited in scope (e.g., limited to two rays), then creating a reasonably sized dataset that also fully samples those random variables will be a computational and storage challenge (nightmare?). So that’s one reason I haven’t done it yet.

      Another reason is more personal. My current modulation-recognition algorithms and their software implementations do not explicitly handle multipath, and classification performance will suffer in the presence of significant frequency-selective effects. However, I do have a mathematical (CSP) approach to joint channel estimation and modulation recognition, and it should be able to handle cochannel situations. That is, it will do parameter estimation and joint modulation recognition for two cochannel signals that experience frequency-selective channel effects. I’ve been dragging my feet on generating and posting the kinds of datasets you have brought up because when I do, I want to also post results for a SP/CSP approach for the Learners to compare to. But I don’t have that fully implemented.

      Regarding hardware impairments, I think that is a bit of a distraction or some kind of feint. I capture a lot of data from Ettus radios as well as using embedded ADC boards, and I receive much more data from various customers, partners, and randos. For the vast majority of these datasets, there is no problematic or even discernable hardware impairment. I can tell because my systems do more than CSP–they can blindly extract digital-QAM/PSK constellations. These often approach textbook quality, which cannot happen if there is an unequalized frequency-selective channel present, or significant hardware impairments in either the transmitter or receiver (such as IQ imbalance or bad phase noise).

      By far the most common “impairment” I face as a data analyst, algorithm developer, and SP/CSP software developer is irregular noise floors. So my systems employ automatic spectral segmentation algorithms that feature a lot of special modes that can handle noise-floor tilts, divots, ripples, pedestals, etc. But don’t take that as “Chad has solved it all,” the irregular-noise-floor problem is ever evolving and always challenging and I just try to adapt.

      I’ve been fully focused on providing datasets that push SP and ML mod-rec toward the more difficult problems (e.g., cochannel) but also that push on the pain points of the ML approach–specifically generalization. Until we can make SP/CSP and/or ML algorithms that can generalize well (and we can with SP/CSP actually), then there isn’t a need to crank up the realism with channels. If you can’t solve the simpler problems involving tiny changes in some of the governing random variables, you can’t solve the larger problems that vastly increase the number of involved random variables.

Leave a Comment, Ask a Question, or Point out an Error

Discover more from Cyclostationary Signal Processing

Subscribe now to keep reading and get access to the full archive.

Continue reading