Dataset for the Machine-Learning Challenge [CSPB.ML.2018]

A PSK/QAM/SQPSK data set with randomized symbol rate, inband SNR, carrier-frequency offset, and pulse roll-off.

Update September 2023: A randomization flaw has been found and fixed for CSPB.ML.2018, resulting in CSPB.ML.2018R2. Use that one going forward.

Update February 2023: I’ve posted a third challenge dataset here. It is CSPB.ML.2023 and features cochannel signals.

Update April 2022. I’ve also posted a second dataset here. This new dataset is similar to the original ML Challenge dataset except the random variable representing the carrier frequency offset has a slightly different distribution.

If you refer to either of the posted datasets in a published paper, please use the following designators, which I am also using in papers I’m attempting to publish:

Original ML Challenge Dataset: CSPB.ML.2018.

Shifted ML Challenge Dataset: CSPB.ML.2022.

Update September 2020. I made a mistake when I created the signal-parameter “truth” files signal_record.txt and signal_record_first_20000.txt. Like the DeepSig RML datasets that I analyzed on the CSP Blog here and here, the SNR parameter in the truth files did not match the actual SNR of the signals in the data files. I’ve updated the truth files and the links below. You can still use the original files for all other signal parameters, but the SNR parameter was in error.

Update July 2020. I originally posted 20,000 signals in the posted dataset. I’ve now added another 92,000 for a total of 112,000 signals. The original signals are contained in Batches 1-5, the additional signals in Batches 6-28. I’ve placed these additional Batches at the end of the post to preserve the original post’s content.

Overview of Dataset

The 20,000 signals are stored in five zip files, each containing 4000 individual signal files:

Batch 1

Batch 2

Batch 3

Batch 4

Batch 5

The zip files are each about 1 GB in size.

The modulation-type labels for the signals, such as “BPSK” or “MSK,” are contained in the text file:

signal_record_first_20000.txt

Each signal file is stored in a binary format involving interleaved real and imaginary parts, which I call ‘.tim’ files. You can read a .tim file into MATLAB using read_binary.m. Or use the code inside read_binary.m to write your own data-reader; the format is quite simple.

The Label and Parameter File

Let’s look at the format of the truth/label file. The first line of signal_record_first_20000.txt is

1 bpsk  11  -7.4433467080e-04  9.8977795076e-01  10  9  7.8834556169e+00  0.0

which comprises 9 fields. All temporal and spectral parameters (times and frequencies) are normalized with respect to the sampling rate. In other words, the sampling rate can be taken to be unity in this dataset. These fields are described in the following list:

  1. Signal index. In the case above this is `1′ and that means the file containing the signal is called signal_1.tim. In general, the nth signal is contained in the file signal_n.tim. The Batch 1 zip file contains signal_1.tim through signal_4000.tim.
  2. Signal type. A string indicating the modulation format of the signal in the file. For this dataset, I’ve only got eight modulation types: BPSK, QPSK, 8PSK, \pi/4-DQPSK, 16QAM, 64QAM, 256QAM, and MSK. These are denoted by the strings bpsk, qpsk, 8psk, dqpsk, 16qam, 64qam, 256qam, and msk, respectively.
  3. Base symbol period. In the example above (line one of the truth file), the base symbol period is T_0 = 11.
  4. Carrier offset. In this case, it is -7.4433467080\times 10^{-4}.
  5. Excess bandwidth. The excess bandwidth parameter, or square-root raised-cosine roll-off parameter, applies to all of the signal types except MSK. Here it is 9.8977795076\times 10^{-1}. It can be any real number between 0.1 and 1.0.
  6. Upsample factor. The sixth field is an upsampling parameter U.
  7. Downsample factor. The seventh field is a downsampling parameter D. The actual symbol rate of the signal in the file is computed from the base symbol period, upsample factor, and downsample factor: \displaystyle f_{sym} = (1/T_0)*(D/U). So the BPSK signal in signal_1.tim has rate 0.08181818. If the downsample factor is zero in the truth-parameters file, no resampling was done to the signal.
  8. Inband SNR (dB). The ratio of the signal power to the noise power within the signal’s bandwidth, taking into account the signal type and the excess bandwidth parameter.
  9. Noise spectral density (dB). It is always 0 dB. So the various SNRs are generated by varying the signal power.

To ensure that you have correctly downloaded and interpreted my data files, I’m going to provide some PSD plots and a couple of the actual sample values for a couple of the files.

signal_1.tim

The line from the truth file is:

1 bpsk  11  -7.4433467080e-04  9.8977795076e-01  10  9  7.8834556169e+00  0.0

The first ten samples of the file are:

-5.703014e-02   -6.163056e-01
-1.285231e-01   -6.318392e-01
6.664069e-01    -7.007506e-02
7.731103e-01    -1.164615e+00
3.502680e-01    -1.097872e+00
7.825349e-01    -3.721564e-01
1.094809e+00    -3.123962e-01
4.146149e-01    -5.890701e-01
1.444665e+00    7.358724e-01
-2.217039e-01   -1.305001e+00

An FSM-based PSD estimate for signal_1.tim is:

psd_1

And the blindly estimated cycle frequencies (using the SSCA) are:

cfs_signal_1

The previous plot corresponds to the numerical values:

Non-conjugate (\alpha, C, S):

8.181762695e-02  7.480e-01  5.406e+00

Conjugate (\alpha, C, S):

8.032470942e-02  7.800e-01  4.978e+00
-1.493096002e-03  8.576e-01  1.098e+01
-8.331298083e-02  7.090e-01  5.039e+00

signal_4000.tim

The line from the truth file is

4000 256qam  9  8.3914849139e-04  7.2367959637e-01  9  8  7.6893849108e+00  0.0

which means the symbol rate is given by (1/9)*(8/9) = 0.09876543209. The carrier offset is 0.000839 and the excess bandwidth is 0.723. Because the signal type is 256QAM, it has a single (non-zero) non-conjugate cycle frequency of 0.098765 and no conjugate cycle frequencies. But the square of the signal has cycle frequencies related to the quadrupled carrier:

cfs_signal_4000

Final Thoughts

Is 20000 waveforms a large enough dataset? Maybe not. I have generated tens of thousands more, but will not post until there is a good reason to do so. And that, my friends, is up to you!

That’s about it. I think that gives you enough information to ensure that you’ve interpreted the data and the labels correctly. What remains is experimentation, machine-learning or otherwise I suppose. Please get back to me and the readers of the CSP Blog with any interesting results using the Comments section of this post or the Challenge post.

For my analysis of a commonly used machine-learning modulation-recognition dataset (RML), see the All BPSK Signals post. I also analyze two other datasets from the RML authors (DeepSig Inc.) here and here.

Additional Batches of Signals:

Batch 6

Batch 7

Batch 8

Batch 9

Batch 10

Batch 11

Batch 12

Batch 13

Batch 14

Batch 15

Batch 16

Batch 17

Batch 18

Batch 19

Batch 20

Batch 21

Batch 22

Batch 23

Batch 24

Batch 25

Batch 26

Batch 27

Batch 28

Signal parameters text file

Author: Chad Spooner

I'm a signal processing researcher specializing in cyclostationary signal processing (CSP) for communication signals. I hope to use this blog to help others with their cyclo-projects and to learn more about how CSP is being used and extended worldwide.

10 thoughts on “Dataset for the Machine-Learning Challenge [CSPB.ML.2018]”

    1. Well, the data set I’ve posted is unlikely to be augmented with any other signals, such as universal filter multicarrier (assuming that’s what you are asking about). It is by no means a comprehensive data set for modulation recognition. I suppose that’s part of the point: modulation recognition is a hard problem with a wide variety of possible inputs (not even counting propagation-channel effects!), and the input class is growing all the time as new RF communication physical-layer technologies are developed and deployed. We’ll all have trouble keeping up…

  1. D.r. Chad, thank you for the great contribution! My doctoral research is based on the combination of ML&SP(Signal processing). Recently some great ideas occur to me and luckily I find this blog. How to use ML to learn the Fourier transform, this is an interesting topic and my ideas have something in common with it. I think maybe there is no need to totally use ML or expertise in SP either. How about combining them. Like some synthesization signal processing methods. Hoping for the next discussion!

    1. Ym S: Thanks for checking out the CSP Blog!

      The next few posts that will appear won’t have to do with Machine Learning. I’m hoping that sometime, somewhere, someone will take up the challenge and post their results. It is likely my non-ML methods will eventually be inferior to some ML method, but so far nobody has shown me their results, even though many have downloaded the data set. We’ll see! At that time, I’ll probably post more on ML and CSP.

  2. Dr. Spooner,
    Given the formula for the actual symbol rate given above using base symbol period, upsample factor, and downsample factor, a downsample rate of 0 does not make sense to me, and yet some of the signals have this value in the truth file. Were such signals resampled differently from those with a downsample rate of 1?

    1. JVB: Thanks for visiting the CSP Blog and for paying close attention to my ML-Challenge data set!

      I’ve verified your observation. Looking at the data files, when downsample was (inadvertently) set to zero, no resampling took place. So this is equivalent to having the upsample and downsample factors both equal one.

      This oversight was also overlooked in my code that computes the error statistics for the CFO estimates, meaning that my errors are slightly better than shown in the figures in the post. Comparing new and old, I don’t think it is worthwhile to replace the figures.

      Let me know what else you find!

  3. Hi Chad – great blog!

    I have recently downloaded your dataset with 112000 signals (28 batches of 4000 signals each). For each signal (which is a vector of 32768 complex samples) I have extracted the RMS (root-mean-square)-value. (Assuming that my software is bug-free) I have found that for the 1288 signals that have exactly zero CFO (carrier-frequency-offset), the corresponding RMS-values are all in the interval [1.015, 6.38] – while for the remaining 110712 signals the RMS-values lie in the “narrow” interval [0.988, 1.012].

    Is there any particular reason for this correlation between CFO and RMS-value?

  4. Hello,

    Can you confirm that the SNR values in the files are the corrected/updated ones?
    It appears that the example values provided in the blog post may be outdated, as they differ
    from the current SNR values (7.883 (file) instead of 5.453 (blog)).

    Additionally, could you kindly explain how you determined the noise spectral density in dB
    as I have never encountered a NSD in raw dB units ?

    I’m also interested in exploring the potential for collaboration.
    Would it be possible to get in touch with you to discuss this further?

    Thank you

    1. AdaBull:

      Thanks for checking out the CSP Blog and the comment! Very helpful.

      Yes, I still believe that the inband SNRs in the truth file are correct–they do correspond to the signals in the files. But, yes, the example lines from the truth file that appeared within the body of the post were taken from the original (uncorrected) truth file. I fixed them. So now the SNRs shown in the examples within the body of the post match those found in the truth file.

      Additionally, could you kindly explain how you determined the noise spectral density in dB
      as I have never encountered a NSD in raw dB units ?

      I am guessing that what you are asking is why don’t I provide the noise spectral density value in Watts/Hz. The reason is that in this situation, for this dataset, and also for most of my signal-processing work, the signals are processed as if the sampling rate is unity, f_s = 1 Hz. In such a case, the variance of the noise sequence \sigma_n^2 is equal to the spectral density N_0 in Watts/Hz.

      \displaystyle R_n^0(0) = \sigma_n^2 = \int_{-1/2}^{1/2} N_0 \, df = N_0

      Here I set the noise variance to unity, which means the noise spectral density is unity because f_s = 1.

      In many situations in RF signal processing, we have sampled data, and we can write code and equations for processing that data as if f_s = 1 and only at the end, if we want to report a time or frequency, we can scale by the physical value of the sampling rate.

      What do you say? Make sense?

      Would it be possible to get in touch with you to discuss this further?

      I will contact you.

Leave a Comment, Ask a Question, or Point out an Error

Discover more from Cyclostationary Signal Processing

Subscribe now to keep reading and get access to the full archive.

Continue reading