**Update April 2022**. I’ve also posted a second dataset here. This new dataset is similar to the original ML Challenge dataset except the random variable representing the carrier frequency offset has a slightly different distribution.

If you refer to either of the posted datasets in a published paper, please use the following designators, which I am also using in papers I’m attempting to publish:

Original ML Challenge Dataset: CSPB.ML.2018.

Shifted ML Challenge Dataset: CSPB.ML.2022.

**Update September 2020**. I made a mistake when I created the signal-parameter “truth” files signal_record.txt and signal_record_first_20000.txt. Like the DeepSig RML data sets that I analyzed on the CSP Blog here and here, the SNR parameter in the truth files did not match the actual SNR of the signals in the data files. I’ve updated the truth files and the links below. You can still use the original files for all other signal parameters, but the SNR parameter was in error.

**Update July 2020**. I originally posted signals in the posted data set. I’ve now added another for a total of signals. The original signals are contained in Batches 1-5, the additional signals in Batches 6-28. I’ve placed these additional Batches at the end of the post to preserve the original post’s content.

### Overview of Data Set

The signals are stored in five zip files, each containing individual signal files:

The zip files are each about 1 GB in size.

The modulation-type labels for the signals, such as “BPSK” or “MSK,” are contained in the text file:

Each signal file is stored in a binary format involving interleaved real and imaginary parts, which I call ‘.tim’ files. You can read a .tim file into MATLAB using read_binary.m. Or use the code inside read_binary.m to write your own data-reader; the format is quite simple.

### The Label and Parameter File

Let’s look at the format of the truth/label file. The first line of signal_record_first_20000.txt is

1 bpsk 11 -7.4433467080e-04 9.8977795076e-01 10 9 5.4532617590e+00 0.0

which comprises fields. All temporal and spectral parameters (times and frequencies) are normalized with respect to the sampling rate. In other words, the sampling rate can be taken to be unity in this data set. These fields are described in the following list:

**Signal index**. In the case above this is `1′ and that means the file containing the signal is called signal_1.tim. In general, the th signal is contained in the file signal_n.tim. The Batch 1 zip file contains signal_1.tim through signal_4000.tim.**Signal type**. A string indicating the modulation format of the signal in the file. For this data set, I’ve only got eight modulation types: BPSK, QPSK, 8PSK, -DQPSK, 16QAM, 64QAM, 256QAM, and MSK. These are denoted by the strings bpsk, qpsk, 8psk, dqpsk, 16qam, 64qam, 256qam, and msk, respectively.**Base symbol period**. In the example above (line one of the truth file), the base symbol period is .**Carrier offset**. In this case, it is .**Excess bandwidth**. The excess bandwidth parameter, or square-root raised-cosine roll-off parameter, applies to all of the signal types except MSK. Here it is . It can be any real number between and .**Upsample factor**. The sixth field is an upsampling parameter U.**Downsample factor**. The seventh field is a downsampling parameter D. The actual symbol rate of the signal in the file is computed from the base symbol period, upsample factor, and downsample factor: . So the BPSK signal in signal_1.tim has rate .**If the downsample factor is zero in the truth-parameters file, no resampling was done to the signal.****Inband SNR (dB)**. The ratio of the signal power to the noise power within the signal’s bandwidth, taking into account the signal type and the excess bandwidth parameter.**Noise spectral density (dB)**. It is always dB. So the various SNRs are generated by varying the signal power.

To ensure that you have correctly downloaded and interpreted my data files, I’m going to provide some PSD plots and a couple of the actual sample values for a couple of the files.

### signal_1.tim

The line from the truth file is:

1 bpsk 11 -7.4433467080e-04 9.8977795076e-01 10 9 5.4532617590e+00 0.0

The first ten samples of the file are:

-5.703014e-02 -6.163056e-01

-1.285231e-01 -6.318392e-01

6.664069e-01 -7.007506e-02

7.731103e-01 -1.164615e+00

3.502680e-01 -1.097872e+00

7.825349e-01 -3.721564e-01

1.094809e+00 -3.123962e-01

4.146149e-01 -5.890701e-01

1.444665e+00 7.358724e-01

-2.217039e-01 -1.305001e+00

An FSM-based PSD estimate for signal_1.tim is:

And the blindly estimated cycle frequencies (using the SSCA) are:

The previous plot corresponds to the numerical values:

Non-conjugate :

8.181762695e-02 7.480e-01 5.406e+00

Conjugate :

8.032470942e-02 7.800e-01 4.978e+00

-1.493096002e-03 8.576e-01 1.098e+01

-8.331298083e-02 7.090e-01 5.039e+00

### signal_4000.tim

The line from the truth file is

4000 256qam 9 8.3914849139e-04 7.2367959637e-01 9 8 1.0566301192e+01 0.0

which means the symbol rate is given by . The carrier offset is and the excess bandwidth is . Because the signal type is 256QAM, it has a single (non-zero) non-conjugate cycle frequency of and no conjugate cycle frequencies. But the square of the signal has cycle frequencies related to the quadrupled carrier:

### Final Thoughts

Is waveforms a large enough data set? Maybe not. I have generated tens of thousands more, but will not post until there is a good reason to do so. And that, my friends, is up to you!

That’s about it. I think that gives you enough information to ensure that you’ve interpreted the data and the labels correctly. What remains is experimentation, machine-learning or otherwise I suppose. Please get back to me and the readers of the CSP Blog with any interesting results using the Comments section of this post or the Challenge post.

For my analysis of a commonly used machine-learning modulation-recognition data set (RML), see the All BPSK Signals post. I also analyze two other data sets from the RML authors (DeepSig Inc.) here and here.

i need the UFMC modulation signal in time domain. thanks in advance

Well, the data set I’ve posted is unlikely to be augmented with any other signals, such as universal filter multicarrier (assuming that’s what you are asking about). It is by no means a comprehensive data set for modulation recognition. I suppose that’s part of the point: modulation recognition is a hard problem with a wide variety of possible inputs (not even counting propagation-channel effects!), and the input class is growing all the time as new RF communication physical-layer technologies are developed and deployed. We’ll all have trouble keeping up…

D.r. Chad, thank you for the great contribution! My doctoral research is based on the combination of ML&SP(Signal processing). Recently some great ideas occur to me and luckily I find this blog. How to use ML to learn the Fourier transform, this is an interesting topic and my ideas have something in common with it. I think maybe there is no need to totally use ML or expertise in SP either. How about combining them. Like some synthesization signal processing methods. Hoping for the next discussion!

Ym S: Thanks for checking out the CSP Blog!

The next few posts that will appear won’t have to do with Machine Learning. I’m hoping that sometime, somewhere, someone will take up the challenge and post their results. It is likely my non-ML methods will eventually be inferior to some ML method, but so far nobody has shown me their results, even though many have downloaded the data set. We’ll see! At that time, I’ll probably post more on ML and CSP.

Dr. Spooner,

Given the formula for the actual symbol rate given above using base symbol period, upsample factor, and downsample factor, a downsample rate of 0 does not make sense to me, and yet some of the signals have this value in the truth file. Were such signals resampled differently from those with a downsample rate of 1?

JVB: Thanks for visiting the CSP Blog and for paying close attention to my ML-Challenge data set!

I’ve verified your observation. Looking at the data files, when downsample was (inadvertently) set to zero, no resampling took place. So this is equivalent to having the upsample and downsample factors both equal one.

This oversight was also overlooked in my code that computes the error statistics for the CFO estimates, meaning that my errors are slightly better than shown in the figures in the post. Comparing new and old, I don’t think it is worthwhile to replace the figures.

Let me know what else you find!

Hi Chad – great blog!

I have recently downloaded your dataset with 112000 signals (28 batches of 4000 signals each). For each signal (which is a vector of 32768 complex samples) I have extracted the RMS (root-mean-square)-value. (Assuming that my software is bug-free) I have found that for the 1288 signals that have exactly zero CFO (carrier-frequency-offset), the corresponding RMS-values are all in the interval [1.015, 6.38] – while for the remaining 110712 signals the RMS-values lie in the “narrow” interval [0.988, 1.012].

Is there any particular reason for this correlation between CFO and RMS-value?