CSPB.ML.2018R2: Correcting an RNG Flaw in CSPB.ML.2018

KIRK: Everything that is in error must be sterilised.
NOMAD: There are no exceptions.
KIRK: Nomad, I made an error in creating you.
NOMAD: The creation of perfection is no error.
KIRK: I did not create perfection. I created error.

I’ve had to update the original Challenge for the Machine Learners post, and the associated dataset post, a couple times due to flaws in my metadata (truth) files. Those were fairly minor, so I just updated the original posts.

But a new flaw in CSPB.ML.2018 and CSPB.ML.2022 has come to light due to the work of the estimable research engineers at Expedition Technology. The problem is not with labeling or the fundamental correctness of the modulation types, pulse functions, etc., but with the way a random-number generator was applied in my multi-threaded dataset-generation technique.

I’ll explain after the fold, and this post will provide links to an updated version of the dataset, CSPB.ML.2018R2. I’ll keep the original up for continuity and also place a link to this post there. Moreover, the descriptions of the truth files over at CSPB.ML.2018 are still valid–the truth file posted here has the same format as the truth files available on the CSPB.ML.2018 and CSPB.ML.2022 posts.

The basic flaw is that the random-number generator seeds for the various separate signal-generation processes were not sufficiently independent. This led to unintended duplicates of the randomly chosen parameters across the 112,000 sample signals. Not all of the parameter selections were repeated–many are unique. If we search for duplicates of the very first entry in the metadata truth file,

1  bpsk 11 -7.4433467080e-04  9.8977795076e-01  10 9 7.8834556169e+00 0.0

we will find no duplicates. But if we search for the 1001th entry, which is

1001 bpsk  10 4.8916485770e-04  9.5661746052e-01  4  3  8.3343281183e+00 0.0

we will find several duplicates,

1001 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0
16913 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0
32825 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0
48737 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0
64649 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0
80561 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0
96473 bpsk 10 4.8916485770e-04 9.5661746052e-01 4 3 8.3343281183e+00 0.0

In these duplications, the noise is not duplicated, but the signal is. This duplication isn’t a problem for the CSP method I use to provide a performance target for the Learners. It would be best if everything were randomized, though, even for that method, so that the performance averages really were averaged over a very large number of different parameter values. However, these duplications don’t have any impact whatsoever on the algorithm itself, since it is constructed using mathematical models for cyclostationary signals that are completely independent of any particular dataset. (In my view, that’s a virtue, but that is a subjective judgment.)

The duplications may, though, have a serious impact on the training of a neural network for modulation recognition. To avoid overfitting, it is typically desired to avoid spurious correlations between training samples, and this kind of duplication provides spurious correlations. Imagine if all the BPSK instances were the same signal: the network would have no chance at learning ‘BPSKness.’ Nevertheless, in all the network training and testing I’ve done with CSPB.ML.2018 I have not suspected memorization or other bad training effects.

I’ve redone the signal generation and confirmed that the new dataset does not contain repetitions. I then processed all 112,000 new signals using the same method as in 2018. The results are nearly identical, so I won’t post them all like I did before. Here are the key results, though.

The confusion matrix for a block length of 32,768 samples (the length of each generated file) is shown in Figure 1.

Figure 1. Confusion matrix for CSP-based modulation-recognition method applied to CSPB.ML.2018R2. The overall probability of correct classification is about 0.82, the same as for CSPB.ML.2018.

The carrier-frequency-offset (CFO) estimation accuracy is shown in Figure 2. Recall that estimating the CFO was the original intent of the dataset.

Figure 2. Carrier-frequency-offset estimation performance for CSPB.ML.2018R2. This is similar to the performance for CSPB.ML.2018. As in CSPB.ML.2018, the average CFO error dips below the basic Fourier resolution starting at about 16,384 samples.

Histograms for the randomly (I hope!) generated parameters, using 56 bins across the span of values, are shown in Figure 3. For a perfectly uniform distribution, each bar in the bar graph would have height 2000.

Figure 3. Histograms of the parameters in the truth file for CSPB.ML.2018R2. The carrier-frequency offset and signal power square-root raised-cosine excess bandwidth are intended to be uniform random variables. The symbol rates are intended to be randomized but cluster around 0.1, and the inband SNR distribution follows from the constant noise floor value of 0 dB, the symbol rate, and the randomly chosen signal power.

The new metadata (true labels and parameters) can be found here.

Over the coming days I’ll be adding the CSPB.ML.2018R2 zip files to this post–it takes a while to upload them all to WordPress. Then I’ll do the same for CSPB.ML.2022. The cochannel dataset CSPB.ML.2023 is not affected by this parallel-processing flaw (but may have other flaws–let us all know if you find some).

Here are the 28 zip files:









signal_31986.tim from Batch 8





















Author: Chad Spooner

I'm a signal processing researcher specializing in cyclostationary signal processing (CSP) for communication signals. I hope to use this blog to help others with their cyclo-projects and to learn more about how CSP is being used and extended worldwide.

6 thoughts on “CSPB.ML.2018R2: Correcting an RNG Flaw in CSPB.ML.2018”

  1. Both the transparency and correction are commendable. Yet another thing that makes the CSPB a valuable resource to the community.

    1. Thank you very much Todd. As I’ve said elsewhere, the CSP Blog is a mixture of mathematics and criticism. I can’t rightly ask to have my criticism taken seriously if I can’t admit and correct my own errors. Plus, it is just the right thing to do in the spirit of science and, at its best, engineering.

  2. Hello Chad,
    Hope you are well!
    Thanks for the update. I fear that the Batch_Dir_8/signal_31987.tim is missing from the dataset 🙁
    Thanks and have a nice day!

    1. Hi Rob! I unzipped the archive locally and the file is in the archive. I suppose the zip file could have been corrupted when I uploaded it to WordPress. Have you noticed any other problems with the other 27 zip files?

      1. Hello!

        Actually it was signal_31986.tim (type in my msg), sorry for that.

        From a quick test:

        for i in {1..28}
        > do
        > a=$((($i-1)*4000+1))
        > b=$(($i*4000))
        > for j in $(seq $a $b)
        > do
        > ll Batch_Dir_$i/signal_$j.tim > /tmp/foo
        > done
        > done
        ls: cannot access ‘Batch_Dir_8/signal_31986.tim’: No such file or directory

        This seems to be the only file that is missing…
        Thanks!! 🙂

        1. I’ve updated the post to include a direct link to the missing file, which is zipped, as an alternative to uploading a new version of the 1-GB Batch-8 zip. Until I hear from you (or others) again with new errors/issues, it looks like the dataset is complete!

          Thanks again Rob.

Leave a Comment, Ask a Question, or Point out an Error

%d bloggers like this: