Or any transform for that matter. Or the power spectrum? Autocorrelation function? Cyclic moment? Cyclic cumulant?
I ask because the Machine Learners want to do away with what they call Expert Features in multiple areas involving classification, such as modulation recognition, image classification, facial recognition, etc. The idea is to train the machine (and by machine they seem to almost always mean an artificial neural network, or just neural network for short) by applying labeled data (supervised learning) where the data is the raw data involved in the classification application area. For us, here at the CSP Blog, that means complex-valued data samples obtained through standard RF signal reception techniques. In other words, the samples that we start with in all of our CSP algorithms, such as the frequency-smoothing method, the time-smoothing method, the strip spectral correlation analyzer, the cycle detectors, the time-delay estimators, automatic spectral segmentation, etc.
This is an interesting and potentially valuable line of inquiry, even if it does lead to the superfluousness of my work and the CSP Blog itself. Oh well, gotta face reality.
So can we start with complex samples (commonly called “I-Q samples”, which is short for “inphase and quadrature samples”) corresponding to labeled examples of the involved classes (BPSK, QPSK, AM, FM, etc.) and end up with a classifier with performance that exceeds that of the best Expert Feature classifier? From my point of view, that means that the machine has to learn cyclic cumulants or something even better. I have a hard time imagining something better (that is just a statement about my mental limitations, not about what might exist in the world), so I shift to asking Can a Machine Learn the Cyclic Cumulant?
Let’s recall our formula for the cyclic temporal cumulant function (“cyclic cumulant”):
Each of the functions is a cyclic temporal moment function, which is a Fourier component of the periodically time-variant temporal moment function, and which can be estimated directly from the signal itself,
In (2), recall that is the order of the cyclic moment, is the number of optional conjugations that are used, is the (impure) cycle frequency, and the delay vector is . The cyclic moments can be usefully estimated by Fourier transformation of the delay product followed by peak-picking.
The cyclic cumulant (1) is fiendishly complex, especially for higher orders () and/or signals with a large number of lower-order cycle frequencies (for example, DSSS). Lucky for us, we can get a lot of modulation-classification mileage by combining cyclic cumulants with orders into feature vectors, but still the cyclic cumulant for order is quite involved. (See My Papers [25,26,28,44]).
So can a Machine Learn the Cyclic Cumulant (1) as a means of classifying arbitrary QAM and PSK signals? That strikes me as a premature question, so I’m going to back off a bit more and wonder whether a machine could learn components of the cyclic cumulant. Like the cyclic moment. Even that might be too involved at this stage—it would require something like (A) creating the (nonlinear) delay product, complete with appropriate conjugated factors, (B) Fourier transformation, and (C) peak-picking with good thresholding to find the cyclic moments at the correct (useful) cycle frequencies.
OK, so how about just the Fourier transform? Can a machine learn that in an supervised learning context? That’s where all my wondering ended up. Maybe if we could train a machine to perform the Fourier transform, we could build up to having it learn the cyclic cumulants just from labeled sets of I-Q data records.
Looking around the Web for information on learning the Fourier transform, I see that the Fourier transform is used in Machine Learning, but I can’t find any work that describes trying to force (or coax? cajole? beg?) a machine to cough up the Fourier transform. So, to be clear, I want a machine that uses I-Q labeled inputs in a supervised learning setting to adjust its internal settings such that the effective function that it represents is the (discrete) Fourier transform.
We know from theoretical work that neural networks can represent many interesting and useful functions. But can we make the networks learn them?
Attempting to Learn the 64-Point Discrete Fourier Transform
I’m using the Neural Network and Machine Learning toolboxes in MATLAB (version R2017A). The basic idea is to create a rather large set of -point complex-valued input sequences, compute the discrete Fourier transform for each using MATLAB’s fft.m, then apply these inputs and desired outputs to a neural network machine with complex-valued outputs.
The neural network tools don’t like complex numbers, so I actually convert the input and output sequences into -point sequences, with the real values followed by the imaginary values.
I realize a big issue in applying neural-network-based machine-learning algorithms is the selection of good hyperparameters. These entities define the structure of the machine, whose internal weights are then optimized through the training process. I accepted most of the default hyperparameters in MATLAB’s neural-network function-fitting tool (nftool), but I did vary the number of nodes in the hidden layer, and of course the output layer has to have elements.
For hidden-layer sizes of and , the training terminated after too many validation-check failures. Once I specified nodes in the hidden layer, training proceeded until I terminated it after iterations. These iterations took about elapsed-time hours on a -core high-performance linux workstation (it does not have any significant GPUs). At termination, the “Performance” parameter (MSE) was at and the “Gradient” was about . There were no reported validation failures.
The training inputs consisted of various sine waves, rectangles, pulse trains, binary sequences, and good old-fashioned white Gaussian noise. The testing inputs had some of these as well as a few other kinds of sequences, such as ramps and sinc functions.
Here is the MSE plot for the training and the testing data sets:
And here is the corresponding normalized MSE result (each MSE is divided by the power of the associated input):
Generally speaking, the learned function has a difficult time producing low normalized MSE for both low-power and high-power signals. I believe there is an option in the training tools to use normalized MSE as the performance metric for function fitting, but I’ve not yet done that. But notice that although the normalized MSE is typically larger for the low-power inputs, it can also be large for the highest-power inputs (look at the results for Sine Waves in the Testing data). So did I manage to make a neural network learn the -point discrete Fourier transform? No.
The code to create the training and testing data sets and the code that MATLAB produced that supposedly corresponds to what the fitting tool did are in this zip file.
Can You do Better?
You probably can! Let me know how. Of particular interest is the generalization property of the learning. We say we have learned the Fourier transform when we know the formula for arbitrary number of samples and can successfully apply the formula symbolically and/or numerically. Once we’ve learned, in the usual sense of the word, we are not limited to calculating the Fourier transform of those signals we’ve already done it for–we can apply it to new signals that are utterly different from those that we’ve encountered so far.
For a machine to learn the Fourier transform, I think it should exhibit a high degree of generalization. If it performs well on a training set, no matter how big, and then fails for signals outside the training set, then it cannot be said to have learned. Moreover, of what use is it if the generalization is low?
The emphasis here on a high degree of generalization in a trained machine is not due to a desire to harp on a known weakness. We need generalization. In a lot of cases of professional interest to me, I use CSP to analyze data for which the contents are not known. We don’t know how many signals are present, or their types, or their parameters, or their cochannel combination properties. In some cases we find a signal with cyclostationarity properties that we’ve simply never encountered before. We are able to do this because we have highly general tools (that gets us to the never-before-seen patterns) and we have experience with lots of known signal types (that gets us to the conclusion that the obtained pattern is new).
It would be great to train a machine to do that. Primarily because one could then automate the process of analyzing new data sets which would likely speed up the process by orders of magnitude.