I first considered whether a machine (neural network) could learn the (64-point, complex-valued) Fourier transform in this post. I used MATLAB’s Neural Network Toolbox and I failed to get good learning results because I did not properly set the machine’s hyperparameters. A kind reader named Vito Dantona provided a comment to that original post that contained good hyperparameter selections, and I’m going to report the new results here in this post.
Since the Fourier transform is linear, the machine should be set up to do linear processing. It can’t just figure that out for itself. Once I used Vito’s suggested hyperparameters to force the machine to be linear, the results became much better:
Looking at the normalized MSE graphs, we can see that a few inputs with low power give rise to learned Fourier transforms with error variance comparable to the input power. But for the most part, the squared error in the learned transform is orders of magnitude smaller than the input power, indicating an accurate result. The lesson is that if we create a neural network that is linear, we can make it learn the Fourier transform over some interval of amplitudes and for signals in the training and testing sets. Pretty good!
However, did the machine learn the Fourier transform in our usual human sense? If it did, it should be able to produce accurate results for signals that are not in the training set and for scaled versions of the training-set input signals where the scaling is not represented in the training set. I think this is called the ability of the machine to generalize.
So, for example, if we consider a simple 20-sample rectangle as our input (20 ones followed by 44 zeros), we can scale it with a variety of scale factors and compare the output of our learned function with the output of MATLAB’s fft.m. The result is:
When the scaling factor is very small, the learned function deviates from the Fourier transform, so even though we got a lot farther with Vito’s help, the machine still didn’t really learn the Fourier transform. Near the left edge of the graph above, we obtain the individual results:
This brings me to a paper I came across recently that applies machine learning to the modulation classification problem. It is by M. Kulin, T. Kazaz, I. Moerman, and E. de Poorter and can be found here. The title is “End-to-End Learning from Spectrum Data: A Deep Learning Approach for Wireless Signal Identification in Spectrum Monitoring Applications.”
There is the usual evangelical fervor about how machine learning will solve everything without the need for any human experts. They’re always wanting to get rid of the experts! Except the machine-learning experts, of course. Here’s why I say that. The authors state several times that one of their objectives is to finally free humanity from the shackles of expertly “hand-crafted features:”
Abstract: “without requiring design of hand-crafted expert features like higher order cyclic moments”
Introduction: “completely eliminates the need for designing expert features such as higher order cyclic moments”
Section IB: “The design of these specialized solutions have proven to be time-demanding as they typically rely on manual extraction of expert features for which a significant amount of domain knowledge and engineering is required.”
Section II: “without requiring design of hand-crafted expert features like higher order cyclic moments”
Section V: [End-to-end learning] “can be applied to various wireless signals to effectively detect the presence of radio emitters in a unified way with requiring design of expert features.”
But we cannot do without the expertise of the machine learners! To wit:
Section IB: “The technical approach depicted in this paper is deeply interdisciplinary and systematic, calling for the synergy of expertise of computer scientists, wireless communications engineers, signal processing and machine learning experts …”
And (this kills me), the laborious and error-prone selection of the machine hyperparameters is just the way it is:
Section IIIB: “The number of filters per layer is a tunable parameter called a hyper-parameter. Other tunable parameters are the filter size, the number of layers, etc. The selection of values for hyper-parameters may be quite difficult, and finding it [sic] commonly is much [sic] an art as it is science. An optimal choice may only be feasible by trial and error.”
Smells a bit like hand-crafting, don’t you think?
I think these guys are behind the times. What we really need is a machine that can automatically learn good hyperparameters for another machine. Well, and a machine just before that machine, that sets its hyperparameters. And, uh, well, I suppose it’s turtles all the way down.
OK, back to the technical ideas here.
First, the authors don’t even cite a paper that uses higher-order cyclic moments or cumulants. They do cite a paper of mine, written with some Virginia Tech researchers a while back, that uses the cyclic domain profile (My Papers ), which is essentially the spectral correlation or coherence magnitude viewed edge-on, so that the x-axis is cycle frequency and the y-axis is magnitude. Only the spectral correlation function and the spectral coherence are used in that paper. No higher-order moments in sight. So they neglect to cite any of my other papers on the topic of higher-order moments/cumulants (either the theory or the application), and they don’t cite anybody else appropriate either, like Antonio Napolitano or Octavia Dobre (see The Literature for examples). So I question whether they know what a higher-order cyclic moment is.
Second, higher-order cyclic moments aren’t “hand crafted.” In My Papers [5,6], I show that the higher-order cyclic moments are just components of the coefficients in the series expansion of the various th-order characteristic functions for a cyclostationary random process. And the higher-order cyclic cumulants are coefficients in the series expansion the logarithm of the characteristic functions. And, of course, the characteristic functions are nothing more than the Fourier transforms of the th-order probability density functions for the random process. So cyclic moments and cyclic cumulants are intimately connected to the fundamental probabilistic structure of communication signals.
But now let’s get to why this paper is relevant to our ongoing search for a machine that can learn the Fourier transform. The authors place a heavy importance on what they call the “signal representation.” One representation is the one we use quite a lot here at the CSP Blog: uniformly sampled inphase and quadrature (I/Q) signal values over time. A second signal representation that they consider is magnitude-phase (M/P). Here each I/Q sample is represented by its (standard) magnitude and phase. The third signal representation is Fourier transformation (FT) of the I/Q values. As near as I can tell from their mathematical description, these are all invertible. We can recover the I/Q samples from the M/P samples or the FT samples. So, presumably, there is no loss of information between the three representations.
The issue is that the performance of the trained machines depends on the signal representation used as input to the machine. Let me provide some key quotes:
Abstract: “From our analysis we prove that the wireless data representation impacts the accuracy depending on the specifics and similarities of the wireless signals that need to be differentiated, with different data representations resulting in accuracy variations of up to 29%.”
Section VE: “However, we noticed that the amplitude/phase representation helped the model discriminate the modulation formats better compared to raw IQ time-series data for high SNR scenarios.”
Section VE: “However, again we noticed that the amplitude/phase representation is beneficial for discriminating signals compared to raw IQ data. But the IF identification classifier performed best on FFT data representations.”
Here are two relevant graphs from the paper:
So … why doesn’t the machine learn to immediately take a Fourier transform or immediately take an inverse Fourier transform if it is so beneficial to do so? Here at the CSP Blog, we learned that we must set up the machine to be linear to have any hope that it could learn the Fourier transform. Presumably Kulin’s machine is nonlinear, so it cannot take the Fourier transform as a first step. The machine needs to be simultaneously linear and nonlinear. Or, perhaps there needs to be a series connection of a linear machine followed by a nonlinear machine.
I think this highlights the hyperparameter selection problem. Another indication of that problem is the authors’ admission that they could not replicate the results in , which is a paper by O’Shea, Corgan, and Clancy:
Section VE: The authors in  used IQ data and reported higher accuracy then [sic] the results we obtained. We were not able to reproduce their results after various attempts on the IQ data, which may be due to the difference in the dataset (e.g. number of training examples), train/test split and hyper-parameter tuning.
The data used for this evaluation is the same data set as used in . You may recall I made some remarks about another paper by the authors of  in this post. It’s probably the same paper; I got my copy from their post to arxiv.org.
Also, these results beg the question of the optimal signal representation for any particular machine-learning problem. I believe there are many. LaPlace transform? How about converting the complex I/Q data to real-valued data by interpolating, frequency shifting, and taking the real part? Etc., etc.
Finally, how do we know how well the expert-feature-based methods do relative to the authors’ machines? They don’t compare. Even better, use the cyclic moments or cumulants as inputs to a machine and see how that machine does relative to the machines trained on the authors’ three signal representations.
Can a machine learn the Fourier transform? Apparently not when it matters!
Comments, corrections, compliments, and disagreements are welcome …