I first considered whether a machine (neural network) could learn the (64-point, complex-valued) Fourier transform in this post. I used MATLAB’s Neural Network Toolbox and I failed to get good learning results because I did not properly set the machine’s hyperparameters. A kind reader named Vito Dantona provided a comment to that original post that contained good hyperparameter selections, and I’m going to report the new results here in this post.

Since the Fourier transform is linear, the machine should be set up to do linear processing. It can’t just figure that out for itself. Once I used Vito’s suggested hyperparameters to force the machine to be linear, the results became much better:

Looking at the normalized MSE graphs, we can see that a few inputs with low power give rise to learned Fourier transforms with error variance comparable to the input power. But for the most part, the squared error in the learned transform is orders of magnitude smaller than the input power, indicating an accurate result. The lesson is that if we create a neural network that is linear, we can make it learn the Fourier transform over some interval of amplitudes and for signals in the training and testing sets. Pretty good!

However, did the machine learn the Fourier transform in our usual human sense? If it did, it should be able to produce accurate results for signals that are not in the training set and for scaled versions of the training-set input signals where the scaling is not represented in the training set. I think this is called the ability of the machine to generalize.

So, for example, if we consider a simple 20-sample rectangle as our input (20 ones followed by 44 zeros), we can scale it with a variety of scale factors and compare the output of our learned function with the output of MATLAB’s fft.m. The result is:

When the scaling factor is very small, the learned function deviates from the Fourier transform, so even though we got a lot farther with Vito’s help, the machine still didn’t *really* learn the Fourier transform. Near the left edge of the graph above, we obtain the individual results:

This brings me to a paper I came across recently that applies machine learning to the modulation classification problem. It is by M. Kulin, T. Kazaz, I. Moerman, and E. de Poorter and can be found here. The title is “End-to-End Learning from Spectrum Data: A Deep Learning Approach for Wireless Signal Identification in Spectrum Monitoring Applications.”

There is the usual evangelical fervor about how machine learning will solve everything without the need for any human experts. They’re always wanting to get rid of the experts! Except the machine-learning experts, of course. Here’s why I say that. The authors state several times that one of their objectives is to finally free humanity from the shackles of expertly “hand-crafted features:”

**Abstract:** “without requiring design of hand-crafted expert features like higher order cyclic moments”

**Introduction:** “completely eliminates the need for designing expert features such as higher order cyclic moments”

**Section IB:** “The design of these specialized solutions have proven to be time-demanding as they typically rely on manual extraction of expert features for which a significant amount of domain knowledge and engineering is required.”

**Section II:** “without requiring design of hand-crafted expert features like higher order cyclic moments”

**Section V:** [End-to-end learning] “can be applied to various wireless signals to effectively detect the presence of radio emitters in a unified way with requiring design of expert features.”

But we cannot do without the expertise of the machine learners! To wit:

**Section IB:** “The technical approach depicted in this paper is deeply interdisciplinary and systematic, calling for the synergy of expertise of computer scientists, wireless communications engineers, signal processing and machine learning experts …”

And (this kills me), the laborious and error-prone selection of the machine hyperparameters is just the way it is:

**Section IIIB:** “The number of filters per layer is a tunable parameter called a hyper-parameter. Other tunable parameters are the filter size, the number of layers, etc. The selection of values for hyper-parameters may be quite difficult, and finding it [sic] commonly is much [sic] an art as it is science. An optimal choice may only be feasible by trial and error.”

Smells a bit like hand-crafting, don’t you think?

I think these authors are behind the times. What we really need is a machine that can automatically learn good hyperparameters for another machine. Well, and a machine just before that machine, that sets *its* hyperparameters. And, uh, well, I suppose it’s turtles all the way down.

OK, back to the technical ideas here.

First, the authors don’t even cite a paper that uses higher-order cyclic moments or cumulants. They do cite a paper of mine, written with some Virginia Tech researchers a while back, that uses the cyclic domain profile (My Papers [30]), which is essentially the spectral correlation or coherence magnitude viewed edge-on, so that the x-axis is cycle frequency and the y-axis is magnitude. Only the spectral correlation function and the spectral coherence are used in that paper. No higher-order moments in sight. So they neglect to cite any of my other papers on the topic of higher-order moments/cumulants (either the theory or the application), and they don’t cite anybody else appropriate either, like Antonio Napolitano or Octavia Dobre (see The Literature for examples). So I question whether they know what a higher-order cyclic moment is.

Second, higher-order cyclic moments aren’t “hand crafted.” In My Papers [5,6], I show that the higher-order cyclic moments are just components of the coefficients in the series expansion of the various th-order characteristic functions for a cyclostationary random process. And the higher-order cyclic cumulants are coefficients in the series expansion the logarithm of the characteristic functions. And, of course, the characteristic functions are nothing more than the Fourier transforms of the th-order probability density functions for the random process. So cyclic moments and cyclic cumulants are intimately connected to the fundamental probabilistic structure of communication signals.

But now let’s get to why this paper is relevant to our ongoing search for a machine that can learn the Fourier transform. The authors place a heavy importance on what they call the “signal representation.” One representation is the one we use quite a lot here at the CSP Blog: uniformly sampled inphase and quadrature (I/Q) signal values over time. A second signal representation that they consider is magnitude-phase (M/P). Here each I/Q sample is represented by its (standard) magnitude and phase. The third signal representation is Fourier transformation (FT) of the I/Q values. As near as I can tell from their mathematical description, these are all invertible. We can recover the I/Q samples from the M/P samples or the FT samples. So, presumably, there is no loss of information between the three representations.

The issue is that the performance of the trained machines depends on the signal representation used as input to the machine. Let me provide some key quotes:

**Abstract:** “From our analysis we prove that the wireless data representation impacts the accuracy depending on the specifics and similarities of the wireless signals that need to be differentiated, with different data representations resulting in accuracy variations of up to 29%.”

**Section VE:** “However, we noticed that the amplitude/phase representation helped the model discriminate the modulation formats better compared to raw IQ time-series data for high SNR scenarios.”

**Section VE:** “However, again we noticed that the amplitude/phase representation is beneficial for discriminating signals compared to raw IQ data. But the IF identification classifier performed best on FFT data representations.”

Here are two relevant graphs from the paper:

So … why doesn’t the machine learn to immediately take a Fourier transform or immediately take an inverse Fourier transform if it is so beneficial to do so? Here at the CSP Blog, we learned that we must set up the machine to be linear to have any hope that it could learn the Fourier transform. Presumably Kulin’s machine is nonlinear, so it cannot take the Fourier transform as a first step. The machine needs to be simultaneously linear and nonlinear. Or, perhaps there needs to be a series connection of a linear machine followed by a nonlinear machine.

I think this highlights the hyperparameter selection problem. Another indication of that problem is the authors’ admission that they could not replicate the results in [9], which is a paper by O’Shea, Corgan, and Clancy:

**Section VE:** The authors in [9] used IQ data and reported higher accuracy then [sic] the results we obtained. We were not able to reproduce their results after various attempts on the IQ data, which may be due to the difference in the dataset (e.g. number of training examples), train/test split and hyper-parameter tuning.

The data used for this evaluation is the same data set as used in [9]. You may recall I made some remarks about another paper by the authors of [9] in this post. It’s probably the same paper; I got my copy from their post to arxiv.org.

Also, these results beg the question of the optimal signal representation for any particular machine-learning problem. I believe there are many. LaPlace transform? How about converting the complex I/Q data to real-valued data by interpolating, frequency shifting, and taking the real part? Etc., etc.

Finally, how do we know how well the expert-feature-based methods do relative to the authors’ machines? They don’t compare. Even better, use the cyclic moments or cumulants as inputs to a machine and see how that machine does relative to the machines trained on the authors’ three signal representations.

Can a machine learn the Fourier transform? Apparently not when it matters!

Comments, corrections, compliments, and disagreements are welcome …

Wow, I kind of got obsessed thinking about this problem of getting a NN to learn the DFT (or specifically a 64-pt DFT) in the last 24 hours. I haven’t actually tried to implement a network to do this, but I believe my thoughts on the matter might be of interest to you and your readers. I’ll start by saying I’m not sure that this “inside-out” way of thinking is “good” when thinking about applying machine learning, but I’m going to run with it here for a while. (By “inside-out” I mean expecting the NN to implement some recognizable method, such as the FT, as part of solving a bigger problem.)

Now on to the idea of getting an NN to “discover” the DFT. As you and some of your readers have noted, the DFT is linear, so we can help the NN out by forcing the NN to also be linear. It’s actually worth examining that a little deeper. In a fully connected traditional NN, which is what I believe you are using, a “layer” implements the following function:

y = g(Ax + b)

where g() is the activation function (nonlinearity), A is the weight matrix and b is the bias vector. In your most recent effort, you dispensed with g(), so:

y = Ax +b

What wasn’t clear to me was did you stop using your hidden layer? Since obviously repeated applications of linear layers can be reduced to a single linear layer, which in this case would be the output layer, there is nothing gained except more training time and variance by using a hidden layer.

Since we know that the DFT, X, of a time-series, x, can be expressed using a “DFT-matrix”, W, as:

X = Wx

then we should probably get rid of the bias vector, b, also. Uh, oh. In the case of the DFT, X, W, and x are all complex-valued, but when discussing the NN, everything is real-valued. How is the network dealing with that?

You handled this in the input layer by creating an input-layer to the NN that is a stacked vector with real-values on ‘top’ and imaginary values on the ‘bottom’. I think in general it might be better to actually use a network with complex weights, but I don’t know of canned packages that do this. In this pure-linear case, however, everything works out fine. (People have developed complex-input, complex-weight, complex-output networks with complex versions of activation, such as a complex version of ReLU. Back-propagation can be modified to work in the complex-case, too. It’s not an area that I specialize in, but there is on going work here as far as I can tell.)

To see what is happening in the network with a stacked complex vector input layer, let’s revisit the expression involving the DFT-matrix. I’ll use the notation x_r and x_i to denote real and imaginary parts of a vector, respectively. Then our DFT expression is:

X_r = (W_r *x_r – W_i*x_i)

X_i = (W_r * x_i + W_i*x_r)

With the stacked input vector, the network weights in a fully connected network should be equivalent to the following matrix with W_r and W_i as submatrices

W_r -W_i

W_i W_r

This looks good, except that as far as I can tell if we use a standard NN package, the network is going to try to learn W_r and W_i submatricies twice and it isn’t likely to maintain proper symmetry when doing so.

Can we do better? Yes. The first step is not to use a fully connected network. Break the network in half so that the top-half is learning W_r and the bottom-half is learning W_i. By “break in half” I simply mean that a given linear-activation function either accepts input from the top-half of your input vector or the bottom half, but not both as in the case of fully connected network.

For training, your output will either be an example of X_r or X_i. When the output example is X_r, your stacked input will be x_r on top and -x_i on the bottom. When the output example is X_i, your stacked input will be x_i on the top and x_r on the bottom. By training this way, I believe you’ll make the best use of your training data.

Now I’m sure at this point, you’re wondering, just how much are we going to help this machine???? That’s certainly a valid point. I think the answer should be – as much as we can. And that, sir, is how we stay relevant as this Machine Learning scourge sweeps the land ;>) Both Convolutional Neural Nets and Recurrent Neural Nets were developed to solve deficiencies in traditional NN for their target problems (image analysis and sequence analysis, respectively). Perhaps some new modifications will need to take place to effectively apply these tools to problems in statistical signal processing?

Keep up the good work, Chad. This blog is always a joy to read.

Awesome comment Todd! Thanks so much for contributing your expertise. This will help my readers, I am certain.

I kept using the hidden layer–never got rid of it. I hadn’t thought of my question (Can a Machine Learn the Fourier Transform?) as inside-out thinking as much as simplified thinking. OK, I wonder if a machine can learn a general cyclic cumulant or spectral correlation function, but maybe that is going to be too difficult to set up and/or interpret. So, can it learn something more basic? My approach to things is usually pretty simple-minded.

I’ve been working on Can a Machine Learn a PSD Estimator? question for a while. No luck, and I’ll post what I did and the data set soon. Thinking of the frequency-smoothed periodogram method, a machine could arrive at a PSD estimator by first taking a Fourier transform, then squaring, the applying a convolution operation. But I sense trouble ahead trying to combine linear and nonlinear steps in a single machine. I think I need a more complicated machine (“new modifications”). Something along the lines of what you suggest for the FT above. A parallel mix of a linear component and a non-linear component, perhaps with a delay inbetween. Is that helping the machine too much? Yes of course it is! But I like your attitude.

Thanks again.

Great post which made me think a lot!

You are right. In the comments of the previous post, I had just proposed some solutions in order to get Chad’s code to work with minimal modifications. That’s why the hidden layer was still there, but it obviously becomes redundant if everything is linear.

I see your starting point, but I still think that you are implicitly assuming that a machine would decompose the problem into the same intermediate steps an expert engineer would take. The machine is much more stupid! But it could also be much more flexible, if enough layers with meaningful parameters are provided.

However, if this were the case in the reviewed paper, I fully agree with you that the complex signal representation should be irrelevant. It is not required that the machine learns the FFT *explicitly* as a separate building block, but it should learn it at least *implicitly*, so that providing the input in either domain should yield the same output.

My feeling is that the machine is wasting a lot of resources just to try to learn what a complex number is! So, I agree with Todd’s final remark that perhaps the standard real-valued tools are just not suitable for complex signal processing.

I agree with you Vito. I admit I am stuck in the rut of thinking about decomposing a problem into solvable steps that can be attacked using our formidable signal-processing toolkit. I should have said something like “Can a machine implicitly learn the Fourier transform?” in the context of the Kulin paper. And it does not appear to have implicitly or explicitly learned the transform. I didn’t see this coming when I wrote the original post, but I couldn’t pass up the chance to connect Kulin’s paper to our (my) ongoing struggle to get machines to do things *I* think are useful. Following on from your and Todd’s comments, I might do well to revise what I think of as useful…

Vito’s new question for me to ponder: “Can a machine learn the square root of negative one?” Hah!