One of the things the machine learners never tire of saying is that their neural-network approach to classification is superior to previous methods because, in part, those older methods use hand-crafted features. They put it in different ways, but somewhere in the introductory section of a machine-learning modulation-recognition paper (ML/MR), you’ll likely see the claim. You can look through the ML/MR papers I’ve cited in The Literature ([R133]-[R146]) if you are curious, but I’ll extract a couple here just to illustrate the idea.
Let’s start with O’Shea’s comment to my original ML/MR post: he laments the excessive time it would take to explain my “careful manual feature engineering” and he notes that “we are not all extreme experts in high order moment engineering.”
From Kulin et al ([R135]), “without requiring design of hand-crafted expert features like higher-order cyclic moments,” and “design of these specialized solutions have proven to be time-demanding as they typically rely on manual extraction of expert features.”
From Rajendran et al ([R136]), “without requiring expert features like higher-order cyclic moments,” and “This manual selection of expert features is tedious …”. Dogberry: but truly, for mine own part, if I were as tedious as a king, I could find it in my heart to bestow it all of your worship. Leonato: All thy tediousness on me, ah? –Shakespeare.
But all of these, and more, seem to derive from [R138] itself. We see statements like
This [good MR] is a significant challenge in the community as expert systems designed to perform well on specialized tasks often lack flexibility and can be expensive and tedious to develop analytically.
I’ve also seen similar monikers in unpublished technical reports, where the phrases “deterministic feature” and “engineered feature” are used to distinguish anything that is used for MR that is not a result of training a neural network.
As I’ve remarked before, almost nobody advocates using higher-order cyclic moments to do MR. Many advocate stationary-signal cumulants (The Literature [R90] and [R147] for example) and many advocate cyclic cumulants (My Papers [25,26,28]). And there is a decades-long sequence of papers that investigate the use of second-order cyclic moments for MR; such moments can be viewed in almost all cases as cyclic cumulants. That just shows that these researchers don’t really pay attention to prior work (too long; didn’t read).
But my task here is not to reprise my criticism of [R138]; this is a lighthearted post. The question at hand is what do these researchers mean by a hand-crafted or engineered feature. And then to make our own assessment: are cyclic moments hand-crafted? Are cyclic cumulants engineered? What about probability density functions?
What Could an ‘Engineered Feature’ or ‘Hand-Crafted Feature’ Mean?
Let’s start where one usually starts with this kind of musing: The dictionary. Dictionary.com defines ‘handcrafted’ as ‘made by handicraft.’ Not too helpful. Looking at the synonyms and antonyms is more revealing. Synonyms are ‘homespun’ and ‘homemade’ and antonyms are ‘factory-made’ and … wait for it … ‘machine-made.’
Well, this is 2020, these are old words, the CSP Blog is international, and so we have to dig deeper: what exactly does homespun mean? From Oxford Languages and Merriam-Webster, we find it means ‘simple and unsophisticated’ and, sadly, ‘homely,’ which means ugly. Ouch.
Taking this all in, the learners are saying any feature-based modulation-recognition method is not machine-made (OK!) and is simple, unsophisticated, and quite possibly ugly. I don’t have much criticism for calling CSP ugly since beauty is in the eye of the beholder. But … simple? CSP and CSP-based modulation recognition are simple and unsophisticated? To me, this is starting to look like a diss rather than any kind of substantive naming or accurate technical description.
Turning to ‘engineered feature,’ if we do a Google search of that phrase we find lots of links to the juxtaposed phrase ‘feature engineering.’ From Wikipedia:
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.https://en.wikipedia.org/wiki/Feature_engineering
Uh-oh! So when they talk about engineered features, perhaps they are talking about feature engineering in machine learning, but features like the cyclic cumulants, spectral correlation function, PSD, cyclic polyspectra, etc., are not found or created by ‘data-mining techniques.’ They are developed or discovered using mathematical modeling and mathematical analysis–analysis that is independent of the modulation-recognition problem. But other quantities derived from a signal or signal model might have no clear connection to mathematical analysis or modeling. I might say I want to use the feature that is equal to the floor of the Bessel function applied to the real part of every twelfth sample, except on Tuesday when it is every thirteenth (The name of the game is called Fizzbin. -James T. Kirk). Seems fine to call that hand-crafted and simple.
My conclusion about ‘engineered features’ is that this is a bucket that the learners place anything that might be of value for the problem at hand but that is found by rooting around in voluminous data rather than directly training a neural network. When they see things like moments and cumulants, they throw them into the bucket too.
PDFs and Features
I sense a gap in this taxonomy where we’ve got inscrutable machine-created neural-network connections on one hand and mysterious made-up hand-crafted features on the other. That gap is called decision theory. This is a branch of mathematics that combines probability, statistics, and estimation theory. Running all through decision theory is the mathematical object called the probability density function (PDF). You can characterize a random variable, a collection of random variables, and random processes by one or more PDFs. In the simplest case of a single random variable, you just need its PDF. For collections of random variables, and random processes, you need a set of joint PDFs.
Crucially, the cyclic autocorrelation functions, spectral correlation functions, cyclic moments, cyclic cumulants, and cyclic polyspectra are all intimately related to PDFs. If you know the set of moments for a random variable, you can construct the PDF. If you know the PDF, you can find any moment you want. If you know the moments, you know the cumulants. If you know the moments or cumulants, you know the polyspectra. If you know the moments or cumulants, you know the spectral correlation function.
We use moments and cumulants instead of PDFs because they are much easier to estimate when we restrict our attention to just a few of them, such as all the cyclic cumulants for orders and , which is equivalent to the information in the mean and autocorrelation functions. We also use them because sometimes an optimization problem (involving PDFs) leads to an estimator that is closely approximated by truncating some series expansion, which involves moments and/or their estimates. So a weak-signal detector might have a structure that involves an infinite number of moments (thereby taking into account the entire set of involved PDFs), but most of the terms in the structure turn out to be negligible. Out pops just a few key moments. If someone just presents this final truncated answer, it might look like that person labored in their garage, hand-crafting these cryptic seemingly-out-of-the-blue lovingly engineered functions for your delectation and approval.
This line of thought leads me to wonder if the learners think that PDFs themselves are ‘hand-crafted’ or ‘engineered’ features. Scoff scoff. Well … um, er, … are they?
Someone, somewhere, at some time had the idea of characterizing a random quantity with mathematics. The random quantity is random because it is not completely predictable. We don’t know the value of the variable in advance of observing it. This contrasts with a deterministic function like , where we know the value for all given we know the values of and (think of as time here).
The PDF is the derivative of the cumulative distribution function (CDF), which is the probability that a random variable takes on a value less than or equal to some constant . The CDF is a collection of probabilities of events. The events are just things like ‘the random variable is less than ten,’ and ‘the random variable is less than one million,’ etc. Clearly this is a powerful set of events to analyze, and two highly useful functions arise from that event analysis: the CDF and its derivative the PDF.
But maybe there are other sets of events that are even better than those involved in the CDF. Maybe analyzing those events would lead to even more powerful functions and analysis than can be afforded by the CDF and PDF. And if so, I suppose that might lend credibility to the putative claim that even PDFs and CDFs are just made-up things, just hand-crafted items not really distinguishable from a bunch of other hand-crafted similar items. Not fundamental, and certainly not machine-made.
Alternatives to the CDF?
The most basic event I can think of is the equality event
where is a real-valued random variable, is some real number, and denotes the probability of the event described by .
The problem with the equality event in (1) is that the probability is zero for many kinds of random variables . Consider a random variable that can take any real value on the interval with equal probability. Then the probability that it takes on any particular value is zero. So the equality event doesn’t help us with the mathematical characterization of an informal idea like ‘equally probable outcomes on a real interval.’
OK, then, the next simplest is whether the random variable is greater than ,
which is simply related to the probability that the random variable is less than or equal to ,
because either it is greater than or it is less-than-or-equal-to, not both.
We could conceive of building a probability theory from events on an interval, such as
but again this kind of event can be decomposed into expressions involving only probabilities of the form .
We could conceive of building the theory using events that involve some function operating on , such as
But this is surely less fundamental than (2). So it seems to me that the CDF, which is simply
C_X(x) = P(X \leq x) \hfill (4)
is basic to the mathematical modeling of uncertain–random–variables. What is more basic?
Just a Diss
In the end, then, the CDF and PDF are fundamental to the mathematical modeling of random phenomena, which includes the communication signals that are of primary interest to the CSP Blog, and which comprise the inputs to the learners’ recognizers just the same as they comprise the inputs to statistics-based or feature-based recognizers.
The refrain of ‘save us, machines, from the horrors of hand-crafted tediously extracted painstakingly manually created features’ really seems to boil down to ‘your math and your features are sooo old-fashioned, man … try to keep up, will ya?’ Jus’ a diss, is all. Gully Foyle is my name; And Terra is my nation; Deep space is my dwelling place; The stars my destination. -Alfred Bester It would sting if I ever saw evidence that the learners actually understand decision theory or the structure of communication signal models. It may sting yet.
‘Hand-crafted’ is a synonym to ‘homespun’ which means simple and unsophisticated, and is an antonym to ‘machine-made’ which is what the learners are striving for. So I suggest that if we in the statistics community want to return the diss favor, we might refer to the machine-learning modulation-recognition work as ‘factory-farmed MR’.