Desultory CSP: The Human-Genome Edition

And now for something completely different …

Let’s take an excursion outside of “Understanding and Using the Statistics of Communication Signals” by looking at a naturally occurring signal: the human genome.

Anyone can download a complete sequence of nucleic acids that make up the twenty-three human chromosomes at a website maintained by the National Center for Biotechnology Information, which is a branch of the National Institutes for Health in the USA.

Adding up the number of nucleic-acid pairs across the chromosomes yields a sequence of about three billion letters–A, C, G, T–for humans, and other organisms have more or less.

The four letters correspond to four nucleic acids: adenine (A), cytosine (C), guanine (G), and thymine (T). So the genome is a sequence of these acids, arranged in pairs in a double helix, and broken up into separate physical chunks called chromosomes.

Figure 1. From www2.nau.edu. Illustration of the double-helix structure of DNA with the helpful addition of distances between adjacent base-pairs and between successive twists in the helix.

But me, I see a signal. And so I want to process that signal with my CSP tools to see what might be revealed that is not obvious from simpler processing, such as spectrum estimation or autocorrelation estimation.

The Downloaded Files

I’ve downloaded two files: GRCh37_latest_genomic.fna and GRCh38_latest_genomic.fna. These are giant text files that consist of the four ASCII characters G, T, A, C, but also g, t, a, and c. There are some stretches of the files that contain sequences of the letter N as well. Why the upper and lower case versions of the four nucleic acids? It isn’t completely clear to me, but it seems to have to do with repeated sequences throughout the genome. Try to figure it out here. Explanations are welcome in the Comments below.

The N character is used when the acids are not clearly identified by whoever is doing the identification. They are fairly rare and I just skip over them and continue encoding the letters into numbers (see below) without introducing any intervening numbers. A typical stretch of letters from the downloaded files is shown in Figure 2.

Figure 2. Typical stretch of nucleic acid indicators from GRCh37_latest_genomic.fna.

Converting from DNA Letters to Numbers

Now none of my signal-processing tools can process a sequence of letters, so we have to choose a mapping between the four nucleic acids G, A, T, and C and real or complex numbers. The problem is a bit more involved than that, though, because the DNA files contain the four letters G, A, T, and C but also g, a, t, and c, their lowercase versions, as we saw above.

I first tried to encode the letters using a QPSK constellation, and I treated the uppercase and lowercase letters as identical. That produced some mildly interesting spectral correlation plots, but later I tried an 8ASK constellation, where I treat the uppercase and lowercase separately. Maybe that isn’t cool–let me and my readers know if you think that was a mistake.

The 8ASK encoding is: t = -7, T = 7, a = -5, A = 5, c = -3, C = 3, g = -1, and G = 1. I haven’t encoded the entire genome file, just the first 83 million acids. So there’s a lot I’ve not looked at.

CSP Video

Once I had the nucleic acids coded as integers, I had a signal I could process! As usual at the CSP Blog, I used a sliding-block style of processing and assembled the results for each block into a movie. The blocks are 131072 letters long and successive blocks are overlapped by 50%.

We are not processing a function of time here, but of space. Figure 1 comes in handy. Normally we use a sampling rate that is expressed in units of Hertz, which arises from a sampling increment–the separation in time between two successive time-series (signal) values–that is expressed in units of seconds. Here for the genome we have a ‘sampled’ signal that corresponds to a sampling rate expressed in units of reciprocal meters (rm), which arises from a sampling increment–the separation in space between two successive distance-series (signal) values–that is expressed in units of meters. So our sampling increment is 0.34 nm, giving rise to a sampling rate of 2.94 Grm, where ‘Grm’ means giga reciprocal meters.

The blind CSP applied to GRCh37_latest_genomic.fna is shown in Video 1. For each block, I first apply a sine-wave detector to determine if there are any additive sine-wave components. If so, I remove them prior to CSP. The sine-wave-expurgated block is then used as input to the SSCA, producing estimates of the cycle frequencies for the block. These are then used by the FSM to estimate the spectral correlation function, and these spectral correlation function slices are arranged and shown in the uppermost plot. I’ve reduced the magnitude scale to focus on the non-zero cycle frequencies at the expense of truncating the PSD.

Video 1. Sliding-block processing of an 8ASK-encoded human genome. The processing is blind: the cycle frequencies for each block are blindly estimated using the SSCA, then the spectral correlation surface for the blindly estimated cycle frequencies is computed using the FSM. The cyclic-domain profile is shown in the second plot, any detected sine-wave components (first-order cyclic cumulants) are shown in the third plot, and a history of the number of detected cycle frequencies and tone frequencies is shown in the fourth plot.

There are many fascinating frames in the movie, and many cycle frequencies are detected. What do they mean? I don’t know.

A particularly intriguing frame occurs around 0.802 mm, as shown in Figure 3.

Figure 3. A frame from the video in Video 1. Note the regular spacing of the cycle frequencies in the CDP and the overall BPSK-like appearance of the SCF surface!

Anyway. This is desultory, so that’s all I have for now.

Comments appreciated!

Author: Chad Spooner

I'm a signal processing researcher specializing in cyclostationary signal processing (CSP) for communication signals. I hope to use this blog to help others with their cyclo-projects and to learn more about how CSP is being used and extended worldwide.

One thought on “Desultory CSP: The Human-Genome Edition”

  1. Now this is an interesting post. One of the hard parts about describing CSP is presenting a “real world” example that is both cyclostationary and easy to understand; that usually ends up being the temperature. But this is a fascinating example I had not even considered. Thank you!

Leave a Comment, Ask a Question, or Point out an Error

%d