On Noisy Duplication Channels
with Markov Sources
Abstract
Channels with noisy duplications have recently been used to model the nanopore sequencer. This paper extends some foundational information-theoretic results to this new scenario. We prove the asymptotic equipartition property (AEP) for noisy duplication processes based on ergodic Markov processes. A consequence is that the noisy duplication channel is information stable for ergodic Markov sources, and therefore the channel capacity constrained to Markov sources is the Markov-constrained Shannon capacity. We use the AEP to estimate lower bounds on the capacity of the binary symmetric channel with Bernoulli and geometric duplications using Monte Carlo simulations. In addition, we relate the AEP for noisy duplication processes to the AEP for hidden semi-Markov processes.
I Introduction
Recently, the nanopore sequencer in DNA storage systems has motivated the study of channels with noisy duplications [10]. In this application, a sequence of nucleotide molecules pass through a microscopic pore that outputs a current response dependent on the nucleotides inside the pore. The mechanism that feeds the nucleotides through the pore varies in speed such that the sampled current signal contains a random number of samples per nucleotide. In the presence of measurement noise, this imperfect mechanism results in a channel with noisy duplications. This paper is an information-theoretic analysis of generic noisy duplication channels, not specifically related to the model of the nanopore sequencer.
Numerical capacity bounds of noisy duplication channels were computed in [11, 13] using Monte Carlo simulations to characterize the theoretical performance of the nanopore sequencer. However, the results were based on the assumption that the asymptotic equipartition property (AEP) holds for the channel processes of noisy duplication channels. The key contribution of this paper is proving that the AEP holds for discrete-time ergodic Markov sources on a finite state space. A consequence of our results is that the Markov-constrained channel capacity is the Markov-constrained Shannon capacity , which is a lower bound on the channel capacity for arbitrary sources. We compute numerical lower bounds on for the binary symmetric channel (BSC) with Bernoulli and geometric duplications by choosing a source. These extend the numerical capacity bounds for the sticky channels in [14] that correspond to the noiseless cases. In addition, we relate the AEP for noisy duplication processes with the AEP for hidden semi-Markov processes (HSMPs), which has largely only been studied in the special case of semi-Markov processes (SMPs) [5, 6].
I-A Noisy duplication channel
The noisy duplication channel for the special case of Markov sources is conveniently described using SMPs. For a sequence of channel inputs , the sequence of channel states is the discrete-time homogeneous Markov process (MP) on a finite state space . Since there is a one-to-one correspondence between the channel inputs and the channel states, we will conveniently consider the latter as being the input to the channel. This Markov source is ergodic if its Markov transition probability matrix is irreducible. For an arbitrary initial state , their relationship is visualised as
for channel-uses, where each arrow corresponds to an input that triggers a change in channel state. Next, the channel states are duplicated according to the i.i.d. state duration process on a discrete support , and then sequentially concatenated together to form the SMP on . For channel-uses, we have the duplicated states
where for convenience we additionally define the jump time for the index of the last sample in the -th segment, forming the jump time process . This process is semi-Markov since the Markov property only holds at the jump time between each segment of duplications. Each duplicated state in undergoes the channel mapping , which is identical for each duplicated state. Then it is corrupted by memoryless noise to give the HSMP . For a channel input of channel-uses, the channel output is and the channel joint input-output is .
I-B Entropy rates
For the MP , the entropy rate is . The HSMP has entropy rate . When this HSMP is randomly indexed as using i.i.d. indexing process , the entropy rate is . Similarly, we have the entropy rate of the joint process as . The existence of these two randomly-indexed entropy rates is proven in Lemma 1. Finally, let the (mutual) information rate of the noisy duplication channel with a Markov source be .
Lemma 1.
The entropy rates and exist.
Proof.
Let . Observe that
(1) | ||||
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
then is sub-additive. By Fekete’s lemma [4, Lemma 4A.2], we have that exists. Each of the above steps similarly applies when setting . ∎
II Asymptotic equipartition property
II-A Noisy duplication processes
In this section, we consider the AEP for the output entropy rate and the joint input-output entropy rate . These results will be proven using almost identical techniques based on genie-aided markers. Since we cannot directly use ergodicity to prove the AEP through the Shannon-McMillan-Breiman theorem (SMB) [1], we use the side-information of genie-aided markers to force ergodicity in the noisy duplication processes with respect to the sub-sequences between the markers in order to apply SMB and indirectly prove the AEP.
Theorem 1 (Output AEP).
If is ergodic, then
(6) |
Proof sketch.
Let be the markers up to the -th symbol separated by symbols. Let be the number of markers up to , then we have for . Let the sample entropy rate of be , and let the entropy rate be with limit (which exists from Lemma 1). For a fixed , let the sample entropy rate of be , and let the entropy rate be with limit .
-
•
Split problem into three parts: We need to show that as . By the triangle inequality, we need only show the chain of convergences for some as .
- •
-
•
Convergence of sample entropy rates with markers: Observe that is decreasing towards for increasing since the side-information of markers reduces the number of terms in the marginalisation of . Then and Markov’s inequality says they converge since .
-
•
Shannon-McMillan-Breiman theorem with markers: Form a hidden MP (HMP) with based on the MP with in between the markers. The HMP is ergodic since the MP is ergodic [9]. Then as by SMB.
∎
Theorem 2 (Joint AEP).
If is ergodic, then
(7) |
Proof.
There are many consequences of these AEP results. In Section III, we will use them to derive achievable lower bounds of channel capacity and demonstrate how they can be estimated in practice. Before that, let us explore another AEP with an interesting connection to the output AEP.
II-B Hidden semi-Markov processes
For noisy duplication channels, we were interested in the case where there are inputs and duplicated samples at the output. Alternatively, we can choose the number of outputs to be samples, and therefore the number of inputs is a random variable much smaller than . We consider the AEP for discrete-time HSMPs over a finite state-space, extending the AEP for discrete-time SMPs over a finite state-space [5] (and was later extended to the Borel state-space [6], which is not considered in this paper). The central idea behind proving our AEP will be based on embedding the SMP in a MP, for which we can leverage existing results.
Definition 1 (Embedded SMP [7]).
The embedded SMP is a MP with transition probabilities
(8) |
for all and for all .
An important property of the embedded duplication process is that it preserves ergodicity. Observe that the embedded duration for each channel input form a sequence of states with one possible transition, the “extending” transition, until it reaches the “terminating” transition. Therefore, the embedding MP is irreducible and thus obeys the ergodic theorem. This embedding of the SMP will be used to prove the AEP for HSMPs in Theorem 3. Further, this HSMP AEP will be related to the output AEP in Theorem 1 through its entropy rate using Lemma 2.
Lemma 2 (Randomly indexed entropy rate).
Let be the entropy rate for fixed-length inputs and variable-length outputs, and let be the entropy rate for variable-length inputs and fixed-length outputs. Then .
Proof.
Let be an embedding of the randomly indexed MP . Observe that the number of random inputs for output samples is where is the set of embedding states that begin a new segment. With this, we will show that the entropy rate goes to zero as .
By Hoeffding’s inequality for MPs [3], we have the concentration inequality for some constant that depends on the convergence speed of the MP to its stationary distribution. Let , then we can partition the entropy of and bound it as
(9) | ||||
(10) |
which uses the conditioning property, Hoeffding’s inequality, and that since conditioned on is supported on the interval .
Observe that for constant and for all . Then is arbitrarily close to for sufficiently large , and
(11) | ||||
(12) | ||||
(13) | ||||
(14) |
where (14) uses a sandwich bound on the conditional entropy , given by
(15) |
which is derived from . By the (randomly-indexed) law of large numbers as , we can squeeze and show it has the limit . Since is increasing in , the left-hand tail of is arbitrarily small up to any given for sufficiently large , showing convergence of (14).∎
Theorem 3 (HSMP AEP).
If is ergodic, then
(16) |
Proof.
This theorem combined with the AEP for the output of the noisy duplication channel shows equivalence between the two AEPs. An interesting consequence of this observation is that it implies the true block length , which is unknown from the observed samples, dominates the marginalisation of in the entropy rate. It is not clear if there are any practical implications from this, but it is nonetheless an interesting observation. It does, however, allow us to use the classical forward algorithm to estimate with the embedding using techniques for finite-state channels [15, 2].
III Markov-constrained channel capacity
An open problem of the noisy duplication channel is its channel capacity. If the channel were ergodic, the channel capacity would be the Shannon capacity formula. However, this must be proven directly since we do not have ergodicity as a given. In the special case of ergodic Markov sources, this result turns out to be a corollary of the AEPs in Theorem 1 and Theorem 2 since they imply information stability.
Theorem 4.
If the Markov source with Markov transition matrix is ergodic, then the Markov-constrained channel capacity is the Markov-constrained Shannon capacity
(17) |
where is the set of all ergodic Markov transition matrices.
Proof.
Consider the information density
whose expectation is the finite-letter mutual information rate . Observe that the first and second terms converge in probability to and , respectively, due to Theorem 1 and Theorem 2, and that the third term converges to due to SMB. Consequently, the noisy duplication channel with an ergodic Markov source is information stable. This is a special case of the general capacity formula in [16], and implies that the Markov-constrained Shannon capacity is achievable using the capacity-achieving Markov source. ∎
For channel capacity with an arbitrary source, we have since the optimal source may not be Markov. Even in the restricted case of Markov sources, finding the Markov source that achieves is an open problem. However, there exist heuristic algorithms [11, 13] based on the generalised Blahut-Arimoto algorithm (GBAA) [8, 20] for finite-state channels that can significantly improve upon the benchmark Markov source with independent and uniformly distributed inputs (i.e., the maximum-entropy Markov source).
In the following examples, we consider some simple capacity bounds of a noisy duplication channel that concatenates the BSC with the sticky channel in [14] for Bernoulli and geometric duplications.
III-A BSC with Bernoulli duplications
The simplest non-trivial noisy duplication channel is a BSC with Bernoulli duplications. In particular, we consider a BSC with crossover probability and durations on with duplication probability . Each source bit is duplicated with probability , and then each bit goes through the BSC that corrupts it with probability .
In Fig. 1, we use the Monte Carlo technique from [13] with a block length of to compute the achievable information rate of this channel in the case of a source. When , the rate is equal to the capacity of the BSC, . When , the rate is equal to the capacity of the BSC with looks, . The red curve is when , which starts at and ends at . The blue curve is when , which starts at and ends at . When the BSC with Bernoulli duplications simplifies into a sticky channel with Bernoulli duplications with rate for a source and channel capacity for the capacity-achieving source, which can be computed numerically [14]. The capacity of the sticky channel with Bernoulli duplications is an upper bound on the capacity of the BSC with Bernoulli duplications.
III-B BSC with geometric duplications
We now consider a BSC with geometric duplications, which is a slightly more complex channel compared to the BSC with Bernoulli duplications. In particular, we consider a BSC with crossover probability and durations on with duplication probability . Each source bit is repeatedly duplicated with probability until the first non-duplication with probability , and then each bit goes through the BSC that corrupts it with probability .
In Fig. 2, we once again use the Monte Carlo technique from [13] with a block length of to compute the achievable information rate of this channel in the case of a source. As noted in [14], accurately computing rates with geometric duplications and a high is challenging, and therefore we stop at (just over duplications on average) as they did. In addition, the geometric distribution is truncated after samples. Analogously to the previous example, when the BSC with geometric duplications simplifies into a sticky channel with geometric duplications with rate for a source and channel capacity for the capacity-achieving source, which can be computed numerically [14]. The capacity of the sticky channel with geometric duplications is an upper bound on the capacity of the BSC with geometric duplications.
IV Conclusion
This paper studied the noisy duplication channel and established its channel capacity as the Markov-constrained Shannon capacity in the special case of ergodic Markov sources. It was proven through the AEP for noisy duplication processes, which was related to the AEP for HSMPs. The motivation for these results was to provide a theoretical underpinning to the numerical results in [11, 13]. A significant open problem that remains is the construction of codes that can achieve the Markov-constrained capacity in practice, which is still in its early stages [17, 18, 12, 19]. Further progress in this area could greatly advance DNA storage systems based on nanopore sequencing.
Appendix A Preliminary lemmas
Lemma A1.
.
Proof.
We start from the inequality [4, Lemma 4A.1]. Then . ∎
Lemma A2.
.
Proof.
By Hoeffding’s inequality, we have the concentration inequality for some constant . Then the proof follows using the same argument as in Lemma 2. Alternatively, we could use the Gaussian approximation.∎
Appendix B Proof of Theorem 1
Split problem into three parts. We want to show that for any and we have for sufficiently large . Let , , and . Then we have the bound
(18) | |||
(19) | |||
(20) |
and we want to show that each term is bounded by , where is fixed and does not affect our choice of . We will prove this in three parts and then combine the results.
Convergence of entropy rates with markers. Observe that is decreasing towards for increasing (with equality at ), then and . From Lemma A1, for a fixed we have the bound uniformly in . Then take the limit in to get . By Lemma A2, there exists a such that and then occurs with probability for all . This proves almost sure convergence for Term 3.
Convergence of sample entropies with markers. For any there exists a such that such that and for all . Recalling that , we can apply Markov’s inequality to get
(21) | ||||
(22) | ||||
(23) |
for all . This proves convergence in probability for Term 1, which will now be linked with the convergence result of Term 3 by showing the convergence of Term 2.
Shannon-McMillan-Breiman theorem with markers. Consider the -step MP where . Since it embeds the ergodic MP , which is a one-to-one mapping between probability spaces with identical probability measure, it is also ergodic. Then we have the HMP where . Since the -step MP is ergodic, then the resulting HMP is ergodic [9]. Apply SMB to get almost surely as for a fixed (hence ). Then there exists a sequence such that for all with probability