On Noisy Duplication Channels
with Markov Sources

Brendon McBain, James Saunderson, and Emanuele Viterbo Department of Electrical & Computer Systems Engineering, Monash University, Clayton, Australia

Abstract

Channels with noisy duplications have recently been used to model the nanopore sequencer. This paper extends some foundational information-theoretic results to this new scenario. We prove the asymptotic equipartition property (AEP) for noisy duplication processes based on ergodic Markov processes. A consequence is that the noisy duplication channel is information stable for ergodic Markov sources, and therefore the channel capacity constrained to Markov sources is the Markov-constrained Shannon capacity. We use the AEP to estimate lower bounds on the capacity of the binary symmetric channel with Bernoulli and geometric duplications using Monte Carlo simulations. In addition, we relate the AEP for noisy duplication processes to the AEP for hidden semi-Markov processes.

I Introduction

Recently, the nanopore sequencer in DNA storage systems has motivated the study of channels with noisy duplications [10]. In this application, a sequence of nucleotide molecules $\{\mathsf{A},\mathsf{T},\mathsf{C},\mathsf{G}\}$ pass through a microscopic pore that outputs a current response dependent on the nucleotides inside the pore. The mechanism that feeds the nucleotides through the pore varies in speed such that the sampled current signal contains a random number of samples per nucleotide. In the presence of measurement noise, this imperfect mechanism results in a channel with noisy duplications. This paper is an information-theoretic analysis of generic noisy duplication channels, not specifically related to the model of the nanopore sequencer.

Numerical capacity bounds of noisy duplication channels were computed in [11, 13] using Monte Carlo simulations to characterize the theoretical performance of the nanopore sequencer. However, the results were based on the assumption that the asymptotic equipartition property (AEP) holds for the channel processes of noisy duplication channels. The key contribution of this paper is proving that the AEP holds for discrete-time ergodic Markov sources on a finite state space. A consequence of our results is that the Markov-constrained channel capacity is the Markov-constrained Shannon capacity $C_{\mathsf{Markov}}$ , which is a lower bound on the channel capacity $C$ for arbitrary sources. We compute numerical lower bounds on $C_{\mathsf{Markov}}$ for the binary symmetric channel (BSC) with Bernoulli and geometric duplications by choosing a $\mathsf{Ber}(1/2)$ source. These extend the numerical capacity bounds for the sticky channels in [14] that correspond to the noiseless cases. In addition, we relate the AEP for noisy duplication processes with the AEP for hidden semi-Markov processes (HSMPs), which has largely only been studied in the special case of semi-Markov processes (SMPs) [5, 6].

I-A Noisy duplication channel

The noisy duplication channel for the special case of Markov sources is conveniently described using SMPs. For a sequence of channel inputs $\{X_{i}\}$ , the sequence of channel states is the discrete-time homogeneous Markov process (MP) $\{S_{\ell}\}$ on a finite state space $\Omega$ . Since there is a one-to-one correspondence between the channel inputs and the channel states, we will conveniently consider the latter as being the input to the channel. This Markov source is ergodic if its Markov transition probability matrix $P$ is irreducible. For an arbitrary initial state $S_{0}$ , their relationship is visualised as

\displaystyle S_{1}\overset{X_{1}}{\longrightarrow}S_{2}\overset{X_{2}}{% \longrightarrow}\ldots\overset{X_{m}}{\longrightarrow}S_{m}

for $m$ channel-uses, where each arrow corresponds to an input that triggers a change in channel state. Next, the channel states are duplicated according to the i.i.d. state duration process $\{K_{\ell}\}$ on a discrete support $\Lambda$ , and then sequentially concatenated together to form the SMP $\{Z_{t}\}$ on $\Omega$ . For $m$ channel-uses, we have the duplicated states

\displaystyle\underbrace{Z_{1},Z_{2},\ldots,Z_{T_{1}}}_{K_{1}},\underbrace{Z_{% T_{1}+1},\ldots,Z_{T_{2}}}_{K_{2}},\ldots,\underbrace{Z_{T_{m-1}+1},\ldots,Z_{% T_{m}}}_{K_{m}}

where for convenience we additionally define the jump time $T_{\ell}=\sum_{i\leq\ell}K_{i}$ for the index of the last sample in the $\ell$ -th segment, forming the jump time process $\{T_{\ell}\}$ . This process is semi-Markov since the Markov property only holds at the jump time between each segment of duplications. Each duplicated state in $\{Z_{t}\}$ undergoes the channel mapping $f:\Omega\rightarrow\mathbb{R}$ , which is identical for each duplicated state. Then it is corrupted by memoryless noise to give the HSMP $\{Y_{t}\}$ . For a channel input $S_{1}^{m}$ of $m$ channel-uses, the channel output is $Y_{1}^{T_{m}}$ and the channel joint input-output is $(S_{1}^{m},Y_{1}^{T_{m}})$ .

I-B Entropy rates

For the MP $\mathbb{S}=\{S_{\ell}\}$ , the entropy rate is $H(\mathbb{S})=\lim_{m\rightarrow\infty}\frac{1}{m}H(S_{1}^{m})$ . The HSMP $\mathbb{Y}=\{Y_{t}\}$ has entropy rate $H(\mathbb{Y})=\lim_{t\rightarrow\infty}\frac{1}{t}H(Y_{1}^{t})$ . When this HSMP is randomly indexed as $Y_{1}^{T_{m}}$ using i.i.d. indexing process $\mathbb{T}=\{T_{\ell}\}$ , the entropy rate is $H(\mathbb{Y}^{\mathbb{T}})=\lim_{m\rightarrow\infty}\frac{1}{m}H(Y_{1}^{T_{m}})$ . Similarly, we have the entropy rate of the joint process $(S_{1}^{m},Y_{1}^{T_{m}})$ as $H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})=\lim_{m\rightarrow\infty}\frac{1}{m}H(S_% {1}^{m},Y_{1}^{T_{m}})$ . The existence of these two randomly-indexed entropy rates is proven in Lemma 1. Finally, let the (mutual) information rate of the noisy duplication channel with a Markov source be $I(\mathbb{S};\mathbb{Y}^{\mathbb{T}})=H(\mathbb{S})+H(\mathbb{Y}^{\mathbb{T}})% -H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})$ .

Lemma 1.

The entropy rates $H(\mathbb{Y}^{\mathbb{T}})$ and $H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})$ exist.

Proof.

Let $H_{m}=\frac{1}{m}H(Y_{1}^{T_{m}})$ . Observe that

$\displaystyle mH_{m}$	$\displaystyle=H(Y_{1}^{T_{m}})$	(1)
	$\displaystyle\leq H(Y_{1}^{T_{\ell}},Y_{T_{\ell+1}}^{T_{m}})\leavevmode% \nobreak\ \leavevmode\nobreak\ \text{for any}\leavevmode\nobreak\ \ell<m$	(2)
	$\displaystyle=H(Y_{1}^{T_{\ell}})+H(Y_{T_{\ell}+1}^{T_{m}}\|Y_{1}^{T_{\ell}})$	(3)
	$\displaystyle\leq H(Y_{1}^{T_{\ell}})+H(Y_{T_{\ell}+1}^{T_{m}})$	(4)
	$\displaystyle=\ell H_{\ell}+(m-\ell)H_{m-\ell}$	(5)

then $\{mH_{m}\}$ is sub-additive. By Fekete’s lemma [4, Lemma 4A.2], we have that $\lim_{m\rightarrow\infty}H_{m}=\liminf_{m\rightarrow\infty}H_{m}$ exists. Each of the above steps similarly applies when setting $H_{m}=\frac{1}{m}H(S_{1}^{m},Y_{1}^{T_{m}})$ . ∎

II Asymptotic equipartition property

II-A Noisy duplication processes

In this section, we consider the AEP for the output entropy rate $H(\mathbb{Y}^{\mathbb{T}})$ and the joint input-output entropy rate $H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})$ . These results will be proven using almost identical techniques based on genie-aided markers. Since we cannot directly use ergodicity to prove the AEP through the Shannon-McMillan-Breiman theorem (SMB) [1], we use the side-information of genie-aided markers to force ergodicity in the noisy duplication processes with respect to the sub-sequences between the markers in order to apply SMB and indirectly prove the AEP.

Theorem 1 (Output AEP).

If $\{S_{\ell}\}$ is ergodic, then

\displaystyle-\frac{1}{m}\log\mathbb{P}(Y_{1}^{T_{m}})\xrightarrow[]{\mathbb{P% }}H(\mathbb{Y}^{\mathbb{T}}).

(6)

The full proof of Theorem 1 is given in Appendix B, which we now summarise as a proof sketch:

Proof sketch.

Let $\mathcal{M}_{m,d}=\{T_{nd}:nd\leq m\}$ be the markers up to the $m$ -th symbol separated by $d$ symbols. Let $n=|\mathcal{M}_{m,d}|$ be the number of markers up to $m$ , then we have $m=nd+i$ for $0\leq i<d$ . Let the sample entropy rate of $Y_{1}^{T_{m}}$ be $g_{m}=-\frac{1}{m}\log\mathbb{P}(Y_{1}^{T_{m}})$ , and let the entropy rate be $H_{m}=\mathbb{E}[g_{m}]=\frac{1}{m}H(Y_{1}^{T_{m}})$ with limit $H=\lim_{m\rightarrow\infty}H_{m}$ (which exists from Lemma 1). For a fixed $d$ , let the sample entropy rate of $(Y_{1}^{T_{m}},\mathcal{M}_{m,d})$ be $g_{m,d}=-\frac{1}{m}\log\mathbb{P}(Y_{1}^{T_{m}},\mathcal{M}_{m,d})$ , and let the entropy rate be $H_{m,d}=\mathbb{E}[g_{m,d}]=\frac{1}{m}H(Y_{1}^{T_{m}},\mathcal{M}_{m,d})$ with limit $H_{\infty,d}=\lim_{m\rightarrow\infty}H_{m,d}$ .

•

Split problem into three parts: We need to show that $g_{m}\rightarrow H$ as $m\rightarrow\infty$ . By the triangle inequality, we need only show the chain of convergences $g_{m}\rightarrow g_{m,d}\rightarrow H_{\infty,d}\rightarrow H$ for some $d=d(m)$ as $m\rightarrow\infty$ .
•

Convergence of entropy rates with markers: Adding markers can only increase the entropy rate such that $H_{m,d}-H_{m}\geq 0$ . This increase in entropy rate is bounded as $H_{m,d}-H_{m}\leq\frac{2}{d}H(T_{d})$ by Lemma A1 and converges to zero as the marker distance $d$ increases by Lemma A2.
•

Convergence of sample entropy rates with markers: Observe that $g_{m,d}$ is decreasing towards $g_{m}$ for increasing $d$ since the side-information of markers reduces the number of terms in the marginalisation of $T_{1}^{m}$ . Then $g_{m,d}-g_{m}\geq 0$ and Markov’s inequality says they converge since $\mathbb{E}[g_{m,d}-g_{m}]=H_{m,d}-H_{m}$ .
•

Shannon-McMillan-Breiman theorem with markers: Form a hidden MP (HMP) $\{W_{n}\}$ with $W_{n}=Y_{T_{(n-1)d}+1}^{T_{nd}}$ based on the MP $\{V_{n}\}$ with $V_{n}=S_{(n-1)d+1}^{nd}$ in between the markers. The HMP is ergodic since the MP is ergodic [9]. Then $g_{m,d}\rightarrow H_{\infty,d}$ as $m\rightarrow\infty$ by SMB.

∎

Theorem 2 (Joint AEP).

If $\{S_{\ell}\}$ is ergodic, then

\displaystyle-\frac{1}{m}\log\mathbb{P}(S_{1}^{m},Y_{1}^{T_{m}})\xrightarrow[]% {\mathbb{P}}H(\mathbb{S},\mathbb{Y}^{\mathbb{T}}).

(7)

Proof.

This proof only differs from Theorem 1 in that instead of showing ergodicity of the HMP $\{W_{n}\}$ we need only show ergodicity of the process $\{U_{n}\}$ for $U_{n}=(V_{n},W_{n})$ . This is also an ergodic HMP with respect to ergodic MP $\{V_{n}\}$ . It is actually a MP, however it is more convenient to invoke the ergodicity of HMPs [9]. ∎

There are many consequences of these AEP results. In Section III, we will use them to derive achievable lower bounds of channel capacity and demonstrate how they can be estimated in practice. Before that, let us explore another AEP with an interesting connection to the output AEP.

II-B Hidden semi-Markov processes

For noisy duplication channels, we were interested in the case where there are $m$ inputs and $T_{m}$ duplicated samples at the output. Alternatively, we can choose the number of outputs to be $t$ samples, and therefore the number of inputs is a random variable $M(t)$ much smaller than $t$ . We consider the AEP for discrete-time HSMPs over a finite state-space, extending the AEP for discrete-time SMPs over a finite state-space [5] (and was later extended to the Borel state-space [6], which is not considered in this paper). The central idea behind proving our AEP will be based on embedding the SMP in a MP, for which we can leverage existing results.

Definition 1 (Embedded SMP [7]).

The embedded SMP is a MP with transition probabilities

\displaystyle\begin{split}P(s,k|s^{\prime},k^{\prime})&=\begin{cases}P(s|s^{% \prime})\mathbb{P}(K=k)&k=1\\ \mathbb{P}(K>k|K>k-1)&k=k^{\prime}+1,s=s^{\prime}\\ 0&\text{otherwise}\end{cases}\end{split}

(8)

for all $k^{\prime},k\in\Lambda$ and for all $s^{\prime},s\in\Omega$ .

An important property of the embedded duplication process is that it preserves ergodicity. Observe that the embedded duration for each channel input form a sequence of states with one possible transition, the “extending” transition, until it reaches the “terminating” transition. Therefore, the embedding MP is irreducible and thus obeys the ergodic theorem. This embedding of the SMP will be used to prove the AEP for HSMPs in Theorem 3. Further, this HSMP AEP will be related to the output AEP in Theorem 1 through its entropy rate using Lemma 2.

Lemma 2 (Randomly indexed entropy rate).

Let $H(\mathbb{Y}^{\mathbb{T}})=\lim_{m\rightarrow\infty}\frac{1}{m}H(Y_{1}^{T_{m}})$ be the entropy rate for fixed-length inputs and variable-length outputs, and let $H(\mathbb{Y})=\lim_{t\rightarrow\infty}\frac{1}{t}H(Y_{1}^{t})$ be the entropy rate for variable-length inputs and fixed-length outputs. Then $H(\mathbb{Y}^{\mathbb{T}})=\mathbb{E}[K]H(\mathbb{Y})$ .

Proof.

Let $\tilde{Z}_{1}^{t}$ be an embedding of the randomly indexed MP $S_{1}^{M(t)}$ . Observe that the number of random inputs for $t$ output samples is $M(t)=\sum_{t}\mathbf{1}_{\mathcal{D}}[\tilde{Z}_{t}]$ where $\mathcal{D}=\{(s,k)\in\mathcal{S}\times\Lambda:k=1\}$ is the set of embedding states that begin a new segment. With this, we will show that the entropy rate $H(M(t))/t$ goes to zero as $t\rightarrow\infty$ .

By Hoeffding’s inequality for MPs [3], we have the concentration inequality $\mathbb{P}(|M(t)-\mathbb{E}[M(t)]|>t)\leq 2\exp(-Ct)$ for some constant $C>0$ that depends on the convergence speed of the MP to its stationary distribution. Let $\mathcal{A}_{t}=\{|M(t)-\mathbb{E}[M(t)]|>t\}$ , then we can partition the entropy of $M(t)$ and bound it as

	$\displaystyle\begin{split}\frac{1}{t}H(M(t))&=\frac{1}{t}H(M(t)\|\mathcal{A}_{t% })\mathbb{P}(\mathcal{A}_{t})\\ &\quad\quad+\frac{1}{t}H(M(t)\|\mathcal{A}^{c}_{t})(1-\mathbb{P}(\mathcal{A}_{t% }))\end{split}$			(9)
		$\displaystyle\leq\frac{1}{t}H(M(t))2e^{-Ct}+\frac{1}{t}\log(Ct)(1-2e^{-Ct})$		(10)

which uses the conditioning property, Hoeffding’s inequality, and that $H(M(t)|\mathcal{A}^{c}_{t})\leq\log(2Ct)$ since $M(t)$ conditioned on $\mathcal{A}^{c}_{t}$ is supported on the interval $[\mathbb{E}[M(t)]-t,\mathbb{E}[M(t)]+t]$ .

Observe that $|t-T_{M(t)}|\leq a$ for constant $a=\max\Lambda$ and for all $t$ . Then $H(Y_{1}^{t})/t$ is arbitrarily close to $H(Y_{1}^{T_{M(t)}})/t$ for sufficiently large $t$ , and

$\displaystyle H(\mathbb{Y})$	$\displaystyle=\lim_{t\rightarrow\infty}\frac{1}{t}H(Y_{1}^{T_{M(t)}})$	(11)
	$\displaystyle=\lim_{t\rightarrow\infty}\frac{1}{t}[H(Y_{1}^{T_{M(t)}}\|M(t))+H(% M(t))]$	(12)
	$\displaystyle=\lim_{t\rightarrow\infty}\frac{1}{t}H(Y_{1}^{T_{M(t)}}\|M(t))$	(13)
	$\displaystyle=\lim_{t\rightarrow\infty}\sum_{m}\mathbb{P}(M(t)=m)\left[\frac{H% (Y_{1}^{T_{m}})}{m\mathbb{E}[K]}\right]$	(14)

where (14) uses a sandwich bound on the conditional entropy $mH_{m}=H(Y_{1}^{T_{m}})=H(Y_{1}^{T_{M(t)}}|M(t)=m)$ , given by

\displaystyle\left(\frac{m}{T_{M(t)}+a}\right)H_{m}

\displaystyle\leq\frac{1}{t}H(Y_{1}^{T_{m}})\leq\left(\frac{m}{T_{M(t)}-a}% \right)H_{m}

(15)

which is derived from $|t-T_{M(t)}|\leq a$ . By the (randomly-indexed) law of large numbers $T_{M(t)}/M(t)\rightarrow\mathbb{E}[K]$ as $t\rightarrow\infty$ , we can squeeze $H(Y_{1}^{T_{m}})/t$ and show it has the limit $H(\mathbb{Y}^{\mathbb{T}})/\mathbb{E}[K]$ . Since $\mathbb{E}[M(t)]$ is increasing in $t$ , the left-hand tail of $\mathbb{P}(M(t)=m)$ is arbitrarily small up to any given $m$ for sufficiently large $t$ , showing convergence of (14).∎

Theorem 3 (HSMP AEP).

If $\{Z_{t}\}$ is ergodic, then

\displaystyle-\frac{1}{t}\log\mathbb{P}(Y_{1}^{t})\xrightarrow[]{a.s.}H(% \mathbb{Y})=\frac{H(\mathbb{Y}^{\mathbb{T}})}{\mathbb{E}[K]}.

(16)

Proof.

Embed the SMP $Z_{1}^{t}$ on $\Omega$ into the MP $\tilde{Z}_{1}^{t}$ on $\Omega\times\Lambda$ , such that the corresponding embedding output process $\tilde{Y}_{1}^{t}$ is a HMP. Since the embedding MP is ergodic, the AEP follows from the ergodicity of HMPs [9] and SMB, combined with Lemma 2. ∎

This theorem combined with the AEP for the output of the noisy duplication channel shows equivalence between the two AEPs. An interesting consequence of this observation is that it implies the true block length $M(t)$ , which is unknown from the $t$ observed samples, dominates the marginalisation of $T_{1}^{M(t)}$ in the entropy rate. It is not clear if there are any practical implications from this, but it is nonetheless an interesting observation. It does, however, allow us to use the classical forward algorithm to estimate $H(\mathbb{Y}^{\mathbb{T}})$ with the embedding using techniques for finite-state channels [15, 2].

III Markov-constrained channel capacity

An open problem of the noisy duplication channel is its channel capacity. If the channel were ergodic, the channel capacity would be the Shannon capacity formula. However, this must be proven directly since we do not have ergodicity as a given. In the special case of ergodic Markov sources, this result turns out to be a corollary of the AEPs in Theorem 1 and Theorem 2 since they imply information stability.

Theorem 4.

If the Markov source $\{S_{\ell}\}$ with Markov transition matrix $P$ is ergodic, then the Markov-constrained channel capacity is the Markov-constrained Shannon capacity

\displaystyle C_{\text{Markov}}=\sup_{P\in\mathcal{P}}I(\mathbb{S};\mathbb{Y}^% {\mathbb{T}})

(17)

where $\mathcal{P}$ is the set of all ergodic Markov transition matrices.

Proof.

Consider the information density

i_{m}=\frac{1}{m}\log\mathbb{P}(S_{1}^{m},Y_{1}^{T_{m}})-\frac{1}{m}\log% \mathbb{P}(Y_{1}^{T_{m}})-\frac{1}{m}\log\mathbb{P}(S_{1}^{m}),

whose expectation is the finite-letter mutual information rate $\frac{1}{m}I(S_{1}^{m};Y_{1}^{T_{m}})$ . Observe that the first and second terms converge in probability to $H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})$ and $H(\mathbb{Y}^{\mathbb{T}})$ , respectively, due to Theorem 1 and Theorem 2, and that the third term converges to $H(\mathbb{S})$ due to SMB. Consequently, the noisy duplication channel with an ergodic Markov source is information stable. This is a special case of the general capacity formula in [16], and implies that the Markov-constrained Shannon capacity is achievable using the capacity-achieving Markov source. ∎

For channel capacity $C$ with an arbitrary source, we have $C_{\text{Markov}}\leq C$ since the optimal source may not be Markov. Even in the restricted case of Markov sources, finding the Markov source that achieves $C_{\text{Markov}}$ is an open problem. However, there exist heuristic algorithms [11, 13] based on the generalised Blahut-Arimoto algorithm (GBAA) [8, 20] for finite-state channels that can significantly improve upon the benchmark Markov source with independent and uniformly distributed inputs (i.e., the maximum-entropy Markov source).

In the following examples, we consider some simple capacity bounds of a noisy duplication channel that concatenates the BSC with the sticky channel in [14] for Bernoulli and geometric duplications.

III-A BSC with Bernoulli duplications

The simplest non-trivial noisy duplication channel is a BSC with Bernoulli duplications. In particular, we consider a BSC with crossover probability $p$ and durations $K_{\ell}\sim\mathsf{Ber}(p_{d})$ on $\Lambda=\{1,2\}$ with duplication probability $p_{d}$ . Each source bit is duplicated with probability $p_{d}$ , and then each bit goes through the BSC that corrupts it with probability $p$ .

In Fig. 1, we use the Monte Carlo technique from [13] with a block length of $m=10^{6}$ to compute the achievable information rate $I_{\text{BSCD,Ber}}(p,p_{d})$ of this channel in the case of a $\mathsf{Ber}(1/2)$ source. When $p_{d}=0$ , the rate is equal to the capacity of the BSC, $C_{\text{BSC}}(p)$ . When $p_{d}=1$ , the rate is equal to the capacity of the BSC with $2$ looks, $C_{\text{BSC}^{2}}(p)$ . The red curve is when $p=0.1$ , which starts at $C_{\text{BSC}}(0.1)=0.5310$ and ends at $C_{\text{BSC}^{2}}(0.1)=0.7421$ . The blue curve is when $p=0.01$ , which starts at $C_{\text{BSC}}(0.01)=0.9192$ and ends at $C_{\text{BSC}^{2}}(0.01)=0.9787$ . When $p=0$ the BSC with Bernoulli duplications simplifies into a sticky channel with Bernoulli duplications with rate $I_{\text{SC,Ber}}(p_{d})$ for a $\mathsf{Ber}(1/2)$ source and channel capacity $C_{\text{SC,Ber}}(p_{d})$ for the capacity-achieving source, which can be computed numerically [14]. The capacity of the sticky channel with Bernoulli duplications is an upper bound on the capacity of the BSC with Bernoulli duplications.

III-B BSC with geometric duplications

We now consider a BSC with geometric duplications, which is a slightly more complex channel compared to the BSC with Bernoulli duplications. In particular, we consider a BSC with crossover probability $p$ and durations $K_{\ell}\sim\mathsf{Geom}(p_{d})$ on $\Lambda=\{1,2,\ldots\}$ with duplication probability $p_{d}$ . Each source bit is repeatedly duplicated with probability $p_{d}$ until the first non-duplication with probability $1-p_{d}$ , and then each bit goes through the BSC that corrupts it with probability $p$ .

In Fig. 2, we once again use the Monte Carlo technique from [13] with a block length of $m=10^{6}$ to compute the achievable information rate $I_{\text{BSCD,geom}}(p,p_{d})$ of this channel in the case of a $\mathsf{Ber}(1/2)$ source. As noted in [14], accurately computing rates with geometric duplications and a high $p_{d}$ is challenging, and therefore we stop at $p_{d}=0.6$ (just over $2$ duplications on average) as they did. In addition, the geometric distribution is truncated after $15$ samples. Analogously to the previous example, when $p=0$ the BSC with geometric duplications simplifies into a sticky channel with geometric duplications with rate $I_{\text{SC,geom}}(p_{d})$ for a $\mathsf{Ber}(1/2)$ source and channel capacity $C_{\text{SC,geom}}(p_{d})$ for the capacity-achieving source, which can be computed numerically [14]. The capacity of the sticky channel with geometric duplications is an upper bound on the capacity of the BSC with geometric duplications.

Figure 1: Monte Carlo estimates of the information rate

I_{\text{BSCD,Ber}}(p,p_{d})

of the BSC with error probability

p

, Bernoulli duplications with probability

p_{d}

, and a

\mathsf{Ber}(1/2)

source. The information rate

I_{\text{SC,Ber}}(p_{d})

is the case when

p=0

, which corresponds to a sticky channel with capacity

C_{\text{SC,Ber}}(p_{d})

and is computed numerically [14].

Figure 2: Monte Carlo estimates of the information rate

I_{\text{BSCD,geom}}(p,p_{d})

of the BSC with error probability

p

, geometric duplications with probability

p_{d}

, and a

\mathsf{Ber}(1/2)

source. The information rate

I_{\text{SC,geom}}(p_{d})

is the case when

p=0

, which corresponds to a sticky channel with capacity

C_{\text{SC,geom}}(p_{d})

and is computed numerically [14].

IV Conclusion

This paper studied the noisy duplication channel and established its channel capacity as the Markov-constrained Shannon capacity in the special case of ergodic Markov sources. It was proven through the AEP for noisy duplication processes, which was related to the AEP for HSMPs. The motivation for these results was to provide a theoretical underpinning to the numerical results in [11, 13]. A significant open problem that remains is the construction of codes that can achieve the Markov-constrained capacity in practice, which is still in its early stages [17, 18, 12, 19]. Further progress in this area could greatly advance DNA storage systems based on nanopore sequencing.

Appendix A Preliminary lemmas

Lemma A1.

$H(Y,A)-H(Y)\leq 2H(A)$ .

Proof.

We start from the inequality $|H(Y)-H(Y|A)|\leq H(A)$ [4, Lemma 4A.1]. Then $H(Y,A)-H(Y)=H(Y|A)+H(A)-H(Y)=|H(Y|A)+H(A)-H(Y)|\leq|H(Y|A)-H(Y)|+H(A)\leq 2H(A)$ . ∎

Lemma A2.

$\lim_{d\rightarrow\infty}\frac{1}{d}H(T_{d})=0$ .

Proof.

By Hoeffding’s inequality, we have the concentration inequality $\mathbb{P}(|T_{d}-\mathbb{E}[T_{d}]|>d)\leq 2\exp(-Cd)$ for some constant $C>0$ . Then the proof follows using the same argument as in Lemma 2. Alternatively, we could use the Gaussian approximation.∎

Appendix B Proof of Theorem 1

Split problem into three parts. We want to show that for any $\epsilon>0$ and $\delta>0$ we have $\mathbb{P}(|g_{m}-H|>\epsilon)<\delta$ for sufficiently large $m$ . Let $\Delta_{1}=\{|g_{m,d}-g_{m}|>\epsilon/3\}$ , $\Delta_{2}=\{|g_{m,d}-H_{\infty,d}|>\epsilon/3\}$ , and $\Delta_{3}=\{|H_{\infty,d}-H|>\epsilon/3\}$ . Then we have the bound

	$\displaystyle\mathbb{P}(\|g_{m}-H\|>\epsilon)$		(18)
	$\displaystyle\leq\mathbb{P}(\|g_{m,d}-g_{m}\|+\|g_{m,d}-H_{\infty,d}\|+\|H_{\infty,% d}-H\|>\epsilon)$		(19)
	$\displaystyle\leq\underbrace{\mathbb{P}(\Delta_{1})}_{\text{Term 1}}+% \underbrace{\mathbb{P}(\Delta_{2})}_{\text{Term 2}}+\underbrace{\mathbb{P}(% \Delta_{3})}_{\text{Term 3}}$		(20)

and we want to show that each term is bounded by $\delta/3$ , where $d$ is fixed and does not affect our choice of $m$ . We will prove this in three parts and then combine the results.

Convergence of entropy rates with markers. Observe that $g_{m,d}$ is decreasing towards $g_{m}$ for increasing $d$ (with equality at $d=m$ ), then $g_{m,d}-g_{m}\geq 0$ and $\mathbb{E}[g_{m,d}-g_{m}]\geq 0$ . From Lemma A1, for a fixed $d$ we have the bound $\mathbb{E}[g_{m,d}-g_{m}]=H_{m,d}-H_{m}\leq\frac{2}{m}H(\mathcal{M}_{m,d})=% \frac{2}{d}H(T_{d})$ uniformly in $m$ . Then take the limit in $m$ to get $H_{\infty,d}-H\leq\frac{2}{d}H(T_{d})$ . By Lemma A2, there exists a $D$ such that $H_{m,d}-H\leq\epsilon/3$ and then $\Delta_{3}$ occurs with probability $0$ for all $d\geq D$ . This proves almost sure convergence for Term 3.

Convergence of sample entropies with markers. For any $\delta^{\prime}>0$ there exists a $D^{\prime}$ such that such that $H_{m,d}-H_{m}\leq\delta^{\prime}$ and $\delta^{\prime}/\epsilon\leq\delta/3$ for all $d\geq D^{\prime}$ . Recalling that $g_{m,d}-g_{m}\geq 0$ , we can apply Markov’s inequality to get

$\displaystyle\mathbb{P}(\Delta_{1})$	$\displaystyle\leq 3\mathbb{E}[g_{m,d}-g_{m}]/\epsilon$	(21)
	$\displaystyle\leq 3(H_{m,d}-H_{m})/\epsilon$	(22)
	$\displaystyle\leq\delta^{\prime}/\epsilon\leq\delta/3$	(23)

for all $d\geq D^{\prime}$ . This proves convergence in probability for Term 1, which will now be linked with the convergence result of Term 3 by showing the convergence of Term 2.

Shannon-McMillan-Breiman theorem with markers. Consider the $d$ -step MP $\{V_{n}\}$ where $V_{n}=S_{(n-1)d+1}^{nd}$ . Since it embeds the ergodic MP $\{S_{\ell}\}$ , which is a one-to-one mapping between probability spaces with identical probability measure, it is also ergodic. Then we have the HMP $\{W_{n}\}$ where $W_{n}=Y_{T_{(n-1)d}+1}^{T_{nd}}$ . Since the $d$ -step MP is ergodic, then the resulting HMP is ergodic [9]. Apply SMB to get $g_{nd,d}\rightarrow H_{\infty,d}$ almost surely as $n\rightarrow\infty$ for a fixed $d$ (hence $m\rightarrow\infty$ ). Then there exists a sequence $\{N(d)\}$ such that $|g_{nd,d}-H_{\infty,d}|\leq\epsilon/6$ for all $n\geq\overline{N}(d)=\max\{N(n_{1}):n_{1}\leq d\}$ with probability $1$

On Noisy Duplication Channels with Markov Sources

Abstract

I Introduction

I-A Noisy duplication channel

I-B Entropy rates

Lemma 1.

Proof.

II Asymptotic equipartition property

II-A Noisy duplication processes

Theorem 1 (Output AEP).

Proof sketch.

Theorem 2 (Joint AEP).

Proof.

II-B Hidden semi-Markov processes

Definition 1 (Embedded SMP [7]).

Lemma 2 (Randomly indexed entropy rate).

Proof.

Theorem 3 (HSMP AEP).

Proof.

III Markov-constrained channel capacity

Theorem 4.

Proof.

III-A BSC with Bernoulli duplications

III-B BSC with geometric duplications

IV Conclusion

Appendix A Preliminary lemmas

Lemma A1.

Proof.

Lemma A2.

Proof.

Appendix B Proof of Theorem 1

On Noisy Duplication Channels
with Markov Sources