(Translated by https://www.hiragana.jp/)
Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks

Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks

Kyriakos Stylianopoulos,  Paolo Di Lorenzo, 
and George C. Alexandropoulos
This work has been supported by the Smart Networks and Services Joint Undertaking projects TERRAMETA, 6G-DISAC, and 6G-GOALS under the European Union’s Horizon Europe research and innovation programme under Grant Agreement numbers 101097101, 101139130, and 101139232 respectively. TERRAMETA also includes top-up funding by UK Research and Innovation under the UK government’s Horizon Europe funding guarantee.K. Stylianopoulos and G. C. Alexandropoulos are with the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 16122 Athens, Greece (e-mails: {kstylianop, alexandg}@di.uoa.gr).P. Di Lorenzo is with the Department of Information Engineering, Electronics, and Telecommunications, Sapienza University, Italy and CNIT, Italy (e-mail: paolo.dilorenzo@uniroma1.it).
Abstract

In the Edge Inference (EI) paradigm, where a Deep Neural Network (DNN) is split across the transceivers to wirelessly communicate goal-defined features in solving a computational task, the wireless medium has been commonly treated as a source of noise. In this paper, motivated by the emerging technologies of Reconfigurable Intelligent Surfaces (RISs) and Stacked Intelligent Metasurfaces (SIM) that offer programmable propagation of wireless signals, either through controllable reflections or diffractions, we optimize the RIS/SIM-enabled smart wireless environment as a means of over-the-air computing, resembling the operations of DNN layers. We propose a framework of Metasurfaces-Integrated Neural Networks (MINNs) for EI, presenting its modeling, training through a backpropagation variation for fading channels, and deployment aspects. The overall end-to-end DNN architecture is general enough to admit RIS and SIM devices, through controllable reconfiguration before each transmission or fixed configurations after training, while both channel-aware and channel-agnostic transceivers are considered. Our numerical evaluation showcases metasurfaces to be instrumental in performing image classification under link budgets that impede conventional communications or metasurface-free systems. It is demonstrated that our MINN framework can significantly simplify EI requirements, achieving near-optimal performance with 50505050 dB lower testing signal-to-noise ratio compared to training, even without transceiver channel knowledge.

Index Terms:
Edge learning, reconfigurable intelligent surface, stacked intelligent metasurfaces, goal-oriented communications, deep learning, over-the-air computing.

I Introduction

In emerging Device-to-Device (D2D) and Internet of Things (IoT) networks, where distributed devices are expected to support various functionalities, such as Integrated Sensing and Communications (ISAC) [1], stringent energy efficiency requirements impose limitations on their communication and computation capabilities. Various sub-networks are envisioned to enable a wide range of high-level applications and services, such as digital twinning through object recognition and computational imaging, or indoor positioning [2], each tied up with a diverse set of requirements to be fulfilled. As a result, it is crucial for the D2D system design to take into consideration the underlying application. To this end, devising novel communication stacks tailored to the application, that break the traditional network layer taxonomy, comes with reduced implementation overheads.

Goal-Oriented Communications (GOC) [3] is a framework that is gaining popularity in D2D systems, since transmissions of only the necessary goal-specific information need to take place, reducing messaging overheads and simplifying system architecture [4]. In fact, when performing Edge Inference (EI) tasks, where the Receiver (RX) wishes to obtain only an estimated label of the data the Transmitter (TX) actually observes and transmits, semantic and GOC approaches that split the layers of a Deep artificial Neural Network (DNN) at the endpoints provide significant benefits in terms of communication requirements [5]. According to those approaches, low-dimensional feature vectors, that are outputs of intermediate DNN layers, are transmitted over the channel [6, 7, 8]. The motivation behind such practices is that GOC deviates from the standard Shannon-type communications [9], in that the goal of the system is an arbitrary computational target function, rather than objectives derived from the Mutual Information (MI) that call for bit-wise reconstruction of the input data. In this way, DNN s are employed, similar to the Joint Source Channel Coding (JSCC) paradigm, and trained to capture the joint or conditional data-channel-target distributions.

Current literature has adopted GOC and EI for a wide variety of problems and wireless systems (see [6, 8, 10, 11] and the references therein), however, a common practice across those works is to treat the wireless environment as a source of noise, whose effects need to be negated at the RX side. The rapid developments of Meta-Surface (MS) technologies for precise Radio-Frequency (RF) domain control open the potential for the wireless propagation medium to be dynamically reconfigurable via reflective Reconfigurable Intelligent Surfaces (RISs) [12] or diffractive Stacked Intelligent Metasurfaces (SIM) [13] with low operational costs. Such MS s have been incorporated in GOC and semantic systems to reduce the transceiver hardware complexity [14] or to enhance the system’s rate [15]. Under another research direction, DNN implementations entirely in the RF domain, capitalizing on MS-based solutions, have been devised for controlled laboratory environments [16, 17, 18].

Refer to caption
Figure 1: A Metasurfaces-Integrated artificial Neural Network (MINN) performing Edge Inference (EI) by controlling the wireless propagation channel, which is treated as one or more hidden network layers.

In this paper, motivated by the elaborate wireless medium control offered by MS technologies [12, 13], in conjunction with their analog computational capabilities, we design MS-controllable wireless channels that perform Over-the-Air Computation (OAC), under which the comprising metamaterials are treated as hidden artificial neurons that control the wireless medium to perform multi-layer non-linear signal processing toward solving EI tasks, as illustrated in Fig. 1. The main contributions of this paper are summarized as follows.

  1. 1.

    We present a novel generic End-to-End (E2E) DNN framework, titled Metasurfaces-Integrated Neural Network (MINN), that admits variations of controllable MSs or MSs of trainable, yet fixed response configurations, of either RISs or SIM and with or without channel knowledge. We detail the framework’s modeling, backpropagation-based training, and deployment from both a theoretical and a system architecture perspectives.

  2. 2.

    We elaborate on the critical role of reconfiguration of wireless channels as a degree of freedom in optimizing DNN-based EI models that capitalize on programmable OAC, instead of treating the wireless propagation environment as a source of noise. To this end, learning RISs and SIM configurations promises substantial potential.

  3. 3.

    We present an extensive numerical evaluation of the proposed MINN framework on an image classification task, which compared with conventional communication systems, as well as a baseline in the absence of an MS, showcases that MS-enabled OAC allows for successful EI with much lower transmit power requirements, even without channel knowledge at the communication ends.

The remainder of the paper is structured as follows. Section II details the relevant pieces of theory regarding EI as well as the possible improvements brought by controlling the wireless environment, while Section III provides a comprehensive review of the relevant literature. Section IV includes the models used for the considered MS technologies: RIS and SIM, as well as the model for the received signal for both cases. The proposed MINN architecture for EI is presented in Section V, whereas Section VI details its training procedure and discusses the deployment and network considerations. Our numerical investigations are presented in Section VII. Finally, Section VIII includes the paper’s concluding remarks.

Notation: Vectors, matrices, and sets are expressed in lowercase bold (e.g., 𝒙𝒙\boldsymbol{x}bold_italic_x), uppercase bold (e.g., 𝑿𝑿\boldsymbol{X}bold_italic_X), and uppercase calligraphic typefaces (e.g., 𝒳𝒳\mathcal{X}caligraphic_X or 𝓧𝓧\boldsymbol{\mathcal{X}}bold_caligraphic_X), respectively. Apart from 𝑿superscript𝑿\boldsymbol{X}^{\ast}bold_italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, 𝑿superscript𝑿\boldsymbol{X}^{\dagger}bold_italic_X start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT, and 𝑿superscript𝑿top\boldsymbol{X}^{\top}bold_italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT that denote the conjugate, conjugate transpose, and transpose of 𝑿𝑿\boldsymbol{X}bold_italic_X, superscripts and subscripts are used to denote different versions of variables or enumeration over collections of variables depending on the context. [𝒙]isubscriptdelimited-[]𝒙𝑖[\boldsymbol{x}]_{i}[ bold_italic_x ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and [𝑿]i,jsubscriptdelimited-[]𝑿𝑖𝑗[\boldsymbol{X}]_{i,j}[ bold_italic_X ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are used to denote the i𝑖iitalic_i-th element of 𝒙𝒙\boldsymbol{x}bold_italic_x and the (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th element of 𝑿𝑿\boldsymbol{X}bold_italic_X, respectively. We use variations of the notation f𝒘()subscript𝑓𝒘f_{\boldsymbol{w}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) to represent neural network functions that are parameterized by their weight matrix 𝒘𝒘\boldsymbol{w}bold_italic_w, noting that f𝑓fitalic_f may be seen either as a function of its arbitrary input variables (during inference) or as a function of 𝒘𝒘\boldsymbol{w}bold_italic_w (during optimization). |𝒳|𝒳|\mathcal{X}|| caligraphic_X | represents the cardinality of the set 𝒳𝒳\mathcal{X}caligraphic_X, diag(𝐱)diag𝐱{\rm diag(\boldsymbol{x})}roman_diag ( bold_x ) creates a square matrix with the elements of 𝒙𝒙\boldsymbol{x}bold_italic_x placed along its main diagonal, and vec(𝑿)vec𝑿{\rm vec}(\boldsymbol{X})roman_vec ( bold_italic_X ) transforms 𝑿𝑿\boldsymbol{X}bold_italic_X in a column vector in a row-by-row fashion. tensor-product\otimes denotes the Kronecker product, {a,b}𝑎𝑏\{a,b\}{ italic_a , italic_b } expresses a set or collection containing a𝑎aitalic_a and b𝑏bitalic_b, while 𝟙condsubscript1cond\mathds{1}_{\texttt{cond}}blackboard_1 start_POSTSUBSCRIPT cond end_POSTSUBSCRIPT is the indicator function equaling to 1111 if condition cond holds, otherwise to 00. Finally, 𝔼𝑿[]subscript𝔼𝑿delimited-[]\mathbb{E}_{\boldsymbol{X}}[\cdot]blackboard_E start_POSTSUBSCRIPT bold_italic_X end_POSTSUBSCRIPT [ ⋅ ] is the expectation operator with respect to the distribution of the random 𝑿𝑿\boldsymbol{X}bold_italic_X, and ȷ1italic-ȷ1\jmath\triangleq\sqrt{-1}italic_ȷ ≜ square-root start_ARG - 1 end_ARG.

II Prerequisites

II-A Probabilistic Inference

Given an input observation 𝒙𝒙\boldsymbol{x}bold_italic_x, the objective of an inference procedure is to compute and output an associated target attribute value 𝒐=l(𝒙)𝒐𝑙𝒙\boldsymbol{o}=l(\boldsymbol{x})bold_italic_o = italic_l ( bold_italic_x ). The mapping function l(𝒙)𝑙𝒙l(\boldsymbol{x})italic_l ( bold_italic_x ) is considered unknown and intractable to express analytically, therefore, inference involves approximating this relationship through examples of (𝒙,𝒐)𝒙𝒐(\boldsymbol{x},\boldsymbol{o})( bold_italic_x , bold_italic_o ) tuples. From a probabilistic perspective, one may fit the conditional Probability Density Function (PDF) of the target p(𝒐|𝒙)𝑝conditional𝒐𝒙p(\boldsymbol{o}|\boldsymbol{x})italic_p ( bold_italic_o | bold_italic_x ) on the available data. In typical settings, point estimates are only required, thus, the problem reduces in predicting the most likely value of 𝒐𝒐\boldsymbol{o}bold_italic_o for a given observation. In the machine learning regime, the previous distribution (or its point estimates) can be approximated by a 𝒘𝒘\boldsymbol{w}bold_italic_w-parameterized model 𝒐^f𝒘(𝒙)^𝒐subscript𝑓𝒘𝒙\hat{\boldsymbol{o}}\triangleq f_{\boldsymbol{w}}(\boldsymbol{x})over^ start_ARG bold_italic_o end_ARG ≜ italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) that outputs its prediction of the target value 𝒐^^𝒐\hat{\boldsymbol{o}}over^ start_ARG bold_italic_o end_ARG for a given input. Consecutively, to solve the inference problem, the parameter values 𝒘𝒘\boldsymbol{w}bold_italic_w need to be optimized. This is achieved by collecting a data set of training tuples 𝒟{(𝒙i,𝒐i)}i=1|𝒟|𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝒐𝑖𝑖1𝒟\mathcal{D}\triangleq\{(\boldsymbol{x}_{i},\boldsymbol{o}_{i})\}_{i=1}^{|% \mathcal{D}|}caligraphic_D ≜ { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT, and then minimizing an amortized cost function over the training instances:

J(𝒘)1|𝒟|i=1|𝒟|Jstr(𝒐i,𝒐^i),where 𝒐^i=f𝒘(𝒙i).formulae-sequence𝐽𝒘1𝒟superscriptsubscript𝑖1𝒟subscript𝐽strsubscript𝒐𝑖subscript^𝒐𝑖where subscript^𝒐𝑖subscript𝑓𝒘subscript𝒙𝑖\displaystyle J(\boldsymbol{w})\triangleq\frac{1}{|\mathcal{D}|}\sum_{i=1}^{|% \mathcal{D}|}J_{\rm str}(\boldsymbol{o}_{i},\hat{\boldsymbol{o}}_{i}),\quad% \text{where }\hat{\boldsymbol{o}}_{i}=f_{\boldsymbol{w}}(\boldsymbol{x}_{i}).italic_J ( bold_italic_w ) ≜ divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (1)

The per-instance cost function values Jstr(𝒐i,𝒐^i)subscript𝐽strsubscript𝒐𝑖subscript^𝒐𝑖J_{\rm str}(\boldsymbol{o}_{i},\hat{\boldsymbol{o}}_{i})italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )’s quantify the error in the model’s predictions and they often have a probabilistic interpretation. For instance, in classification settings where each observation belongs to one of dclsubscript𝑑cld_{\rm cl}italic_d start_POSTSUBSCRIPT roman_cl end_POSTSUBSCRIPT classes indexed by the natural number c𝑐citalic_c, the target value is defined via one-hot encoding as 𝒐=[𝟙c=1,𝟙c=2,,𝟙c=dcl]dcl×1𝒐superscriptsubscript1𝑐1subscript1𝑐2subscript1𝑐subscript𝑑cltopsuperscriptsubscript𝑑cl1\boldsymbol{o}=[\mathds{1}_{c=1},\mathds{1}_{c=2},\dots,\mathds{1}_{c=d_{\rm cl% }}]^{\top}\in\mathbb{N}^{d_{\rm cl}\times 1}bold_italic_o = [ blackboard_1 start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT , blackboard_1 start_POSTSUBSCRIPT italic_c = 2 end_POSTSUBSCRIPT , … , blackboard_1 start_POSTSUBSCRIPT italic_c = italic_d start_POSTSUBSCRIPT roman_cl end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_cl end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT. The Cross Entropy (CE) loss function is defined as follows:

JCE(𝒐i,𝒐^i)j=1dcl[𝒐i]jlog[𝒐^i]j.J_{\rm CE}(\boldsymbol{o}_{i},\hat{\boldsymbol{o}}_{i})\triangleq-\sum_{j=1}^{% d_{\rm cl}}[\boldsymbol{o}_{i}]_{j}\log[\hat{\boldsymbol{o}}_{i}]_{j}.italic_J start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≜ - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_cl end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log [ over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (2)

Minimizing (2) over 𝒟𝒟\mathcal{D}caligraphic_D is equivalent to performing maximum likelihood estimation of 𝒘𝒘\boldsymbol{w}bold_italic_w on p(𝒐|𝒙)𝑝conditional𝒐𝒙p(\boldsymbol{o}|\boldsymbol{x})italic_p ( bold_italic_o | bold_italic_x ) under the assumption that this PDF is a multivariate Bernoulli distribution. Similarly, in regression tasks, where 𝒐,𝒐^dout×1𝒐^𝒐superscriptsubscript𝑑out1\boldsymbol{o},\hat{\boldsymbol{o}}\in\mathbb{R}^{d_{\rm out}\times 1}bold_italic_o , over^ start_ARG bold_italic_o end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, the Mean Squared Error (MSE) metric, defined as 1/dout𝒐𝒐^21subscript𝑑outsuperscriptnorm𝒐^𝒐21/d_{\rm out}\|\boldsymbol{o}-\hat{\boldsymbol{o}}\|^{2}1 / italic_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ∥ bold_italic_o - over^ start_ARG bold_italic_o end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, implies that p(𝒐|𝒙)𝑝conditional𝒐𝒙p(\boldsymbol{o}|\boldsymbol{x})italic_p ( bold_italic_o | bold_italic_x ) is a Gaussian conditional PDF.

II-B Artificial Neural Networks

While a wide range of parameterized families of functions is available to use as f𝒘()subscript𝑓𝒘f_{\boldsymbol{w}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ), the current state of the art considers DNN models, capitalizing on their diverse benefits including high expressivity, diverse selection of architectural components for different tasks, and parallelizable computations that offer real-time computational cost during inference on pertinent hardware. Mathematically, let a neural network be expressed as a composition of L𝐿Litalic_L layers such that:

f𝒘(𝒙)f𝒘LL(f𝒘L1L1(f𝒘11(𝒙))),subscript𝑓𝒘𝒙subscriptsuperscript𝑓𝐿superscript𝒘𝐿subscriptsuperscript𝑓𝐿1superscript𝒘𝐿1subscriptsuperscript𝑓1superscript𝒘1𝒙f_{\boldsymbol{w}}(\boldsymbol{x})\triangleq f^{L}_{\boldsymbol{w}^{L}}\left(f% ^{L-1}_{\boldsymbol{w}^{L-1}}\left(\ldots f^{1}_{\boldsymbol{w}^{1}}(% \boldsymbol{x})\ldots\right)\right),italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) ≜ italic_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( … italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) … ) ) , (3)

so that the l𝑙litalic_l-th layer (l=1,2,,L𝑙12𝐿l=1,2,\ldots,Litalic_l = 1 , 2 , … , italic_L) is parameterized by 𝒘lsubscript𝒘𝑙\boldsymbol{w}_{l}bold_italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and its output 𝒐¯lsuperscript¯𝒐𝑙\bar{\boldsymbol{o}}^{l}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT constitutes the input to the (l+1)𝑙1(l+1)( italic_l + 1 )-th layer; this process can be represented recursively by 𝒐¯lf𝒘ll(𝒐¯(l1))superscript¯𝒐𝑙subscriptsuperscript𝑓𝑙superscript𝒘𝑙superscript¯𝒐𝑙1\bar{\boldsymbol{o}}^{l}\triangleq f^{l}_{\boldsymbol{w}^{l}}(\bar{\boldsymbol% {o}}^{(l-1)})over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≜ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ), where we have set 𝒐¯0=𝒙superscript¯𝒐0𝒙\bar{\boldsymbol{o}}^{0}=\boldsymbol{x}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_x. For convenience, let us also define the vector 𝒘[𝒘1,𝒘2,,𝒘L]𝒘superscriptsubscriptsuperscript𝒘top1subscriptsuperscript𝒘top2subscriptsuperscript𝒘top𝐿top\boldsymbol{w}\triangleq[\boldsymbol{w}^{\top}_{1},\boldsymbol{w}^{\top}_{2},% \dots,\boldsymbol{w}^{\top}_{L}]^{\top}bold_italic_w ≜ [ bold_italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

We leave aside precise definitions of individual layer functions, as there has been an impressive body of work focusing on layers that perform data-specific computations and capture high-level patterns in data sets [19]. Nevertheless, we highlight the fact that each f𝒘ll()subscriptsuperscript𝑓𝑙subscript𝒘𝑙f^{l}_{\boldsymbol{w}_{l}}(\cdot)italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) is demanded to be a non-linear function, and, in fact, it is often requested to be discriminatory or sigmoidal [20]. Those properties guarantee that artificial neural networks with at least two layers (of potentially infinite width) are universal approximators and can, therefore, be used to approximate any arbitrary 𝒐=l(𝒙)𝒐𝑙𝒙\boldsymbol{o}=l(\boldsymbol{x})bold_italic_o = italic_l ( bold_italic_x ) mapping, i.e., f𝒘(𝒙)l(𝒙)subscript𝑓𝒘𝒙𝑙𝒙f_{\boldsymbol{w}}(\boldsymbol{x})\cong l(\boldsymbol{x})italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) ≅ italic_l ( bold_italic_x ) [20].

Besides theoretical guarantees, the problem of obtaining 𝒘𝒘\boldsymbol{w}bold_italic_w values that perform successful inference can be efficiently solved by substituting the neural network expression from (3) into (2) (using classification as an example), and subsequently into (1). The latter may be solved through one of the many variations of the Stochastic Gradient Descent (SGD) approach by computing the gradients J(𝒘)/𝒘𝐽𝒘𝒘\partial J(\boldsymbol{w})/\partial\boldsymbol{w}∂ italic_J ( bold_italic_w ) / ∂ bold_italic_w, and propagating them through the layers of f𝒘(𝒙)subscript𝑓𝒘𝒙f_{\boldsymbol{w}}(\boldsymbol{x})italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) by taking advantage of the chain rule; this leads to the celebrated backpropagation algorithm [21] that constitutes the backbone of deep learning.

II-C Edge Inference (EI)

The rising field of EI considers the implementation and training of inference tasks over a wireless communication network. In that regard, consider an uplink setup where a multi-antenna TX observes 𝒙𝒙\boldsymbol{x}bold_italic_x and wishes to convey its estimation 𝒐^^𝒐\hat{\boldsymbol{o}}over^ start_ARG bold_italic_o end_ARG of the target value 𝒐𝒐\boldsymbol{o}bold_italic_o to an RX. At first observation, this task is apparently straightforward to implement within the existing frameworks of machine learning and wireless communications under the following two paradigm options:

II-C1 “Infer-then-transmit”

The TX may first compute 𝒐^=f𝒘(𝒙)^𝒐subscript𝑓𝒘𝒙\hat{\boldsymbol{o}}=f_{\boldsymbol{w}}(\boldsymbol{x})over^ start_ARG bold_italic_o end_ARG = italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) and then transmit 𝒐^^𝒐\hat{\boldsymbol{o}}over^ start_ARG bold_italic_o end_ARG over the wireless medium upon performing source coding (i.e., data compression) and channel coding (i.e., modulation and beamforming) to derive the transmission signal 𝒔𝒔\boldsymbol{s}bold_italic_s. The above coding computations ensure that a satisfactory communication rate is achievable by the system so that 𝒐^^𝒐\hat{\boldsymbol{o}}over^ start_ARG bold_italic_o end_ARG may be reconstructed at the RX’s side via decoding the received signal and decompressing the data. In modern high-complexity wireless systems, those operations require their own optimization procedures and necessitate additional computational costs as well as channel state information knowledge. For most inference tasks, the target values are of a much smaller dimension than the observations, as a result, this option includes small rate requirements, but comes at a computational cost on the TX’s side, as the device must be endowed with hardware capable of executing DNN computations locally. Since EI tasks are envisioned specifically for cases where IoT or other lightweight devices with low complexity and minute power consumption sending messages to a collection/fusion center, the assumption of computational capabilities for local DNN inference is rather optimistic.

II-C2 “Transmit-then-infer”

This converse approach is also possible: the TX performs source and channel coding but no DNN-based EI computations, so that the original observation 𝒙𝒙\boldsymbol{x}bold_italic_x is transmitted instead. The RX performs decoding to obtain the data point, which is then fed to its local f𝒘()subscript𝑓𝒘f_{\boldsymbol{w}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) to perform inference. While in uplink settings, it is reasonable to assume that the RX has sufficient power and hardware capabilities to support a DNN, the rate required for transmitting the original observation may impose high link budget demands that are not readily facilitated.

A compromising solution to the above may be obtainable by exploiting the sequential nature of the DNN structure appearing in (3[22]. To this end, the following EI option is possible:

II-C3 “Infer-while-transmitting” (DNN splitting)

The intermediate representations 𝒐¯lsuperscript¯𝒐𝑙\bar{\boldsymbol{o}}^{l}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT l=1,2,L1for-all𝑙12𝐿1\forall l=1,2\ldots,L-1∀ italic_l = 1 , 2 … , italic_L - 1 can be of arbitrary dimensions, and it is not uncommon to devise architectures with one or more small-sized intermediate layers. In fact, various deep learning models, such as auto-encoders [23] and U-Net [24], are designed specifically to contain such bottleneck layers as a form of compression to keep only relevant information. From that perspective, one may choose to split the first L<Lsuperscript𝐿𝐿L^{\prime}<Litalic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_L layers of the DNN to reside at the TX, so that 𝒐¯Lsuperscript¯𝒐superscript𝐿\bar{\boldsymbol{o}}^{L^{\prime}}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is transmitted over the network and, then, pass to the (L+1)superscript𝐿1(L^{\prime}+1)( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 )-th up to the L𝐿Litalic_L-th layer in sequence at the RX.

Evidently, the latter paradigm of EI is the most flexible, since one has the option to balance trade-offs between computation and communication resources between the TX and RX. In the remainder of this paper, we will be adopting this paradigm of EI as the default case and provide further elaboration and extensions, revisiting the two extreme previous cases as baselines in our numerical comparisons.

II-D Computational Considerations

When performing DNN splitting over a wireless channel with realistic characteristics (i.e., large- and small-scale fading, as well as Additive White Gaussian Noise (AWGN)), the transmitted output 𝒐¯Lsuperscript¯𝒐superscript𝐿\bar{\boldsymbol{o}}^{L^{\prime}}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT of the Lsuperscript𝐿L^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th DNN layer will be distorted when arriving at the RX. By representing the channel state with an abstract random variable 𝓗𝓗\boldsymbol{\mathcal{H}}bold_caligraphic_H, one needs to minimize the same 𝒘𝒘\boldsymbol{w}bold_italic_w-parameterized objective function as before, while accounting for the stochastic nature of the wireless environment (i.e., with respect to 𝓗𝓗\boldsymbol{\mathcal{H}}bold_caligraphic_H’s distribution), i.e., solve:

𝒪𝒫EI:min𝒘𝔼𝓗[J(𝒘)],:𝒪subscript𝒫EIsubscript𝒘subscript𝔼𝓗delimited-[]𝐽𝒘\mathcal{OP}_{\rm EI}:\min_{\boldsymbol{w}}\mathbb{E}_{\boldsymbol{\mathcal{H}% }}[J(\boldsymbol{w})],caligraphic_O caligraphic_P start_POSTSUBSCRIPT roman_EI end_POSTSUBSCRIPT : roman_min start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_caligraphic_H end_POSTSUBSCRIPT [ italic_J ( bold_italic_w ) ] ,

where each instantaneous channel realization affects the value of the objected function by distorting the value of the transmitted 𝒐¯Lsuperscript¯𝒐superscript𝐿\bar{\boldsymbol{o}}^{L^{\prime}}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. The precise definition of 𝓗𝓗\boldsymbol{\mathcal{H}}bold_caligraphic_H will be given in Section IV, while the considered fading distributions are discussed during the numerical evaluation in Section VII.

Assuming the wireless channel has sufficient capacity, optimizing both endpoints to perform source and channel encoding and decoding, under the standard paradigm of wireless communications, will indeed nullify the distortion on the received version of 𝒐¯Lsuperscript¯𝒐superscript𝐿\bar{\boldsymbol{o}}^{L^{\prime}}over¯ start_ARG bold_italic_o end_ARG start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. The decoding output may then passed to the (L+1)superscript𝐿1(L^{\prime}+1)( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 )-layer of the network as normal; note that the presence of the channel is effectively hidden from the neural network’s perspective. This approach aligns more with the standard practices of both the wireless communications and machine learning paradigms, as each problem is treated individually and has shown to indeed exhibit satisfactory results [8, 15]. However, the above practice of optimizing the system for reconstruction of the received signal may result in higher computational overheads under the following considerations:

  1. 1.

    Input reconstruction is not the objective of EI, rather the computation of an arbitrary function of it. Under this point of view, allocating computational resources in reconstructing intermediate variables is not always the most efficient way of solving the problem. In fact, one may regard GOC as a particular instance of lossy compression between the (unseen) target value 𝒐𝒐\boldsymbol{o}bold_italic_o and its estimation 𝒐^^𝒐\hat{\boldsymbol{o}}over^ start_ARG bold_italic_o end_ARG, where Jstr(𝒐,𝒐^)subscript𝐽str𝒐^𝒐J_{\rm str}(\boldsymbol{o},\hat{\boldsymbol{o}})italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o , over^ start_ARG bold_italic_o end_ARG ) plays the role of the distortion metric. From an information-theoretic perspective, variations of the distortion-rate functions may then be studied, as proposed by [25], indicating that the channel rate required to transmit intermediate variables, while ensuring a desired error threshold, is typically lower than the actual channel capacity which assumes reconstruction with arbitrary small error probabilities.

  2. 2.

    From an engineering perspective, since the received signal is to be fed to subsequent neural network layers, perfect reconstruction may not be required. The employed neural network architectures are commonly designed to account for noisy inputs in light of the inherent stochasticity of inference problems. Besides, manually imputing noise to activations of intermediate layers, both during training [26, 27] and inference [28, 29], is known to enhance the model’s regularization and uncertainty estimation properties.

  3. 3.

    Finally, the wireless channel, which can be regarded as a (naturally stochastic) function over the transmitted data, imposes its own computations. While this function is not, in general, controllable by the E2E system, OAC approaches leverage the superimposition of wireless signals to implement certain families of computational functions on top of the wireless medium. Interestingly, the controllability offered by emerging MS technological solutions has the potential to enable more elaborate computations over-the-air, essentially offloading computations from the communication network’s endpoints.

III Relevant State-of-the-Art

The implementation of DNN s at the transceivers has been studied under the JSCC paradigm, with results that outperform traditional communication systems due to the neural networks’ ability to learn useful patterns from the data and channel distributions [30, 31]. However, JSCC is limited to data reconstruction as an objective. Conversely, deep semantic communication approaches [32], endow the communication system with the purpose of transmitting the meaning of the data, rather than their bit representations, which can be interpreted as a GOC objective. Notably, the DeepSC architecture of [8] proposes a DNN splitting approach, with separate source and channel coding sub-modules that are trained independently. The channel encoder and decoder are trained to maximize an MI objective, which, on the one hand, is difficult to evaluate analytically for fading channels, while, on the other hand, maximizing the MI additionally implies that the effects of the channel are to be negated instead of being used for computations. In [10], a GOC approach with separate source and channel coding modules was developed for image retrieval under AWGN and Rayleigh fading, which further exemplified the benefits of separate training of each component rather than E2E. However, this training is possible due to the inserted rate-like part of the loss function that also treats the channel as noise, instead of a computational entity. Besides, the idea of DNN splitting for EI tasks has been investigated in [22, 6] from the information bottleneck viewpoint, to derive optimal network partitioning in uncontrollable wireless channels.

Traditionally, OAC methods have been developed to compute aggregate functions over multiple-access channels based on the superpositions of signals [33], and have been utilized for federated learning tasks, as in [34]. Both of these OAC approaches, however, focused on computing a limited, yet useful family of analytical functions that are fundamentally different from the computations that take place in hidden neural network layers. More relevant to the present work is the methodology of [7], where the wireless channel is treated as a hidden network layer encompassed by DNN layers at the transceivers. While this E2E treatment can be utilized under the context of GOC, the fact that the computations imposed by wireless channels are not controllable in the absence of any MS flavor, provides limited benefits. In essence, the proposed MINN framework in this paper is general enough to accommodate this setup as a special case, once the MS-induced links are ignored in the overall system model. In the numerical evaluation section later on, we compare with this variation to illustrate the benefits of integrating MS(s) as hidden layer(s) for effective mixed digital/analog computation.

The joint consideration of MSs and deep learning is a growing body of research. Apart from works that introduce deep learning algorithms to control RISs [35, 36, 37] or SIM [38], cascaded MSs have been introduced as DNN layers of diffractive implementations in [16, 17, 18], towards analog computing hardware that is envisioned to exhibit notable benefits in computational speed and power consumption. In this paper, we are motivated by all-optical neural network implementations, but our framework is differentiated by the fact that the SIM layers are assumed to reside within the wireless environment, so that the E2E system needs to account for time-varying wireless fading when passing information between the SIM and digital network layers at the TX/RX endpoints [39]. It is noted that purely optical DNNs have the limitation that the input data needs to be transferred to the RF domain via techniques like holography, illumination, or traditional modulation, which are currently impractical for real-life deployment of such architectures. Finally, in the context of wireless networks, other works have capitalized on sophisticated designs of the responses of the elements of MSs to implement wave-domain signal processing [40, 41] or multi-access edge computing [42] tasks.

Regarding approaches that specifically consider GOC or semantic communication problems with the inclusion of MSs, a GOC approach was recently introduced in [14], according to which, the DNN layers at the transceivers were implemented via SIM layers, in contrast to having the SIM as part of the environment, as proposed by this work. Indeed, performing DNN computations at the RF regime even at the endpoints has the aforementioned benefits, however, the hardware design of such transceivers is far from trivial to be implemented by low-cost IoT devices. Besides, an MS placed directly inside the wireless environment may offer more precise control of the propagation medium. Finally, semantic communications were performed through the assistance of an RIS placed at the environment in [15]. In that approach, the RIS was optimized to maximize an equivalent Signal-to-Noise Ratio (SNR) objective, similar to [8], whereas we herein propose to treat the problem in an E2E manner. In the numerical evaluation of Section VII, we include a baseline where the RIS alongside other components are optimized with respect to the achiavable Shannon rate, and we show that this approach is less effective compared to our E2E treatment under the considered system, especially in the low-SNR regime. Despite the related literature of E2E architectures for EI that potentially take advantage of MS capabilities, to the best of our knowledge, the proposed MINN framework is the first work to highlight the importance of treating the MS-enabled smart wireless channel as a favorable computational machinery embedded inside an E2E DNN architecture.

IV System and MS Components Modeling

IV-A System and Received Signal Models

We consider the uplink of a point-to-point Multiple-Input Multiple-Output (MIMO) communication system, where a TX equipped with Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT transmit antennas wishes to transmit its data to an Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT-antenna RX on a frame-by-frame basis, where the frames are indexed as t=1,2,𝑡12t=1,2,\ldotsitalic_t = 1 , 2 , … and are possibly unevenly spaced. This communication is enhanced via an MS (either an RIS or SIM), deployed as a standalone node in the wireless environment, whose configuration may change at every discrete time step t𝑡titalic_t upon the command of an abstract controller unit. Without loss of generality, let us assume that the SIM is consisted of M𝑀Mitalic_M thin diffractive layers, each with Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT unit elements, so that it contains NSIMMNmsubscript𝑁SIM𝑀subscript𝑁𝑚N_{\rm SIM}\triangleq MN_{m}italic_N start_POSTSUBSCRIPT roman_SIM end_POSTSUBSCRIPT ≜ italic_M italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT phase-tunable elements in total. For ease of notation and to present a comprehensive system model, we will additionally use Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to denote the number of RIS tunable elements. Furthermore, let us define the Channel Frequency Response (CFR) matrices at each t𝑡titalic_t-th time instance for the TX-RX, the TX-MS, and the MS-RX links as 𝐇D(t)Nr×Ntsubscript𝐇D𝑡superscriptsubscript𝑁𝑟subscript𝑁𝑡\mathbf{H}_{\rm D}(t)\in\mathbb{C}^{N_{r}\times N_{t}}bold_H start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐇1(t)Nt×Nmsubscript𝐇1𝑡superscriptsubscript𝑁𝑡subscript𝑁𝑚\mathbf{H}_{\rm 1}(t)\in\mathbb{C}^{N_{t}\times N_{m}}bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝐇2(t)Nr×Nmsubscript𝐇2𝑡superscriptsubscript𝑁𝑟subscript𝑁𝑚\mathbf{H}_{\rm 2}(t)\in\mathbb{C}^{N_{r}\times N_{m}}bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, respectively. The transmitted signal is expressed as 𝒔(t)Nt×1𝒔𝑡superscriptsubscript𝑁𝑡1\boldsymbol{s}(t)\in\mathbb{C}^{N_{t}\times 1}bold_italic_s ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, which satisfies a power budget constraint P𝔼[𝒔(t)]𝑃𝔼delimited-[]norm𝒔𝑡P\triangleq\mathbb{E}[\|\boldsymbol{s}(t)\|]italic_P ≜ blackboard_E [ ∥ bold_italic_s ( italic_t ) ∥ ]. In fact, we suggest that 𝒔(t)𝒔𝑡\boldsymbol{s}(t)bold_italic_s ( italic_t ) represents both the intended, source-coded, and modulated data stream (according to the MIMO spatial multiplexing principle, the number of data symbols need to be dmin{Nt,Nr}𝑑subscript𝑁𝑡subscript𝑁𝑟d\leq\min\{N_{t},N_{r}\}italic_d ≤ roman_min { italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }), as well as potential beamforming weights, without making any specific assumptions about the underlying procedures that produced the transmitted signal or the distribution of symbols.

During each t𝑡titalic_t-th frame transmission, the MS is characterized by its controllable phase configuration vector 𝝎(t)Nm×1𝝎𝑡superscriptsubscript𝑁𝑚1\boldsymbol{\omega}(t)\in\mathbb{C}^{N_{m}\times 1}bold_italic_ω ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT in the case of an RIS and 𝝎(t)NSIM×1𝝎𝑡superscriptsubscript𝑁SIM1\boldsymbol{\omega}(t)\in\mathbb{C}^{N_{\rm SIM}\times 1}bold_italic_ω ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_SIM end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT in the case of SIM, while its resulting response configuration is modeled for the idealized case of unit amplitude as 𝝋(t)exp(ȷ𝝎(t))𝝋𝑡italic-ȷ𝝎𝑡\boldsymbol{\varphi}(t)\triangleq\exp(-\jmath\boldsymbol{\omega}(t))bold_italic_φ ( italic_t ) ≜ roman_exp ( - italic_ȷ bold_italic_ω ( italic_t ) ). The effects of the responses of the metamaterials in the cascaded channel are captured via the matrix 𝛀(t)Nm×Nm𝛀𝑡superscriptsubscript𝑁𝑚subscript𝑁𝑚\boldsymbol{\Omega}(t)\in\mathbb{C}^{N_{m}\times N_{m}}bold_Ω ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the structure of which, will be detailed in the next subsection. In the remainder of this work, 𝛀(t)𝛀𝑡\boldsymbol{\Omega}(t)bold_Ω ( italic_t ), 𝝋(t)𝝋𝑡\boldsymbol{\varphi}(t)bold_italic_φ ( italic_t ), and 𝝎(t)𝝎𝑡\boldsymbol{\omega}(t)bold_italic_ω ( italic_t ) are used as generic notation to describe any of the RIS and SIM cases, while device-specific notation is introduced wherever needed. Using the above, the baseband received signal at the RX antennas is expressed as follows:

𝐲(t)𝐲𝑡\displaystyle\mathbf{y}(t)bold_y ( italic_t ) (𝐇D(t)+𝐇2(t)𝛀(t)𝐇1(t))𝒔(t)+𝒏~absentsubscript𝐇D𝑡subscript𝐇2𝑡𝛀𝑡superscriptsubscript𝐇1𝑡𝒔𝑡bold-~𝒏\displaystyle\triangleq\left(\mathbf{H}_{\rm D}(t)+\mathbf{H}_{\rm 2}(t)% \boldsymbol{\Omega}(t)\mathbf{H}_{\rm 1}^{\dagger}(t)\right)\boldsymbol{s}(t)+% \boldsymbol{\tilde{n}}≜ ( bold_H start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT ( italic_t ) + bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) bold_Ω ( italic_t ) bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_t ) ) bold_italic_s ( italic_t ) + overbold_~ start_ARG bold_italic_n end_ARG (4)
𝒯(𝓗(t),𝝋(t),𝒔(t)),absent𝒯𝓗𝑡𝝋𝑡𝒔𝑡\displaystyle\triangleq\mathcal{T}(\boldsymbol{\mathcal{H}}(t),\boldsymbol{% \varphi}(t),\boldsymbol{s}(t)),≜ caligraphic_T ( bold_caligraphic_H ( italic_t ) , bold_italic_φ ( italic_t ) , bold_italic_s ( italic_t ) ) , (5)

where 𝒏~Nr×1bold-~𝒏superscriptsubscript𝑁𝑟1\boldsymbol{\tilde{n}}\in\mathbb{C}^{N_{r}\times 1}overbold_~ start_ARG bold_italic_n end_ARG ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT denotes the AWGN at the RX, comprising of independent and identically distributed (i.i.d.) samples drawn from the standard complex normal distribution 𝒞𝒩(0,σ2)𝒞𝒩0superscript𝜎2\mathcal{CN}(0,\sigma^{2})caligraphic_C caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In the sequel, we will be making use of the transmission function 𝒯(𝓗(t),𝝋(t),𝒔(t))𝒯𝓗𝑡𝝋𝑡𝒔𝑡\mathcal{T}(\boldsymbol{\mathcal{H}}(t),\boldsymbol{\varphi}(t),\boldsymbol{s}% (t))caligraphic_T ( bold_caligraphic_H ( italic_t ) , bold_italic_φ ( italic_t ) , bold_italic_s ( italic_t ) ) as an abstraction, emphasizing that the wireless medium is treated as a programmable computation. In this definition, we use the notation 𝓗(t){𝐇D(t),𝐇1(t),𝐇2(t)}𝓗𝑡subscript𝐇D𝑡subscript𝐇1𝑡subscript𝐇2𝑡\boldsymbol{\mathcal{H}}(t)\triangleq\{\mathbf{H}_{\rm D}(t),\mathbf{H}_{\rm 1% }(t),\mathbf{H}_{\rm 2}(t)\}bold_caligraphic_H ( italic_t ) ≜ { bold_H start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT ( italic_t ) , bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) } for the instantaneous Channel State Information (CSI), which is assumed to be readily available to all system nodes. Obviously, this availability implies a recurring channel estimation phase at each t𝑡titalic_t-th time step, which may be a challenging prerequisite (see [36] and references therein). Nevertheless, this assumption allows the focus of this work to be on the training and evaluation of the proposed MINN architecture. Logical future extensions could incorporate the channel estimation phase in the DNN transceiver modules themselves, following ISAC principles [12]. Alternatively, channel-agnostic variations of transceivers will be also proposed and evaluated in the following sections to illustrate the performance trade-offs when integrating MSs as over-the-air neural network layers.

IV-B RIS and SIM Models

Considering first an RIS, let its phase configuration vector at time t𝑡titalic_t be denoted as 𝜽(t)[θ1(t),θ2(t),,θNm(t)]𝜽𝑡superscriptsubscript𝜃1𝑡subscript𝜃2𝑡subscript𝜃subscript𝑁𝑚𝑡top\boldsymbol{\theta}(t)\triangleq[\theta_{1}(t),\theta_{2}(t),\ldots,\theta_{N_% {m}}(t)]^{\top}bold_italic_θ ( italic_t ) ≜ [ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) , … , italic_θ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (𝝎(t)absent𝝎𝑡\equiv\boldsymbol{\omega}(t)≡ bold_italic_ω ( italic_t ) in (5)), so that the phase state of its n𝑛nitalic_n-th unit element (n=1,2,,Nm𝑛12subscript𝑁𝑚n=1,2,\ldots,N_{m}italic_n = 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) is expressed as θn(t)[0,2π)subscript𝜃𝑛𝑡02𝜋\theta_{n}(t)\in[0,2\pi)italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ∈ [ 0 , 2 italic_π ). Then, the induced weights at the response configuration vector are given as ϕ(t)exp(ȷθ(t))bold-italic-ϕ𝑡italic-ȷ𝜃𝑡\boldsymbol{\phi}(t)\triangleq\exp(-\jmath\theta(t))bold_italic_ϕ ( italic_t ) ≜ roman_exp ( - italic_ȷ italic_θ ( italic_t ) ) (𝝋(t)absent𝝋𝑡\equiv\boldsymbol{\varphi}(t)≡ bold_italic_φ ( italic_t )) In this case, using the diagonal matrix definition 𝚽(t)diag(ϕ(t))Nm×Nm𝚽𝑡diagbold-italic-ϕ𝑡superscriptsubscript𝑁𝑚subscript𝑁𝑚\boldsymbol{\Phi}(t)\triangleq{\rm diag}(\boldsymbol{\phi}(t))\in\mathbb{C}^{N% _{m}\times N_{m}}bold_Φ ( italic_t ) ≜ roman_diag ( bold_italic_ϕ ( italic_t ) ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, it holds that 𝛀(t)𝚽(t)𝛀𝑡𝚽𝑡\boldsymbol{\Omega}(t)\equiv\boldsymbol{\Phi}(t)bold_Ω ( italic_t ) ≡ bold_Φ ( italic_t ) in (4).

Proceeding to the introduction of the SIM into the system model, we first assume that all M𝑀Mitalic_M layers (m=1,2,,M𝑚12𝑀m=1,2,\ldots,Mitalic_m = 1 , 2 , … , italic_M) are closely stacked and aligned parallel to each other, with their shared normal vector oriented perpendicular to the line connecting the TX and RX positions. Under this placement, the signal from the TX arrives at the first layer of the SIM, undergoes diffraction and controllable phase shifting by the consecutive M1𝑀1M-1italic_M - 1 layers, before being finally diffracted towards the RX. Let us define the distance between any consecutive MS layers as dMsubscript𝑑𝑀d_{M}italic_d start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and the area of each unit element as SMsubscript𝑆𝑀S_{M}italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. Due to the compact placement of the layers, the layer-to-layer propagation can be accurately modeled via the Rayleigh-Sommerfeld diffraction equation [14, 13]. Namely, we define the propagation coefficient matrix from each (m1)𝑚1(m-1)( italic_m - 1 )-th to the m𝑚mitalic_m-th layer as 𝚵mNm×Nmsubscript𝚵𝑚superscriptsubscript𝑁𝑚subscript𝑁𝑚\boldsymbol{\Xi}_{m}\in\mathbb{C}^{N_{m}\times N_{m}}bold_Ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, so that its (n,n)𝑛superscript𝑛(n,n^{\prime})( italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )-th entry (n,n=1,2,,Nmformulae-sequence𝑛superscript𝑛12subscript𝑁𝑚n,n^{\prime}=1,2,\ldots,N_{m}italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) includes the propagation gain between the (arbitrarily ordered) n𝑛nitalic_n-th unit element of the (m1)𝑚1(m-1)( italic_m - 1 )-th layer and the nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th element of the next layer, as follows [13]:

[𝚵m]n,nsubscriptdelimited-[]subscript𝚵𝑚𝑛superscript𝑛\displaystyle[\boldsymbol{\Xi}_{m}]_{n,n^{\prime}}[ bold_Ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT dMSM(dn,n)2(12πdn,nȷλ)exp(ȷ2πdn,n),absentsubscript𝑑𝑀subscript𝑆𝑀superscriptsubscript𝑑𝑛superscript𝑛212𝜋subscript𝑑𝑛superscript𝑛italic-ȷ𝜆italic-ȷ2𝜋subscript𝑑𝑛superscript𝑛\displaystyle\triangleq\frac{d_{M}S_{M}}{(d_{n,n^{\prime}})^{2}}\Big{(}\frac{1% }{2\pi d_{n,n^{\prime}}}-\frac{\jmath}{\lambda}\Big{)}\exp({\jmath 2\pi d_{n,n% ^{\prime}}}),≜ divide start_ARG italic_d start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_d start_POSTSUBSCRIPT italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_ȷ end_ARG start_ARG italic_λ end_ARG ) roman_exp ( italic_ȷ 2 italic_π italic_d start_POSTSUBSCRIPT italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (6)

where dn,nsubscript𝑑𝑛superscript𝑛d_{n,n^{\prime}}italic_d start_POSTSUBSCRIPT italic_n , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the distance between the centers of the n𝑛nitalic_n-th and nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th elements, and λ𝜆\lambdaitalic_λ is the carrier frequency.

Apart from diffracting, each n𝑛nitalic_n-th element of each m𝑚mitalic_m-th SIM layer introduces a controllable weight, similar to the RIS modeling, denoted as [𝝍m(t)]nexp(ȷϑnm(t))subscriptdelimited-[]subscript𝝍𝑚𝑡𝑛italic-ȷsubscriptsuperscriptitalic-ϑ𝑚𝑛𝑡[\boldsymbol{\psi}_{m}(t)]_{n}\triangleq\exp(-\jmath\vartheta^{m}_{n}(t))[ bold_italic_ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≜ roman_exp ( - italic_ȷ italic_ϑ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ). We also introduce 𝝍m(t)subscript𝝍𝑚𝑡\boldsymbol{\psi}_{m}(t)bold_italic_ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) including the response configuration at the m𝑚mitalic_m-th layer, the overall response configuration 𝝍(t)=[𝝍1(t),𝝍2(t),,𝝍M(t)]NSIM×1𝝍𝑡superscriptsubscriptsuperscript𝝍top1𝑡subscriptsuperscript𝝍top2𝑡subscriptsuperscript𝝍top𝑀𝑡topsuperscriptsubscript𝑁SIM1\boldsymbol{\psi}(t)=[\boldsymbol{\psi}^{\top}_{1}(t),\boldsymbol{\psi}^{\top}% _{2}(t),\dots,\boldsymbol{\psi}^{\top}_{M}(t)]^{\top}\in\mathbb{C}^{N_{\rm SIM% }\times 1}bold_italic_ψ ( italic_t ) = [ bold_italic_ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , bold_italic_ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) , … , bold_italic_ψ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_SIM end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, and the phase configuration vector of the SIM ϑ(t)[ϑ11(t),,ϑNm1(t),,ϑ1M(t),,ϑNmM(t)]NSIM×1bold-italic-ϑ𝑡superscriptsubscriptsuperscriptitalic-ϑ11𝑡subscriptsuperscriptitalic-ϑ1subscript𝑁𝑚𝑡subscriptsuperscriptitalic-ϑ𝑀1𝑡subscriptsuperscriptitalic-ϑ𝑀subscript𝑁𝑚𝑡topsuperscriptsubscript𝑁SIM1\boldsymbol{\vartheta}(t)\triangleq[\vartheta^{1}_{1}(t),\dots,\vartheta^{1}_{% N_{m}}(t),\dots,\vartheta^{M}_{1}(t),\dots,\vartheta^{M}_{N_{m}}(t)]^{\top}\in% \mathbb{C}^{N_{\rm SIM}\times 1}bold_italic_ϑ ( italic_t ) ≜ [ italic_ϑ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ϑ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) , … , italic_ϑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ϑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_SIM end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT. By defining 𝚿m(t)diag(𝝍m(t))subscript𝚿𝑚𝑡diagsubscript𝝍𝑚𝑡\boldsymbol{\Psi}_{m}(t)\triangleq{\rm diag}(\boldsymbol{\psi}_{m}(t))bold_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ≜ roman_diag ( bold_italic_ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ), the overall SIM response can be mathematically expressed via the following matrix [39]:

𝚼(t)(m=M2𝚿m(t)𝚵m)𝚿1(t)Nm×Nm.𝚼𝑡superscriptsubscriptproduct𝑚𝑀2subscript𝚿𝑚𝑡subscript𝚵𝑚subscript𝚿1𝑡superscriptsubscript𝑁𝑚subscript𝑁𝑚\boldsymbol{\Upsilon}(t)\triangleq\left(\prod_{m=M}^{2}\boldsymbol{\Psi}_{m}(t% )\boldsymbol{\Xi}_{m}\right)\boldsymbol{\Psi}_{1}(t)\in\mathbb{C}^{N_{m}\times N% _{m}}.bold_Υ ( italic_t ) ≜ ( ∏ start_POSTSUBSCRIPT italic_m = italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) bold_Ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (7)

Note that, for 𝛀(t)𝚼(t)𝛀𝑡𝚼𝑡\boldsymbol{\Omega}(t)\equiv\boldsymbol{\Upsilon}(t)bold_Ω ( italic_t ) ≡ bold_Υ ( italic_t ), (4) holds for the SIM case.

Revisiting the previously introduced generic notations of 𝛀(t)𝛀𝑡\boldsymbol{\Omega}(t)bold_Ω ( italic_t ) and 𝝎(t)𝝎𝑡\boldsymbol{\omega}(t)bold_italic_ω ( italic_t ), they can now be expressed concretely for each of the RIS and SIM cases as 𝛀(t){𝚽(t),𝚼(t)}𝛀𝑡𝚽𝑡𝚼𝑡\boldsymbol{\Omega}(t)\in\{\boldsymbol{\Phi}(t),\boldsymbol{\Upsilon}(t)\}bold_Ω ( italic_t ) ∈ { bold_Φ ( italic_t ) , bold_Υ ( italic_t ) } and 𝝎(t){𝜽(t),ϑ(t)}𝝎𝑡𝜽𝑡bold-italic-ϑ𝑡\boldsymbol{\omega}(t)\in\{\boldsymbol{\theta}(t),\boldsymbol{\vartheta}(t)\}bold_italic_ω ( italic_t ) ∈ { bold_italic_θ ( italic_t ) , bold_italic_ϑ ( italic_t ) }. In the rest of this paper, 𝛀(t)𝛀𝑡\boldsymbol{\Omega}(t)bold_Ω ( italic_t ), 𝝎(t)𝝎𝑡\boldsymbol{\omega}(t)bold_italic_ω ( italic_t ), and 𝝋(t)𝝋𝑡\boldsymbol{\varphi}(t)bold_italic_φ ( italic_t ) will be used when the underlying operations are agnostic of the MS type, while 𝚽(t)𝚽𝑡\boldsymbol{\Phi}(t)bold_Φ ( italic_t ), 𝚼(t)𝚼𝑡\boldsymbol{\Upsilon}(t)bold_Υ ( italic_t ), and their respective vectors will be explicitly utilized when the operations need to discriminate between RISs and SIM.

V Metasurface-Integrated Neural Networks

Refer to caption
Figure 2: Block diagram and computation flow for the proposed E2E GOC framework where the metasurface-parametrizable channel acts as an intermediate DNN component. Both the cases of reconfigurable and static metasurfaces are included, entailing different procedures during the forward and backward passes.

V-A Transceiver Modules

As initially discussed in Section II-C, EI entails two computational modules, collocated at the transceiver endpoints. To implement the “infer-while-transmitting” methodology, the TX utilizes an encoder DNN, f𝒘𝐞e()subscriptsuperscript𝑓esubscript𝒘𝐞f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\cdot)italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ), providing the output 𝒔(t)𝒔𝑡\boldsymbol{s}(t)bold_italic_s ( italic_t ), while the RX operates a decoder DNN, f𝒘𝐝d()subscriptsuperscript𝑓dsubscript𝒘𝐝f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\cdot)italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ), deriving the output 𝒐^(t)bold-^𝒐𝑡\boldsymbol{\hat{o}}(t)overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ). Those blocks are tasked with performing compression, encoding and decoding, error resilience and correction, and potential transmit and receive beamforming alongside probabilistic inference. The exact layer architecture of those models is purposely left unspecified at this stage as a practitioner’s choice, depending, in general, on: i) the nature of the wireless environment; ii) the type of input and target data; iii) computational capabilities of the transceivers’ hardware; as well as iv) the current state-of-the-art. We only note that different sub-modules may be used for each of the above operations, while, typically for uplink scenarios, f𝒘𝐝d()subscriptsuperscript𝑓dsubscript𝒘𝐝f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\cdot)italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) can be implemented with larger DNN structures due to the constant power supply at base stations. In addition, regardless of the choice of neural network, we impose a final fixed post-processing step at the encoder’s output 𝒔(t)𝒔𝑡\boldsymbol{s}(t)bold_italic_s ( italic_t ) to satisfy the TX system’s power budget, as follows:

𝒔(t)P𝒔(t)𝒔(t).𝒔𝑡𝑃𝒔𝑡norm𝒔𝑡\boldsymbol{s}(t)\leftarrow\sqrt{P}\frac{\boldsymbol{s}(t)}{\|\boldsymbol{s}(t% )\|}.bold_italic_s ( italic_t ) ← square-root start_ARG italic_P end_ARG divide start_ARG bold_italic_s ( italic_t ) end_ARG start_ARG ∥ bold_italic_s ( italic_t ) ∥ end_ARG . (8)

Considering the concrete input arguments of the encoder functions, two different variations may be defined depending on whether CSI is available to each of the endpoints.

V-A1 Channel-Agnostic Transceivers

An instance of the data variables 𝒙(t)𝒙𝑡\boldsymbol{x}(t)bold_italic_x ( italic_t ) is observed at the TX, that is passed to the encoder to construct the transmitted signal, while the decoder DNN observes the received signal and performs an estimate of the unseen target variable 𝒐(t)𝒐𝑡\boldsymbol{o}(t)bold_italic_o ( italic_t ); this can be described as:

𝒔(t)𝒔𝑡\displaystyle\boldsymbol{s}(t)bold_italic_s ( italic_t ) =f𝒘𝐞e(𝒙(t)),absentsubscriptsuperscript𝑓esubscript𝒘𝐞𝒙𝑡\displaystyle=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t)),= italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) ) , (9)
𝒐^(t)bold-^𝒐𝑡\displaystyle\boldsymbol{\hat{o}}(t)overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) =f𝒘𝐝d(𝐲(t)).absentsubscriptsuperscript𝑓dsubscript𝒘𝐝𝐲𝑡\displaystyle=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t)).= italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y ( italic_t ) ) . (10)

Since no CSI is used by the endpoints, we highlight the similarity of this design to source-only coding, despite the fact that the encoder may need to add redundancy in the transmitted signal, which is traditionally considered as channel coding. Evidently, both processes need to guarantee that the inference procedure performs sufficiently, irrespective of the current channel conditions, which may be a demanding request. Nevertheless, not requiring CSI is a strong simplification of the system architecture, therefore, it is included later on in our investigations in this paper.

V-A2 Channel-Aware Transceivers

Let us assume a quasi-static fading channel and a channel estimation procedure that takes place within each t𝑡titalic_t-th channel frame before data transmission, based on which the TX and RX modules obtain accurate estimates of the CFR matrices 𝓗(t)𝓗𝑡\boldsymbol{\mathcal{H}}(t)bold_caligraphic_H ( italic_t ). Each module may receive 𝓗(t)𝓗𝑡\boldsymbol{\mathcal{H}}(t)bold_caligraphic_H ( italic_t ) as an additional input, yielding respectively the following representations for the encoder/decoder DNNs:

𝒔(t)𝒔𝑡\displaystyle\boldsymbol{s}(t)bold_italic_s ( italic_t ) =f𝒘𝐞e(𝒙(t),𝓗(t)),absentsubscriptsuperscript𝑓esubscript𝒘𝐞𝒙𝑡𝓗𝑡\displaystyle=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t),\boldsymbol% {\mathcal{H}}(t)),= italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) , (11)
𝒐^(t)bold-^𝒐𝑡\displaystyle\boldsymbol{\hat{o}}(t)overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) =f𝒘𝐝d(𝐲(t),𝓗(t)).absentsubscriptsuperscript𝑓dsubscript𝒘𝐝𝐲𝑡𝓗𝑡\displaystyle=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t),\boldsymbol{% \mathcal{H}}(t)).= italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y ( italic_t ) , bold_caligraphic_H ( italic_t ) ) . (12)

Endowing the TX/RX modules with CSI leads to more resilient transmission schemes that closely resemble JSCC [43]. The main difference lies in that JSCC focuses on estimating 𝒙(t)𝒙𝑡\boldsymbol{x}(t)bold_italic_x ( italic_t ), while EI deals with approximating 𝒐(t)=l(𝒙(t))𝒐𝑡𝑙𝒙𝑡\boldsymbol{o}(t)=l(\boldsymbol{x}(t))bold_italic_o ( italic_t ) = italic_l ( bold_italic_x ( italic_t ) )111Those two problems can be considered equivalent by setting the mapping function l()𝑙l(\cdot)italic_l ( ⋅ ) to be the identity function, yielding 𝒐(t)=𝒙(t)𝒐𝑡𝒙𝑡\boldsymbol{o}(t)=\boldsymbol{x}(t)bold_italic_o ( italic_t ) = bold_italic_x ( italic_t ), and adopting MSE as the objective function Jstr()subscript𝐽strJ_{\rm str}(\cdot)italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( ⋅ ). As a result, EI is a more general problem formulation than communications which focus on data reconstruction.. In this paper, we assume that channel estimation takes place transparently before every transmission, both during training and inference ,and results in noise-free estimations of 𝓗(t)𝓗𝑡\boldsymbol{\mathcal{H}}(t)bold_caligraphic_H ( italic_t ). Accounting for noisy estimates or even incorporate the estimation in the procedure under ISAC paradigms lead exciting research directions, which we will study in future works.

V-B Control Module for Reconfigurable Metasurfaces

When CSI is available, as in most wireless communication settings, the MS changes its response configuration at every transmission frame to optimize the system’s objective [44]. To incorporate this mode of operation into our E2E architecture, 𝝋(t)𝝋𝑡\boldsymbol{\varphi}(t)bold_italic_φ ( italic_t ) is treated as a controllable variable that is the output of a third DNN module. Specifically, we define the metasurface controller as the following neural network:

𝝋(t)=f𝒘𝐦m(𝓗(t)),𝝋𝑡subscriptsuperscript𝑓msubscript𝒘𝐦𝓗𝑡\boldsymbol{\varphi}(t)=f^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{% \mathcal{H}}(t)),bold_italic_φ ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_caligraphic_H ( italic_t ) ) , (13)

imposing that the final layer performs the operation 𝝋(t)=exp(ȷϑ^)𝝋𝑡italic-ȷbold-^bold-italic-ϑ\boldsymbol{\varphi}(t)=\exp(-\jmath\boldsymbol{\hat{\vartheta}})bold_italic_φ ( italic_t ) = roman_exp ( - italic_ȷ overbold_^ start_ARG bold_italic_ϑ end_ARG ). As stated before, either ϕ(t)bold-italic-ϕ𝑡\boldsymbol{\phi}(t)bold_italic_ϕ ( italic_t ) or 𝝍(t)𝝍𝑡\boldsymbol{\psi}(t)bold_italic_ψ ( italic_t ) may be the actual output of the module depending of the selected type of MS, however, we keep the abstract notation of 𝝋(t)𝝋𝑡\boldsymbol{\varphi}(t)bold_italic_φ ( italic_t ) to provide a general framework. Under this viewpoint, the MS is a controllable entity that can be adapted dynamically to offer favorable wave-domain computation at every channel realization. This treatment allows for fine-grained control over the reprogrammability of the environment, at the cost of an additional neural network module and the associated hardware requirements. Plugging the three trained modules from expressions (9)–(13) onto the received signal in (4), we can derive the E2E inference model 𝒐^(t)=f𝒘𝐫r(𝒙(t),𝓗(t))bold-^𝒐𝑡subscriptsuperscript𝑓rsubscript𝒘𝐫𝒙𝑡𝓗𝑡\boldsymbol{\hat{o}}(t)=f^{\rm r}_{\boldsymbol{w_{\rm r}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t))overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ), for the two channel knowledge cases, as follows:

𝒐^(t)={f𝒘𝐝d(𝒯(𝓗(t),f𝒘𝐦m(𝓗),f𝒘𝐞e(𝒙(t)))),channel-agnostic transceiversf𝒘𝐝d(𝒯(𝓗(t),f𝒘𝐦m(𝓗(t)),f𝒘𝐞e(𝒙(t),𝓗(t))),𝓗(t)),channel-aware transceivers\footnotesize\boldsymbol{\hat{o}}(t)\!=\!\begin{cases}\underbrace{f^{\rm d}_{% \boldsymbol{w_{\rm d}}}\bigg{(}\mathcal{T}\big{(}\boldsymbol{\mathcal{H}}(t),f% ^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{\mathcal{H}}),f^{\rm e}_{% \boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t))\big{)}\bigg{)},}_{\text{channel-% agnostic transceivers}}\\ \underbrace{f^{\rm d}_{\boldsymbol{w_{\rm d}}}\Big{(}\mathcal{T}\big{(}% \boldsymbol{\mathcal{H}}(t),f^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{% \mathcal{H}}(t)),f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t))\big{)},\boldsymbol{\mathcal{H}}(t)\Big{)},}_{% \text{channel-aware transceivers}}\end{cases}overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) = { start_ROW start_CELL under⏟ start_ARG italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_T ( bold_caligraphic_H ( italic_t ) , italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_caligraphic_H ) , italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) ) ) ) , end_ARG start_POSTSUBSCRIPT channel-agnostic transceivers end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL under⏟ start_ARG italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_T ( bold_caligraphic_H ( italic_t ) , italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_caligraphic_H ( italic_t ) ) , italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) ) , bold_caligraphic_H ( italic_t ) ) , end_ARG start_POSTSUBSCRIPT channel-aware transceivers end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW (14)

where the overall trainable weights of this reconfigurable architecture have been represented as 𝒘𝐫{𝒘𝐝,𝒘𝐞,𝒘𝐦}subscript𝒘𝐫subscript𝒘𝐝subscript𝒘𝐞subscript𝒘𝐦\boldsymbol{w_{\rm r}}\triangleq\{\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e}% },\boldsymbol{w_{\rm m}}\}bold_italic_w start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT ≜ { bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT }, which can be trained together under the same objective functions and backward passes, as it will be detailed in Section VI.

V-C Metasurfaces with Trainable Fixed Response

As an alternative approach, one may choose to directly learn a fixed configuration for the MS; let us denote this as 𝝎¯bold-¯𝝎\boldsymbol{\bar{\omega}}overbold_¯ start_ARG bold_italic_ω end_ARG. While the training process may iteratively evaluate multiple candidate values for 𝝎¯bold-¯𝝎\boldsymbol{\bar{\omega}}overbold_¯ start_ARG bold_italic_ω end_ARG, once the training is complete, the learned configuration is equipped onto the MS to maintain a constant (static) response configuration 𝝋(t)𝝋¯exp(ȷ𝝎¯)𝝋𝑡bold-¯𝝋italic-ȷbold-¯𝝎\boldsymbol{\varphi}(t)\equiv\boldsymbol{\bar{\varphi}}\triangleq\exp(-\jmath% \boldsymbol{\bar{\omega}})bold_italic_φ ( italic_t ) ≡ overbold_¯ start_ARG bold_italic_φ end_ARG ≜ roman_exp ( - italic_ȷ overbold_¯ start_ARG bold_italic_ω end_ARG ) over time, irrespective of the channel conditions or input data. This description is more akin to the idea that the effective phase configurations are treated similarly to DNN weights, as they too remain fixed after the completion of the training procedure, and are used to perform the same computational operations over varying input instances. To this end, denote the fixed phase configurations of the RIS and SIM as 𝜽¯bold-¯𝜽\boldsymbol{\bar{\theta}}overbold_¯ start_ARG bold_italic_θ end_ARG and ϑ¯bold-¯bold-italic-ϑ\boldsymbol{\bar{\vartheta}}overbold_¯ start_ARG bold_italic_ϑ end_ARG, respectively. The training procedure optimizes 𝝎¯bold-¯𝝎\boldsymbol{\bar{\omega}}overbold_¯ start_ARG bold_italic_ω end_ARG directly, i.e., its weights 𝒘𝐬{𝒘𝐝,𝒘𝐞,𝝎¯}subscript𝒘𝐬subscript𝒘𝐝subscript𝒘𝐞bold-¯𝝎\boldsymbol{w_{\rm s}}\triangleq\{\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e}% },\boldsymbol{\bar{\omega}}\}bold_italic_w start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ≜ { bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT , overbold_¯ start_ARG bold_italic_ω end_ARG }, therefore, the E2E static architecture can be now expressed as follows:

𝒐^(t)={f𝒘𝐝d(𝒯(𝓗(t),𝝋¯,f𝒘𝐞e(𝒙(t)))),channel-agnostic transceiversf𝒘𝐝d(𝒯(𝓗(t),𝝋¯,f𝒘𝐞e(𝒙(t),𝓗(t))),𝓗(t)).channel-aware transceivers\footnotesize\boldsymbol{\hat{o}}(t)=\begin{cases}\underbrace{f^{\rm d}_{% \boldsymbol{w_{\rm d}}}\left(\mathcal{T}\left(\boldsymbol{\mathcal{H}}(t),% \boldsymbol{\bar{\varphi}},f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t% ))\right)\right),}_{\text{channel-agnostic transceivers}}\\ \underbrace{f^{\rm d}_{\boldsymbol{w_{\rm d}}}\bigg{(}\mathcal{T}\big{(}% \boldsymbol{\mathcal{H}}(t),\boldsymbol{\bar{\varphi}},f^{\rm e}_{\boldsymbol{% w_{\rm e}}}(\boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t))\big{)},\boldsymbol{% \mathcal{H}}(t)\bigg{)}.}_{\text{channel-aware transceivers}}\end{cases}overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) = { start_ROW start_CELL under⏟ start_ARG italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_T ( bold_caligraphic_H ( italic_t ) , overbold_¯ start_ARG bold_italic_φ end_ARG , italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) ) ) ) , end_ARG start_POSTSUBSCRIPT channel-agnostic transceivers end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL under⏟ start_ARG italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_T ( bold_caligraphic_H ( italic_t ) , overbold_¯ start_ARG bold_italic_φ end_ARG , italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) ) , bold_caligraphic_H ( italic_t ) ) . end_ARG start_POSTSUBSCRIPT channel-aware transceivers end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW (15)

It is noted that, while reconfigurable MSs may offer more precise control in shaping the exact form of 𝒯()𝒯\mathcal{T}(\cdot)caligraphic_T ( ⋅ ), the extra hidden layers required by the inclusion of the MS controller module may well hinder the training capabilities of the proposed MINN compared to the current variation. Besides, assuming wireless systems of reasonably limited variability, such as Line-of-Sight (LoS) dominant environments with fixed transceivers, static MS configurations may offer satisfactory performance. The next section addresses the systemic requirements for all variations of this section, while performance trade-offs are investigated under our numerical evaluations.

The final thing to consider about the proposed MINN framework are the theoretical approximation guarantees, especially considering its universal approximation capabilities. Presenting a concrete analysis of this property lies well outside the scope of this paper, as it involves proving that functionals derived of the modeling of Section IV are discriminatory and dense in the space of continuous complex functions as in [20], while also accounting for the stochastic nature of both the fading components and the AWGN. Nevertheless, we argue that, since our MINN architecture includes two typical DNNs at the start and the end of the cascaded computations, which are in fact universal approximators, the E2E architecture should also contain this property at least in the infinite-SNR regime. Intuitively, as long as 𝒯()𝒯\mathcal{T}(\cdot)caligraphic_T ( ⋅ ) does not lose information due to the noisy channel, the decoder DNN should be capable of approximating the optimal decoder of 𝐲(t)𝐲𝑡\mathbf{y}(t)bold_y ( italic_t ) to obtain 𝒔(t)𝒔𝑡\boldsymbol{s}(t)bold_italic_s ( italic_t ), so that the channel is completely negated, and the E2E MINN reduces to f𝒘𝐝d(f𝒘𝐞e(𝒙(t),𝓗(t)),𝓗(t))subscriptsuperscript𝑓dsubscript𝒘𝐝subscriptsuperscript𝑓esubscript𝒘𝐞𝒙𝑡𝓗𝑡𝓗𝑡f^{\rm d}_{\boldsymbol{w_{\rm d}}}\big{(}f^{\rm e}_{\boldsymbol{w_{\rm e}}}(% \boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t)),\boldsymbol{\mathcal{H}}(t)\big% {)}italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) , bold_caligraphic_H ( italic_t ) ), which is indeed a universal approximator consisting of standard DNN layers.

VI MINN Training and Deployment

To perform neural network training at any of the wireless communication system nodes, a separate data collection step is carried out to generate a set of |𝒟|𝒟|\mathcal{D}|| caligraphic_D | labeled data instances: 𝒟{(𝒙i,𝒐i)}i=1|𝒟|𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝒐𝑖𝑖1𝒟\mathcal{D}\triangleq\{(\boldsymbol{x}_{i},\boldsymbol{o}_{i})\}_{i=1}^{|% \mathcal{D}|}caligraphic_D ≜ { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT. Let us also assume the availability of a set of |𝒞|𝒞|\mathcal{C}|| caligraphic_C | channel sample estimates (in respective coherent time instances): 𝒞{𝓗(t)}t=1|𝒞|𝒞superscriptsubscript𝓗𝑡𝑡1𝒞\mathcal{C}\triangleq\{\boldsymbol{\mathcal{H}}(t)\}_{t=1}^{|\mathcal{C}|}caligraphic_C ≜ { bold_caligraphic_H ( italic_t ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_C | end_POSTSUPERSCRIPT, not necessarily equally spaced. In this paper, we make the assumption that the channel realizations are conditionally independent222In certain scenarios, the data realizations and the statistical properties of the channel can be statistically dependent. For example, in a target detection system where the observations 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contain sensory inputs, while 𝒐isubscript𝒐𝑖\boldsymbol{o}_{i}bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a binary variable indicating the existence of a target in the area of interest, deep fading may be encountered more often when a target is subject to signal blockages. In such cases, the two collection processes of channel measurements and observed data must be synchronous, and a more detailed formulation of the EI objective is required. However, the inference problem itself may be potentially computationally easier, since the CSI observation provides additional information regarding the target value. from 𝒟𝒟\mathcal{D}caligraphic_D’s data instances. It is noted that, while this is a rather lenient assumption, it is crucial in permitting the evaluation of the expectation in 𝒪𝒫EI𝒪subscript𝒫EI\mathcal{OP}_{\rm EI}caligraphic_O caligraphic_P start_POSTSUBSCRIPT roman_EI end_POSTSUBSCRIPT’s objective via i.i.d. Monte Carlo samples.

VI-A Backpropagation Over the Wireless Channel

The training procedure can be described as a variation of the standard gradient descent approach for neural network training, with the inclusion of channel samples. To provide a comprehensive framework, let us use the generic parameter vector 𝒘𝐤subscript𝒘𝐤\boldsymbol{w_{\rm k}}bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT with 𝐤{𝐫,𝐬}𝐤𝐫𝐬\mathbf{k}\in\{\mathbf{r},\mathbf{s}\}bold_k ∈ { bold_r , bold_s }, taking the form of either 𝒘𝐫subscript𝒘𝐫\boldsymbol{w_{\rm r}}bold_italic_w start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT or 𝒘𝐬subscript𝒘𝐬\boldsymbol{w_{\rm s}}bold_italic_w start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT depending on the choice of use of reconfigurable or static MSs. Similar to standard deep learning practices, our E2E MINN architecture may be optimized using SGD over the collected data and channel instances. Specifically, let us express the data-channel as Jstr(𝒐i,𝒐^i)=Jstr(𝒐i,f𝒘𝐤k(𝒙(t),𝓗(t)))subscript𝐽strsubscript𝒐𝑖subscriptbold-^𝒐𝑖subscript𝐽strsubscript𝒐𝑖subscriptsuperscript𝑓ksubscript𝒘𝐤𝒙𝑡𝓗𝑡J_{\rm str}(\boldsymbol{o}_{i},\boldsymbol{\hat{o}}_{i})=J_{\rm str}(% \boldsymbol{o}_{i},f^{\rm k}_{\boldsymbol{w_{\rm k}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t)))italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) ) to explicit show the dependence of the loss function on the instantaneous wireless channel conditions. To this end, leveraging the previous described conditional independence assumption, 𝒪𝒫EI𝒪subscript𝒫EI\mathcal{OP}_{\rm EI}caligraphic_O caligraphic_P start_POSTSUBSCRIPT roman_EI end_POSTSUBSCRIPT’s objective can be approximated as follows:

𝔼𝓗[J(𝒘𝐤)]1|𝒞||𝒟|t=1|𝒞|i=1|𝒟|Jstr(𝒐i,f𝒘𝐤k(𝒙i,𝓗(t))).subscript𝔼𝓗delimited-[]𝐽subscript𝒘𝐤1𝒞𝒟superscriptsubscript𝑡1𝒞superscriptsubscript𝑖1𝒟subscript𝐽strsubscript𝒐𝑖subscriptsuperscript𝑓ksubscript𝒘𝐤subscript𝒙𝑖𝓗𝑡\mathbb{E}_{\boldsymbol{\mathcal{H}}}[J(\boldsymbol{w_{\rm k}})]\!\cong\!\frac% {1}{|\mathcal{C}||\mathcal{D}|}\sum_{t=1}^{|\mathcal{C}|}\sum_{i=1}^{|\mathcal% {D}|}J_{\rm str}(\boldsymbol{o}_{i},f^{\rm k}_{\boldsymbol{w_{\rm k}}}(% \boldsymbol{x}_{i},\boldsymbol{\mathcal{H}}(t))).blackboard_E start_POSTSUBSCRIPT bold_caligraphic_H end_POSTSUBSCRIPT [ italic_J ( bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ) ] ≅ divide start_ARG 1 end_ARG start_ARG | caligraphic_C | | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_C | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_caligraphic_H ( italic_t ) ) ) . (16)

In the online version of SGD, at every time t𝑡titalic_t, one may select a single data point and channel instance to evaluate (16), and accordingly update the parameter vector as follows:

𝒘𝐤𝒘𝐤η𝒘𝐤Jstr(𝒐(t),f𝒘𝐤k(𝒙(t),𝓗(t))),subscript𝒘𝐤subscript𝒘𝐤𝜂subscriptsubscript𝒘𝐤subscript𝐽str𝒐𝑡subscriptsuperscript𝑓ksubscript𝒘𝐤𝒙𝑡𝓗𝑡\boldsymbol{w_{\rm k}}\leftarrow\boldsymbol{w_{\rm k}}-\eta\nabla_{\boldsymbol% {w_{\rm k}}}J_{\rm str}(\boldsymbol{o}(t),f^{\rm k}_{\boldsymbol{w_{\rm k}}}(% \boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t))),bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ← bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT - italic_η ∇ start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o ( italic_t ) , italic_f start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) ) , (17)

for some chosen learning rate η𝜂\etaitalic_η, with the gradient at each case being defined by one of the two following expressions:

Jstrsubscript𝐽str\displaystyle\nabla J_{\rm str}∇ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT =[[Jstr𝒘𝐝],[Jstr𝒘𝐞],[Jstr𝒘𝐦]]reconfigurable metasurfaceabsentsubscriptsuperscriptsuperscriptdelimited-[]subscript𝐽strsubscript𝒘𝐝topsuperscriptdelimited-[]subscript𝐽strsubscript𝒘𝐞topsuperscriptdelimited-[]subscript𝐽strsubscript𝒘𝐦toptopreconfigurable metasurface\displaystyle=\underbrace{\bigg{[}\Big{[}\frac{\partial J_{\rm str}}{\partial% \boldsymbol{w_{\rm d}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str}}{% \partial\boldsymbol{w_{\rm e}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str% }}{\partial\boldsymbol{w_{\rm m}}}\Big{]}^{\top}\bigg{]}^{\top}}_{\text{% reconfigurable metasurface}}= under⏟ start_ARG [ [ divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , [ divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , [ divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT reconfigurable metasurface end_POSTSUBSCRIPT (18)
Jstrsubscript𝐽str\displaystyle\nabla J_{\rm str}∇ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT =[[Jstr𝒘𝐝],[Jstr𝒘𝐞],[Jstr𝝎¯]]metasurface with trainable fixed response.absentsubscriptsuperscriptsuperscriptdelimited-[]subscript𝐽strsubscript𝒘𝐝topsuperscriptdelimited-[]subscript𝐽strsubscript𝒘𝐞topsuperscriptdelimited-[]subscript𝐽strbold-¯𝝎toptopmetasurface with trainable fixed response\displaystyle=\underbrace{\bigg{[}\Big{[}\frac{\partial J_{\rm str}}{\partial% \boldsymbol{w_{\rm d}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str}}{% \partial\boldsymbol{w_{\rm e}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str% }}{\partial\boldsymbol{\bar{\omega}}}\Big{]}^{\top}\bigg{]}^{\top}}_{\text{% metasurface with trainable fixed response}}.= under⏟ start_ARG [ [ divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , [ divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , [ divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ω end_ARG end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT metasurface with trainable fixed response end_POSTSUBSCRIPT . (19)

Under the i.i.d. sampling assumption, the consecutive evaluations of the gradient of the objective function in (17) at each training instance t𝑡titalic_t are unbiased estimators of the true gradient of the objective in (16). Therefore, following the stochastic approximation framework [45], the repetition of this procedure will converge to the true value of the expectation with probability 1111 up to a precision of O(η)𝑂𝜂O(\eta)italic_O ( italic_η ) around it, using constant step size [46]. The complete training procedure is detailed in Algorithm 1, which supports all variations of channel-agnostic/-aware transceivers, static/reconfigurable MS controllers, and RIS/SIM structure. Lines 6666-9999 implement our MINN architecture as defined in (14) and (15). Naturally, batched gradient descent versions may be used alongside more elaborate gradient updates, such as momentum, weight decay (regularization), and adaptive rates [47], however, such implementation details have been left out for ease of presentation.

Algorithm 1 Training of the Proposed E2E MINN
1:Construct DNN weight vector 𝒘𝒘\boldsymbol{w}bold_italic_w as one of the following:
2:    i) 𝒘𝐤=concat(𝒘𝐝,𝒘𝐞,𝒘𝐦)subscript𝒘𝐤concatsubscript𝒘𝐝subscript𝒘𝐞subscript𝒘𝐦\boldsymbol{w_{\rm k}}={\rm concat}(\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e% }},\boldsymbol{w_{\rm m}})bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT = roman_concat ( bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT ).     \triangleright 𝒘𝐤𝒘𝐫subscript𝒘𝐤subscript𝒘𝐫\boldsymbol{w_{\rm k}}\leftarrow\boldsymbol{w_{\rm r}}bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ← bold_italic_w start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT
3:    ii) 𝒘𝐤=concat(𝒘𝐝,𝒘𝐞,𝝎¯)subscript𝒘𝐤concatsubscript𝒘𝐝subscript𝒘𝐞bold-¯𝝎\boldsymbol{w_{\rm k}}={\rm concat}(\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e% }},\boldsymbol{\bar{\omega}})bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT = roman_concat ( bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT , overbold_¯ start_ARG bold_italic_ω end_ARG ).     \triangleright 𝒘𝐤𝒘𝐬subscript𝒘𝐤subscript𝒘𝐬\boldsymbol{w_{\rm k}}\leftarrow\boldsymbol{w_{\rm s}}bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ← bold_italic_w start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT
4:Initialize 𝒘𝒘\boldsymbol{w}bold_italic_w randomly.
5:for t=1,2,,𝑡12t=1,2,\ldots,italic_t = 1 , 2 , … , until convergence do
6:     Sample (𝒙(t),𝒐(t))𝒙𝑡𝒐𝑡(\boldsymbol{x}(t),\boldsymbol{o}(t))( bold_italic_x ( italic_t ) , bold_italic_o ( italic_t ) ) from 𝒟𝒟\mathcal{D}caligraphic_D.
7:     Sample 𝓗(t)𝓗𝑡\boldsymbol{\mathcal{H}}(t)bold_caligraphic_H ( italic_t ) from 𝒞𝒞\mathcal{C}caligraphic_C.
8:     Compute 𝒔(t)𝒔𝑡\boldsymbol{s}(t)bold_italic_s ( italic_t ) using one of the following:
9:       i) 𝒔(t)=f𝒘𝐞e(𝒙(t))𝒔𝑡subscriptsuperscript𝑓esubscript𝒘𝐞𝒙𝑡\boldsymbol{s}(t)=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t))bold_italic_s ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) ).        \triangleright eq. (9)
10:       ii) 𝒔(t)=f𝒘𝐞e(𝒙(t),𝓗(t))𝒔𝑡subscriptsuperscript𝑓esubscript𝒘𝐞𝒙𝑡𝓗𝑡\boldsymbol{s}(t)=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t))bold_italic_s ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ).        \triangleright eq. (11)
11:     Compute ϕ(t)bold-italic-ϕ𝑡\boldsymbol{\phi}(t)bold_italic_ϕ ( italic_t ) using one of the following:
12:       i) 𝝋(t)=f𝒘𝐦m(𝓗(t))𝝋𝑡subscriptsuperscript𝑓msubscript𝒘𝐦𝓗𝑡\boldsymbol{\varphi}(t)=f^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{% \mathcal{H}}(t))bold_italic_φ ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_caligraphic_H ( italic_t ) ).        \triangleright eq. (13)
13:       ii) 𝝋(t)=𝝋¯𝝋𝑡bold-¯𝝋\boldsymbol{\varphi}(t)=\boldsymbol{\bar{\varphi}}bold_italic_φ ( italic_t ) = overbold_¯ start_ARG bold_italic_φ end_ARG.
14:     Transmit 𝒔(t)𝒔𝑡\boldsymbol{s}(t)bold_italic_s ( italic_t ) to receive 𝐲(t)𝐲𝑡\mathbf{y}(t)bold_y ( italic_t ):
15:       𝐲(t)=𝒯(𝓗(t),𝝋(t),𝒔(t))𝐲𝑡𝒯𝓗𝑡𝝋𝑡𝒔𝑡\mathbf{y}(t)=\mathcal{T}(\boldsymbol{\mathcal{H}}(t),\boldsymbol{\varphi}(t),% \boldsymbol{s}(t))bold_y ( italic_t ) = caligraphic_T ( bold_caligraphic_H ( italic_t ) , bold_italic_φ ( italic_t ) , bold_italic_s ( italic_t ) ). \triangleright eq. (5)
16:     Compute 𝒐^(t)bold-^𝒐𝑡\boldsymbol{\hat{o}}(t)overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) using one of the following:
17:       i) 𝒐^(t)=f𝒘𝐝d(𝐲(t))bold-^𝒐𝑡subscriptsuperscript𝑓dsubscript𝒘𝐝𝐲𝑡\boldsymbol{\hat{o}}(t)=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t))overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y ( italic_t ) ).        \triangleright eq. (10)
18:       ii) 𝒐^(t)=f𝒘𝐝d(𝐲(t),𝓗(t))bold-^𝒐𝑡subscriptsuperscript𝑓dsubscript𝒘𝐝𝐲𝑡𝓗𝑡\boldsymbol{\hat{o}}(t)=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t),% \boldsymbol{\mathcal{H}}(t))overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ) = italic_f start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y ( italic_t ) , bold_caligraphic_H ( italic_t ) ).        \triangleright eq. (12)
19:     Set 𝒘𝐤𝒘𝐤η𝒘𝐤Jstr(𝒐(t),f𝒘𝐤k(𝒙(t),𝓗(t)))subscript𝒘𝐤subscript𝒘𝐤𝜂subscriptsubscript𝒘𝐤subscript𝐽str𝒐𝑡subscriptsuperscript𝑓ksubscript𝒘𝐤𝒙𝑡𝓗𝑡\boldsymbol{w_{\rm k}}\!\leftarrow\!\boldsymbol{w_{\rm k}}\!-\!\eta\nabla_{% \boldsymbol{w_{\rm k}}}J_{\rm str}(\boldsymbol{o}(t),f^{\rm k}_{\boldsymbol{w_% {\rm k}}}(\boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t)))bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ← bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT - italic_η ∇ start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT ( bold_italic_o ( italic_t ) , italic_f start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ( italic_t ) , bold_caligraphic_H ( italic_t ) ) ).
20:end for
21:return 𝒘𝒘\boldsymbol{w}bold_italic_w

The crux of the training procedure is the gradient update mechanism of (17). Since (14) and (15) are differentiable operations with respect to 𝒘𝐬subscript𝒘𝐬\boldsymbol{w_{\rm s}}bold_italic_w start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT or 𝒘𝐫subscript𝒘𝐫\boldsymbol{w_{\rm r}}bold_italic_w start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT, the partial derivatives may be computed via automatic differentiation tools, by applying the chain rule on the underlying computational graph. Regardless, for the shake of completeness, we provide the derivations for the partial derivatives of the various modules, however, treating the implementation-defined derivatives of the classical neural network components (i.e., f𝒘ee/𝒘esuperscriptsubscript𝑓subscript𝒘eesubscript𝒘e\partial f_{\boldsymbol{w}_{\mathrm{e}}}^{\mathrm{e}}/\partial\boldsymbol{w}_{% \mathrm{e}}∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT / ∂ bold_italic_w start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT, f𝒘dd/𝒘dsuperscriptsubscript𝑓subscript𝒘ddsubscript𝒘d\partial f_{\boldsymbol{w}_{\mathrm{d}}}^{\mathrm{d}}/\partial\boldsymbol{w}_{% \mathrm{d}}∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT / ∂ bold_italic_w start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT, f𝒘mm/𝒘msuperscriptsubscript𝑓subscript𝒘mmsubscript𝒘m\partial f_{\boldsymbol{w}_{\mathrm{m}}}^{\mathrm{m}}/\partial\boldsymbol{w}_{% \mathrm{m}}∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT / ∂ bold_italic_w start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT, and f𝒘dd/𝐲(t)superscriptsubscript𝑓subscript𝒘dd𝐲𝑡\partial f_{\boldsymbol{w}_{\mathrm{d}}}^{\mathrm{d}}/\partial\mathbf{y}(t)∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT / ∂ bold_y ( italic_t )) as known. Continuing, we will make use of the identity vec(𝐀𝐗𝐁)=(𝐁𝐀)vec(𝐗)vec𝐀𝐗𝐁tensor-productsuperscript𝐁top𝐀vec𝐗\textrm{vec}(\mathbf{A}\mathbf{X}\mathbf{B})=(\mathbf{B}^{\top}\otimes\mathbf{% A})\textrm{vec}(\mathbf{X})vec ( bold_AXB ) = ( bold_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⊗ bold_A ) vec ( bold_X ) and that, for an n𝑛nitalic_n-element vector 𝒙𝒙\boldsymbol{x}bold_italic_x and 𝑿=diag(𝒙)𝑿diag𝒙\boldsymbol{X}={\rm diag}(\boldsymbol{x})bold_italic_X = roman_diag ( bold_italic_x ), the vectorization operation on 𝑿𝑿\boldsymbol{X}bold_italic_X can be expressed using matrix operations as vec(𝑿)=𝑫𝒙vec𝑿𝑫𝒙{\rm vec}(\boldsymbol{X})=\boldsymbol{D}\boldsymbol{x}roman_vec ( bold_italic_X ) = bold_italic_D bold_italic_x, where 𝑫[𝑫1,𝑫2,,𝑫n]𝑫subscript𝑫1subscript𝑫2subscript𝑫𝑛\boldsymbol{D}\triangleq[\boldsymbol{D}_{1},\boldsymbol{D}_{2},\ldots,% \boldsymbol{D}_{n}]bold_italic_D ≜ [ bold_italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] is an n2×nsuperscript𝑛2𝑛n^{2}\times nitalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_n matrix used for selecting the diagonal elements, in which 𝑫isubscript𝑫𝑖\boldsymbol{D}_{i}bold_italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an n×n𝑛𝑛n\times nitalic_n × italic_n matrix with binary elements having 1111 at its (i,i)𝑖𝑖(i,i)( italic_i , italic_i )-th element and 00 elsewhere.

For the case of a reconfigurable MS (either an RIS or SIM), 𝒐^bold-^𝒐\boldsymbol{\hat{o}}overbold_^ start_ARG bold_italic_o end_ARG is computed via (14). By applying backwards propagation, the following derivations are deduced:

Jstr𝒘𝐝subscript𝐽strsubscript𝒘𝐝\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{w_{\rm d}}}divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_ARG =Jstr𝒐^(t)f𝒘𝐝d𝒘𝐝,absentsubscript𝐽str^𝒐𝑡superscriptsubscript𝑓subscript𝒘𝐝dsubscript𝒘𝐝\displaystyle=\frac{\partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm d}}}^{\rm d}}{\partial\boldsymbol{w_% {\rm d}}},= divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_ARG , (20)
Jstr𝒘𝐦subscript𝐽strsubscript𝒘𝐦\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{w_{\rm m}}}divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_ARG =Jstr𝒐^(t)f𝒘𝐝d𝐲(t)𝐲(t)f𝒘𝐦mf𝒘𝐦m𝒘𝐦,absentsubscript𝐽str^𝒐𝑡superscriptsubscript𝑓subscript𝒘𝐝d𝐲𝑡𝐲𝑡subscriptsuperscript𝑓msubscript𝒘𝐦subscriptsuperscript𝑓msubscript𝒘𝐦subscript𝒘𝐦\displaystyle=\frac{\partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm d}}}^{\rm d}}{\partial\mathbf{y}(t)}% \cdot\frac{\partial\mathbf{y}(t)}{\partial f^{\rm m}_{\boldsymbol{w_{\rm m}}}}% \cdot\frac{\partial f^{\rm m}_{\boldsymbol{w_{\rm m}}}}{\partial\boldsymbol{w_% {\rm m}}},= divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_y ( italic_t ) end_ARG ⋅ divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_ARG , (21)
Jstr𝒘𝐞subscript𝐽strsubscript𝒘𝐞\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{w_{\rm e}}}divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_ARG =Jstr𝒐^(t)f𝒘𝐝d𝐲(t)𝐲(t)f𝒘𝐞ef𝒘𝐞e𝒘𝐞,absentsubscript𝐽str^𝒐𝑡superscriptsubscript𝑓subscript𝒘𝐝d𝐲𝑡𝐲𝑡superscriptsubscript𝑓subscript𝒘𝐞esuperscriptsubscript𝑓subscript𝒘𝐞esubscript𝒘𝐞\displaystyle=\frac{\partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm d}}}^{\rm d}}{\partial\mathbf{y}(t)}% \cdot\frac{\partial\mathbf{y}(t)}{\partial f_{\boldsymbol{w_{\rm e}}}^{\rm e}}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm e}}}^{\rm e}}{\partial\boldsymbol{w_% {\rm e}}},= divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_y ( italic_t ) end_ARG ⋅ divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_ARG , (22)

where Jstr/𝒐^(t)subscript𝐽str^𝒐𝑡\partial J_{\rm str}/\partial\hat{\boldsymbol{o}}(t)∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT / ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) concerns the gradient of the problem-defined loss function with respect to the network’s output, which, for the example of the CE loss of (2), is computed as 𝒐(t)/𝒐^(t)𝒐𝑡bold-^𝒐𝑡-\boldsymbol{o}(t)/\boldsymbol{\hat{o}}(t)- bold_italic_o ( italic_t ) / overbold_^ start_ARG bold_italic_o end_ARG ( italic_t ). The remaining terms are defined as follows:

𝐲(t)f𝒘𝐞e𝐲𝑡superscriptsubscript𝑓subscript𝒘𝐞e\displaystyle\frac{\partial\mathbf{y}(t)}{\partial f_{\boldsymbol{w_{\rm e}}}^% {\rm e}}divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT end_ARG =𝐇2(t)𝚽(t)𝐇1(t)+𝐇D(t),absentsubscript𝐇2𝑡𝚽𝑡superscriptsubscript𝐇1𝑡subscript𝐇D𝑡\displaystyle=\mathbf{H}_{\rm 2}(t)\boldsymbol{\Phi}(t)\mathbf{H}_{\rm 1}^{% \dagger}(t)+\mathbf{H}_{\rm D}(t),= bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) bold_Φ ( italic_t ) bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_t ) + bold_H start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT ( italic_t ) , (23)
𝐲(t)f𝒘𝐦m𝐲𝑡subscriptsuperscript𝑓msubscript𝒘𝐦\displaystyle\frac{\partial\mathbf{y}(t)}{\partial f^{\rm m}_{\boldsymbol{w_{% \rm m}}}}divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG =𝐲(t)𝝋(t)=((𝒔(t)𝐇1(t))𝐇2(t))𝑫,absent𝐲𝑡𝝋𝑡tensor-productsuperscript𝒔top𝑡superscriptsubscript𝐇1𝑡subscript𝐇2𝑡𝑫\displaystyle=\frac{\partial\mathbf{y}(t)}{\partial\boldsymbol{\varphi}(t)}=% \big{(}(\boldsymbol{s}^{\top}(t)\mathbf{H}_{\rm 1}^{\ast}(t))\otimes\mathbf{H}% _{\rm 2}(t)\big{)}\boldsymbol{D},= divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ bold_italic_φ ( italic_t ) end_ARG = ( ( bold_italic_s start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_t ) bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) ) ⊗ bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ) bold_italic_D , (24)

with 𝑫𝑫\boldsymbol{D}bold_italic_D being the Nm2×Nmsuperscriptsubscript𝑁𝑚2subscript𝑁𝑚N_{m}^{2}\times N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT binary selection matrix.

In the fixed-configuration RIS case, 𝒐^bold-^𝒐\boldsymbol{\hat{o}}overbold_^ start_ARG bold_italic_o end_ARG is computed via (15), and the trainable configuration may be concretely expressed as 𝝎¯𝜽¯bold-¯𝝎bold-¯𝜽\boldsymbol{\bar{\omega}}\equiv\boldsymbol{\bar{\theta}}overbold_¯ start_ARG bold_italic_ω end_ARG ≡ overbold_¯ start_ARG bold_italic_θ end_ARG; the phase shift vector is 𝝋¯ϕ¯bold-¯𝝋bold-¯bold-italic-ϕ\boldsymbol{\bar{\varphi}}\equiv\boldsymbol{\bar{\phi}}overbold_¯ start_ARG bold_italic_φ end_ARG ≡ overbold_¯ start_ARG bold_italic_ϕ end_ARG. The quantities Jstr/𝒘𝐞subscript𝐽strsubscript𝒘𝐞\partial J_{\rm str}/\partial\boldsymbol{w_{\rm e}}∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT / ∂ bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT and Jstr/𝒘𝐞subscript𝐽strsubscript𝒘𝐞\partial J_{\rm str}/\partial\boldsymbol{w_{\rm e}}∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT / ∂ bold_italic_w start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT remain the same, yielding:

Jstr𝝎¯=Jstr𝜽¯=Jstr𝒐^(t)f𝒘𝐝d𝐲(t)𝐲(t)ϕ¯ϕ¯𝜽¯,subscript𝐽strbold-¯𝝎subscript𝐽strbold-¯𝜽subscript𝐽str^𝒐𝑡superscriptsubscript𝑓subscript𝒘𝐝d𝐲𝑡𝐲𝑡bold-¯bold-italic-ϕbold-¯bold-italic-ϕbold-¯𝜽\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\omega}}}=% \frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\theta}}}=\frac{\partial J% _{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}\cdot\frac{\partial f_{\boldsymbol% {w_{\rm d}}}^{\mathrm{d}}}{\partial\mathbf{y}(t)}\cdot\frac{\partial\mathbf{y}% (t)}{\partial\boldsymbol{\bar{\phi}}}\cdot\frac{\partial\boldsymbol{\bar{\phi}% }}{\partial\boldsymbol{\bar{\theta}}},divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ω end_ARG end_ARG = divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_θ end_ARG end_ARG = divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_y ( italic_t ) end_ARG ⋅ divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ϕ end_ARG end_ARG ⋅ divide start_ARG ∂ overbold_¯ start_ARG bold_italic_ϕ end_ARG end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_θ end_ARG end_ARG , (25)

where 𝐲(t)/ϕ¯𝐲𝑡bold-¯bold-italic-ϕ\partial\mathbf{y}(t)/\partial\boldsymbol{\bar{\phi}}∂ bold_y ( italic_t ) / ∂ overbold_¯ start_ARG bold_italic_ϕ end_ARG can be computed as in (24), while ϕ¯/𝜽¯=ȷexp(ȷ𝜽¯)bold-¯bold-italic-ϕbold-¯𝜽italic-ȷitalic-ȷbold-¯𝜽\partial\boldsymbol{\bar{\phi}}/\partial\boldsymbol{\bar{\theta}}=-\jmath\exp{% (-\jmath\boldsymbol{\bar{\theta}})}∂ overbold_¯ start_ARG bold_italic_ϕ end_ARG / ∂ overbold_¯ start_ARG bold_italic_θ end_ARG = - italic_ȷ roman_exp ( - italic_ȷ overbold_¯ start_ARG bold_italic_θ end_ARG ).

For the fixed-configuration SIM case, the trainable configuration is expressed as 𝝎¯ϑ¯bold-¯𝝎bold-¯bold-italic-ϑ\boldsymbol{\bar{\omega}}\equiv\boldsymbol{\bar{\vartheta}}overbold_¯ start_ARG bold_italic_ω end_ARG ≡ overbold_¯ start_ARG bold_italic_ϑ end_ARG, while the static phase shifts are 𝝋¯𝝍¯bold-¯𝝋bold-¯𝝍\boldsymbol{\bar{\varphi}}\equiv\boldsymbol{\bar{\psi}}overbold_¯ start_ARG bold_italic_φ end_ARG ≡ overbold_¯ start_ARG bold_italic_ψ end_ARG, and again, 𝒐^bold-^𝒐\boldsymbol{\hat{o}}overbold_^ start_ARG bold_italic_o end_ARG is computed via (15). Similarly, the following derivations hold:

Jstr𝝎¯=Jstrϑ¯=Jstr𝒐^(t)f𝒘𝐝d𝐲(t)𝐲(t)𝝍¯𝝍¯ϑ¯,subscript𝐽strbold-¯𝝎subscript𝐽strbold-¯bold-italic-ϑsubscript𝐽str^𝒐𝑡superscriptsubscript𝑓subscript𝒘𝐝d𝐲𝑡𝐲𝑡bold-¯𝝍bold-¯𝝍bold-¯bold-italic-ϑ\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\omega}}}=% \frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\vartheta}}}=\frac{% \partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}\cdot\frac{\partial f_{% \boldsymbol{w_{\rm d}}}^{\mathrm{d}}}{\partial\mathbf{y}(t)}\cdot\frac{% \partial\mathbf{y}(t)}{\partial\boldsymbol{\bar{\psi}}}\cdot\frac{\partial% \boldsymbol{\bar{\psi}}}{\partial\boldsymbol{\bar{\vartheta}}},divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ω end_ARG end_ARG = divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ϑ end_ARG end_ARG = divide start_ARG ∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT end_ARG start_ARG ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) end_ARG ⋅ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_y ( italic_t ) end_ARG ⋅ divide start_ARG ∂ bold_y ( italic_t ) end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG end_ARG ⋅ divide start_ARG ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG end_ARG start_ARG ∂ overbold_¯ start_ARG bold_italic_ϑ end_ARG end_ARG , (26)

where Jstr/𝒐^(t)subscript𝐽str^𝒐𝑡\partial J_{\rm str}/\partial\hat{\boldsymbol{o}}(t)∂ italic_J start_POSTSUBSCRIPT roman_str end_POSTSUBSCRIPT / ∂ over^ start_ARG bold_italic_o end_ARG ( italic_t ) and f𝒘dd/𝐲(t)superscriptsubscript𝑓subscript𝒘dd𝐲𝑡\partial f_{\boldsymbol{w}_{\mathrm{d}}}^{\mathrm{d}}/\partial\mathbf{y}(t)∂ italic_f start_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT / ∂ bold_y ( italic_t ) are the same as before, while, similarly, 𝝍¯/ϑ¯=ȷexp(ȷϑ¯)bold-¯𝝍bold-¯bold-italic-ϑitalic-ȷitalic-ȷbold-¯bold-italic-ϑ\partial\boldsymbol{\bar{\psi}}/\partial\boldsymbol{\bar{\vartheta}}=-\jmath% \exp{(-\jmath\boldsymbol{\bar{\vartheta}})}∂ overbold_¯ start_ARG bold_italic_ψ end_ARG / ∂ overbold_¯ start_ARG bold_italic_ϑ end_ARG = - italic_ȷ roman_exp ( - italic_ȷ overbold_¯ start_ARG bold_italic_ϑ end_ARG ). Since now 𝒯()𝒯\mathcal{T}(\cdot)caligraphic_T ( ⋅ ) involves the SIM system model, 𝐲(t)/𝝍¯𝐲𝑡bold-¯𝝍\partial\mathbf{y}(t)/\partial\boldsymbol{\bar{\psi}}∂ bold_y ( italic_t ) / ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG requires further derivations. Following the same procedure as in (24), and by denoting the response matrix of each of the M𝑀Mitalic_M SIM elements as 𝚿¯mdiag(𝝍¯m)subscriptbold-¯𝚿𝑚diagsubscriptbold-¯𝝍𝑚\boldsymbol{\bar{\Psi}}_{m}\triangleq{\rm diag}(\boldsymbol{\bar{\psi}}_{m})overbold_¯ start_ARG bold_Ψ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≜ roman_diag ( overbold_¯ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), it is deduced 𝐲(t)/𝝍¯=[[𝐲(t)/𝝍¯1],[𝐲(t)/𝝍¯2][𝐲(t)/𝝍¯M]]𝐲𝑡bold-¯𝝍superscriptsuperscriptdelimited-[]𝐲𝑡subscriptbold-¯𝝍1topsuperscriptdelimited-[]𝐲𝑡subscriptbold-¯𝝍2topsuperscriptdelimited-[]𝐲𝑡subscriptbold-¯𝝍𝑀toptop\partial\mathbf{y}(t)/\partial\boldsymbol{\bar{\psi}}=[[\partial\mathbf{y}(t)/% \partial\boldsymbol{\bar{\psi}}_{1}]^{\top},[\partial\mathbf{y}(t)/\partial% \boldsymbol{\bar{\psi}}_{2}]^{\top}\ldots[\partial\mathbf{y}(t)/\partial% \boldsymbol{\bar{\psi}}_{M}]^{\top}]^{\top}∂ bold_y ( italic_t ) / ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG = [ [ ∂ bold_y ( italic_t ) / ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , [ ∂ bold_y ( italic_t ) / ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT … [ ∂ bold_y ( italic_t ) / ∂ overbold_¯ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT with:

𝐲(t)𝝍¯𝒎={(𝒔(t)𝐇1(t)