Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks

Kyriakos Stylianopoulos, Paolo Di Lorenzo,
and George C. Alexandropoulos This work has been supported by the Smart Networks and Services Joint Undertaking projects TERRAMETA, 6G-DISAC, and 6G-GOALS under the European Union’s Horizon Europe research and innovation programme under Grant Agreement numbers 101097101, 101139130, and 101139232 respectively. TERRAMETA also includes top-up funding by UK Research and Innovation under the UK government’s Horizon Europe funding guarantee.K. Stylianopoulos and G. C. Alexandropoulos are with the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 16122 Athens, Greece (e-mails: {kstylianop, alexandg}@di.uoa.gr).P. Di Lorenzo is with the Department of Information Engineering, Electronics, and Telecommunications, Sapienza University, Italy and CNIT, Italy (e-mail: paolo.dilorenzo@uniroma1.it).

Abstract

In the Edge Inference (EI) paradigm, where a Deep Neural Network (DNN) is split across the transceivers to wirelessly communicate goal-defined features in solving a computational task, the wireless medium has been commonly treated as a source of noise. In this paper, motivated by the emerging technologies of Reconfigurable Intelligent Surfaces (RISs) and Stacked Intelligent Metasurfaces (SIM) that offer programmable propagation of wireless signals, either through controllable reflections or diffractions, we optimize the RIS/SIM-enabled smart wireless environment as a means of over-the-air computing, resembling the operations of DNN layers. We propose a framework of Metasurfaces-Integrated Neural Networks (MINNs) for EI, presenting its modeling, training through a backpropagation variation for fading channels, and deployment aspects. The overall end-to-end DNN architecture is general enough to admit RIS and SIM devices, through controllable reconfiguration before each transmission or fixed configurations after training, while both channel-aware and channel-agnostic transceivers are considered. Our numerical evaluation showcases metasurfaces to be instrumental in performing image classification under link budgets that impede conventional communications or metasurface-free systems. It is demonstrated that our MINN framework can significantly simplify EI requirements, achieving near-optimal performance with $50$ dB lower testing signal-to-noise ratio compared to training, even without transceiver channel knowledge.

Index Terms:

Edge learning, reconfigurable intelligent surface, stacked intelligent metasurfaces, goal-oriented communications, deep learning, over-the-air computing.

I Introduction

In emerging Device-to-Device (D2D) and Internet of Things (IoT) networks, where distributed devices are expected to support various functionalities, such as Integrated Sensing and Communications (ISAC) [1], stringent energy efficiency requirements impose limitations on their communication and computation capabilities. Various sub-networks are envisioned to enable a wide range of high-level applications and services, such as digital twinning through object recognition and computational imaging, or indoor positioning [2], each tied up with a diverse set of requirements to be fulfilled. As a result, it is crucial for the D2D system design to take into consideration the underlying application. To this end, devising novel communication stacks tailored to the application, that break the traditional network layer taxonomy, comes with reduced implementation overheads.

Goal-Oriented Communications (GOC) [3] is a framework that is gaining popularity in D2D systems, since transmissions of only the necessary goal-specific information need to take place, reducing messaging overheads and simplifying system architecture [4]. In fact, when performing Edge Inference (EI) tasks, where the Receiver (RX) wishes to obtain only an estimated label of the data the Transmitter (TX) actually observes and transmits, semantic and GOC approaches that split the layers of a Deep artificial Neural Network (DNN) at the endpoints provide significant benefits in terms of communication requirements [5]. According to those approaches, low-dimensional feature vectors, that are outputs of intermediate DNN layers, are transmitted over the channel [6, 7, 8]. The motivation behind such practices is that GOC deviates from the standard Shannon-type communications [9], in that the goal of the system is an arbitrary computational target function, rather than objectives derived from the Mutual Information (MI) that call for bit-wise reconstruction of the input data. In this way, DNN s are employed, similar to the Joint Source Channel Coding (JSCC) paradigm, and trained to capture the joint or conditional data-channel-target distributions.

Current literature has adopted GOC and EI for a wide variety of problems and wireless systems (see [6, 8, 10, 11] and the references therein), however, a common practice across those works is to treat the wireless environment as a source of noise, whose effects need to be negated at the RX side. The rapid developments of Meta-Surface (MS) technologies for precise Radio-Frequency (RF) domain control open the potential for the wireless propagation medium to be dynamically reconfigurable via reflective Reconfigurable Intelligent Surfaces (RISs) [12] or diffractive Stacked Intelligent Metasurfaces (SIM) [13] with low operational costs. Such MS s have been incorporated in GOC and semantic systems to reduce the transceiver hardware complexity [14] or to enhance the system’s rate [15]. Under another research direction, DNN implementations entirely in the RF domain, capitalizing on MS-based solutions, have been devised for controlled laboratory environments [16, 17, 18].

Refer to caption — Figure 1: A Metasurfaces-Integrated artificial Neural Network (MINN) performing Edge Inference (EI) by controlling the wireless propagation channel, which is treated as one or more hidden network layers.

In this paper, motivated by the elaborate wireless medium control offered by MS technologies [12, 13], in conjunction with their analog computational capabilities, we design MS-controllable wireless channels that perform Over-the-Air Computation (OAC), under which the comprising metamaterials are treated as hidden artificial neurons that control the wireless medium to perform multi-layer non-linear signal processing toward solving EI tasks, as illustrated in Fig. 1. The main contributions of this paper are summarized as follows.

1.

We present a novel generic End-to-End (E2E) DNN framework, titled Metasurfaces-Integrated Neural Network (MINN), that admits variations of controllable MSs or MSs of trainable, yet fixed response configurations, of either RISs or SIM and with or without channel knowledge. We detail the framework’s modeling, backpropagation-based training, and deployment from both a theoretical and a system architecture perspectives.
2.

We elaborate on the critical role of reconfiguration of wireless channels as a degree of freedom in optimizing DNN-based EI models that capitalize on programmable OAC, instead of treating the wireless propagation environment as a source of noise. To this end, learning RISs and SIM configurations promises substantial potential.
3.

We present an extensive numerical evaluation of the proposed MINN framework on an image classification task, which compared with conventional communication systems, as well as a baseline in the absence of an MS, showcases that MS-enabled OAC allows for successful EI with much lower transmit power requirements, even without channel knowledge at the communication ends.

The remainder of the paper is structured as follows. Section II details the relevant pieces of theory regarding EI as well as the possible improvements brought by controlling the wireless environment, while Section III provides a comprehensive review of the relevant literature. Section IV includes the models used for the considered MS technologies: RIS and SIM, as well as the model for the received signal for both cases. The proposed MINN architecture for EI is presented in Section V, whereas Section VI details its training procedure and discusses the deployment and network considerations. Our numerical investigations are presented in Section VII. Finally, Section VIII includes the paper’s concluding remarks.

Notation: Vectors, matrices, and sets are expressed in lowercase bold (e.g., $\boldsymbol{x}$ ), uppercase bold (e.g., $\boldsymbol{X}$ ), and uppercase calligraphic typefaces (e.g., $\mathcal{X}$ or $\boldsymbol{\mathcal{X}}$ ), respectively. Apart from $\boldsymbol{X}^{\ast}$ , $\boldsymbol{X}^{\dagger}$ , and $\boldsymbol{X}^{\top}$ that denote the conjugate, conjugate transpose, and transpose of $\boldsymbol{X}$ , superscripts and subscripts are used to denote different versions of variables or enumeration over collections of variables depending on the context. $[\boldsymbol{x}]_{i}$ and $[\boldsymbol{X}]_{i,j}$ are used to denote the $i$ -th element of $\boldsymbol{x}$ and the $(i,j)$ -th element of $\boldsymbol{X}$ , respectively. We use variations of the notation $f_{\boldsymbol{w}}(\cdot)$ to represent neural network functions that are parameterized by their weight matrix $\boldsymbol{w}$ , noting that $f$ may be seen either as a function of its arbitrary input variables (during inference) or as a function of $\boldsymbol{w}$ (during optimization). $|\mathcal{X}|$ represents the cardinality of the set $\mathcal{X}$ , ${\rm diag(\boldsymbol{x})}$ creates a square matrix with the elements of $\boldsymbol{x}$ placed along its main diagonal, and ${\rm vec}(\boldsymbol{X})$ transforms $\boldsymbol{X}$ in a column vector in a row-by-row fashion. $\otimes$ denotes the Kronecker product, $\{a,b\}$ expresses a set or collection containing $a$ and $b$ , while $\mathds{1}_{\texttt{cond}}$ is the indicator function equaling to $1$ if condition cond holds, otherwise to $0$ . Finally, $\mathbb{E}_{\boldsymbol{X}}[\cdot]$ is the expectation operator with respect to the distribution of the random $\boldsymbol{X}$ , and $\jmath\triangleq\sqrt{-1}$ .

II Prerequisites

II-A Probabilistic Inference

Given an input observation $\boldsymbol{x}$ , the objective of an inference procedure is to compute and output an associated target attribute value $\boldsymbol{o}=l(\boldsymbol{x})$ . The mapping function $l(\boldsymbol{x})$ is considered unknown and intractable to express analytically, therefore, inference involves approximating this relationship through examples of $(\boldsymbol{x},\boldsymbol{o})$ tuples. From a probabilistic perspective, one may fit the conditional Probability Density Function (PDF) of the target $p(\boldsymbol{o}|\boldsymbol{x})$ on the available data. In typical settings, point estimates are only required, thus, the problem reduces in predicting the most likely value of $\boldsymbol{o}$ for a given observation. In the machine learning regime, the previous distribution (or its point estimates) can be approximated by a $\boldsymbol{w}$ -parameterized model $\hat{\boldsymbol{o}}\triangleq f_{\boldsymbol{w}}(\boldsymbol{x})$ that outputs its prediction of the target value $\hat{\boldsymbol{o}}$ for a given input. Consecutively, to solve the inference problem, the parameter values $\boldsymbol{w}$ need to be optimized. This is achieved by collecting a data set of training tuples $\mathcal{D}\triangleq\{(\boldsymbol{x}_{i},\boldsymbol{o}_{i})\}_{i=1}^{|% \mathcal{D}|}$ , and then minimizing an amortized cost function over the training instances:

\displaystyle J(\boldsymbol{w})\triangleq\frac{1}{|\mathcal{D}|}\sum_{i=1}^{|% \mathcal{D}|}J_{\rm str}(\boldsymbol{o}_{i},\hat{\boldsymbol{o}}_{i}),\quad% \text{where }\hat{\boldsymbol{o}}_{i}=f_{\boldsymbol{w}}(\boldsymbol{x}_{i}).

(1)

The per-instance cost function values $J_{\rm str}(\boldsymbol{o}_{i},\hat{\boldsymbol{o}}_{i})$ ’s quantify the error in the model’s predictions and they often have a probabilistic interpretation. For instance, in classification settings where each observation belongs to one of $d_{\rm cl}$ classes indexed by the natural number $c$ , the target value is defined via one-hot encoding as $\boldsymbol{o}=[\mathds{1}_{c=1},\mathds{1}_{c=2},\dots,\mathds{1}_{c=d_{\rm cl% }}]^{\top}\in\mathbb{N}^{d_{\rm cl}\times 1}$ . The Cross Entropy (CE) loss function is defined as follows:

J_{\rm CE}(\boldsymbol{o}_{i},\hat{\boldsymbol{o}}_{i})\triangleq-\sum_{j=1}^{% d_{\rm cl}}[\boldsymbol{o}_{i}]_{j}\log[\hat{\boldsymbol{o}}_{i}]_{j}.

(2)

Minimizing (2) over $\mathcal{D}$ is equivalent to performing maximum likelihood estimation of $\boldsymbol{w}$ on $p(\boldsymbol{o}|\boldsymbol{x})$ under the assumption that this PDF is a multivariate Bernoulli distribution. Similarly, in regression tasks, where $\boldsymbol{o},\hat{\boldsymbol{o}}\in\mathbb{R}^{d_{\rm out}\times 1}$ , the Mean Squared Error (MSE) metric, defined as $1/d_{\rm out}\|\boldsymbol{o}-\hat{\boldsymbol{o}}\|^{2}$ , implies that $p(\boldsymbol{o}|\boldsymbol{x})$ is a Gaussian conditional PDF.

II-B Artificial Neural Networks

While a wide range of parameterized families of functions is available to use as $f_{\boldsymbol{w}}(\cdot)$ , the current state of the art considers DNN models, capitalizing on their diverse benefits including high expressivity, diverse selection of architectural components for different tasks, and parallelizable computations that offer real-time computational cost during inference on pertinent hardware. Mathematically, let a neural network be expressed as a composition of $L$ layers such that:

f_{\boldsymbol{w}}(\boldsymbol{x})\triangleq f^{L}_{\boldsymbol{w}^{L}}\left(f% ^{L-1}_{\boldsymbol{w}^{L-1}}\left(\ldots f^{1}_{\boldsymbol{w}^{1}}(% \boldsymbol{x})\ldots\right)\right),

(3)

so that the $l$ -th layer ( $l=1,2,\ldots,L$ ) is parameterized by $\boldsymbol{w}_{l}$ , and its output $\bar{\boldsymbol{o}}^{l}$ constitutes the input to the $(l+1)$ -th layer; this process can be represented recursively by $\bar{\boldsymbol{o}}^{l}\triangleq f^{l}_{\boldsymbol{w}^{l}}(\bar{\boldsymbol% {o}}^{(l-1)})$ , where we have set $\bar{\boldsymbol{o}}^{0}=\boldsymbol{x}$ . For convenience, let us also define the vector $\boldsymbol{w}\triangleq[\boldsymbol{w}^{\top}_{1},\boldsymbol{w}^{\top}_{2},% \dots,\boldsymbol{w}^{\top}_{L}]^{\top}$ .

We leave aside precise definitions of individual layer functions, as there has been an impressive body of work focusing on layers that perform data-specific computations and capture high-level patterns in data sets [19]. Nevertheless, we highlight the fact that each $f^{l}_{\boldsymbol{w}_{l}}(\cdot)$ is demanded to be a non-linear function, and, in fact, it is often requested to be discriminatory or sigmoidal [20]. Those properties guarantee that artificial neural networks with at least two layers (of potentially infinite width) are universal approximators and can, therefore, be used to approximate any arbitrary $\boldsymbol{o}=l(\boldsymbol{x})$ mapping, i.e., $f_{\boldsymbol{w}}(\boldsymbol{x})\cong l(\boldsymbol{x})$ [20].

Besides theoretical guarantees, the problem of obtaining $\boldsymbol{w}$ values that perform successful inference can be efficiently solved by substituting the neural network expression from (3) into (2) (using classification as an example), and subsequently into (1). The latter may be solved through one of the many variations of the Stochastic Gradient Descent (SGD) approach by computing the gradients $\partial J(\boldsymbol{w})/\partial\boldsymbol{w}$ , and propagating them through the layers of $f_{\boldsymbol{w}}(\boldsymbol{x})$ by taking advantage of the chain rule; this leads to the celebrated backpropagation algorithm [21] that constitutes the backbone of deep learning.

II-C Edge Inference (EI)

The rising field of EI considers the implementation and training of inference tasks over a wireless communication network. In that regard, consider an uplink setup where a multi-antenna TX observes $\boldsymbol{x}$ and wishes to convey its estimation $\hat{\boldsymbol{o}}$ of the target value $\boldsymbol{o}$ to an RX. At first observation, this task is apparently straightforward to implement within the existing frameworks of machine learning and wireless communications under the following two paradigm options:

II-C1 “Infer-then-transmit”

The TX may first compute $\hat{\boldsymbol{o}}=f_{\boldsymbol{w}}(\boldsymbol{x})$ and then transmit $\hat{\boldsymbol{o}}$ over the wireless medium upon performing source coding (i.e., data compression) and channel coding (i.e., modulation and beamforming) to derive the transmission signal $\boldsymbol{s}$ . The above coding computations ensure that a satisfactory communication rate is achievable by the system so that $\hat{\boldsymbol{o}}$ may be reconstructed at the RX’s side via decoding the received signal and decompressing the data. In modern high-complexity wireless systems, those operations require their own optimization procedures and necessitate additional computational costs as well as channel state information knowledge. For most inference tasks, the target values are of a much smaller dimension than the observations, as a result, this option includes small rate requirements, but comes at a computational cost on the TX’s side, as the device must be endowed with hardware capable of executing DNN computations locally. Since EI tasks are envisioned specifically for cases where IoT or other lightweight devices with low complexity and minute power consumption sending messages to a collection/fusion center, the assumption of computational capabilities for local DNN inference is rather optimistic.

II-C2 “Transmit-then-infer”

This converse approach is also possible: the TX performs source and channel coding but no DNN-based EI computations, so that the original observation $\boldsymbol{x}$ is transmitted instead. The RX performs decoding to obtain the data point, which is then fed to its local $f_{\boldsymbol{w}}(\cdot)$ to perform inference. While in uplink settings, it is reasonable to assume that the RX has sufficient power and hardware capabilities to support a DNN, the rate required for transmitting the original observation may impose high link budget demands that are not readily facilitated.

A compromising solution to the above may be obtainable by exploiting the sequential nature of the DNN structure appearing in (3) [22]. To this end, the following EI option is possible:

II-C3 “Infer-while-transmitting” (DNN splitting)

The intermediate representations $\bar{\boldsymbol{o}}^{l}$ $\forall l=1,2\ldots,L-1$ can be of arbitrary dimensions, and it is not uncommon to devise architectures with one or more small-sized intermediate layers. In fact, various deep learning models, such as auto-encoders [23] and U-Net [24], are designed specifically to contain such bottleneck layers as a form of compression to keep only relevant information. From that perspective, one may choose to split the first $L^{\prime}<L$ layers of the DNN to reside at the TX, so that $\bar{\boldsymbol{o}}^{L^{\prime}}$ is transmitted over the network and, then, pass to the $(L^{\prime}+1)$ -th up to the $L$ -th layer in sequence at the RX.

Evidently, the latter paradigm of EI is the most flexible, since one has the option to balance trade-offs between computation and communication resources between the TX and RX. In the remainder of this paper, we will be adopting this paradigm of EI as the default case and provide further elaboration and extensions, revisiting the two extreme previous cases as baselines in our numerical comparisons.

II-D Computational Considerations

When performing DNN splitting over a wireless channel with realistic characteristics (i.e., large- and small-scale fading, as well as Additive White Gaussian Noise (AWGN)), the transmitted output $\bar{\boldsymbol{o}}^{L^{\prime}}$ of the $L^{\prime}$ -th DNN layer will be distorted when arriving at the RX. By representing the channel state with an abstract random variable $\boldsymbol{\mathcal{H}}$ , one needs to minimize the same $\boldsymbol{w}$ -parameterized objective function as before, while accounting for the stochastic nature of the wireless environment (i.e., with respect to $\boldsymbol{\mathcal{H}}$ ’s distribution), i.e., solve:

\mathcal{OP}_{\rm EI}:\min_{\boldsymbol{w}}\mathbb{E}_{\boldsymbol{\mathcal{H}% }}[J(\boldsymbol{w})],

where each instantaneous channel realization affects the value of the objected function by distorting the value of the transmitted $\bar{\boldsymbol{o}}^{L^{\prime}}$ . The precise definition of $\boldsymbol{\mathcal{H}}$ will be given in Section IV, while the considered fading distributions are discussed during the numerical evaluation in Section VII.

Assuming the wireless channel has sufficient capacity, optimizing both endpoints to perform source and channel encoding and decoding, under the standard paradigm of wireless communications, will indeed nullify the distortion on the received version of $\bar{\boldsymbol{o}}^{L^{\prime}}$ . The decoding output may then passed to the $(L^{\prime}+1)$ -layer of the network as normal; note that the presence of the channel is effectively hidden from the neural network’s perspective. This approach aligns more with the standard practices of both the wireless communications and machine learning paradigms, as each problem is treated individually and has shown to indeed exhibit satisfactory results [8, 15]. However, the above practice of optimizing the system for reconstruction of the received signal may result in higher computational overheads under the following considerations:

1.

Input reconstruction is not the objective of EI, rather the computation of an arbitrary function of it. Under this point of view, allocating computational resources in reconstructing intermediate variables is not always the most efficient way of solving the problem. In fact, one may regard GOC as a particular instance of lossy compression between the (unseen) target value $\boldsymbol{o}$ and its estimation $\hat{\boldsymbol{o}}$ , where $J_{\rm str}(\boldsymbol{o},\hat{\boldsymbol{o}})$ plays the role of the distortion metric. From an information-theoretic perspective, variations of the distortion-rate functions may then be studied, as proposed by [25], indicating that the channel rate required to transmit intermediate variables, while ensuring a desired error threshold, is typically lower than the actual channel capacity which assumes reconstruction with arbitrary small error probabilities.
2.

From an engineering perspective, since the received signal is to be fed to subsequent neural network layers, perfect reconstruction may not be required. The employed neural network architectures are commonly designed to account for noisy inputs in light of the inherent stochasticity of inference problems. Besides, manually imputing noise to activations of intermediate layers, both during training [26, 27] and inference [28, 29], is known to enhance the model’s regularization and uncertainty estimation properties.
3.

Finally, the wireless channel, which can be regarded as a (naturally stochastic) function over the transmitted data, imposes its own computations. While this function is not, in general, controllable by the E2E system, OAC approaches leverage the superimposition of wireless signals to implement certain families of computational functions on top of the wireless medium. Interestingly, the controllability offered by emerging MS technological solutions has the potential to enable more elaborate computations over-the-air, essentially offloading computations from the communication network’s endpoints.

III Relevant State-of-the-Art

The implementation of DNN s at the transceivers has been studied under the JSCC paradigm, with results that outperform traditional communication systems due to the neural networks’ ability to learn useful patterns from the data and channel distributions [30, 31]. However, JSCC is limited to data reconstruction as an objective. Conversely, deep semantic communication approaches [32], endow the communication system with the purpose of transmitting the meaning of the data, rather than their bit representations, which can be interpreted as a GOC objective. Notably, the DeepSC architecture of [8] proposes a DNN splitting approach, with separate source and channel coding sub-modules that are trained independently. The channel encoder and decoder are trained to maximize an MI objective, which, on the one hand, is difficult to evaluate analytically for fading channels, while, on the other hand, maximizing the MI additionally implies that the effects of the channel are to be negated instead of being used for computations. In [10], a GOC approach with separate source and channel coding modules was developed for image retrieval under AWGN and Rayleigh fading, which further exemplified the benefits of separate training of each component rather than E2E. However, this training is possible due to the inserted rate-like part of the loss function that also treats the channel as noise, instead of a computational entity. Besides, the idea of DNN splitting for EI tasks has been investigated in [22, 6] from the information bottleneck viewpoint, to derive optimal network partitioning in uncontrollable wireless channels.

Traditionally, OAC methods have been developed to compute aggregate functions over multiple-access channels based on the superpositions of signals [33], and have been utilized for federated learning tasks, as in [34]. Both of these OAC approaches, however, focused on computing a limited, yet useful family of analytical functions that are fundamentally different from the computations that take place in hidden neural network layers. More relevant to the present work is the methodology of [7], where the wireless channel is treated as a hidden network layer encompassed by DNN layers at the transceivers. While this E2E treatment can be utilized under the context of GOC, the fact that the computations imposed by wireless channels are not controllable in the absence of any MS flavor, provides limited benefits. In essence, the proposed MINN framework in this paper is general enough to accommodate this setup as a special case, once the MS-induced links are ignored in the overall system model. In the numerical evaluation section later on, we compare with this variation to illustrate the benefits of integrating MS(s) as hidden layer(s) for effective mixed digital/analog computation.

The joint consideration of MSs and deep learning is a growing body of research. Apart from works that introduce deep learning algorithms to control RISs [35, 36, 37] or SIM [38], cascaded MSs have been introduced as DNN layers of diffractive implementations in [16, 17, 18], towards analog computing hardware that is envisioned to exhibit notable benefits in computational speed and power consumption. In this paper, we are motivated by all-optical neural network implementations, but our framework is differentiated by the fact that the SIM layers are assumed to reside within the wireless environment, so that the E2E system needs to account for time-varying wireless fading when passing information between the SIM and digital network layers at the TX/RX endpoints [39]. It is noted that purely optical DNNs have the limitation that the input data needs to be transferred to the RF domain via techniques like holography, illumination, or traditional modulation, which are currently impractical for real-life deployment of such architectures. Finally, in the context of wireless networks, other works have capitalized on sophisticated designs of the responses of the elements of MSs to implement wave-domain signal processing [40, 41] or multi-access edge computing [42] tasks.

Regarding approaches that specifically consider GOC or semantic communication problems with the inclusion of MSs, a GOC approach was recently introduced in [14], according to which, the DNN layers at the transceivers were implemented via SIM layers, in contrast to having the SIM as part of the environment, as proposed by this work. Indeed, performing DNN computations at the RF regime even at the endpoints has the aforementioned benefits, however, the hardware design of such transceivers is far from trivial to be implemented by low-cost IoT devices. Besides, an MS placed directly inside the wireless environment may offer more precise control of the propagation medium. Finally, semantic communications were performed through the assistance of an RIS placed at the environment in [15]. In that approach, the RIS was optimized to maximize an equivalent Signal-to-Noise Ratio (SNR) objective, similar to [8], whereas we herein propose to treat the problem in an E2E manner. In the numerical evaluation of Section VII, we include a baseline where the RIS alongside other components are optimized with respect to the achiavable Shannon rate, and we show that this approach is less effective compared to our E2E treatment under the considered system, especially in the low-SNR regime. Despite the related literature of E2E architectures for EI that potentially take advantage of MS capabilities, to the best of our knowledge, the proposed MINN framework is the first work to highlight the importance of treating the MS-enabled smart wireless channel as a favorable computational machinery embedded inside an E2E DNN architecture.

IV System and MS Components Modeling

IV-A System and Received Signal Models

We consider the uplink of a point-to-point Multiple-Input Multiple-Output (MIMO) communication system, where a TX equipped with $N_{t}$ transmit antennas wishes to transmit its data to an $N_{r}$ -antenna RX on a frame-by-frame basis, where the frames are indexed as $t=1,2,\ldots$ and are possibly unevenly spaced. This communication is enhanced via an MS (either an RIS or SIM), deployed as a standalone node in the wireless environment, whose configuration may change at every discrete time step $t$ upon the command of an abstract controller unit. Without loss of generality, let us assume that the SIM is consisted of $M$ thin diffractive layers, each with $N_{m}$ unit elements, so that it contains $N_{\rm SIM}\triangleq MN_{m}$ phase-tunable elements in total. For ease of notation and to present a comprehensive system model, we will additionally use $N_{m}$ to denote the number of RIS tunable elements. Furthermore, let us define the Channel Frequency Response (CFR) matrices at each $t$ -th time instance for the TX-RX, the TX-MS, and the MS-RX links as $\mathbf{H}_{\rm D}(t)\in\mathbb{C}^{N_{r}\times N_{t}}$ , $\mathbf{H}_{\rm 1}(t)\in\mathbb{C}^{N_{t}\times N_{m}}$ , and $\mathbf{H}_{\rm 2}(t)\in\mathbb{C}^{N_{r}\times N_{m}}$ , respectively. The transmitted signal is expressed as $\boldsymbol{s}(t)\in\mathbb{C}^{N_{t}\times 1}$ , which satisfies a power budget constraint $P\triangleq\mathbb{E}[\|\boldsymbol{s}(t)\|]$ . In fact, we suggest that $\boldsymbol{s}(t)$ represents both the intended, source-coded, and modulated data stream (according to the MIMO spatial multiplexing principle, the number of data symbols need to be $d\leq\min\{N_{t},N_{r}\}$ ), as well as potential beamforming weights, without making any specific assumptions about the underlying procedures that produced the transmitted signal or the distribution of symbols.

During each $t$ -th frame transmission, the MS is characterized by its controllable phase configuration vector $\boldsymbol{\omega}(t)\in\mathbb{C}^{N_{m}\times 1}$ in the case of an RIS and $\boldsymbol{\omega}(t)\in\mathbb{C}^{N_{\rm SIM}\times 1}$ in the case of SIM, while its resulting response configuration is modeled for the idealized case of unit amplitude as $\boldsymbol{\varphi}(t)\triangleq\exp(-\jmath\boldsymbol{\omega}(t))$ . The effects of the responses of the metamaterials in the cascaded channel are captured via the matrix $\boldsymbol{\Omega}(t)\in\mathbb{C}^{N_{m}\times N_{m}}$ , the structure of which, will be detailed in the next subsection. In the remainder of this work, $\boldsymbol{\Omega}(t)$ , $\boldsymbol{\varphi}(t)$ , and $\boldsymbol{\omega}(t)$ are used as generic notation to describe any of the RIS and SIM cases, while device-specific notation is introduced wherever needed. Using the above, the baseband received signal at the RX antennas is expressed as follows:

	$\displaystyle\mathbf{y}(t)$	$\displaystyle\triangleq\left(\mathbf{H}_{\rm D}(t)+\mathbf{H}_{\rm 2}(t)% \boldsymbol{\Omega}(t)\mathbf{H}_{\rm 1}^{\dagger}(t)\right)\boldsymbol{s}(t)+% \boldsymbol{\tilde{n}}$		(4)
		$\displaystyle\triangleq\mathcal{T}(\boldsymbol{\mathcal{H}}(t),\boldsymbol{% \varphi}(t),\boldsymbol{s}(t)),$		(5)

where $\boldsymbol{\tilde{n}}\in\mathbb{C}^{N_{r}\times 1}$ denotes the AWGN at the RX, comprising of independent and identically distributed (i.i.d.) samples drawn from the standard complex normal distribution $\mathcal{CN}(0,\sigma^{2})$ . In the sequel, we will be making use of the transmission function $\mathcal{T}(\boldsymbol{\mathcal{H}}(t),\boldsymbol{\varphi}(t),\boldsymbol{s}% (t))$ as an abstraction, emphasizing that the wireless medium is treated as a programmable computation. In this definition, we use the notation $\boldsymbol{\mathcal{H}}(t)\triangleq\{\mathbf{H}_{\rm D}(t),\mathbf{H}_{\rm 1% }(t),\mathbf{H}_{\rm 2}(t)\}$ for the instantaneous Channel State Information (CSI), which is assumed to be readily available to all system nodes. Obviously, this availability implies a recurring channel estimation phase at each $t$ -th time step, which may be a challenging prerequisite (see [36] and references therein). Nevertheless, this assumption allows the focus of this work to be on the training and evaluation of the proposed MINN architecture. Logical future extensions could incorporate the channel estimation phase in the DNN transceiver modules themselves, following ISAC principles [12]. Alternatively, channel-agnostic variations of transceivers will be also proposed and evaluated in the following sections to illustrate the performance trade-offs when integrating MSs as over-the-air neural network layers.

IV-B RIS and SIM Models

Considering first an RIS, let its phase configuration vector at time $t$ be denoted as $\boldsymbol{\theta}(t)\triangleq[\theta_{1}(t),\theta_{2}(t),\ldots,\theta_{N_% {m}}(t)]^{\top}$ ( $\equiv\boldsymbol{\omega}(t)$ in (5)), so that the phase state of its $n$ -th unit element ( $n=1,2,\ldots,N_{m}$ ) is expressed as $\theta_{n}(t)\in[0,2\pi)$ . Then, the induced weights at the response configuration vector are given as $\boldsymbol{\phi}(t)\triangleq\exp(-\jmath\theta(t))$ ( $\equiv\boldsymbol{\varphi}(t)$ ) In this case, using the diagonal matrix definition $\boldsymbol{\Phi}(t)\triangleq{\rm diag}(\boldsymbol{\phi}(t))\in\mathbb{C}^{N% _{m}\times N_{m}}$ , it holds that $\boldsymbol{\Omega}(t)\equiv\boldsymbol{\Phi}(t)$ in (4).

Proceeding to the introduction of the SIM into the system model, we first assume that all $M$ layers ( $m=1,2,\ldots,M$ ) are closely stacked and aligned parallel to each other, with their shared normal vector oriented perpendicular to the line connecting the TX and RX positions. Under this placement, the signal from the TX arrives at the first layer of the SIM, undergoes diffraction and controllable phase shifting by the consecutive $M-1$ layers, before being finally diffracted towards the RX. Let us define the distance between any consecutive MS layers as $d_{M}$ and the area of each unit element as $S_{M}$ . Due to the compact placement of the layers, the layer-to-layer propagation can be accurately modeled via the Rayleigh-Sommerfeld diffraction equation [14, 13]. Namely, we define the propagation coefficient matrix from each $(m-1)$ -th to the $m$ -th layer as $\boldsymbol{\Xi}_{m}\in\mathbb{C}^{N_{m}\times N_{m}}$ , so that its $(n,n^{\prime})$ -th entry ( $n,n^{\prime}=1,2,\ldots,N_{m}$ ) includes the propagation gain between the (arbitrarily ordered) $n$ -th unit element of the $(m-1)$ -th layer and the $n^{\prime}$ -th element of the next layer, as follows [13]:

\displaystyle[\boldsymbol{\Xi}_{m}]_{n,n^{\prime}}

\displaystyle\triangleq\frac{d_{M}S_{M}}{(d_{n,n^{\prime}})^{2}}\Big{(}\frac{1% }{2\pi d_{n,n^{\prime}}}-\frac{\jmath}{\lambda}\Big{)}\exp({\jmath 2\pi d_{n,n% ^{\prime}}}),

(6)

where $d_{n,n^{\prime}}$ denotes the distance between the centers of the $n$ -th and $n^{\prime}$ -th elements, and $\lambda$ is the carrier frequency.

Apart from diffracting, each $n$ -th element of each $m$ -th SIM layer introduces a controllable weight, similar to the RIS modeling, denoted as $[\boldsymbol{\psi}_{m}(t)]_{n}\triangleq\exp(-\jmath\vartheta^{m}_{n}(t))$ . We also introduce $\boldsymbol{\psi}_{m}(t)$ including the response configuration at the $m$ -th layer, the overall response configuration $\boldsymbol{\psi}(t)=[\boldsymbol{\psi}^{\top}_{1}(t),\boldsymbol{\psi}^{\top}% _{2}(t),\dots,\boldsymbol{\psi}^{\top}_{M}(t)]^{\top}\in\mathbb{C}^{N_{\rm SIM% }\times 1}$ , and the phase configuration vector of the SIM $\boldsymbol{\vartheta}(t)\triangleq[\vartheta^{1}_{1}(t),\dots,\vartheta^{1}_{% N_{m}}(t),\dots,\vartheta^{M}_{1}(t),\dots,\vartheta^{M}_{N_{m}}(t)]^{\top}\in% \mathbb{C}^{N_{\rm SIM}\times 1}$ . By defining $\boldsymbol{\Psi}_{m}(t)\triangleq{\rm diag}(\boldsymbol{\psi}_{m}(t))$ , the overall SIM response can be mathematically expressed via the following matrix [39]:

\boldsymbol{\Upsilon}(t)\triangleq\left(\prod_{m=M}^{2}\boldsymbol{\Psi}_{m}(t% )\boldsymbol{\Xi}_{m}\right)\boldsymbol{\Psi}_{1}(t)\in\mathbb{C}^{N_{m}\times N% _{m}}.

(7)

Note that, for $\boldsymbol{\Omega}(t)\equiv\boldsymbol{\Upsilon}(t)$ , (4) holds for the SIM case.

Revisiting the previously introduced generic notations of $\boldsymbol{\Omega}(t)$ and $\boldsymbol{\omega}(t)$ , they can now be expressed concretely for each of the RIS and SIM cases as $\boldsymbol{\Omega}(t)\in\{\boldsymbol{\Phi}(t),\boldsymbol{\Upsilon}(t)\}$ and $\boldsymbol{\omega}(t)\in\{\boldsymbol{\theta}(t),\boldsymbol{\vartheta}(t)\}$ . In the rest of this paper, $\boldsymbol{\Omega}(t)$ , $\boldsymbol{\omega}(t)$ , and $\boldsymbol{\varphi}(t)$ will be used when the underlying operations are agnostic of the MS type, while $\boldsymbol{\Phi}(t)$ , $\boldsymbol{\Upsilon}(t)$ , and their respective vectors will be explicitly utilized when the operations need to discriminate between RISs and SIM.

V Metasurface-Integrated Neural Networks

V-A Transceiver Modules

As initially discussed in Section II-C, EI entails two computational modules, collocated at the transceiver endpoints. To implement the “infer-while-transmitting” methodology, the TX utilizes an encoder DNN, $f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\cdot)$ , providing the output $\boldsymbol{s}(t)$ , while the RX operates a decoder DNN, $f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\cdot)$ , deriving the output $\boldsymbol{\hat{o}}(t)$ . Those blocks are tasked with performing compression, encoding and decoding, error resilience and correction, and potential transmit and receive beamforming alongside probabilistic inference. The exact layer architecture of those models is purposely left unspecified at this stage as a practitioner’s choice, depending, in general, on: i) the nature of the wireless environment; ii) the type of input and target data; iii) computational capabilities of the transceivers’ hardware; as well as iv) the current state-of-the-art. We only note that different sub-modules may be used for each of the above operations, while, typically for uplink scenarios, $f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\cdot)$ can be implemented with larger DNN structures due to the constant power supply at base stations. In addition, regardless of the choice of neural network, we impose a final fixed post-processing step at the encoder’s output $\boldsymbol{s}(t)$ to satisfy the TX system’s power budget, as follows:

\boldsymbol{s}(t)\leftarrow\sqrt{P}\frac{\boldsymbol{s}(t)}{\|\boldsymbol{s}(t% )\|}.

(8)

Considering the concrete input arguments of the encoder functions, two different variations may be defined depending on whether CSI is available to each of the endpoints.

V-A1 Channel-Agnostic Transceivers

An instance of the data variables $\boldsymbol{x}(t)$ is observed at the TX, that is passed to the encoder to construct the transmitted signal, while the decoder DNN observes the received signal and performs an estimate of the unseen target variable $\boldsymbol{o}(t)$ ; this can be described as:

	$\displaystyle\boldsymbol{s}(t)$	$\displaystyle=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t)),$		(9)
	$\displaystyle\boldsymbol{\hat{o}}(t)$	$\displaystyle=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t)).$		(10)

Since no CSI is used by the endpoints, we highlight the similarity of this design to source-only coding, despite the fact that the encoder may need to add redundancy in the transmitted signal, which is traditionally considered as channel coding. Evidently, both processes need to guarantee that the inference procedure performs sufficiently, irrespective of the current channel conditions, which may be a demanding request. Nevertheless, not requiring CSI is a strong simplification of the system architecture, therefore, it is included later on in our investigations in this paper.

V-A2 Channel-Aware Transceivers

Let us assume a quasi-static fading channel and a channel estimation procedure that takes place within each $t$ -th channel frame before data transmission, based on which the TX and RX modules obtain accurate estimates of the CFR matrices $\boldsymbol{\mathcal{H}}(t)$ . Each module may receive $\boldsymbol{\mathcal{H}}(t)$ as an additional input, yielding respectively the following representations for the encoder/decoder DNNs:

	$\displaystyle\boldsymbol{s}(t)$	$\displaystyle=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t),\boldsymbol% {\mathcal{H}}(t)),$		(11)
	$\displaystyle\boldsymbol{\hat{o}}(t)$	$\displaystyle=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t),\boldsymbol{% \mathcal{H}}(t)).$		(12)

Endowing the TX/RX modules with CSI leads to more resilient transmission schemes that closely resemble JSCC [43]. The main difference lies in that JSCC focuses on estimating $\boldsymbol{x}(t)$ , while EI deals with approximating $\boldsymbol{o}(t)=l(\boldsymbol{x}(t))$ ¹¹1Those two problems can be considered equivalent by setting the mapping function $l(\cdot)$ to be the identity function, yielding $\boldsymbol{o}(t)=\boldsymbol{x}(t)$ , and adopting MSE as the objective function $J_{\rm str}(\cdot)$ . As a result, EI is a more general problem formulation than communications which focus on data reconstruction.. In this paper, we assume that channel estimation takes place transparently before every transmission, both during training and inference ,and results in noise-free estimations of $\boldsymbol{\mathcal{H}}(t)$ . Accounting for noisy estimates or even incorporate the estimation in the procedure under ISAC paradigms lead exciting research directions, which we will study in future works.

V-B Control Module for Reconfigurable Metasurfaces

When CSI is available, as in most wireless communication settings, the MS changes its response configuration at every transmission frame to optimize the system’s objective [44]. To incorporate this mode of operation into our E2E architecture, $\boldsymbol{\varphi}(t)$ is treated as a controllable variable that is the output of a third DNN module. Specifically, we define the metasurface controller as the following neural network:

\boldsymbol{\varphi}(t)=f^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{% \mathcal{H}}(t)),

(13)

imposing that the final layer performs the operation $\boldsymbol{\varphi}(t)=\exp(-\jmath\boldsymbol{\hat{\vartheta}})$ . As stated before, either $\boldsymbol{\phi}(t)$ or $\boldsymbol{\psi}(t)$ may be the actual output of the module depending of the selected type of MS, however, we keep the abstract notation of $\boldsymbol{\varphi}(t)$ to provide a general framework. Under this viewpoint, the MS is a controllable entity that can be adapted dynamically to offer favorable wave-domain computation at every channel realization. This treatment allows for fine-grained control over the reprogrammability of the environment, at the cost of an additional neural network module and the associated hardware requirements. Plugging the three trained modules from expressions (9)–(13) onto the received signal in (4), we can derive the E2E inference model $\boldsymbol{\hat{o}}(t)=f^{\rm r}_{\boldsymbol{w_{\rm r}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t))$ , for the two channel knowledge cases, as follows:

\footnotesize\boldsymbol{\hat{o}}(t)\!=\!\begin{cases}\underbrace{f^{\rm d}_{% \boldsymbol{w_{\rm d}}}\bigg{(}\mathcal{T}\big{(}\boldsymbol{\mathcal{H}}(t),f% ^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{\mathcal{H}}),f^{\rm e}_{% \boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t))\big{)}\bigg{)},}_{\text{channel-% agnostic transceivers}}\\ \underbrace{f^{\rm d}_{\boldsymbol{w_{\rm d}}}\Big{(}\mathcal{T}\big{(}% \boldsymbol{\mathcal{H}}(t),f^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{% \mathcal{H}}(t)),f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t))\big{)},\boldsymbol{\mathcal{H}}(t)\Big{)},}_{% \text{channel-aware transceivers}}\end{cases}

(14)

where the overall trainable weights of this reconfigurable architecture have been represented as $\boldsymbol{w_{\rm r}}\triangleq\{\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e}% },\boldsymbol{w_{\rm m}}\}$ , which can be trained together under the same objective functions and backward passes, as it will be detailed in Section VI.

V-C Metasurfaces with Trainable Fixed Response

As an alternative approach, one may choose to directly learn a fixed configuration for the MS; let us denote this as $\boldsymbol{\bar{\omega}}$ . While the training process may iteratively evaluate multiple candidate values for $\boldsymbol{\bar{\omega}}$ , once the training is complete, the learned configuration is equipped onto the MS to maintain a constant (static) response configuration $\boldsymbol{\varphi}(t)\equiv\boldsymbol{\bar{\varphi}}\triangleq\exp(-\jmath% \boldsymbol{\bar{\omega}})$ over time, irrespective of the channel conditions or input data. This description is more akin to the idea that the effective phase configurations are treated similarly to DNN weights, as they too remain fixed after the completion of the training procedure, and are used to perform the same computational operations over varying input instances. To this end, denote the fixed phase configurations of the RIS and SIM as $\boldsymbol{\bar{\theta}}$ and $\boldsymbol{\bar{\vartheta}}$ , respectively. The training procedure optimizes $\boldsymbol{\bar{\omega}}$ directly, i.e., its weights $\boldsymbol{w_{\rm s}}\triangleq\{\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e}% },\boldsymbol{\bar{\omega}}\}$ , therefore, the E2E static architecture can be now expressed as follows:

\footnotesize\boldsymbol{\hat{o}}(t)=\begin{cases}\underbrace{f^{\rm d}_{% \boldsymbol{w_{\rm d}}}\left(\mathcal{T}\left(\boldsymbol{\mathcal{H}}(t),% \boldsymbol{\bar{\varphi}},f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t% ))\right)\right),}_{\text{channel-agnostic transceivers}}\\ \underbrace{f^{\rm d}_{\boldsymbol{w_{\rm d}}}\bigg{(}\mathcal{T}\big{(}% \boldsymbol{\mathcal{H}}(t),\boldsymbol{\bar{\varphi}},f^{\rm e}_{\boldsymbol{% w_{\rm e}}}(\boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t))\big{)},\boldsymbol{% \mathcal{H}}(t)\bigg{)}.}_{\text{channel-aware transceivers}}\end{cases}

(15)

It is noted that, while reconfigurable MSs may offer more precise control in shaping the exact form of $\mathcal{T}(\cdot)$ , the extra hidden layers required by the inclusion of the MS controller module may well hinder the training capabilities of the proposed MINN compared to the current variation. Besides, assuming wireless systems of reasonably limited variability, such as Line-of-Sight (LoS) dominant environments with fixed transceivers, static MS configurations may offer satisfactory performance. The next section addresses the systemic requirements for all variations of this section, while performance trade-offs are investigated under our numerical evaluations.

The final thing to consider about the proposed MINN framework are the theoretical approximation guarantees, especially considering its universal approximation capabilities. Presenting a concrete analysis of this property lies well outside the scope of this paper, as it involves proving that functionals derived of the modeling of Section IV are discriminatory and dense in the space of continuous complex functions as in [20], while also accounting for the stochastic nature of both the fading components and the AWGN. Nevertheless, we argue that, since our MINN architecture includes two typical DNNs at the start and the end of the cascaded computations, which are in fact universal approximators, the E2E architecture should also contain this property at least in the infinite-SNR regime. Intuitively, as long as $\mathcal{T}(\cdot)$ does not lose information due to the noisy channel, the decoder DNN should be capable of approximating the optimal decoder of $\mathbf{y}(t)$ to obtain $\boldsymbol{s}(t)$ , so that the channel is completely negated, and the E2E MINN reduces to $f^{\rm d}_{\boldsymbol{w_{\rm d}}}\big{(}f^{\rm e}_{\boldsymbol{w_{\rm e}}}(% \boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t)),\boldsymbol{\mathcal{H}}(t)\big% {)}$ , which is indeed a universal approximator consisting of standard DNN layers.

VI MINN Training and Deployment

To perform neural network training at any of the wireless communication system nodes, a separate data collection step is carried out to generate a set of $|\mathcal{D}|$ labeled data instances: $\mathcal{D}\triangleq\{(\boldsymbol{x}_{i},\boldsymbol{o}_{i})\}_{i=1}^{|% \mathcal{D}|}$ . Let us also assume the availability of a set of $|\mathcal{C}|$ channel sample estimates (in respective coherent time instances): $\mathcal{C}\triangleq\{\boldsymbol{\mathcal{H}}(t)\}_{t=1}^{|\mathcal{C}|}$ , not necessarily equally spaced. In this paper, we make the assumption that the channel realizations are conditionally independent²²2In certain scenarios, the data realizations and the statistical properties of the channel can be statistically dependent. For example, in a target detection system where the observations $\boldsymbol{x}_{i}$ contain sensory inputs, while $\boldsymbol{o}_{i}$ is a binary variable indicating the existence of a target in the area of interest, deep fading may be encountered more often when a target is subject to signal blockages. In such cases, the two collection processes of channel measurements and observed data must be synchronous, and a more detailed formulation of the EI objective is required. However, the inference problem itself may be potentially computationally easier, since the CSI observation provides additional information regarding the target value. from $\mathcal{D}$ ’s data instances. It is noted that, while this is a rather lenient assumption, it is crucial in permitting the evaluation of the expectation in $\mathcal{OP}_{\rm EI}$ ’s objective via i.i.d. Monte Carlo samples.

VI-A Backpropagation Over the Wireless Channel

The training procedure can be described as a variation of the standard gradient descent approach for neural network training, with the inclusion of channel samples. To provide a comprehensive framework, let us use the generic parameter vector $\boldsymbol{w_{\rm k}}$ with $\mathbf{k}\in\{\mathbf{r},\mathbf{s}\}$ , taking the form of either $\boldsymbol{w_{\rm r}}$ or $\boldsymbol{w_{\rm s}}$ depending on the choice of use of reconfigurable or static MSs. Similar to standard deep learning practices, our E2E MINN architecture may be optimized using SGD over the collected data and channel instances. Specifically, let us express the data-channel as $J_{\rm str}(\boldsymbol{o}_{i},\boldsymbol{\hat{o}}_{i})=J_{\rm str}(% \boldsymbol{o}_{i},f^{\rm k}_{\boldsymbol{w_{\rm k}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t)))$ to explicit show the dependence of the loss function on the instantaneous wireless channel conditions. To this end, leveraging the previous described conditional independence assumption, $\mathcal{OP}_{\rm EI}$ ’s objective can be approximated as follows:

\mathbb{E}_{\boldsymbol{\mathcal{H}}}[J(\boldsymbol{w_{\rm k}})]\!\cong\!\frac% {1}{|\mathcal{C}||\mathcal{D}|}\sum_{t=1}^{|\mathcal{C}|}\sum_{i=1}^{|\mathcal% {D}|}J_{\rm str}(\boldsymbol{o}_{i},f^{\rm k}_{\boldsymbol{w_{\rm k}}}(% \boldsymbol{x}_{i},\boldsymbol{\mathcal{H}}(t))).

(16)

In the online version of SGD, at every time $t$ , one may select a single data point and channel instance to evaluate (16), and accordingly update the parameter vector as follows:

\boldsymbol{w_{\rm k}}\leftarrow\boldsymbol{w_{\rm k}}-\eta\nabla_{\boldsymbol% {w_{\rm k}}}J_{\rm str}(\boldsymbol{o}(t),f^{\rm k}_{\boldsymbol{w_{\rm k}}}(% \boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t))),

(17)

for some chosen learning rate $\eta$ , with the gradient at each case being defined by one of the two following expressions:

	$\displaystyle\nabla J_{\rm str}$	$\displaystyle=\underbrace{\bigg{[}\Big{[}\frac{\partial J_{\rm str}}{\partial% \boldsymbol{w_{\rm d}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str}}{% \partial\boldsymbol{w_{\rm e}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str% }}{\partial\boldsymbol{w_{\rm m}}}\Big{]}^{\top}\bigg{]}^{\top}}_{\text{% reconfigurable metasurface}}$		(18)
	$\displaystyle\nabla J_{\rm str}$	$\displaystyle=\underbrace{\bigg{[}\Big{[}\frac{\partial J_{\rm str}}{\partial% \boldsymbol{w_{\rm d}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str}}{% \partial\boldsymbol{w_{\rm e}}}\Big{]}^{\top},\Big{[}\frac{\partial J_{\rm str% }}{\partial\boldsymbol{\bar{\omega}}}\Big{]}^{\top}\bigg{]}^{\top}}_{\text{% metasurface with trainable fixed response}}.$		(19)

Under the i.i.d. sampling assumption, the consecutive evaluations of the gradient of the objective function in (17) at each training instance $t$ are unbiased estimators of the true gradient of the objective in (16). Therefore, following the stochastic approximation framework [45], the repetition of this procedure will converge to the true value of the expectation with probability $1$ up to a precision of $O(\eta)$ around it, using constant step size [46]. The complete training procedure is detailed in Algorithm 1, which supports all variations of channel-agnostic/-aware transceivers, static/reconfigurable MS controllers, and RIS/SIM structure. Lines $6$ - $9$ implement our MINN architecture as defined in (14) and (15). Naturally, batched gradient descent versions may be used alongside more elaborate gradient updates, such as momentum, weight decay (regularization), and adaptive rates [47], however, such implementation details have been left out for ease of presentation.

Algorithm 1 Training of the Proposed E2E MINN

1:Construct DNN weight vector

\boldsymbol{w}

as one of the following:

2: i)

\boldsymbol{w_{\rm k}}={\rm concat}(\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e% }},\boldsymbol{w_{\rm m}})

\triangleright

\boldsymbol{w_{\rm k}}\leftarrow\boldsymbol{w_{\rm r}}

3: ii)

\boldsymbol{w_{\rm k}}={\rm concat}(\boldsymbol{w_{\rm d}},\boldsymbol{w_{\rm e% }},\boldsymbol{\bar{\omega}})

\triangleright

\boldsymbol{w_{\rm k}}\leftarrow\boldsymbol{w_{\rm s}}

4:Initialize

\boldsymbol{w}

randomly.

5:for

t=1,2,\ldots,

until convergence do

6: Sample

(\boldsymbol{x}(t),\boldsymbol{o}(t))

from

\mathcal{D}

7: Sample

\boldsymbol{\mathcal{H}}(t)

from

\mathcal{C}

8: Compute

\boldsymbol{s}(t)

using one of the following:

9: i)

\boldsymbol{s}(t)=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t))

\triangleright

eq. (9)

10: ii)

\boldsymbol{s}(t)=f^{\rm e}_{\boldsymbol{w_{\rm e}}}(\boldsymbol{x}(t),% \boldsymbol{\mathcal{H}}(t))

\triangleright

eq. (11)

11: Compute

\boldsymbol{\phi}(t)

using one of the following:

12: i)

\boldsymbol{\varphi}(t)=f^{\rm m}_{\boldsymbol{w_{\rm m}}}(\boldsymbol{% \mathcal{H}}(t))

\triangleright

eq. (13)

13: ii)

\boldsymbol{\varphi}(t)=\boldsymbol{\bar{\varphi}}

14: Transmit

\boldsymbol{s}(t)

to receive

\mathbf{y}(t)

15:

\mathbf{y}(t)=\mathcal{T}(\boldsymbol{\mathcal{H}}(t),\boldsymbol{\varphi}(t),% \boldsymbol{s}(t))

\triangleright

eq. (5)

16: Compute

\boldsymbol{\hat{o}}(t)

using one of the following:

17: i)

\boldsymbol{\hat{o}}(t)=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t))

\triangleright

eq. (10)

18: ii)

\boldsymbol{\hat{o}}(t)=f^{\rm d}_{\boldsymbol{w_{\rm d}}}(\mathbf{y}(t),% \boldsymbol{\mathcal{H}}(t))

\triangleright

eq. (12)

19: Set

\boldsymbol{w_{\rm k}}\!\leftarrow\!\boldsymbol{w_{\rm k}}\!-\!\eta\nabla_{% \boldsymbol{w_{\rm k}}}J_{\rm str}(\boldsymbol{o}(t),f^{\rm k}_{\boldsymbol{w_% {\rm k}}}(\boldsymbol{x}(t),\boldsymbol{\mathcal{H}}(t)))

20:end for

21:return

\boldsymbol{w}

The crux of the training procedure is the gradient update mechanism of (17). Since (14) and (15) are differentiable operations with respect to $\boldsymbol{w_{\rm s}}$ or $\boldsymbol{w_{\rm r}}$ , the partial derivatives may be computed via automatic differentiation tools, by applying the chain rule on the underlying computational graph. Regardless, for the shake of completeness, we provide the derivations for the partial derivatives of the various modules, however, treating the implementation-defined derivatives of the classical neural network components (i.e., $\partial f_{\boldsymbol{w}_{\mathrm{e}}}^{\mathrm{e}}/\partial\boldsymbol{w}_{% \mathrm{e}}$ , $\partial f_{\boldsymbol{w}_{\mathrm{d}}}^{\mathrm{d}}/\partial\boldsymbol{w}_{% \mathrm{d}}$ , $\partial f_{\boldsymbol{w}_{\mathrm{m}}}^{\mathrm{m}}/\partial\boldsymbol{w}_{% \mathrm{m}}$ , and $\partial f_{\boldsymbol{w}_{\mathrm{d}}}^{\mathrm{d}}/\partial\mathbf{y}(t)$ ) as known. Continuing, we will make use of the identity $\textrm{vec}(\mathbf{A}\mathbf{X}\mathbf{B})=(\mathbf{B}^{\top}\otimes\mathbf{% A})\textrm{vec}(\mathbf{X})$ and that, for an $n$ -element vector $\boldsymbol{x}$ and $\boldsymbol{X}={\rm diag}(\boldsymbol{x})$ , the vectorization operation on $\boldsymbol{X}$ can be expressed using matrix operations as ${\rm vec}(\boldsymbol{X})=\boldsymbol{D}\boldsymbol{x}$ , where $\boldsymbol{D}\triangleq[\boldsymbol{D}_{1},\boldsymbol{D}_{2},\ldots,% \boldsymbol{D}_{n}]$ is an $n^{2}\times n$ matrix used for selecting the diagonal elements, in which $\boldsymbol{D}_{i}$ is an $n\times n$ matrix with binary elements having $1$ at its $(i,i)$ -th element and $0$ elsewhere.

For the case of a reconfigurable MS (either an RIS or SIM), $\boldsymbol{\hat{o}}$ is computed via (14). By applying backwards propagation, the following derivations are deduced:

$\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{w_{\rm d}}}$	$\displaystyle=\frac{\partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm d}}}^{\rm d}}{\partial\boldsymbol{w_% {\rm d}}},$	(20)
$\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{w_{\rm m}}}$	$\displaystyle=\frac{\partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm d}}}^{\rm d}}{\partial\mathbf{y}(t)}% \cdot\frac{\partial\mathbf{y}(t)}{\partial f^{\rm m}_{\boldsymbol{w_{\rm m}}}}% \cdot\frac{\partial f^{\rm m}_{\boldsymbol{w_{\rm m}}}}{\partial\boldsymbol{w_% {\rm m}}},$	(21)
$\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{w_{\rm e}}}$	$\displaystyle=\frac{\partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm d}}}^{\rm d}}{\partial\mathbf{y}(t)}% \cdot\frac{\partial\mathbf{y}(t)}{\partial f_{\boldsymbol{w_{\rm e}}}^{\rm e}}% \cdot\frac{\partial f_{\boldsymbol{w_{\rm e}}}^{\rm e}}{\partial\boldsymbol{w_% {\rm e}}},$	(22)

where $\partial J_{\rm str}/\partial\hat{\boldsymbol{o}}(t)$ concerns the gradient of the problem-defined loss function with respect to the network’s output, which, for the example of the CE loss of (2), is computed as $-\boldsymbol{o}(t)/\boldsymbol{\hat{o}}(t)$ . The remaining terms are defined as follows:

	$\displaystyle\frac{\partial\mathbf{y}(t)}{\partial f_{\boldsymbol{w_{\rm e}}}^% {\rm e}}$	$\displaystyle=\mathbf{H}_{\rm 2}(t)\boldsymbol{\Phi}(t)\mathbf{H}_{\rm 1}^{% \dagger}(t)+\mathbf{H}_{\rm D}(t),$		(23)
	$\displaystyle\frac{\partial\mathbf{y}(t)}{\partial f^{\rm m}_{\boldsymbol{w_{% \rm m}}}}$	$\displaystyle=\frac{\partial\mathbf{y}(t)}{\partial\boldsymbol{\varphi}(t)}=% \big{(}(\boldsymbol{s}^{\top}(t)\mathbf{H}_{\rm 1}^{\ast}(t))\otimes\mathbf{H}% _{\rm 2}(t)\big{)}\boldsymbol{D},$		(24)

with $\boldsymbol{D}$ being the $N_{m}^{2}\times N_{m}$ binary selection matrix.

In the fixed-configuration RIS case, $\boldsymbol{\hat{o}}$ is computed via (15), and the trainable configuration may be concretely expressed as $\boldsymbol{\bar{\omega}}\equiv\boldsymbol{\bar{\theta}}$ ; the phase shift vector is $\boldsymbol{\bar{\varphi}}\equiv\boldsymbol{\bar{\phi}}$ . The quantities $\partial J_{\rm str}/\partial\boldsymbol{w_{\rm e}}$ and $\partial J_{\rm str}/\partial\boldsymbol{w_{\rm e}}$ remain the same, yielding:

\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\omega}}}=% \frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\theta}}}=\frac{\partial J% _{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}\cdot\frac{\partial f_{\boldsymbol% {w_{\rm d}}}^{\mathrm{d}}}{\partial\mathbf{y}(t)}\cdot\frac{\partial\mathbf{y}% (t)}{\partial\boldsymbol{\bar{\phi}}}\cdot\frac{\partial\boldsymbol{\bar{\phi}% }}{\partial\boldsymbol{\bar{\theta}}},

(25)

where $\partial\mathbf{y}(t)/\partial\boldsymbol{\bar{\phi}}$ can be computed as in (24), while $\partial\boldsymbol{\bar{\phi}}/\partial\boldsymbol{\bar{\theta}}=-\jmath\exp{% (-\jmath\boldsymbol{\bar{\theta}})}$ .

For the fixed-configuration SIM case, the trainable configuration is expressed as $\boldsymbol{\bar{\omega}}\equiv\boldsymbol{\bar{\vartheta}}$ , while the static phase shifts are $\boldsymbol{\bar{\varphi}}\equiv\boldsymbol{\bar{\psi}}$ , and again, $\boldsymbol{\hat{o}}$ is computed via (15). Similarly, the following derivations hold:

\displaystyle\frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\omega}}}=% \frac{\partial J_{\rm str}}{\partial\boldsymbol{\bar{\vartheta}}}=\frac{% \partial J_{\rm str}}{\partial\hat{\boldsymbol{o}}(t)}\cdot\frac{\partial f_{% \boldsymbol{w_{\rm d}}}^{\mathrm{d}}}{\partial\mathbf{y}(t)}\cdot\frac{% \partial\mathbf{y}(t)}{\partial\boldsymbol{\bar{\psi}}}\cdot\frac{\partial% \boldsymbol{\bar{\psi}}}{\partial\boldsymbol{\bar{\vartheta}}},

(26)

where $\partial J_{\rm str}/\partial\hat{\boldsymbol{o}}(t)$ and $\partial f_{\boldsymbol{w}_{\mathrm{d}}}^{\mathrm{d}}/\partial\mathbf{y}(t)$ are the same as before, while, similarly, $\partial\boldsymbol{\bar{\psi}}/\partial\boldsymbol{\bar{\vartheta}}=-\jmath% \exp{(-\jmath\boldsymbol{\bar{\vartheta}})}$ . Since now $\mathcal{T}(\cdot)$ involves the SIM system model, $\partial\mathbf{y}(t)/\partial\boldsymbol{\bar{\psi}}$ requires further derivations. Following the same procedure as in (24), and by denoting the response matrix of each of the $M$ SIM elements as $\boldsymbol{\bar{\Psi}}_{m}\triangleq{\rm diag}(\boldsymbol{\bar{\psi}}_{m})$ , it is deduced $\partial\mathbf{y}(t)/\partial\boldsymbol{\bar{\psi}}=[[\partial\mathbf{y}(t)/% \partial\boldsymbol{\bar{\psi}}_{1}]^{\top},[\partial\mathbf{y}(t)/\partial% \boldsymbol{\bar{\psi}}_{2}]^{\top}\ldots[\partial\mathbf{y}(t)/\partial% \boldsymbol{\bar{\psi}}_{M}]^{\top}]^{\top}$ with:

\footnotesize\frac{\partial\mathbf{y}(t)}{\partial\boldsymbol{\bar{\psi}_{m}}}% =\begin{cases}\begin{array}[]{l}(\boldsymbol{s}^{\top}(t)\mathbf{H}_{\rm 1}^{% \ast}(t))\\ \otimes(\mathbf{H}_{\rm 2}(t)\prod_{m^{\prime}=M}^{2}\boldsymbol{\bar{\Psi}}_{% m^{\prime}}\boldsymbol{\Xi}_{m^{\prime}}\big{)}\boldsymbol{D},\end{array}&\!\!% \!\!m=1\\[6.88889pt] \begin{array}[]{l}\big{(}\big{(}\prod_{m^{\prime}=m}^{2}\boldsymbol{\Xi}_{m^{% \prime}}\boldsymbol{\bar{\Psi}}_{m^{\prime}-1}\big{)}\mathbf{H}_{\rm 1}^{% \dagger}(t)\boldsymbol{s}(t))^{\top}\\ \otimes(\mathbf{H}_{\rm 2}(t)\prod_{m^{\prime}=M}^{m+1}\boldsymbol{\bar{\Psi}}% _{m^{\prime}}\boldsymbol{\Xi}_{m^{\prime}}\big{)}\boldsymbol{D},\end{array}&\!% \!\!\!m=2,\dots,M\end{cases}.