Abstract
A high efficiency hardware integration of neural networks benefits from realizing nonlinearity, network connectivity and learning fully in a physical substrate. Multiple systems have recently implemented some or all of these operations, yet the focus was placed on addressing technological challenges. Fundamental questions regarding learning in hardware neural networks remain largely unexplored. Noise in particular is unavoidable in such architectures, and here we experimentally and theoretically investigate its interaction with a learning algorithm using an opto-electronic recurrent neural network. We find that noise strongly modifies the system’s path during convergence, and surprisingly fully decorrelates the final readout weight matrices. This highlights the importance of understanding architecture, noise and learning algorithm as interacting players, and therefore identifies the need for mathematical tools for noisy, analogue system optimization.
1 Introduction
In recent years, neural networks (NNs) take centre-stage in advancing computation [1]. Optimized by training, such learning machines provide key advantages for solving abstract computational problems and already outperform humans in numerous tasks previously deemed impossible for classically (algorithmically) programmed computers [1], [2], [3].
However, NNs are still mostly emulated by traditional Turing/von Neumann computers. The absence of computing hardware supporting fully parallel NNs reduces energy efficiency and overall speed, and new paradigms addressing these problems are desirable. The lack of a parallel network substrate is a fundamental roadblock and is an active area of research since decades, with current analogue hardware either implementing the full network [4], [5], [6] or the neurons [7], [8], [9], [10], [11]. An implementation of nonlinear neurons, fully-parallel information transduction and learning on a substrate level promises a revolution, and photonic NNs [12], [13] remain a highly promising avenue [4], [5].
Noise is an inseparable companion of analogue hardware [14], yet the fundamental aspects of optimizing a noisy NN [15], [16], [17] have so far hardly been explored – neither in experiments [18], [19] nor in theory [14]. Here, we investigate the interactions between noise, learning rules and the topology of an error landscape for the first time. We experimentally implement a NN with 961 electro-optical neurons via a spatial-light modulator (SLM) [17], use diffraction [4], [5], [10], [20] to physically realize the network’s internal connections and a digital micro-mirror device (DMD) for programmable Boolean readout weights [17]. The particular NN task consists in one-step-ahead prediction of the chaotic Mackey–Glass times series. Learning exclusively modifies the readout connections [21] via an evolutionary Boolean algorithm based on the error gradient only, and optimization is using either fully random (Markovian) or structured (greedy) exploration.
The statistics of the experimentally obtained learning trajectories prove that noise and exploration strategy strongly interact. Noise induces a kind of random forcing upon the descent algorithm, which strongly modifies the system’s path during its convergence towards a local minimum. We find that noise decorrelates the final weight configurations: starting from identical weight configurations and exploring the error landscape’s dimensions in identical sequences always leads to clearly differentiated local minima. Quite astonishingly, all minima are spaced at an almost constant distance from each other, which for the generally non-trivial error landscape topologies is unusual at the least. Noise therefore appears to arrange minimizers in periodic positions, much like competitive Brownian walkers with non-local interactions [22]. These fundamental effects highlight the importance of considering hardware architecture, noise and learning algorithm as intimately linked.
2 Neural network hardware
A recurrent NN inspired by reservoir computing, illustrated in Figure 1a, was our experimentally realized NN test bench. Figure 1b schematically depicts the experiment. An optical plane wave E0 illuminates the SLM’s pixels, and the reflected field is filtered by a polarizing beam splitter (PBS). The SLM combined with the PBS creates a cos(·) non-linearity and the SLM’s pixels physically encode the NNs state. A quarter wave plate located between the PBS and the mirror directs the signal towards a camera, and a double pass through the diffractive optical element (DOE) establishes the recurrent connections WDOE [10], [17], [20]. Camera state
is combined with external input information u (n+1) and sent to the SLM, creating the network’s state according to
Here, N = 961 is the recurrent layer’s number of nodes,
The polarization reflected by the PBS is imaged onto the DMD, whose mirrors are programmed to fixed angles of ±12° from normal incidence. A photodiode only detects optical signals reflected of mirrors with −12°, thus implementing a Boolean readout weight matrix
Here, k is the learning epoch and
As the network is constructed of physical neurons it harbours noise, which can either be additive or multiplicative, as well as correlated or uncorrelated [14]. The main sources of noise in our experiment are the SLM and the camera, in relation to which the illumination laser and output detector can be considered as noiseless, and so are the internal coupling and readout matrices implemented by the DOE and DMD, respectively. All relevant noise sources are therefore reservoir-internal, and our following discussion is by no means limited to systems where the readout layer is implement physically. More details about the theoretical treatment and propagation of noise in NNs as well as the individual noise sources and their respective amplitudes and statistics can be found in [14].
3 Boolean evolutionary learning
Most current learning techniques require complete knowledge of the internal network’s state [21], all connection weights and potentially all gradients [1]. In a hardware network this demands probing (and most probably externally storing) the value of each node and connection, which necessitates auxiliary circuitry of a complexity potentially exceeding the actual neural network. This jeopardizes precisely the benefits one targets when mapping a neural network onto hardware. We therefore employ learning that only tracks the computation error’s evolution, and hence imposes no constraint on the type of neurons, and more broadly, on hidden layers as a whole. Such an implementation’s complexity does not depend on, and hence does not limit the NN’s size, which is crucial considering the importance of scalability for computing.
Here, we optimize the DMD’s configuration simply by measuring the impact of output mirrors’ modifications onto computing error
3.1 Mutation
We create a vector with N random elements, independently and identically distributed between 0 and 1 (rand (N)). Wbias offers the possibility to modifying the otherwise stochastic
A fully stochastic Markovian descent is obtained with Wbias = 1 and excluding Eq. (6). However, here we also investigate exploration which makes mutating a particular connection in near succession unlikely. There, Wbias is randomly initialized at k = 1, and at each epoch Eq. (6) increases the bias of all connections by 1/N, while the currently modified connection’s bias is set to zero. On average, the probability of again probing a particular weight reaches unity only after N learning epochs have passed, and we therefore refer to this biased exploration as greedy learning.
According to these instructions, our algorithm only probes, hence potentially mutates one mirror at a time. We have considered updating more than one mirror at each k, yet simplified numerical simulations indicate that convergence was significantly faster for updating only one weight at a time.
3.2 Error and reward signals
Mean square error
3.3 Descent action
Based on reward r(k), the DMD’s current configuration either accepts or rejects the previous modification, Eq. (11). For a noise-less system, reward r(k) is therefore simply based on the gradient found at position l(k). We will refer to this hypothetical gradient of a noise-less system as the systematic gradient.
4 Results
While such Boolean learning has been applied to a wide range of computational problems, recurrent neural networks have a particular relevance for dynamical signal processing, and we therefore explore one-step-ahead prediction of the chaotic Mackey–Glass sequence with a Lyapunov exponent of ∼3·10−3. This particular input is a commonly employed benchmark test and our results are therefore directly comparable to other works such as Mackey-Glass prediction based on a semiconductor laser delay reservoir with weights optimized and applied in an offline procedure [23], as well as the seminal work on RC [21] – where however the time step was twice as large.
The chaotic sequence acting as input information u(n+1) has zero mean and is normalized to its standard deviation, making error
Understanding why generalization is possible for a training set size (T = 170) not orders of magnitude larger than the number of to be optimized weights (N = 961) is an interesting question. Recent results on deep neural networks, triggered by the insightful analysis from Ref. [24], show that overparametrisation may not preclude generalization. See Ref. [25] for an account to this phenomenon using random matrix theory, starting from simple linear models and generalizing to kernel estimation. In our setting we, however, might additionally postulate that we work below the overparametrization barrier due to the Boolean entries of WDMD(k), which brings substantial rigidity into play. The price one pays is making the problem harder from a computational optimization viewpoint [26].
Typically, the main metric for evaluating learning are speed of convergence kmin and final inference error
4.1 Average and local features of convergence and minima
On average, the error landscape topology excellently follows an exponential decay for both exploration strategies, see fit (blue line) to the average error (red crosses) in Figure 2. Comparing individual trajectories, however, reveals strong inter-trial differences significantly exceeding the noise level. This diversity corresponds to the error landscape’s topological richness probed by the different random descents, and trajectories range from rather smooth descents to paths including steep drops. No correlation between the starting
Nevertheless, despite the small deviations of
5 Noise sensitivity
To further investigate this phenomena, we reduce the number of uncertainties during learning. We measure three minimizer paths starting at the same
Results are shown in Figure 4a. The blue, green and red lines correspond to the different errors
To understand this behaviour, we therefore have to consider the impact of noise and learning upon the system’s error
and
The fact that according to Eq. (12) noise (
The objective of modifying a readout weight is to probe the error landscape’s systematic gradient. However, this action is contaminated by noise which can potentially exceed the systematic gradient in the opposite direction. The consequence is a change in the sign of
Probability C is the driving force behind the growing separation between two identical minimizers, and two situations are relevant. The first situation occurs when r(k) for one minimizer is inverted by noise while the other preserves its systematic value, which has a probability of C(1−C)+(1−C)C = 2C(1−C). The other situation is if both minimizers have an identical reward r(k), which can either be the consequence of both retaining their systematic result, or for both being inverted by noise, with a combined probability of C(1−C)2+C2 = 1−2C(1−C). The first situation leads to H(k + 1) ≠ H(k), the second to H(k + 1) = H(k), and the Hamming distance’s rate equation is
Here,
where
The Hamming distance’s evolution is therefore governed by noise quantified through constant
For fully random mutation, the probability of a weight to be selected is identical at every k, and hence the Hamming distance at the previous epoch k determines the probability of two weights being opposite in their configuration:
Figure 4b shows the evolution of Hamming distance
Different minimizers therefore always arrive at final readout configurations which share no common feature. This suggests a closer look into the role and relevance of individual weights: how many induce a systematic contribution to convergence at all, and if their gradients depend on the sequence of previous mutations. We optimized readout weights via two minimizers starting at different random positions WDMD,a(1) and WDMD,b(1), which arrived at two distinct local minima
Weights insensitive (sensitive) to preceding optimizations correspond to linearly independent (linearly dependent) NN dimensions. Linearly independent NN dimensions must always induce the same gradient, regardless of the preceding optimization path, and they therefore have to be located on the red diagonal line in Figure 5b. The Figure’s green area indicates the linearly independent criteria when considering the impact of noise
6 Discussion
Our experimental findings and analytical descriptions are the first of their kind and stimulate a fundamental discussion. Equation (12) is of interesting consequence for noisy hardware NNs comprising linear readout weights. It links the susceptibility of
We would like to also propose alternative noise-mitigation approaches which are a derivative of our findings. Simply suppressing noise on a hardware level is potentially expensive, and topological requirements can limit mitigation based on connectivity statistics [14]. One might therefore curb the impact of noise by modified learning strategies. Noise will first of all limit the absolute performance, but also cause this performance to fluctuate, which is an effect one could address for example by amending an optimization’s cost function by the gradients encountered in the proximity of a neighborhood. According to Eq. (12) the local gradients
Equation (15) shows that for C > 0 the Hamming distance between readout weights of two systems will always tend towards complete decorrelation as
We have shown that the large majority of our NN’s dimensions are most likely linear dependent. What this means in pratical terms is that each modification of a weight has to be interpreted in the context of all previous modifications. Each configuration WDMD(k) therefore encodes the history of modifications to the reward due to noise during the previous learning epoch.
One direct consequence for applications is that one cannot simply transfer or swap weight configurations between optimized analogue neural networks, even for potentially available identical twin networks. The reason is that optimized configurations are not only the consequence of error landscape, system properties and noise, but also of the precise history of noise during an exploration path. Even almost perfectly reproducible hardware networks will therefore always have to be individually trained for optimal performance; simply uploading a configuration will potentially not work. A ‘school’ in which each neural network learns individually might therefore be required. Finally, our findings open a new field where such twin-minimizers could be considered for probing and interrogating unknown hardware neural networks. The average divergences shown in Figure 4b agree exceptionally well with our model, and based on such data one can therefore make accurate inferences about the noise properties of a hardware NN and about its error landscape exploration strategy.
7 Conclusions
In our work we have investigated the intricate interactions between different learning concepts and the noise inherently present in analogue neural networks. We experimentally showed that trajectories of individual minimizers (i.e. learning trajectories) strongly diverge, and were able to analytically link this divergence to a constant ration between output error and noise susceptibility. Our analytical description only assumes a linear multiplication between a NN’s state and its readout weights, and hence should be generally applicable to this wide class of analogue hardware NNs.
Funding source: Region Bourgogne Franche-Comté
Award Identifier / Grant number: ANR-14-OHRI- 0002-02ANR-17-EURE- 0002
Funding source: H2020 Marie Skłodowska-Curie Actions
Award Identifier / Grant number: 713694 (MULTIPLY)860830 (POST DIGITAL)
Funding source: Volkswagen Foundation
Award Identifier / Grant number: NeuroQNet
Acknowledgement
The authors acknowledge the support of the Region Bourgogne Franche-Comté. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No. 860360 (POST DIGITAL) and No. 713694 (MULTIPLY). This work was also supported by the EUR EIPHI program (Contract No. ANR-17-EURE- 0002), the BiPhoProc project (Contract No. ANR-14-OHRI- 0002-02), and by the Volkwagen Foundation (NeuroQNet).
Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This research was funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No. 860830 (POST DIGITAL) and No. 713694 (MULTIPLY). This work was also supported by the EUR EIPHI program (Contract No. ANR-17-EURE- 0002), the BiPhoProc project (Contract No. ANR-14-OHRI- 0002-02), and by the Volkwagen Foundation (NeuroQNet).
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.10.1038/nature14539Search in Google Scholar PubMed
[2] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, OpenFace: A General- Purpose Face Recognition Library with Mobile Applications, Pittsburgh, Carnegie Mellon University, Tech. Rep., 2016.Search in Google Scholar
[3] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, IEEE, 2013, pp. 6645–6649.10.1109/ICASSP.2013.6638947Search in Google Scholar
[4] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science, vol. 26, pp. 1–20, 2018.10.1126/science.aat8084Search in Google Scholar PubMed
[5] Y. Shen, N. C. Harris, S. Skirlo, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics, vol. 11, pp. 441–446, 2017, https://doi.org/10.1038/nphoton.2017.93.Search in Google Scholar
[6] A. N. Tait, S. Member, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast and weight : an integrated network for scalable photonic spike processing,” J. Lightwave Technol., vol. 32, no. 21, pp. 3427–3439, 2014.10.1109/JLT.2014.2345652Search in Google Scholar
[7] L. Appeltant, M. C. Soriano, G. V. D. Sande, et al., “Information processing using a single dynamical node as complex system,” Nat. Commun., vol. 2, p. 468, 2011, https://doi.org/10.1038/ncomms1476.Search in Google Scholar
[8] F. Duport, B. Schneider, A. Smerieri, M. Haelterman, and S. Massar, “All-optical reservoir computing,” Opt. Express, vol. 20, pp. 22783–22795, 2012, https://doi.org/10.1364/OE.20.022783.Search in Google Scholar
[9] L. Larger, M. C. Soriano, D. Brunner, et al., “Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing,” Opt. Express, vol. 20, pp. 3241–3249, 2012, https://doi.org/10.1364/OE.20.003241.Search in Google Scholar
[10] D. Brunner, and I. Fischer, “Reconfigurable semiconductor laser networks based on diffractive coupling,” Opt. Lett., vol. 40, pp. 3854–3857, 2015, https://doi.org/10.1364/OL.40.003854.Search in Google Scholar
[11] J. Torrejon, M. Riou, F. A. Araujo, et al., “Neuromorphic computing with nanoscale spintronic oscillators,” Nature, vol. 547, pp. 428–431, 2017, https://doi.org/10.1038/nature23011.Search in Google Scholar
[12] N. H. Farhat, D. Psaltis, A. Prata, and E. Paek, “Optical implementation of the Hopfield model,” Appl. Opt., vol. 24, no. 10, pp. 1469–1475, 1985, https://doi.org/10.1364/AO.24.001469.Search in Google Scholar
[13] D. Psaltis, and N. Farhat, “Optical information processing based on an associative-memory model of neural nets with thresholding and feedback,” Opt. Lett., vol. 10, pp. 98–100, 1985, https://doi.org/10.1364/ol.10.000098.Search in Google Scholar
[14] N. Semenova, X. Porte, L. Andreoli, M. Jacquot, L. Larger, and D. Brunner, “Fundamental aspects of noise in analog-hardware neural networks,” Chaos, vol. 29, no. 10, p. 103128, 2019, https://doi.org/10.1063/1.5120824.Search in Google Scholar
[15] M. Hermans, P. Antonik, M. Haelterman, and S. Massar, “Embodiment of learning in electro-optical signal processors,” Phys. Rev. Lett., vol. 117, 2016, Art no. 128301. https://doi.org/10.1103/PhysRevLett.117.128301.Search in Google Scholar
[16] P. Antonik, M. Haelterman, and S. Massar, “Brain-inspired photonic signal processor for generating periodic patterns and emulating chaotic systems,” Phys. Rev. Appl., vol. 7, 5 2017, Art no. 054014. https://doi.org/10.1103/PhysRevApplied.7.054014.Search in Google Scholar
[17] J. Bueno, S. Maktoobi, L. Froehly, et al., “Reinforcement learning in a large scale photonic recurrent neural network,” Optica, vol. 5, pp. 756–760, 2018, https://doi.org/10.1364/OPTICA.5.000756.Search in Google Scholar
[18] R. Alata, J. Pauwels, M. Haelterman, and S. Massar, “Phase noise robustness of a coherent spatially parallel optical reservoir,” IEEE J. Select.Top. Quant. Electron., vol. 26, no. 1, pp. 1–10, 2020.10.1109/CLEOE-EQEC.2019.8872207Search in Google Scholar
[19] M. C. Soriano, S. Ortín, D. Brunner, et al., “Optoelectronic reservoir computing: tackling noise-induced performance degradation,” Opt. Express, vol. 21, pp. 12–20, 2013, https://doi.org/10.1364/OE.21.000012.Search in Google Scholar
[20] S. Maktoobi, L. Froehly, L. Andreoli, et al., “Diffractive coupling for photonic networks: how big can we go?,” IEEE J. Select. Top. Quant. Electron., vol. 26, no. 1, pp. 1–8, 2020.10.1109/JSTQE.2019.2930454Search in Google Scholar
[21] H. Jaeger, and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, pp. 78–80, 2004, https://doi.org/10.1126/science.1091277.Search in Google Scholar
[22] E. Heinsalu, E. Hernández-García, and C. López, “Competitive brownian and lévy walkers,” Phys. Rev. E Stat. Nonlinear Soft Matter Phys., vol. 85, no. 4, pp. 1–10, 2012, https://doi.org/10.1103/PhysRevE.85.041105.Search in Google Scholar
[23] J. Bueno, D. Brunner, M. Soriano, and I. Fischer, “Conditions for reservoir computing performance using semiconductor lasers with delayed optical feedback,” Opt. Express, vol. 25, no. 3, pp. 2401–2412, 2017, https://doi.org/10.1364/OE.25.002401.Search in Google Scholar
[24] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine learning and the bias-variance trade-off,” Proc. Natl. Acad. Sci., vol. 116, no. 32, pp. 15849–15854, 2018, https://doi.org/10.1073/pnas.1903070116.Search in Google Scholar
[25] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” arXiv preprint arXiv:1903.08560, 2019.Search in Google Scholar
[26] F. Hadaeghi and H. Jaeger, “Computing optimal discrete readout weights in reservoir computing is NP-hard,” Neurocomputing, vol. 338, pp. 233–236, 2019, https://doi.org/10.1016/j.neucom.2019.02.009.Search in Google Scholar
[27] X. Porte, L. Andreoli, M. Jacquot, L. Larger, and D. Brunner, “Reservoir-size dependent learning in analogue neural networks,” in Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions, I. V. Tetko, V. Kůrková, P. Karpov, and F. Theis, Eds., Cham, Springer International Publishing, 2019, pp. 184–192.10.1007/978-3-030-30493-5_21Search in Google Scholar
[28] S. Liu, B. Kailkhura, P. -Y. Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” in Advances in Neural Information Processing Systems, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018, pp. 3727–3737.10.1109/GlobalSIP.2018.8646618Search in Google Scholar
[29] M. Freiberger, A. Katumba, P. Bienstman, and J. Dambre, “Training passive photonic reservoirs with integrated optical readout,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 7, pp. 1943–1953, 2019, https://doi.org/10.1109/TNNLS.2018.2874571.Search in Google Scholar
[30] A. Suarez-Perez, G. Gabriel, B. Rebollo, et al., “Quantification of signal-to-noise ratio in cerebral cortex recordings using flexible MEAs with co-localized platinum black, carbon nanotubes, and gold electrodes,” Front. Neurosci., vol. 12, pp. 1–12, 2018, https://doi.org/10.3389/fnins.2018.00862.Search in Google Scholar
© 2020 Louis Andreoli et al., published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.