Introduction
Deep learning has solved numerous problems in computer vision [1], [2], [3], [4], [5], [6], speech recognition [7], [8], [9], and natural language processing [10], [11], [12], [13], [14]. Neural networks have been instrumental in outperforming world champions in a diverse range of games from Go to Starcraft [15], [16], and they are now surpassing the diagnostic capability of clinical specialists in numerous medical tasks [17], [18], [19], [20], [21]. However, for all the state-of-the-art models designed every day, a Kaggle [22] contest for state-of-the-art energy efficiency would go to the brain, every time. A new generation of brain-inspired spiking neural networks (SNNs) is poised to bridge this efficiency gap.
The amount of computational power required to run top-performing deep learning models has increased at a rate of
A. Neuromorphic Computing: A Quick Snapshot
Neuromorphic (“brain-like”) engineering strives to imitate the computational principles of the brain to drive down the energy cost of artificial intelligence systems. To replicate a biological system, we build on three parts.
Neuromorphic sensors that take inspiration from biological sensors, such as the retina or cochlear, and typically record changes in a signal instead of sampling it at regular intervals. Signals are only generated when a change occurs, and the signal is referred to as a “spike.”
Neuromorphic algorithms that learn to make sense of spikes are known as SNNs. Instead of floating point values, SNNs work with single-bit, binary activations (spikes) that encode information over time, rather than in an intensity. As such, SNNs take advantage of low-precision parameters and high spatial and temporal sparsity.
These models are designed with power-efficient execution on specialized neuromorphic hardware in mind. Sparse activations reduce data movement both on and off a chip to accelerate neuromorphic workloads, which can lead to large power and latency gains compared to the same task on conventional hardware.
Armed with these three components, neuromorphic systems are equipped to bridge the efficiency gap between today’s and future intelligent systems.
What lessons can be learned from the brain to build more efficient neural networks? Should we replicate the genetic makeup of a neuron right down to the molecular level [29], [30]? Do we look at the way memory and processing coalesce within neurons and synapses [31], [32]? Or should we aim to extract the learning algorithms that underpin the brain [33]? This article hones in on the intricacies of training brain-inspired neuromorphic algorithms, ultimately moving toward the goal of harnessing natural intelligence to further improve our use of artificial intelligence. SNNs can already be optimized using the tools available to the deep learning community. However, the brain-inspired nature of these emerging sensors, neuron models, and training methods is different enough to warrant a deep dive into biologically inspired neural networks.
B. Neuromorphic Systems in the Wild
The overarching aim is to combine artificial neural networks (ANNs), which have already proven their worth in a broad range of domains, with the potential efficiency of SNNs [34]. So far, SNNs have staked their claim to a range of applications where power efficiency is of utmost importance.
Fig. 1 offers a small window into the uses of SNNs, and their domain only continues to expand. Spiking algorithms have been used to implement low-power artificial intelligence algorithms across the medical, robotics, and mixed-reality domains, among many other fields. Given their power efficiency, initial commercial products often target edge computing applications, close to where the data are recorded.
SNNs have pervaded many streams of deep learning, which are in need of low-power, resource-constrained, and often portable operation. The utility of SNNs even extends to the modeling of neural dynamics across individual neurons and higher level neural systems.
In biosignal monitoring, nerve implants for the brain–machine or biosignal interfaces have to preprocess information locally at minimum power and lack the bandwidths to transmit data for cloud computation. Work done in that direction using SNNs includes on-chip spike sorting [35], [36], biosignal anomaly detection [37], [38], [39], [40], and brain–machine interfaces [41], [42], [43]. Beyond biomedical intervention, SNN models are also used in robotics in an effort to make them more human-like and to drive down the cost of operation [44], [45], [46]. Unmanned aerial vehicles must also operate in low-power environments to extract as much value from lightweight batteries and have benefited from using neuromorphic processors [47]. Audio signals can be processed with sub-milliwatt (mW) power consumption and low latency on neuromorphic hardware as SNNs provide an efficient computational mechanism for temporal signal processing [48].
A plethora of efficient computer vision applications using SNNs are reviewed in [49]. SNNs are equally suitable to track objects such as satellites in the sky for space situational awareness [50], [51] and have been researched to promote sustainable uses of artificial intelligence, such as in monitoring material strain in smart buildings [52] and wind power forecasting in remote areas that face power delivery challenges [53]. At the 2018–19 Telluride Neuromorphic and Cognition Workshops, a neuromorphic robot was even built to play foosball! [54].
Beyond neuromorphic applications, SNNs are also used to test theories about how natural intelligence may arise, from the higher level learning rules of the brain [55] and how memories are formed [56] down to the lower level dynamics at the neuronal and synaptic layers [57].
C. Overview of This Article
The brain’s neural circuitry is a physical manifestation of its neural algorithm; understanding one will likely lead to an understanding of the other. This article will hone in on one particular aspect of neural models: those that are compatible with modern deep learning. Fig. 2 provides an illustrated overview of the structure of this article, and we will start from the ground up.
In Section II, we will rationalize the commonly accepted advantages of using spikes and derive a spiking neuron model from basic principles.
These spikes are assigned meaning in Section III by exploring various spike encoding strategies, how they impact the learning process, and how objective and regularization functions are used to sway the spiking patterns of an SNN.
In Section IV, the challenges of training SNNs using gradient-based optimization are explored, and several solutions are derived. These include defining derivatives at spike times and using approximations of the gradient.
In doing so, a subtle link between the backpropagation algorithm and the spike timing-dependent plasticity (STDP) learning rule emerges and is used in the subsequent section to derive online variants of backprop, which move toward biologically plausible learning mechanisms.
With that being said, it is time to dive into how we might combine the potential efficiency of SNNs with the high performance of ANNs.
From Artificial to Spiking Neural Networks
The neural code refers to how the brain represents information, and while many theories exist, the code is yet to be cracked. There are several persistent themes across these theories, which can be distilled down to “the three S’s”: spikes, sparsity, and static suppression. These traits are a good starting point to show why the neural code might improve the efficiency of ANNs. Our first observation is given as follows.
Spikes (biological neurons interact via spikes): Neurons primarily process and communicate with action potentials or “spikes,” which are electrical impulses of approximately 100 mV in amplitude. In most neurons, the occurrence of an action potential is far more important than the subtle variations of the action potential [58]. Many computational models of neurons simplify the representation of a spike to a discrete, single-bit, all-or-nothing event [see Fig. 3(a)–(c)]. Communicating high-precision activations between layers and routing them around and between chips are expensive undertakings. Multiplying high-precision activation with a high-precision weight requires conversion into integers and decomposition of multiplication into multiple additions, which introduces a carry propagation delay. On the other hand, a spike-based approach only requires a weight to be multiplied by a spike (“1”). This trades the cumbersome multiplication process with a simple memory read-out of the weight value.
Despite the activation being constrained to a single bit, spiking networks are vastly different from binarized neural networks. What actually matters is the timing of the spike. Time is not a binarized quantity and can be implemented using clock signals that are already distributed across a digital circuit. After all, why not use what is already available?
Sparsity: Biological neurons spend most of their time at rest, silencing a majority of activations to zero at any given time.
Sparse tensors are cheap to store. The space that a simple data structure requires to store a matrix grows with the number of entries to store. In contrast, a data structure to store a sparse matrix only consumes memory with the number of nonzero elements. Take the following list as an example:
Since most of the entries are zero, writing out only the nonzero elements is a far more efficient representation, as would occur in run-length encoding (indexing from zero)\begin{equation*} [{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1}].\end{equation*} View Source\begin{equation*} [{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1}].\end{equation*}
For example, Fig. 3(c) shows how a single action potential can be represented by a sparsely populated vector. The sparser the list, the more space can be saved.\begin{equation*} ``\textit {1 at position 10; 1 at position 20.}''\end{equation*} View Source\begin{equation*} ``\textit {1 at position 10; 1 at position 20.}''\end{equation*}
Static suppression (a.k.a., event-driven processing): The sensory system is more responsive to changes than to static input.
The sensory periphery features several mechanisms that promote neuron excitability when subject to dynamic, changing stimuli, while suppressing its response to static, unchanging information. In retinal ganglion cells and the primary visual cortex, the spatiotemporal receptive fields of neurons promote excitable responses to regions of spatial contrast (or edges) over regions of spatial invariance [59]. Analogous mechanisms in early auditory processing include spectrotemporal receptive fields that cause neurons to respond more favorably to changing frequencies in sound over static frequencies [60]. These processes occur on short timescales (milliseconds), while perceptual adaptation has also been observed on longer timescales (seconds) [61], [62], [63], causing neurons to become less responsive to prolonged exposure to fixed stimuli.
A real-world engineering example of event-driven processing is the dynamic vision sensor (DVS), or the “silicon retina,” which is a camera that reports changes in brightness and stays silent otherwise [see Fig. 3(d) and (e)] [64], [65], [66], [67], [68]. This also means that each pixel activates independently of all other pixels, as opposed to waiting for a global shutter to produce a still frame. The reduction of active pixels leads to huge energy savings compared to conventional CMOS image sensors. This mix of low-power and asynchronous pixels allows for fast clock speeds, giving commercially available DVS cameras a microsecond temporal resolution without breaking a sweat [69]. The difference between a conventional frame-based camera and an event-based camera is illustrated in Fig. 4.
Neurons communicate via spikes. (a) Diagram of a neuron. (b) Measuring an action potential propagated along the axon of a neuron. Fluctuating subthreshold voltages are present in the soma but become severely attenuated over distances beyond 1 mm [58]. Only the action potential is detectable along the axon. (c) Neuron’s spike is approximated with a binary representation. (d) Event-driven processing. Only dynamic segments of a scene are passed to the output (“1”), while static regions are suppressed (“0”). (e) Active pixel sensor and DVS.
Functional difference between a conventional frame-based camera (top) and an event-based camera/silicon retina (bottom). The former records the scene as a sequence of images at a fixed frame rate. It operates independently of activity in the scene and can result in motion blur due to the global shutter. The silicon retina’s output is directly driven by visual activity in the scene, as every pixel reacts to a change in illuminance.
A. Spiking Neurons
ANNs and SNNs can model the same types of network topologies, but SNNs trade the artificial neuron model with a spiking neuron model instead (see Fig. 5). Much like the artificial neuron model [70], spiking neurons operate on a weighted sum of inputs. Rather than passing the result through a sigmoid or rectified linear unit (ReLU) nonlinearity, the weighted sum contributes to the membrane potential
Leaky IF neuron model. (a) Insulating bilipid membrane separates the intracellular and extracellular medium. Gated ion channels allow charge carriers, such as Na+, to diffuse through the membrane. (b) Capacitive membrane and resistive ion channels form an RC circuit. When the membrane potential exceeds a threshold
These dynamics were quantified back in 1907 [75]. Lapicque [75] stimulated the nerve fiber of a frog leg using a hacked-together current source and observed how long it took the frog leg to twitch based on the amplitude and duration of the driving current \begin{equation*} \tau \frac {dU(t)}{dt} = -U(t) + I_{\textrm {in}}(t)R \tag{1}\end{equation*}
\begin{equation*} U(t) = I_{\textrm {in}}R + [U_{0} - I_{\textrm {in}}R]e^{-\frac {t}{\tau }} \tag{2}\end{equation*}
\begin{equation*} U[t]=\beta U[t-1] + (1-\beta)I_{\textrm {in}}[t] \tag{3}\end{equation*}
In deep learning, the weighting factor of an input is typically a learnable parameter. Relaxing the physically viable assumptions made thus far, the coefficient of input current in (3), \begin{equation*} U[t] = \underbrace {\beta U[t-1]}_{\textrm {decay}} + \underbrace {WX[t]}_{\textrm {input}} - \underbrace {S_{\textrm {out}}[t-1]\theta }_{\textrm {reset}}. \tag{4}\end{equation*}
\begin{align*} S_{\textrm {out}}[t] = \begin{cases} \displaystyle 1, & \textrm {if $U[t] > \theta $} \\ \displaystyle 0, & \textrm {otherwise.} \end{cases} \tag{5}\end{align*}
In your exploration of leaky IF neurons, you may come across slight variants.
The spike threshold may be applied before updating the membrane potential. This induces a one-step delay between the input signal
when it can trigger a spike.$X$ The above derivations use a “reset-by-subtraction” (or soft reset) mechanism. However, an alternative shown in Appendix A1 is a “reset-to-zero” mechanism (or hard reset).
The factor
from (3) may be included as a coefficient to the input term,$({1}-\beta)$ . This will allow you to simulate a neuron model with realistic time constants but does not offer any advantages when ultimately applied to deep learning.$WX$
An extensive list of alternative neuron types is detailed in Section II-B, along with a brief overview of their use cases.
A graphical depiction of the LIF neuron is provided in Fig. 6. The recurrent neuron in (a) is “unrolled” across time steps in (b), where the reset mechanism is included via
Computational steps in solving the leaky IF neuron model. (a) Recurrent representation of a spiking neuron. Hidden state decay is referred to as “implicit recurrence,” and external feedback from the spike is “explicit recurrence,” where
B. Alternative Spiking Neuron Models
The leaky IF neuron is one of the many spiking neuron models. Some other models that you might encounter are listed as follows.
IF: The leakage mechanism is removed;
in (4).$\beta ={1}$ Current-based: Often referred to as CuBa neuron models, these incorporate synaptic conductance variation into leaky IF neurons. If the default LIF neuron is a first-order low-pass filter, then CuBa neurons are a second-order low-pass filter. The input spike train undergoes two rounds of “smoothing,” which means that the membrane potential has a finite rise time rather than experiencing discontinuous jumps in response to incoming spikes [78], [79], [80]. A depiction of such a neuron with a finite rise time of membrane potential is shown in Fig. 5(d).
Recurrent neurons: The output spikes of a neuron are routed back to the input, labeled in Fig. 6(a) with explicit recurrence. Rather than an alternative model, recurrence is a topology that can be applied to any other neuron and can be implemented in different ways: one-to-one recurrence, where each neuron routes its own spike to itself, or all-to-all recurrence, where the output spikes of a full layer are weighted and summed (e.g., via a dense or convolutional layer), before being fed back to the full layer [81].
Kernel-based models: Also known as the spike-response model, where a predefined kernel (such as the “alpha function”: see Appendix C1) is convolved with input spikes [72], [73], [74]. Having the option to define the kernel to be any shape offers significant flexibility.
Deep learning inspired spiking neurons: Rather than drawing upon neuroscience, it is just as possible to start with primitives from deep learning and apply spiking thresholds. This helps with extending the short-term capacity of basic recurrent neurons. A couple of examples include spiking LSTMs [39] and Legendre memory units [82]. More recently, transformers have been used to further improve long-range memory dependencies in data. In a similar manner, SpikeGPT approximated self-attention into a recurrent model, providing the first demonstration of natural language generation with SNNs [83].
Higher complexity neuroscience-inspired models: A large variety of more detailed neuron models are out there. These account for biophysical realism and/or morphological details that are not represented in simple leaky integrators. The most renowned models include the Hodgkin–Huxley model [77] and the Izhikevich (or resonator) model [84], which can reproduce electrophysiological results with better accuracy.
The main takeaway is given as follows: use the neuron model that suits your task. Power-efficient deep learning will call for LIF models. Improving performance may call for using recurrent SNNs. Driving performance even further (often at the expense of efficiency) may demand methods derived from deep learning, such as spiking LSTMs and recurrent spiking transformers [83]. Or perhaps, deep learning is not your goal. If you are aiming to construct a brain model or are tasked with an exploration of linking low-level dynamics (ionic, conductance-driven, or otherwise) with higher order brain function, then perhaps, more detailed, biophysically accurate models will be your friend.
Having formulated a spiking neuron in a discrete-time, recursive form, we can now “borrow” the developments in training recurrent neural networks (RNNs) and sequence-based models. This recursion is illustrated using an “implicit” recurrent connection for the decay of the membrane potential and is distinguished from “explicit” recurrence where the output spikes
While there are plenty more physiologically accurate neuron models [77], the leaky IF model is the most prevalent in gradient-based learning due to its computational efficiency and ease of training. Before moving onto training SNNs in Section IV, let us gain some insight into what spikes actually mean and how they might represent information in Section III.
Neural Code
Light is what we see when the retina converts photons into spikes. Odors are what we smell when the nose processes volatilized molecules into spikes. Tactile perceptions are what we feel when our nerve endings turn pressure into spikes. The brain trades in the global currency of the spike. If all spikes are treated identically, then how do they carry meaning? With respect to spike encoding, there are two parts of a neural network that must be treated separately (see Fig. 7).
Input encoding: Conversion of input data into spikes which is then passed to a neural network.
Output decoding: Train the output of a network to spike in a way that is meaningful and informative.
Input data to an SNN may be converted into a firing rate, a firing time, or the data can be delta modulated. Alternatively, the input to the network can also be passed in without conversion, which experimentally represents a direct or variable current source applied to the input layer of neurons. The network itself may be trained to enable the correct class to have the highest firing rate or to fire first, among many other encoding strategies.
A. Input Encoding
Input data to an SNN do not necessarily have to be encoded into spikes. It is acceptable to pass continuous values as input, much like how the perception of light starts with a continuous number of photons impinging upon our photoreceptor cells.
Static data, such as an image, can be treated as a direct current (dc) input with the same features passed to the input layer of the SNN at every time step. However, this does not exploit the way SNNs extract meaning from temporal data. In general, three encoding mechanisms have been popularized with respect to input data.
Rate coding converts input intensity into a firing rate or spike count.
Latency (or temporal) coding converts input intensity to a spike time.
Delta modulation converts a temporal change of input intensity into spikes and otherwise remains silent.
1) Rate-Coded Inputs:
How does the sensory periphery encode information about the world into spikes? When bright light is incident upon our photoreceptor cells, the retina triggers a spike train to the visual cortex. Hubel and Wiesel’s Nobel prize-winning research on visual processing indicates that a brighter input or a favorable orientation of light corresponds to a higher firing rate [59]. As a rudimentary example, a bright pixel is encoded into a high-frequency firing rate, whereas a dark pixel would result in low-frequency firing. Measuring the firing rate of a neuron can become quite nuanced. The simplest approach is to apply an input stimulus to a neuron, count up the total number of action potentials it generates, and divide that by the duration of the trial. Although straightforward, the problem here is that the neuronal dynamics vary across time. There is no guarantee that the firing rate at the start of the trial is anything near the rate at the end.
An alternative method counts the spikes over a very short time interval
This representation is quite convenient for sequential neural networks. Each discrete-time step in an RNN can be thought of as lasting for a brief duration
2) Latency-Coded Inputs:
A latency, or temporal, code is concerned with the timing of a spike. The total number of spikes is no longer consequential. Rather, when the spike occurs is what matters. For example, a time-to-first-spike mechanism encodes a bright pixel as an early spike, whereas a dark input will spike last or simply never spike at all. Compared to the rate code, latency-encoding mechanisms assign much more meaning to each individual spike.
Neurons can respond to sensory stimuli over an enormous dynamic range. In the retina, neurons can detect individual photons to an influx of millions of photons [96], [97], [98], [99], [100]. To handle such widely varying stimuli, sensory transduction systems likely compress stimulus intensity with a logarithmic dependency. For this reason, a logarithmic relation between spike times and input feature intensity is ubiquitous in the literature (see Appendix B2) [101], [102].
Although sensory pathways appear to transmit rate-coded spike trains to our brains, it is likely that temporal codes dominate the actual processing that goes on within the brain. More on this in Section III-B4.
3) Delta Modulated Inputs:
Delta modulation is based on the notion that neurons thrive on change, which underpins the operation of the silicon retina camera that only generates an input when there has been a sufficient change of input intensity over time. If there is no change in your field of view, then your photoreceptor cells are much less prone to firing. Computationally, this would take a time-series input and feed a thresholded matrix difference to the network. While the precise implementation may vary, a common approach requires the difference to be both positive and greater than some predefined threshold for a spike to be generated. This encoding technique is also referred to as “threshold crossing.” Alternatively, changes in intensity can be tracked over multiple time steps, and other approaches account for negative changes. For an illustration, see Fig. 4, where the “background” is not captured over time. Only the moving blocks are recorded, as it is those pixels that are changing.
The previous techniques tend to “convert” data into spikes. However, it is more efficient to natively capture data in “preencoded,” spiking form. Each pixel in a DVS camera and channel in a silicon cochlear uses delta modulation to record changes in the visual or audio scene. Some examples of neuromorphic benchmark datasets are described in Table 1. A comprehensive series of neuromorphic-relevant datasets are accounted for in NeuroBench [103]. Many of these are readily available for use with open-source libraries, such as Tonic [104].
B. Output Decoding
Encoding input data into spikes can be thought of as how the sensory periphery transmits signals to the brain. On the other side of the same coin, decoding these spikes provides insight into how the brain handles these encoded signals. In the context of training an SNN, the encoding mechanism does not constrain the decoding mechanism. Shifting our attention from the input of an SNN, how might we interpret the firing behavior of output neurons?
Rate coding chooses the output neuron with the highest firing rate, or spike count, as the predicted class.
Latency (or temporal) coding chooses the output neuron that fires first as the predicted class.
Population coding relies on multiple neurons per class. This is typically used in conjunction with rate coding, rank order coding, or N-of-M coding [105], [106].
1) Rate-Coded Outputs:
Consider a multiclass classification problem, where
2) Latency-Coded Outputs:
There are numerous ways a neuron might encode data in the timing of a spike. As in the case of latency-coded inputs, it could be that a neuron representing the correct class fires first. This addresses the energy burden that arises from the multiple spikes needed in rate codes. In hardware, the need for fewer spikes reduces the frequency of memory accesses, which is another computational burden in deep learning accelerators.
Biologically, does it make sense for neurons to operate on a time to first spike principle? How might we define “first” if our brains are not constantly resetting to some initial, default state? This is quite easy to conceptually address. The idea of a latency or temporal code is motivated by our response to a sudden input stimulus. For example, when viewing a static, unchanging visual scene, the retina undergoes rapid, yet subtle, saccadic motion. The scene projected onto the retina changes every few hundreds of milliseconds. It could very well be the case that the first spike must occur with respect to the reference signal generated by this saccade.
3) Population-Coded Outputs:
We know that the reaction time of a human is roughly in the ballpark of 250 ms. If the average firing rate of a neuron in the human brain is on the order of 10 Hz, then we can only process about two to three spikes within our reaction time. However, this often cited 10-Hz assumption should be treated as an upper limit. When experimentalists are hunting for neurons using a single microelectrode, low-rate neurons might be bypassed as they do not generate enough spikes for data analysis or are simply missed altogether. As a result, high-rate neurons may have become significantly overrepresented in the literature. This is supported by data collected from chronic implants in the macaque hippocampus, which routinely yields neurons with background firing rates below 0.1 Hz [114]. One would have to wait at least 10 s before observing a single spike!
This can be addressed by using a distributed representation of information across a population of neurons: if a single neuron is limited in its spike count within a brief time window, then just use more neurons [85]. The spikes from a subgroup of neurons can be pooled together to make more rapid decisions. Interestingly, population codes trade sequential processing for parallelism, which is more optimal when training SNNs on GPUs [81].
4) Rate Versus Latency Code:
Whether neurons encode information as a rate, as latency, or as something wholly different, is a topic of much controversy. We do not seek to crack the neural code here but instead aim to provide intuition on when SNNs might benefit from one code over the other.
Advantages of Rate Codes:
Error tolerance: If a neuron fails to fire, there are ideally many more spikes to reduce the burden of this error.
More spiking promotes more learning: Additional spikes provide a stronger gradient signal for learning via error backpropagation. As will be described in Section IV, the absence of spiking can impede learning convergence (more commonly referred to as the “dead neuron problem”).
Advantages of Latency Codes:
Power consumption: Generating and communicating fewer spikes mean less dynamic power dissipation in tailored hardware. It also reduces memory access frequency due to sparsity as a vector-matrix product for an all-zero input vector returns a zero output.
Speed: The reaction time of a human is roughly in the ballpark of 250 ms. If the average firing rate of a neuron in the human brain is on the order of 10 Hz (which is likely an overestimation [114]), then one can only process about two to three spikes in this reaction time window. In contrast, latency codes rely on a single spike to represent information. This issue with rate codes may be addressed by coupling it with a population code: if a single neuron is limited in its spike count within a brief time window, then just use more neurons [85]. This comes at the expense of further exacerbating the power consumption problem of rate codes.
The power consumption benefit of latency codes is also supported by observations in biology, where nature optimizes for efficiency. Olshausen and Field’s [114] work in “What is the other 85% of V1 doing?” methodically demonstrates that rate-coding can only explain, at most, the activity of 15% of neurons in the primary visual cortex (V1). If our neurons indiscriminately defaulted to a rate code, this would consume an order of magnitude more energy than a temporal code. The mean firing rate of our cortical neurons must necessarily be rather low, which is supported by temporal codes.
Lesser explored encoding mechanisms in gradient-based SNNs include using spikes to represent a prediction or reconstruction error [115]. The brain may be perceived as an anticipatory machine that takes action based on its predictions. When these predictions do not match reality, spikes are triggered to update the system.
Some assert that the true code must lie between rate and temporal codes [116], while others argue that the two may coexist and only differ based on the timescale of observation: rates are observed for long timescales and latency for short timescales [117]. Some reject rate codes entirely [118]. This is one of those instances where a deep learning practitioner might be less concerned with what the brain does and prefers to focus on what is most useful.
C. Objective Functions
While it is unlikely that our brains use something as explicit as a cross-entropy loss function, it is fair to say that humans and animals have baseline objectives [121]. Biological variables, such as dopamine release, have been meaningfully related to objective functions from reinforcement learning [122]. Predictive coding models often aim to minimize the information entropy of sensory encodings such that the brain can actively predict incoming signals and inhibit what it already expects [123]. The multifaceted nature of the brain’s function likely calls for the existence of multiple objectives [124]. How the brain can be optimized using these objectives remains a mystery though we might gain insight from multiobjective optimization [125].
A variety of loss functions can be used to encourage the output layer of a network to fire as a rate or temporal code. The optimal choice is largely unsettled and tends to be a function of the network hyperparameters and the complexity of the task at hand. All objective functions described in the following have successfully trained networks to competitive results on a variety of datasets though they come with their own tradeoffs.
1) Spike Rate Objective Functions:
A summary of approaches commonly adopted in supervised learning classification tasks with SNNs to promote the correct neuron class to fire with the highest frequency is provided in Table 2. In general, either the cross-entropy loss or mse is applied to the spike count or the membrane potential of the output layer of neurons.
With a sufficient number of time steps, passing the spike count the objective function is more widely adopted as it operates directly on spikes. Membrane potential acts as a proxy for increasing the spike count and is also not considered an observable variable, which may partially offset the computational benefits of using spikes.
Cross-entropy approaches aim to suppress the spikes from incorrect classes, which may drive weights in a network to zero. This could cause neurons to go quiet in the absence of additional regularization. By using the mean square spike rate, which specifies a target number of spikes for each class, output neurons can be placed on the cusp of firing. Therefore, the network may adapt to changing inputs with a faster response time than neurons that have their firing completely suppressed.
In networks that simulate a constrained number of time steps, a small change in weights is unlikely to cause a change in the spike count of the output. It might be preferable to apply the loss function directly to a more “continuous” signal, such as the membrane potential instead. This comes at the expense of operating on a full precision hidden state, rather than on spikes. Alternatively, using population coding can distribute the cost burden over multiple neurons to increase the probability that a weight update will alter the spiking behavior of the output layer. It also increases the number of pathways through which error backpropagation may take place and improve the chance that a weight update will generate a change in the global loss.
2) Spike Time Objectives:
Loss functions that implement spike timing objectives are less commonly used than rate-coded objectives. Two possible reasons may explain why: 1) error rates are typically perceived to be the most important metric in deep learning literature, and rate codes are more tolerant to noise and 2) temporal codes are marginally more difficult to implement. A summary of approaches is provided in Table 3.
The use cases of these objectives are analogous to the spike rate objectives. A subtle challenge with using spike times is that the default implementation assumes each neuron spikes at least once, which is not necessarily the case. This can be handled by forcing a spike at the final time step in the event when a neuron does not fire [120].
Several state-of-the-art models wholly abandon spiking neurons at the output and train their models using a “read-out layer.” This often consists of an IF layer with infinitely high thresholds (i.e., they will never fire) or with typical artificial neurons that use standard activation functions (ReLU, sigmoid, softmax, and so on). While this often improves accuracy, this may not qualify as a fully spiking network. Does this actually matter? If one can still achieve power efficiency, then engineers will be happy and that is often all that matters.
D. Learning Rules
1) Spatial and Temporal Credit Assignment:
Once a loss has been determined, it must somehow be used to update the network parameters with the hope that the network will iteratively improve at the trained task. Each weight takes some blame for its contribution to the total loss, and this is known as “credit assignment.” This can be split into the spatial and temporal credit assignment problems. Spatial credit assignment aims to find the spatial location of the weight contributing to the error, while the temporal credit assignment problem aims to find the time at which the weight contributes to the error. Backpropagation has proven to be an extremely robust way to address credit assignment, but the brain is far more constrained in developing solutions to these challenges.
Backpropagation solves spatial credit assignment by applying a distinct backward pass after a forward pass during the learning process [126]. The backward pass mirrors the forward pass, such that the computational pathway of the forward pass must be recalled. In contrast, action potential propagation along an axon is considered to be unidirectional, which may reject the plausibility of backprop taking place in the brain. Spatial credit assignment is not only concerned with calculating the weight’s contribution to an error but also assigning the error back to the weight. Even if the brain could somehow calculate the gradient (or an approximation), a major challenge would be projecting that gradient back to the synapse and knowing which gradient belongs to which synapse.
This constraint of neurons acting as directed edges is increasingly being relaxed, which could be a mechanism by which errors are assigned to synapses [127]. Numerous bidirectional, nonlinear phenomena occur within individual neurons, which may contribute toward helping errors find their way to the right synapse. For example, feedback connections are observed in most places where there are feedforward connections [128].
2) Biologically Motivated Learning Rules:
With a plethora of neuronal dynamics that might embed variants of backpropagation, what options are there for modifying backprop to relax some of the challenges associated with biologically plausible spatial credit assignment? In general, the more broadly adopted approaches rely on either trading parts of the gradient calculation for stochasticity or otherwise swapping a global error signal for localized errors (see Fig. 8). Conjuring alternative methods to credit assignment that a real-time machine such as the brain can implement is not only useful for developing insight into biological learning [129] but also reduces the cost of data communication in hardware [130]. For example, using local errors can reduce the length a signal must travel across a chip. Stochastic approaches can trade computation with naturally arising circuit noise [131], [132], [133]. A brief summary of several common approaches to mitigating the spatial credit assignment problem is provided in the following [134].
Perturbation learning: A random perturbation of network weights is used to measure the change in error. If the error is reduced, the change is accepted. Otherwise, it is rejected [135], [136], [137]. The difficulty of learning scales with the number of weights, where the effect of a single weight change is dominated by the noise from all other weight changes. In practice, it may take a huge number of trials to average this noise away [55].
Random feedback: Backpropagation requires sequentially transporting the error signal through multiple layers, scaled by the forward weights of each layer. Random feedback replaces the forward weight matrices with random matrices, reducing the dependence of each weight update on distributed components of the network. While this does not fully solve the spatial credit assignment problem, it quells the weight transport problem [138], which is specifically concerned with a weight update in one layer depending upon the weights of far-away layers. Forward- and backward-propagating data are scaled by symmetric weight matrices, a mechanism that is absent in the brain. Random feedback has shown similar performance to backpropagation on simple networks and tasks, which gives hope that a precise gradient may not be necessary for good performance [138]. Random feedback has struggled with more complex tasks though variants have been proposed that reduce the gap [139], [140], [141], [142]. Nonetheless, the mere fact that such a core piece of the backpropagation algorithm can be replaced with random noise and yet somehow still work is a marvel. It is indicative that we still have much left to understand about gradient backpropagation.
Local losses: It could be that the six layers of the cortex are each supplied with their own cost function, rather than a global signal that governs a unified goal for the brain [124]. Early visual regions may try to minimize the prediction error in constituent visual features, such as orientations, while higher areas use cost functions that target abstractions and concepts. For example, a baby learns how to interpret receptive fields before consolidating them into facial recognition. In deep learning, greedy layerwise training assigns a cost function to each layer independently [143]. Each layer is sequentially assigned a cost function, so as to ensure that a shallow network is only ever trained. Target propagation is similarly motivated by assigning a reconstruction criterion to each layer [115]. Such approaches exploit the fact that training a shallow network is easier than training a deep one and aim to address spatial credit assignment by ensuring that the error signal does not need to propagate too far [127], [144].
Forward–forward error propagation: The backward pass of a model is replaced with a second forward pass where the input signal is altered based on error or some related metric. Initially proposed by Dellaferrera and Kreiman [145], Hinton’s [146] forward–forward learning algorithm generated more traction soon after. These have not been ported to SNNs at the time of writing though someone is bound to step up to the mantle soon.
Variety of learning rules can be used to train a network. (a) Objective functions. Gradient backpropagation: an unbiased gradient estimator of the loss is derived with respect to each weight. Perturbation learning: weights are randomly perturbed by
These approaches to learning are illustrated in Fig. 8(a). While they are described in the context of supervised learning, many theories of learning place emphasizes on self-organization and unsupervised approaches. Hebbian plasticity is a prominent example [147]. However, an intersection may exist in self-supervised learning, where the target of the network is a direct function of the data itself. Some types of neurons may be representative of facts, features, or concepts, only firing when exposed to the right type of stimuli. Other neurons may fire with the purpose of reducing a reconstruction error [148], [149]. By accounting for spatial and temporal correlations that naturally exist around us, such neurons may fire with the intent to predict what happens next. A more rigorous treatment of biological plausibility in objective functions can be found in [124].
E. Activity Regularization
A huge motivator behind using SNNs comes from the power efficiency when processed on appropriately tailored hardware. This benefit is not only from single-bit interlayer communication via spikes but also the sparse occurrence of spikes. Some of the loss functions above, in particular those that promote rate codes, will indiscriminately increase the membrane potential and/or firing frequency without an upper bound, if left unchecked. Regularization of the loss can be used to penalize excessive spiking (or alternatively, penalize insufficient spiking, which is great for discouraging dead neurons). Conventionally, regularization is used to constrain the solution space of loss minimization, thus leading to a reduction in variance at the cost of increasing bias. Care must be taken, as too much activity regularization can lead to excessively high bias. Activity regularization can be applied to alter the behavior of individual neurons or populations of neurons, as depicted in Fig. 8(b).
Population level regularization: This is useful when the metric to optimize is a function of aggregate behavior. For example, the metric may be power efficiency, which is strongly linked to the total number of spikes from an entire network. L1-regularization can be applied to the total number of spikes emitted at the output layer to penalize excessive firing, which encourages sparse activity at the output [150]. Alternatively, for more fine-grain control over the network, an upper activity threshold can be applied. If the total number of spikes for all neurons in a layer exceeds the threshold, only then does the regularization penalty kick in [113] and [110] (see Appendix B11).
Neuron level regularization: If neurons completely cease to fire, then learning may become significantly more difficult. Regularization may also be applied at the individual neuron level by adding a penalty for each neuron. A lower activity threshold specifies the lower permissible limit of firing for each neuron before the regularization penalty is applied (see Appendix B12).
Recent experiments have shown that rate-coded networks (at the output) are robust to sparsity-promoting regularization terms [110], [111], [113]. However, networks that rely on time-to-first-spike schemes have had less success, which is unsurprising given that temporal outputs are already sparse.
Encouraging each neuron to have a baseline spike count helps with the backpropagation of errors through pathways that would otherwise be inactive. Together, the upper and lower limit regularization terms can be used to find the sweet spot of firing activity at each layer. As explained in detail in [151], the variance of activations should be as close as possible to “1” to avoid vanishing and exploding gradients. While modern deep learning practices rely on appropriate parameter initialization to achieve this, these approaches were not designed for nondifferentiable activation functions, such as spikes. By monitoring and appropriately compensating for neuron activity, this may turn out to be a key ingredient to successfully training deep SNNs.
Training Spiking Neural Networks
The rich temporal dynamics of SNNs give rise to a variety of ways in which a neuron’s firing pattern can be interpreted. Naturally, this means that there are several methods for training SNNs. They can generally be classified into the following methods.
Shadow training: A nonspiking ANN is trained and converted into an SNN by interpreting the activations as a firing rate or spike time.
Backpropagation using spikes: The SNN is natively trained using error backpropagation, typically through time as is done with sequential models.
Local learning rules: Weight updates are a function of signals that are spatially and temporally local to the weight, rather than from a global signal as in error backpropagation.
Each approach has a time and place where it outshines the others. We will focus on approaches that apply backprop directly to an SNN, but useful insights can be attained by exploring shadow training and various local learning rules.
The goal of the backpropagation algorithm is loss minimization. To achieve this, the gradient of the loss is computed with respect to each learnable parameter by applying the chain rule from the final layer back to each weight [152], [153], [154]. The gradient is then used to update the weights such that the error is ideally always decreased. If this gradient is “0,” there is no weight update. This has been one of the main road blocks to training SNNs using error backpropagation due to the nondifferentiability of spikes. This is also known as the dreaded “dead neuron” problem. There is a subtle, but important, difference between “vanishing gradients” and “dead neurons,” which will be explained in Section IV-C.
To gain deeper insight behind the nondifferentiability of spikes, recall the discretized solution of the membrane potential of the leaky IF neuron from (4):
Addressing the dead neuron problem. Only one time step is shown, where temporal connections and subscripts from Fig. 6 have been omitted for simplicity. (a) Dead neuron problem: the analytical solution of
A. Shadow Training
The dead neuron problem can be completely circumvented by instead training on a shadow ANN and converting it into an SNN [see Fig. 9(b)]. The high-precision activation function of each neuron is converted into either a spike rate [155], [156], [157], [158], [159] or a latency code [160]. One of the most compelling reasons to use shadow training is that advances in conventional deep learning can be directly applied to SNNs. For this reason, ANN-to-SNN conversion currently takes the crown for static image classification tasks on complex datasets, such as CIFAR-10 and ImageNet. Where inference efficiency is more important than training efficiency, and if input data are not time-varying, then shadow training could be the optimal way to go.
In addition to the inefficient training process, there are several drawbacks. First, the types of tasks that are most commonly benchmarked do not make use of the temporal dynamics of SNNs, and the conversion of sequential neural networks to SNNs is an underexplored area [157]. Second, converting high-precision activations into spikes typically requires a long number of simulation time steps, which may offset the power/latency benefits initially sought from SNNs. However, what really motivates doing away with ANNs is that the conversion process is necessarily an approximation. Therefore, a shadow-trained SNN is very unlikely to reach the performance of the original network.
The issue of long time sequences can be partially addressed by using a hybrid approach: start with a shadow-trained SNN and then perform backpropagation on the converted SNN [161]. Although this appears to degrade accuracy (reported on CIFAR-10 and ImageNet), it is possible to reduce the required number of steps by an order of magnitude. A more rigorous treatment of shadow training techniques and challenges can be found in [162].
B. Backpropagation Using Spike Times
An alternative method to side step the dead neuron problem is to instead take the derivative at spike times. In fact, this was the first proposed method for training multilayer SNNs using backpropagation [119]. The original approach in SpikeProp observes that, while spikes may be discontinuous, time is continuous. Therefore, taking the derivative of spike timing with respect to the weights achieves functional results. A thorough description is provided in Appendix C1.
Intuitively, SpikeProp calculates the gradient of the error with respect to the spike time. A change to the weight by
Several drawbacks arise. Once neurons become inactive, their weights become frozen. In most instances, no closed-form solutions exist to solve for the gradient if there is no spiking [167]. SpikeProp tackles this by modifying parameter initialization (i.e., increasing weights until a spike is triggered). However, since the inception of SpikeProp in 2002, the deep learning community’s understanding of weight initialization has gradually matured. We now know initialization aims to set a constant activation variance between layers, the absence of which can lead to vanishing and exploding gradients through space and time [151], [168]. Modifying weights to promote spiking may detract from this. Instead, a more effective way to overcome the lack of firing is to lower the firing thresholds of the neurons. One may consider applying activity regularization to encourage firing in hidden layers though this has degraded classification accuracy when taking the derivative at spike times. This result is unsurprising, as regularization can only be applied at the spike time rather than when the neuron is quiet.
Another challenge is that it enforces stringent priors upon the network (e.g., each neuron must fire only once), which are incompatible with dynamically changing input data. This may be addressed by using periodic temporal codes that refresh at given intervals, in a similar manner to how visual saccades may set a reference time. However, it is the only approach that enables the calculation of an unbiased gradient without any approximations in multilayer SNNs. Whether this precision is necessary is a matter of further exploration of a broader range of tasks.
Another challenge is that it enforces stringent priors upon the network (e.g., each neuron must fire only once), which are incompatible with dynamically changing input data. This may be addressed by using periodic temporal codes that refresh at given intervals, in a similar manner to how visual saccades may set a reference time. However, it is the only approach that enables the calculation of an unbiased gradient without any approximations in multilayer SNNs. Whether this precision is necessary is a matter of further exploration on a broader range of tasks [165].
C. Backpropagation Using Spikes
Instead of computing the gradient with respect to spike times, the most commonly adopted approach over the past several years is to apply the generalized backpropagation algorithm to the unrolled computational graph [see Fig. 6(b)] [73], [107], [156], [169], [170], i.e., backpropagation through time (BPTT). Working backward from the final output of the network, the gradient flows from the loss to all descendants. In this way, computing the gradient through an SNN is mostly the same as that of an RNN by iterative application of the chain rule. Fig. 10(a) depicts the various pathways of the gradient
BPTT. (a) Present time application of
Finding the derivative of the total loss with respect to the parameters allows the use of gradient descent to train the network, so the goal is to find \begin{equation*} \frac {\partial \mathcal {L}}{\partial W} = \sum _{t} \frac {\partial \mathcal {L}[t]}{\partial W} = \sum _{t} \sum _{s\leq t}\frac {\partial \mathcal {L}[t]}{\partial W[s]} \frac {\partial W[s]}{\partial W}. \tag{6}\end{equation*}
\begin{equation*} \frac {\partial \mathcal {L}}{\partial W} = \sum _{t} \sum _{s\leq t}\frac {\partial \mathcal {L}[t]}{\partial W[s]}. \tag{7}\end{equation*}
Thankfully, gradients rarely need to be calculated by hand as most deep learning packages come with an automatic differentiation engine. Isolating the immediate influence at a single time step as in Fig. 9(c) makes it clear that we run into the spike nondifferentiability problem in the term
The solution is actually quite simple. During the forward pass, as per usual, apply the Heaviside operator to
D. Surrogate Gradients
A major advantage of surrogate gradients is that they help with overcoming the dead neuron problem. To make the dead neuron problem more concrete, consider a neuron with a threshold of
The membrane potential is below the threshold:
.$U < \theta $ The membrane potential is above the threshold:
.$U > \theta $ The membrane potential is exactly at the threshold:
.$U = \theta $
In Case 1, no spike is elicited, and the derivative would be
One example is to replace the nondifferentiable term with the threshold-shifted sigmoid function but only during the backward pass. This is illustrated in Fig. 9(d). More formally \begin{equation*} \sigma (\cdot) = \frac {1}{1+e^{\theta -U}} \tag{8}\end{equation*}
\begin{equation*} \frac {\partial S}{\partial U} \leftarrow \frac {\partial \tilde {S}}{\partial U} = \sigma '(\cdot) = \frac {e^{\theta - U}}{(e^{\theta -U}+1)^{2}}. \tag{9}\end{equation*}
This means that learning only takes place if there is spiking activity. Consider a synaptic weight attached to the input of a spiking neuron,
An input spike,
, is scaled by$S_{\textrm {in}}$ .$W_{\textrm {in}}$ The weighted spike is added as an input current injection to the spiking neuron [see (4)].
This may cause the neuron to trigger a spike,
.$S_{\textrm {out}}$ The output spike is weighted by the output weight
.$W_{\textrm {out}}$ This weighted output spike varies some arbitrary loss function,
.$\mathcal {L}$
Let the loss function be the Manhattan distance between a target value \begin{equation*} \mathcal {L} = | W_{\textrm {out}} S_{\textrm {out}} - y|\end{equation*}
\begin{equation*} \frac {\partial \mathcal {L}}{\partial W_{\textrm {out}}} = S_{\textrm {out}}.\end{equation*}
More generally, a spike must be triggered for a weight to be updated. The surrogate gradient does not change this.
Now, consider the case for updating \begin{equation*} \frac {\partial \mathcal {L}}{\partial W_{\textrm {in}}} = \underbrace {\frac {\partial \mathcal {L}}{\partial S_{\textrm {out}}}}_{A} \underbrace {\frac {\partial S_{\textrm {out}}}{\partial U_{}}}_{B} \underbrace {\frac {\partial U}{\partial W_{\textrm {in}}}}_{C}.\end{equation*}
Term A is simply
based on the above equation for$W_{\textrm {out}}$ .$\mathcal {L}$ Term B would almost always be 0, unless substituted for a surrogate gradient.
Term C is
(see (4) where$S_{\textrm {in}}$ ).$X=S_{\textrm {in}}$
To summarize, the surrogate gradient enables errors to propagate to earlier layers, regardless of spiking. However, spiking is still needed to trigger a weight update.
As a practical note, various works empirically explore different surrogate gradients. These include triangular functions, fast sigmoid and sigmoid functions, straight-through estimators, and various other weird shapes. Is there a best surrogate gradient? In our experience, we have found the following function to be the best starting point:\begin{equation*} \frac {\partial \tilde {S}}{\partial U} = \frac {1}{\pi } \frac {1}{1+(\pi U)^{2}}.\end{equation*}
You might see this referred to as the “arctan” surrogate gradient, first proposed in [171]. This is because the integral of this function is \begin{equation*} \tilde {S} = \frac {1}{\pi }\textrm {arctan}(\pi U).\end{equation*}
As of 2023, this is the default surrogate gradient in snnTorch, and it is not wholly clear why it works so well.
To reiterate, surrogate gradients will not enable learning in the absence of spiking. This provokes an important distinction between the dead neuron problem and the vanishing gradient problem. A dead neuron is one that does not fire and, therefore, does not contribute to the loss. This means that the weights attached to that neuron have no “credit” in the credit assignment problem. The relevant gradient terms during the training process will remain at zero. Therefore, the neuron cannot learn to fire later on and so is stuck forever, not contributing to learning.
On the other hand, vanishing gradients can arise in ANNs and SNNs. For deep networks, the gradients of the loss function can become vanishingly small as they are successively scaled by values less than “1” when using several common activation functions (e.g., a sigmoid unit). In much the same way, RNNs are highly susceptible to vanishing gradients because they introduce an additional layer to the unrolled computational graph at each time step. Each layer adds another multiplicative factor in calculating the gradient, which makes it susceptible to vanishing if the factor is less than “1” or exploding if greater than “1.” The ReLU activation became broadly adopted to reduce the impact of vanishing gradients but remains underutilized in surrogate gradient implementations [151].
Surrogate gradients have been used in most state-of-the-art experiments that natively train an SNN [73], [107], [156], [169], [170]. A variety of surrogate gradient functions have been used to varying degrees of success, and the choice of function can be treated as a hyperparameter. While several studies have explored the impact of various surrogates on the learning process [113], [172], our understanding tends to be limited to what is known about biased gradient estimators. There is a lot left unanswered here. For example, if we can get away with approximating gradients, then, perhaps, surrogate gradients can be used in tandem with random feedback alignment. This involves replacing weights with random matrices during the backward pass. Rather than pure randomness, perhaps, local approximations can be made that follow the same spirit of a surrogate gradient.
In summary, taking the gradient only at spike times provides an unbiased estimator of the gradient at the expense of losing the ability to train dead neurons. Surrogate gradient descent flips this around, enabling dead neurons to backpropagate error signals by introducing a biased estimator of the gradient. There is a tug-of-war between bringing dead neurons back to life and introducing bias. Given how prevalent surrogate gradients have become, we will linger a little longer on the topic in describing their relation to model quantization. Understanding how approximations in gradient descent impact learning will very likely lead to a deeper understanding of why surrogate gradients are so effective, how they might be improved, and how backpropagation can be simplified by making approximations that reduce the cost of training without harming an objective.
E. Bag of Tricks in BPTT With SNNs
Many advances in deep learning stem from a series of incremental techniques that bolster the learning capacity of models. These techniques are applied in conjunction to boost model performance. For example, He et al.’s [173] work in “Bag of tricks for image classification with convolutional neural networks” not only captures the honest state of deep learning in the title alone but also performs an ablation study of “hacks” that can be combined to improve optimization during training. Some of these techniques can be ported straight from deep learning to SNNs, while others are SNN-specific. A nonexhaustive list of these techniques is provided in this section. These techniques are quite empirical, and each bullet would have its own “Practical Note” text box, but then this article would just turn into a bunch of boxes.
The reset mechanism in (4) is a function of the spike and is also nondifferentiable. It is important to ensure that the surrogate gradient is not cloned into the reset function as it has been empirically shown to degrade network performance [113]. Quite simply, we ignore it during the backward pass. snnTorch does this automatically by detaching the reset term in (4) from the computational graph by calling the “.detach()” function.
Residual connections work remarkably well for nonspiking nets and spiking models alike. Direct paths between layers are created by allowing the output of an earlier layer to be added to the output of a later layer, effectively skipping one or more layers in between. They are used to address the vanishing gradient problem and improve the flow of information during both forward propagation and backward propagation, which enabled the neural network community to construct far deeper architectures, starting with the ResNet family of models and now commonly used in transformers [174]. Unsurprisingly, they work extremely well for SNNs too [171].
Learnable decay: Rather than treating the decay rates of neurons as hyperparameters, it is also common practice to make them learnable parameters. This makes SNNs resemble conventional RNNs much more closely. Doing so has been shown to improve testing performance on datasets with time-varying features [57].
Graded spikes: Passive dendritic properties can attenuate action potentials, as can the cable-like properties of the axon. This feature can be coarsely accounted for as graded spikes. Each neuron has an additional learnable parameter that determines how to scale an output spike. Neuronal activations are no longer constrained to
,$\{1$ . Can this still be thought of as an SNN? From an engineering standpoint, if a spike must be broadcast to a variety of downstream neurons with an 8- or 16-bit destination address, then adding another several bits to the payload can be worth it. The second-generation Loihi chip from Intel Labs incorporates graded spikes in such a way that sparsity is preserved. Furthermore, the vector of learned values scales linearly with the number of neurons in a network, rather than quadratically with weights. It, therefore, contributes a minor cost in comparison to other components of an SNN [175].$0\}$ Learnable thresholds have not been shown to help the training process. This is likely due to the discrete nature of thresholds, giving rise to nondifferentiable operators in a computational graph. On the other hand, normalizing the values that are passed into a threshold significantly helps. Adopting batch normalization in convolutional networks helps boost performance, and learnable normalization approaches may act as an effective surrogate for learnable thresholds [176], [177], [178].
Pooling is effective for downsampling large spatial dimensions in convolutional networks and achieving translational invariance. If max pooling is applied to a sparse, spiking tensor, then tie-breaking between 1’s and 0’s does not make much sense. One might expect that we can borrow ideas from training binarized neural networks, where pooling is applied to the activations before they are thresholded to binarized quantities. This corresponds to applying pooling to the membrane potential in a manner that resembles a form of “local lateral inhibition.” However, this does not necessarily lead to optimal performance in SNNs. Interestingly, Fang et al. applied pooling to the spikes instead. Where multiple spikes occurred in a pooling window, a tie-break would occur randomly among them [171]. While no reason was given for doing this, it, nonetheless, achieved state-of-the-art (at the time) performance on a series of computer vision problems. Our best guess is that this randomness acted as a type of regularization. Whether max pooling or average pooling is used can be treated as a hyperparameter. As an alternative, SynSense’s neuromorphic hardware adopts sum pooling, where spatial dimensions are reduced by rerouting the spikes in a receptive field to a common postsynaptic neuron.
Optimizer: Most SNNs default to the Adam optimizer as they have classically been shown to be robust when used with sequential models [179]. As SNNs become deeper, stochastic gradient descent with momentum seems to increase in prevalence over the Adam optimizer. The reader is referred to Godbole et al.’s [180] deep learning tuning playbook for a systematic approach to hyperparameter optimization that applies generally.
F. Intersection Between Backprop and Local Learning
An interesting result arises when comparing backpropagation pathways that traverse varying durations of time. The derivative of the hidden state over time is
Is this link just a coincidence? BPTT was derived from function optimization. STDP is a model of biological observation. Despite being developed via completely independent means, they converge upon an identical result. This could have immediate practical implications, where hardware accelerators that train models can excise a chunk of BPTT and replace it with the significantly cheaper and local STDP rule. Adopting such an approach might be thought of as an online variant of BPTT or as a gradient-modulated form of STDP.
G. Long-Term Temporal Dependencies
Neural and synaptic time constants span timescales typically on the order of one to hundreds of milliseconds. With such time scales, it is difficult to solve problems that require long-range associations that are larger than the slowest neuron or synaptic time constant. Such problems are common in natural language processing and reinforcement learning, and are key to understanding behavior and decision-making in humans. This challenge is a huge burden on the learning process, where vanishing gradients drastically slow the convergence of the neural network. LSTMs [181] and, later, GRUs [182] introduced slow dynamics designed to overcome memory and vanishing gradient problems in RNNs. Thus, a natural solution for networks of spiking neurons is to complement the fast timescales of neural dynamics with a variety of slower dynamics. Mixing discrete and continuous dynamics may enable SNNs to learn features that occur on a vast range of timescales. Examples of slower dynamics include the following.
Adaptive thresholds: After a neuron fires, it enters a refractory period during which it is more difficult to elicit further spikes from the neuron. This can be modeled by increasing the firing threshold of the neuron
every time the neuron emits a spike. After a sufficient time in which the neuron has spiked, the threshold relaxes back to a steady-state value. Homeostatic thresholds are known to promote neuronal stability in correlated learning rules, such as STDP, which favors long-term potentiation at high frequencies regardless of spike timing [183], [184]. More recently, it has been found to benefit gradient-based learning in SNNs as well [169] (see Appendix C3).$\theta $ Recurrent attention: Hugely popularized from natural language generation, self-attention finds correlations between tokens of vast sequence lengths by feeding a model with all sequential inputs at once. This representation of data is not quite how the brain processes data. Several approaches have approximated self-attention into a sequence of recurrent operations, where SpikeGPT is the first application in the spiking domain and successfully achieved language generation [83]. In addition to more complex state-based computation, SpikeGPT additionally employs dynamical weights that vary over time.
Axonal delays: The wide variety of axon lengths means that there is a wide range of spike propagation delays. Some neurons have axons as short as 1 mm, whereas those in the sciatic nerve can extend up to a meter in length. The axonal delay can be a learned parameter spanning multiple time steps [73], [185], [186]. A lesser explored approach accounts for the varying delays in not only axons but also across the dendritic tree of a neuron. Coupling axonal and dendritic delays together allows for a fixed delay per synapse.
Membrane dynamics: We already know how the membrane potential can trigger spiking, but how does spiking impact the membrane? Rapid changes in voltage cause an electric field build-up that leads to temperature changes in cells. Joule heating scales quadratically with voltage changes, which affects the geometric structure of neurons and cascades into a change in membrane capacitance (and, thus, time constants). Decay rate modulation as a function of spike emission can act as a second-order mechanism to generate neuron-specific refractory dynamics [187].
Multistable neural activity: Strong recurrent connections in biological neural networks can support multistable dynamics [188], which facilitates stable information storage over time. Such dynamics, often called attractor neural networks [189], are believed to underpin working memory in the brain [190], [191] and are often attributed to the prefrontal cortex. The training of such networks using gradient descent is challenging and has not been attempted using SNNs as of yet [192].
Several rudimentary slow timescale dynamics have been tested in gradient-based approaches to training SNNs with a good deal of success [73], [169], but there are several neuronal dynamics that are yet to be explored. LSTMs showed us the importance of temporal regulation of information and effectively cured the short-term memory problem that plagued RNNs. Translating more nuanced neuronal features into gradient-based learning frameworks can undoubtedly strengthen the ability of SNNs to represent dynamical data in an efficient manner.
Online Learning
A. Temporal Locality
As incredible as our brains are, sadly, they are not time machines. It is highly unlikely that our neurons are breaching the space-time continuum to explicitly reference historical states to run the BPTT algorithm. As with all computers, brains operate on a physical substrate, which dictates the operations that it can handle and where memory is located. While conventional computers operate on an abstraction layer, memory is delocalized and communicated on demand, thus paying a considerable price in latency and energy. Brains are believed to operate on local information, which means that the best performing approaches in temporal deep learning, namely, BPTT, are biologically implausible. This is because BPTT requires the storage of the past inputs and states in memory. As a result, the required memory scales with time, a property that limits BPTT to small temporal dependencies. To solve this problem, BPTT assumes a finite sequence length before making an update while truncating the gradients in time. This, however, severely restricts the temporal dependencies that can be learned.
The constraint imposed on brain-inspired learning algorithms is that the calculation of a gradient should, much like the forward pass, be temporally local, i.e., they only depend on values available at either present time
B. Real-Time Recurrent Learning
RTRL estimates the same gradients as BPTT but relies on a set of different computations that make it temporally, but not spatially, local [193]. Since RTRL’s memory requirement does not grow with time, why is it not used in favor of BPTT? BPTT’s memory usage scales with the product of time and the number of neurons; it is
Let us derive what new information needs to be propagated forward to enable real-time gradient calculation for an SNN. As in (7), let
First, we define the influence of parameter \begin{equation*} m[t] = \frac {\partial U[t]}{\partial W} = \sum _{s\leq t}\frac {\partial U[t]}{\partial W[s]} = \underbrace {\sum _{s\leq t-1}\frac {\partial U[t]}{\partial W[s]}}_{\textrm {prior}} + \underbrace {\frac {\partial U[t]}{\partial W[t]}}_{\textrm {immediate}}. \tag{10}\end{equation*}
The immediate and prior influence components are graphically illustrated in Fig. 10(a). The immediate influence is also natural to calculate online and evaluates the unweighted input to the neuron \begin{equation*} \sum _{s\leq t-1}\frac {\partial U[t]}{\partial W[s]} = \sum _{s\leq t-1}\underbrace {\frac {\partial U[t]}{\partial U[t-1]}}_{\textrm {temporal}}\frac {\partial U[t-1]}{\partial W[s]}. \tag{11}\end{equation*}
Based on (4), in the absence of explicitly recurrent connections, the temporal term evaluates to \begin{equation*} m[t] = \beta m[t-1] + x[t]. \tag{12}\end{equation*}
This recursive formula is updated by passing the unweighted input directly to \begin{equation*} \frac {\partial \mathcal {L}[t]}{\partial W} = \frac {\partial \mathcal {L}[t]}{\partial U[t]}\frac {\partial U[t]}{\partial W} \equiv \bar {c}[t] m[t] \tag{13}\end{equation*}
RTRL gradient pathways. The node for synaptic current,
An intuitive, though incomplete, way to think about RTRL is given as follows. By reference to Fig. 12, at each time step, a backward pass that does not account for the history of weight updates is applied:
In the example above, the RTRL approach to training SNNs was only derived for a single neuron and a single parameter. A full-scale neural network replaces the influence value with an influence matrix
Recent focus on online learning aims to reduce the memory and computational demands of RTRL. This is generally achieved by decomposing the influence matrix into simpler parts, approximating the calculation of
C. RTRL Variants in SNNs
Since 2020, a flurry of forward-mode learning algorithms has been tailored to SNNs [200]. All such works either modify, rederive, or approximate RTRL.
e-Prop [109]: RTRL is combined with surrogate gradient descent. Recurrent spiking neurons are used where output spikes are linearly transformed and then fed back to the input of the same neurons. The computational graph is detached at the explicit recurrent operation but retained for implicit recurrence (i.e., where membrane potential evolves over time). Projecting output spikes into a higher dimensional recurrent space acts like a reservoir though it leads to biased gradient estimators that underperform compared to BPTT.
decolle [110]: “Deep continuous online learning” also combines RTRL with surrogate gradient descent. This time, greedy local losses are applied at every layer [143]. As such, errors only need to be propagated back to a single layer at a time. This means that errors do not need to traverse through a huge network, which reduces the burden of the spatial credit assignment problem. This brings about two challenges: not many problems can be cast into a form with definable local losses and greedy local learning prioritizes immediate gains without considering an overall objective.
OSTL [201]: “Online spatiotemporal learning” rederives RTRL. The spatial components of backpropagation and temporal components are factored into two separate terms, e.g., one that tracks the “immediate” influence and the other one that tracks the “prior influence” from (10).
ETLP [202]: “Event-based three-factor local plasticity” combines e-prop with direct random target projection (DRTP) [142]. In other words, the weights in the final layer are updated based on an approximation of RTRL. Earlier layers are updated based on partial derivatives that do not rely on a global loss and are spatially “local” to the layer. Instead, the target output is used to modulate these gradients. This addresses spatial credit assignment by using signals from a target, rather than backpropagating gradients in the immediate influence term of (10). The cost is that it both inherits drawbacks from e-prop and DRTP. DRTP prioritizes immediate gains without considering an overall objective, similar to greedy local learning.
OSTTP [203]: “Online spatiotemporal learning with target projection” combines OSTL (functionally equivalent to RTRL) with DRTP. It inherits the drawbacks of DRTP while addressing the spatial credit assignment problem.
FPTT [204]: “Forward propagation through time” considers RTRL for sequence-to-sequence models with time varying losses. A regularization term is applied to the loss at each step to ensure stability during the training process. Yin et al. [205] subsequently applied FPTT to SNNs with more complex neuron models with richer dynamics.
This is a nonexhaustive list of RTRL alternatives and can appear quite daunting at first. However, all approaches effectively stem from RTRL. The dominant trends include the following:
approximating RTRL to test how much of an approximation the training procedure can tolerate without completely failing [109];
replacing the immediate influence with global modulation of a loss or target to address spatial credit assignment [110], [202], [203];
modifying the objective to promote stable training dynamics [204];
identifying similarities to biology by factorizing RTRL into eligibility traces and/or three-factor learning rules [109], [202], [205].
Several RTRL variants claim to outperform BPTT in terms of loss minimization though we take caution with such claims as the two effectively become identical to BPTT for the case where weight updates are deferred to the end of a sequence. We also note caution with claims that suggest improvements over RTRL, as RTRL can be thought of as the most general case of forward-model learning applied to any generic architecture. Most reductions in computational complexity arise because they are narrowly considered for specific architectures or otherwise introduce approximations into their models. In contrast, Tallec and Ollivier [194] developed an “unbiased online recurrent optimization” scheme where stochastic noise is used and ultimately canceled out, leading to quadratic (rather than cubic) computational complexity with network size.
D. Spatial Locality
While temporal locality relies on a learning rule that depends only on the present state of the network, spatial locality requires each update to be derived from a node immediately adjacent to the parameter. The biologically motivated learning rules described in Section III-D address the spatial credit assignment problem by either replacing the global error signal with local errors or replacing analytical/numerical derivatives with random noise [138].
The more “natural” approach to online learning is perceived to be via unsupervised learning with synaptic plasticity rules, such as STDP [33], [206] and variants of STDP (see Appendix C2) [207], [208], [209], [210]. These approaches are directly inspired by experimental relationships between spike times and changes to synaptic conductance. Input data are fed to a network, and weights are updated based on the order and firing times of each pair of connected neurons [see Fig. 10(b)]. The interpretation is that, if a neuron causes another neuron to fire, then their synaptic strength should be increased. If a pair of neurons appears uncorrelated, their synaptic strength should be decreased. It follows the Hebbian mantra of “neurons that fire together wire together” [147].
There is a common misconception that backprop and STDP-like learning rules are at odds with one other, competing to be the long-term solution for training connectionist networks. On the one hand, it is thought that STDP deserves more attention as it scales with less complexity than backprop. STDP adheres to temporal and spatial locality, as each synaptic update only relies on information from immediately adjacent nodes. However, this relationship necessarily arises as STDP was reported using data from “immediately adjacent” neurons. On the other hand, STDP fails to compete with backprop on remotely challenging datasets. However, backprop was designed with function optimization in mind, while STDP emerged as a physiological observation. The mere fact that STDP is capable at all of obtaining competitive results on tasks originally intended for supervised learning (such as classifying the MNIST dataset), no matter how simple, is quite a wonder. Rather than focusing on what divides backprop and STDP, the pursuit of more effective learning rules will more likely benefit by understanding how the two intersect.
We demonstrated in Section IV-F how surrogate gradient descent via BPTT subsumes the effect of STDP. Spike time differences result in exponentially decaying weight update magnitudes such that half of the learning window of STDP is already accounted for within the BPTT algorithm [see Fig. 10(b)]. Bengio et al. [211] previously made the case that STDP resembles stochastic gradient descent, provided that STDP is supplemented with gradient feedback [212]. This specifically relates to the case where a neuron’s firing rate is interpreted as its activation. Here, we have demonstrated that no modification needs to be made to the BPTT algorithm for it to account for STDP-like effects and is not limited to any specific neural code, such as the firing rate. The common theme is that STDP may benefit from integrating error-triggered plasticity to provide meaningful feedback to training a network [213].
Outlook
Designing a neural network was once thought to be strictly an engineering problem, whereas mapping the brain was a scientific curiosity [214]. With the intersection between deep learning and neuroscience broadening, and brains being able to solve complex problems much more efficiently, this view is poised to change. From the scientist’s view, deep learning and brain activity have shown many correlates, which leads us to believe that there is much untapped insight that ANNs can offer in the ambitious quest of understanding biological learning. For example, the activity across layers of a neural network has repeatedly shown similarities to experimental activity in the brain. This includes links between convolutional neural networks and measured activity from the visual cortex [215], [216], [217] and auditory processing regions [218]. Activity levels across populations of neurons have been quantified in many studies, but SNNs might inform us of the specific nature of such activity.
From the engineer’s perspective, neuron models derived from experimental results have allowed us to design extremely energy-efficient networks when running on hardware tailored to SNNs [219], [220], [221], [222], [223], [224], [225]. Improvements in energy consumption of up to two to three orders of magnitude have been reported when compared to conventional ANN acceleration on embedded hardware, which provides empirical validation of the benefits available from the three S’s: spikes, sparsity, and static data suppression (or event-driven processing) [20], [226], [227], [228], [229], [230]. These energy and latency benefits are derived from simply applying neuron models to connectionist networks, but there is so much more left to explore.
It is safe to say that the energy benefits afforded by spikes are uncontroversial. However, a more challenging question to address is: are spikes actually good for computation? It could be those years of evolution-determined spikes solved the long-range signal transmission problem in living organisms, and everything else had to adapt to fit this constraint. If this were true, then spike-based computation would be Pareto optimal with a proclivity toward energy efficiency and latency. However, until we amass more evidence of a spike’s purpose, we have some intuition as to where spikes shine in computation.
Hybrid dynamical systems: SNNs can model a broad class of dynamical systems by coupling discrete and continuous time dynamics into one system. Discontinuities are present in many physical systems, and spiking neuron models are a natural fit to model such dynamics.
Discrete function approximators: Neural networks are universal function approximators, where discrete functions are considered to be modeled sufficiently well by continuous approximations. Spikes are capable of precisely defining discrete functions without approximation.
Multiplexing: Spikes can encode different information in spike rate, times, or burst counts. Repurposing the same spikes offers a sensible way to condense the amount of computation required by a system.
Message packets: By compressing the representation of information, spikes can be thought of as packets of messages that are unlikely to collide as they travel across a network. In contrast, a digital system requires a synchronous clock to signal that a communication channel is available for a message to pass through (even when modeling asynchronous systems).
Coincidence detection: Neural information can be encoded based on spatially disparate but temporally proximate input spikes on a target neuron. It may be the case that isolated input spikes are insufficient to elicit a spike from the output neuron. However, if two incident spikes occur on a timescale faster than the target neuron membrane potential decay rate, this could push the potential beyond the threshold and trigger an output spike. In such a case, associative learning is taking place across neurons that are not directly connected. Although coincidence detection can be programmed in a continuous-time system without spikes, a theoretical analysis has shown that the processing rate of a coincidence detector neuron is faster than the rate at which information is passed to a neuron [231], [232].
Noise robustness: While analog signals are highly susceptible to noise, digital signals are far more robust in long-range communication. Neurons seem to have figured this out by performing analog computation via integration at the soma, and digital communication along the axon. It is possible that any noise incident during analog computation at the soma is subsumed into the subthreshold dynamics of the neuron and, therefore, eliminated. In terms of neural coding, a similar analogy can be made to spike rates and spike times. Pathways that are susceptible to adversarial attacks or timing perturbations could learn to be represented as a rate, which, otherwise, mitigates timing disturbances in temporal codes.
Modality normalization: A unified representation of sensory input (e.g., vision and auditory) as spikes is nature’s way of normalizing data. While this benefit is not exclusive to spikes (i.e., continuous data streams in nonspiking networks may also be normalized), early empirical evidence has shown instances where multimodal SNNs outperform convolutional neural networks on equivalent tasks [20], [228].
Mixed-mode differentiation: While most modern deep learning frameworks rely on reverse-mode autodifferentiation [233], it is in stark contrast to how the spatial credit assignment problem is treated in biological organisms. If we are to draw parallels between backpropagation and the brain, it is far more likely that approximations of forward-mode autodifferentiation are being used instead. Equation (12) describes how to propagate gradient-related terms forward in time to implement online learning, where such terms could be approximated by eligibility traces that keep track of presynaptic neuron activity in the form of calcium ions and fades over time [109], [234]. SNNs offer a natural way to use mixed-mode differentiation by projecting temporal terms in the gradient calculation from (11) into the future via forward-mode differentiation while taking advantage of the computational complexity of reverse-mode autodifferentiation for spatial terms [71], [110].
A better understanding of the problems spikes are best suited for, beyond addressing just energy efficiency, will be important in directing SNNs to meaningful tasks. The above list is a nonexhaustive start to intuit where that might be. Thus far, we have primarily viewed the benefits of SNNs by examining individual spikes. For example, the advantages derived from sparsity and single-bit communication arise at the level of an individual spiking neuron: how a spike promotes sparsity, how it contributes to a neural encoding strategy, and how it can be used in conjunction with modern deep learning, backprop, and gradient descent. Despite the advances yielded by this spike-centric view, it is important not to develop tunnel vision. New advances are likely to come from a deeper understanding of spikes acting collectively, much like the progression from atoms to waves in physics.
Designing learning rules that operate with brain-like performance is far less trivial than substituting a set of artificial neurons with spiking neurons. It would be incredibly elegant if a unified principle governed how the brain learns. However, the diversity of neurons, functions, and brain regions implies that a heterogeneous system rich in objectives and synaptic update rules is more likely and might require us to use all of the weapons in our arsenal of machine learning tools. It is likely that a better understanding of biological learning will be amassed by observing the behavior of a collection of spikes distributed across brain regions. Ongoing advances in procuring large-scale electrophysiological recordings at the neuron level can give us a window into observing how populations of spikes are orchestrated to handle credit assignment so efficiently and, at the very least, give us a more refined toolkit to developing theories that may advance deep learning [235], [236]. After all, it was not a single atom that led to the silicon revolution but, rather, a mass of particles and their collective fields. A stronger understanding of the computational benefits of spikes may require us to think at a larger scale in terms of the “fields” of spikes.
As the known benefits of SNNs manifest in the physical quantities of energy and latency, it will take more than just a machine learning mind to navigate the tangled highways of 100 trillion synapses. It will take a concerted effort between machine learning engineers, neuroscientists, and circuit designers to put spikes in the front seat of deep learning.
Additional Materials
A series of interactive tutorials complementary to this article are available in the documentation for our Python package designed for gradient-based learning using SNNs, snnTorch [237], at the following link: https://snntorch.readthedocs.io/en/latest/tutorials/index.html.
We invite additional contributions and tutorial content from the neuromorphic community.
ACKNOWLEDGMENT
The authors would like to thank Sumit Bam Shrestha, Garrick Orchard, and Albert Albesa-González for their insightful discussions over the course of putting together this article and iDataMap Corporation for their support.
Appendix
Appendix
From Artificial to Spiking Neural Networks
1) Forward Euler Method to Solving Spiking Neuron Models:
The time derivative \begin{equation*} \tau \frac {U(t+\Delta t) - U(t)}{\Delta t} = -U(t) + I_{\textrm {in}}(t)R. \tag{14}\end{equation*}
\begin{equation*} U(t+\Delta t) = \left({1-\frac {\Delta t}{\tau }}\right)U(t) + \frac {\Delta t}{\tau }I_{\textrm {in}}(t)R. \tag{15}\end{equation*}
\begin{equation*} U(t+\Delta t) = \left({1-\frac {\Delta t}{\tau }}\right)U(t). \tag{16}\end{equation*}
Assume that \begin{equation*} \beta = \left({1-\frac {1}{\tau }}\right) \implies U[t+1] = \beta U[t] + (1-\beta)I_{\textrm {in}}[t+1]. \tag{17}\end{equation*}
\begin{equation*} U(t) = U_{0}e^{-t/\tau } \tag{18}\end{equation*}
\begin{align*} \beta = &\frac {U_{0}e^{-(t+\Delta t)/\tau }}{U_{0}e^{-t/\tau }} = \frac {U_{0}e^{-(t+2\Delta t)/\tau }}{U_{0}e^{-(t+\Delta t)/\tau }} = \ldots \\ \implies & \beta = e^{-\Delta t/\tau }. \tag{19}\end{align*}
A second nonphysiological assumption is made, where the effect of \begin{equation*} WX[t] = I_{\textrm {in}}[t]. \tag{20}\end{equation*}
\begin{equation*} U[t+1] = \beta U[t] + WX[t+1] \tag{21}\end{equation*}
To arrive at (4), a reset function is appended, which activates every time an output spike is triggered. The reset mechanism can be implemented by either subtracting the threshold at the onset of a spike as in (4) or by forcing the membrane potential to zero (Fig. 13) \begin{equation*} U[t+1] =\underbrace {\beta U[t]}_{\textrm {decay}} + \underbrace {WX[t]}_{\textrm {input}} - \underbrace {S_{\textrm {out}}(\beta U[t] + WX[t])}_{\textrm {reset-to-zero}}. \tag{22}\end{equation*}
Spike Encoding
The following spike encoding mechanisms and loss functions are described with respect to a single sample of data. They can be generalized to multiple samples as is common practice in deep learning to process data in batches.
1) Rate-Coded Input Conversion:
An example of the conversion of an input sample to a rate-coded spike train follows (Fig. 14). Let \begin{equation*} {} {P}(R_{ijk}=1) = X_{ij} = 1 - {P}(R_{ijk}=0). \tag{23}\end{equation*}
Rate-coded input pixel. An input pixel of greater intensity corresponds to a higher firing rate.
2) Latency-Coded Input Conversion:
The logarithmic dependence between input feature intensity and spiking timing can be derived using an \begin{equation*} U(t) = I_{\textrm {in}}R(1 - e^{-t/\tau }). \tag{24}\end{equation*}
\begin{equation*} t=\tau \left[{\textrm {ln}\left({\frac {I_{\textrm {in}}R}{I_{\textrm {in}}R-\theta }}\right)}\right]. \tag{25}\end{equation*}
\begin{align*} t(x) = \begin{cases} \displaystyle \tau \left[{\textrm {ln}\left({\frac {x}{x-\theta }}\right)}\right], & x > \theta \\ \displaystyle \infty, & \textrm {otherwise.} \end{cases} \tag{26}\end{align*}
Latency-coded input pixel. An input pixel of greater intensity corresponds to an earlier spike time.
3) Rate-Coded Outputs:
A vectorized implementation of determining the predicted class from rate-coded output spike trains is described (Fig. 16). Let \begin{equation*} \vec {c} = \sum _{j=0}^{T}\vec {S}[t]. \tag{27}\end{equation*}
\begin{equation*} \hat {y} = \mathop {\mathrm {arg\,max}} _{i}c_{i}. \tag{28}\end{equation*}
Rate-coded outputs.
4) Cross-Entropy Spike Rate:
The spike count of the output layer \begin{equation*} p_{i}=\frac {e^{c_{i}}}{\sum _{i=1}^{N_{C}}e^{c_{i}}}. \tag{29}\end{equation*}
\begin{equation*} \mathcal {L}_{CE} = \sum _{i=0}^{N_{C}}y_{i}\textrm {log}(p_{i}). \tag{30}\end{equation*}
Cross-entropy spike rate. The target vector
5) Mean Square Spike Rate:
As in (27), the spike count of the output layer \begin{equation*} \mathcal {L}_{\textrm {mse}} = \sum _{i}^{N_{C}}(y_{i} - c_{i})^{2}. \tag{31}\end{equation*}
Mean square spike rate. The target vector
6) Maximum Membrane:
The logits \begin{equation*} \vec {m} = \textrm {max}_{t}\vec {U}[t]. \tag{32}\end{equation*}
Maximum membrane. The peak membrane potential for each neuron is used in the cross-entropy loss function. This encourages the peak of the correct class to grow, while that of the incorrect class is suppressed. The effect of this is to promote more firing from the correct class and less from the incorrect class.
Alternatively, the membrane potential is summed over time to obtain the logits \begin{equation*} \vec {m} = \sum _{t}^{T}\vec {U}[t]. \tag{33}\end{equation*}
7) Mean Square Membrane:
Let \begin{equation*} \mathcal {L}_{\textrm {mse}} = \sum _{i}^{N_{C}}\sum _{t}^{T}(y_{i}[t]-U[t])^{2}. \tag{34}\end{equation*}
Mean square membrane. The membrane potential at each time step is applied to the mse loss function. This allows a defined membrane target. The example above sets the target at all time steps at the firing threshold for the correct class and to zero for incorrect classes.
8) Cross-Entropy Latency Code:
Let \begin{equation*} \vec {f}:=-\vec {f}. \tag{35}\end{equation*}
\begin{equation*} f_{i}:= \frac {1}{f_{i}}. \tag{36}\end{equation*}
Cross-entropy latency code. Applying the inverse (or negated) spike time to the cross-entropy loss pushes the correct class to fire first and the incorrect classes to fire later.
9) Mean Square Spike Time:
The spike time(s) of all neurons are specified as targets (Fig. 22). In the case where only the first spike matters, \begin{equation*} \mathcal {L}_{\textrm {mse}} = \sum _{i}^{N_{C}}(y_{i} - f_{i})^{2}. \tag{37}\end{equation*}
\begin{equation*} \mathcal {L}_{\textrm {mse}} = \sum _{k}^{n}\sum _{i}^{N_{C}}(y_{i,k} - f_{i,k})^{2}. \tag{38}\end{equation*}
Mean square spike time. The timing of all spikes is iterated over and sequentially applied to the mse loss function. This enables the timing for multiple spikes to be precisely defined.
10) Mean Square Relative Spike Time:
The difference between the spike time of correct and incorrect neurons is specified as a target (Fig. 23). As in Appendix B9,
Mean square relative spike time. The relative timing between all spikes is applied to the mse loss function, enabling a defined time window
Let the minimum possible spike time be \begin{align*} y_{i} = \begin{cases} \displaystyle f_{0} + \gamma, & \textrm {if $f_{i} < f_{{0}}+\gamma $}\\ \displaystyle f_{i}, & \textrm {if $f_{i} \geq f_{{0}}+\gamma $} \end{cases} \tag{39}\end{align*}
11) Population Level Regularization:
L1-regularization can be applied to the total number of spikes emitted at the output layer to penalize excessive firing [150], thus encouraging sparse activity at the output \begin{equation*} \mathcal {L}_{L1} = \lambda _{1}\sum _{t}^{T}\sum _{i}^{N_{C}}S_{i}[t] \tag{40}\end{equation*}
Alternatively, an upper activity threshold \begin{equation*} \mathcal {L}_{U} = \lambda _{U}\left({\left[{\sum _{i}^{N}c_{i}^{(l)} - \theta _{U}}\right]_{+}}\right)^{L} \tag{41}\end{equation*}
12) Neuron Level Regularization:
A lower activity threshold \begin{equation*} \mathcal {L}_{L} = \frac {\lambda _{L}}{N}\sum _{i}^{N}\Big (\Big [\theta _{L} - c_{i}^{(l)}\Big]_{+}\Big)^{2}. \tag{42}\end{equation*}
The rectification
Training Spiking Neural Networks
1) Backpropagation Using Spike Times:
A visual depiction of the following derivation is provided in Fig. 24. In the original description of SpikeProp from [119], a spike response model is used \begin{align*} U_{j}(t) &= \sum _{i,k} W_{i,j}I_{i}^{(k)}(t) \\ I_{i}^{(k)}(t) & =\epsilon (t-f_{i}^{(k)})\tag{43}\end{align*}
\begin{equation*} \epsilon (t) = \frac {t}{\tau }e^{1-\frac {t}{\tau }}\Theta \left ({t}\right) \tag{44}\end{equation*}
Calculation of derivative of membrane potential with respect to spike time. The superscripts
Consider an SNN where each target specifies the timing of the output spike emitted from the \begin{equation*} \frac {\partial \mathcal {L}}{\partial W_{i,j}} = \frac {\partial \mathcal {L}}{\partial f_{j}}\frac {\partial f_{j}}{\partial U_{j}}\frac {\partial U_{j}}{\partial W_{i,j}}\Bigr |_{t=f_{j}}. \tag{45}\end{equation*}
\begin{equation*} \frac {\partial \mathcal {L}}{\partial f_{j}} = 2(y_{j} - f_{j}). \tag{46}\end{equation*}
\begin{equation*} \frac {\partial U_{j}}{\partial W_{i,j}}\Bigr |_{t=f_{j}} = \sum _{k}I_{i}^{(k)}(f_{j}) = \sum _{k}\epsilon (f_{j}-f_{i}^{(k)}). \tag{47}\end{equation*}
\begin{align*} \frac {\partial f_{j}}{\partial U_{j}} & \leftarrow \left ({\frac {\partial U_{j}}{\partial t}\Bigr |_{t=f_{j}}}\right)^{-1} \\ & = \left ({\sum _{i,k} W_{i,j}\frac {\partial I_{i}^{(k)}}{\partial t}\Bigr |_{t=f_{j}}}\right)^{-1} \\ & = \left ({\sum _{i,k}W_{i,j}\frac {f_{j} - f_{i}^{(k)}-\tau }{\tau ^{2}}\left({e^{\frac {f_{j} - f_{i}^{(k)}}{\tau }-1}}\right)}\right)^{-1}. \tag{48}\end{align*}
\begin{equation*} \frac {\partial \mathcal {L}}{\partial W_{i,j}} = \frac {2(y_{j}-f_{j})\sum _{k} I_{i}^{(k)}(f_{j})}{\sum _{i,k}W_{i,j}(\partial I_{j}^{(k)}/\partial t)\Bigr |_{t=f_{j}}}. \tag{49}\end{equation*}
2) Backpropagation Using Spikes:
STDP: The connection between a pair of neurons can be altered by the spikes emitted by both neurons. Several experiments have shown that the relative timing of spikes between presynaptic and postsynaptic neurons can be used to define a learning rule for updating the synaptic weight [33]. Let \begin{equation*} \Delta t = t_{\textrm {pre}} - t_{\textrm {post}}. \tag{50}\end{equation*}
\begin{align*} \Delta W = \begin{cases} \displaystyle A_{+}e^{\Delta t/\tau _{+}}, & \textrm {if $t_{\textrm {post}} > t_{\textrm {pre}}$} \\ \displaystyle A_{-}e^{-\Delta t/\tau _{-}}, & \textrm {if $t_{\textrm {post}} < t_{\textrm {pre}}$} \end{cases} \tag{51}\end{align*}
STDP learning window. If the presynaptic neuron spikes before the postsynaptic neuron,
For a strong, excitatory synaptic connection, a presynaptic spike will trigger a large postsynaptic potential [refer to
Input sensory data are typically correlated in both space and time, so a network’s response to a correlated spike train will increase the weights much faster than uncorrelated spike trains. This is a direct result of causal spiking. Intuitively, a group of correlated spikes from multiple presynaptic neurons will arrive at a postsynaptic neuron within a close time interval, causing stronger depolarization of the neuron membrane potential, and a higher probability of a postsynaptic spike being triggered.
However, without an upper bound, this will lead to unstable and indefinitely large growth of the synaptic weight. In practice, an upper limit should be applied to constrain potentiation. Alternatively, homeostatic mechanisms can also be used to offset this unbounded growth, such as an adaptive threshold that increases each time a spike is triggered from the neuron (see Appendix C3).
3) Long-Term Temporal Dependencies:
One of the simplest implementations of an adaptive threshold is to choose a steady-state threshold \begin{align*} \theta [t] &= \theta _{0} + b[t] \tag{52}\\ b[t+1] &= \alpha b[t] + (1-\alpha)S_{\textrm {out}}[t]. \tag{53}\end{align*}