Critical Learning Periods in Deep Networks
Alessandro achille , matteo rovere , stefano soatto, send feedback.
Enter your feedback below and we'll get back to you as soon as possible. To submit a bug report or feature request, you can use the official OpenReview GitHub repository: Report an issue

BibTeX Record
- Corpus ID: 19783063
Critical Learning Periods in Deep Neural Networks
- A. Achille , Matteo Rovere , Stefano Soatto
- Published 24 November 2017
- Computer Science
Figures from this paper

65 Citations
On plasticity, invariance, and mutually frozen weights in sequential task learning.
- Julian G. Zilly , A. Achille , A. Censi , E. Frazzoli
- Computer Science NeurIPS
- Highly Influenced
- View 8 excerpts, cites background and methods
On the Importance of Critical Period in Multi-stage Reinforcement Learning
- Junseok Park , Inwoo Hwang , +4 authors Byoung-Tak Zhang
- Psychology ArXiv
- View 3 excerpts, cites background
Continual Lifelong Learning with Neural Networks: A Review
- G. I. Parisi , Ronald Kemker , Jose L. Part , Christopher Kanan , S. Wermter
- Computer Science Neural Networks
- View 1 excerpt, cites background
The Early Phase of Neural Network Training
- Jonathan Frankle , D. Schwab , Ari S. Morcos
- Computer Science ICLR
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks
- Wei Hu , Lechao Xiao , Ben Adlam , Jeffrey Pennington
- View 6 excerpts, cites background
The Break-Even Point on Optimization Trajectories of Deep Neural Networks
- Stanislaw Jastrzebski , Maciej Szymczak , +4 authors Krzysztof J. Geras
- View 5 excerpts, cites background
Triple-Memory Networks: A Brain-Inspired Method for Continual Learning
- Liyuan Wang , Bo Lei , Qian Li , Hang Su , Jun Zhu , Yi Zhong
- Computer Science IEEE Transactions on Neural Networks and Learning Systems
THE NEGATIVE PRETRAINING EFFECT IN SEQUENTIAL DEEP LEARNING AND THREE WAYS TO FIX
- Computer Science, Psychology
Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width
- Dayal Singh Kalra , M. Barkeshli
- Computer Science ArXiv
- View 2 excerpts, cites background
The Dynamics of Differential Learning I: Information-Dynamics and Task Reachability
- A. Achille , G. Mbeng , Stefano Soatto
- View 4 excerpts, cites background
64 References
Overcoming catastrophic forgetting in neural networks.
- J. Kirkpatrick , Razvan Pascanu , +11 authors R. Hadsell
- Computer Science Proceedings of the National Academy of Sciences
- View 2 excerpts, references methods and background
Sensitive Periods in the Development of the Brain and Behavior
- Psychology Journal of Cognitive Neuroscience
Critical period regulation.
- Biology, Psychology Annual review of neuroscience
- View 2 excerpts, references background
Opening the Black Box of Deep Neural Networks via Information
- Ravid Shwartz-Ziv , Naftali Tishby
- Highly Influential
- View 4 excerpts, references background
The postnatal development of the visual cortex and the influence of environment.
- Biology Bioscience reports
- View 6 excerpts, references background
Birdsong: from behavior to neuron.
- Psychology, Biology Annual review of neuroscience
- View 1 excerpt, references background
On the Emergence of Invariance and Disentangling in Deep Representations
- A. Achille , Stefano Soatto
Learning Multiple Layers of Features from Tiny Images
- A. Krizhevsky
- View 1 excerpt, references methods
Global plasticity in adult visual cortex following reversal of visual input
- Biology, Psychology Nature
Continual Learning Through Synaptic Intelligence
- Friedemann Zenke , Ben Poole , S. Ganguli
- Computer Science ICML
Related Papers
Showing 1 through 3 of 0 Related Papers

Powered by:

Sponsored by:

- No suggested jump to results
- Notifications
Name already in use
Papers / neural-nets / critical_learning_periods_in_deep_neural_networks.md.
- Go to file T
- Go to line L
- Copy permalink
- Open with Desktop
- Title : Critical Learning Periods in Deep Neural Networks
- Authors : Alessandro Achille, Matteo Rovere, Stefano Soatto
- Link : https://arxiv.org/abs/1711.08856
- Tags : Neural Network
- Year : 2017
- They find that artificial neural networks show critical periods, similar to biological neural networks.
- E. g. a disease causing blurry vision for some time during early development can permanently harm the ability to interpret visual stimuli. The same disease very early or later in life will cause no permanent damage (after being healing).
- They train networks in CIFAR10 and blur all input images by down- and then upscaling.
- They vary when the blurring is removed.
- They always train for 300 epochs more after the blurring was removed. So the network always sees the unaltered input images for at least the same amount of epochs.
- Same as blurring, but they vertically flip the image.
- This is expected to keep low and mid level statistics the same. Only the final layers have to change.
- This is expected to be easy to adapt to, even if the flipping is removed fairly late into the training.
- They randomly permute the class labels and remove the effect after some epochs.
- This is expected to only affect the last layer and hence should have similar effects to vertical flipping.
- They test the effect of sensory deprivation on neural nets.
- They make the input of the networks uninformative by replacing it with gaussian noise.
- This is assumed to have less effect than adding blur, because the network does not learn significantly wrong statistics (due to the input being uninformative).
- They add random gaussian noise to each layer's output (or to the weights - not really clear).
- They allow the network to learn the variance of that noise.
- They add a regularization based on mutual information. This adds a "cost proportional to the quantity of mutual information I(w;D) that the weights retain about the training data D after the learning process" (?).
- So the network can retain more information, but has to pay for that.
- It is expected to set the variance to low values for layers which are critical for the predictions.
- Removing the blur early leads to only a small loss in final accuracy.
- Removing the blur too late leads to a large and permanent loss in final accuracy.
- The decline in accuracy is not linear with respect to when the blur is removed.
- The effect is similar to biological neural nets.

- Making the network deeper does not help, but instead worsens the effect.
- Fixing the learning rate doesn't help either.

- This is basically like starting to let the network learn once the blur is removed, but using a weirdly bad initialization. (Weird in the sense that it starts with great accuracy, but is barely able to improve.)
- As expected, adding vertical flips does not significantly affect long term accuracy.
- Same as for vertical flipping, only minor effect.
- This has worse effects than vertical flipping / label permutation.
- Overall less decrease in accuracy than with blur.
- The effect is more linear with respect to the epoch (remove early: hardly any decline in accuracy, remove after half of training: medium decline, remove late: strong decline).
- Without deficit, the network will put most weight (least amount of noise) on the middle layers (3-5 of 7).
- With deficit, it will put more weight on the last layers and is only able to partially reconfigure if the deficit is removed early enough.


On the Importance of Critical Period in Multi-stage Reinforcement Learning

The initial years of an infant's life are known as the critical period, during which the overall development of learning performance is significantly impacted due to neural plasticity. In recent studies, an AI agent, with a deep neural network mimicking mechanisms of actual neurons , exhibited a learning period similar to human's critical period. Especially during this initial period, the appropriate stimuli play a vital role in developing learning ability. However, transforming human cognitive bias into an appropriate shaping reward is quite challenging, and prior works on critical period do not focus on finding the appropriate stimulus. To take a step further, we propose multi-stage reinforcement learning to emphasize finding “appropriate stimulus" around the critical period. Inspired by humans' early cognitive-developmental stage, we use multi-stage guidance near the critical period, and demonstrate the appropriate shaping reward (stage-2 guidance) in terms of the AI agent's performance, efficiency, and stability.

Junseok Park
Inwoo Hwang
Min Whoo Lee
Hyunseok Oh
Youngki Lee

Byoung-Tak Zhang
Related Research
Toddler-guidance learning: impacts of critical period on multimodal ai agents, subgoal-based reward shaping to improve efficiency in reinforcement learning, dream architecture: a developmental approach to open-ended learning in robotics, machine learning the period finding algorithm, critical learning periods in deep neural networks, criticality as it could be: organizational invariance as self-organized criticality in embodied agents.
Sign up for DeepAI
Join one of the world's largest A.I. communities

Already have an account? Login here
Login to DeepAI
Forgot password? Click here
Don't have an account? Signup here
Become a DeepAI PRO

Pay as you go

Critical learning periods for multisensory integration in Deep Networks
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. Interfering with the learning process during this initial stage can permanently impair the development of a skill, both in artificial and biological systems where the phenomenon is known as critical learning period . We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations. This evidence challenges the view, engendered by analysis of wide and shallow networks, that early learning dynamics of neural networks are simple, akin to those of a linear model. Indeed, we show that even deep linear networks exhibit critical learning periods for multi-source integration, while shallow networks do not. To better understand how the internal representations change according to disturbances or sensory deficits, we introduce a new measure of source sensitivity, which allows us to track the inhibition and integration of sources during training. Our analysis of inhibition suggests cross-source reconstruction as a natural auxiliary training objective, and indeed we show that architectures trained with cross-sensor reconstruction objectives are remarkably more resilient to critical periods. Our findings suggest that the recent success in self-supervised multi-modal training compared to previous supervised efforts may be in part due to more robust learning dynamics and not solely due to better architectures and/or more data.
1 Introduction
Learning generally benefits from exposure to diverse sources of information, including different sensory modalities, views, or features. Multiple sources can be more informative than the sum of their parts. For instance, both views of a random-dot stereogram are needed to extract the synergistic information , which is absent in each individual view (Julesz, 1960 ) . More generally, multiple sources can help identify latent common factors of variation relevant to the task, and separate them from source-specific nuisance variability, as done in contrastive learning.
Much information fusion work in Deep Learning focuses on the design of the architecture, as different sources may require different architectural biases to be efficiently encoded. We instead focus on the learning dynamics , since effective fusion of different sources relies on complex phenomena beginning during the early epochs of training. In fact, even slight interference with the learning process during this critical period can permanently damage a network’s ability to harvest synergistic information. Even in animals, which excel at multi-sensor fusion, a temporary deficit in one source during early development can permanently impair the learning process: congenital strabismus in humans can cause permanent loss of stereopsis if not corrected sufficiently early; similarly, visual/auditory misalignment can impair the ability of barn owls to localize prey (Kandel et al., 2013 ) . In artificial networks, the challenge of integrating different sources has been noted in visual question answering (VQA), where the model often resorts to encoding less rich but more readily accessible textual information (Agrawal et al., 2016 ; Cadene et al., 2019 ) , ignoring the visual modality, or in audio-visual processing, where acoustic information is often washed out by visual information (Wang et al., 2020 ) .
Such failures are commonly attributed to the mismatch in learning speed between sources, or their “information asymmetry” for the task. It has also been suggested, based on limiting analysis for wide networks, that the initial dynamics of DNNs are very simple (Hu et al., 2020 ) , seemingly in contrast with evidence from biology. In this paper, we instead argue that the early learning dynamics of information fusion in deep networks are both highly complex and brittle, to the point of exhibiting critical learning periods similar to biological systems.

In Sect. 2 , we show that shallow networks do not exhibit critical periods when learning to fuse diverse sources of information, but deep networks do. Even though, unlike animals, artificial networks do not age, their learning success is still decided during the early phases of training. The existence of critical learning periods for information fusion is not an artifact of annealing the learning rate or other details of the optimizer and the architecture. In fact, we show that critical periods for fusing information are present even in a simple deep linear network. This refutes the idea that deep networks exhibit trivial early dynamics (Hu et al., 2020 ; Lee et al., 2019 ) . We provide an interpretation for critical periods in linear networks in terms of mutual inhibition/reinforcement between sources, manifest through sharp transitions in the learning dynamics, which in turn are related to the intrinsic structure of the underlying data distribution.
In Sect. 3 , we introduce a metric called “Relative Source Variance” to quantify the dependence of units in a representation to individual sources, allowing us to better understand inhibition and fusion between sources. Through it, in Sect. 4 , we show that temporarily reducing the information in one source, or breaking the correlation between sources, can permanently change the overall amount of information in the learned representation. Moreover, even when downstream performance is not significantly affected, such temporarily changes result in units that are highly polarized and process only information from one source or the other. Surprisingly, we found that the final representations in our artificial networks that were exposed to a temporary deficit mirrored single-unit animal representations exposed to analogous deficits (Fig. 4 , Fig. 6 ).
We hypothesize that features inhibit each other because they are competing to solve the task. But if the competitive effect is reduced, such as through an auxiliary cross-source reconstruction task, the different sources can interact synergistically. This supports cross-modal reconstruction as a practical self-supervision criterion. In Sect. 4.4 , we show that indeed auxiliary cross-source reconstruction can stabilize the learning dynamics and prevent critical periods. This lends an alternate interpretation for the recent achievements in multi-modal learning as due to the improved stability of the early learning dynamics due to auxiliary cross-modal reconstruction tasks, rather than to the design of the architecture.
Empirically, we show the existence of critical learning periods for multi-source integration using state-of-the-art architectures (Sect. 4.3 - 4.4 ). To isolate different factors that may contribute to low-performance on multi-modal tasks (mismatched training dynamics, different informativeness), we focus on tasks where the sources of information are symmetric and homogeneous, in particular stereo and multi-view imagery. Even in this highly controlled setting, we observe the effect of critical periods both in downstream performance and/or in unit polarization. Our analysis suggests that pre-training on one modality, for instance text, and then adding additional pre-trained backbones, for instance visual and acoustic, as advocated in recent trends with Foundation Models, yields representations that fail to encode synergistic information. Instead, training should be performed across modalities at the outset. Our work also suggests that asymptotic analysis is irrelevant for deep network fusion, as their fate is sealed during the initial transient learning. Also, conclusions drawn from wide and shallow networks do not transfer to deep networks in use in practice.
1.1 Related Work
Multi-sensor learning. There is a large literature on sensor fusion in early development (Smith & Gasser, 2005 ) , including homogeneous sensors that are spatially dislocated (e.g., two eyes), or time-separated (e.g., motion), and heterogeneous sources (e.g., optical and acoustic, or visual and tactile). Indeed, given normal learning , humans and other animals have the remarkable ability to integrate multi-sensory data, such as incoming visual stimuli coming into two eyes, as well as corresponding haptic and audio stimuli. Monkeys have been shown to be adept at combining and leveraging arbitrary sensory feedback information (Dadarlat et al., 2015 ) .
In deep learning, multi-modal (or multi-view learning) learning typically falls into two broad categories: learning a joint representation (fusion of information) and learning an aligned representation (leveraging coordinated information in the multiple views) (Baltru v saitis et al., 2018 ) . A fusion-based approach is beneficial if there is synergistic information available in the different views, while an alignment-based approach is helpful is there is shared information common to the different views (Fig. 1 ). Such a division of information typically affects architectural and model choices: synergistic information requires the information from the different modalities to be fused or combined, whereas shared information often serves as a self-supervised signal that can align information from the different modalities, as in contrastive learning (Tian et al., 2020a , b ; Chen et al., 2020 ) or correlation based approaches (Andrew et al., 2013 ) .
Critical periods in animals and deep networks: Such architectural considerations often neglect the impact coming from multisensory learning dynamics, where information can be learned at different speeds from each sensor (Wu et al., 2022 ) . Indeed, Wiesel ( 1982 ) showed that humans and animals are peculiarly sensitive to changes in the distribution of sensory information early in training, in a phenomenon known as critical periods . Critical periods have since been described in many different species and sensory organs. For example, barn owls originally exposed to misaligned auditory and visual information cannot properly localize prey (Knudsen & Knudsen, 1990 ) . Somewhat surprisingly, similar critical periods for learning have also been observed in deep networks. Achille et al. ( 2019 ) found that early periods of training were critical for determining the asymptotic network behavior. Additionally, it was found that the timing of regularization was important for determining asymptotic performance (Golatkar et al., 2019 ) , with regularization during the initial stages of training having the most influential effect.
Masked/de-noising Autoencoders: Reconstructing an input from a noisy or partial observation has been long used as a form of supervision. Recently, an in part due the successful usage of transformers in language (Vaswani et al., 2017 ) and vision tasks (Dosovitskiy et al., 2020 ) , such a pre-training strategy has been successfully applied to text (Devlin et al., 2018 ) and vision tasks (He et al., 2021 ) . An extension of this has been recently applied to multi-modal data (Bachmann et al., 2022 ) .
Models of learning dynamics We consider two approaches to gain analytic insight into the learning dynamics of deep networks. Saxe et al. ( 2013 , 2019 ) assume that the input-output mapping is done by a deep linear network. We show that under this model critical periods may exist. Hu et al. ( 2020 ); Lee et al. ( 2019 ) assume instead infinitely wide networks, resulting in a model linear with respect to the parameters. In this latter case, no critical period is predicted contradicting our empirical observations on finite networks.
2 A model for critical periods in sensor-fusion

We want to establish what is the difference, in terms of learning dynamics, between learning how to use two sources of information at the same time, or learning how to solve a task using each modality separately and then merging the results. In particular we consider the counterfactual question: if we disable sensor A during training, would this change how we learn to use sensor B? To start, let’s consider the simple case of a linear regression model y = W x trained with a mean square error loss
where D = { ( x ( i ) , y ( i ) ) } N i = 1 is a training set of i.i.d. samples. In this simplified setting, we consider each component x k of x as coming from a different sensor or source. To simplify even further, we assume that the inputs have been whitened, so that the input correlation matrix Σ x = 1 N ∑ i x ( i ) x ( i ) T = I .
In this case, the learning dynamics of any source is independent from the others. In fact, the gradient of the weight w j k associated to x k and y j is given by
and does not depend on any w h l with w h l ≠ w j k . The answer to the counterfactual question is thus negative in this setting: adding or removing one source of information (or output) will not change how the model learns to extract information from the other sources. However, we now show that the addition of depth, even without taking introducing non-linearities, makes the situation radically different.
To this effect, consider a deep linear network with one hidden layer y = W 2 W 1 x . This network has the same expressive power (and the same global optimum) as the previous model. However, this introduces a mutual dependency between sensors (due to the shared layer) that can ultimately lead to critical periods in cross-sensor learning. To see this, we use an analytical expression of the learning dynamics for two-layer deep networks (Saxe et al., 2013 , 2019 ) . Let Σ y x = 1 N ∑ N i = 1 y ( i ) x ( i ) T be the cross-correlation matrix between the inputs x and the target vector y 1 1 1 Note that W = Σ y x is also the global minimum of the MSE loss L = 1 N ∑ i 1 2 | | y ( i ) − W x ( i ) | | 2 . and let Σ y x = U S V T be its singular-value decomposition (SVD). Saxe et al. ( 2019 ) shows that the total weight W ( t ) = W 2 ( t ) W 1 ( t ) assigned to each source at time t during the training can be written as
This leads to non-linear learning dynamics where different features are learned at sharply distinct points in time (Saxe et al., 2019 ) . Moreover, it leads to entanglement between the learning dynamics of different sources due to the eigenvectors v α mixing multiple sources.
Disabling (or adding) a source of information corresponds to removing (or adding) a column to the matrix Σ y x , which in turns affects its singular-value decomposition and the corresponding learning dynamics. To see how this change may affect the learning dynamics, in Fig. 2 we compare the weights associated to each sensor during training for one particular task. In solid we show the dynamics with all sensors active at the same time. In dashed line we show the dynamics when one of the sensor is disabled. We see that disabling a sensor (green in the figure) can completely inhibit learning of other task-relevant features (e.g., the gray feature) during the initial transient. This should be compared with the learning dynamics of a shallow one-layer network (Fig. 2 , left) where all task-relevant features are learned at the same time, and where removal of a source does not affect the others.
In deep linear networks, the suboptimal configuration learned during the initial transient is eventually discarded, and the network reverts to the globally optimal solution. In the following we show this is not the case for standard non-linear deep networks. While the initial non-trivial interaction between sources of information remain, the non-linear networks are unable to unlearn the suboptimal configurations learned at the beginning (owing to the highly non-convex landscape). This can result in permanent impairments if a source of information is removed during the initial transient of learning, which reflects the trends observed in critical periods in animals.
3 Single Neuron Sensitivity Analysis

Before studying the empirical behavior of real networks on multi-sensor tasks, we should consider how to quantify the effect of a deficit on a down-stream task. One way is to look at the final performance of the model on the task. For example, animals reared with a monocular deprivation deficit have reduced accuracy on a visual acuity test and, similarly, deep networks may show reduced classification accuracy (Achille et al., 2019 ) . However, in some cases deficits may not drastically impair the accuracy but may still affect how the model is organized internally. Individuals with strabismus or ambliopia can perform just as well on most tasks, since the individual information coming from each sensor separately is enough to compensate. But the connectivity scheme of the synapses may change so that neurons eventually process only information from one sensor or the other, and not from both together, as observed in individuals without deficits (Wiesel, 1982 ) .
To understand whether units in a representation of multisensory inputs depend on both sensors or only a particular sensor, we introduce a measure of Relative Source Variance . We first define the Source Variance (SV) for unit i of a representation due to sensor A, conditioned on an example b as
where f denotes the mapping from multisensory inputs to the representation and i indexes the unit of the representation. We note that the value of S V i ( A , b ) depends on the example b . We use an analogous formula for S V i ( B , a ) .
Typically, we are interested in the distribution of the Source Variance of the units i in a representation, as a function of many examples a and b . To capture this, we define a notion of Relative Source Variance (RSV) for unit i as:
If the RSV is 1 , this means that the unit is only sensitive to sensor A, and if the RSV is − 1 , the unit is sensitive to sensor B . To compute S V ( A , b ) (and analogously for S V ( B , a ) ) from samples, we fix a sample b , and vary the inputs a , sampling from a ∼ p ( a ) . We run this for multiple fixed samples from b , performing the computation over a batch. We perform analogous computations for S V ( B , a ) We compute the R S V i ( a , b ) for all units i from a representation, and for many examples a and b . We then plot the distribution of RSVs, aggregating across all units (see, e.g., Fig. 4 - 6 ). In particular, we track how the distribution changes as a result of sensory deficits and perturbations, as well as how the distribution changes during normal training. Note that − 1 ≤ R S V i ( a , b ) ≤ 1 . If R S V i ( a , b ) = 1 (or -1) is 1 , this means that the unit is only sensitive to sensor A (or B ). If R S V i ( a , b ) = 0 the unit is equally sensitive to both sensors. For controlled simulations (See Appendix A.1 ), we show the variety of distributions of units in a representation that the RSV can measure in Fig. 3 .
4 Critical learning periods in deep multi-sensor networks
In this section, we investigate the learning dynamics of deep networks during the initial learning transient when multiple source of information are present. We evaluate how temporary perturbations of the relation between the two sensors during the training can change the final outcome. To exclude possible confounding factors, in all our experiments, the two input sources are perfectly symmetrical (same data distribution and same informativeness for the task) which ensures that any asymmetry observed in the final model is due to the perturbation.
4.1 Inhibition of a weak source
Uncorrected vision problems in an eye during early childhood can cause permanent visual impairment in humans, whereas even after correction the patient only sees through the unaffected eye and does not recover vision in the affected eye (ambliopia, or lazy-eye). We explore whether such inhibition of a sensor can happen in DNNs following a similar experimental setup to Achille et al. ( 2019 ) . To simulate binocular data from single images, we partition each image in a left and right crop and feed each to two separate pathways of the network, which are then fused in an additive manner at a later stage. For each initial pathway, we used the early stages of a ResNet-18 backbone. We then simulate the blurry vision of a weak eye by downsampling the input of the right pathway by 4 × , and then resized the image to the original size. After training for t 0 initial epochs with the blur deficit, we remove it and train for further 180 epochs to ensure convergence (see Appendix for details). Here we focus on the simple CIFAR-10 classification dataset, and we later examine different architectures and datasets, and learning approaches.
At the end of the training, both sensors are working well and contain partially disjoint information about the task variable, so the network would benefit from using both of them. However, in Fig. 4 (top) we see from the RSV that weakening the right sensor by blurring it during the initial transient will permanently inhibit its use even after removing the deficit. More specifically, at the end of normal training units in the network attend equally to either sensor (leftmost panel). However, in the network trained with a short deficit the neurons only encode information about the “initially good” left sensor (the RSV of the units concentrates around -1, rightmost panel). This mirrors the occular dominance findings present in monkeys with a cataract (Wiesel, 1982 , Fig. 7) . Similarly, the longer the deficit is present during the initial training, the more the downstream performance on the CIFAR-10 classification task is impaired (Fig. 5 , left). However, the reduction of performance is not as drastic as the RSV change, since the network can compensate and achieve a good accuracy on the task using only the good sensor.
Dependency on depth. In Sect. 2 we note that depth is fundamental to make critical periods emerge in multi-sensor networks. We further claim that increasing the depth of the network makes critical periods more evident. Indeed, in Fig. 5 (right) we show that increasingly deeper network have increasingly more marked permanent impairment as a result of a temporary deficit.
4.2 Learning synergistic information
We have seen that temporary weakening of one sensor may completely inhibit its learning. We now consider an alternative deficit where the two sensors are both working well, but are initially trained on uncorrelated data and only later trained together. This situation is common in every day machine learning, for example when pre-training backbones on different modalities separately (e.g., a text and a vision backbone) and then fine-tuning them together on a downstream task.
Dissociation deficit. To keep the two modalities symmetrical, we consider a similar set up as before where we feed to each pathway of a network the left and a right crop of an image. Both crops are now always full-resolution. However, we introduce a dissociation deficit, during which the right crop is sampled from a different image than the left one. During the dissociation, the task is to predict either the class of the left image or the right image with probability 0.5 . This deficit removes any synergistic information between the two pathways, but still encourages the two pathways to extract any unique information from the inputs.
We observe that this setup too has a critical period: In Fig. 6 , we see that, after normal training, the units are equally sensitive to both the left and right inputs (histogram clusters around zero). However, after training with an increasingly longer dissociation deficit, the histogram becomes increasingly polarized around ± 1 , suggesting that each unit is encoding information only about the right or the left image. This precludes the possibility that the network is extracting synergistic information from the two views (which would entail units that process information from both sensors). This mirrors the ocular dominance representations observed in strabismic monkeys (Wiesel, 1982 , Fig. 10-12) . Similarly to the dissociation deficit, in strabismus, the eyes are not aligned, thus breaking the normal correlation between the views. The dissociation deficit also produces a permanent impairment in the downstream performance (Fig. 6 , top) but again the effect is not as drastic as in the RSV plot since the network compensates by using each pathway separately (albeit synergistic information is lost).
4.3 Synergistic information in videos
So far we have seen that supervised deep networks, similar to humans and animals, have critical periods for learning correspondences between multi-view data. We confirmed this both at the behavioural (measured in terms of performance and visual acuity for the deep networks and animals respectively) and at the representation level, quantified by the neuron sensitivity. We now investigate whether such phenomenon generalize across learning strategies, architecture, and tasks.
Multi-View Transformer. Aside from integrating information from different sensors, animals and artificial networks need to be able to integrate information through time. We can think of frames of a video as being different views or sources of information that are correlated through time, and we can study how a network learns to integrate such information. We opted to use a more flexible transformer-based visual architecture, which has recently achieved state-of-the-art results in computer vision tasks (Dosovitskiy et al., 2020 ; He et al., 2021 ) , and language tasks (Vaswani et al., 2017 ; Devlin et al., 2018 ) . Visual transformers are typically trained either with a supervised loss (Dosovitskiy et al., 2020 ) or a masking-based objective, followed by fine-tuning (He et al., 2021 ) . We focus now on the first case, and analyze the second in the next section. In order to process multiple frames of a video, we use a modified Multi-Modal Masked Auto-Encoder (Bachmann et al., 2022 ) , which we train in a fully supervised fashion. We refer to this as a Multi-View Transformer .
To capture multiple views of a scene, we opted to use the the Kinetics Action classification video dataset (Carreira & Zisserman, 2017 ) , which consist in classifying one of 400 possible actions given a video clip. To adapt the task to our setting, from each video we select two random frames that are a multiple of 0.33 seconds apart to comprise our two views, and feed them to the Multi-View Transformer . Due to their temporal correlation, the two frames together contain more information (the motion) than either frame individually. We use a similar dissociation deficit as in the previous section: During the dissociation deficit period, we sampled the two frames from independent videos in order to break their temporal correlation. In this case, the classification label coming from either view with p = 0.5 (see Appendix for training details).
Even on a largely different architectures (transformer instead of ResNet) and a more complex task (action classification on natural video instead of CIFAR-10), in Fig. 8 we observe the same trends as in the previous section. Training with a temporary dissociation deficit permanently prevents the network from extracting synergistic temporal information from the frames. Unlike in the previous experiment, since the synergistic information is fundamental for the action classification task, the network cannot compensate the deficit and perturbations during the critical period also results in an harsh decrease of up to 20% in the final test accuracy (Fig. 8 , left).
4.4 Overcoming critical periods with cross-sensor reconstruction

Our previous experiments suggest that critical periods can be caused by competition between sensors which increases the selectivity of the units. If this is the case, we may hypothesize that training adding a cross-sensor reconstruction objective may help forcing the unit to learn how to encode cross-sensor information. To test this hypothesis, we train the Multi-View transformer of Sec. 4.3 using the cross-sensor masking-based reconstruction objective of Bachmann et al. ( 2022 ) and compare it with the supervised case. The self-supervised masked-image reconstruction task could encourage correspondences to be learned (if un-occluded parts of one view are helpful for reconstructing the other view), and may force learning synergistic information irrespective of the initial transient. In Fig. 7 , we show that indeed the masking-based pre-training is successful in using information from one source to predict masked patches of the other.
We train using the same protocol as Sec. 4.3 to pre-train the Multi-View Transformer using the cross-reconstruction objective. We then subsequently fine-tuned for 20 epochs on the downstream supervised classification tasks (see Appendix for details). In Fig. 8 we see that the unsupervised network was much more robust to perturbations early in the training, whereas that supervised objective was not. To understand whether such robustness was due to large changes to the representation when fine-tuning, we applied the RSV on the output of the encoder’s representation and found that while the resulting distribution became slightly more symmetrically balanced, it retained a similar bimodal distribution to the pre-trained representation. (Fig. 14 ).
5 Discussion
We have shown – in a variety of architectures and tasks – the existence of critical learning periods for multi-source integration: a temporary initial perturbations of an input source may permanently inhibit that source, or prevent the model from learning how to combine multiple sources. These trends replicate similar phenomena in animals, and point to the underlying complexity and brittleness of the learning dynamics that allow a network (or an animal) to fuse information. To simplify the analysis of the learning dynamics, we focused on tasks with homogeneous sources (stereo, video). We leave to future work to further study the role played by the asymmetry between sources (e.g., different informativeness or ease). Our theoretical and empirical analysis leads to several suggestions: Pre-training different backbones separately on each modality, as advocated in some foundational model, may yield representations that ultimately fail to encode synergistic information. Instead, training should be performed across modalities at the outset. On the theoretical side, our work suggests that analysis “at convergence” of the learning dynamics of a network are irrelevant for sensor fusion, as their fate is sealed during the initial transient learning. It also suggests that conclusions drawn from wide and shallow networks may not transfer to deep networks in current use.
- Achille et al. (2019) Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep networks. In International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=BkeStsCcKQ .
- Agrawal et al. (2016) Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356 , 2016.
- Andrew et al. (2013) Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning , pp. 1247–1255. PMLR, 2013.
- Bachmann et al. (2022) Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multimae: Multi-modal multi-task masked autoencoders. arXiv preprint arXiv:2204.01678 , 2022.
- Baltru v saitis et al. (2018) Tadas Baltru v saitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence , 41(2):423–443, 2018.
- Cadene et al. (2019) Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems , 32, 2019.
- Carreira & Zisserman (2017) Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6299–6308, 2017.
- Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning , pp. 1597–1607. PMLR, 2020.
- Dadarlat et al. (2015) Maria C Dadarlat, Joseph E O’doherty, and Philip N Sabes. A learning-based approach to artificial sensory feedback leads to optimal integration. Nature neuroscience , 18(1):138–144, 2015.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 , 2020.
- Golatkar et al. (2019) Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence. In Advances in Neural Information Processing Systems 32 , pp. 10677–10687. Curran Associates, Inc., 2019.
- Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 , 2017.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 770–778, 2016. doi: 10.1109/CVPR.2016.90.
- He et al. (2021) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 , 2021.
- Hu et al. (2020) Wei Hu, Lechao Xiao, Ben Adlam, and Jeffrey Pennington. The surprising simplicity of the early-time learning dynamics of neural networks. Advances in Neural Information Processing Systems , 33:17116–17128, 2020.
- Julesz (1960) Bela Julesz. Binocular depth perception of computer-generated patterns. Bell System Technical Journal , 39(5):1125–1162, 1960.
- Kandel et al. (2013) Eric R Kandel, James H Schwartz, Thomas M Jessell, Steven A Siegelbaum, and A J Hudspeth. Principles of neural science . McGraw-Hill, New York, fifth edition, 2013.
- Kleinman et al. (2021) Michael Kleinman, Alessandro Achille, Daksh Idnani, and Jonathan Kao. Usable information and evolution of optimal representations during training. In International Conference on Learning Representations , 2021. URL https://openreview.net/forum?id=p8agn6bmTbr .
- Knudsen & Knudsen (1990) Eric I Knudsen and Phyllis F Knudsen. Sensitive and critical periods for visual calibration of sound localization by barn owls. Journal of Neuroscience , 10(1):222–232, 1990.
- Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems , 32, 2019.
- Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7 .
- Saxe et al. (2013) Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 , 2013.
- Saxe et al. (2019) Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences , 116(23):11537–11546, 2019.
- Smith & Gasser (2005) Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificial life , 11(1-2):13–29, 2005.
- Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 , 2014.
- Tian et al. (2020a) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European conference on computer vision , pp. 776–794. Springer, 2020a.
- Tian et al. (2020b) Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems , 33:6827–6839, 2020b.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017.
- Wang et al. (2020) Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 12695–12705, 2020.
- Wiesel (1982) Torsten N Wiesel. Postnatal development of the visual cortex and the influence of environment. Nature , 299(5884):583–591, 1982.
- Wu et al. (2022) Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning , pp. 24043–24055. PMLR, 2022.
- Xu et al. (2020) Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=r1eBeyHFDH .
Appendix A Additional Details
A.1 description of simulated rsv distributions.
When evaluating the RSV on a synthetic distribution, we considered the following generative model that consists of a common component x 0 with additive noise:
Depending on the values of α and β , the Beta distribution that the weights w i are drawn from will take different shapes, changing how units in the representation z vary with inputs x a and x b . We find that the distribution of RSVs in Fig. 3 reflect the full spectrum of these various distributions, where the resulting RSVs can vary from an approximately Gaussian distribution where units vary equally with both modalities, to polarized representations where units vary uniquely with one modality
For this synthetic simulation, we can derive a closed form expression for the RSV. In particular (and dropping the subscript i for clarity),
and note that z will be distributed as a normal distribution. Then,
We know that
since x 0 , n a , and n b are independent. Finally,
We also know that
We can then solve for S V i by plugging Eq 9, 16, 17 into Eq 8 and obtain:
We assumed that the representation z i for half of the units were sampled from above generative model, while the other half the representation z i were sampled from the reverse convex combination of inputs, i.e, z i = w i x b + ( 1 − w i ) x a .
For simulations 2-4, we set β = 20 and varied α in [ 1 , 20 , 30 ] respectively. We considered a representation on N = 20000 units. For the first simulation we only considered the half of units in the generative model above, with α = 10 .
a.2 Generalization of RSV to arbitrary number of sensors
We can naturally generalize the RSV to an arbitrary number n of sources. To do so, define:
and then collect the individual source variances into a vector S V i of size n . Then normalized sensor variance would be
which provides a normalized quantification (between 0 and 1 ) of how much an individual unit varies with each sensor modality j .
a.3 Description of deep linear network experiment
We considered the original input-output correlation (before dropping a sensor) to be
Our perturbation involved dropping a sensor, in this case the third column, leading to
Using the analytical equations for the learning dynamics given by saxe2018information for the shallow and deep network, we investigated how learning the task (row 5) was affected (Fig. 2 ), finding that such a perturbation had a significant on the dynamics of sensor learning in the deep, but not shallow, network.
a.4 Description of architectures and training
Most of our experiments are based on the ResNet-18 architecture (He et al., 2016 ) . We modified the architecture to process multi-sensor input with what we call a SResNet-18. We separately process two initial pathways which we combine in an additive manner. In particular, the initial pathway followed the architecture of He et al. ( 2016 ) directly up to (and including) conv3_x (See Table 1 of (He et al., 2016 ) ). After combining the pathways, the remaining layers followed the ResNet-18 architecture directly.
To examine the effect of depth, we modified the All-CNN architecture (Springenberg et al., 2014 ) , following Achille et al. ( 2019 ) . In particular we processed each pathway with the following architecture:
conv 96 - [conv 96 ⋅ 2 i − 1 - conv 96 ⋅ 2 i s2] n i = 1 - conv 96 ⋅ 2 n - conv1 96 ⋅ 2 n - conv1 10
where s refers to the stride. We then merged the final representation from each pathway in an additive manner. We examined the setting when n = 1 , 2 , 3 . We used a fixed learning rate of 0.001 in these experiments.
a.5 Description of Blurring Experiments (Fig. 4 )
We attempted to simulate a cataract-like deficit by blurring the image to one pathway. We reduced the resolution of the image being passed to one pathway by first resizing the Cifar images to 8 × 8 , and then resizing to its original size ( 32 × 32 pixels, decreasing the available information.
While training, we applied standard data augmentation on the uncorrupted pathway (random translation of up to 4 pixels, and random horizontal flipping. We then retained a width w of the leftmost and rightmost pixels from uncorrupted and corrupted pathway respectively, setting w = 16 unless otherwise stated. At inference time, no data augmentation was applied and the leftmost w pixels and rightmost w pixels was supplied to each pathway respectively. We used an initial learning rate of 0.075 , decaying smoothly at each epoch with a scale factor of 0.97 .
To quantify the information contained in the representation, we randomly masked out each pathway with p = 0.1 during training, and computed the usable information I u contained in the representation Z abbout the task Y following Kleinman et al. ( 2021 ); Xu et al. ( 2020 ) by computing I u ( Z ; Y ) = H ( Y ) − L C E , with H ( Y ) being known and equal to log 2 10 since the distribution of targets is uniform, and L C E being the cross-entropy loss on the test set. We reported the corresponding RSV plots, and network performance in Appendix Fig. 9 , which reveal similar performance trends and polarization of units, when pre-training with the random masking as in Fig. 4 .

a.6 Description of Independent Pathways Experiment (Fig. 6 )
We followed the same setup as above, but instead randomly permuted the images fed to the ‘right’ pathway across the batch, breaking the correlation between the views. We trained using an initial learning rate of 0.05 , decaying smoothly with a scale factor of 0.97 . When training with the deficit we randomly sampled the target from the different views with p = 0.5 . We also modified the architecture to produce multiple classification outputs, corresponding to a classification based on both views, or each pathway respectively. This modification was helpful for interpreting the polarization plots. While training, the loss function was applied on the head that contained the proper input-target correspondence. After the deficit, and during inference, only the head corresponding to both views was used.
a.7 Description of Masking + Supervised MultiViT training
These experiments were based on the MultiMAE architechture (Bachmann et al., 2022 ) , using their implementation and closely following their default settings. We adapted their implementation to process two separate RGB views coming from Kinetics-400 dataset (Carreira & Zisserman, 2017 ) . We used a patch size of 16 in all experiments, and the AdamW optimizer (Loshchilov & Hutter, 2019 ) . All inputs were first resized to 224 × 224 pixels. Our learning rate followed the linear scaling rule (Goyal et al., 2017 ) .
For the masking sensitivity experiments in Fig. 8 , we used a fixed delay of 1.33 seconds ( 4 frames) between frames, and trained with an initial base learning rate of 0.0001 , with 40 epochs of warmup for the learning rate. We trained for 800 epochs, with a 200 epoch deficit of independent frames during the pre-training starting at different epochs during training. We used a masking ratio of 0.75 . We pre-trained with a batch size of 256 per GPU on 8 GPUs. After the pre-training, we fine-tuned for 20 epochs with all the tokens and the corresponding action classification label. We fine-tuned on 8 GPUs with a batch size of 32 . We fine-tuned with a learning rate of 0.0005 , with 5 epochs of warmup.
For the supervised experiments, we trained our networks with an initial base learning rate of 0.01 for 120 epochs using all the tokens, with 20 epochs of warmup. We applied a temporary deficit of independent frames for 20 epochs, starting at various epochs during the training. We used in cutmix ( 1.0 ) and mixup ( 0.8 ) applied to each view) while training and we used a random baseline between frames. For the supervised experiments, we used a batch size of 64 per GPU.
In both the masking and supervised experiments in Fig. 8 , we reported the difference of networks trained with a deficit starting at different epochs of training against a corresponding model trained without any deficit. In Fig. 7 , we show example reconstructions from our Multi-View transformer pre-trained without a deficit for 800 epochs with a random baseline between frames.
Appendix B Additional Plots
Want to hear about new tools we're making? Sign up to our mailing list for occasional updates.
If you find a rendering bug, file an issue on GitHub . Or, have a go at fixing it yourself – the renderer is open source !
For everything else, email us at [email protected] .
Subscribe to the PwC Newsletter
Join the community, edit social preview.

Add a new code entry for this paper
Remove a code repository from this paper.

Mark the official implementation from paper authors
Add a new evaluation result row.
- DISENTANGLEMENT
- REPRESENTATION LEARNING
Remove a task

Add a method
Remove a method, edit datasets, critical learning periods in deep neural networks.
24 Nov 2017 · Alessandro Achille , Matteo Rovere , Stefano Soatto · Edit social preview
Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not affect low-level statistics, such as vertical flipping of the images, have no lasting effect on performance and can be overcome with further training. To better understand this phenomenon, we use the Fisher Information of the weights to measure the effective connectivity between layers of a network during training. Counterintuitively, information rises rapidly in the early phases of training, and then decreases, preventing redistribution of information resources in a phenomenon we refer to as a loss of "Information Plasticity". Our analysis suggests that the first few epochs are critical for the creation of strong connections that are optimal relative to the input data distribution. Once such strong connections are created, they do not appear to change during additional training. These findings suggest that the initial learning transient, under-scrutinized compared to asymptotic behavior, plays a key role in determining the outcome of the training process. Our findings, combined with recent theoretical results in the literature, also suggest that forgetting (decrease of information in the weights) is critical to achieving invariance and disentanglement in representation learning. Finally, critical periods are not restricted to biological systems, but can emerge naturally in learning systems, whether biological or artificial, due to fundamental constrains arising from learning dynamics and information processing.
Code Edit Add Remove Mark official
Tasks edit add remove, datasets edit.

Results from the Paper Edit Add Remove
Methods edit add remove.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- My Account Login
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open Access
- Published: 24 May 2023
A study on computer vision for facial emotion recognition
- Zi-Yu Huang 1 ,
- Chia-Chin Chiang 1 ,
- Jian-Hao Chen 2 ,
- Yi-Chian Chen 3 ,
- Hsin-Lung Chung 1 ,
- Yu-Ping Cai 4 &
- Hsiu-Chuan Hsu 2 , 5
Scientific Reports volume 13 , Article number: 8425 ( 2023 ) Cite this article
Metrics details
- Health care
- Health occupations
Artificial intelligence has been successfully applied in various fields, one of which is computer vision. In this study, a deep neural network (DNN) was adopted for Facial emotion recognition (FER). One of the objectives in this study is to identify the critical facial features on which the DNN model focuses for FER. In particular, we utilized a convolutional neural network (CNN), the combination of squeeze-and-excitation network and the residual neural network, for the task of FER. We utilized AffectNet and the Real-World Affective Faces Database (RAF-DB) as the facial expression databases that provide learning samples for the CNN. The feature maps were extracted from the residual blocks for further analysis. Our analysis shows that the features around the nose and mouth are critical facial landmarks for the neural networks. Cross-database validations were conducted between the databases. The network model trained on AffectNet achieved 77.37% accuracy when validated on the RAF-DB, while the network model pretrained on AffectNet and then transfer learned on the RAF-DB results in validation accuracy of 83.37%. The outcomes of this study would improve the understanding of neural networks and assist with improving computer vision accuracy.
Introduction
In human communications, facial expressions contain critical nonverbal information that can provide additional clues and meanings to verbal communications 1 . Some studies have suggested that 60–80% of communication is nonverbal 2 . This nonverbal information includes facial expressions, eye contact, tones of voice, hand gestures and physical distancing. In particular, facial expression analysis has become a popular research topic 3 . Facial emotional recognition (FER) has been applied in the field of human–computer interaction (HCI) in areas such as autopilot, education, medical treatment, psychological treatment 4 , surveillance and psychological analysis in computer vision 5 , 6 .
In psychology and computer vision, emotions are classified as categorical or dimensional (valence and arousal) models 7 , 8 , 9 . In the categorical model, Ekman et al . 7 defined basic human emotions as happiness, anger, disgust, fear, sadness, and surprise. In the dimensional model, the emotion is evaluated by continuous numerical scales for determination of valence and arousal. FER is an important task in computer vision that has numerous practical applications and the number of studies on FER has increased in recent years 10 , 11 , 12 , 13 , benefiting from the advances provided by deep neural networks. In particular, convolutional neural networks (CNNs) have attained excellent results in terms of extracting features. For example, He et al . 14 proposed the residual neural network (ResNet) architecture in 2015, which added residual learning to a CNN to resolve the issues of vanishing gradient and decreasing accuracy of deep networks.
Several authors have applied neural network models to classify emotions according to categorical models 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 and dimensional models 15 , 23 , 24 , 25 , 26 . Huang 27 applied a residual block architecture to a VGG CNN to perform emotion recognition and obtained improved accuracy. Mao et al . 28 proposed a new FER model called POSTER V2, which aims to improve the performance of the state-of-the-art technique and reduce the required computational cost by introducing window-based cross attention mechanism and facial landmarks’ multi-scale features. To incorporate more information into the automatic emotion recognition process, some recent studies have fused several modalities, such as the temporal, audio and visual modalities 10 , 17 , 18 , 23 , 25 , into the algorithm. Moreover, attention mechanisms have been adopted by several studies 17 , 18 , 19 , 20 , 22 , 25 for FER tasks. Zhang et al . 19 applied class activation mapping to analyze the attention maps learned by their model. It was found that the model could be regularized by flipping its attention map and randomly erasing part of the input images. Wang et al. 22 introduced an attention branch to learn a face mask that highlights the discriminative parts for FER. These studies show that attention mechanisms play a critical role in FER. Several approaches for FER utilize self-attention mechanisms to capture both local and global contexts through a set of convolutional layers for feature extraction 29 , 30 , 31 . The extracted features are then used as the inputs of a relation attention module, which utilizes self-attention to capture the relationships between different patches and the context.
However, the practical deployment of facial recognition systems remains a challenging task, as a result of the presence of noise, ambiguous annotations 32 , and complicated scenes in the real-world setting 33 , 34 , 35 . Since attention modules have been proven effective for computer vision tasks, applying attention modules to FER tasks is of great interest. Moreover, in psychology, the facial features for FER by human have been analyzed. The results presented by Beaudry et al . 35 suggest that the mouth is the major landmark when observing a happy emotion and that the eyes are the major landmarks when observing a sad emotion. Similarly, the DNN model extracts discriminative features for FER. It is beneficial to apply class activation mapping to identify the discriminative features learned by the network at each layer. It has been shown that the class activation mapping method can be utilized for localization recognition around the eyes for movement analysis purposes 37 , 38 . The produced feature maps could provide a better understanding of the performance of the developed model.
In this study, the squeeze-and-excitation module (SENet) was used with ResNet-18 to achieve a relatively light model for FER. This model has fewer trainable parameters (approximately 11.27 million) than the approximately 23 million parameters required for ResNet-50 and the approximately 86 million parameters of the vision transformer. The effectiveness of the proposed approach was evaluated on two FER datasets, namely, AffectNet and the Real-World Affective Faces Database (RAF-DB). Both datasets contain a great quantity of facial emotion data, including those from various cultures and races. The number of images in AffectNet is about 20 times than that of RAF-DB. The images in AffectNet are more diverse and wilder than those in RAF-DB. The neural network was trained to extract emotional information from AffectNet and RAF-DB. A cross-database validation between the AffectNet dataset and the RAF-DB was conducted. The results show that a training accuracy of 79.08% and a validation accuracy of 56.54% were achieved with AffectNet. A training accuracy of 76.51% and a validation accuracy of 65.67% were achieved with RAF-DB. The transfer-learning was applied on RAF-DB with pretrained weight obtained with AffectNet. The prediction accuracy after transfer-learning increases dramatically on the RAF-DB dataset. The results suggest that transfer learning can be conducted for smaller dataset with a particular culture, region, or social setting 36 for specific applications. Transfer-learning enables the model to learn the facial emotions of a particular population with a smaller database and achieve accurate results. Moreover, the images in AffectNet and RAF-DB with softmax score exceeding 90% were selected to identify the important facial landmarks that were captured by the network. It is found that in the shallow layers, the extracted dominant features are fine lines, whereas in the deep layers, the regions near mouth and nose are more important.
Database and model
The AffectNet database contains 456,349 images of facial emotions obtained from three search engines, Google, Bing and Yahoo, in six different languages. The images were labeled with the following 11 emotions: neutrality, happiness, sadness, surprise, fear, disgust, anger, contempt, none, uncertain, and nonface. Among these emotions, “uncertain” means that the given image cannot be classified into one of the other categories, and “nonface” means that the image contains exaggerated expressions, animations, drawings, or watermarks. Mollahosseini et al . 15 hired annotators to manually classify emotions defined in AffectNet. In addition, AffectNet is heavily imbalanced in terms of the number of images of each emotion category. For example, the number of images representing “happy” is almost 30 times higher than the number of images representing “disgust”. The number of images for each category is shown in Table 1 . Figure 1 shows sample images for the 11 emotions contained in AffectNet. In this study, we use seven categories, surprise, fear, disgust, anger, sadness, happiness and neutrality, in AffectNet.

Image categories of the faces contained in the AffectNet database 12 .
The RAF-DB is provided by the Pattern Recognition and Intelligent System Laboratory (PRIS Lab) of the Beijing University of Posts and Telecommunications 39 . The database consists of more than 300,000 facial images sourced from the internet, which are classified into seven categories: surprise, fear, disgust, anger, sadness, happiness and neutrality. Each of the images contains 5 accurate landmark locations and 37 automatic landmark locations. The RAF-DB also contains a wide variety of information in terms of ages, races, head gestures, light exposure levels and blocking. The training set contains five times as many images as the test set. Figure 2 shows sample images for the seven emotions contained in the RAF-DB. Table 1 shows the number of images used in this article for each emotion from each database.

Image categories of the faces contained in the RAF-DB database 37 .
SENet is a new image recognition architecture developed in 2017 40 . The network reinforces critical features by comparing the correlations among feature channels to achieve increased classification accuracy. Figure 3 shows the SENet architecture, which contains three major operations. The squeeze operation extracts global feature information from the previous convolution layer and conducts global average pooling on the feature map to obtain a feature tensor (Z) of size 1 × 1 × \({\text{C}}\) (number of channels), in which the \({\text{c}} - {\text{th}}\) element is calculated by:
where \(F_{sq}\) is the global average pooling operation, \(u_{c}\) is the \({\text{c}} - {\text{th}}\) 2-dimensional matrix, W × H represents the dimensions of each channel, and C is the number of channels.

The schema of the SENet inception module.
Equation ( 1 ) is followed by two fully connected layers. The first layer reduces the number of channels from \({\text{C}}\) to \({\text{C}}/{\text{r}}\) to reduce the required number computations (r is the compression rate), and the second layer increases the number of channels to \({\text{C}}\) . The excitation operation is defined as follows:
where \({\upsigma }\) is the sigmoid activation function, \(\delta\) is the rectified linear unit (ReLU) excitation function, and \(W_{1}\) and \(W_{2}\) are the weights for reducing and increasing the dimensionality, respectively.
The scale operation multiplies the feature tensor by the excitation. This operation captures the significance of each channel via feature learning. The corresponding channel is then multiplied by the gained weight to discern the major and minor information for the computer 38 . The formula for the scale operation, which is used to obtain the final output of the block, is shown as follows.
where the dot is the channelwise multiplication operation and \(S_{c}\) is the output of the excitation operation.
ResNet was proposed by He et al . 11 to solve the vanishing gradient problem in a deep network. ResNet introduces a residual block to a conventional CNN. Figure 4 shows the residual block in the ResNet architecture. The concept of a residual block is to combine the output from the previous convolutional layer with the next convolutional layer in the ResNet. It has been shown in several studies that the residual blocks relieve the vanishing gradient issue encountered by a deeper network. Therefore, the residual blocks have been adopted in several architectures 37 , 38 .

Residual block of the ResNet architecture.
SE-ResNet combines the SENet and ResNet architectures presented above and adds the SE block from SENet to ResNet. The SE block is used to capture the significance of each channel to determine whether it contains major or minor information. The feature information from the previous convolutional layer is then combined with the next layer by the residual block. This method can mitigate the decreasing accuracy caused by the vanishing gradient problem that occurs while increasing the network layers. Figure 5 shows the network architecture of SE-ResNet.

The schema of the SE-Resnet module.
Experimental method
In this study, we extracted seven categories from AffectNet to ensure that AffectNet and the RAF-DB were validated with identical categories. The SE-ResNet architecture was adopted as the neural network model for training and testing. A comparison and cross-database validation were conducted between RAF-DB and AffectNet. To achieve better performance, the transfer learning technique was used. The model trained on AffectNet was applied as the pretrained model to train RAF-DB.
The feature maps derived from each SE block were printed to determine which facial landmarks contain major information for the network. Only facial emotion images with softmax score exceeding 90% were adopted to ensure objectivity and accuracy. Examples of the feature maps printed from AffectNet are shown in Fig. 6 . The feature maps printed from the RAF-DB are shown in Fig. 7 .

Feature maps of different SE block layers (AffectNet).

Feature maps of different SE block layers (RAF-DB).
In this experiment, the training hardware was an NVIDIA TITAN RTX 24-GB GPU. The input image size was 256 × 256 pixels with data augmentation. For the training process, the tones of the input images were changed. The images were randomly rotated between + / − 30 degrees, and cropped according to the four corners and the center into five images of the size 224 × 224 pixels. For validation purposes, the input images were cropped from the center to a final size of 224 × 224 pixels. The optimization algorithm and loss function were stochastic gradient descent and the cross entropy loss function, respectively. Twenty epochs were used, and the initial learning rate was set to 0.01. The momentum was 0.9 and the batch size for training was 100.
Results and discussion
Cross-database validation.
The AffectNet dataset and the RAF-DB were cross-database validated in this study. The model trained on AffectNet was used to predict the RAF-DB, and the model trained on the RAF-DB was used to predict AffectNet. The results are shown in Table 2 . Because AffectNet exhibits more diversity in terms of facial emotion data and more images, when the model trained on AffectNet predicted the RAF-DB, an accuracy of 77.37% was achieved, which was significantly higher than the accuracy achieved by directly training on the RAF-DB (65.67%). In contrast, low accuracy (42.6%) was obtained for AffectNet predicted by the model trained on the RAF-DB. The difference can be understood by the fact that the images in AffectNet are more in quantity and more complex.
The accuracies achieved on AffectNet and RAF-DB by SE-ResNet were compared in this study. RAF-DB results in a higher accuracy than AffectNet, as shown in Table 3 . However, this was expected since the RAF-DB dataset exhibits more constrained images. The accuracy of the proposed model on AffectNet is 56%, which is slightly lower than the 58% accuracy obtained in the original paper 19 that proposed AffectNet. However, as mentioned in the original paper 15 , the agreement between two human annotators was 60% over 36,000 images. Our result is comparable to this agreement rate.
Additionally, we performed transfer learning by pretraining the model on AffectNet, followed by training on the RAF-DB. As shown in Table 4 , the validation accuracy on the RAF-DB increased by 26.95% ([(accuracy with pretrained model—accuracy without pretrained model)/accuracy without pretrained model = (83.37–65.67) / 65.67] × 100%) and was higher than that of the model trained directly with the RAF-DB. Compared to the accuracy of 76.73% obtained in 21 by multi-region ensemble CNN, transfer learning with a single network performs better than the ensemble CNN that utilizes global and local features. This result indicates that AffectNet provides useful pretrained weights because of the wide diversity of the dataset. The diverse cultural and racial backgrounds of the images in the AffectNet dataset provides a more representative and inclusive training set, leading to a more robust and accurate recognition system. The result highlights the significance of considering the diversity of data and transfer learning in the development and deployment of FER algorithms.
The normalized confusion matrices predicted by the model trained on AffectNet for AffectNet and RAF-DB are shown in Fig. 8 a and b, respectively. The normalized confusion matrices predicted by the model after transfer learning for RAF-DB is given in Fig. 8 c. Figure 8 a and b show that the model tends to falsely classify images as “neutral”. It suggests the discriminative features learned from AffectNet are similar between “neutral” and other categories. Moreover, the comparison between Fig. 8 b and c shows that after transfer learning, the model classifies the emotions in the RAF-DB in a more accurate and even manner.

Normalized confusion matrix for AffectNet and RAF-DB ( a ) AffectNet, ( b ) RAF-DB and ( c ) RAF-DB with pretrained model.
It can be seen from the normalized confusion matrices that the classification accuracy is positively correlated with the number of images in the dataset, as given in Table 1 . In Fig. 8 a, the AffectNet dataset contains the least number of “disgust” images, which results in the lowest accuracy in the normalized confusion matrix. In contrast, the number of images of the “happy” category is the most in AffectNet and, therefore, yields the highest accuracy in the normalized confusion matrix for this category. The same conclusion can be obtained from Fig. 8 b and c for RAF-DB.
Feature maps
This study examines the important features that the network learns to classify facial emotions. The feature maps in AffectNet with softmax scores (P) exceeding 90% are visualized in Fig. 9 . It is shown that mouth, nose, and other facial lines are major information, while the eyes and ears for minor information. This is similar to the results found in Beaudry et al . 35 that the mouth is the major landmark when the neural network predicts a happy emotion. The feature maps of misclassified images are also visualized in Fig. 10 for comparisons with those that were correctly classified. By observing the feature maps of misclassified images, it is evident that the important features in the images are similar to those in the correctly classified images. It can be observed from Figs. 9 and 10 that the network tends to detect edges and lines in shallow layers and focuses more on local features, like mouth and nose, in deeper layers.

Feature maps with a softmax score greater than 90% (AffectNet).

Misclassified feature maps (AffectNet).
Asian facial emotion
The Asian facial emotion dataset 41 consists of images of 29 actors aged from 19 to 67 years old. The images were taken from frontal, 3/4 sideways and sideways angles. Figure 11 shows some example images from the Asian facial emotion dataset. The number of images of each class are given in Table 5 . There are only six labeled categories in this dataset. The “neutrality” category is not provided in this dataset. Therefore, in the output layer of the model, which was trained to predict the probabilities of 7 categories, the probability for “neutrality” was specified as zero.

Example images from the Asian facial emotion dataset 39 .
The Asian facial emotion dataset was tested with the model trained on AffectNet. The images were resized to 256 × 256 pixels and then cropped to 224 × 224 pixels with their faces centered. The derived average accuracy was 61.99%, which was slightly higher than that of AffectNet. Similar to the validation results of AffectNet, the “happy” category yielded the highest score, while “fear” and “disgust” had the lowest scores. The normalized confusion matrix is shown in Fig. 12 , and the feature maps are shown in Fig. 13 . In contrast with the feature maps of AffectNet, the discriminative locations were not centered around the mouth and nose but were located more on the right half of the face. It shows that the model lacked generalizability for Asian faces in the laboratory setting. This experiment shows that the model trained on AffectNet has limited prediction performance on other datasets.

Normalized confusion matrix produced for the Asian facial emotion dataset tested with the model trained on AffectNet.

Feature maps produced for the Asian facial emotion dataset.
The process of interpreting facial expressions is also subject to cultural and individual differences that are not considered by the model during the training phase. The feature maps in Figs. 9 and 10 show that the proposed model focused more on the mouth and nose but less on the eyes. To obtain correct FER results, subtle features such as wrinkles and eyes may also be critical. However, the proposed model does not capture features that are far from the mouth or nose. The test results obtained on the Asian face emotion dataset shows that the discriminative regions are skewed toward the right half of the face. This finding indicates that the limited generalizability of the model to Asian faces in the laboratory setting. Although AffectNet is a diverse dataset containing representations from various cultures and races, it is still limited to a tiny portion of the global population. In contrast, the RAF-DB contains similar ethnic groups and settings similar to AffectNet. The validation results obtained on the RAF-DB (77.37%) is better than that on the Asian face emotion dataset. The results show that for datasets with similar ethnic groups, the model trained on a more diverse and wilder dataset (AffectNet) performs better prediction on a more constrained dataset (the RAF-DB in this work).
This study addresses how the neural network model learns to identify facial emotions. The features displayed on emotion images were derived with a CNN, and these emotional features were visualized to determine the facial landmarks that contains major information. Conclusions drawn based on the findings are listed below.
A cross-database validation experiment was conducted for AffectNet and RAF-DB. An accuracy of 77.37% was achieved when the RAF-DB was predicted by the model trained on AffectNet. The accuracy is comparable to the result in 21 . An accuracy of 42.6% was achieved when AffectNet was predicted by the model trained on RAF-DB. These results agree with the fact that AffectNet exhibits more diversity than RAF-DB in terms of facial emotion images. Moreover, transfer learning dramatically increases the accuracy by 26.95% for RAF-DB. The finding highlights the significance of using transfer learning to improve the performance of FER algorithms by training the associated models on AffectNet for pretrained weights.
The visualized emotion feature maps show that the mouth and nose contain the major information, while the eyes and ears contain the minor information when the neural network learns to perform FER. This paradigm is similar to how human observes emotions.
When comparing the feature maps that were correctly classified (those with softmax scores exceeding 90%) with those that were incorrectly classified, it can be seen that the network model focuses on similar features with no major differences. This result indicates that FER requires the observation of large patches near distinctive areas on a face.
Data availability
The datasets applied in this study are available with authorization from the following websites for AffectNet ( http://mohammadmahoor.com/affectnet/ ), the Real-World Affective Faces Database (RAF-DB; http://www.whdeng.cn/raf/model1.html ) and the Asian facial emotion dataset ( http://mil.psy.ntu.edu.tw/ssnredb/logging.php?action=login ). However, restrictions apply to the availability of these data, which were used under license for the current study and thus are not publicly available. The data are, however, available from the authors upon reasonable request and with permission from AffectNet, the RAF-DB and the Asian facial emotion dataset. The training and analysis processes are discussed in the research methodology.
Vo, T. H., Lee, G. S., Yang, H. J. & Kim, S. H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 8 , 131988–132001 (2020).
Article Google Scholar
Mehrabian, A. Nonverbal communication (Aldine Transaction, 2007).
Ekman, P. Darwin, deception, and facial expression. Ann. N. Y. Acad. Sci. 1000, 205–2 (Kortli & Jridi, 2020) (2006).
Farzaneh, A. H. & Qi, X. Facial expression recognition in the wild via deep attentive center loss in 2021 IEEE winter conference on applications of computer vision (WACV) 2401–2410 (IEEE, 2021).
Alnuaim, A. A. et al. Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. J. Healthc. Eng. 2022 , 6005446 (2022).
Article PubMed PubMed Central Google Scholar
Kumari, H. M. L. S. Facial expression recognition using convolutional neural network along with data augmentation and transfer learning (2022).
Ekman, P., Dalgleish, T. & Power, M. Handbook of cognition and emotion (Wiley, 1999).
Ekman, P. Are there basic emotions?. Psychol. Rev. 99 , 550–553 (1992).
Article CAS PubMed Google Scholar
Russell, J. A. A circumplex model of affect. J. Pers. Soc. Psychol. 39 , 1161–1178 (1980).
Goodfellow, I. J. et al. Challenges in representation learning: A report on three machine learning contests in Neural information processing (eds. Lee, M., Hirose, A., Hou, Z. & Kil, R) 117–124 (Springer, 2013).
Maithri, M. et al. Automated emotion recognition: Current trends and future perspectives. Comput. Method Prog. Biomed. 215 , 106646 (2022).
Article CAS Google Scholar
Li, S. & Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 13 , 1195–1215 (2022).
Canal, F. Z. et al. A survey on facial emotion recognition techniques: A state-of-the-art literature review. Inf. Sci. 582 , 593–617 (2022).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition in 2016 IEEE conference on computer vision and pattern recognition (CVPR) 770–778 (IEEE, 2016).
Mollahosseini, A., Hasani, B. & Mahoor, M. H. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10 , 18–31 (2019).
Schoneveld, L. & Othmani, A. Towards a general deep feature extractor for facial expression recognition in 2021 IEEE international conference on image processing (ICIP) 2339–2342 (IEEE, 2021).
Rajan, V., Brutti, A. & Cavallaro, A. Is cross-attention preferable to self-attention for multi-modal emotion recognition? in ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) 4693–4697 (IEEE, 2022).
Zhuang, X., Liu, F., Hou, J., Hao, J. & Cai, X. Transformer-based interactive multi-modal attention network for video sentiment detection. Neural Process. Lett. 54 , 1943–1960 (2022).
Zhang, Y., Wang, C., Ling, X. & Deng, W. Learn from all: Erasing attention consistency for noisy label facial expression recognition in Lecture notes in computer science (eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner T.) 418–434 (Springer, 2022).
Savchenko, A. V., Savchenko, L. V. & Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 13 , 2132–2143 (2022).
Fan, Y., Lam, J. C. K. & Li, V. O. K. Multi-region ensemble convolutional neural network for facial expression recognition in Artificial neural networks and machine learning—ICANN 2018 (eds. Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L. & Maglogiannis, I.) 84–94 (Springer International Publishing, 2018).
Wang, Z., Zeng, F., Liu, S. & Zeng, B. OAENet: Oriented attention ensemble for accurate facial expression recognition. Pattern Recognit. 112 , 107694 (2021).
Schoneveld, L., Othmani, A. & Abdelkawy, H. Leveraging recent advances in deep learning for audio-Visual emotion recognition. Pattern Recognit. Lett. 146 , 1–7 (2021).
Article ADS Google Scholar
Hwooi, S. K. W., Othmani, A. & Sabri, A. Q. M. Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space. IEEE Access 10 , 96053–96065 (2022).
Sun, L., Lian, Z., Tao, J., Liu, B. & Niu, M. Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism in Proceedings of the 1st international on multimodal sentiment analysis in real-life media challenge and workshop 27–34 (ACM, 2020).
Allognon, S. O. C., de S. Britto, A. & Koerich, A. L. Continuous emotion recognition via deep convolutional autoencoder and support vector regressor in 2020 international joint conference on neural networks (IJCNN) 1–8 (IEEE, 2020).
Huang, C. Combining convolutional neural networks for emotion recognition in 2017 IEEE MIT undergraduate research technology conference (URTC) 1–4 (IEEE, 2017).
Mao, J. et al. POSTER V2: A simpler and stronger facial expression recognition network. arXiv preprint arXiv:2301.12149 (2023).
Le, N. et al. Uncertainty-aware label distribution learning for facial expression recognition in 2023 IEEE/CVF winter conference on applications of computer vision (WACV) 6088–6097 (IEEE, 2023).
Singh, S. & Prasad, S. V. A. V. Techniques and challenges of face recognition: A critical review. Proc. Comput. Sci. 143 , 536–543 (2018).
Kortli, Y., Jridi, M., Falou, A. A. & Atri, M. Face recognition systems: A survey. Sensors (Basel, Switzerland) 20 , 342 (2020).
Article ADS PubMed Google Scholar
Shirazi, M. S. & Bati, S. Evaluation of the off-the-shelf CNNs for facial expression recognition in Lecture notes in networks and systems (ed. Arai, K.) 466–473 (Springer, 2022).
Chen, D., Wen, G., Li, H., Chen, R. & Li, C. Multi-relations aware network for in-the-wild facial expression recognition. IEEE Trans. Circuits Syst. Video Technol. https://doi.org/10.1109/tcsvt.2023.3234312 (2023).
Heidari, N. & Iosifidis, A. Learning diversified feature representations for facial expression recognition in the wild. arXiv preprint arXiv:2210.09381 (2022).
Beaudry, O., Roy-Charland, A., Perron, M., Cormier, I. & Tapp, R. Featural processing in recognition of emotional facial expressions. Cogn. Emot. 28 , 416–432 (2013).
Article PubMed Google Scholar
Bhattacharyya, A. et al. A deep learning model for classifying human facial expressions from infrared thermal images. Sci. Rep. 11 , 20696 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Alp, N. & Ozkan, H. Neural correlates of integration processes during dynamic face perception. Sci. Rep. 12 , 118 (2022).
Siddiqi, M. H. Accurate and robust facial expression recognition system using real-time YouTube-based datasets. Appl. Intell. 48 , 2912–2929 (2018).
Li, S., Deng, W. H. & Du, J. P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild in 2017 IEEE conference on computer vision and pattern recognition (CVPR) 2584–2593 (IEEE, 2017).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks in 2018 IEEE/CVF conference on computer vision and pattern recognition 7132–7141 (IEEE, 2018).
Chen, C. C., Cho, S. L. & Tseng, R. Y. Taiwan corpora of Chinese emotions and relevant psychophysiological data-Behavioral evaluation norm for facial expressions of professional performer. Chin. J. Psychol. 55 , 439–454 (2013).
Google Scholar
Download references
Acknowledgements
This work was funded in part by National Science and Technology Council (project number MOST 111-2635-E-242-001 -).
Author information
Authors and affiliations.
Department of Mechanical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Zi-Yu Huang, Chia-Chin Chiang & Hsin-Lung Chung
Graduate Institute of Applied Physics, National Chengchi University, Taipei, Taiwan
Jian-Hao Chen & Hsiu-Chuan Hsu
Department of Occupational Safety and Hygiene, Fooyin University, Kaohsiung, Taiwan
Yi-Chian Chen
Department of Nursing, Hsin Sheng Junior College of Medical Care and Management, Taoyuan, Taiwan
Yu-Ping Cai
Department of Computer Science, National Chengchi University, Taipei, Taiwan
Hsiu-Chuan Hsu
You can also search for this author in PubMed Google Scholar
Contributions
Z.-Y. Huang contributed to writing the manuscript. C.-C. Chiang contributed to overseeing and finalizing the paper. J.-H. Chen conducted all computations and contributed equally as the first author. Y.-C. Chen contributed to designing the research and editing the manuscript. H.-L. Chung contributed to editing the manuscript. Y.-P. C. assessed the emotion classification field and contributed to the literature review. H.-C. H. designed the study and provided conceptual guidance. All authors discussed and reviewed the manuscript.
Corresponding authors
Correspondence to Yi-Chian Chen or Hsiu-Chuan Hsu .
Ethics declarations
Competing interests.
The authors declare no competing interests.
Additional information
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and Permissions
About this article
Cite this article.
Huang, ZY., Chiang, CC., Chen, JH. et al. A study on computer vision for facial emotion recognition. Sci Rep 13 , 8425 (2023). https://doi.org/10.1038/s41598-023-35446-4
Download citation
Received : 08 December 2022
Accepted : 18 May 2023
Published : 24 May 2023
DOI : https://doi.org/10.1038/s41598-023-35446-4
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

- Research Paper
- Published: 23 May 2023
Gaussian process regression based on deep neural network for reliability analysis in high dimensions
- Tong Zhou 1 , 2 &
- Yongbo Peng ORCID: orcid.org/0000-0002-8110-8536 1 , 3 , 4
Structural and Multidisciplinary Optimization volume 66 , Article number: 131 ( 2023 ) Cite this article
24 Accesses
Metrics details
An efficient method is proposed to deal with high-dimensional reliability problems. The critical contribution lies in developing an advanced DGPR model, which effectively combines deep neural network (DNN)-based dimension reduction and Gaussian process regression (GPR) model. Specifically, the parameters of both the DNN and the GPR are calibrated through a joint-training scheme, and the network architecture of the DGPR is optimally selected via a grid-search scheme coupled with five-fold cross validation. In this regard, both the supervised extraction of low-dimensional latent space and the training of GPR in the latent space are intrinsically achieved by the DGPR. Then, an active learning strategy is adopted to combine the DGPR and the subset simulation for reliability analysis. To verify the efficacy of the proposed approach, three numerical examples are investigated and comparisons are made against other reliability methods. Numerical results demonstrate that the proposed approach gains reasonable computational cost savings whilst maintaining satisfactory accuracy of reliability results.
This is a preview of subscription content, access via your institution .
Access options
Buy single article.
Instant access to the full article PDF.
Price includes VAT (Russian Federation)
Rent this article via DeepDyve.

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. https://doi.org/10.1109/TAC.1974.1100705
Article MathSciNet MATH Google Scholar
Amalnerkar E, Lee TH, Lim W (2020) Reliability analysis using bootstrap information criterion for small sample size response functions. Struct Mutltidisc Optim 62:2901–2913. https://doi.org/10.1007/s00158-020-02724-y
Article MathSciNet Google Scholar
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79. https://doi.org/10.1214/09-SS054
Au SK, Beck JL (2001) Estimation of small failure probabilities in high dimensions by subset simulation. Probab Eng Mech 16(4):263–277. https://doi.org/10.1016/S0266-8920(01)00019-4
Article Google Scholar
Bichon BJ, Eldred MS, Swiler LP, Mahadevan S, McFarland JM (2008) Efficient global reliability analysis for nonlinear implicit performance functions. AIAA J 46(10):2459–2468. https://doi.org/10.2514/1.34321
Breitung K (1989) Asymptotic approximations for probability integrals. Probab Eng Mech 4:187–190. https://doi.org/10.1016/0266-8920(89)90024-6
Calandra R, Peters J, Rasmussen CE, Deisenroth MP (2016) Manifold Gaussian processes for regression. In: Proceedings of the 2016 international joint conference on neural networks
De SantanaGomes WJ (2019) Structural reliability analysis using adaptive artificial neural networks. ASCE-ASME J Risk Uncertain Eng Syst Part B Mech Eng. 5(4):041004. https://doi.org/10.1115/1.4044040
Der Kiureghian A, Dakessian T (1998) Multiple design points in first and second-order reliability. Struct Saf 20(1):37–49. https://doi.org/10.1016/S0167-4730(97)00026-X
Echard B, Gayton N, Lemaire M (2011) AK-MCS: an active learning reliability method combining Kriging and Monte Carlo Simulation. Struct Saf 33(2):145–154. https://doi.org/10.1016/j.strusafe.2011.01.002
Echard B, Gayton N, Lemaire M, Relun N (2013) A combined Importance Sampling and Kriging reliability method for small failure probabilities with time-demanding numerical models. Reliab Eng Syst Saf 111:232–240. https://doi.org/10.1016/j.ress.2012.10.008
Ehre M, Papaioannou I, Sudret B, Straub D (2022) Sequential active learning of low-dimensional model representations for reliability analysis. Siam J Sci Comput 44:B558–B584. https://doi.org/10.1137/21M1416758
Engelund S, Rackwitz R (1993) A benchmark study on importance sampling techniques in structural reliability. Struct Saf 12(4):255–276. https://doi.org/10.1016/0167-4730(93)90056-7
Hasofer AM, Lind NC (1974) Exact and invariant second-moment code format. ASCE J Eng Mech 100:111–121. https://doi.org/10.1061/JMCEA3.0001848
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507. https://doi.org/10.1126/science.1127647
Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform 20:198363. https://doi.org/10.1155/2015/198363
Gardner J, Pleiss G, Weinberger KQ, Bindel D, Wilson AG (2018) Gpytorch: blackbox matrix-matrix gaussian process inference with gpu acceleration. In: Proceedings of 32nd conference on neural information processing systems
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations
Konakli K, Sudret B (2016) Reliability analysis of high-dimensional models using low-rank tensor approximations. Probab Eng Mech 46:18–36. https://doi.org/10.1016/j.probengmech.2016.08.002
Article MATH Google Scholar
Lataniotis C, Marelli S, Sudret B (2020) Extending classical surrogate modeling to high dimensions through supervised dimensionality reduction: a data-driven approach. Int J Uncertain Quantif 10(1):55–82. https://doi.org/10.1615/Int.J.UncertaintyQuantification.2020031935
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444. https://doi.org/10.1038/nature14539
Li J, Chen JB (2004) Probability density evolution method for dynamic response analysis of structures with uncertain parameters. Comput Mech 34:400–409. https://doi.org/10.1007/s00466-004-0583-8
Li J, Chen J (2009) Stochastic dynamics of structures. Wiley, New York
Book MATH Google Scholar
Li M, Wang Z (2020) Deep learning for high-dimensional reliability analysis. Mech Syst Signal Process 139:106399. https://doi.org/10.1016/j.ymssp.2019.106399
Lim W, Lee TH, Kang S, Cho SG (2016) Estimation of body and tail distribution under extreme events for reliability analysis. Struct Mutltidisc Optim 54:1631–1639. https://doi.org/10.1007/s00158-016-1506-2
Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. International Conference on Learning Representations
Lüthen N, Marelli S, Sudret B (2021) Sparse polynomial chaos expansions: literature survey and benchmark. SIAM/ASA J Uncertain Quantif 9:593–649. https://doi.org/10.1137/20M1315774
Moustapha M, Sudret B (2019) Surrogate-assisted reliability-based design optimization: a survey and a unified modular framework. Struct Mutltidisc Optim 60:2157–2176. https://doi.org/10.1007/s00158-019-02290-y
Moustapha M, Marelli S, Sudret B (2022) Active learning for structural reliability: survey, general framework and benchmark. Struct Saf 96:102174. https://doi.org/10.1016/j.strusafe.2021.102174
Naik GR (2017) Advances in principal component analysis: research and development. Springer, New York
Google Scholar
Navaneeth N, Chakraborty S (2022) Surrogate assisted active subspace and active subspace assisted surrogate—a new paradigm for high dimensional structural reliability analysis. Comput Methods Appl Mech Eng 389:114374. https://doi.org/10.1016/j.cma.2021.114374
Pan Q, Dias D (2017) Sliced inverse regression-based sparse polynomial chaos expansions for reliability analysis in high dimensions. Reliab Eng Syst Saf 167:484–493. https://doi.org/10.1016/j.ress.2017.06.026
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 8026–8037
Peng YB, Zhou T, Li J (2021) Surrogate modeling immersed probability density evolution method for structural reliability analysis in high dimensions. Mech Syst Signal Process 152(5):107366. https://doi.org/10.1016/j.ymssp.2020.107366
Qi M, Zhang GP (2001) An investigation of model selection criteria for neural network time series forecasting. Eur J Oper Res 132(3):666–680. https://doi.org/10.1016/S0377-2217(00)00171-5
Sadoughi MK, Li M, Hu C, MacKenzie CA, Lee S, Eshghi AT (2018) A high-dimensional reliability analysis method for simulation-based design under uncertainty. J Mech Des Trans ASME. https://doi.org/10.1115/1.4039589
Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008) Global sensitivity analysis: the primer. Wiley, New York
MATH Google Scholar
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a Kernel eigenvalue problem. Neural Comput 10:1299–1319. https://doi.org/10.1162/089976698300017467
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. https://doi.org/10.1214/aos/1176344136
Shinozuka M, Jan CM (1972) Digital simulation of random processes and its applications. J Sound Vib 25:111–128. https://doi.org/10.1016/0022-460X(72)90600-1
Teixeira R, Nogal M, O’Connor A (2021) Adaptive approaches in metamodel-based reliability analysis: a review. Struct Saf 89:102019. https://doi.org/10.1016/j.strusafe.2020.102019
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning. MIT press, Cambridge
Wilson AG, Hu Z, Salakhutdinov R, Xing EP (2016) Deep kernel learning. In: Proceedings of the 19th international conference on artificial intelligence and statistics
Yin J, Du X (2022) Active learning with generalized sliced inverse regression for high-dimensional reliability analysis. Struct Saf 94:102151. https://doi.org/10.1016/j.strusafe.2021.102151
Zhang J, Xiao M, Gao L (2019) An active learning reliability method combining Kriging constructed with exploration and exploitation of failure region and subset simulation. Reliab Eng Syst Saf 188:90–102. https://doi.org/10.1016/j.ress.2019.03.002
Zhao Y-G, Lu Z-H (2021) Structural reliability: approaches from perspectives of statistical moments. Wiley, New York
Book Google Scholar
Zhou T, Peng Y (2020a) Kernel principal component analysis-based Gaussian process regression modelling for high-dimensional reliability analysis. Comput Struct 241:106358. https://doi.org/10.1016/j.compstruc.2020.106358
Zhou T, Peng Y (2020b) Structural reliability analysis via dimension reduction, adaptive sampling, and Monte Carlo simulation. Struct Mutltidisc Optim 62:2629–2651. https://doi.org/10.1007/s00158-020-02633-0
Zhou T, Peng Y (2021) Active learning and active subspace enhancement for PDEM-based high-dimensional reliability analysis. Struct Saf 88:102026. https://doi.org/10.1016/j.strusafe.2020.102026
Zhou T, Peng Y, Li J (2019) An efficient reliability method combining adaptive global metamodel and probability density evolution method. Mech Syst Signal Process 131:592–616. https://doi.org/10.1016/j.ymssp.2019.06.009
Zhu M, McKenna F, Scott MH (2018) OpenSeesPy: Python library for the OpenSees finite element framework. SoftwareX 7:6–11. https://doi.org/10.1016/j.softx.2017.10.009
Zuhal LR, Faza GA, Palar PS, Liem RP (2021) On dimensionality reduction via partial least squares for Kriging-based reliability analysis with active learning. Reliab Eng Syst Saf 215:107848. https://doi.org/10.1016/j.ress.2021.107848
Download references
Acknowledgements
The supports of the Committee of Science and Technology of Shanghai China (Grant Nos. 22160713000; 21ZR1425500) and the Ministry of Science and Technology of China (Grant No. SLDRCE19-B-26) are highly appreciated.
Author information
Authors and affiliations.
State Key Laboratory of Disaster Reduction in Civil Engineering, Tongji University, Shanghai, 200092, People’s Republic of China
Tong Zhou & Yongbo Peng
College of Civil Engineering, Tongji University, Shanghai, 200092, People’s Republic of China
Shanghai Institute of Disaster Prevention and Relief, Tongji University, Shanghai, 200092, People’s Republic of China
Yongbo Peng
The Key Laboratory of Urban Security and Disaster Engineering of Ministry of Education, Beijing University of Technology, Beijing, 100124, People’s Republic of China
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Yongbo Peng .
Ethics declarations
Conflict of interest.
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Replication of results
The details of the proposed methodology and of the specific values of the parameters considered have been provided in the paper. Hence, we are confident that the results can be reproduced. Readers interested in the source code are encouraged to contact the authors by e-mail.
Additional information
Responsible editor: Zhen Hu
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and Permissions
About this article
Cite this article.
Zhou, T., Peng, Y. Gaussian process regression based on deep neural network for reliability analysis in high dimensions. Struct Multidisc Optim 66 , 131 (2023). https://doi.org/10.1007/s00158-023-03582-0
Download citation
Received : 29 September 2021
Revised : 14 April 2023
Accepted : 01 May 2023
Published : 23 May 2023
DOI : https://doi.org/10.1007/s00158-023-03582-0
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Reliability analysis
- Deep neural network
- Gaussian process regression
- Dimension reduction
- Subset simulation
- High dimensions
Advertisement
Black-Box Dataset Ownership Verification via Backdoor Watermarking
Ieee account.
- Change Username/Password
- Update Address
Purchase Details
- Payment Options
- Order History
- View Purchased Documents
Profile Information
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2023 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES
VIDEO
COMMENTS
Critical Learning Periods in Deep Neural Networks. Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of ...
Similar to biological organisms, artificial neural networks (ANNs) exhibit critical periods during which a temporary stim-ulus deficit can impair the development of a skill. The extent of the impairment depends on the inception and length of the deficit window, as in animal models, and on the size of the neural network.
Abstract: Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not ...
Critical periods are phases in the early development of humans and animals during which experience can affect the structure of neuronal networks irreversibly. In this work, we study the effects of visual stimulus deficits on the training of artificial neural networks (ANNs). Introducing well-characterized visual deficits, such as cataract-like blurring, in the early training phase of a ...
Abstract. Critical periods are phases in the early development of humans and animals during which experience can affect the structure of neuronal networks irreversibly. In this work, we study the ...
Computer Science. ArXiv. Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network.
The existence of critical learning periods gives birth to an interesting perspective of the learning process for deep neural networks. Namely, the fact that such networks cannot recover from impairments applied during the early epochs of training reveals that learning progresses in two, distinct phases, each of which have interesting properties ...
Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not affect low-level statistics, such as vertical flipping of the ...
The term "critical period" refers to a phase in brain development during which the effects of experience lead to deep and irreversible remodeling of neural circuits ( 1). Similarly, the expression "sensitive period" is used to describe a time of heightened, yet reversible, plasticity in response to experience ( 2).
Critical Learning Periods for Multisensory Integration in Deep Networks. Michael Kleinman, Alessandro Achille, Stefano Soatto; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 24296-24305. Abstract. We show that the ability of a neural network to integrate information from diverse sources ...
What. They find that artificial neural networks show critical periods, similar to biological neural networks. Critical periods in biological neural networks are phases in the network's development during which a malformed sensory input can irreversibly harm the capabilities of the network. E. g. a disease causing blurry vision for some time ...
Critical learning periods in deep networks. In International Conference on Learning Representations, 2019. ... Exact solutions to the nonlinear dynamics of learning in deep linear neural networks ...
The initial years of an infant's life are known as the critical period, during which the overall development of learning performance is significantly impacted due to neural plasticity. In recent studies, an AI agent, with a deep neural network mimicking mechanisms of actual neurons, exhibited a learning period similar to human's critical period ...
We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations. This evidence challenges the view, engendered by analysis of wide and shallow networks, that early learning dynamics of neural networks are simple, akin to ...
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. Interfering with the learning process during this initial stage can permanently impair the development of a skill, both in artificial and biological systems where the phenomenon is known as critical ...
Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network.
Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed for sequential data. They have "memory" in the form of hidden states that carry information from one step in ...
In this study, a deep neural network (DNN) was adopted for Facial emotion recognition (FER). One of the objectives in this study is to identify the critical facial features on which the DNN model ...
An efficient method is proposed to deal with high-dimensional reliability problems. The critical contribution lies in developing an advanced DGPR model, which effectively combines deep neural network (DNN)-based dimension reduction and Gaussian process regression (GPR) model. Specifically, the parameters of both the DNN and the GPR are calibrated through a joint-training scheme, and the ...
Abstract: Deep learning, especially deep neural networks (DNNs), has been widely and successfully adopted in many critical applications for its high effectiveness and efficiency. The rapid development of DNNs has benefited from the existence of some high-quality datasets ...