-
PDF
- Split View
-
Views
-
Cite
Cite
Valerie Goffaux, Judith Peters, Julie Haubrechts, Christine Schiltz, Bernadette Jansma, Rainer Goebel, From Coarse to Fine? Spatial and Temporal Dynamics of Cortical Face Processing, Cerebral Cortex, Volume 21, Issue 2, February 2011, Pages 467–476, https://doi.org/10.1093/cercor/bhq112
- Share Icon Share
Abstract
Primary vision segregates information along 2 main dimensions: orientation and spatial frequency (SF). An important question is how this primary visual information is integrated to support high-level representations. It is generally assumed that the information carried by different SF is combined following a coarse-to-fine sequence. We directly addressed this assumption by investigating how the network of face-preferring cortical regions processes distinct SF over time. Face stimuli were flashed during 75, 150, or 300 ms and masked. They were filtered to preserve low SF (LSF), middle SF (MSF), or high SF (HSF). Most face-preferring regions robustly responded to coarse LSF, face information in early stages of visual processing (i.e., until 75 ms of exposure duration). LSF processing decayed as a function of exposure duration (mostly until 150 ms). In contrast, the processing of fine HSF, face information became more robust over time in the bilateral fusiform face regions and in the right occipital face area. The present evidence suggests the coarse-to-fine strategy as a plausible modus operandi in high-level visual cortex.
Introduction
Primary steps of human vision decompose the retinal input along 2 main dimensions: orientation and spatial frequencies (SF). This primary visual information is assumed to be combined in higher level visual regions located in inferior temporal cortex, yielding complex representations thought to underlie the perception of a rich and coherent environment. While there is extensive knowledge on the primary processing of SF in V1 (De Valois et al. 1982; Hess 2004), it is still not known how this primary visual information is integrated in higher level visual cortex.
A number of theoretical models assume that the visual system combines the information carried by different SF following a coarse-to-fine sequence (Marr 1982; Watt 1987; Bullier 2001; Bar 2007; see also Hochstein and Ahissar 2002). It is proposed that the coarse structure of a stimulus, which is carried by low SF (LSF), is processed before the fine details transmitted by high SF (HSF). For example, once the coarse structure of a face is detected, it would be used as an index into the fine facial structure. Such a strategy would be very efficient since the LSF structure provides a stable representation of the image before the noisier HSF structure is extracted. Electrophysiological evidence of such coarse-to-fine scenario has been reported in V1 (Bredfeldt and Ringach 2002; Mazer et al. 2002; Frazor et al. 2004). Moreover, coarse-to-fine temporal dynamics have been described with a variety of stimuli, ranging from lines, dots, and gratings (Musselwhite and Jeffreys 1985; Parker and Dutch 1987; Watt 1987; Hughes et al. 1996; Mihaylova et al. 1999) to complex stimuli such as faces (McCarthy et al. 1999; Halit et al. 2006; Vlamings et al. 2009) or natural scenes (Parker et al. 1992, 1997; Schyns and Oliva 1994; Peyrin et al. 2006). It has also been documented in other sensory modalities (Narayan et al. 2005; Sripati et al. 2006), suggesting that coarse-to-fine processing is a general property of signal processing in the brain (Allen and Freeman 2006; see Hegde 2008).
Within the visual domain, evidence for coarse-to-fine processing at high-level processing stages is however still lacking. The few past studies addressing coarse-to-fine processing in the human brain (Peyrin et al. 2010, 2005; Bar et al. 2006) did not explore the LSF over HSF processing precedence in high-level visual cortex (see Discussion). By manipulating exposure duration and SF content of filtered images, the present study investigated the differential contribution of SF during the build up of the visual representation of complex stimuli, for example, faces. Faces constitute an ideal visual category to tackle spatiotemporal dynamics of high-level vision. The ubiquity and social importance of faces in human life have pushed the visual system to adopt extremely fast and efficient strategies to extract face information. Moreover, several aspects suggest that face perception is more sensitive to SF than the visual processing of other complex visual categories (Biederman and Kalocsai 1997; Liu et al. 2000; Fiser et al. 2001; Goffaux et al. 2003; Collin et al. 2004; Yue et al. 2006; Williams et al. 2009). First, the integration of face cues into a global, so-called holistic, face representation relies on the processing of LSF face information (below 8 cycles per faces, cpf; Collishaw and Hole 2000; Goffaux and Rossion 2006; Goffaux 2009; but see Cheung et al. 2008). Second, the extraction of face identity relies on intermediate SF situated around 12 cpf (e.g., Gold et al. 1999; Nasanen 1999; Tanskanen et al. 2005). Finally, the analysis of face local details is based on HSF (above 32 cpf; Goffaux and Rossion 2006).
Human functional magnetic resonance imaging (fMRI) evidence portrays higher level visual cortex as a mosaic of category-preferring regions tuned to global object properties (Lerner et al. 2001). In particular, the fusiform face area (FFA) responds more robustly to faces than other object categories (Sergent et al. 1992; Kanwisher et al. 1997). The FFA, especially in the right hemisphere (right fusiform face area [rFFA]), is thought to represent the identity of faces based on the robust integration of local cues in a so-called holistic representation (Schiltz and Rossion 2006; Goffaux et al. 2009). However, how primary visual information is combined to yield high-level face representations in the rFFA is an unanswered question.
Here, we compared the activation of face-preferring regions with faces that were filtered to selectively preserve LSF, middle SF (MSF), or HSF. These stimuli were presented either at 75, 150, or 300 ms and subsequently masked (Figure 1). We observe that the processing of face information in most face-preferring regions, especially in rFFA, initially relies on LSF; with increasing exposure time, face-preferring regions attenuate LSF processing in favor of HSF processing. Our findings thus indicate the existence of a coarse-to-fine sequence of SF processing in face-preferring cortical regions. The ventral lateral occipital complex (LOC), a general-purposed high-level visual region encoding complex shape properties with no preference for any given visual category, failed to reveal such a coarse-to-fine sequential processing, suggesting that this scenario selectively applies to high-level, category-preferring visual regions.
Methods
fMRI Acquisition
Thirteen adult subjects (normal or corrected-to-normal vision; mean age 26 ± 4, 4 males, 2 left handed; no history of neurological disease) performed 2 scanning sessions on different days (spread over 2 weeks, on average). In this paper, we report the results of 2 experiments, namely, the localizer and the SF experiments. The order of experiments and runs was counterbalanced across subjects.
Imaging was performed on a 3-T head scanner at the University of Maastricht (Allegra, Siemens Medical Systems) provided with standard head coil. T2*-weighted echo-planar imaging was performed using blood oxygen level–dependent (BOLD) contrast as an indirect marker of local neuronal activity.
In the localizer experiment, twenty-five 3.5-mm oblique coronal slices were acquired (no gap, time repetition [TR] = 1500 ms, time echo [TE] = 28 ms, flip angle [FA] = 67°, matrix size = 64 × 64, field of view [FOV] = 224 mm, in-plane resolution 3.5 × 3.5 mm). Each subject performed 2 localizer runs of 265 TRs each (approximately 400 s).
In the SF experiment, twenty-one 3.5-mm oblique coronal slices (no gap, TR = 1250 ms, TE = 28 ms, FA = 67°) were acquired. Each subject performed 4 experimental runs of 690 TRs each (approximately 862.5 s).
A high-resolution T1-weighted anatomical data set encompassing the whole head was acquired in each session by means of a “modified driven equilibrium Fourier transform” sequence (TR = 2250 ms, TE = 26 ms, FA = 9°, matrix size = 256 × 256, FOV = 256 mm2, 192 slices, slice thickness = 1 mm, no gap, total run time= 8 min, 26 s).
Visual Stimulation
Visual stimuli were presented using Eprime 1.1 on a uniformly gray background. They were projected onto a translucent screen at the head of the scanner bore by means of a liquid crystal display projector and viewed by the subjects through a mirror placed within the radio frequency coil at a viewing distance of 57 cm. Stimulus size was 256 × 256 pixels. At a resolution of 1024 × 768 pixels, all stimuli subtended a visual angle of 5.8 × 5.8 degrees. Behavioral responses were collected during acquisition via a button box.
Face images were first normalized to obtain a global luminance with zero mean and a standard deviation (i.e., root mean square [RMS] contrast) equal to 1 using MatLab 7.5. Subsequently, filtered stimuli were generated by fast Fourier transforming the image and multiplying the Fourier energy with Gaussian filters. In the localizer experiment, stimuli were filtered using a broadband Gaussian filter (preserving information between 2 and 128 cycles per image, cpi, or 0.34–22 cycles per degree, cpd) in order to exclude SF below 2 cpi. In the main experiment (i.e., SF experiment), 2-octave-wide bandpass Gaussian filters were applied to the face images to filter the LSF (from 2 to 8 cpi or 0.34 to 1.35 cpd), MSF (from 8 to 32 cpi, 1.35 to 5.4 cpd), and HSF (from 32 to 128 cpi or 5.4 to 22 cpd; see Figure 1a).
In natural images such as face or scene pictures, amplitude typically decays as a function of SF. This decay obeys 1/fα with 0.7 < α < 2 (Field 1987; Tolhurst et al. 1992; see Figure 2). As a consequence, bandpass filters centered on lower versus higher ends of SF spectrum will pass information of high versus low energy, respectively. Since we were interested into BOLD modulations related to high-level processing of different SF ranges, we avoided this potential confound by attributing the same global luminance and RMS contrast to LSF, MSF, and HSF images (intact or scrambled). This control is necessary since RMS contrast has been shown to be the best index for perceived contrast in natural images (Bex and Makous 2002) and to largely drive neural activation in the visual cortex (Boynton et al. 1996). Without any control of this parameter, one cannot ascertain that all SF are equally visible to the observer, thus severely hampering conclusions about spatial scale processing per se.
Phase of the face images was scrambled in the Fourier domain via random permutation, a procedure known to preserve orientation content (Dakin et al. 2002). To substantiate this point, Figure 3 highlights the high similarity of SF and orientation spectra of stimulus images before and after phase scrambling.
After the inverse Fourier transform, the mean (i.e., the global luminance) and standard deviation (i.e., global RMS contrast) of each image were adjusted to match the average global luminance and RMS contrast of the original image set (Figure 2). This procedure is conventionally used to warrant equal global luminance and RMS contrast values across SF conditions (e.g., Vlamings et al. 2009). Luminance (intact LSF: 0.52 ± 0.00003; intact MSF: 0.52 ± 0.00004; intact HSF: 0.52 ± 0.00001; scrambled LSF: 0.52 ± 0; scrambled MSF: 0.52 ±0; scrambled HSF: 0.52 ± 0) and contrast values (intact LSF: 0.1 ± 0; intact MSF: 0.1 ± 0.009; intact HSF: 0.09 ± 0.002; scrambled LSF: 0.1 ± 0; scrambled MSF: 0.1 ± 0; scrambled HSF: 0.1 ± 0) were highly similar between the stimulus and SF conditions of the SF experiment and barely varied within conditions, indicating the high efficiency of our equalization procedure. Figure 2 further illustrates that equalization does not alter SF spectral envelope. A 2-pixel light gray border surrounded all stimuli to minimize global shape differences between intact and scrambled stimuli.
A localizer run comprised 16-s blocks of 20 gray-scale images: intact faces, intact cars, scrambled faces, or scrambled cars. Face pictures used in the localizer runs were not shown during the experimental runs. Within a block, each stimulus appeared during 600 ms at a random x,y position (±10 pixels away from screen center), followed by a blank screen of 200 ms. During each block, subjects performed a one-back matching task. Blocks were interleaved with 15 s of fixation pauses. There were 3 blocks per condition per run.
The SF experiment was a slow event-related design comprising 18 different conditions: SF (LSF, MSF, HSF) × exposure (75, 150, 300 ms) and stimulus (intact, scrambled). All conditions were randomly interleaved within a run. There were 5 trials per condition per run and there were 4 runs in total, giving a total of 20 trials per condition. The start of a trial was announced by a transiently brighter fixation cross (average duration: 1685 ms). Either an intact or a scrambled face then appeared during 75, 150, or 300 ms, immediately followed by a Gaussian noise mask (duration: 300 ms; 256 × 256 pixels) to eliminate any retinally persisting image of the stimulus and to limit processing time to exposure duration (Keysers and Perrett 2002). To maximize masking, the SF content of the mask was adjusted to fit stimulus center SF: square size of 64 × 64 pixels were used in LSF conditions (i.e., 4 cpi in a 256 × 256 pixel image), square size of 16 × 16 pixels in MSF conditions (i.e., 16 cpi), and square size of 4 × 4 pixels in HSF conditions (i.e., 64 cpi). Intact and scrambled conditions were matched for luminance, RMS contrast as well as spectral composition; they were also matched with respect to mask since different Gaussian masks were paired with different faces but were identical across intact and scrambled conditions. Our findings, which mostly rely on intact–scrambled comparisons across SF and exposure duration, thus cannot be due to divergent masking parameters. The mask was followed by a long fixation pause (8.125 s on average). Subjects had to perform an intact versus scrambled categorization task by pressing 1 of 2 buttons with their right index or middle fingers. Within a run, a given face appeared in both intact and scrambled version. Over the 4 runs, all faces were equally often presented in LSF, MSF, or HSF range. However, to avoid face-priming effects across SF, a given face appeared in only one SF range within a run.
Localizer Behavioral Performance
In the localizer experiment, hits and correct rejections of the one-back sensitivity were combined to compute standard sensitivity estimate (d′) individually. One-back sensitivity was high, in all conditions (intact faces: 3.9 ± 0.17; intact cars: 3.55 ± 0.23; scrambled faces: 3.25 ± 0.16; scrambled cars: 3.08 ± 0.22) but was significantly affected by category (faces vs. cars; F1,11 = 7.07, P < 0.03) and stimulus (intact vs. scrambled; F1,11 = 12.02, P < 0.007) as subjects performed less accurately for cars than faces and for scrambled than intact stimuli. There was no significant difference between faces and cars conditions when intact and scrambled conditions were considered separately (Ps > 0.07).
fMRI Data Analyses
Functional and anatomical images were analyzed using BrainVoyager QX (version 1.10, Brain Innovation). The first 4 volumes were skipped to avoid T1 saturation effect. Functional runs then underwent several preprocessing steps: correction of interslice scan time differences, linear trend removal, temporal high-pass filtering (to remove frequencies lower than 3 cycles per time course), smoothing with a Gaussian kernel of 6-mm full width at half maximum, and correction for interscan head motion (translation and rotation of functional volumes to align them to a reference volume). Anatomical and functional data were spatially normalized to Talairach coordinate system (Talairach and Tournoux 1988) with a resolution of 3 × 3 × 3 mm using sinc interpolation.
Individual regions of interest (ROIs) were isolated based on 2 localizer scans. The fMRI signal in the localizer runs was analyzed using single-participant general linear model. The predictor time courses for stimulation blocks were constructed as box-car functions filtered through a linear model indirectly relating neural activity and BOLD response (Boynton et al. 1996). For anatomical reference, the statistical maps were overlaid on Talairach-normalized individual anatomical volumes. The areas responding preferentially to faces were defined independently for each participant by the (intact faces – intact cars) contrast. Significant voxel clusters on individual t maps were selected as ROIs for further analysis. Face-preferring voxel clusters were located in bilateral middle fusiform gyri (rFFA and left fusiform face area [lFFA]; selected at a q[false discovery rate, FDR] < 0.01), superior temporal sulci (STS; q[FDR] < 0.01), anterior inferotemporal cortex (AIT; q[FDR] < 0.05), and right inferior occipital gyrus (the right Occipital Face Area [rOFA]; q[FDR] < 0.01). The left occipetal face area (lOFA) was only found in 6 out of the 13 subjects and was discarded from the analyses. Right- and left-lateralized AIT activation foci were only found in 9 and 7 subjects, respectively, and were consequently collapsed in subjects showing bilateral foci (7 out of 9 subjects). We localized ventral LOC in both hemispheres in all the subjects using the contrast (intact cars – scrambled cars) at a P(Bonferroni) < 0.001). To ascertain that the LOC ROIs also process face information, individual z-scored beta weights from rLOC and lLOC were extracted in each condition of the localizer experiment and submitted to a repeated-measure analysis of variance (ANOVA) with stimulus (intact, scrambled) and category (face, car) as factors. Afterward, post hoc Fisher's least significant difference (LSD) tests were used to compare conditions 2 × 2. We found a significant intact–scrambled difference for each category (Ps < 0.0002).
Talairach coordinates of ROIs were consistent with previous studies (see Table 1).
Mean Talairach coordinates (standard errors are shown in italics) of face-preferring and ventral LOC voxel clusters
N | Mean x | Mean y | Mean z | No. of voxels | |||||
rFFA | 12 | 39 | ±1 | –42 | ±2 | –19 | ±1 | 517 | ±155 |
lFFA | 12 | –39 | ±1 | –44 | ±2 | –19 | ±1 | 346 | ±108 |
rOFA | 9 | 34 | ±2 | –75 | ±3 | –9 | ±2 | 418 | ±195 |
rSTS | 10 | 51 | ±1 | –46 | ±3 | 5 | ±1 | 840 | ±219 |
lSTS | 10 | –52 | ±2 | –52 | ±2 | 8 | ±2 | 713 | ±255 |
rAIT | 9 | 37 | ±3 | 8 | ±3 | –22 | ±2 | 454 | ±273 |
lAIT | 7 | –34 | ±2 | 3 | ±4 | –25 | ±1 | 71 | ±35 |
rLOC | 13 | 39 | ±1 | –71 | ±1 | –12 | ±1 | 1795 | ±458 |
lLOC | 13 | –38 | ±1 | –75 | ±1 | –12 | ±1 | 604 | ±167 |
N | Mean x | Mean y | Mean z | No. of voxels | |||||
rFFA | 12 | 39 | ±1 | –42 | ±2 | –19 | ±1 | 517 | ±155 |
lFFA | 12 | –39 | ±1 | –44 | ±2 | –19 | ±1 | 346 | ±108 |
rOFA | 9 | 34 | ±2 | –75 | ±3 | –9 | ±2 | 418 | ±195 |
rSTS | 10 | 51 | ±1 | –46 | ±3 | 5 | ±1 | 840 | ±219 |
lSTS | 10 | –52 | ±2 | –52 | ±2 | 8 | ±2 | 713 | ±255 |
rAIT | 9 | 37 | ±3 | 8 | ±3 | –22 | ±2 | 454 | ±273 |
lAIT | 7 | –34 | ±2 | 3 | ±4 | –25 | ±1 | 71 | ±35 |
rLOC | 13 | 39 | ±1 | –71 | ±1 | –12 | ±1 | 1795 | ±458 |
lLOC | 13 | –38 | ±1 | –75 | ±1 | –12 | ±1 | 604 | ±167 |
Mean Talairach coordinates (standard errors are shown in italics) of face-preferring and ventral LOC voxel clusters
N | Mean x | Mean y | Mean z | No. of voxels | |||||
rFFA | 12 | 39 | ±1 | –42 | ±2 | –19 | ±1 | 517 | ±155 |
lFFA | 12 | –39 | ±1 | –44 | ±2 | –19 | ±1 | 346 | ±108 |
rOFA | 9 | 34 | ±2 | –75 | ±3 | –9 | ±2 | 418 | ±195 |
rSTS | 10 | 51 | ±1 | –46 | ±3 | 5 | ±1 | 840 | ±219 |
lSTS | 10 | –52 | ±2 | –52 | ±2 | 8 | ±2 | 713 | ±255 |
rAIT | 9 | 37 | ±3 | 8 | ±3 | –22 | ±2 | 454 | ±273 |
lAIT | 7 | –34 | ±2 | 3 | ±4 | –25 | ±1 | 71 | ±35 |
rLOC | 13 | 39 | ±1 | –71 | ±1 | –12 | ±1 | 1795 | ±458 |
lLOC | 13 | –38 | ±1 | –75 | ±1 | –12 | ±1 | 604 | ±167 |
N | Mean x | Mean y | Mean z | No. of voxels | |||||
rFFA | 12 | 39 | ±1 | –42 | ±2 | –19 | ±1 | 517 | ±155 |
lFFA | 12 | –39 | ±1 | –44 | ±2 | –19 | ±1 | 346 | ±108 |
rOFA | 9 | 34 | ±2 | –75 | ±3 | –9 | ±2 | 418 | ±195 |
rSTS | 10 | 51 | ±1 | –46 | ±3 | 5 | ±1 | 840 | ±219 |
lSTS | 10 | –52 | ±2 | –52 | ±2 | 8 | ±2 | 713 | ±255 |
rAIT | 9 | 37 | ±3 | 8 | ±3 | –22 | ±2 | 454 | ±273 |
lAIT | 7 | –34 | ±2 | 3 | ±4 | –25 | ±1 | 71 | ±35 |
rLOC | 13 | 39 | ±1 | –71 | ±1 | –12 | ±1 | 1795 | ±458 |
lLOC | 13 | –38 | ±1 | –75 | ±1 | –12 | ±1 | 604 | ±167 |
We extracted individual z-scored beta weights from these individual ROIs for each condition of the SF experiment. Beta weights were subjected to a repeated-measure ANOVA with stimulus (intact, scrambled), SF (LSF, MSF, HSF), and exposure duration (75, 150, 300 ms) as factors. Post hoc Fisher’s LSD tests were used to compare conditions 2 × 2.
Scrambled conditions were used as control conditions, from which no face representation can be extracted despite identical luminance, RMS contrast and SF spectrum (Figure 3). To gain more insight in high-level visual processing, we compared ROI activation in intact and scrambled conditions in SF and exposure conditions in 2 ways. First, since all face-preferring ROIs, but the lOFA, responded more strongly to intact faces than scrambled faces shown in the SF experiment, we ran separate ANOVAs for intact and scrambled conditions with SF (low, middle, high) and exposure duration (75, 150, 300 ms) as within-subject factors. Second, we directly compared ROI activation in intact and scrambled conditions using planned comparisons. We estimated the magnitude of this difference using partial eta squared (partial η2). Partial η2 quantifies the percentage of variance explained by a given factor (here, stimulus) when excluding the contribution of intersubject variance. Partial η2 was used to estimate the percentage of BOLD variance related to the processing of face information across SF and time while avoiding unwarranted computations of face-related activation (Baker et al. 2007; Simmons et al. 2007).
Additionally to these ROI analyses, we performed a random-effects (RFX) whole-brain analysis by computing (intact face – scrambled face) contrasts for each SF and duration (see Supplementary Data 2). We restricted this analysis to the subspace of all subjects’ brain resulting from intersecting the scanned functional volumes.
Results
In a slow event-related design, subjects viewed intact and scrambled faces that were filtered to preserve only LSF, MSF, or HSF. Intact and scrambled faces were presented at 3 different exposure durations (75, 150, 300 ms), immediately followed by a Gaussian mask (see Figure 1a). They performed an easy intact–scrambled categorization task, which yielded comparable accuracy across SF conditions. This allowed us to avoid potential confounds (e.g., attentional, decisional, and/or motor load) due to task difficulty. Performance accuracy in intact–scrambled categorization was at ceiling and was not influenced by SF, exposure, or stimulus factors (Figure 1b). Correct response times (computed with respect to stimulus onset) were shorter for intact than scrambled conditions (F1,12 = 11.2, P < 0.006), and they significantly increased at 300-ms exposure duration compared with 75- and 150-ms exposure conditions (F2,24 = 18.3, P < 0.0001).
Furthermore, all conditions were randomly interleaved within a run, ruling out SF differences in terms of cognitive strategies as alternative accounts of our findings. In addition, all conditions were perfectly matched with respect to masking parameters and physical properties of the stimuli (i.e., luminance, RMS contrast, orientation, and SF composition, see Methods) such that our findings are not influenced by low-level visual processing differences and therefore can be related to the high-level processing of face information.
Coarse-to-Fine Processing in the rFFA
Individual rFFAs were defined based on an independent localizer and a standard comparison of activations between faces and cars (see Methods). The omnibus ANOVA revealed a significant main effect of stimulus as intact faces induced larger rFFA activity than scrambled faces (F1,11 = 18.2, P < 0.001; Figure 4a).
In intact conditions, exposure duration significantly interacted with SF (F4,44 = 2.8, P < 0.03). Hence, HSF faces induced weaker response than LSF and MSF faces (Ps < 0.05) at 75 ms of exposure. However, this pattern reversed for 150-ms exposure as the weakest activation was observed for LSF as compared with MSF and HSF faces (Ps < 0.05). In contrast, there was no difference between SF with the 300-ms-long stimuli.
These findings indicate different temporal dynamics of SF processing in rFFA. While LSF processing was initially strong and attenuated at 150 ms of exposure, HSF processing increased with exposure time. Polynomial contrasts showed a quadratic trend for activations induced by LSF stimuli across time (P < 0.02), confirming the strong attenuation at intermediate exposure duration. In contrast, a linear trend was found for HSF processing over exposure duration (P < 0.04). Importantly, none of these trends were significant in scrambled conditions (Ps > 0.2), suggesting that they specifically relate to the processing of complex and structured face information (Figure 4a).
In order to estimate the magnitude of BOLD response related to the processing of complex face information, we directly compared intact and scrambled conditions, in each SF and exposure condition, via planned comparisons (see Supplementary Table 1 and Figure 4b) and computed the effect size (partial η2; see Methods) of this difference.
When stimuli were presented for 75 ms, the intact–scrambled difference was significant in LSF and MSF (P < 0.002 and P < 0.03, respectively) but not in HSF (P = 0.08; see Figure 4b). Even though significant in both LSF and MSF, intact–scrambled difference of activation was almost twice as large in LSF (60% of rFFA signal variance) as in MSF (36%). At 150 ms, this pattern strikingly reversed as the intact–scrambled difference was significant in MSF and HSF (Ps < 0.008) but not in LSF (P = 0.06). Effect sizes reveal that HSF face processing explained 68% of rFFA signal variance, while signal variance related to MSF face processing was approximately 48%. The contribution of LSF at 150 ms was marginal and half as strong as in the 75-ms duration condition. After an exposure of 300 ms, intact–scrambled difference was significant in every SF (LSF: P < 0.03; MSF: P < 0.0008; HSF: P < 0.0003). Yet, MSF and HSF each accounted for twice a larger rFFA response variance than LSF.
These results indicate that the contribution of SF to rFFA face processing dynamically changes over time. At the shortest exposure duration, the processing of face information is strongest in LSF. At longer exposures, LSF processing decreases, whereas face processing in MSF and HSF gets more robust. The use of scrambled controls allows us to conclude that the bias observed in SF processing over time is related to high-level representations, here faces, and not to more general or low-level aspects of SF processing (see also averaged time course of rFFA activity; Figure 4c).
Processing of SF over Time in Other Face-preferring Regions
The above analyses focused on rFFA, which is the main cortical site assumed to be involved in the holistic processing of face identity (Schiltz and Rossion 2006). Yet, besides rFFA, other face-preferring regions have proven essential for normal face perception (Haxby et al. 2000; Rossion et al. 2003). Using the same “faces minus cars” contrast as for rFFA, face-preferring regions were individually localized in the left FFA (lFFA) as well as in bilateral STS, OFA, and AIT (Kriegeskorte et al. 2007; Rajimehr et al. 2009; see Methods). Since left OFA failed to show a significantly larger response to intact than scrambled faces in the SF experiment, it was discarded from the subsequent analyses. Full statistical analyses are presented in Supplementary Data 1.
The processing of LSF face information engaged most face-preferring regions (see Supplementary Table 1, Fig. 3a) at short exposure duration. At longer exposure durations (150 and 300 ms), the LSF intact–scrambled differential response was only significant in lFFA (in addition to above-mentioned rFFA). Though significant, the bilateral FFA response to LSF face information was weaker at 150 and 300 ms than at 75-ms exposure durations. These results support the coarse-to-fine hypothesis of visual processing in FFA, which assumes that LSF processing decays over time, in favor of finer-grained processing. Our results indicate that LSF face processing mainly decayed from 75 to 150 ms in bilateral FFA and largely stabilized after 150 ms of processing. Interestingly, rOFA did not engage in the processing of face information based on LSF, at any exposure duration.
Neural activation related to MSF face processing was robust in bilateral FFA and rOFA, at all durations (Ps < 0.02; Figure 5a). The temporal dynamics of MSF processing in these regions was mixed. In rFFA, MSF processing steadily increased from 75 to 300 ms, suggesting the progressive accumulation of face identity cues over time in this region. In the lFFA, MSF processing mainly increased from 150 to 300 ms of exposure. In contrast, MSF processing decreased from 75 to 150 ms of exposure in rOFA.
In contrast to LSF and MSF, activations to HSF faces mainly spread across face-preferring regions over time; at 75 ms, the intact–scrambled difference was only significant in lFFA; at 150 ms, the intact–scrambled differential response extended to rFFA (see above); and at 300 ms, it was significant also in rOFA. Effect size estimates suggest that HSF processing temporal dynamics differed across these regions. In rFFA, the processing of HSF face content became more robust from 75 to 150 ms of exposure duration, whereas it mainly strengthened from 150 to 300 ms of exposure duration in rOFA and lFFA.
Bilateral AIT failed to show a coarse-to-fine profile over time. Actually, the intact–scrambled contrast was only significant for brief LSF stimuli. Intact and scrambled conditions did not differ in any other condition. This finding indicates that anterior face-preferring clusters of the ventral pathway are mostly responsive to brief and coarse input. As for left STS, it mainly activated to short MSF stimuli; it did not reveal any trend for coarse-to-fine processing dynamics.
Coarse-to-fine models of vision predict that processing resources dedicated to the processing of LSF input initially dominate but then progressively decrease, while they become increasingly devoted to the processing of finer spatial scales over processing time. Our findings largely corroborate this view as most face-preferring regions disclosed coarse-to-fine temporal dynamics (see Supplementary Table 1). Neural activity to LSF was strong in early stages of visual processing but decayed as a function of time (mostly until 150 ms of processing). Moreover, the processing of HSF face information strengthened at different temporal intervals depending on the region. In contrast, neural responses to MSF face information were strong in bilateral FFA and rOFA, already at the shortest exposure duration.
These findings were confirmed by a whole-brain analysis of intact–scrambled differential activations (see Supplementary Data 2).
No Coarse-to-Fine Processing in Ventral LOC
Do the spatiotemporal processing dynamics observed in the face-preferring network, especially in rFFA, apply to high-level, noncategory-preferring, visual regions? To answer this question, the ventral LOC was localized using an “intact cars minus scrambled cars” contrast in each individual subject (see Methods). This region is a more general-purpose high-level visual area as it responds to any shape with no preference for a given category (Malach et al. 1995). As a matter of fact, there was no difference of activation between intact faces and cars in bilateral LOC (Ps > 0.2).
As expected, lLOC and rLOC responded more strongly to intact than scrambled faces in the SF experiment (rLOC: F1,12 = 27.3, P < 0.0002; lLOC: F1,12 = 22.63, P < 0.0005; see Figure 5b). Both regions were largely driven by MSF and HSF at any exposure duration. In contrast to face-preferring regions, there was no larger BOLD response to LSF than to HSF in initial stages of processing (see Supplementary Table 1).
In the face-preferring network, both the whole-brain and the ROIs analyses revealed that distinct SF were processed at different time points during the processing of face information (see Supplementary Figure 1 and Supplementary Table 2). Precisely, LSF processing was initially strong but was progressively attenuated, while BOLD responses to HSF face information increased over time. In contrast, MSF processing was robust at all durations in most face-preferring regions. Importantly, LSF and HSF spatiotemporal dynamics did not generalize to the adjacent LOC regions, which are engaged in general aspects of object encoding. However, a marked advantage for processing MSF information was observed in LOC at all durations, indicating that the large response to MSF is a general trait of high-level visual processing.
Discussion
The present study shows, for the first time, that the human brain regions responsible for high-level face representations rely on different SF over time. The temporal dynamics of SF processing were coarse to fine in most face-preferring regions.
Coarse-to-fine models of visual processing propose that LSF are extracted mainly in the first stages of visual processing. Accordingly, all face-preferring regions (but the rOFA) robustly responded to LSF in early stages of visual processing (until 75 ms of exposure duration), and this response decayed over time (mostly until 150 ms of processing). Coarse-to-fine models further suggest that visual processing becomes finer grained over processing time and increasingly relies on the processing of HSF information. Indeed, the processing of HSF face information got more robust over time in bilateral FFA and rOFA. In contrast, MSF face processing was strong in bilateral FFA and rOFA, already at the shortest exposure duration. Neural activity related to MSF face processing increased over time in bilateral FFA (though in different temporal intervals in the 2 hemispheres), while it decreased in rOFA.
Interestingly, these spatiotemporal processing dynamics revealed in face-preferring cortex were not observed in LOC, a high-level visual region showing no visual object category preference. This suggests that coarse-to-fine processing is a special signature of category-preferring brain regions (but see below). This would agree with Bar's theoretical framework, which proposes that inferences generated in prefrontal cortex based on early LSF input are sent back to high-level/category-preferring regions of the ventral pathway to guide visual processing (e.g., Bar 2007). In contrast, the HSF content of a scene is thought to be processed in posterior visual regions, which projects on category-preferring regions of the ventral pathway. Accordingly, LOC may not belong to the coarse-to-fine network of visual processing and may rather engage in the slow encoding of fine image content. As a matter of fact, bilateral LOC responded more robustly to MSF and HSF than to LSF, irrespective of exposure duration.
Past fMRI studies investigated coarse-to-fine processing dynamics using nonface stimuli such as scenes and objects. In a recent combined fMRI and event-related potentials study, Peyrin et al. (2010) presented SF-filtered natural scenes sequences. Sequences followed either a coarse-to-fine (i.e., LSF-to-HSF) or a fine-to-coarse (i.e., HSF-to-LSF) order. They showed that coarse-to-fine sequences induce an initial increase of activity in prefrontal cortex, followed by enhanced occipital responses to HSF. However, it is unclear from this and previous studies by the same authors (Peyrin et al. 2005) whether the scene-preferring high-level regions situated in parahippocampal gyrus (Epstein et al. 1999) would also show a coarse-to-fine dynamic over processing time. Indeed, scene-preferring regions were not explored by Peyrin and colleagues.
Studies by Bar and colleagues also addressed the question of coarse-to-fine processing in the human brain. One study of this group (Kveraga et al. 2007) suggested that prefrontal regions, thought to facilitate visual processing via feedback, receive visual input from primary visual cortex very rapidly after stimulus onset via M pathway. Counter intuitively, however, the authors reported larger prefrontal deactivations to HSF than LSF stimuli (Bar et al. 2006). Bar and colleagues mostly explored temporal dynamics in prefrontal regions; they did not address whether activations in object-preferring regions follow a coarse-to-fine temporal dynamic. More generally, the findings of Bar et al. do not provide unequivocal evidence of coarse-to-fine processing in the human brain, for several reasons (see Hegde 2008). First, Bar's framework relies on the unwarranted assumption that M pathway selectively carries LSF information; however, this assumption is not supported by the literature (Kaplan 2004). Moreover, luminance and contrast largely differed between LSF and HSF stimuli used by Bar et al. (2006), whereas it was highly similar between unfiltered and LSF stimuli. The differential activations observed across SF in prefrontal cortex may thus be due to these differences in input properties rather than spatial scale per se.
The present study reports evidence for coarse-to-fine processing in high-level visual face-preferring regions while strictly equating stimulus and cognitive properties across SF conditions. Coarse-to-fine strategy may apply more to the processing of faces than other object categories, for several reasons. First, behavioral and fMRI evidence jointly indicate that face processing is more largely dependent on SF than object processing (Collin et al. 2004; Yue et al. 2006; Williams et al. 2009). It has been suggested that especially for faces, the SF-dependent representations generated in primary visual cortex are kept segregated at high-level processing stages (Biederman 1987). Second, and in relation to the previous point, previous publications showed that holistic processing relies on the processing of LSF face information (Collishaw and Hole 2000; Goffaux et al. 2003, 2005; Goffaux and Rossion 2006; Goffaux 2009; though see Cheung et al. 2008). Holistic processing emerges very early during face processing (Richler et al. 2009; see also Singer and Sheinberg 2006). It plays a key role in, and is highly specific for, face perception. When holistic processing is disrupted, face recognition is dramatically impaired (Sergent and Signoret 1992; Barton et al. 2002; though see Konar et al. 2010). Schiltz and Rossion (2006) showed that holistic face representations emerge in high-level face-preferring visual cortex and especially in the rFFA. We speculate that the early and strong rFFA responses to LSF face information observed in the present study may serve the generation of holistic face representations. However, further research is needed to support this proposal. The key contribution of LSF to early face perception is also indicated by the observation that the human N170, that is an electrophysiological component known to be stronger in response to faces than other visual categories (Rossion et al. 2000), is stronger in response to LSF faces than HSF faces (Goffaux et al. 2003; Flevaris et al. 2008). Another aspect that likely favors the coarse-to-fine strategy for faces is related to development. Faces are ubiquitous in human visual environment since the first minutes of life and newborns show an exceptional ability to discriminate faces. Due to the immaturity of their visual system, newborns individuate faces mainly based on LSF (de Heering et al. 2008). The predominance of LSF-based face processing early in life may contribute to the importance of this band of information for the early processing stages in adulthood (Le Grand et al. 2001).
Given that face perception is more affected by SF content than the processing of other visual categories (see above), it is unclear whether our findings can be generalized to other high-level, category-preferring regions. However, since coarse-to-fine processing has been evidenced with simple stimuli (e.g., Watt 1987; Bredfeldt and Ringach 2002), and complex visual stimuli like natural scenes (e.g., Peyrin et al. 2010), one might speculate that it generalizes to low- and high-processing levels of vision. Nevertheless, further research is necessary to tackle this issue.
Our results resolve the empirical divergence between the past fMRI explorations of SF processing in face-preferring cortical regions. While some papers reported overall larger BOLD responses to HSF than LSF (Vuilleumier et al. 2003; Eger et al. 2004; Iidaka et al. 2004), others observed no BOLD response difference between LSF and HSF (Gauthier et al. 2005). Despite the pervasive assumption that SF processing is time-dependent, the potential role of exposure duration was not addressed in any of these earlier studies. The strong initial response to LSF face information has thus likely been hindered in the past studies, which used long exposure duration (>200 ms).
Another important new finding relates to the large cortical response measured in some face-preferring regions in response to MSF face information, already at the shortest exposures. This finding is without precedent since no fMRI study explored cortical processing of face MSF information so far. In contrast to LSF, robust MSF responses were also observed early in bilateral LOC. They may thus reflect the general peak of human visual acuity centered at intermediate SF (De Valois et al. 1974; Tanskanen et al. 2005).
Our finding that BOLD responses to faces depend on different SF over time suggests that face-preferring cells tune to different SF ranges of face information. Accordingly, our whole-brain analyses (Supplementary Data 2) indicate that different voxel clusters respond to distinct ranges of SF, suggesting that SF are segregated until high-level stages of face processing. This is further supported by electrophysiological studies in the monkey brain, showing that face-preferring cells located in the inferotemporal cortex are sensitive to SF (Rolls et al. 1985; Bermudez et al. 2009).
The present evidence suggests the coarse-to-fine strategy as a plausible modus operandi in high-level visual cortex. Because LSF are processed earlier than—and independently from—HSF, they may be used for an initial coarse segmentation of the stimulus, to be later refined by the slower accumulation of HSF information. This is further supported by electrophysiological evidence in the monkey brain that inferotemporal cells respond to the global, coarse image structure before encoding local, fine information (Sugase et al. 1999; Sripati and Olson 2009). By revealing the spatial and temporal dynamics in high-level visual cortex dedicated to face perception, the present study opens a new avenue for investigating the composition of high-level visual representations in the human brain (see Hegde 2008).
![(a) LSF, MSF, and HSF faces were presented at 3 exposure durations, immediately followed by a Gaussian mask. The phase of face stimuli was either intact or scrambled in the Fourier domain. All conditions were equated for luminance, RMS contrast, and spectral composition. They were randomly interleaved within a run and subjects categorized each trial as an intact or a scrambled one. (b) Performance accuracy in intact-scrambled categorization was at ceiling and was not influenced by SF, exposure, or stimulus factors. In contrast, correct response times were shorter for intact than scrambled conditions, and significantly increased at 300-ms exposure duration compared with 75- and 150-ms exposure conditions.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/cercor/21/2/10.1093_cercor_bhq112/2/m_cercorbhq112f01_4c.jpeg?Expires=1722370287&Signature=ll4OFhGZ~IZiivHMul~LHuwMLkzQaqme4FGPsBkHBF3jk81v8ZDU9HN1nBIZjjTDUJu-j5BawYs6AkqpjXQViNMNdH0e7RUswmQUNK7jhqWAjud-cRVN5hUqS4J3j1h4QULje7aqF3BYNpzqqgCsTwaKhhQFrGDyOans7LSdZGFZ3~y2IDXJPkbBnAbpPYVSzvg~HT7ckuHdV3PLumUUuEKrYwLIs-cOmB~LlLfKSCSuzkgR-hbRlmMmxtr8umbEi~EGQPA1kiiyZ9gGoIWih1V6ibABMORVKnry9xOdNJeCCzmjNq0~Vqmq5GvgwwNM2WquxbjKfWegr0~GKif-BQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
(a) LSF, MSF, and HSF faces were presented at 3 exposure durations, immediately followed by a Gaussian mask. The phase of face stimuli was either intact or scrambled in the Fourier domain. All conditions were equated for luminance, RMS contrast, and spectral composition. They were randomly interleaved within a run and subjects categorized each trial as an intact or a scrambled one. (b) Performance accuracy in intact-scrambled categorization was at ceiling and was not influenced by SF, exposure, or stimulus factors. In contrast, correct response times were shorter for intact than scrambled conditions, and significantly increased at 300-ms exposure duration compared with 75- and 150-ms exposure conditions.
![Amplitude spectrum as a function of SF in unfiltered, LSF, MSF, and HSF stimuli, before and after luminance and RMS contrast have been equalized. Note that luminance and contrast equalization did not alter spectral envelope.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/cercor/21/2/10.1093_cercor_bhq112/2/m_cercorbhq112f02_4c.jpeg?Expires=1722370287&Signature=RGKbb1JfgnM5be5WX82dL6GlwHwoT~UzGUiehz3PYQd8JnhXIV64RlWU5cKc3kRblICTzlrdUNyWX5sxDWc00nUN2lTDRJn6GNSRW6GJTxAkHvIT37-TDUQ8fxw5hChkHUNwlVlQ4J32atLP-RjaZxz~ZSa6K9ejo3yXubI9g5eKSIPxuEd1izn8~4tVHSABtiBahFLjet67Y-GMSwNF~VPbnj9UwMnENpJwiVfzgHYWH4zP~OCK2Io0kFNxhguG-YS5XHErEjS-kA-MAQD1mES9HUcRzt4TzT~vqo7yDoKI5HJ2C6kMc8HNv0KZ21MzWTxkULoptC~UAuBhEqW7ag__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Amplitude spectrum as a function of SF in unfiltered, LSF, MSF, and HSF stimuli, before and after luminance and RMS contrast have been equalized. Note that luminance and contrast equalization did not alter spectral envelope.
![Left: Fourier amplitude is plotted as a function of orientation, revealing the similar orientation content across intact and scrambled conditions in each SF conditions, separately. These plots are based on a single measurement, so not taking into account the lack of a set of continuous orientation vectors in the Fourier domain (e.g., Hansen and Essock 2004). Right: Fourier amplitude plotted as a function of SF. Note the high similarity between intact and scrambled spectra.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/cercor/21/2/10.1093_cercor_bhq112/2/m_cercorbhq112f03_4c.jpeg?Expires=1722370287&Signature=3sgkPykl~SBV4G9pYeDIZQq6ZC1N91IxROIOWLifzmfdxOL9VynC-9pfBDWntbRgGKGjv1kz7S2m0zka2hh0ZJZ7eVOz51VF17inrN1J3wekOkssiNIVt7cbOhWekBIuR3wSOb3Vl9uCjA3BRiyZnLexCIPdA61sLrvrLeourPp866mfOQQ02hAuJL7uFjsBl7hczjEqYexkxMhaahlt4O~SCBr8bJHVgi-UD1nikBIfwVFS~yoT-AiN73vSHt6k1~LPWy0YXti-zBhWZxtRnqK~wyicIAgWPitJ90IvpPSIDXk7o8sQVUsTPNVbzuHMJDI9aTER0N58NVFgO668FA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Left: Fourier amplitude is plotted as a function of orientation, revealing the similar orientation content across intact and scrambled conditions in each SF conditions, separately. These plots are based on a single measurement, so not taking into account the lack of a set of continuous orientation vectors in the Fourier domain (e.g., Hansen and Essock 2004). Right: Fourier amplitude plotted as a function of SF. Note the high similarity between intact and scrambled spectra.
![Average BOLD activity in the rFFA. (a) Normalized beta weights in the rFFA (bars 5 mean intrasubject variance). (b) Effect size of the difference between intact and scrambled faces in separate SF and exposure duration conditions. (c) Grand averaged event-related time course of intact and scrambled face processing in the rFFA. Time courses are expressed in percent signal change relative to fixation baseline activity (baseline interval: from 2 to ±2 TR around preparatory cue onset). The activity time courses shown on (c) reflect the findings based on the beta weights.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/cercor/21/2/10.1093_cercor_bhq112/2/m_cercorbhq112f04_4c.jpeg?Expires=1722370287&Signature=Bvl3uUrXM-fXoeQNvrwQFIP-EDl8Xj4UEtTINZ~O5XaVGY3d8qCzH2EP2gIeBm-1ZkxMaS6bHhXdqw2PQL1TtA918nB863XHYGjWzLRiObqhTj9lV89Mtk6WADK4NylOmgEp2yNBbvHO1n~PoAUP0642XTZrJmQgA~uTRpU2E2vLwzKkEDjLX~lIgXyjeULrXxWBlewbBvrXIHlwPccY3-DqD1mBqHtUvVo1T4cbCsHtMKJzv9yiBrCBfhvXLnMeBVGO4ZG6KyGEkaSjq748QkyqEfduIyjZKvktw2C8~i9BdT1oq9QF-MnyJG20br-iXIiGTVH0j5qakV2UnuGLiQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Average BOLD activity in the rFFA. (a) Normalized beta weights in the rFFA (bars 5 mean intrasubject variance). (b) Effect size of the difference between intact and scrambled faces in separate SF and exposure duration conditions. (c) Grand averaged event-related time course of intact and scrambled face processing in the rFFA. Time courses are expressed in percent signal change relative to fixation baseline activity (baseline interval: from 2 to ±2 TR around preparatory cue onset). The activity time courses shown on (c) reflect the findings based on the beta weights.
![Effect size plots in (a) face-preferring regions (lFFA, rOFA, rSTS, lSTS, bilateral AIT) and (b) object-preferring regions (right and left ventral LOC).](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/cercor/21/2/10.1093_cercor_bhq112/2/m_cercorbhq112f05_4c.jpeg?Expires=1722370287&Signature=Uef19zGXgKv8IdsLMeN9PDs~1Zn-fb1921JgkXSaXaOLmGYOf5J5Do-cMjUnu49VzqIvFBJhR4WTu3P4Dr4GCSonhEo5bsAOQKA7AZp8z55PILD08E3t0G10H~mFEaGXtF2iuWIhgczuEW8-w7o7yLC0G9eanP4pJ5vudZdWfJWuM6NNxxSBqiNaR9a3zKE5MdIseI5qK7ix4rICI5xHX~HqbWzCKXMEjWq3oxoY9M4P84flU~LC0OUW9fhPpLYQE6AVS~f8EWUQExXZZ4GwbS0NP5mTLUNWEPcFJ22MFB8vxc9IqllrQHVsbQ6QKILypfaxnmtvMhW4UA-o821BLg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Effect size plots in (a) face-preferring regions (lFFA, rOFA, rSTS, lSTS, bilateral AIT) and (b) object-preferring regions (right and left ventral LOC).
The authors are grateful to Steven C. Dakin for providing the Matlab codes used to plot the orientation and SF spectrum. We also would like to thank Bruno Rossion for providing face and car stimuli, Armin Heineke for his support during fMRI data analyses, and Marieke Mur for her interesting suggestions on a previous version of the manuscript. Conflict of Interest: None declared.