This is the html version of the file https://arxiv.org/abs/2308.09431.
Google automatically generates html versions of documents as we crawl the web.
All_TNN_draft - Google Docs
Page 1
End-to-end topographic networks as models of
cortical map formation and human visual behaviour:
moving beyond convolutions
Zejin Lu* 1,4 , Adrien Doerig* 1 , Victoria Bosch* 1 , Bas Krahmer,
Daniel Kaiser +2,3 , Radoslaw M Cichy +4 , & Tim C Kietzmann +1
1 Machine Learning, Institute for Cognitive Science, Osnabrück University, Osnabrück, Germany.
2 Neural Computation Group, Mathematical Institute, Justus-Liebig-Universität Gießen, Gießen, Germany.
3 Center for Mind Brain and Behavior, Philipps-Universität Marburg and Justus-Liebig-Universität Gießen,
Marburg, Germany
4 Neural Dynamics of Visual Cognition Group, Department of Education and Psychology, Freie Universität Berlin,
Berlin, Germany.
* Shared first authorship.
+ Shared last authorship.
Abstract
Computational models are an essential tool for understanding the origin and functions of the
topographic organisation of the primate visual system. Yet, vision is most commonly
modelled by convolutional neural networks that ignore topography by learning identical
features across space. Here, we overcome this limitation by developing All-Topographic
Neural Networks (All-TNNs). Trained on visual input, several features of primate topography
emerge in All-TNNs: smooth orientation maps and cortical magnification in their first layer,
and category-selective areas in their final layer. In addition, we introduce a novel dataset of
human spatial biases in object recognition, which enables us to directly link models to
behaviour. We demonstrate that All-TNNs significantly better align with human behaviour
than previous state-of-the-art convolutional models due to their topographic nature. All-TNNs
thereby mark an important step forward in understanding the spatial organisation of the
visual brain and how it mediates visual behaviour.

Page 2
Introduction
Artificial neural networks (ANNs) have enabled the investigation of neuroscientific questions
that were previously beyond the scope of traditional modelling and experimental techniques
by offering a way to design models that are image-computable, and task-performing, while
bridging levels of explanation from single neurons to behaviour 1–3 . In vision, the most
commonly used networks are convolutional neural networks (CNNs), a powerful and efficient
architecture type that has been successful at predicting primate neural activity across
multiple hierarchical levels of the ventral stream 4–7 and at accounting for complex visual
behaviour 8–11 .
However, a crucial limitation on the future prospects of CNNs as neuroscientific models is
the architecture’s reliance on weight sharing, i.e. CNNs use identical features across visual
space. This strong inductive bias is sensible for engineering purposes, because it facilitates
efficient learning and enables spatially invariant object recognition. However, this
architectural design choice limits their ability to model fundamental aspects of biological
vision. A central aspect is the origin and function of cortical topography 12,13 and its relation to
behaviour - a central area of research in visual neuroscience for which modelling promises
important insights that cannot easily be addressed using only experimental approaches.
In the brain, topographic organisation refers to the fact that the spatial arrangement of
neurons on the cortical sheet is highly structured with respect to their tuning profiles. For
example, early visual cortex is thought to be organised into columnar structures
(hypercolumns) with repeating motifs of orientation sensitivity that vary smoothly across the
surface 14–16 . In higher-level visual cortex, clusters of neurons that respond preferentially to
abstract stimulus categories, such as faces 17,18 , bodies 19 , and scenes 20–22 are observed,
among other spatial organisational structures based on a variety of visual and conceptual
stimulus properties 23–25 . Human visual behaviour, too, exhibits spatial regularities, with
objects being more easily recognized when displayed in their typical spatial position 26–28 ,
likely arising from the topographic organisation of visual cortex with joint spatial tuning and
feature tuning determining visual efficiency.
The emergence of topographies and its interrelation with behaviour cannot directly be
modelled using CNNs due to their spatially enforced weight sharing, leading to three
modelling limitations First, synaptic changes in CNNs are globally orchestrated to form
identical feature selectivity across space, which is in stark contrast to the brain where
synaptic changes across the cortical topography are local. Second, CNNs lack a clear
spatial arrangement similar to the brain’s cortical sheet. Third, CNNs do not exhibit spatially
smooth neural tuning transitions as found in the brain. Here we overcome these limitations
with a new topographic model architecture, which we term All-Topographic Neural Network
(All-TNN). All-TNNs fulfil the following desiderata for modelling cortical topographic
organisation:
1) Locality: Units in the model need to have local receptive fields (RFs) that are learnt
individually and not enforced to be exact duplicates of other RFs.
2) Arrangement along the cortical sheet: The spatial smoothness across cortex is
thought to be due to a smooth decay in connectivity between neurons with increasing
cortical distance 29,30 . Units in the model, therefore, need to be arranged along an
1

Page 3
artificial cortical sheet, and the spatial distance between units is to be measured in
this space.
3) Smoothness: There is a continuum from spatially discontinuous models where each
unit detects a different feature, to spatially uniform models where all units detect the
same feature. A biologically plausible model of topographic organisation should
reflect the biological observation that the brain operates in between these two
extremes 25,31–33 .
Across a set of experiments with this architecture and a novel dataset of spatial biases in
human visual object recognition, we demonstrate that All-TNNs, in contrast to current
state-of-the-art CNNs, more accurately capture topographic representations in the primate
visual system. First, they reproduce important properties (smooth orientation selectivity
maps, cortical magnification and category-selective regions) of neural topography when
trained on visual input. Second, All-TNNs significantly better align with human behaviour by
reproducing visual field biases in object perception. Lastly, we show that the visual behaviour
of All-TNNs is directly linked to their topography.
Results
To study the emergence of feature topographies and their impact on behaviour, we
developed a fully topographic neural network, All-TNN. Contrary to CNNs, All-TNNs can
learn locally specific weight kernels to detect different features across visual space (Fig. 1,
desideratum 1). As a result, units at different spatial locations are free to learn different
features, making it possible to directly compare the topographies of spatial selectivity maps
on the model’s cortical sheets with properties of topographies found in the brain. In addition,
units of each network layer are arranged along a 2D cortical sheet (desideratum 2), while
resembling hypercolumnar structure: all units of a hypercolumn share the same local
receptive field location (i.e., they only receive connections from the same spatially limited
area in the layer below). To navigate the continuum from spatially discontinuous to spatially
uniform feature selectivity, we use a tunable spatial similarity loss that acts as a regularizer
encouraging neighbouring units to detect similar features (desideratum 3; see Methods).
In our experiments, we trained All-TNNs on ecoset, an object classification dataset that
contains 565 object categories selected to be representative of concrete categories that are
of importance to humans 34 . Training for object categorisation performance while satisfying
the spatial similarity loss results in a dual-loss objective, which forces the network to trade off
between i) learning varied feature selectivity required to accomplish a difficult object
categorisation task, and ii) preserving similar feature selectivity between neighbouring units.
In our experiments, we contrasted multiple instances of All-TNN with two control models:
purely locally connected networks (LCN, i.e. All-TNNs without spatial similarity loss), and
CNNs. To make sure we single out architectural differences in our analyses, these models
have matching numbers of units, identical hyperparameter settings, and are trained on the
same dataset and task (see Methods). We train multiple seeds of each network type that we
treat as experimental subjects.
To confirm the capacity of All-TNN to model cortical topography and its relation with
behaviour, we perform in-silico electrophysiology analyses hierarchically from low-level to
2

Page 4
high-level neural topographical characteristics, and then move onto behavioural
experiments.
Figure 1 | The All-Topographic Neural Network (All-TNN). Example topographic layer
(top) with the properties of local connectivity, 2D arrangement and spatial weight similarity
loss. Units are arranged retinotopically into ‘hypercolumns’, and units in a given hypercolumn
share the same local receptive field location. The spatial similarity loss is applied to weight
kernels of neighbouring units within a layer (as illustrated for one unit by green arrows). The
All-TNN architecture (bottom) used in our experiments consists of 6 topographic layers of
different dimensions (indicated by numbers along the layer depictions) and kernel sizes
(indicated by the numbers along the kernel depiction), of which layers 1, 3 and 5 are
followed by pooling layers, and the last layer is followed by a category readout (565
categories) with softmax. The network’s learning objective is the sum of a classification loss
and a loss favouring local spatial similarity between unit kernels.
Topographical features of the ventral stream emerge in All-TNNs
V1 orientation selectivity maps are a topographical hallmark of the primate visual system 12 .
We thus begin our investigation by determining whether the model’s first layer reproduces
the features of smooth orientation selectivity maps in V1. To determine orientation selectivity
for each unit in the layer, we follow the standard analysis procedure in biological systems 14
by presenting the network sinewave gratings of different angles and phases (Fig 2a) and
determining the angle for which each unit is most responsive (see Methods). We find that the
first All-TNN layer exhibits a smooth distribution of orientation selectivities, mirroring
orientation selectivity maps in primate V1 (see Fig. 2a, left panel). By contrast, CNN
architectures do not exhibit such topography, because feature selectivity is, by definition,
identical across all locations in the layer (Fig. 2a, bottom right panel). Importantly, V1-like
3

Page 5
selectivity maps also did not emerge in the locally connected control network, suggesting
that such a topographical organisation does not emerge purely from learning to categorise
natural objects under the influence of the autocorrelation of the input statistics (Fig. 2a). In
addition, while V1-like feature selectivity maps require training (they are not present in the
untrained network), they emerge early and remain stable after only a few epochs (Fig. S1a),
even though performance keeps increasing after the topography has stabilized. This
suggests that the topographical organisation emerges in the network quickly, with further
training only finetuning selectivities within this topography rather than bringing about broad
changes in the overall structure. This is in line with early maturation of topographical
structures in visual cortex of infants, thought to provide scaffolding for functional
selectivity 35,36 .
4

Page 6
Figure 2 | All-TNNs mirror key features of the visual system’s topography. a. The first
layer of All-TNN (example network instance) shows a V1-like organisation of orientation
selectivities, while the two control architectures, a locally connected control network and a
convolutional network, do not. b. Entropy visualisation of an All-TNN instance mirrors
foveation and cortical magnification in the first layer. Entropy decreases with eccentricity in
all seeds of All-TNN, but remains constant for CNN and LCN (data averaged across all
seeds, shaded region shows the 95% confidence interval; curves overlap for LCN and CNN).
Selective entropy-based lesioning confirms cortical magnification (data shown averaged
5

Page 7
across all seeds, error bars show the variance). Classification accuracy is more affected
when lesioning 50% of units in high-entropy (i.e. varied selectivity) regions of All-TNN, than
lesions performed to units in low-entropy (i.e. homogeneous selectivity) regions. This effect
of cortical magnification is neither observed for the locally connected control network nor the
CNN. c. The last layer of All-TNNs shows clustering of high-level category-based
selectivities (d’) for tools, scenes, and faces, whereas the locally connected control network
and the CNN do not show clustering of similarly selective units. Results and maps for all
seeds can be viewed in Fig. S1-3.
Interestingly, the All-TNNs’ orientation selectivity maps exhibit a strong centre-periphery
organisation, with increasingly smooth feature selectivity towards the periphery. To quantify
the observation of a foveal region with a higher diversity of feature selectivities, we
computed feature-entropy at various spatial eccentricities (Fig. 2b, top left panel), as well as
for the control models (Fig. 2b, middle right panel). Indeed, we observe a marked decline in
entropy in All-TNNs, in line with less varied feature selectivity in the periphery (Fig. 2b, top
right panel). In contrast, entropy is consistently high and does not vary for control networks,
which are not able to pick up on the image statistics in their topography. This aspect may be
surprising since All-TNNs do not have a central bias in its architecture or loss terms. Instead,
the position of this “foveal” region must therefore result from an interplay of the network’s
training objective with the statistics of the training dataset. We hypothesise that the centre of
the images contains crucial information for the categorization task, which forces the network
to learn varied features in this region at the expense of feature smoothness. In contrast, the
network favours more homogenous visual features in the periphery, because less
categorization-relevant information is present there.
This foveal bias is reminiscent of cortical magnification in humans, with more neurons per
degree of visual angle in the foveal region of V1 than in the periphery, leading to better
acuity in the fovea 37–39 . This allocation of more resources to the fovea goes hand in hand
with the fact that humans fixate on relevant regions of the visual field through eye
movements. We tested whether the greater diversity of foveal selectivities in All-TNNs
reflects additional computational resources contributing to task performance, similar to
human cortical magnification, in an in-silico lesion study. We quantified the diversity of
selectivities at each location on the sheet by computing the local entropy of orientation
selectivity in a sliding window. We then lesioned 50% of units in regions with homogenous
selectivities (low entropy lesions) or varied selectivity (high entropy lesions) (Fig. 2b, bottom
panel; see Methods). We found that low-entropy lesions have a minor effect on All-TNN
classification performance (accuracy drop to 90.91% of the unlesioned All-TNN
performance). In contrast, high-entropy lesions strongly deteriorate classification
performance (accuracy drop to 57.87% compared to unlesioned All-TNN). This shows that
All-TNNs can afford to lose units in the peripheral regions with homogenous selectivities, but
not in the foveal region with diverse selectivities. Neither convolutional nor locally connected
control models show this effect, because their selectivity profiles are homogeneous (Fig. 2b,
middle right panel), in contrast to the strong topographical structure of All-TNNs. Hence, only
All-TNNs can learn a topography mirroring the training image statistics to selectively process
relevant parts of the visual field, leading to a spatial organisation reminiscent of cortical
magnification.
Having investigated lower-level topographic features of All-TNNs, we focus on higher-level
representations and analyse whether the networks’ last layer reproduces topographical
6

Page 8
features characterizing primate higher-level visual cortex. To do so, we contrast unit
activations in response to faces, tools, and places ( , 500 stimuli each, see Methods) -
𝑑'
image classes that yield clusters of neural activation in primate higher-level visual
areas 17,18,20,21,23,36,40,41 . After model training, we observe smooth model regions selective for
faces, places and tools (Fig. 2c, left panel) in the All-TNN. Neither of the control models
shows a comparable topography. Instead, they develop an unstructured salt-and-pepper
selectivity map without any clusters (Fig. 2c, bottom right panel). Similarly to the emergence
of orientation selectivity in the first layer of All-TNN, this topographic organisation into
category-selective regions requires training and stabilises early on (Fig. S3a).
Taken together, these results demonstrate that All-TNNs consistently reproduce key
characteristics of primate neural topography at early and higher-level visual cortex, including
V1-like smooth orientation maps, cortical magnification, and category-selective regions in the
final model layer.
All-TNNs better align with human spatial visual behaviour
Humans reliably show visual field biases, i.e. objects are better detected and recognized
when they appear in the locations they are most often experienced in 26,27,42 . Given that, akin
to cortical maps, All-TNNs have the ability to detect different features in different parts of the
visual field, the question arises whether All-TNNs exhibit human-like effects of spatial
position in their classification performance.
To investigate this question, we collected a novel dataset of spatial biases in human visual
object recognition (Fig. 3a). Participants (n=30) classified 80 objects from 16 classes of the
COCO dataset 43 , which were presented for 40ms in a random location of a 5x5 grid, followed
by a Mondrian mask (see Methods). Masking prevents ceiling effects in performance, and
has been proposed to limit recurrent processing in humans 44 , which is ideal as a testbed for
our current set of feedforward models. Based on these data, we computed spatial
classification accuracy maps for each individual and each object class that capture the
spatial distribution of human classification performance (Fig. 3b).
We performed the same experiment on All-TNNs and our control models (see Methods; note
that LCN control models are not considered further because they do not exhibit meaningful
topographic organisation). Model testing was based on all images in COCO for a given
class, instead of only 5 exemplars used for the human participants. In analogy to the human
analyses, accuracy maps were computed for each model type, instance, and object
category.
7

Page 9
Figure 3 | An experiment testing spatial biases in human visual behaviour. a . To create
stimulus materials for the behavioural experiment, objects from COCO are segmented from
their background and placed on a 5x5 grid. 16 object categories consisting of 5 object
exemplars were included in this behavioural dataset (example segmentations shown). Each
trial contained a brief stimulus presentation at one of 25 locations on the screen, followed by
a Mondrian mask. A response screen showing the 16 target category labels was presented
after the stimulus and mask to collect participant responses. b. Accuracy maps for all 16
categories, averaged across participants (n=30).
As a first step to understanding spatial visual biases in both humans and models, we verified
that object classification performance across the visual field is aligned with the
corresponding object’s occurrence statistics. To do so, we correlated the object occurrence
frequency maps, as obtained from COCO (see Methods; Fig. S4), with the classification
accuracy maps (Fig. 4a, left panel). For humans, we observe a positive relationship between
accuracy maps and COCO occurrence frequency maps (Pearson r=0.56; Fig. 4a, right
panel), consistent with previous research 26,27 . All-TNNs and CNNs, too, exhibit a significant
correlation between accuracy and occurrence frequency (permutation test for both All-TNN
and CNN, n=1e5; p <0.001). However, the strength of the effect observed in All-TNNs was
significantly closer to humans compared to CNNs controls (Pearson’s r; r=0.45 for All-TNN,
r=0.23 for CNN, permutation test, n=1e5; p <0.001). This indicates that the alignment of
performance and occurrence statistics is most similar to humans for All-TNNs than CNNs.
8

Page 10
Figure 4 | All-TNNs capture spatial statistics of objects, matching human behaviour.
Data from human participants (n=30), All-TNN instances (n=10) and CNN controls (n=10)
shown. a . Humans and All-TNNs show a stronger alignment (Pearson correlation) between
their accuracy maps and occurrence frequency maps obtained from COCO for each of our
selected 16 categories, as compared to CNNs. b . All-TNNs, similarly to humans, have less
peaked accuracy maps when the location of objects is more variable. Positional uncertainty
for each of our 16 categories was computed as the image area required to cover 90% of
object occurrences in COCO. Positional variance in classification performance was
computed by the accuracy ratio between the best and worst classification accuracy for a
given object. Robust regression indicates a significant relationship between positional
variance in classification performance and positional uncertainty across 16 categories for
both humans and All-TNNs, but not CNNs. A negative slope indicates a decreasing accuracy
ratio as a function of positional uncertainty.
To further characterise positional effects and their relation to occurrence statistics, we
investigated whether positional uncertainty of object categories in natural scenes had an
effect on human and model accuracy maps. We hypothesised that object categories with
stereotypical locations, i.e. with low positional uncertainty, should exhibit stronger
behavioural differences across space due to stronger position-dependent tuning. For objects
that occur in more diverse and unpredictable locations, however, behaviour should show
weaker positional effects. We operationalise positional uncertainty as the size of the region
where the object’s occurrence frequency exceeds 90% of its maximum frequency, and
positional effects on classification performance as the ratio between the locations with best
9

Page 11
and worst classification accuracy for each object class (Fig. 4b, left). We then tested for a
relationship between these two measures via robust linear regression, and analysed the
estimated slopes for average humans, All-TNNs, and CNN controls (Fig. 4b, middle and
right). For human observers we observe a negative relationship: objects with low positional
uncertainty exhibited stronger positional effects on accuracy (robust regression; avg. slope =
-11.91; 95% CI, -22.56 - -0.18). Similar effects were also observed for All-TNNs (robust
regression; avg. slope = -9.13; 95% CI, -18.86 - -0.40; permutation test, n=1e5; p <0.001). In
contrast, CNN control models showed much-reduced effect sizes (robust regression; avg.
slope = -2.48; 95% CI, -7.15 - 4.95; permutation test, n=1e5; p <0.001). These analyses
indicate that the magnitude of the positional effect on classification performance varies as a
function of how uniformly distributed object occurrences are, with All-TNNs again aligning
more closely to human behavioural patterns than CNN control models.
Figure 5 | All-TNNs mirror spatial biases in human visual behaviour. a. Alignment of
accuracy maps of humans and models was quantified by a Pearson correlation. b. All-TNNs
exhibit significantly better alignment with human accuracy maps than CNNs. Shown are the
noise-ceiling corrected agreements of All-TNNs and CNNs with human accuracy maps. The
error bars show 95% confidence intervals.
Having verified that humans and All-TNN models are able to mirror spatial occurrence
statistics in their behavioural patterns, we next tested the models’ ability to accurately predict
human accuracy maps, i.e. variations in human object categorizations across the visual field.
For this, we directly compared human behavioural and model accuracy maps through
category-wise Pearson correlation analysis (Fig. 5a). We find that All-TNNs correlate
significantly stronger with human behavioural patterns than CNN controls ( p <0.001; n=10;
Fig. 5b).
Could this pattern of results be caused by a simple centre bias instead of richer structure
with different maps for different categories? To determine whether this is the case, or
whether the behavioural alignment is more precise and indeed object specific, we
constructed accuracy dissimilarity matrices (ADMs), by computing the Pearson correlation
distance between all object-specific accuracy maps (Fig. 6a). This analysis is akin to
representational similarity analysis 45 , but based on our accuracy maps, highlighting in how
far pairs of categories exhibit similar or different accuracy maps. ADMs are compared
between humans and models via Spearman correlation. Correlating ADMs focuses on
differences between accuracy maps. Indeed, features shared between accuracy maps, such
10

Page 12
as central biases, would show up as a constant in the ADM (all ADM cells are impacted by
this one shared aspect). Constants, however, do not affect ADM comparisons using
Spearman correlations. In short, our analysis approach of comparing ADMs focuses on
differences in accuracy maps across object categories and thereby moves the analysis
beyond similarities that are category-agnostic. As shown in Figure 6b, ADM agreement
between human data and All-TNNs is significantly higher than between humans and CNN
controls (permutation test, n=1e5; p <0.01). This confirms that visual classification behaviour
of All-TNNs aligns with human behaviour in a category-specific manner rather than merely
reflecting a central bias effect.
`
Figure 6 | All-TNNs mirror object specific spatial biases in human visual behaviour. a .
Accuracy dissimilarity matrices (ADMs) were created to capture the differences between
accuracy maps of all objects using Pearson correlation distance (left). Behaviourally more
similar objects have a low dissimilarity in their accuracy maps, whereas objects yielding
behaviourally distinct accuracy maps have high dissimilarity. To relate the ADMs of average
humans, All-TNNs, and CNNs with each other they are correlated using Spearman
correlation (right). b . ADMs of All-TNNs align significantly better with human data than those
of CNNs.
Engagement of stereotypical unit activation patterns links topography to behaviour
When trained on natural scenes, All-TNNs develop units that are selective for different
categories in different spatial locations. The behavioural experiment, however, relied on
small cropped objects presented in different locations - a setting quite different from the
natural images of ecoset (Fig. 7a, top panel). This discrepancy offers the unique opportunity
to directly link the human-like behaviour of All-TNNs to their topographical arrangement.
11

Page 13
Figure 7 | Human-like accuracy patterns are linked to topography in All-TNNs. a . The
engagement of stereotypical unit activation patterns by objects presented in various
locations was determined by computing a cosine similarity between the corresponding
activity maps of the respective network instances (All-TNNs, and CNN controls). In addition,
we extracted classification accuracy for all locations. b . Relating engagement to accuracy via
linear regression with bootstrapping demonstrates that All-TNNs exhibit a positive
relationship: stronger engagement of stereotypical unit patterns yield higher classification
accuracy. This was not observed for CNNs.
To investigate this aspect, we recorded average unit activation maps of the final network
layer for each object category, based on the ecoset test set. We call these stereotypical
activation maps, as they reflect the model’s response pattern to images from the distribution
of images of a given category on which it was trained. We then recorded the unit activation
maps for each stimulus from the behavioural experiment (i.e. each object category,
presented in each of the 25 locations of the behavioural experiment; see Methods), and
averaged responses across images for each location and category. We call these the
experimental unit activation maps, reflecting the model’s responses to ouf-of-distribition
images used in the behavioural experiment. Comparisons of the stereotypical activation
maps to the activation patterns observed during the experimental setting enabled us to test
in how far the network’s better performance for some locations was due to it engaging the
right topographic features. Category-specific activation maps for the stereotypical setting
were compared to the 25 experimental locations using cosine similarity (Fig. 7a). If the
behavioural effects observed in All-TNNs are driven by their topographical layout, then
locations with a better alignment of the stereotypical and experimental activations maps
should have a better classification accuracy. Collapsing data across categories and
locations, we find that this is indeed the case (linear regression with bootstrapping,
slope=0.88; permutation test, n=5e5; p <0.001; Fig. 7b). In other words, stimuli are well
classified when they successfully engage the right topographic regions of the last layer. In
contrast to this, we found that CNNs do not show a significant relationship between unit
activation patterns and recognition accuracy (linear regression with bootstrapping,
slope=0.40; permutation test, n=5e5; p >0.05). These results tie together the two previous
results of the paper: the human-like neural topographies of All-TNNs explain their more
human-like behavioural patterns.
12

Page 14
Discussion
Here we introduce and test a new artificial neural network architecture, All-TNN, that is
capable of modelling topographic aspects of primate vision and their behavioural
consequences. All-TNNs fulfill three desiderata for topographical models of the visual
system: (1) units with local receptive fields and independently learnt kernels, (2) units
arranged on a 2D cortical sheet and (3) spatially smooth feature selectivity. This endows
All-TNNs with genuine topography that goes beyond current state-of-the-art CNN models
that instead copy and paste identical features across locations.
Using in-silico electrophysiology across the hierarchical levels of All-TNNs, we show that
they develop key features of topographic organisation reminiscent of both lower- and
higher-level areas of the visual cortex. We find that, unlike non-topographical control
networks, All-TNNs tune to spatial occurrence statistics, produce smooth orientation
selectivity maps and category-specific topographical organisation. This shows that our
simple spatial smoothness constraint is sufficient for All-TNNs trained on natural images to
model primate topographies across levels. Our modelling results lend support to the idea
that topography in the visual cortex may result from the tendency of spatially neighbouring
neurons to learn similar features, which leads to a clustering of neurons. Indeed, the spatial
loss that we impose upon the network is in line with experiments that show that neurons
preferably connect to neurons with similar orientation selectivity in mammals 46–48 .
A surprising property of All-TNNs is a strong centre-periphery topography, with increasingly
smooth feature selectivity towards the periphery. This pattern is reminiscent of cortical
magnification and could be explained by the trade-off between the classification and spatial
loss: affected by dataset statistics, the network learns to detect varied features at the
expense of spatial smoothness in the foveal region. The observation that All-TNNs can
afford to lose units in the periphery becomes of interest when considering that the brain
operates under a limited energy budget and hence likely optimises for energy
consumption 49,50 . Our results support the hypothesis that cortical magnification emerged
through evolution as an optimal topography to trade off visual performance and neural
energy consumption in animals that foveally fixate on relevant objects 39 .
Similarly, the finding that All-TNNs show structured, yet varied, representations in the last
layer can be tied to the hypothesis that the high-level organisation in IT balances feature
variety and homogeneity 24 . Again, the emergence of category-selective clusters can be seen
as satisfying a trade-off between the need for varied feature detectors and the tendency to
have smooth selectivities.
Our results further reveal a similarity in the developmental principles and trajectory of
All-TNNs and the primate brain. The topographic organisation throughout the layers of
All-TNN arises early during training and stabilises quickly, while the network continues to
increase its classification performance subsequently. This indicates that the topographical
organisation offers a stable structure while still allowing for enough flexibility for task learning
throughout training. This is similar to the primate visual cortex where early maturation of
important architectural structures is thought to provide scaffolding for functional
selectivity 23,35,36 .
13

Page 15
LCN controls, which have an identical architecture to All-TNNs, but lack the spatial similarity
loss do not develop topographic features that are similar to the brain, despite being trained
on the same dataset and categorization objective. This ties into an important current debate
about which aspects of the visual system's structure are genetically hardwired, and which
require experience 36,51 , and lends support to the idea that both visual expertise (here:
dataset) and the right inductive biases (here: architecture and spatial loss) are necessary
driving factors for the emergence of functional topographies in the brain. All-TNNs thus invite
further modelling of the visual cortex and beyond 52 , taking developmental genesis into
account through systematic manipulations of architectural features, loss functions, or training
datasets to uncover principles and mechanisms underlying the maturation of brain structure
and behaviour in the visual system 53 .
Importantly, the topographic features of All-TNNs are central to replicating important aspects
of human behaviour. Humans can exploit spatial regularities in the typical locations of
objects, which may be an adaptive strategy to reduce the computational load and enhance
visual efficiency in complex environments 26 . In line with this strategy, neural representations
are better decodable and perceptual sensitivity is higher when objects appear at locations
that match their typical positions in the world, allowing objects to be more easily detected
and recognized when presented in expected locations 26,54 . We find that All-TNNs are closer
to human visual behaviour in this setting, due to how their topographical organisation
impacts object recognition.
The ability of All-TNNs to link topographies to behaviour allows for new research directions.
An obvious hypothesis to test is that if certain objects have more importance than others
during training, they may take up more space in the topography, which in turn may account
for biases in behaviour 55–57 . As another example, the spatial topography of All-TNNs allows
for targeted lesioning of its organised parts, such as the face-selective units in the later
layers. This allows using All-TNNs to model brain lesions 58 with potential for clinical impact,
and a model of virtual lesioning methods such as Transcranial Magnetic Stimulation (TMS) 59 ,
providing insights into the underlying mechanisms and effects of such experimental
interventions.
All-TNNs complement other recent approaches to modelling topography in the visual
system 30,41,60–66 , which have greatly contributed to our understanding of cortical map
formation. One limitation of most existing models of topographic organisation is that they are
either truly topographic but not task-performing or task-performing but not truly topographic.
Examples of the former are hand-crafted self-organising maps 13,67,68 . The latter are most
often, if not always, based on augmenting CNNs, for example by adding a spatial remapping
to their units, building self-organising maps based on unit activities, or creating multiple CNN
streams 30,60,61,64–66 . While we strongly agree that CNNs can provide important insight into
functional organisation, such models do rely on biologically implausible weight sharing rather
than genuine topography. All-TNNs are a promising approach to overcome this limitation, as
they are both mechanistically truly topographic and task performing.
The current work has several limitations. First, the spatial similarity loss that we used to
encourage neighbouring units to learn similar features may not capture the mechanism of
smoothness in topographic organization and the exact spatial arrangement of topography in
the brain. Future work using All-TNNs as a starting point can explore how smooth maps can
emerge naturally from model training, without imposing a secondary spatial similarity
14

Page 16
constraint explicitly. Possible avenues include using more biologically plausible constraints,
such as wiring length optimization 30,64,69 , energy constraints 50 , cortical size 13,70 , or recurrent
connectivity patterns. Second, the current set of models is trained on a supervised
classification objective, whereas primate visual learning likely relies on unsupervised signals,
too 71–73 .
In conclusion, All-TNNs are a promising new class of models, which address questions that
are beyond the scope of CNNs, and could serve as more accurate models of functional
organisation in the visual cortex and its behavioural consequences.
Methods
Neural network architectures and training
All-Topographic Neural Network
Each layer in All-TNNs is arranged as a 2D sheet, where each unit has its own receptive
field and set of weights (i.e., there is no weight sharing, unlike in CNNs). Units in the same
“hypercolumn” share the same receptive field. These aspects mirror well-known
characteristics of the visual system. In practice, this is implemented by subclassing
tensorflow’s
LocallyConnected2D
layer.
This
layer
is
arranged
in
a
3D structure, identical to CNNs, but without weight sharing.
ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑤𝑖𝑑𝑡ℎ × 𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠
To convert this 3D LocallyConnectedLayer to our 2D All-TNN layer, we “unfold” each channel
to a 2D square, giving rise to a
2D sheet with the
𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 * ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 * 𝑤𝑖𝑑𝑡ℎ
desired characteristics.
In addition, each layer has a spatial similarity loss, which promotes similar selectivity for
neighbouring units and is crucial for the emergence of topography (as evident in Figure 2). In
detail, this spatial similarity loss is computed as the average cosine distance between the
weight kernels of neighbouring units in each layer.
The total loss that the model aims to minimize is a composite of two losses: the
categorization cross-entropy loss (equation 1), and the spatial similarity loss (summed over
all layers; equation 2). The spatial loss is multiplied by a factor that determines the additive
α
weight of the spatial loss.
(1)
𝓛
𝐶𝐸
= −
𝑐= 1
𝑀
∑ 𝑙𝑜𝑔( 𝑝
𝑐
) 𝑦
𝑐
(2)
𝓛
𝑆
=
𝑙= 1
𝐿
1
𝑁
𝑙 𝑛= 1
𝑁
𝑙
∑ ( α
𝑙
·
𝑐𝑜𝑠_ 𝑑𝑖𝑠𝑡( 𝑤
𝑖, 𝑗
, 𝑤
𝑖, 𝑗 + 1
) + 𝑐𝑜𝑠_ 𝑑𝑖𝑠𝑡( 𝑤
𝑖, 𝑗
, 𝑤
𝑖+ 1 , 𝑗
)
2
)
(3)
𝓛
𝑡𝑜𝑡𝑎𝑙
= 𝓛
𝐶𝐸
+ 𝓛
𝑆
15

Page 17
where
is the weight kernel of the unit at the position
on the 2D sheet, denotes the
𝑤
𝑖, 𝑗
𝑖, 𝑗
𝑁
𝑙
total number of units in layer ,
is a hyperparameter in layer that determines the
𝑙 α
𝑙
𝑙
magnitude of the spatial similarity loss, and is the total number of layers in the network..
𝐿
The All-TNNs we used consist of 6 such layers, of which layers 1, 3, and 5 are followed by 2
by 2 pooling layers. Each layer is subject to L2 regularisation with a factor of 1e-6, and is
followed by layer normalisation and a rectified linear unit. We used a spatial loss of α = 10
in all layers except the final, which for which we used
, due to increased smoothness
α = 10
observed in the higher visual cortex. We used an Adam optimiser with a learning rate of
0.001 and =0.1. and a regularisation ratio of 1e-6. Weights are initialised with Xavier
ϵ
initialisation. A dropout of 0.2 is applied to all layers during training. See Figure 1 for the
specific layer and kernel sizes.
Locally connected control model
As a control for the effect of the spatial loss, we also train two All-TNNs with identical
hyperparameters but without spatial loss (
), meaning that the model trains with only
α = 0
the task loss (cross-entropy).
Convolutional control model
Our convolutional controls have the same number of layers, number of units and
hyperparameters as our All-TNNs. The spatial similarity loss is not (and cannot be) enforced
in this model. It is thus trained with only the cross-entropy loss.
Dataset & training
Each model is trained on the ecoset training set (see subsection Stimuli ). The images were
input to the networks with a resolution of 150x150 pixels. Given that “individual differences”
exist between ANNs 74 , we trained multiple instances of each network type with different
random seeds, which are treated as experimental subjects. For All-TNNs and CNNs, we
trained 10 instances. For LCNs (i.e. All-TNNs without spatial loss), we only trained two
network instances due to resource constraints. All models are trained for 600 epochs.
All models are custom-made, implemented and trained in Python v.3.10 with Tensorflow
v.2.8 using NVIDIA A100 GPUs.
Stimuli
Training dataset
The All-TNN models and CNN control models were trained using the ecoset dataset 34 . The
ecoset dataset consists of 1.5 million ecologically motivated images from 565 categories. It
was shown that networks trained on ecoset better predict activities in the human higher
visual cortex than networks trained on Imagenet (ILSVRC-2012), making it a good choice for
modelling the influence of natural image statistics on the emergence of cortical topography.
Selectivity analysis dataset
To determine high-level object selectivity in the last layer of the models, we selected 500
images for each superclass used to test selectivity (faces, places and tools). The places and
16

Page 18
tools images are selected from the 10 most common classes for the respective superclass
found in ecoset validation set. The faces are taken from the VGG-Face dataset 75 .
Places : ‘House’, ‘City’, ‘Kitchen’, ‘Mountain’, ‘Road’, ‘River’, ‘Jail’, ‘Castle’, ‘Lake’, ‘Iceberg’
Tools : ‘Phone’, ‘Gun’, ‘Book’, ‘Table’, ‘Clock’, ‘Camera’, ‘Cup’, ‘Key’, ‘Computer’, ‘Knife’
Faces : 10 identities taken from the VGG-Face dataset 75 .
Stimuli for the behavioural experiment
The spatial classification behaviour of humans, All-TNNs, and control models, was evaluated
using segmented objects from the COCO (Common Objects in Context) dataset 43 . This is a
large-scale image dataset, gathered from everyday scenes containing common objects, and
the objects in this natural sense are precisely labelled, and provided with segmentation
masks, bounding box and keypoint annotations.
Easy segmentation of objects from pixel-wise segmentation masks in COCO motivates the
use of this dataset for behavioural testing. The stimuli set consists of 16 categories that
occur in both ecoset and COCO. These categories are: 'airplane', 'bear', 'broccoli', 'bus', 'cat',
'elephant', 'giraffe', 'kite', 'laptop', 'motorcycle', 'pizza', 'refrigerator', 'scissors', 'toilet', 'train',
and 'zebra'. Each category contains 5 exemplars, selected for having similar illumination. To
control for visual confounds, all stimuli images were resized to equal size and cropped onto
grey backgrounds. Note that we only used 5 stimuli per class in our human experiment due
to time limitations, but when we conduct a similar experiment on our networks as described
below, we use all COCO stimuli for each class, at each 5x5 location instead.
Behavioural experiment
To assess the correspondence between the behaviour of All-TNN, the CNN control model,
and human behaviour, we collected human behavioural responses to one image dataset.
This allowed for a comparison of the correlation of behavioural responses to the same
stimuli between All-TNN, the CNN control model, and human behaviour.
Participants
30 healthy adults (aged 21-30 years, mean=25.47 years, SD=2.5 years; 17 female)
participated and completed the visual classification task. All participants had normal or
corrected-to-normal visual acuity. Prior to participation, ethical approval for the study was
obtained to ensure compliance with ethical guidelines. All participants provided written
informed consent and received monetary compensation or course credits for their
participation.
Experiment design
The participants were asked to detect and classify the object that presented on screen. The
images used were 215x215 pixels, calculated based on a 5-degree visual angle and a 65 cm
screen distance. Stimuli randomly occurred in one of the 25 positions on a 5 by 5 grid on the
screen for 40ms. The location of the stimulus was then masked with a Mondrian mask for
300ms. The participants were then presented with a response panel showing the 16
category names after the mask disappeared, after which they had 2150 ms to click on the
category name that matched what they saw before the mask. Feedback was given after
each trial: the right category name was displayed in a green font colour if the participants
17

Page 19
were correct, and in red if they were incorrect. The experiment consisted of 2000 trials in
total: each object exemplar was shown one time in each location (i.e. for each of the 16
categories, all 5 object exemplars are presented 25 times). The order of stimulus
presentation was randomised between participants. Each trial took around 2.5s, and after
each set of 200 trials, there was a 2-minute pause. The total experiment took 3 hours in
total.
Note that we only used 5 stimuli per class in our human experiment due to time limitations,
but when we conduct a similar experiment on our networks as described below, we use all
COCO stimuli for each class, at each 5x5 location instead.
Data recording and processing
The stimulus presentation in the behavioural experiment was controlled using the
Psychtoolbox 76 in Matlab. We recorded the human behavioural data for each object class in
each location into 5x5 accuracy maps. We also collected the same data for our models. This
involved presenting the models with each object exemplar at each location and recording
their classification performance for each position into accuracy maps.
Data analysis
Orientation selectivity
To determine the orientation selectivity of units in the first layer of the models, we present
grating stimuli at 8 angles (equally spaced between 0 and 180 degrees) to the models and
record the elicited activity. The gratings have a spatial wavelength of 3 pixels, allowing for >1
cycle within each receptive field in the first layer of our networks. For each grating angle, and
for each network unit, we present various phases, and pick the phase that maximizes the
unit’s activation response to the grating, to find the best alignment between the stimulus and
weight kernel. We combine the resulting 8 activities (one per angle) vectorially projecting
each angle onto a circle and multiplying it by the corresponding activity, and taking a
weighted sum of these vectors (a widespread method for measuring orientation selectivity in
electrophysiology 77,78 ). Intuitively, if all stimulus angles elicit similar activities, these vectors
will cancel out, while there will be a clear winner otherwise.To visualise orientation maps (as
well as the entropy and category selectivity maps) for CNNs the kernels are flattened into a
2D-sheet in an identical manner as to the flattening of the All-TNN prior to training.
Cortical Magnification
To quantify the diversity of selectivities at each location in the first layer of our models, we
calculated entropy in a 3x3 sliding window for each retinotopic position on the orientation
selectivity maps. Units that do not respond to any of the grating stimuli are excluded from
this analysis. Preferred orientations, computed as described above, are discretized into 8
equally spaced orientation bins, which are used to compute the entropy. This yields a map
showing how varied unit responses are (e.g. the entropy is low if all units in the sliding
window have similar preferred orientations).
To test if the network can afford to lose units in regions with lower entropy (i.e. more
redundant coding), we perform an entropy-based lesion experiment. We use the entropy
map described above and lesion the 50% lowest entropy units, and measure the
18

Page 20
performance of the lesioned network on the validation set of ecoset. As controls, we perform
the same test, but lesion the 50% highest entropy units.
Category selectivity
We computed the selectivity of each unit in the last layer of our networks to scenes, tools
and faces using the d’ signal detection measure (see subsection Stimuli for the images we
used):
𝑑' =
| µ
𝑖𝑛
µ
𝑜𝑢𝑡
|
𝑚𝑒𝑎𝑛(σ
𝑖𝑛
, σ
𝑜𝑢𝑡
)
where
and
are the mean and variance of the activations of the unit in response to
µ
𝑖𝑛
σ
𝑖𝑛
stimuli of the category of interest, and
and
are the mean and variance of the
µ
𝑜𝑢𝑡
σ
𝑜𝑢𝑡
activations of the unit in response to stimuli from the other categories. The variances for both
distributions are assumed to be in a similar range, therefore we average over the variance of
the two distributions in the denominator.
Positional occurrence maps
We generated maps that quantify the frequency of occurrence at each image location of
each of the 16 COCO categories we use in the behavioural experiment (see Fig. S4a). To do
so, we use the binary segmentation masks provided by COCO for each category, and take
their average across the whole dataset. These maps are then downsampled to 5x5 using
average pooling, to match the dimensionality of accuracy maps derived from behavioural
experiments.
Positional Uncertainty
Positional uncertainty is derived from the positional occurrence maps described above
before downsampling to 5x5. We define positional uncertainty as the area of the locations for
which the occurrence frequency is larger than 90% of the highest frequency position (see
examples in Fig. S5). A small positional uncertainty thus means that the stimulus instances
of an object category often occur in the same positions, while a lard positional uncertainty
means that the object appears in varied positions.
Accuracy Ratio
The accuracy ratio measures the overall magnitude of positional effects on behavioural
performance as the accuracy at the location with the best accuracy divided by the accuracy
at the location with the worst accuracy in the accuracy map.
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑟𝑎𝑡𝑖𝑜 =
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑏𝑒𝑠𝑡
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑤𝑜𝑟𝑠𝑡
Accuracy Map Agreement
Accuracy Map Agreement is a measurement to quantify the alignment of positional
dependency in visual behaviour between humans and All-TNNs/CNNs. We computed the
Pearson correlation coefficient between the corresponding maps for humans and
All-TNNs/CNNs for each of our 16 object categories. The mean noise-ceiling corrected (see
below) correlation score is calculated across these 16 categorical accuracy maps using a
permutation test.
19

Page 21
Accuracy Dissimilarity Matrix Agreement
We use Accuracy Dissimilarity Matrices (ADM) to quantify the dissimilarity between pairs of
accuracy maps of different categories, in a similar spirit to the well-known Representational
Dissimilarity Matrices (RDMs) 45 , which compare the dissimilarity between representations of
pairs of stimuli pairs of stimuli. Objects yielding behaviourally distinct accuracy maps have
high dissimilarity in our ADM, and vice-versa. ADMs were created using Pearson correlation
distance between each category accuracy map. To quantify the agreement between model
and human ADMs we use noise-ceiling corrected (see below) Spearman correlation
between ADMs participants (n=30), and All-TNNs (n=10) or CNNs (n=10) using permutation
tests. A higher correlation indicates a higher alignment of the structure of class-wise
positional dependencies.
Noise Ceiling Analysis
Behaviour is noisy, and therefore, even humans do not correlate perfectly with each other.
Therefore, we cannot expect models to have a perfect correlation with our participants either.
To account for this, we compute a noise-ceiling by iteratively leaving out one human and
seeing how well it correlates with the other 29. This yields 30 correlations, of which we take
the mean (this is technically referred to as the noise ceiling lower bound). We then divide our
model-human correlations by this value.
Stereotypical unit activation patterns and link to behaviour
To ask whether the drop in performance at different locations of the 5x5 accuracy map can
be explained by the fact that All-TNNs fail to engage the adequate unit population, we
extracted activation maps from the last layer of All-TNNs and CNN control models in
response to both ecoset test set images for our 16 classes, and stimuli from COCO
displayed on the 5x5 experimental grid. We consider the responses to ecoset test images
“stereotypical”, and quantify the engagement of these stereotypical unit activation patterns
for each class and location in the experimental 5x5 grid. In detail, for each of our 16 classes,
we compute the cosine similarity between the “stereotypical” activation map and the
“experimental” activation map elicited by presenting our experimental stimuli at each
location. We then relate this stereotypical engagement to classification accuracy by using a
linear regression with bootstrapping.
Code and data availability
All analyses of human and model data were performed in custom Python software, making
use of Numpy and/or scikit-learn packages. The code and data required to reproduce our
results will be released upon journal publication of this paper.
Acknowledgements
The authors acknowledge support by the following grants: A.D. is supported by SNF grant
n.203018. T.C.K., V.B., and A.D. are supported by the ERC STG grant 101039524 TIME.
D.K. is supported by the Deutsche Forschungsgemeinschaft (SFB/TRR 135, project number
222641018), an ERC Starting Grant (ERC-2022-STG 101076057), and by “The Adaptive
Mind,” funded by the Excellence Program of the Hessian Ministry of Higher Education,
20

Page 22
Science, Research and Art. Z.L. is supported by CSC grant (202106120015) and R.M.C. is
supported by DFG grants CI241/1-1, CI241/3-1, CI241/7- 1 and ERC STG grant 803370.
21

Page 23
Bibliography
1. Doerig, A. et al. The neuroconnectionist research programme. Nat. Rev. Neurosci. 1–20
(2023) doi:10.1038/s41583-023-00705-w.
2. Cichy, R. M. & Kaiser, D. Deep Neural Networks as Scientific Models. Trends Cogn. Sci.
23 , 305–317 (2019).
3. Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22 ,
1761–1770 (2019).
4. Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep Supervised, but Not Unsupervised,
Models May Explain IT Cortical Representation. PLoS Comput. Biol. 10 , e1003915
(2014).
5. Yamins, D., Hong, H., Cadieu, C. & Dicarlo, J. J. Hierarchical Modular Optimization of
Convolutional Networks Achieves Representations Similar to Macaque IT and Human
Ventral Stream. Adv. Neural Inf. Process. Syst. NIPS (2013).
6. Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural
responses in higher visual cortex. Proc. Natl. Acad. Sci. 111 , 8619–8624 (2014).
7. Güçlü, U. & Gerven, M. A. J. van. Deep Neural Networks Reveal a Gradient in the
Complexity of Neural Representations across the Ventral Stream. J. Neurosci. 35 ,
10005–10014 (2015).
8. Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H.
A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain
Responses, and Reveals a Cortical Processing Hierarchy. Neuron 98 , 630-644.e16
(2018).
9. Spoerer, C. J., Kietzmann, T. C., Mehrer, J., Charest, I. & Kriegeskorte, N. Recurrent
neural networks can explain flexible trading of speed and accuracy in biological vision.
PLOS Comput. Biol. 16 , e1008215 (2020).
10.
Kar, K., Kubilius, J., Schmidt, K., Issa, E. B. & DiCarlo, J. J. Evidence that recurrent
circuits are critical to the ventral stream’s execution of core object recognition behavior.
Nat. Neurosci. 22 , 974–983 (2019).
22

Page 24
11.
Rust, N. C. & Mehrpour, V. Understanding Image Memorability. Trends Cogn. Sci. 24 ,
557–568 (2020).
12.
Kaschube, M. et al. Universality in the Evolution of Orientation Columns in the Visual
Cortex. Science 330 , 1113–1116 (2010).
13.
Najafian, S. et al. A theory of cortical map formation in the visual brain. Nat.
Commun. 13 , 2303 (2022).
14.
Hubel, D. H. & Wiesel, T. N. Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. J. Physiol. 160 , 106–154 (1962).
15.
Jung, Y. J. et al. Orientation pinwheels in primary visual cortex of a highly visual
marsupial. Sci. Adv. 8 , eabn0954 (2022).
16.
Hubel, D. H. & Wiesel, T. N. Ferrier lecture - Functional architecture of macaque
monkey visual cortex. Proc. R. Soc. Lond. B Biol. Sci. 198 , 1–59 (1997).
17.
Kanwisher, N., McDermott, J. & Chun, M. M. The Fusiform Face Area: A Module in
Human Extrastriate Cortex Specialized for Face Perception. J. Neurosci. 17 , 4302–4311
(1997).
18.
Tsao, D. Y., Freiwald, W. A., Tootell, R. B. H. & Livingstone, M. S. A cortical region
consisting entirely of face-selective cells. Science 311 , 670–674 (2006).
19.
Peelen, M. V. & Downing, P. E. Selectivity for the human body in the fusiform gyrus.
J. Neurophysiol. 93 , 603–608 (2005).
20.
Dilks, D. D., Julian, J. B., Paunov, A. M. & Kanwisher, N. The Occipital Place Area Is
Causally and Selectively Involved in Scene Perception. J. Neurosci. 33 , 1331–1336
(2013).
21.
Nasr, S. et al. Scene-Selective Cortical Regions in Human and Nonhuman Primates.
J. Neurosci. 31 , 13771–13785 (2011).
22.
Epstein, R. & Kanwisher, N. A cortical representation of the local visual environment.
Nature 392 , 598–601 (1998).
23.
Deen, B. et al. Organization of high-level visual cortex in human infants. Nat.
Commun. 8 , 13995 (2017).
23

Page 25
24.
Tanaka, K. Columns for complex visual object features in the inferotemporal cortex:
clustering of cells with similar but slightly different stimulus selectivities. Cereb. Cortex N.
Y. N 1991 13 , 90–99 (2003).
25.
Konkle, T. & Caramazza, A. Tripartite Organization of the Ventral Stream by Animacy
and Object Size. J. Neurosci. 33 , 10235–10242 (2013).
26.
Kaiser, D., Quek, G. L., Cichy, R. M. & Peelen, M. V. Object Vision in a Structured
World. Trends Cogn. Sci. 23 , 672–685 (2019).
27.
Kaiser, D. & Cichy, R. M. Typical visual-field locations enhance processing in
object-selective channels of human occipital cortex. J. Neurophysiol. 120 , 848–853
(2018).
28.
Bar, M. Visual objects in context. Nat. Rev. Neurosci. 5 , 617–629 (2004).
29.
Song, H. F., Kennedy, H. & Wang, X.-J. Spatial embedding of structural similarity in
the cerebral cortex. Proc. Natl. Acad. Sci. 111 , 16580–16585 (2014).
30.
Blauch, N. M., Behrmann, M. & Plaut, D. C. A connectivity-constrained computational
account of topographic organization in primate high-level visual cortex. Proc. Natl. Acad.
Sci. 119 , e2112566119 (2022).
31.
Finzi, D. et al. Differential spatial computations in ventral and lateral face-selective
regions are scaffolded by structural connections. Nat. Commun. 12 , 2278 (2021).
32.
Dumoulin, S. O. & Wandell, B. A. Population receptive field estimates in human
visual cortex. NeuroImage 39 , 647–660 (2008).
33.
Aflalo, T. N. & Graziano, M. S. A. Organization of the Macaque Extrastriate Visual
Cortex Re-Examined Using the Principle of Spatial Continuity of Function. J.
Neurophysiol. 105 , 305–320 (2011).
34.
Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An
ecologically motivated image dataset for deep learning yields better models of human
vision. Proc. Natl. Acad. Sci. 118 , e2011417118 (2021).
35.
Ellis, C. T. et al. Retinotopic organization of visual cortex in human infants. Neuron
109 , 2616-2626.e6 (2021).
24

Page 26
36.
Arcaro, M. J., Schade, P. F., Vincent, J. L., Ponce, C. R. & Livingstone, M. S. Seeing
faces is necessary for face-patch formation. Nat. Neurosci. 20 , 1404–1412 (2017).
37.
Cowey, A. & Rolls, E. T. Human cortical magnification factor and its relation to visual
acuity. Exp. Brain Res. 21 , 447–454 (1974).
38.
Levi, D. M., Klein, S. A. & Aitsebaomo, A. P. Vernier acuity, crowding and cortical
magnification. Vision Res. 25 , 963–977 (1985).
39.
Provis, J. M., Dubis, A. M., Maddess, T. & Carroll, J. Adaptation of the Central Retina
for High Acuity Vision: Cones, the Fovea and the Avascular Zone. Prog. Retin. Eye Res.
35 , 63–81 (2013).
40.
Kanwisher, N. Functional specificity in the human brain: A window into the functional
architecture of the mind. Proc. Natl. Acad. Sci. 107 , 11163–11170 (2010).
41.
Bao, P., She, L., McGill, M. & Tsao, D. Y. A map of object space in primate
inferotemporal cortex. Nature 583 , 103–108 (2020).
42.
de Haas, B. et al. Perception and Processing of Faces in the Human Brain Is Tuned
to Typical Feature Locations. J. Neurosci. Off. J. Soc. Neurosci. 36 , 9289–9302 (2016).
43.
Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. Preprint at
http://arxiv.org/abs/1405.0312 (2015).
44.
Fahrenfort, J. J., Scholte, H. S. & Lamme, V. a. F. Masking disrupts reentrant
processing in human visual cortex. J. Cogn. Neurosci. 19 , 1488–1497 (2007).
45.
Nili, H. et al. A Toolbox for Representational Similarity Analysis. PLOS Comput. Biol.
10 , e1003553 (2014).
46.
Malach, R., Amir, Y., Harel, M. & Grinvald, A. Relationship between intrinsic
connections and functional architecture revealed by optical imaging and in vivo targeted
biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. U. S. A. 90 ,
10469–10473 (1993).
47.
Gilbert, C. D. & Wiesel, T. N. Columnar specificity of intrinsic horizontal and
corticocortical connections in cat visual cortex. J. Neurosci. Off. J. Soc. Neurosci. 9 ,
2432–2442 (1989).
25

Page 27
48.
Fitzpatrick, D. The Functional Organization of Local Circuits in Visual Cortex: Insights
from the Study of Tree Shrew Striate Cortex. Cereb. Cortex 6 , 329–341 (1996).
49.
Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by
learning a sparse code for natural images. Nature 381 , 607–609 (1996).
50.
Ali, A., Ahmad, N., Groot, E. de, Gerven, M. A. J. van & Kietzmann, T. C. Predictive
coding is a consequence of energy efficiency in recurrent neural networks. Patterns 3 ,
(2022).
51.
Ratan Murty, N. A. et al. Visual experience is not necessary for the development of
face-selectivity in the lateral fusiform gyrus. Proc. Natl. Acad. Sci. 117 , 23011–23020
(2020).
52.
Eggermont, J. J. The Role of Sound in Adult and Developmental Auditory Cortical
Plasticity. Ear Hear. 29 , 819–829 (2008).
53.
Ibbotson, M. & Jung, Y. J. Origins of Functional Organization in the Visual Cortex.
Front. Syst. Neurosci. 14 , 10 (2020).
54.
Kaiser, D. & Cichy, R. M. Typical visual-field locations facilitate access to awareness
for everyday objects. Cognition 180 , 118–122 (2018).
55.
Mahon, B. Z. & Caramazza, A. What drives the organization of object knowledge in
the brain? The distributed domain-specific hypothesis. Trends Cogn. Sci. 15 , 97–103
(2011).
56.
Bonner, M. F. & Epstein, R. A. Coding of navigational affordances in the human
visual system. Proc. Natl. Acad. Sci. 114 , 4793–4798 (2017).
57.
Op de Beeck, H. P., Pillet, I. & Ritchie, J. B. Factors Determining Where
Category-Selective Areas Emerge in Visual Cortex. Trends Cogn. Sci. 23 , 784–797
(2019).
58.
Liu, T. T. & Behrmann, M. Functional outcomes following lesions in visual cortex:
Implications for plasticity of high-level vision. Neuropsychologia 105 , 197–214 (2017).
59.
Silvanto, J. & Cattaneo, Z. Common framework for “virtual lesion” and
state-dependent TMS: The facilitatory/suppressive range model of online TMS effects on
26

Page 28
behavior. Brain Cogn. 119 , 32–38 (2017).
60.
Doshi, F. R. & Konkle, T. Cortical topographic motifs emerge in a self-organized map
of object space. Sci. Adv. 9 , eade8187 (2023).
61.
Margalit, E. et al. A Unifying Principle for the Functional Organization of Visual
Cortex. 2023.05.18.541361 Preprint at https://doi.org/10.1101/2023.05.18.541361 (2023).
62.
Kanwisher, N., Gupta, P. & Dobs, K. CNNs reveal the computational implausibility of
the expertise hypothesis. iScience 26 , 105976 (2023).
63.
Keller, T. A. & Welling, M. Topographic VAEs learn Equivariant Capsules. in
Advances in Neural Information Processing Systems vol. 34 28585–28597 (Curran
Associates, Inc., 2021).
64.
Zhang, Y., Zhou, K., Bao, P. & Liu, J. Principles governing the topological
organization of object selectivities in ventral temporal cortex. 2021.09.15.460220 Preprint
at https://doi.org/10.1101/2021.09.15.460220 (2021).
65.
Dobs, K., Martinez, J., Kell, A. J. E. & Kanwisher, N. Brain-like functional
specialization emerges spontaneously in deep neural networks. Sci. Adv. 8 , eabl8913
(2022).
66.
Lindsey, J., Ocko, S. A., Ganguli, S. & Deny, S. A Unified Theory of Early Visual
Representations from Retina to Cortex through Anatomically Constrained Deep CNNs.
arXiv (2019) doi:10.48550/arxiv.1901.00945.
67.
Kohonen, T. Self-organized formation of topologically correct feature maps. Biol.
Cybern. 43 , 59–69 (1982).
68.
Swindale, N. V. & Bauer, H.-U. Application of Kohonen’s self-organizing feature map
algorithm to cortical maps of orientation and direction preference. Proc. R. Soc. B Biol.
Sci. 265 , 827–838 (1998).
69.
Koulakov, A. A. & Chklovskii, D. B. Orientation Preference Patterns in Mammalian
Visual Cortex A Wire Length Minimization Approach. Neuron 29 , 519–527 (2001).
70.
Weigand, M., Sartori, F. & Cuntz, H. Universal transition from unstructured to
structured neural maps. Proc. Natl. Acad. Sci. 114 , E4057–E4064 (2017).
27

Page 29
71.
Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream.
Proc. Natl. Acad. Sci. 118 , (2021).
72.
Konkle, T. & Alvarez, G. A. A self-supervised domain-general learning framework for
human ventral stream representation. Nat. Commun. 13 , 491 (2022).
73.
Storrs, K. R., Anderson, B. L. & Fleming, R. W. Unsupervised learning predicts
human perception and misperception of gloss. Nat. Hum. Behav. 5 , 1402–1417 (2021).
74.
Mehrer, J., Spoerer, C. J., Kriegeskorte, N. & Kietzmann, T. C. Individual differences
among deep neural network models. Nat. Commun. 11 , 5725 (2020).
75.
Parkhi, O. M., Vedaldi, A. & Zisserman, A. Deep Face Recognition. in Procedings of
the British Machine Vision Conference 2015 41.1-41.12 (British Machine Vision
Association, 2015). doi:10.5244/C.29.41.
76.
Brainard, D. H. The Psychophysics Toolbox. Spat. Vis. 10 , 433–436 (1997).
77.
Hübener, M., Shoham, D., Grinvald, A. & Bonhoeffer, T. Spatial Relationships among
Three Columnar Systems in Cat Area 17. J. Neurosci. Off. J. Soc. Neurosci. 17 , 9270–84
(1998).
78.
Kaschube, M., Schnabel, M. & Wolf, F. Self-organization and the selection of
pinwheel density in visual cortical development. New J. Phys. 10 , 015009 (2008).
28

Page 30
Supplementary Material
Figure S1 | Orientation selectivity maps across training epochs and model instances.
a. The V1-like organisation of orientation selectivities in the first layer of All-TNN remains
stable after emergence in the first training epochs. b . The organisation of orientation
selectivities for All-TNNs is consistent across all trained network seeds. c. Unstructured
salt-and-pepper orientation selectivity maps emerge in all trained CNN seeds. d .
Salt-and-pepper orientation selectivity maps emerge in all trained LCN seeds.
29

Page 31
Figure S2 | Entropy analysis across model instances. a. The cortical magnification of
entropy maps in the first layer of All-TNN emerges consistently across network seeds. b .
Homogenous unstructured entropy maps emerge in all trained CNN seeds. c .
Homogenousentropy maps emerge in all trained LCN seeds.
30

Page 32
Figure S3 | Category selectivity maps across training epochs and model instances. a.
The clustering of high-level category-based selectivities (d’) for tools, scenes, and faces in
the last layer of All-TNN emerges through training epochs. b . The emergence of clusters of
category selectivity for All-TNNs is consistent across all trained network seeds. c. Category
selectivity maps are unstructured in all trained CNN seeds d . Category selectivity maps are
unstructured in all trained LCN seeds.
31

Page 33
Figure S4 | Categorical spatial prior and accuracy maps a. Spatial occurrence frequency
maps for different object categories in high resolution, derived from the COCO dataset, are
downsampled to 5x5 resolution to match the resolution of accuracy map results from
behaviour experiments. b. Visualisation of accuracy maps for all 16 categories for average
humans, All-TNNs and CNNs.
32

Page 34
Figure S5 | Positional uncertainty. The calculation of categorical positional uncertainty is
based on the area size of the locations with the occurrence frequency larger than threshold
(90% of highest frequency) on the categorical positional occurrence maps. The size of the
region with a high frequency of occurrence represents the magnitude of the categorical
position uncertainty.
33
-