Uninformed Students: Student-Teacher Anomaly Detection With Discriminative Latent Embeddings

Page 1

Uninformed Students: Student–Teacher Anomaly Detection

with Discriminative Latent Embeddings

Paul Bergmann

Michael Fauser

David Sattlegger

Carsten Steger

MVTec Software GmbH

www.mvtec.com

{paul.bergmann, fauser, sattlegger, steger}@mvtec.com

Abstract

We introduce a powerful student–teacher framework for

the challenging problem of unsupervised anomaly detection

and pixel-precise anomaly segmentation in high-resolution

images. Student networks are trained to regress the out-

put of a descriptive teacher network that was pretrained on

a large dataset of patches from natural images. This cir-

cumvents the need for prior data annotation. Anomalies

are detected when the outputs of the student networks dif-

fer from that of the teacher network. This happens when

they fail to generalize outside the manifold of anomaly-free

training data. The intrinsic uncertainty in the student net-

works is used as an additional scoring function that indi-

cates anomalies. We compare our method to a large number

of existing deep learning based methods for unsupervised

anomaly detection. Our experiments demonstrate improve-

ments over state-of-the-art methods on a number of real-

world datasets, including the recently introduced MVTec

Anomaly Detection dataset that was specifically designed

to benchmark anomaly segmentation algorithms.

1. Introduction

Unsupervised pixel-precise segmentation of regions that

appear anomalous or novel to a machine learning model is

an important and challenging task in many domains of com-

puter vision. In automated industrial inspection scenarios,

it is often desirable to train models solely on a single class

of anomaly-free images to segment defective regions during

inference. In an active learning setting, regions that are de-

tected as previously unknown by the current model can be

included in the training set to improve the model’s perfor-

mance.

Recently, efforts have been made to improve anomaly

detection for one-class or multi-class classification [2, 3,

10, 11, 21, 28, 29]. However, these algorithms assume that

anomalies manifest themselves in the form of images of an

Figure 1: Qualitative results of our anomaly detection method on

the MVTec Anomaly Detection dataset. Top row: Input images

containing defects. Center row: Ground truth regions of defects

in red. Bottom row: Anomaly scores for each image pixel pre-

dicted by our algorithm.

entirely different class and a simple binary image-level de-

cision whether an image is anomalous or not must be made.

Little work has been directed towards the development of

methods that can segment anomalous regions that only dif-

fer in a very subtle way from the training data. Bergmann

et al. [7] provide benchmarks for several state-of-the-art al-

gorithms and identify a large room for improvement.

Existing work predominantly focuses on generative al-

gorithms such as Generative Adversarial Networks (GANs)

[31, 32] or Variational Autoencoders (VAEs) [5, 36]. These

detect anomalies using per-pixel reconstruction errors or by

evaluating the density obtained from the model’s probabil-

ity distribution. This has been shown to be problematic

due to inaccurate reconstructions or poorly calibrated like-

lihoods [8, 22].

The performance of many supervised computer vision

algorithms [16, 34] is improved by transfer learning, i.e. by

using discriminative embeddings from pretrained networks.

For unsupervised anomaly detection, such approaches have

not been thoroughly explored so far. Recent work suggests

that these feature spaces generalize well for anomaly detec-

4183

Page 2

Figure 2: Schematic overview of our approach. Input images are fed through a teacher network that densely extracts features for local

image regions. An ensemble of M student networks is trained to regress the output of the teacher on anomaly-free data. During inference,

the students will yield increased regression errors e and predictive uncertainties v in pixels for which the receptive field covers anomalous

regions. Anomaly maps generated with different receptive fields can be combined for anomaly segmentation at multiple scales.

tion and even simple baselines outperform generative deep

learning approaches [10, 26]. However, the performance of

existing methods on large high-resolution image datasets is

hampered by the use of shallow machine learning pipelines

that require a dimensionality reduction of the used feature

space. Moreover, they rely on heavy training data subsam-

pling since their capacity does not suffice to model highly

complex data distributions with a large number of training

samples.

We propose to circumvent these limitations of shallow

models by implicitly modeling the distribution of train-

ing features with a student–teacher approach. This lever-

ages the high capacity of deep neural networks and frames

anomaly detection as a feature regression problem. Given a

descriptive feature extractor pretrained on a large dataset of

patches from natural images (the teacher), we train an en-

semble of student networks on anomaly-free training data

to mimic the teacher’s output. During inference, the stu-

dents’ predictive uncertainty together with their regression

error with respect to the teacher are combined to yield

dense anomaly scores for each input pixel. Our intuition

is that students will generalize poorly outside the manifold

of anomaly-free training data and start to make wrong pre-

dictions. Figure 1 shows qualitative results of our method

when applied to images selected from the MVTec Anomaly

Detection dataset [7]. A schematic overview of the entire

anomaly detection process is given in Figure 2. Our main

contributions are:

• We propose a novel framework for unsupervised

anomaly detection based on student–teacher learning.

Local descriptors from a pretrained teacher network

serve as surrogate labels for an ensemble of students.

Our models can be trained end-to-end on large un-

labeled image datasets and make use of all available

training data.

• We introduce scoring functions based on the students’

predictive variance and regression error to obtain dense

anomaly maps for the segmentation of anomalous re-

gions in natural images. We describe how to extend

our approach to segment anomalies at multiple scales

by adapting the students’ and teacher’s receptive fields.

• We demonstrate state-of-the-art performance on three

real-world computer vision datasets. We compare our

method to a number of shallow machine learning clas-

sifiers and deep generative models that are fitted di-

rectly to the teacher’s feature distribution. We also

compare it to recently introduced deep learning based

methods for unsupervised anomaly segmentation.

2. Related Work

There exists an abundance of literature on anomaly de-

tection [27]. Deep learning based methods for the segmen-

tation of anomalies strongly focus on generative models

such as autoencoders [1, 8] or GANs [32]. These attempt

to learn representations from scratch, leveraging no prior

knowledge about the nature of natural images, and segment

anomalies by comparing the input image to a reconstruction

in pixel space. This can result in poor anomaly detection

performance due to simple per-pixel comparisons or imper-

fect reconstructions [8].

2.1. Anomaly Detection with Pretrained Networks

Promising results have been achieved by transferring dis-

criminative embedding vectors of pretrained networks to the

task of anomaly detection by fitting shallow machine learn-

ing models to the features of anomaly-free training data.

Andrews et al. [3] use activations from different layers of a

pretrained VGG network and model the anomaly-free train-

ing distribution with a ν-SVM. However, they only apply

4184

Page 3

their method to image classification and do not consider the

segmentation of anomalous regions. Similar experiments

have been performed by Burlina et al. [10]. They report su-

perior performance of discriminative embeddings compared

to feature spaces obtained from generative models.

Nazare et al. [24] investigate the performance of dif-

ferent off-the-shelf feature extractors pretrained on an im-

age classification task for the segmentation of anomalies

in surveillance videos. Their approach trains a 1-Nearest-

Neighbor (1-NN) classifier on embedding vectors extracted

from a large number of anomaly-free training patches. Prior

to the training of the shallow classifier, the dimensional-

ity of the network’s activations is reduced using Principal

Component Analysis (PCA). To obtain a spatial anomaly

map during inference, the classifier must be evaluated for

a large number of overlapping patches, which quickly be-

comes a performance bottleneck and results in rather coarse

anomaly maps. Similarly, Napoletano et al. [23] extract

activations from a pretrained ResNet-18 for a large num-

ber of cropped training patches and model their distribution

using K-Means clustering after prior dimensionality reduc-

tion with PCA. They also perform strided evaluation of test

images during inference. Both approaches sample training

patches from the input images and therefore do not make

use of all possible training features. This is necessary since,

in their framework, feature extraction is computationally

expensive due to the use of very deep networks that out-

put only a single descriptor per patch. Furthermore, since

shallow models are employed for learning the feature distri-

bution of anomaly-free patches, the available training infor-

mation must be strongly reduced.

To circumvent the need for cropping patches and to

speed up feature extraction, Sabokrou et al. [30] extract de-

scriptors from early feature maps of a pretrained AlexNet

in a fully convolutional fashion and fit a unimodal Gaussian

distribution to all available training vectors of anomaly-free

images. Even though feature extraction is achieved more ef-

ficiently in their framework, pooling layers lead to a down-

sampling of the input image. This strongly decreases the

resolution of the final anomaly map, especially when using

descriptive features of deeper network layers with larger re-

ceptive fields. In addition, unimodal Gaussian distributions

will fail to model the training feature distribution as soon as

the problem complexity rises.

2.2. Open-Set Recognition with Uncertainty Esti-

mates

Our work draws some inspiration from the recent success

of open-set recognition in supervised settings such as image

classification or semantic segmentation, where uncertainty

estimates of deep neural networks have been exploited to

detect out-of-distribution inputs using MC Dropout [14] or

deep ensembles [19]. Seeboeck et al. [33] demonstrate that

uncertainties from segmentation networks trained with MC

Dropout can be used to detect anomalies in retinal OCT im-

ages. Beluch et al. [6] show that the variance of network

ensembles trained on an image classification task serves as

an effective acquisition function for active learning. Inputs

that appear anomalous to the current model are added to the

training set to quickly enhance its performance.

Such algorithms, however, demand prior labeling of im-

ages by domain experts for a supervised task, which is not

always possible or desirable. In our work, we utilize feature

vectors of pretrained networks as surrogate labels for the

training of an ensemble of student networks. The predictive

variance together with the regression error of the ensem-

ble’s output mixture distribution is then used as a scoring

function to segment anomalous regions in test images.

3. Student–Teacher Anomaly Detection

This section describes the core principles of our

proposed method.

Given a training dataset D =

{I1, I2,..., IN } of anomaly-free images, our goal is to cre-

ate an ensemble of student networks Si that can later detect

anomalies in test images J. This means that they can as-

sign a score to each pixel indicating how much it deviates

from the training data manifold. For this, the student mod-

els are trained against regression targets obtained from a de-

scriptive teacher network T pretrained on a large dataset

of natural images. After the training, anomaly scores can

be derived for each image pixel from the students’ regres-

sion error and predictive variance. Given an input image

I ∈ Rw×h×C of width w, height h, and number of channels

C, each student Si in the ensemble outputs a feature map

Si(I) ∈ Rw×h×d. It contains descriptors y(r,c) ∈ Rd of di-

mension d for each input image pixel at row r and column

c. By design, we limit the students’ receptive field, such

that y(r,c) describes a square local image region p(r,c) of I

centered at (r, c) of side length p. The teacher T has the

same network architecture as the student networks. How-

ever, it remains constant and extracts descriptive embedding

vectors for each pixel of the input image I that serve as de-

terministic regression targets during student training.

3.1. Learning Local Patch Descriptors

We begin by describing how to efficiently construct a

descriptive teacher network T using metric learning and

knowledge distillation techniques. In existing work for

anomaly detection with pretrained networks, feature extrac-

tors only output single feature vectors for patch-sized inputs

or spatially heavily downsampled feature maps [23, 30].

In contrast, our teacher network T efficiently outputs de-

scriptors for every possible square of side length p within

the input image. T is obtained by first training a network

ˆT to embed patch-sized images p ∈ Rp×p×C into a met-

ric space of dimension d using only convolution and max-

4185

Page 4

Figure 3: Pretraining of the teacher network ˆT to output descrip-

tive embedding vectors for patch-sized inputs. The knowledge

of a powerful but computationally inefficient network P is dis-

tilled into ˆT by decoding the latent vectors to match the descrip-

tors of P. We also experiment with embeddings obtained using

self-supervised metric learning techniques based on triplet learn-

ing. Information within each feature dimension is maximized by

decorrelating the feature dimensions within a minibatch.

pooling layers. Fast dense local feature extraction for an

entire input image can then be achieved by a determinis-

tic network transformation of ˆT to T as described in [4].

This yields significant speedups compared to previously in-

troduced methods that perform patch-based strided evalua-

tions. To let ˆT output semantically strong descriptors, we

investigate both self-supervised metric learning techniques

as well as distilling knowledge from a descriptive but com-

putationally inefficient pretrained network. A large number

of training patches p can be obtained by random crops from

any image database. Here, we use ImageNet [18].

Knowledge Distillation Patch descriptors obtained from

deep layers of CNNs trained on image classification tasks

perform well for anomaly detection when modeling their

distribution with shallow machine learning models [23, 24].

However, the architectures of such CNNs are usually highly

complex and computationally inefficient for the extraction

of local patch descriptors. Therefore, we distill the knowl-

edge of a powerful pretrained network P into ˆT by match-

ing the output of P with a decoded version of the descriptor

obtained from ˆT:

Lk( ˆT) = ||D( ˆT(p)) − P(p)||2.

(1)

D denotes a fully connected network that decodes the d-

dimensional output of ˆT to the output dimension of the pre-

trained network’s descriptor.

Metric Learning If for some reason pretrained networks

are unavailable, one can also learn local image descrip-

tors in a fully self-supervised way [12]. Here, we inves-

tigate the performance of discriminative embeddings ob-

tained using triplet learning. For every randomly cropped

patch p, a triplet of patches (p, p+, p−) is augmented. Pos-

itive patches p+ are obtained by small random translations

around p, changes in image luminance, and the addition

Figure 4: Embedding vectors visualized for ten samples of the

MNIST dataset. Larger circles around the students’ mean predic-

tions indicate increased predictive variance. Being only trained on

a single class of training images, the students manage to accurately

regress the features solely for this class (green). They yield large

regression errors and predictive uncertainties for images of other

classes (red). Anomaly scores for the entire dataset are displayed

in the bottom histogram.

of Gaussian noise. The negative patch p− is created by a

random crop from a randomly chosen different image. In-

triplet hard negative mining with anchor swap [37] is used

as a loss function for learning an embedding sensitive to the

ℓ2 metric

Lm( ˆT) = max{0,δ + δ+ − δ−},

(2)

where δ > 0 denotes the margin parameter and in-triplet

distances δ+ and δ− are defined as:

δ+ = || ˆT(p) − ˆT(p+)||2

(3)

δ− = min{|| ˆT(p) − ˆT(p−)||2, || ˆT(p+) − ˆT(p−)||2} (4)

Descriptor Compactness As proposed by Vassileios et

al. [35], we minimize the correlation between descriptors

within one minibatch of inputs p to increase the descriptors’

compactness and remove unnecessary redundancy:

Lc( ˆT) = ∑

i=j

cij,

(5)

where cij denotes the entries of the correlation matrix com-

puted over all descriptors ˆT(p) in the current minibatch.

The final training loss for ˆT is then given as

L( ˆT) = λkLk( ˆT) + λmLm( ˆT) + λcLc( ˆT),

(6)

where λk, λm, λc ≥ 0 are weighting factors for the indi-

vidual loss terms. Figure 3 summarizes the entire learning

process for the teacher’s discriminative embedding.

4186

Page 5

3.2. Ensemble of Student Networks for Deep

Anomaly Detection

Next, we describe how to train student networks Si to

predict the teacher’s output on anomaly-free training data.

We then derive anomaly scores from the students’ predictive

uncertainty and regression error during inference. First, the

vector of component-wise means µ ∈ Rd and standard de-

viations σ ∈ Rd over all training descriptors is computed

for data normalization. Descriptors are extracted by apply-

ing T to each image in the dataset D. We then train an en-

semble of M ≥ 1 randomly initialized student networks Si,

i ∈ {1,...,M} that possess the identical network architec-

ture as the teacher T. For an input image I, each student

outputs its predictive distribution over the space of possible

regression targets for each local image region p(r,c) cen-

tered at row r and column c. Note that the students’ ar-

chitecture with limited receptive field of size p allows us

to obtain dense predictions for each image pixel with only

a single forward pass, without having to actually crop the

patches p(r,c). The students’ output vectors are modeled as

a Gaussian distribution Pr(y|p(r,c)) = N(y|µSi

(r,c),s) with

constant covariance s ∈ R, where µSi

(r,c)

denotes the predic-

tion made by Si for the pixel at (r, c). Let yT

(r,c)

denote

the teacher’s respective descriptor that is to be predicted

by the students. The log-likelihood training criterion L(Si)

for each student network then simplifies to the squared ℓ2-

distance in feature space:

L(Si) =

wh ∑

(r,c)

||µSi

(r,c) −(yT

(r,c) −µ)diag(σ)−1||2

2, (7)

where diag(σ)−1 denotes the inverse of the diagonal matrix

filled with the values in σ.

Scoring Functions for Anomaly Detection Having

trained each student to convergence, a mixture of Gaussians

can be obtained at each image pixel by equally weighting

the ensemble’s predictive distributions. From it, measures

of anomaly can be obtained in two ways: First, we propose

to compute the regression error of the mixture’s mean µ(r,c)

with respect to the teacher’s surrogate label:

e(r,c) = ||µ(r,c) − (yT

(r,c) − µ)diag(σ)−1||2

(8)

= \\\

∑

i=1

(r,c) − (yT

(r,c) − µ)diag(σ)−1\

. (9)

The intuition behind this score is that the student networks

will fail to regress the teacher’s output within anomalous re-

gions during inference since the corresponding descriptors

have not been observed during training. Note that e(r,c) is

non-constant even for M = 1, where only a single student is

trained and anomaly scores can be efficiently obtained with

only a single forward pass through the student and teacher

network, respectively.

As a second measure of anomaly, we compute for each

pixel the predictive uncertainty of the Gaussian mixture as

defined by Kendall et al. [14], assuming that the student

networks generalize similarly for anomaly-free regions and

differently in regions that contain novel information unseen

during training:

v(r,c) =

∑

i=1

||µSi

(r,c)||2

2 − ||µ(r,c)||2

(10)

To combine the two scores, we compute the means eµ,vµ

and standard deviations eσ,vσ of all e(r,c) and v(r,c), re-

spectively, over a validation set of anomaly-free images.

Summation of the normalized scores then yields the final

anomaly score:

˜e(r,c) + ˜v(r,c) =

e(r,c) − eµ

eσ

v(r,c) − vµ

vσ

(11)

Figure 4 illustrates the basic principles of our anomaly

detection method on the MNIST dataset, where images with

label 0 were treated as the normal class and all other classes

were treated as anomalous. Since the images of this dataset

are very small, we extracted a single feature vector for each

image using ˆT and trained an ensemble of M = 5 patch-

sized students to regress the teacher’s output. This results

in a single anomaly score for each input image. Feature

descriptors were embedded into 2D using multidimensional

scaling [9] to preserve their relative distances.

3.3. Multi-Scale Anomaly Segmentation

If an anomaly only covers a small part of the teacher’s

receptive field of size p, the extracted feature vector pre-

dominantly describes anomaly-free traits of the local image

region. Consequently, the descriptor can be predicted well

by the students and anomaly detection performance will de-

crease. One could tackle this problem by downsampling the

input image. This would, however, lead to an undesirable

loss in resolution of the output anomaly map.

Our framework allows for explicit control over the size

of the students’ and teacher’s receptive field p. Therefore,

we can detect anomalies at various scales by training mul-

tiple student–teacher ensemble pairs with varying values of

p. At each scale, an anomaly map with the same size as

the input image is computed. Given L student–teacher en-

semble pairs with different receptive fields, the normalized

anomaly scores ˜e

(l)

(r,c)and ˜v

(l)

(r,c)

of each scale l can be com-

bined by simple averaging:

∑

l=1

(˜e(l)

(r,c) + ˜v

(l)

(r,c)) .

(12)

4187

Page 6