A Diffusion-Based Framework for Multi-Class Anomaly Detection

Page 1

Haoyang He1*, Jiangning Zhang1,2*, Hongxu Chen1, Xuhai Chen1, Zhishan Li1,

Xu Chen2, Yabiao Wang2, Chengjie Wang2, Lei Xie1†

1College of Control Science and Engineering, Zhejiang University

2Youtu Lab, Tencent

{haoyanghe,186368,chenhongxu,22232044,zhishanli}@zju.edu.cn,

{cxxuchen,caseywang,jasoncjwang}@tencent.com, leix@iipc.zju.edu.cn

Abstract

Reconstruction-based approaches have achieved remarkable

outcomes in anomaly detection. The exceptional image re-

construction capabilities of recently popular diffusion models

have sparked research efforts to utilize them for enhanced re-

construction of anomalous images. Nonetheless, these meth-

ods might face challenges related to the preservation of im-

age categories and pixel-wise structural integrity in the more

practical multi-class setting. To solve the above problems,

we propose a Difusion-based Anomaly Detection (DiAD)

framework for multi-class anomaly detection, which con-

sists of a pixel-space autoencoder, a latent-space Semantic-

Guided (SG) network with a connection to the stable dif-

fusion’s denoising network, and a feature-space pre-trained

feature extractor. Firstly, The SG network is proposed for

reconstructing anomalous regions while preserving the orig-

inal image’s semantic information. Secondly, we introduce

Spatial-aware Feature Fusion (SFF) block to maximize re-

construction accuracy when dealing with extensively recon-

structed areas. Thirdly, the input and reconstructed images

are processed by a pre-trained feature extractor to generate

anomaly maps based on features extracted at different scales.

Experiments on MVTec-AD and VisA datasets demonstrate

the effectiveness of our approach which surpasses the state-

of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0

(AUROC/AP) for localization and detection respectively on

multi-class MVTec-AD dataset. Code is available at https:

//lewandofskee.github.io/projects/diad.

Introduction

Anomaly detection is a crucial task in computer vision and

industrial applications (Tao et al. 2022; Salehi et al. 2022;

Liu et al. 2023), which goal of visual anomaly detection

is to determine anomalous images and locate the regions

of anomaly accurately. Existing anomaly detection mod-

els (Liznerski et al. 2021; Yi and Yoon 2020; Yu et al.

2021) mostly correspond to one class, which requires a large

amount of storage space and training time as the number of

classes increases. There is a critical requirement for a robust

unsupervised multi-class anomaly detection model.

*These authors contributed equally.

†Corresponding author.

Figure 1: A analysis of different diffusion models for multi-

class anomaly detection. The image above shows various

denoising network architectures, while the images below

demonstrate the results reconstructed by different methods

for the same input image. a) DDPM suffers from categorical

errors. b) LDM exhibits semantic errors. c) Our approach ef-

fectively reconstructs the anomalous regions while preserv-

ing the semantic information of the original image.

The current mainstream unsupervised anomaly detection

methods can be divided into three categories: synthesizing-

based (Zavrtanik, Kristan, and Skocaj 2021a; Li et al. 2021),

embedding-based (Defard et al. 2021; Roth et al. 2022; Xie

et al. 2023) and reconstruction-based (Liu et al. 2022; Liang

et al. 2023) methods. The core of the reconstruction-based

method is that during training, the model only learns from

normal images. During testing, the model reconstructs ab-

normal images into normal ones using the trained model.

Therefore, by comparing the reconstructed image with the

input image, we can determine the location of anoma-

lies. Traditional reconstruction-based methods, including

AEs (Zavrtanik, Kristan, and Skocaj 2021b), VAEs (Kingma

and Welling 2022), and GANs (Liang et al. 2023; Yan

et al. 2021) can learn the distribution of normal samples

and reconstruct abnormal regions during the testing phase.

However, these models have limited reconstruction capabil-

ities especially large-scale defects or disappearances. Hence,

models with stronger reconstruction capability are required

to effectively tackle multi-class anomaly detection.

Recently, the diffusion models (Ho, Jain, and Abbeel

2020; Rombach et al. 2022; Zhang and Agrawala 2023)

have demonstrated their powerful image-generation capa-

bility. However, directly using current mainstream diffusion

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

8472

Page 2

models cannot effectively address multi-class anomaly de-

tection problems. 1) For the Denoising Diffusion Probabilis-

tic Model (DDPM) (Ho, Jain, and Abbeel 2020) in Fig. 1-

(a), when performing the multi-class setting, this method

may encounter issues with misclassifying image categories.

Because after adding T timesteps noise to the input image,

the original class information is lost. During inference, de-

noising is performed based on this Gaussian noise-like dis-

tribution, which may generate images belonging to differ-

ent categories. 2) Latent Diffusion Model (LDM) (Rombach

et al. 2022) has an embedder as a class condition as shown

in Fig. 1-(b), which overcomes the problem of misclassifica-

tion in DDPM. However, LDM cannot address the issue of

semantic loss in generated images. LDM cannot simultane-

ously preserve the semantic information of the input image

while reconstructing the anomalous regions. For example,

they may fail to maintain direction consistency with the in-

put image in terms of objects like screws and hazelnuts.

To address the problems, we propose DiAD for multi-

class anomaly detection in Fig. 2, which comprises: a pixel

space autoencoder, a latent space denoising network and

a feature space pre-trained model. To effectively maintain

consistent semantic information with the original image

while reconstructing the location of anomalous regions, we

propose the Semantic-Guided (SG) network with a connec-

tion to the Stable Diffusion (SD) denoising network. To

further enhance the capability of preserving fine details in

the original image, we propose the Spatial-aware Feature

Fusion (SFF) block to integrate features at different scales.

Finally, the reconstructed and input images are extracted fea-

tures through a pre-trained model for anomaly scores. We

summarize our contributions as follows:

• We propose a novel diffusion-based framework DiAD

for multi-class anomaly detection, which firstly tackles

the problem of existing denoising networks of diffusion-

based methods failing to correctly reconstruct anomalies.

• We construct an SG network connecting to the SD de-

noising network to maintain consistent semantic infor-

mation and reconstruct the anomalies.

• We propose an SFF block to integrate features from dif-

ferent scales to further improve the reconstruction ability.

• Abundant experiments demonstrate the sufficient superi-

ority of DiAD over SOTA methods.

Related Work

Diffusion Model. The diffusion model has gained

widespread attention since its remarkable reconstruction

ability. It has demonstrated excellent performance in various

applications such as image generation (Zhang and Agrawala

2023), video generation (Ho et al. 2022), object detec-

tion (Chen et al. 2022), image segmentation (Amit et al.

2022) and etc. LDM (Rombach et al. 2022) introduces con-

ditions through cross-attention to control generation.

Anomaly Detection. AD contains a variety of different

settings, e.g., open-set (Ding, Pang, and Shen 2022), noisy

learning (Tan et al. 2021; Yoon et al. 2022), zero-/few-

shot (Huang et al. 2022; Jeong et al. 2023; Cao et al. 2023;

Chen, Han, and Zhang 2023; Chen et al. 2023b; Zhang et al.

2023b), 3D AD (Wang et al. 2023; Chen et al. 2023a), etc.

The unsupervised anomaly detection can primarily be cate-

gorized into three major methodologies:

1) Synthesizing-based methods synthesize anomalies on

normal image samples. During the training phase, both

normal images and synthetically generated abnormal im-

ages are input into the network for training, which aids

in anomaly detection and localization. DRAEM (Zavrtanik,

Kristan, and Skocaj 2021a) consists of an end-to-end net-

work composed of a reconstruction network and a discrim-

inative sub-network, which synthesizes and generates just-

out-distribution phenomena. However, due to the diversity

and unpredictability of anomalies in real-world scenarios, it

is impossible to synthesize all types of anomalies.

2) Embedding-based methods encode the original image’s

three-dimensional information into a multidimensional fea-

ture space (Roth et al. 2022; Cao et al. 2022; Gu et al.

2023). Most methods employ networks (He et al. 2016; Tan

and Le 2019; Zhang et al. 2022, 2023c; Wu et al. 2023)

pre-trained on ImageNet (Deng et al. 2009) for feature ex-

traction. RD4AD (Deng and Li 2022) utilizes a WideRes-

Net50 (Zagoruyko and Komodakis 2016) as the teacher

model for feature extraction and employs a structurally iden-

tical network in reverse as the student model, computing

the cosine similarity of corresponding features as anomaly

scores. However, due to significant differences between in-

dustrial images and the data distribution in ImageNet, the ex-

tracted features might not be suitable for industrial anomaly

detection purposes.

3) Reconstruction-based methods aim to train a model

on a dataset without anomalies. The model learns to iden-

tify patterns and characteristics in the normal data. OCR-

GAN (Liang et al. 2023) decouples images into different

frequencies and uses GAN for reconstruction. EdgRec (Liu

et al. 2022) achieves good reconstruction results by first syn-

thesizing anomalies and then extracting grayscale edge in-

formation from images, which is ultimately input into a re-

construction network. However, there are certain limitations

in the reconstruction of large-area anomalies. Moreover, the

accuracy of anomaly localization is also not sufficient.

Recently, some studies have applied diffusion models to

anomaly detection. AnoDDPM (Wyatt et al. 2022) is the first

approach to employ a diffusion model for medical anomaly

detection. DiffusionAD (Zhang et al. 2023a) utilizes an

anomaly synthetic strategy to generate anomalous samples

and labels, along with two sub-networks dedicated to the

tasks of denoising and segmentation. DDAD (Mousakhan,

Brox, and Tayyub 2023) employs a score-based pre-trained

diffusion model to generate normal samples while fine-

tuning the pre-trained feature extractor to achieve domain

transfer. However, these approaches only add limited steps

of noise and perform few denoising steps, which makes them

unable to reconstruct large-scale defects.

To overcome the aforementioned problems, We pro-

pose a diffusion-based framework DiAD for multi-class

anomaly detection, which firstly tackles the problem of ex-

isting diffusion-based methods failing to correctly recon-

struct anomalies.

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

8473

Page 3

Diffusion Forward Process

SFF

CNNs

Denoising

Network

× T

Pixel Space

Latent Space

Feature Space

Input Image

Reconstruction Image

8×8

4×4

8×8

16×16

32×32

16×16

8×8

4×4

Semantic-Guided Network

Test only

Train/Test

Frozen

Add

Figure 2: Framework of the proposed DiAD that contains three parts: 1) a pixel-space autoencoder {E, D}; 2) a latent-space

Semantic-Guided (SG) network with a connection to Stable Diffusion (SD) denoising network; and 3) a feature-space pre-

trained feature extractor Ψ. During training, the input x0 and the latent variable zT are inputted into the SG network and the

SD denoising network, respectively. The output noise and input noise are calculated for MSE loss and gradient optimization is

computed. During testing, x0 and the reconstructed image ˆx0 are inputted into the same pre-trained feature extraction network

to obtain feature maps {f1,f2,f3} of different scales, and their anomaly scores S are calculated.

Preliminaries

Denoising Diffusion Probabilistic Model. Denoising

Diffusion Probabilistic Model (DDPM) consists of two pro-

cesses: the forward diffusion process and the reverse denois-

ing process. During the forward process, a noisy sample xt

is generated using a Markov chain that incrementally adds

Gaussian-distributed noise to an initial data sample x0. The

forward diffusion process can be characterized as follows:

xt =

√

¯αtx0 +

√

1 − ¯αtϵt,ϵt ∼ N(0, I),

(1)

where αt = 1 − βt, ¯αt = ∏

i=1 αi = ∏

i=1(1 − βi) and βi

represents the noise schedule used to regulate the quantity

of noise added at each timestep.

In the reverse denoising process, xT is first sampled from

equation 1 and xt−1 is reconstructed from xt and the model

prediction ϵθ (xt,t) with the formulation:

xt−1 =

√

αt

(

xt −

1 − αt

√

1 − ¯αt

ϵθ (xt,t)

)

+ σtz,

(2)

where z ∼ N(0, I), σt is a fixed constant related to the vari-

ance schedule, ϵθ (xt,t) is a U-Net (Ronneberger, Fischer,

and Brox 2015) network to predict the distribution and θ is

the learnable parameter which could be optimized as:

min

Ex0∼q(x0),ϵ∼N(0,I),t ∥ϵ − ϵθ (xt,t)∥

2 .

(3)

Latent Diffusion Model. Latent Diffusion Model (LDM)

focuses on the low-dimensional latent space with condition-

ing mechanisms. LDM consists of a pre-trained autoencoder

model and a denoising U-Net-like attention-based network.

The network compresses images using an encoder, conducts

diffusion and denoising operations in the latent representa-

tion space, and subsequently reconstructs the images back to

the original pixel space using a decoder. The training opti-

mization objective is:

LLDM = Ez0,t,c,ϵ∼N(0,1)

[

∥ϵ − ϵθ (zt, t, c)∥

]

(4)

where c represents the conditioning mechanisms which can

consist of multimodal types such as text or image, connected

to the model through a cross-attention mechanism. zt repre-

sents the latent space variable,

Method

The proposed pipeline DiAD is shown in Fig. 2. First, the

pre-trained encoder downsamples the input image into a

latent-space representation. Then, noise is added to the latent

representation, followed by the denoising process using an

SD denoising network with a connection to the SG network.

The denoising process is repeated for the same timesteps as

the diffusion process. Finally, the reconstructed latent rep-

resentation is restored to the original image level using the

pre-trained decoder. In terms of anomaly detection and lo-

calization, the input and reconstructed images are fed into

the same pre-trained model to extract features at different

scales and calculate the differences between these features.

Semantic-Guided Network

As discussed earlier, DDPM and LDM each have specific

problems when addressing multi-class anomaly detection

tasks. In response to these issues and the multi-class task

itself, we propose an SG network to address the problem

of LDM’s inability to effectively reconstruct anomalies and

preserve the semantic information of the input image.

Given an input image x0 ∈ R3×H×W in pixel space, the

pre-trained encoder E encodes x0 into a latent space repre-

sentation z = E(x0) where z ∈ Rc×h×w. Similar to Eq. 1

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

8474

Page 4

where the original pixel space x is replaced by latent repre-

sentation z, the forward diffusion process now can be char-

acterized as follows:

zt =

√

¯αtz0 +

√

1 − ¯αtϵt,ϵt ∼ N(0, I).

(5)

The perturbed representation zT and input x0 are simulta-

neously fed into the SD denoising network and SG network,

respectively. After T steps of the reverse denoising process,

the final variable z is restored to the reconstructed image

ˆx0 from the pre-trained decoder D giving ˆx0 = D(z). The

training objective of DiAD is:

LDiAD = Ez0,t,ci,ϵ∼N(0,1)

[

∥ϵ − ϵθ (zt, t, ci)∥

]

(6)

The denoising network consists of a pre-trained SD de-

noising network and an SG network that replicates the SD

parameters for initiation as shown in Fig. 2. The pre-trained

SD denoising network comprises four encoder blocks, one

middle block and four decoder blocks. Here, ’block’ means

a frequently utilized unit in the construction of the neural

network layer, e.g.,, ’resnet’ block, transformer block, multi-

head cross attention block, etc.

The input image x0 ∈ R3×H×W is transformed into

x ∈ Rd×h×w by a set of ’conv-silu’ layers C in SG network

in order to keep the same dimension with the latent represen-

tations in SD Encoder Block 1 ESD1. Then, the result of the

summation of x and z are input into the SG Encoder Blocks

(SGEBs). After continuous downsampling by the encoder

ESG, the results are finally added to the output of the SD

middle block MSD after its completion in the middle block

MSG. Additionally, to address multi-class tasks of differ-

ent scenarios and categories, the results of the SG Decoder

Blocks (SGDBs) DSG are also added to the results of the

SD decoder DSD with an SFF block combined which will

be particularly explained in the next section. The output G

of the denoising network is characterized as:

G = DSD (MSD (ESD (zt)) + MSG (ESD (z + C (x0))))

+ DSGj(MSG (ESD (z + C (x0)))),

(7)

where z represents the latent representation with noise per-

turbed, x0 represents the input image, C(·) represents a set

of ’conv-silu’ layers in SG network, ESD(·) represents all

the SD encoder blocks (SDEBs), ESG(·) represents all the

SGEBs, MSG(·) and MSD(·) represent SG and SD mid-

dle blocks respectively, DSD(·) represent all the SDDBs and

DSGj(·) represents SGDBs for j-th blocks.

Spatial-Aware Feature Fusion Block

When adding several layers of decoder blocks from SGEBs

to SDDBs during the experiment as shown in Table 5, we

found it to be challenging to solve the multi-class anomaly

detection. This is because the dataset contains various types,

such as objects and textures. For texture-related cases, the

anomalies are generally smaller, so it is necessary to pre-

serve their original textures. On the other hand, the de-

fects often cover larger areas for object-related cases, requir-

ing stronger reconstruction capabilities. Therefore, it is ex-

tremely challenging to simultaneously preserve the normal

Bloc

Conv

Block

Bloc

Conv2d 3×3

Normalization

Activation

Conv

Block

4_3

4_2

3_1

3_2

3_3

Conv

Block

Conv

Block

Add

4_1

4_3

Figure 3: Schematic diagram of SFF block. Each layer in

SGDB4 is obtained by adding the corresponding SGEB4 to

every SGEB3 with Conv Block performed.

information of the original samples and reconstruct the ab-

normal locations in different scenarios.

Hence, we proposed a Spatial-aware Feature Fusion (SFF)

block with the aim of integrating high-scale semantic infor-

mation into the low-scale. This ultimately enables the model

to both preserve the information of the original normal sam-

ples and reconstruct large-scale abnormal regions. The struc-

ture of the SFF block is shown in Fig. 3. Each SGEBs

consists of three sub-layers. Therefore, the SFF block inte-

grates the features of each layer in SGEB3 into each layer in

SGEB4 and adds the fused features to the original features.

The final output of each layer of the SGEB4 is:

As Batch Normalization (BN) (Ioffe and Szegedy 2015)

considers the normalization statistics of all images within a

batch, it leads to a loss of unique details in each sample.

BN is suitable for a relatively large mini-batch scenario with

similar data distributions. However, for multi-class anomaly

detection where there are significant differences in data dis-

tributions among different categories, normalizing the en-

tire batch is not suitable for tasks in the multi-class set-

ting. Since the results generated by using SD mainly depend

on the input image instance, using Instance Normalization

(IN) (Ulyanov, Vedaldi, and Lempitsky 2017) can not only

accelerate model convergence but also maintain the indepen-

dence between each image instance. In addition, in terms of

choosing the activation function, we use the SiLU (Elfwing,

Uchibe, and Doya 2018) instead of the commonly used

ReLU (Hahnloser et al. 2000), which can preserve more in-

put information. Experimental results in Table 5 show that

the performance is improved by using IN and SiLU simulta-

neously instead of the combination of BN and ReLU.

Anomaly Localization and Detection

During the inference stage, the reconstruction image is ob-

tained through the diffusion and denoising process in the la-

tent space. For anomaly localization and detection, We use

the same ImageNet pre-trained feature extractor Ψ to extract

features from both the input image x0 and the reconstructed

image ˆx0 and calculate the anomaly map on different scale

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

8475

Page 5