This is the html version of the file https://ojs.aaai.org/index.php/AAAI/article/view/28690.
Google automatically generates html versions of documents as we crawl the web.
A Diffusion-Based Framework for Multi-Class Anomaly Detection
Page 1
A Diffusion-Based Framework for Multi-Class Anomaly Detection
Haoyang He1*, Jiangning Zhang1,2*, Hongxu Chen1, Xuhai Chen1, Zhishan Li1,
Xu Chen2, Yabiao Wang2, Chengjie Wang2, Lei Xie1†
1College of Control Science and Engineering, Zhejiang University
2Youtu Lab, Tencent
{haoyanghe,186368,chenhongxu,22232044,zhishanli}@zju.edu.cn,
{cxxuchen,caseywang,jasoncjwang}@tencent.com, leix@iipc.zju.edu.cn
Abstract
Reconstruction-based approaches have achieved remarkable
outcomes in anomaly detection. The exceptional image re-
construction capabilities of recently popular diffusion models
have sparked research efforts to utilize them for enhanced re-
construction of anomalous images. Nonetheless, these meth-
ods might face challenges related to the preservation of im-
age categories and pixel-wise structural integrity in the more
practical multi-class setting. To solve the above problems,
we propose a Difusion-based Anomaly Detection (DiAD)
framework for multi-class anomaly detection, which con-
sists of a pixel-space autoencoder, a latent-space Semantic-
Guided (SG) network with a connection to the stable dif-
fusion’s denoising network, and a feature-space pre-trained
feature extractor. Firstly, The SG network is proposed for
reconstructing anomalous regions while preserving the orig-
inal image’s semantic information. Secondly, we introduce
Spatial-aware Feature Fusion (SFF) block to maximize re-
construction accuracy when dealing with extensively recon-
structed areas. Thirdly, the input and reconstructed images
are processed by a pre-trained feature extractor to generate
anomaly maps based on features extracted at different scales.
Experiments on MVTec-AD and VisA datasets demonstrate
the effectiveness of our approach which surpasses the state-
of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0
(AUROC/AP) for localization and detection respectively on
multi-class MVTec-AD dataset. Code is available at https:
//lewandofskee.github.io/projects/diad.
Introduction
Anomaly detection is a crucial task in computer vision and
industrial applications (Tao et al. 2022; Salehi et al. 2022;
Liu et al. 2023), which goal of visual anomaly detection
is to determine anomalous images and locate the regions
of anomaly accurately. Existing anomaly detection mod-
els (Liznerski et al. 2021; Yi and Yoon 2020; Yu et al.
2021) mostly correspond to one class, which requires a large
amount of storage space and training time as the number of
classes increases. There is a critical requirement for a robust
unsupervised multi-class anomaly detection model.
*These authors contributed equally.
Corresponding author.
Copyright © 2024, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: A analysis of different diffusion models for multi-
class anomaly detection. The image above shows various
denoising network architectures, while the images below
demonstrate the results reconstructed by different methods
for the same input image. a) DDPM suffers from categorical
errors. b) LDM exhibits semantic errors. c) Our approach ef-
fectively reconstructs the anomalous regions while preserv-
ing the semantic information of the original image.
The current mainstream unsupervised anomaly detection
methods can be divided into three categories: synthesizing-
based (Zavrtanik, Kristan, and Skocaj 2021a; Li et al. 2021),
embedding-based (Defard et al. 2021; Roth et al. 2022; Xie
et al. 2023) and reconstruction-based (Liu et al. 2022; Liang
et al. 2023) methods. The core of the reconstruction-based
method is that during training, the model only learns from
normal images. During testing, the model reconstructs ab-
normal images into normal ones using the trained model.
Therefore, by comparing the reconstructed image with the
input image, we can determine the location of anoma-
lies. Traditional reconstruction-based methods, including
AEs (Zavrtanik, Kristan, and Skocaj 2021b), VAEs (Kingma
and Welling 2022), and GANs (Liang et al. 2023; Yan
et al. 2021) can learn the distribution of normal samples
and reconstruct abnormal regions during the testing phase.
However, these models have limited reconstruction capabil-
ities especially large-scale defects or disappearances. Hence,
models with stronger reconstruction capability are required
to effectively tackle multi-class anomaly detection.
Recently, the diffusion models (Ho, Jain, and Abbeel
2020; Rombach et al. 2022; Zhang and Agrawala 2023)
have demonstrated their powerful image-generation capa-
bility. However, directly using current mainstream diffusion
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8472

Page 2
models cannot effectively address multi-class anomaly de-
tection problems. 1) For the Denoising Diffusion Probabilis-
tic Model (DDPM) (Ho, Jain, and Abbeel 2020) in Fig. 1-
(a), when performing the multi-class setting, this method
may encounter issues with misclassifying image categories.
Because after adding T timesteps noise to the input image,
the original class information is lost. During inference, de-
noising is performed based on this Gaussian noise-like dis-
tribution, which may generate images belonging to differ-
ent categories. 2) Latent Diffusion Model (LDM) (Rombach
et al. 2022) has an embedder as a class condition as shown
in Fig. 1-(b), which overcomes the problem of misclassifica-
tion in DDPM. However, LDM cannot address the issue of
semantic loss in generated images. LDM cannot simultane-
ously preserve the semantic information of the input image
while reconstructing the anomalous regions. For example,
they may fail to maintain direction consistency with the in-
put image in terms of objects like screws and hazelnuts.
To address the problems, we propose DiAD for multi-
class anomaly detection in Fig. 2, which comprises: a pixel
space autoencoder, a latent space denoising network and
a feature space pre-trained model. To effectively maintain
consistent semantic information with the original image
while reconstructing the location of anomalous regions, we
propose the Semantic-Guided (SG) network with a connec-
tion to the Stable Diffusion (SD) denoising network. To
further enhance the capability of preserving fine details in
the original image, we propose the Spatial-aware Feature
Fusion (SFF) block to integrate features at different scales.
Finally, the reconstructed and input images are extracted fea-
tures through a pre-trained model for anomaly scores. We
summarize our contributions as follows:
• We propose a novel diffusion-based framework DiAD
for multi-class anomaly detection, which firstly tackles
the problem of existing denoising networks of diffusion-
based methods failing to correctly reconstruct anomalies.
• We construct an SG network connecting to the SD de-
noising network to maintain consistent semantic infor-
mation and reconstruct the anomalies.
• We propose an SFF block to integrate features from dif-
ferent scales to further improve the reconstruction ability.
• Abundant experiments demonstrate the sufficient superi-
ority of DiAD over SOTA methods.
Related Work
Diffusion Model. The diffusion model has gained
widespread attention since its remarkable reconstruction
ability. It has demonstrated excellent performance in various
applications such as image generation (Zhang and Agrawala
2023), video generation (Ho et al. 2022), object detec-
tion (Chen et al. 2022), image segmentation (Amit et al.
2022) and etc. LDM (Rombach et al. 2022) introduces con-
ditions through cross-attention to control generation.
Anomaly Detection. AD contains a variety of different
settings, e.g., open-set (Ding, Pang, and Shen 2022), noisy
learning (Tan et al. 2021; Yoon et al. 2022), zero-/few-
shot (Huang et al. 2022; Jeong et al. 2023; Cao et al. 2023;
Chen, Han, and Zhang 2023; Chen et al. 2023b; Zhang et al.
2023b), 3D AD (Wang et al. 2023; Chen et al. 2023a), etc.
The unsupervised anomaly detection can primarily be cate-
gorized into three major methodologies:
1) Synthesizing-based methods synthesize anomalies on
normal image samples. During the training phase, both
normal images and synthetically generated abnormal im-
ages are input into the network for training, which aids
in anomaly detection and localization. DRAEM (Zavrtanik,
Kristan, and Skocaj 2021a) consists of an end-to-end net-
work composed of a reconstruction network and a discrim-
inative sub-network, which synthesizes and generates just-
out-distribution phenomena. However, due to the diversity
and unpredictability of anomalies in real-world scenarios, it
is impossible to synthesize all types of anomalies.
2) Embedding-based methods encode the original image’s
three-dimensional information into a multidimensional fea-
ture space (Roth et al. 2022; Cao et al. 2022; Gu et al.
2023). Most methods employ networks (He et al. 2016; Tan
and Le 2019; Zhang et al. 2022, 2023c; Wu et al. 2023)
pre-trained on ImageNet (Deng et al. 2009) for feature ex-
traction. RD4AD (Deng and Li 2022) utilizes a WideRes-
Net50 (Zagoruyko and Komodakis 2016) as the teacher
model for feature extraction and employs a structurally iden-
tical network in reverse as the student model, computing
the cosine similarity of corresponding features as anomaly
scores. However, due to significant differences between in-
dustrial images and the data distribution in ImageNet, the ex-
tracted features might not be suitable for industrial anomaly
detection purposes.
3) Reconstruction-based methods aim to train a model
on a dataset without anomalies. The model learns to iden-
tify patterns and characteristics in the normal data. OCR-
GAN (Liang et al. 2023) decouples images into different
frequencies and uses GAN for reconstruction. EdgRec (Liu
et al. 2022) achieves good reconstruction results by first syn-
thesizing anomalies and then extracting grayscale edge in-
formation from images, which is ultimately input into a re-
construction network. However, there are certain limitations
in the reconstruction of large-area anomalies. Moreover, the
accuracy of anomaly localization is also not sufficient.
Recently, some studies have applied diffusion models to
anomaly detection. AnoDDPM (Wyatt et al. 2022) is the first
approach to employ a diffusion model for medical anomaly
detection. DiffusionAD (Zhang et al. 2023a) utilizes an
anomaly synthetic strategy to generate anomalous samples
and labels, along with two sub-networks dedicated to the
tasks of denoising and segmentation. DDAD (Mousakhan,
Brox, and Tayyub 2023) employs a score-based pre-trained
diffusion model to generate normal samples while fine-
tuning the pre-trained feature extractor to achieve domain
transfer. However, these approaches only add limited steps
of noise and perform few denoising steps, which makes them
unable to reconstruct large-scale defects.
To overcome the aforementioned problems, We pro-
pose a diffusion-based framework DiAD for multi-class
anomaly detection, which firstly tackles the problem of ex-
isting diffusion-based methods failing to correctly recon-
struct anomalies.
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8473

Page 3
Diffusion Forward Process
SD
EB
2
SD
DB
2
SD
EB
3
SD
DB
3
SD
EB
4
SD
DB
4
SD
M
SD
EB
1
SD
DB
1
SG
EB
2
SG
EB
3
SG
EB
4
SG
DB
4
SG
M
SG
EB
1
SFF
CNNs
SD
Denoising
Network
× T
Pixel Space
Latent Space
M1
M2
M3
Feature Space
Input Image
Reconstruction Image
8×8
4×4
4×4
8×8
16×16
16×16
32×32
32×32
32×32
16×16
8×8
4×4
4×4
Semantic-Guided Network
Test only
Train/Test
Frozen
Add
Figure 2: Framework of the proposed DiAD that contains three parts: 1) a pixel-space autoencoder {E, D}; 2) a latent-space
Semantic-Guided (SG) network with a connection to Stable Diffusion (SD) denoising network; and 3) a feature-space pre-
trained feature extractor Ψ. During training, the input x0 and the latent variable zT are inputted into the SG network and the
SD denoising network, respectively. The output noise and input noise are calculated for MSE loss and gradient optimization is
computed. During testing, x0 and the reconstructed image ˆx0 are inputted into the same pre-trained feature extraction network
to obtain feature maps {f1,f2,f3} of different scales, and their anomaly scores S are calculated.
Preliminaries
Denoising Diffusion Probabilistic Model. Denoising
Diffusion Probabilistic Model (DDPM) consists of two pro-
cesses: the forward diffusion process and the reverse denois-
ing process. During the forward process, a noisy sample xt
is generated using a Markov chain that incrementally adds
Gaussian-distributed noise to an initial data sample x0. The
forward diffusion process can be characterized as follows:
xt =
¯αtx0 +
1 − ¯αtϵtt ∼ N(0, I),
(1)
where αt = 1 − βt, ¯αt = ∏
T
i=1 αi = ∏
T
i=1(1 − βi) and βi
represents the noise schedule used to regulate the quantity
of noise added at each timestep.
In the reverse denoising process, xT is first sampled from
equation 1 and xt−1 is reconstructed from xt and the model
prediction ϵθ (xt,t) with the formulation:
xt−1 =
1
αt
(
xt
1 − αt
1 − ¯αt
ϵθ (xt,t)
)
+ σtz,
(2)
where z ∼ N(0, I), σt is a fixed constant related to the vari-
ance schedule, ϵθ (xt,t) is a U-Net (Ronneberger, Fischer,
and Brox 2015) network to predict the distribution and θ is
the learnable parameter which could be optimized as:
min
θ
Ex0∼q(x0),ϵ∼N(0,I),t ∥ϵ − ϵθ (xt,t)∥
2
2 .
(3)
Latent Diffusion Model. Latent Diffusion Model (LDM)
focuses on the low-dimensional latent space with condition-
ing mechanisms. LDM consists of a pre-trained autoencoder
model and a denoising U-Net-like attention-based network.
The network compresses images using an encoder, conducts
diffusion and denoising operations in the latent representa-
tion space, and subsequently reconstructs the images back to
the original pixel space using a decoder. The training opti-
mization objective is:
LLDM = Ez0,t,c,ϵ∼N(0,1)
[
∥ϵ − ϵθ (zt, t, c)∥
2
2
]
,
(4)
where c represents the conditioning mechanisms which can
consist of multimodal types such as text or image, connected
to the model through a cross-attention mechanism. zt repre-
sents the latent space variable,
Method
The proposed pipeline DiAD is shown in Fig. 2. First, the
pre-trained encoder downsamples the input image into a
latent-space representation. Then, noise is added to the latent
representation, followed by the denoising process using an
SD denoising network with a connection to the SG network.
The denoising process is repeated for the same timesteps as
the diffusion process. Finally, the reconstructed latent rep-
resentation is restored to the original image level using the
pre-trained decoder. In terms of anomaly detection and lo-
calization, the input and reconstructed images are fed into
the same pre-trained model to extract features at different
scales and calculate the differences between these features.
Semantic-Guided Network
As discussed earlier, DDPM and LDM each have specific
problems when addressing multi-class anomaly detection
tasks. In response to these issues and the multi-class task
itself, we propose an SG network to address the problem
of LDM’s inability to effectively reconstruct anomalies and
preserve the semantic information of the input image.
Given an input image x0 ∈ R3×H×W in pixel space, the
pre-trained encoder E encodes x0 into a latent space repre-
sentation z = E(x0) where z ∈ Rc×h×w. Similar to Eq. 1
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8474

Page 4
where the original pixel space x is replaced by latent repre-
sentation z, the forward diffusion process now can be char-
acterized as follows:
zt =
¯αtz0 +
1 − ¯αtϵtt ∼ N(0, I).
(5)
The perturbed representation zT and input x0 are simulta-
neously fed into the SD denoising network and SG network,
respectively. After T steps of the reverse denoising process,
the final variable z is restored to the reconstructed image
ˆx0 from the pre-trained decoder D giving ˆx0 = D(z). The
training objective of DiAD is:
LDiAD = Ez0,t,ci,ϵ∼N(0,1)
[
∥ϵ − ϵθ (zt, t, ci)∥
2
2
]
.
(6)
The denoising network consists of a pre-trained SD de-
noising network and an SG network that replicates the SD
parameters for initiation as shown in Fig. 2. The pre-trained
SD denoising network comprises four encoder blocks, one
middle block and four decoder blocks. Here, ’block’ means
a frequently utilized unit in the construction of the neural
network layer, e.g.,, ’resnet’ block, transformer block, multi-
head cross attention block, etc.
The input image x0 ∈ R3×H×W is transformed into
x ∈ Rd×h×w by a set of ’conv-silu’ layers C in SG network
in order to keep the same dimension with the latent represen-
tations in SD Encoder Block 1 ESD1. Then, the result of the
summation of x and z are input into the SG Encoder Blocks
(SGEBs). After continuous downsampling by the encoder
ESG, the results are finally added to the output of the SD
middle block MSD after its completion in the middle block
MSG. Additionally, to address multi-class tasks of differ-
ent scenarios and categories, the results of the SG Decoder
Blocks (SGDBs) DSG are also added to the results of the
SD decoder DSD with an SFF block combined which will
be particularly explained in the next section. The output G
of the denoising network is characterized as:
G = DSD (MSD (ESD (zt)) + MSG (ESD (z + C (x0))))
+ DSGj(MSG (ESD (z + C (x0)))),
(7)
where z represents the latent representation with noise per-
turbed, x0 represents the input image, C(·) represents a set
of ’conv-silu’ layers in SG network, ESD(·) represents all
the SD encoder blocks (SDEBs), ESG(·) represents all the
SGEBs, MSG(·) and MSD(·) represent SG and SD mid-
dle blocks respectively, DSD(·) represent all the SDDBs and
DSGj(·) represents SGDBs for j-th blocks.
Spatial-Aware Feature Fusion Block
When adding several layers of decoder blocks from SGEBs
to SDDBs during the experiment as shown in Table 5, we
found it to be challenging to solve the multi-class anomaly
detection. This is because the dataset contains various types,
such as objects and textures. For texture-related cases, the
anomalies are generally smaller, so it is necessary to pre-
serve their original textures. On the other hand, the de-
fects often cover larger areas for object-related cases, requir-
ing stronger reconstruction capabilities. Therefore, it is ex-
tremely challenging to simultaneously preserve the normal
Co
nv
Bloc
k
Co
nv
Bloc
k
Conv
Block
Co
nv
Bloc
k
Co
nv
Bloc
k
Co
nv
Bloc
k
Co
nv
Bloc
k
Conv2d 3×3
Normalization
Activation
Conv
Block
=
SG
EB
4_3
SG
EB
4_2
SG
DB
4_2
SG
EB
3_1
SG
EB
3_2
SG
EB
3_3
Conv
Block
Conv
Block
Add
=
SG
EB
4_1
SG
DB
4_1
SG
DB
4_3
Figure 3: Schematic diagram of SFF block. Each layer in
SGDB4 is obtained by adding the corresponding SGEB4 to
every SGEB3 with Conv Block performed.
information of the original samples and reconstruct the ab-
normal locations in different scenarios.
Hence, we proposed a Spatial-aware Feature Fusion (SFF)
block with the aim of integrating high-scale semantic infor-
mation into the low-scale. This ultimately enables the model
to both preserve the information of the original normal sam-
ples and reconstruct large-scale abnormal regions. The struc-
ture of the SFF block is shown in Fig. 3. Each SGEBs
consists of three sub-layers. Therefore, the SFF block inte-
grates the features of each layer in SGEB3 into each layer in
SGEB4 and adds the fused features to the original features.
The final output of each layer of the SGEB4 is:
As Batch Normalization (BN) (Ioffe and Szegedy 2015)
considers the normalization statistics of all images within a
batch, it leads to a loss of unique details in each sample.
BN is suitable for a relatively large mini-batch scenario with
similar data distributions. However, for multi-class anomaly
detection where there are significant differences in data dis-
tributions among different categories, normalizing the en-
tire batch is not suitable for tasks in the multi-class set-
ting. Since the results generated by using SD mainly depend
on the input image instance, using Instance Normalization
(IN) (Ulyanov, Vedaldi, and Lempitsky 2017) can not only
accelerate model convergence but also maintain the indepen-
dence between each image instance. In addition, in terms of
choosing the activation function, we use the SiLU (Elfwing,
Uchibe, and Doya 2018) instead of the commonly used
ReLU (Hahnloser et al. 2000), which can preserve more in-
put information. Experimental results in Table 5 show that
the performance is improved by using IN and SiLU simulta-
neously instead of the combination of BN and ReLU.
Anomaly Localization and Detection
During the inference stage, the reconstruction image is ob-
tained through the diffusion and denoising process in the la-
tent space. For anomaly localization and detection, We use
the same ImageNet pre-trained feature extractor Ψ to extract
features from both the input image x0 and the reconstructed
image ˆx0 and calculate the anomaly map on different scale
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8475

Page 5
Category
Non-Diffusion Method
Diffusion-based Method
PaDiM
DRAEM
RD4AD
UniAD
DDPM
LDM
Ours
Objects
Bottle
97.9/-
97.5/99.2/96.1
99.6/99.9/98.4
99.7/100./100.
63.6/71.8/86.3
93.8/98.7/93.7
99.7/96.5/91.8
Cable
70.9/-
57.8/74.0/76.3
84.1/89.5/82.5
95.2/95.9/88.0
55.6/69.7/76.0
55.7/74.8/77.7
94.8/98.8/95.2
Capsule
73.4/-
65.3/92.5/90.4
94.1/96.9/96.9
86.9/97.8/94.4
52.9/82.0/90.5
60.5/81.4/90.5
89.0/97.5/95.5
Hazelnut
85.5/-
93.7/97.5/92.3
60.8/69.8/86.4
99.8/100./99.3
87.0/90.4/88.1
93.0/95.8/89.8
99.5/99.7/97.3
Metal Nut
88.0/-
72.8/95.0/92.0
100./100./99.5
99.2/99.9/99.5
60.0/74.4/89.4
53.0/80.1/89.4
99.1/96.0/91.6
Pill
68.8/-
82.2/94.9/92.4
97.5/99.6/96.8
93.7/98.7/95.7
55.8/84.0/91.6
62.1/93.1/91.6
95.7/98.5/94.5
Screw
56.9/-
92.0/95.7/89.9
97.7/99.3/95.8
87.5/96.5/89.0
53.6/71.9/85.9
58.7/81.9/85.6
90.7/99.7/97.9
Toothbrush
95.3/-
90.6/96.8/90.0
97.2/99.0/94.7
94.2/97.4/95.2
57.5/68.0/83.3
78.6/83.9/83.3
99.7/99.9/99.2
Transistor
86.6/-
74.8/77.4/71.1
94.2/95.2/90.0
99.8/98.0/93.8
57.8/44.6/57.1
61.0/57.8/59.1
99.8/99.6/97.4
Zipper
79.7/-
98.8/99.9/99.2
99.5/99.9/99.2
95.8/99.5/97.1
64.9/77.4/88.1
73.6/89.5/90.6
95.1/99.1/94.4
T
extures
Carpet
93.8/-
98.0/99.1/96.7
98.5/99.6/97.2
99.8/99.9/99.4
95.5/98.7/91.0
99.4/99.8/99.4
99.4/99.9/98.3
Grid
73.9/-
99.3/99.7/98.2
98.0/99.4/96.5
98.2/99.5/97.3
83.5/93.9/86.9
67.3/82.6/84.4
98.5/99.8/97.7
Leather
99.9/-
98.7/99.3/95.0
100./100./100. 100./100./100.
98.4/99.5/96.3
97.4/99.0/96.3
99.8/99.7/97.6
Tile
93.3/-
99.8/100./100.
98.3/99.3/96.4
99.3/99.8/98.2
93.697.5/92.0
97.1/98.7/94.1
96.8/99.9/98.4
Wood
98.4/-
99.8/100./100.
99.2/99.8/98.3
98.6/99.6/96.6
98.6/99.6/97.5
97.8/99.4/95.9
99.7/100./100.
Mean
84.2/-
88.1/94.7/92.0
94.6/96.5/95.2
96.5/98.8/96.2
71.9/81.6/86.6
76.6/87.8/88.1
97.2/99.0/96.5
Table 1: Image-level multi-class anomaly classification results with AUROCcls/APcls/F1maxcls metrics on MVTec-AD.
Metrics
Non-Diffusion
Diffusion-based
DRAEM
UniAD
DDPM
LDM
Ours
AUROC-cls
79.1
85.5
54.5
56.7
86.8
AP-cls
81.9
85.5
57.9
61.4
88.3
F1max-cls
78.9
84.4
72.3
73.1
85.1
AUROC-seg
91.3
95.9
79.7
86.6
96.0
AP-seg
23.5
21.0
2.2
6.0
26.1
F1max-seg
29.5
27.0
4.5
9.9
33.0
PRO
58.8
75.6
46.8
55.0
75.2
Table 2: Quantitative comparisons on VisA dataset.
feature maps Mn using cosine similarity:
Mn(x0, ˆx0)=1 −
n(x0, ˆx0))
T · Ψn(x0, ˆx0)
∥Ψn(x0, ˆx0)∥∥Ψn(x0, ˆx0)∥
,
(8)
where n represents the n-th feature layer fn and the anomaly
score S for an input-pair of anomaly localization is:
S = ∑
n∈N
σnMn(x0, ˆx0),
(9)
where σn indicates the upsampling factor in order to keep
the same dimension of the pixel space image and N indi-
cates the number of feature layers used during inference.
Experiment
Datasets and Evaluation Metrics
MVTec-AD Dataset. MVTec-AD (Bergmann et al. 2019)
dataset simulates real-world industrial production scenarios,
filling the gap in unsupervised anomaly detection. It consists
of 5 types of textures and 10 types of objects, in 5,354 high-
resolution images from different domains. The training set
contains 3,629 images with only anomaly-free samples. The
test set consists of 1,725 images, including both normal and
abnormal samples. Pixel-level annotations are provided for
the anomaly localization evaluation.
VisA Dataset. VisA (Zou et al. 2022) dataset consists of
a total of 10,821 high-resolution images, including 9,621
normal images and 1,200 anomaly images with 78 types of
anomalies. The VisA dataset comprises 12 subsets, each cor-
responding to a distinct object. 12 objects could be catego-
rized into three different object types: Complex structure,
Multiple instances, and Single instance.
Evaluation Metrics. Following prior works, Area Under the
Receiver Operating Characteristic Curve (AUROC), Aver-
age Precision (AP) and F1-score-max (F1max) are used in
both anomaly detection and anomaly localization, where cls
represents the image level anomaly detection and seg repre-
sents the pixel level anomaly localization. Also, Per-Region-
Overlap (PRO) is used in anomaly localization.
Implementation Details
All images in MVTec-AD and VisA are resized to 256 ×
256. For the denoising network, we adopt the 4-th block
of SGDB for connection to SDDB. In this experiment,
we adopt ResNet50 as the feature extraction network and
choose n ∈ {2, 3, 4} as the feature layers used in calcu-
lating the anomaly localization. We utilized the KL method
as the Auto-encoder and fine-tune the model before training
the denoising network. We train for 1000 epochs on a single
NVIDIA Tesla V100 32GB with a batch size of 12. Adam
optimiser (Loshchilov and Hutter 2019) with a learning rate
of 1e−5 is set. A Gaussian filter with σ = 5 is used to smooth
the anomaly localization score. For anomaly detection, the
anomaly score of the image is the maximum value of the av-
eragely pooled anomaly localization score which undergoes
8 rounds of global average pooling operations with a size
of 8 × 8. During inference, the initial denoising timestep T
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8476

Page 6
Category
Non-Diffusion Method
Diffusion-based Method
PaDiM
DRAEM
RD4AD
UniAD
DDPM
LDM
Ours
Objects
Bottle
96.1/-
87.6/62.5/56.9
97.8/68.2/67.6
98.1/66.0/69.2
59.9/ 4.9/11.7
86.9/49.1/50.0
98.4/52.2/54.8
Cable
81.0/-
71.3/14.7/17.8
85.1/26.3/33.6
97.3/39.9/45.2
66.5/ 6.7/10.6
89.3/18.5/26.2
96.8/50.1/57.8
Capsule
96.9/-
50.5/ 6.0/10.0
98.8/43.4/50.0
98.5/42.7/46.5
63.1/ 6.2/ 9.7
90.0/ 7.9/27.3
97.1/42.0/45.3
Hazelnut
96.3/-
96.9/70.0/60.5
97.9/36.2/51.6
98.1/55.2/56.8
91.2/24.1/28.3
95.1/51.2/53.5
98.3/79.2/80.4
Metal Nut
84.8/-
62.2/31.1/21.0
93.8/62.3/65.4
94.8/55.5/66.4
62.7/14.6/29.2
70.5/19.3/30.7
97.3/30.0/38.3
Pill
87.7/-
94.4/59.1/44.1
97.5/63.4/65.2
95.0/44.0/53.9
55.3/ 4.0/ 8.4
74.9/10.2/15.0
95.7/46.0/51.4
Screw
94.1/-
95.5/33.8/40.6
99.4/40.2/44.6
98.3/28.7/37.6
91.1/ 1.8/ 3.8
91.7/ 2.2/ 4.6
97.9/60.6/59.6
Toothbrush
95.6/-
97.7/55.2/55.8
99.0/53.6/58.8
98.4/34.9/45.7
76.9/ 4.0/ 7.7
93.7/20.4/ 9.8
99.0/78.7/72.8
Transistor
92.3/-
64.5/23.6/15.1
85.9/42.3/45.2
97.9/59.5/64.6
53.2/ 5.8/11.4
85.5/25.0/30.7
95.1/15.6/31.7
Zipper
94.8/-
98.3/74.3/69.3
98.5/53.9/60.3
96.8/40.1/49.9
67.4/ 3.5/ 7.6
66.9/ 5.3/ 7.4
96.2/60.7/60.0
T
extures
Carpet
97.6/-
98.6/78.7/73.1
99.0/58.5/60.4
98.5/49.9/51.1
89.2/18.8/44.3
99.1/70.6/66.0
98.6/42.2/46.4
Grid
71.0/-
98.7/44.5/46.2
99.2/46.0/47.4
96.5/23.0/28.4
63.1/ 0.7/ 1.9
52.4/ 1.1/ 1.9
96.6/66.0/64.1
Leather
84.8/-
97.3/60.3/57.4
99.3/38.0/45.1
98.8/32.9/34.4
97.3/38.9/43.2
99.0/45.9/44.0
98.8/56.1/62.3
Tile
80.5/-
98.0/93.6/86.0
95.3/48.5/60.5
91.8/42.1/50.6
87.0/35.2/36.6
90.1/43.9/51.6
92.4/65.7/64.1
Wood
89.1/-
96.0/81.4/74.6
95.3/47.8/51.0
93.2/37.2/41.5
84.7/30.9/37.3
92.3/44.1/46.6
93.3/43.3/43.5
Mean
89.5/-
87.2/52.5/48.6
96.1/48.6/53.8
96.8/43.4/49.5
75.6/13.3/19.5
85.1/27.6/31.0
96.8/52.6/55.5
Table 3: Pixel-level multi-class anomaly segmentation results with AUROCseg/APseg/F1maxseg metrics on MVTec-AD.
Method
Non-Diffusion
Diffusion-based
DRAEM
UniAD
DDPM
LDM
Ours
PRO
71.1
90.4
49.0
66.3
90.7
Table 4: Multi-class anomaly segmentation results with PRO
metric on MVTec-AD.
is set from 1,000. We use DDIM (Song, Meng, and Ermon
2021) as the sampler with 10 steps by default.
Comparison with SOTAs
We conduct and analyze a range of qualitative and quantita-
tive comparison experiments on MVTec-AD, VisA, MVTec-
3D and Medical datasets. We choose a synthesizing-based
method DRAEM (Zavrtanik, Kristan, and Skocaj 2021a),
two embedding-based methods PaDiM (Defard et al.
2021) and RD4AD (Deng and Li 2022), a reconstruction-
based method EdgRec (Liu et al. 2022), a unified SOTA
UniAD (You et al. 2022) method and diffusion-based
DDPM and LDM methods. Specifically, we categorize the
aforementioned methods into two types: non-diffusion and
diffusion-based methods.
Qualitative Results. We conducted substantial qualitative
experiments on MVTec-AD and VisA datasets to visually
demonstrate the superiority of our method in image re-
construction and the accuracy of anomaly localization. As
shown in Figure 4, our method exhibits better reconstruction
capabilities for anomalous regions compared to the EdgRec
on MVTec-AD dataset. In comparison to UniAD shown in
Figure 5, our method exhibits more accurate anomaly local-
ization abilities on VisA dataset.
Quantitative Results. As shown in Table 1 and in Ta-
ble 3, our method achieves SOTA AUROC/AP/F1max met-
rics of 97.2/99.0/96.5 and 96.8/52.6/55.5 for image-wise and
Figure 4: Qualitative illustration on MVTec-AD dataset.
pixel-wise respectively for multi-class setting on MVTec-
AD dataset. For the diffusion-based methods, our approach
significantly outperforms existing DDPM and LDM meth-
ods in terms of 11.7↑ in AUROC and 25↑ in AP for anomaly
localization. For non-diffusion methods, our approach sur-
passes existing methods in both metrics, especially at the
pixel level, where our method exceeds UniAD by 9.2↑/6.0↑
in AP/F1max. Our method has also demonstrated its superi-
ority on VisA dataset, as shown in Table 2. Our approach
exhibits significant improvements compared to diffusion-
based methods of 30.1↑/9.4↑ than the LDM method in
image/pixel AUROC. It also performs well compared to
UniAD by 4.9↑/6.0↑ in pixel AP/F1max metrics.
Ablation Studies
The Architecture Design of DiAD. We investigate the im-
portance of each module in DiAD as shown in Table 5. SD
indicates only the diffusion model without connecting to the
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8477

Page 7
SD
MSG
SGEB3
SGEB4
BN+ReLU
IN+SiLU
AUROC-cls
79.3
95.1
95.3
93.8
96.7
97.2
AUROC-seg
89.5
91.1
89.1
91.2
96.7
96.8
Table 5: Ablation studies on the design of DiAD with AU-
ROC metrics.
SG network which is the LDM’s architecture. MSG indi-
cates only the middle block of the SG network adding to the
middle of SD. SGEB3 and SGEB4 indicate directly skip-
connecting to the corresponding SDDB. When connecting
SGDB3 and SGDB4 at the same time, more details of the
original images are preserved in terms of texture, but the re-
construction ability for large anomaly areas decreases. Us-
ing the combination of IN+SiLU in the SFF block yields
better results compared to using BN+ReLU.
Effect of Pre-trained Feature Extractors. Table 6 shows
the quantitative comparison of using different pre-trained
feature extraction networks. ResNet50 achieved the best per-
formance in anomaly classification metrics, while WideRes-
Net101 excelled in anomaly segmentation.
Backbone
AUROC-cls AUROC-seg
PRO
VGG
16
91.8
92.1
80.1
19
91.3
92.3
80.4
ResNet
18
94.7
96
89.1
34
95.2
96.2
89.6
50
97.2
96.8
90.7
101
96.2
96.9
91.2
WideResNet
50
95.9
96.4
89.3
101
95.6
96.9
91.4
EfficientNet
b0
93.5
94
84
b2
94.2
94.1
84.2
b4
92.8
93.6
83.5
Table 6: Ablation studies on different feature extractors.
Effect of Feature Layers Used in Anomaly Score Calcu-
lating. After extracting feature maps of 5 different scales
using a pre-trained backbone, the anomaly scores are cal-
culated by computing the cosine similarity between fea-
ture maps from different layers. The experimental results, as
shown in Appendix, indicate that using feature maps from
layers f2, f3, and f4 (with corresponding sizes of 64 × 64,
32 × 32, and 16 × 16) yields the best performance.
Effect of Forward Diffusion Timesteps. Increasing the
number of diffusion steps in the forward process impacts
the performance of image reconstruction. The experimental
results, depicted in Figure 6, indicate that with an increas-
ing number of forward diffusion steps, the image approaches
pure Gaussian noise, while the anomaly reconstruction abil-
ity improves as well. Nevertheless, when the number of for-
Input
Ours Rec.
GT
UniAD Loc.
Ours Loc.
Figure 5: Qualitative results on VisA dataset.
Figure 6: Ablation studies on different diffusion timesteps.
ward diffusion steps is less than 600, a significant decline in
performance occurs because the number of steps is insuffi-
cient for anomaly reconstruction.
Conclusion
This paper proposes a diffusion-based DiAD framework to
address the issue of category and semantic loss in the stable
diffusion model for multi-class anomaly detection. We pro-
pose the Semantic-Guided network and Spatial-aware Fea-
ture Fusion block to better reconstruct the abnormal regions
while maintaining the same semantic information as the in-
put image. Our approach achieves state-of-the-art perfor-
mance on MVTec-AD and VisA datasets, significantly out-
performing the non-diffusion and diffusion-based methods.
Limitation. Although our method has demonstrated excep-
tional performance in reconstructing anomalies, it can be
susceptible to the influence of background impurities, re-
sulting in errors in localization and classification. In the fu-
ture, we will further explore diffusion models and enhance
the background’s anti-interference capability for multi-class
anomaly detection. Additionally, we will incorporate multi-
modal assistance in our anomaly detection. Lastly, we will
utilize larger models to enhance reconstruction performance.
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8478

Page 8
Acknowledgments
This work was supported by Jianbing Lingyan Foundation
of Zhejiang Province, P.R. China (Grant No. 2023C01022).
References
Amit, T.; Shaharbany, T.; Nachmani, E.; and Wolf, L. 2022.
SegDiff: Image Segmentation with Diffusion Probabilistic
Models. arXiv:2112.00390.
Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C.
2019. MVTec AD–A comprehensive real-world dataset for
unsupervised anomaly detection. In CVPR, 9592–9600.
Cao, Y.; Wan, Q.; Shen, W.; and Gao, L. 2022. Informa-
tive knowledge distillation for image anomaly segmentation.
Knowledge-Based Systems, 248: 108846.
Cao, Y.; Xu, X.; Sun, C.; Cheng, Y.; Du, Z.; Gao, L.; and
Shen, W. 2023. Segment Any Anomaly without Train-
ing via Hybrid Prompt Regularization.
arXiv preprint
arXiv:2305.10724.
Chen, R.; Xie, G.; Liu, J.; Wang, J.; Luo, Z.; Wang, J.; and
Zheng, F. 2023a. Easynet: An easy network for 3d industrial
anomaly detection. In ACM MM, 7038–7046.
Chen, S.; Sun, P.; Song, Y.; and Luo, P. 2022. DiffusionDet:
Diffusion Model for Object Detection. arXiv:2211.09788.
Chen, X.; Han, Y.; and Zhang, J. 2023. A Zero-/Few-
Shot Anomaly Classification and Segmentation Method for
CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st
Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv
preprint arXiv:2305.17382.
Chen, X.; Zhang, J.; Tian, G.; He, H.; Zhang, W.; Wang,
Y.; Wang, C.; Wu, Y.; and Liu, Y. 2023b. CLIP-AD: A
Language-Guided Staged Dual-Path Model for Zero-shot
Anomaly Detection. arXiv preprint arXiv:2311.00453.
Defard, T.; Setkov, A.; Loesch, A.; and Audigier, R.
2021. Padim: a patch distribution modeling framework for
anomaly detection and localization. In ICPR, 475–489.
Springer.
Deng, H.; and Li, X. 2022. Anomaly detection via reverse
distillation from one-class embedding. In CVPR, 9737–
9746.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-
Fei, L. 2009. Imagenet: A large-scale hierarchical image
database. In CVPR, 248–255. Ieee.
Ding, C.; Pang, G.; and Shen, C. 2022. Catching both gray
and black swans: Open-set supervised anomaly detection. In
CVPR, 7388–7398.
Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoid-
weighted linear units for neural network function approx-
imation in reinforcement learning. Neural networks, 107:
3–11.
Gu, Z.; Liu, L.; Chen, X.; Yi, R.; Zhang, J.; Wang, Y.; Wang,
C.; Shu, A.; Jiang, G.; and Ma, L. 2023. Remembering Nor-
mality: Memory-guided Knowledge Distillation for Unsu-
pervised Anomaly Detection. In ICCV, 16401–16409.
Hahnloser, R. H.; Sarpeshkar, R.; Mahowald, M. A.; Dou-
glas, R. J.; and Seung, H. S. 2000. Digital selection and
analogue amplification coexist in a cortex-inspired silicon
circuit. nature, 405(6789): 947–951.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
learning for image recognition. In CVPR, 770–778.
Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko,
A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; and
Salimans, T. 2022. Imagen Video: High Definition Video
Generation with Diffusion Models. arXiv:2210.02303.
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion
Probabilistic Models. In NeurIPS, volume 33, 6840–6851.
Huang, C.; Guan, H.; Jiang, A.; Zhang, Y.; Spratling, M.;
and Wang, Y.-F. 2022. Registration based few-shot anomaly
detection. In ECCV, 303–319. Springer.
Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Ac-
celerating Deep Network Training by Reducing Internal Co-
variate Shift. In Bach, F. R.; and Blei, D. M., eds., ICML,
volume 37 of JMLR Workshop and Conference Proceedings,
448–456. JMLR.org.
Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.;
and Dabeer, O. 2023. Winclip: Zero-/few-shot anomaly clas-
sification and segmentation. In CVPR, 19606–19616.
Kingma, D. P.; and Welling, M. 2022. Auto-Encoding Vari-
ational Bayes. arXiv:1312.6114.
Li, C.-L.; Sohn, K.; Yoon, J.; and Pfister, T. 2021. Cutpaste:
Self-supervised learning for anomaly detection and localiza-
tion. In CVPR, 9664–9674.
Liang, Y.; Zhang, J.; Zhao, S.; Wu, R.; Liu, Y.; and Pan,
S. 2023. Omni-frequency channel-selection representations
for unsupervised anomaly detection. IEEE Transactions on
Image Processing.
Liu, J.; Xie, G.; Wang, J.; Li, S.; Wang, C.; Zheng, F.; and
Jin, Y. 2023. Deep Industrial Image Anomaly Detection: A
Survey. arXiv preprint arXiv:2301.11514, 2.
Liu, T.; Li, B.; Zhao, Z.; Du, X.; Jiang, B.; and Geng, L.
2022. Reconstruction from edge image combined with color
and gradient difference for industrial surface anomaly detec-
tion. arXiv:2210.14485.
Liznerski, P.; Ruff, L.; Vandermeulen, R. A.; Franks, B. J.;
Kloft, M.; and Müller, K. 2021. Explainable Deep One-
Class Classification. In ICLR.
Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight De-
cay Regularization. arXiv:1711.05101.
Mousakhan, A.; Brox, T.; and Tayyub, J. 2023. Anomaly
Detection with Conditioned Denoising Diffusion Models.
arXiv:2305.15956.
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om-
mer, B. 2022. High-Resolution Image Synthesis with Latent
Diffusion Models. arXiv:2112.10752.
Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Con-
volutional networks for biomedical image segmentation. In
MICCAI, 234–241. Springer.
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.;
and Gehler, P. 2022. Towards total recall in industrial
anomaly detection. In CVPR, 14318–14328.
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8479

Page 9
Salehi, M.; Mirzaei, H.; Hendrycks, D.; Li, Y.; Ro-
hban, M. H.; and Sabokrou, M. 2022.
A Unified
Survey on Anomaly, Novelty, Open-Set, and Out-of-
Distribution Detection: Solutions and Future Challenges.
arXiv:2110.14051.
Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M. H.; and
Rabiee, H. R. 2021. Multiresolution knowledge distillation
for anomaly detection. In CVPR, 14902–14912.
Song, J.; Meng, C.; and Ermon, S. 2021. Denoising Diffu-
sion Implicit Models. In ICLR. OpenReview.net.
Tan, D. S.; Chen, Y.-C.; Chen, T. P.-C.; and Chen, W.-
C. 2021. Trustmae: A noise-resilient defect classification
framework using memory-augmented auto-encoders with
trust regions. In WACV, 276–285.
Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model
scaling for convolutional neural networks. In ICML, 6105–
6114. PMLR.
Tao, X.; Gong, X.; Zhang, X.; Yan, S.; and Adak, C. 2022.
Deep Learning for Unsupervised Anomaly Localization in
Industrial Images: A Survey. IEEE Transactions on Instru-
mentation and Measurement, 71: 1–21.
Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2017. Instance
Normalization: The Missing Ingredient for Fast Stylization.
arXiv:1607.08022.
Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; and Wang, C.
2023. Multimodal Industrial Anomaly Detection via Hybrid
Fusion. In CVPR, 8032–8041.
Wu, J.; Li, J.; Zhang, J.; Zhang, B.; Chi, M.; Wang, Y.; and
Wang, C. 2023. PVG: Progressive Vision Graph for Vision
Recognition. arXiv preprint arXiv:2308.00574.
Wyatt, J.; Leach, A.; Schmon, S. M.; and Willcocks, C. G.
2022. AnoDDPM: Anomaly Detection with Denoising Dif-
fusion Probabilistic Models using Simplex Noise. In CVPR
Workshops 2022, New Orleans, LA, USA, June 19-20, 2022,
649–655. IEEE.
Xie, G.; Wang, J.; Liu, J.; Jin, Y.; and Zheng, F. 2023. Push-
ing the Limits of Fewshot Anomaly Detection in Industry
Vision: Graphcore. In ICLR.
Yan, X.; Zhang, H.; Xu, X.; Hu, X.; and Heng, P. 2021.
Learning Semantic Context from Normal Samples for Un-
supervised Anomaly Detection. In AAAI, 3110–3118.
Yi, J.; and Yoon, S. 2020. Patch SVDD: Patch-level SVDD
for Anomaly Detection and Segmentation. In ACCV.
Yoon, J.; Sohn, K.; Li, C.-L.; Arik, S. O.; Lee, C.-Y.; and
Pfister, T. 2022. Self-supervise, Refine, Repeat: Improv-
ing Unsupervised Anomaly Detection. Transactions on Ma-
chine Learning Research.
You, Z.; Cui, L.; Shen, Y.; Yang, K.; Lu, X.; Zheng, Y.; and
Le, X. 2022. A Unified Model for Multi-class Anomaly De-
tection. In NeurIPS, volume 35, 4571–4584.
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.;
and Wu, L. 2021. FastFlow: Unsupervised Anomaly
Detection and Localization via 2D Normalizing Flows.
arXiv:2111.07677.
Zagoruyko, S.; and Komodakis, N. 2016. Wide Residual
Networks. In BMVC. BMVA Press.
Zavrtanik, V.; Kristan, M.; and Skocaj, D. 2021a. Draem-a
discriminatively trained reconstruction embedding for sur-
face anomaly detection. In ICCV, 8330–8339.
Zavrtanik, V.; Kristan, M.; and Skocaj, D. 2021b. Recon-
struction by inpainting for visual anomaly detection. Pattern
Recognition, 112: 107706.
Zhang, H.; Wang, Z.; Wu, Z.; and Jiang, Y.-G. 2023a.
DiffusionAD: Denoising Diffusion for Anomaly Detection.
arXiv:2303.08730.
Zhang, J.; Chen, X.; Xue, Z.; Wang, Y.; Wang, C.; and Liu,
Y. 2023b. Exploring Grounding Potential of VQA-oriented
GPT-4V for Zero-shot Anomaly Detection. arXiv preprint
arXiv:2311.02612.
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang,
Z.; Huang, T.; Wang, Y.; and Wang, C. 2023c. Rethink-
ing Mobile Block for Efficient Attention-based Models. In
ICCV, 1389–1400.
Zhang, J.; Li, X.; Wang, Y.; Wang, C.; Yang, Y.; Liu, Y.;
and Tao, D. 2022. Eatformer: Improving vision trans-
former inspired by evolutionary algorithm. arXiv preprint
arXiv:2206.09325.
Zhang, L.; and Agrawala, M. 2023.
Adding Con-
ditional Control to Text-to-Image Diffusion Models.
arXiv:2302.05543.
Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; and Dabeer, O.
2022. Spot-the-difference self-supervised pre-training for
anomaly detection and segmentation. In ECCV, 392–408.
Springer.
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
8480
-