This is the html version of the file https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Pishchulin_DeepCut_Joint_Subset_CVPR_2016_paper.html.
Google automatically generates html versions of documents as we crawl the web.
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
Page 1
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
Leonid Pishchulin1, Eldar Insafutdinov1, Siyu Tang1, Bjoern Andres1,
Mykhaylo Andriluka1,3, Peter Gehler2, and Bernt Schiele1
1Max Planck Institute for Informatics, Germany
2Max Planck Institute for Intelligent Systems, Germany
3Stanford University, USA
Abstract
This paper considers the task of articulated human pose
estimation of multiple people in real world images. We pro-
pose an approach that jointly solves the tasks of detection
and pose estimation: it infers the number of persons in a
scene, identifies occluded body parts, and disambiguates
body parts between people in close proximity of each other.
This joint formulation is in contrast to previous strategies,
that address the problem by first detecting people and subse-
quently estimating their body pose. We propose a partition-
ing and labeling formulation of a set of body-part hypotheses
generated with CNN-based part detectors. Our formulation,
an instance of an integer linear program, implicitly performs
non-maximum suppression on the set of part candidates and
groups them to form configurations of body parts respect-
ing geometric and appearance constraints. Experiments on
four different datasets demonstrate state-of-the-art results
for both single person and multi person pose estimation1.
1. Introduction
Human body pose estimation methods have become in-
creasingly reliable. Powerful body part detectors [29] in
combination with tree-structured body models [30, 7] show
impressive results on diverse datasets [18, 3, 26]. These
benchmarks promote pose estimation of single pre-localized
persons but exclude scenes with multiple people. This prob-
lem definition has been a driver for progress, but also falls
short on representing a realistic sample of real-world images.
Many photographs contain multiple people of interest (see
Fig 1) and it is unclear whether single pose approaches gen-
eralize directly. We argue that the multi person case deserves
more attention since it is an important real-world task.
Key challenges inherent to multi person pose estimation
1Models and code available at http://pose.mpi-inf.mpg.de
(a)
(b)
(c)
Figure 1. Method overview: (a) initial detections (= part candidates)
and pairwise terms (graph) between all detections that (b) are jointly
clustered belonging to one person (one colored subgraph = one
person) and each part is labeled corresponding to its part class
(different colors and symbols correspond to different body parts);
(c) shows the predicted pose sticks.
are the partial visibility of some people, significant overlap
of bounding box regions of people, and the a-priori unknown
number of people in an image. The problem thus is to in-
fer the number of persons, assign part detections to person
instances while respecting geometric and appearance con-
straints. Most strategies use a two-stage inference process
[23, 15, 28] to first detect and then independently estimate
poses. This is unsuited for cases when people are in close
14929

Page 2
proximity since they permit simultaneous assignment of the
same body-part candidates to multiple people hypotheses.
As a principled solution for multi person pose estimation
a model is proposed that jointly estimates poses of all people
present in an image by minimizing a joint objective. The
formulation is based on partitioning and labeling an initial
pool of body part candidates into subsets that correspond to
sets of mutually consistent body-part candidates and abide to
mutual consistency and exclusion constraints. The proposed
method has a number of appealing properties. (1) The for-
mulation is able to deal with an unknown number of people,
and also infers this number by linking part hypotheses. (2)
The formulation allows to either deactivate or merge part hy-
potheses in the initial set of part candidates hence effectively
performing non-maximum suppression (NMS). In contrast
to NMS performed on individual part candidates, the model
incorporates evidence from all other parts making the pro-
cess more reliable. (3) The problem is cast in the form of
an Integer Linear Program (ILP). Although the problem is
NP-hard, the ILP formulation facilitates the computation of
bounds and feasible solutions with a certified optimality gap.
This paper makes the following contributions. The main
contribution is the derivation of a joint detection and pose
estimation formulation cast as an integer linear program. Fur-
ther, two CNN variants are proposed to generate representa-
tive sets of body part candidates. These, combined with the
model, obtain state-of-the-art results for both single-person
and multi-person pose estimation on different datasets.
Related work. Most work on pose estimation targets the
single person case. Methods progressed from simple part
detectors and elaborate body models [25, 24, 16] to tree-
structured pictorial structures (PS) models with strong part
detectors [22, 34, 7, 26]. Impressive results are obtained pre-
dicting locations of parts with convolutional neural networks
(CNN) [31, 29]. While body models are not a necessary
component for effective part localization, constraints among
parts allow to assemble independent detections into body
configurations as demonstrated in [7] by combining CNN-
based body part detectors with a body model [34].
A popular approach to multi-person pose estimation is to
detect people first and then estimate body pose independently
[28, 23, 34, 15]. [34] proposes a flexible mixture-of-parts
model for detection and pose estimation. [34] obtains mul-
tiple pose hypotheses corresponding to different root part
positions and then performing non-maximum suppression.
[15] detects people using a flexible configuration of poselets
and the body pose is predicted as a weighted average of acti-
vated poselets. [23] detects people and then predicts poses
of each person using a PS model. [5] estimates poses of mul-
tiple people in 3D by constructing a shared space of 3D body
part hypotheses, but uses 2D person detections to establish
the number of people in the scene. These approaches are
limited to cases with people sufficiently far from each other
that do not have overlapping body parts.
Our work is closely related to [12, 21] who also propose
a joint objective to estimate poses of multiple people. [12]
proposes a multi-person PS model that explicitly models
depth ordering and person-person occlusions. Our formula-
tion is not limited by a number of occlusion states among
people. [21] proposes a joint model for pose estimation and
body segmentation coupling pose estimates of individuals
by image segmentation. [12, 21] uses a person detector to
generate initial hypotheses for the joint model. [21] resorts
to a greedy approach of adding one person hypothesis at a
time until the joint objective can be reduced, whereas our
formulation can be solved with a certified optimality gap.
In addition [21] relies on expensive labeling of body part
segmentation, which the proposed approach does not require.
Similarly to [8] we aim to distinguish between visible
and occluded body parts. [8] primarily focuse on the single-
person case and handles multi-person scenes akin to [34].
We consider the more difficult problem of full-body pose
estimation, whereas [12, 8] focus on upper-body poses and
consider a simplified case of people seen from the front.
Our work is related to early work on pose estimation
that also relies on integer linear programming to assemble
candidate body part hypotheses into valid configurations [16].
Their single person method employs a tree graph augmented
with weaker non-tree repulsive edges and expects the same
number of parts. In contrast, our novel formulation relies
on fully connected model to deal with unknown number of
people per image and body parts per person.
The Minimum Cost Multicut Problem [9, 11], known in
machine learning as correlation clustering [4], has been used
in computer vision for image segmentation [1, 2, 19, 35] but
has not been used before in the context of pose estimation.
It is known to be NP-hard [10].
2. Problem Formulation
In this section, the problem of estimating articulated poses
of an unknown number of people in an image is cast as an
optimization problem. The goal of this formulation is to state
three problems jointly: 1. The selection of a subset of body
parts from a set D of body part candidates, estimated from
an image as described in Section 4 and depicted as nodes of a
graph in Fig. 1(a). 2. The labeling of each selected body part
with one of C body part classes, e.g., “arm”, “leg”, “torso”,
as depicted in Fig. 1(c). 3. The partitioning of body parts
that belong to the same person, as depicted in Fig. 1(b).
2.1. Feasible Solutions
We encode labelings of the three problems jointly through
triples (x, y, z) of binary random variables with domains
x ∈ {0, 1}D×C,y ∈ {0, 1}(
D
2 ) and z ∈ {0, 1}(D
2 )×C2
.
Here, xdc = 1 indicates that body part candidate d is of
4930

Page 3
class c, ydd= 1 indicates that the body part candidates d
and dbelong to the same person, and zddccare auxiliary
variables to relate x and y through zddcc= xdcxdcydd.
Thus, zddcc
= 1 indicates that body part candidate d is
of class c (xdc = 1), body part candidate dis of class c
(xdc= 1), and body part candidates d and dbelong to the
same person (ydd= 1).
In order to constrain the 01-labelings (x, y, z) to well-
defined articulated poses of one or more people, we impose
the linear inequalities (1)–(3) stated below. Here, the in-
equalities (1) guarantee that every body part is labeled with
at most one body part class. (If it is labeled with no body part
class, it is suppressed). The inequalities (2) guarantee that
distinct body parts d and dbelong to the same person only if
neither d nor dis suppressed. The inequalities (3) guarantee,
for any three pairwise distinct body parts, d, dand d′′, if d
and dare the same person (as indicated by ydd= 1) and
dand d′′ are the same person (as indicated by ydd′′ = 1),
then also d and d′′ are the same person (ydd′′ = 1), that is,
transitivity, cf. [9]. Finally, the inequalities (4) guarantee, for
any dd∈ (D
2 ) and any cc∈ C2 that zddcc= xdcxdcydd.
These constraints allow us to write an objective function as
a linear form in z that would otherwise be written as a cubic
form in x and y. We denote by XDC the set of all (x, y, z)
that satisfy all inequalities, i.e., the set of feasible solutions.
∀d ∈ D∀cc∈ (C
2 ) : xdc + xdc≤ 1
(1)
∀dd∈ (D
2 ) : ydd≤ ∑
c∈C
xdc
ydd≤ ∑
c∈C
xdc
(2)
∀ddd′′ ∈ (D
3 ) : ydd+ ydd′′ − 1 ≤ ydd′′
(3)
∀dd∈ (D
2 )∀cc∈ C2 : xdc + xdc+ ydd− 2 ≤ zddcc
zddcc≤ xdc
zddcc≤ xdc
zddcc≤ ydd
(4)
When at most one person is in an image, we further
constrain the feasible solutions to a well-defined pose of a
single person. This is achieved by an additional class of
inequalities which guarantee, for any two distinct body parts
that are not suppressed, that they must be clustered together:
∀dd∈ (D
2 )∀cc∈ C2 :
xdc + xdc− 1 ≤ ydd
(5)
2.2. Objective Function
For every pair (d, c) ∈ D × C, we will estimate a proba-
bility pdc ∈ [0, 1] of the body part d being of class c. In the
context of CRFs, these probabilities are called part unaries
and we will detail their estimation in Section 4.
For every dd∈ (D
2 ) and every cc∈ C2, we consider a
probability pddcc∈ (0, 1) of the conditional probability of
d and dbelonging to the same person, given that d and dare
body parts of classes c and c, respectively. For c = c, these
probabilities pddccare the pairwise terms in a graphical
model of the human body. In contrast to the classic pictorial
structures model, our model allows for a fully connected
graph where each body part is connected to all other parts in
the entire set D by a pairwise term. For c = c, pddccis the
probability of the part candidates d and drepresenting the
same part of the same person. This facilitates clustering of
multiple part candidates of the same part of the same person
and a repulsive property that prevents nearby part candidates
of the same type to be associated to different people.
The optimization problem that we call the subset partition
and labeling problem is the ILP that minimizes over the set
of feasible solutions XDC:
min
(x,y,z)∈XDC
〈α, x〉 + 〈β,z〉,
(6)
where we used the short-hand notation
αdc := log
1 − pdc
pdc
(7)
βddcc:= log
1 − pddcc
pddcc
(8)
〈α, x〉 := ∑
d∈D
c∈C
αdc xdc
(9)
〈β,z〉 := ∑
dd(D
2 )
c,c∈C
βddcczddcc
.
(10)
The objective (6)–(10) is the MAP estimate of a prob-
ability measure of joint detections x and clusterings y, z
of body parts, where prior probabilities pdc and pddccare
estimated independently from data, and the likelihood is a
positive constant if (x, y, z) satisfies (1)–(4), and is 0, other-
wise. The exact form (6)–(10) is obtained when minimizing
the negative logarithm of this probability measure.
2.3. Optimization
In order to obtain feasible solutions of the ILP (6) with
guaranteed bounds, we separate the inequalities (1)–(5) in
the branch-and-cut loop of the state-of-the-art ILP solver
Gurobi. More precisely, we solve a sequence of relaxations
of the problem (6), starting with the (trivial) unconstrained
problem. Each problem is solved using the cuts proposed
by Gurobi. Once an integer feasible solution is found, we
identify violated inequalities (1)–(5), if any, by breadth-first-
search, add these to the constraint pool and re-solve the
tightened relaxation. Once an integer solution satisfying
all inequalities is found, together with a lower bound that
certifies an optimality gap below 1%, we terminate.
3. Pairwise Probabilities
Here we describe the estimation of the pairwise terms.
We define pairwise features fdd
for the variable zddcc
4931

Page 4
(Sec. 2). Each part detection d includes the probabilities
fpdc (Sec. 4.4), its location (xd,yd), scale hd and bound-
ing box Bd coordinates. Given two detections d and d,
and the corresponding features (fpdc ,xd,yd,hd,Bd) and
(fpd′c ,xd,yd,hd,Bd), we define two sets of auxiliary
variables for zddcc, one set for c = c(same body part class
clustering) and one for c = c(across two body part classes
labeling). These features capture the proximity, kinematic
relation and appearance similarity between body parts.
The same body part class (c = c). Two detections de-
noting the same body part of the same person should be
in close proximity to each other. We introduce the fol-
lowing auxiliary variables that capture the spatial relations:
∆x = |xd −xd|/¯h, ∆y = |yd −yd|/¯h, ∆h = |hd −hd|/¯h,
IOUnion, IOMin, IOMax. The latter three are intersec-
tions over union/minimum/maximum of the two detection
boxes, respectively, and ¯h = (hd + hd)/2.
Non-linear Mapping. We augment the feature repre-
sentation by appending quadratic and exponential terms.
The final pairwise feature fdd
for the variable zddcc is
(∆x, ∆y, ∆h,IOUnion,IOMin,IOMax, (∆x)
2
,
..., (IOMax)
2
, exp (−∆x),..., exp (−IOMax)).
Two different body part classes (c = c). We encode the
kinematic body constraints into the pairwise feature by in-
troducing auxiliary variables Sddand Rdd, where Sddand
Rddare the Euclidean distance and the angle between two
detections, respectively. To capture the joint distribution of
Sddand Rdd, instead of using Sddand Rdddirectly, we
employ the posterior probability p(zddcc= 1|Sdd,Rdd)
as pairwise feature for zddccto encode the geometric rela-
tions between the body part class c and c. More specifically,
assuming the prior probability p(zddcc= 1) = p(zddcc=
0) = 0.5, the posterior probability of detection d and dhave
the body part label c and c, namely zddcc= 1, is
p(zddcc= 1|Sdd,Rdd)
=
p(Sdd,Rdd|zddcc= 1)
p(Sdd,Rdd|zddcc= 1) + p(Sdd,Rdd|zddcc= 0)
,
where p(Sdd,Rdd|zddcc= 1) is obtained by conducting
a normalized 2D histogram of Sddand Rddfrom posi-
tive training examples, analogous to the negative likelihood
p(Sdd,Rdd|zddcc= 0). In Sec. 5.1 we also experiment
with encoding the appearance into the pairwise feature by
concatenating the feature fpdc from d and fpd′c from d, as
fpdc is the output of the CNN-based part detectors. The final
pairwise feature is (p(zddcc= 1|Sdd,Rdd),fpdc ,fpd′c ).
3.1. Probability Estimation
The coefficients α and β of the objective function (Eq. 6)
are defined by the probability ratio in the log space (Eq. 7 and
Eq. 8). Here we describe the estimation of the corresponding
probability density: (1) For every pair of detection and part
classes, namely for any (d, c) ∈ D × C, we estimate a
probability pdc ∈ (0, 1) of the detection d being a body
part of class c. (2) For every combination of two distinct
detections and two body part classes, namely for any dd
(D
2 ) and any cc∈ C2, we estimate a probability pddcc
(0, 1) of d and dbelonging to the same person, meanwhile
d and dare body parts of classes c and c, respectively.
Learning. Given the features fddand a Gaussian prior
p(θcc) = N(0,σ2) on the parameters, logistic model is
p(zddcc= 1|fddcc) =
1
1 + exp(−〈θcc,fdd〉)
. (11)
(|C| × (|C| + 1))/2 parameters are estimated using ML.
Inference Given two detections d and d, the coefficients
αdc for xdc and αdc for xdc are obtained by Eq. 7, the
coefficient βddccfor zddcchas the form
βddcc= log
1 − pddcc
pddcc
= −〈fddcc〉.
(12)
Model parameters θccare learned using logistic regression.
4. Body Part Detectors
We first introduce our deep learning-based part detection
models and then evaluate them on two prominent bench-
marks thereby significantly outperforming state of the art.
4.1. Adapted Fast R-CNN (AFR-CNN)
To obtain strong part detectors we adapt Fast R-
CNN [14]. FR-CNN takes as input an image and set of
class-independent region proposals [32] and outputs the soft-
max probabilities over all classes and refined bounding boxes.
To adapt FR-CNN for part detection we alter it in two ways:
1) proposal generation and 2) detection region size. The
adapted version is called AFR-CNN throughout the paper.
Detection proposals. Generating object proposals is essen-
tial for FR-CNN, meanwhile detecting body parts is chal-
lenging due to their small size and high intra-class variability.
We use DPM-based part detectors [22] for proposal gener-
ation. We collect K top-scoring detections by each part
detector in a common pool of N part-independent proposals
and use these proposals as input to AFR-CNN. N is 2, 000
in case of single and 20, 000 in case of multiple people.
Larger context. Increasing the size of DPM detections
by upscaling every bounding box by a fixed factor allows
to capture more context around each part. In Sec. 4.3 we
evaluate the influence of upscaling and show that using larger
context around parts is crucial for best performance.
Details. Following standard FR-CNN training procedure Im-
ageNet models are finetuned on pose estimation task. Center
of a predicted bounding box is used for body part location
prediction. See supplemental for detailed parameter analysis.
4932

Page 5
Setting
Head Sho Elb Wri Hip Knee Ank PCK AUC
oracle2,000
98.8 98.8 97.4 96.4 97.4 98.3 97.7 97.8 84.0
DPMscale1
48.8 25.1 14.4 10.2 13.6 21.8 27.1 23.0 13.6
AlexNetscale1
82.2 67.0 49.6 45.4 53.1 52.9 48.2 56.9 35.9
AlexNetscale4
85.7 74.4 61.3 53.2 64.1 63.1 53.8 65.1 39.0
+optimalparams
88.1 79.3 68.9 62.6 73.5 69.3 64.7 72.4 44.6
VGGscale4optimalparams 91.0 84.2 74.6 67.7 77.4 77.3 72.8 77.9 50.0
+ finetune LSP
95.4 86.5 77.8 74.0 84.5 78.8 82.6 82.8 57.0
Table 1. Unary only performance (PCK) of AFR-CNN on the LSP
(Person-Centric) dataset. AFR-CNN is finetuned from ImageNet to
MPII (lines 3-6), and then finetuned to LSP (line 7).
4.2. Dense Architecture (Dense-CNN)
Using proposals for body part detection may be sub-
optimal. We thus develop a fully convolutional architecture
for computing part probability scoremaps.
Stride. We build on VGG [27]. Fully convolutional VGG
has stride of 32 px – too coarse for precise part localization.
We thus use hole algorithm [6] to reduce the stride to 8 px.
Scale. Selecting image scale is crucial. We found that scal-
ing to a standing height of 340 px performs best: VGG
receptive field sees entire body to disambiguate body parts.
Loss function. We start with a softmax loss that outputs
probabilities for each body part and background. The down-
side is inability to assign probabilities above 0.5 to several
close-by body parts. We thus re-formulate the detection as
multi-label classification, where at each location a separate
set of probability distributions is estimated for each part. We
use sigmoid activation function on the output neurons and
cross entropy loss. We found this loss to perform better than
softmax and converge much faster compared to MSE [30].
Target training scoremap for each joint is constructed by
assigning a positive label 1 at each location within 15 px to
the ground truth, and negative label 0 otherwise.
Location refinement. In order to improve location preci-
sion we follow [14]: we add a location refinement FC layer
after the FC7 and use the relative offsets (∆x, ∆y) from a
scoremap location to the ground truth as targets.
Regression to other parts. Similar to location refinement
we add an extra term to the objective function where for each
part we regress onto all other part locations. We found this
auxiliary task to improve the performance (c.f. Sec. 4.3).
Training. We follow best practices and use SGD for CNN
training. In each iteration we forward-pass a single image.
After FC6 we select all positive and random negative sam-
ples to keep the pos/neg ratio as 25%/75%. We finetune
VGG from Imagenet model to pose estimation task and use
training data augmentation. We train for 430k iterations with
the following learning rates (lr): 10k at lr=0.001, 180k at
lr=0.002, 120k at lr=0.0002 and 120k at lr=0.0001. Pre-
training at smaller lr prevents the gradients from diverging.
Setting
Head Sho Elb Wri Hip Knee Ank PCK AUC
MPIIsoftmax
91.5 85.3 78.0 72.4 81.7 80.7 75.7 80.8 51.9
+LSPET
94.6 86.8 79.9 75.4 83.5 82.8 77.9 83.0 54.7
+sigmoid
93.5 87.2 81.0 77.0 85.5 83.3 79.3 83.8 55.6
+locationrefinement 95.0 88.4 81.5 76.4 88.0 83.3 80.8 84.8 61.5
+auxiliarytask
95.1 89.6 82.8 78.9 89.0 85.9 81.2 86.1 61.6
+ finetune LSP
97.2 90.8 83.0 79.3 90.6 85.6 83.1 87.1 63.6
Table 2. Unary only performance (PCK) of Dense-CNN VGG on
LSP (PC) dataset. Dense-CNN is finetuned from ImageNet to MPII
(line 1), to MPII+LSPET (lines 2-5), and finally to LSP (line 6).
4.3. Evaluation of Part Detectors
Datasets. We train and evaluate on three public benchmarks:
“Leeds Sports Poses” (LSP) [17] (person-centric (PC)), “LSP
Extended” (LSPET) [18]2, and “MPII Human Pose” (“Sin-
gle Person”) [3]. The MPII training set (19185 people) is
used as default. In some cases LSP training and LSPET are
added to MPII (marked as MPII+LSPET in the experiments).
Evaluation measures. We use the standard “PCK” met-
ric [26, 31, 30] and evaluation scripts available on the web
page of [3]. In addition, we report “Area under Curve”
(AUC) computed for the entire range of PCK thresholds.
AFR-CNN. Evaluation of AFR-CNN on LSP is shown in
Tab. 1. Oracle selecting per part the closest from 2, 000 pro-
posals achieves 97.8% PCK, as proposals cover majority of
the ground truth locations. Choosing a single proposal per
part using DPM score achieves 23.0% PCK – not surprising
given the difficulty of the body part detection problem. Re-
scoring the proposals using AFR-CNN with AlexNet [20]
dramatically improves the performance to 56.9% PCK, as
CNN learns richer image representations. Extending the
regions by 4x (1x ≈ head size) achieves 65.1% PCK, as it
incorporates more context including the information about
symmetric parts and allows to implicitly encode higher-order
part relations. Using data augmentation and slightly tun-
ing training parameters improves the performance to 72.4%
PCK. We refer to the supplementary material for detailed
analysis. Deeper VGG architecture improves over smaller
AlexNet reaching 77.9% PCK. All results so far are achieved
by finetuning the ImageNet models on MPII. Further fine-
tuning to LSP leads to remarkable 82.8% PCK: CNN learns
LSP-specific image representations. Strong increase in AUC
(57.0 vs. 50%) is due to improvements for smaller PCK
thresholds. Using no bounding box regression leads to per-
formance drop (81.3% PCK, 53.2% AUC): location refine-
ment is crucial for better localization. Overall AFR-CNN
obtains very good results on LSP by far outperforming the
state of the art (c.f. Tab. 3, rows 7 − 9). Evaluation on MPII
shows competitive performance (Tab. 4, row 1).
Dense-CNN. The results are in Tab. 2. Training with VGG
on MPII with softmax loss achieves 80.8% PCK thereby
2To reduce labeling noise we re-annotated original high-resolution im-
ages and make the data available at http://datasets.d2.mpi-inf.
mpg.de/hr-lspet/hr-lspet.zip
4933

Page 6
Normalized distance
0
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Detection rate, %
0
10
20
30
40
50
60
70
80
90
100
PCK total, LSP PC
AFR-CNN
DeepCut SP AFR-CNN
Dense-CNN
DeepCut SP Dense-CNN
Tompson et al., NIPS'14
Chen&Yuille, NIPS'14
Fan et al., CVPR'15
Normalized distance
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Detection rate, %
0
10
20
30
40
50
60
70
80
90
100
PCKh total, MPII
AFR-CNN
DeepCut SP AFR-CNN
Dense-CNN
DeepCut SP Dense-CNN
Tompson et al., NIPS'14
Tompson et al., CVPR'15
(a) LSP (PC)
(b) MPII Single Person
Figure 2. Pose estimation results over all PCK thresholds.
outperforming AFR-CNN (c.f. Tab. 1, row 6). This shows
the advantages of fully convolutional training and evalua-
tion. Expectedly, training on larger MPII+LSPET dataset
improves the results (83.0 vs. 80.8% PCK). Using cross-
entropy loss with sigmoid activations improves the results to
83.8% PCK, as it better models the appearance of close-by
parts. Location refinement improves localization accuracy
(84.8% PCK), which becomes more clear when analyzing
AUC (61.5 vs. 55.6%). Interestingly, regressing to other
parts further improves PCK to 86.1% showing a value of
training with the auxiliary task. Finally, finetuning to LSP
achieves the best result of 87.1% PCK, which is significantly
higher than the best published results (c.f. Tab. 3, rows 7−9).
Unary-only evaluation on MPII reveals slightly higher AUC
results compared to the state of the art (Tab. 4, row 3 − 4).
4.4. Using Detections in DeepCut Models
The SPLP problem is NP-hard, to solve instances of it
efficiently we select a subset of representative detections
from the entire set produced by a model. In our experiments
we use |D| = 100 as default detection set size. In case of the
AFR-CNN we directly use the softmax output as unary prob-
abilities: fpdc = (pd1,...,pdc), where pdc is the probability
of the detection d being the part class c. For Dense-CNN
detection model we use the sigmoid detection unary scores.
5. DeepCut Results
The aim of this paper is to tackle the multi person case.
To that end, we evaluate the proposed DeepCut models on
four diverse benchmarks. We confirm that both single person
(SP) and multi person (MP) variants (Sec. 2) are effective
on standard SP pose estimation datasets [17, 3]. Then, we
demonstrate superior performance of DeepCut MP on the
multi person pose estimation task.
5.1. Single Person Pose Estimation
We now evaluate single person (SP) and more general
multi person (MP) DeepCut models on LSP and MPII SP
benchmarks described in Sec. 4. Since this evaluation setting
implicitly relies on the knowledge that all parts are present
in the image we always output the full number of parts.
Results on LSP. We report per-part PCK results (Tab. 3)
and results for a variable distance threshold (Fig. 2 (a)).
Setting
Head Sho Elb Wri Hip Knee Ank PCK AUC
AFR-CNN (unary)
95.4 86.5 77.8 74.0 84.5 82.6 78.8 82.8 57.0
+ DeepCut SP
95.4 86.7 78.3 74.0 84.3 82.9 79.2 83.0 58.4
+appearancepairwise 95.4 87.2 78.6 73.7 84.7 82.8 78.8 83.0 58.5
+ DeepCut MP
95.2 86.7 78.2 73.5 84.6 82.8 79.0 82.9 58.0
Dense-CNN (unary)
97.2 90.8 83.0 79.3 90.6 85.6 83.1 87.1 63.6
+ DeepCut SP
97.0 91.0 83.8 78.1 91.0 86.7 82.0 87.1 63.5
+ DeepCut MP
96.2 91.2 83.3 77.6 91.3 87.0 80.4 86.7 62.6
Tompsonetal.[30]
90.6 79.2 67.9 63.4 69.5 71.0 64.2 72.3 47.3
Chen&Yuille[7]
91.8 78.2 71.8 65.5 73.3 70.2 63.4 73.4 40.1
Fan et al. [33]
92.4 75.2 65.3 64.0 75.7 68.3 70.4 73.0 43.2
re-evaluated using the standard protocol, for details see project page of [33]
Table 3. Pose estimation results (PCK) on LSP (PC) dataset.
DeepCut SP AFR-CNN model using 100 detections im-
proves over unary only (83.0 vs. 82.8% PCK, 58.4 vs.
57% AUC), as pairwise connections filter out some of the
high-scoring detections on the background. The improve-
ment is clear in Fig. 2 (a) for smaller thresholds. Using
part appearance scores in addition to geometrical features in
c = cpairwise terms only slightly improves AUC, as the
appearance of neighboring parts is mostly captured by a rela-
tively large region centered at each part. The performance of
DeepCut MP AFR-CNN matches the SP and improves over
AFR-CNN alone: DeepCut MP correctly handles the SP case.
Performance of DeepCut SP Dense-CNN is almost identical
to unary only, unlike the results for AFR-CNN. Dense-CNN
performance is noticeably higher compared to AFR-CNN,
and “easy” cases that could have been corrected by a spatial
model are resolved by stronger part detectors alone.
Comparison to the state of the art (LSP). Tab. 3 compares
results of DeepCut models to other deep learning methods
specifically designed for single person pose estimation. All
DeepCuts significantly outperform the state of the art, with
DeepCut SP Dense-CNN model improving by 13.7% PCK
over the best known result [7]. The improvement is even
more dramatic for lower thresholds (Fig. 2 (a)): for PCK
@ 0.1 the best model improves by 19.9% over Tompson et
al. [30], by 26.7% over Fan et al. [33], and by 32.4% PCK
over Chen&Yuille [7]. The latter is interesting, as [7] use a
stronger spatial model that predicts the pairwise conditioned
on the CNN features, whereas DeepCuts use geometric-only
pairwise connectivity. Including body part orientation infor-
mation into DeepCuts should further improve the results.
Results on MPII Single Person. Results are shown in
Tab. 4 and Fig. 2 (b). DeepCut SP AFR-CNN noticeably
improves over AFR-CNN alone (79.8 vs. 78.8% PCK, 51.1
vs. 49.0% AUC). The improvement is stronger for smaller
thresholds (c.f. Fig. 2), as spatial model improves part local-
ization. Dense-CNN alone trained on MPII outperforms
AFR-CNN (81.6 vs. 78.8% PCK), which shows the ad-
vantages of dense training and evaluation. As expected,
Dense-CNN performs slightly better when trained on the
larger MPII+LSPET. Finally, DeepCut Dense-CNN SP is
slightly better than Dense-CNN alone leading to the best
4934

Page 7
Setting
Head Sho Elb Wri Hip Knee Ank PCKh AUC
AFR-CNN (unary) 91.5 89.7 80.5 74.4 76.9 69.6 63.1 78.8 49.0
+ DeepCut SP
92.3 90.6 81.7 74.9 79.2 70.4 63.0 79.8 51.1
Dense-CNN (unary) 93.5 88.6 82.2 77.1 81.7 74.4 68.9 81.6 56.0
+LSPET
94.0 89.4 82.3 77.5 82.0 74.4 68.7 81.9 56.5
+DeepCut SP
94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4 56.5
Tompson et al. [30] 95.8 90.3 80.5 74.3 77.6 69.7 62.8 79.6 51.8
Tompson et al. [29] 96.1 91.9 83.9 77.8 80.9 72.3 64.8 82.0 54.9
Table 4. Pose estimation results (PCKh) on MPII Single Person.
result on MPII dataset (82.4% PCK).
Comparison to the state of the art (MPII). We com-
pare the performance of DeepCut models to the best
deep learning approaches from the literature [30, 29]3.
DeepCut SP Dense-CNN outperforms both [30, 29] (82.4
vs 79.6 and 82.0% PCK, respectively). Similar to them
DeepCuts rely on dense training and evaluation of part de-
tectors, but unlike them use single size receptive field and
do not include multi-resolution context information. Also,
appearance and spatial components of DeepCuts are trained
piece-wise, unlike [30]. We observe that performance dif-
ferences are higher for smaller thresholds (c.f. Fig. 2 (b)).
This is remarkable, as a much simpler strategy for location
refinement is used compared to [29]. Using multi-resolution
filters and joint training should improve the performance.
5.2. Multi Person Pose Estimation
We now evaluate DeepCut MP models on the challenging
task of MP pose estimation with an unknown number of
people per image and visible body parts per person.
Datasets. For evaluation we use two public MP benchmarks:
“We Are Family” (WAF) [12] with 350 training and 175 test-
ing group shots of people; “MPII Human Pose” (“Multi-
Person”) [3] consisting of 3844 training and 1758 testing
groups of multiple interacting individuals in highly articu-
lated poses with variable number of parts. On MPII, we use
a subset of 288 testing images for evaluation. We first pre-
finetune both AFR-CNN and Dense-CNN from ImageNet to
MPII and MPII+LSPET, respectively, and further finetune
each model to WAF and MPII Multi-Person. For WAF, we
re-train the spatial model on WAF training set.
WAF evaluation measure. Approaches are evaluated using
the official toolkit [12], thus results are directly comparable
to prior work. The toolkit implements occlusion-aware “Per-
centage of Correct Parts (mPCP)” metric. In addition, we
report “Accuracy of Occlusion Prediction (AOP)” [8].
MPII Multi-Person evaluation measure. PCK metric is
suitable for SP pose estimation with known number of parts
and does not penalize for false positives that are not a part
of the ground truth. Thus, for MP pose estimation we
use “Mean Average Precision (mAP)” measure, similar to
[28, 34]. In contrast to [28, 34] evaluating the detection
3[30] was re-trained and evaluated on MPII dataset by the authors.
Setting
Head U Arms L Arms Torso mPCP AOP
AFR-CNN det ROI
69.8
46.0
36.7
83.7
53.1 73.9
DeepCut MP AFR-CNN
99.0
79.5
74.3
87.1
82.2 85.6
Dense-CNN det ROI
76.0
46.0
40.2
83.7
55.3 73.8
DeepCut MP Dense-CNN 99.3
81.5
79.5
87.1
84.7 86.5
Ghiasi et. al. [13]
-
-
-
-
63.6 74.0
Eichner&Ferrari [12]
97.6
68.2
48.1
86.1
69.4 80.0
Chen&Yuille [8]
98.5
77.2
71.3
88.5
80.7 84.9
Table 5. Pose estimation results (mPCP) on WAF dataset.
of any part instance in the image disrespecting inconsis-
tent pose predictions, we evaluate consistent part configura-
tions. First, multiple body pose predictions are generated and
then assigned to the ground truth (GT) based on the highest
PCKh [3]. Only single pose can be assigned to GT. Unas-
signed predictions are counted as false positives. Finally, AP
for each body part is computed and mAP is reported.
Baselines. To assess the performance of AFR-CNN and
Dense-CNN we follow a traditional route from the literature
based on two stage approach: first a set of regions of inter-
est (ROI) is generated and then the SP pose estimation is
performed in the ROIs. This corresponds to unary only per-
formance. ROI are either based on a ground truth (GT ROI)
or on the people detector output (det ROI).
Results on WAF. Results are shown in Tab. 5. det ROI is
obtained by extending provided upper body detection boxes.
AFR-CNN det ROI achieves 57.6% mPCP and 73.9% AOP.
DeepCut MP AFR-CNN significantly improves over AFR-
CNN det ROI achieving 82.2% mPCP. This improvement
is stronger compared to LSP and MPII due to several rea-
sons. First, mPCP requires consistent prediction of body
sticks as opposite to body joints, and including spatial model
enforces consistency. Second, mPCP metric is occlusion-
aware. DeepCuts can deactivate detections for the occluded
parts thus effectively reasoning about occlusion. This is
supported by strong increase in AOP (85.6 vs. 73.9%). Re-
sults by DeepCut MP Dense-CNN follow the same tendency
achieving the best performance of 84.7% mPCP and 86.5%
AOP. Both increase in mPCP and AOP show the advantages
of DeepCuts over traditional det ROI approaches.
Tab. 5 shows that DeepCuts outperform all prior methods.
Deep learning method [8] is outperformed both for mPCP
(84.7 vs. 80.7%) and AOP (86.5 vs. 84.9%) measures. This
is remarkable, as DeepCuts reason about part interactions
across several people, whereas [8] primarily focuses on the
single-person case and handles multi-person scenes akin to
[34]. In contrast to [8], DeepCuts are not limited by the num-
ber of possible occlusion patterns and cover person-person
occlusions and other types as truncation and occlusion by
objects in one formulation. DeepCuts significantly outper-
form [12] while being more general: unlike [12] DeepCuts
do not require person detector and not limited by a number
of occlusion states among people.
Qualitative comparison to [8] is provided in Fig. 3.
4935

Page 8
det
R
OI
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2 3
45
6
1
2
3
4
5
6
12
3
4
5
6
1
2
3
4
5
6
12
3
4
5
6
12 3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
45
6
1
2
3
4
5
6
1
2
3
4
5
6
1
23
45
6
1
2 3
4
5
6
1
2
3
4
5
6
1
23
4
5
6
1
2
3
4 5
6
1
23
4
5
6
DeepCut
MP
1
2
3
5
6
1
2
3
4
5
6
1
2
3
4
6
1
2
3
4
5
6
1 3
5
6
1
2
3
4 5
6
1
2
3
4
5
6
1
2
4
6
1
2
3
4
6
1
2
3
4
6
1
2
4
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
4
6
1
2
3
4
5
6
1
3
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
4
6
1
6
1
3
5
6
Chen&Y
uille
[
8
]
6
1
2
3
4
5
6
6
1
2
3
4
5
6
1
3
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
4
6
1
2
6
1
2
3
4
5
6
1
2
3
4
6
1
2
3
4
5
6
1
2
3
4
5
6
1
3
5
6
1
2
3
4
5
6
1
2
4
6
12
3
4
6
1
3
5
6
Figure 3. Qualitative comparison of our joint formulation DeepCut MP Dense-CNN (middle) to the traditional two-stage approach
Dense-CNN det ROI (top) and the approach of Chen&Yuille [8] (bottom) on WAF dataset. In contrast to det ROI, DeepCut MP is able
to disambiguate multiple and potentially overlapping persons and correctly assemble independent detections into plausible body part
configurations. In contrast to [8], DeepCut MP can better predict occlusions (image 2 person 1 − 4 from the left, top row; image 4 person 1,
4; image 5, person 2) and better cope with strong articulations and foreshortenings (image 1, person 1, 3; image 2 person 1 bottom row;
image 3, person 1-2). See supplementary material for more examples.
Results on MPII Multi-Person. Obtaining a strong detec-
tor of highly articulated people having strong occlusions and
truncations is difficult. We employ a neck detector as a per-
son detector as it turned out to be the most reliable part. Full
body bounding box is created around a neck detection and
used as det ROI. GT ROIs were provided by the authors [3].
As the MP approach [8] is not public, we compare to SP
state-of-the-art method [7] applied to GT ROI image crops.
Results are shown in Tab. 6. DeepCut MP AFR-CNN
improves over AFR-CNN det ROI by 4.3% achieving 51.4%
AP. The largest differences are observed for the ankle, knee,
elbow and wrist, as those parts benefit more from the con-
nections to other parts. DeepCut MP UB AFR-CNN using
upper body parts only slightly improves over the full body
model when compared on common parts (60.5 vs 58.2% AP).
Similar tendencies are observed for Dense-CNNs, though im-
provements of MP UB over MP are more significant.
All DeepCuts outperform Chen&Yuille SP GT ROI, par-
tially due to stronger part detectors compared to [7] (c.f.
Tab. 3). Another reason is that Chen&Yuille SP GT ROI does
not model body part occlusion and truncation always pre-
dicting the full set of parts, which is penalized by the AP
measure. In contrast, our formulation allows to deactivate the
part hypothesis in the initial set of part candidates thus effec-
tively performing non-maximum suppression. In DeepCuts
part hypotheses are suppressed based on the evidence from
all other body parts making this process more reliable.
Setting
Head Sho Elb Wri Hip Knee Ank UBody FBody
AFR-CNN det ROI
71.1 65.8 49.8 34.0 47.7 36.6 20.6 55.2
47.1
AFR-CNN MP
71.8 67.8 54.9 38.1 52.0 41.2 30.4 58.2
51.4
AFR-CNN MP UB
75.2 71.0 56.4 39.6 -
-
-
60.5
-
Dense-CNN det ROI
77.2 71.8 55.9 42.1 53.8 39.9 27.4 61.8
53.2
Dense-CNN MP
73.4 71.8 57.9 39.9 56.7 44.0 32.0 60.7
54.1
Dense-CNN MP UB
81.5 77.3 65.8 50.0 -
-
-
68.7
-
AFR-CNN GT ROI
73.2 66.5 54.6 42.3 50.1 44.3 37.8 59.1
53.1
Dense-CNN GT ROI
78.1 74.1 62.2 52.0 56.9 48.7 46.1 66.6
60.2
Chen&Yuille SP GT ROI 65.0 34.2 22.0 15.7 19.2 15.8 14.2 34.2
27.1
Table 6. Pose estimation results (AP) on MPII Multi-Person.
6. Conclusion
Articulated pose estimation of multiple people in uncon-
trolled real world images is challenging but of real world
interest. In this work, we proposed a new formulation as a
joint subset partitioning and labeling problem (SPLP). Dif-
ferent to previous two-stage strategies that separate the de-
tection and pose estimation steps, the SPLP model jointly
infers the number of people, their poses, spatial proxim-
ity, and part level occlusions. Empirical results on four
diverse and challenging datasets show significant improve-
ments over all previous methods not only for the multi per-
son, but also for the single person pose estimation problem.
On multi person WAF dataset we improve by 30% PCP over
the traditional two-stage approach. This shows that a joint
formulation is crucial to disambiguate multiple and poten-
tially overlapping persons. Models and code available at
http://pose.mpi-inf.mpg.de.
4936

Page 9
References
[1] A. Alush and J. Goldberger. Ensemble segmentation using
efficient integer linear programming. TPAMI, 34(10):1966–
1977, 2012. 2
[2] B. Andres, J. H. Kappes, T. Beier, U. Köthe, and F. A. Ham-
precht. Probabilistic image segmentation with closedness
constraints. In ICCV, 2011. 2
[3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d
human pose estimation: New benchmark and state of the art
analysis. In CVPR’14. 1, 5, 6, 7, 8
[4] N. Bansal, A. Blum, and S. Chawla. Correlation clustering.
Machine Learning, 56(1–3):89–113, 2004. 2
[5] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab,
and S. Ilic. 3D pictorial structures for multiple human pose
estimation. In CVPR’14. 2
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
Yuille. Semantic image segmentation with deep convolutional
nets and fully connected crfs. In ICLR, 2015. 5
[7] X. Chen and A. Yuille. Articulated pose estimation by a
graphical model with image dependent pairwise relations. In
NIPS’14. 1, 2, 6, 8
[8] X. Chen and A. Yuille. Parsing occluded people by flexible
compositions. In CVPR, 2015. 2, 7, 8
[9] S. Chopra and M. Rao. The partition problem. Mathematical
Programming, 59(1–3):87–115, 1993. 2, 3
[10] E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Cor-
relation clustering in general weighted graphs. Theoretical
Computer Science, 361(2–3):172–187, 2006. 2
[11] M. M. Deza and M. Laurent. Geometry of Cuts and Metrics.
Springer, 1997. 2
[12] M. Eichner and V. Ferrari. We are family: Joint pose estima-
tion of multiple persons. In ECCV’10. 2, 7
[13] G. Ghiasi, Y. Yang, D. Ramanan, and C. Fowlkes. Parsing
occluded people. In CVPR’14. 7
[14] R. Girshick. Fast r-cnn. In ICCV’15. 4, 5
[15] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using
k-poselets for detecting people and localizing their keypoints.
In CVPR’14. 1, 2
[16] H. Jiang and D. R. Martin. Global pose estimation using
non-tree models. In CVPR’09. 2
[17] S. Johnson and M. Everingham. Clustered pose and nonlinear
appearance models for human pose estimation. In BMVC’10.
5, 6
[18] S. Johnson and M. Everingham. Learning Effective Human
Pose Estimation from Inaccurate Annotation. In CVPR’11. 1,
5
[19] S. Kim, C. Yoo, S. Nowozin, and P. Kohli. Image seg-
mentation using higher-order correlation clustering. TPAMI,
36:1761–1774, 2014. 2
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS’12. 5
[21] L. Ladicky, P. H. Torr, and A. Zisserman. Human pose esti-
mation using a joint pixel-wise and part-wise formulation. In
CVPR’13. 2
[22] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Strong
appearance and expressive spatial models for human pose
estimation. In ICCV’13. 2, 4
[23] L. Pishchulin, A. Jain, M. Andriluka, T. Thormaehlen, and
B. Schiele. Articulated people detection and pose estimation:
Reshaping the future. In CVPR’12. 1, 2
[24] D. Ramanan. Learning to parse images of articulated objects.
In NIPS’06. 2
[25] X. Ren, A. C. Berg, and J. Malik. Recovering human body
configurations using pairwise constraints between parts. In
ICCV’05. 2
[26] B. Sapp and B. Taskar. Multimodal decomposable models for
human pose estimation. In CVPR’13. 1, 2, 5
[27] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. CoRR,14. 5
[28] M. Sun and S. Savarese. Articulated part-based model for
joint object detection and pose estimation. In ICCV’11. 1, 2,
7
[29] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler.
Efficient object localization using convolutional networks. In
CVPR’15. 1, 2, 7
[30] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint
training of a convolutional network and a graphical model for
human pose estimation. In NIPS’14. 1, 5, 6, 7
[31] A. Toshev and C. Szegedy. Deeppose: Human pose estimation
via deep neural networks. In CVPR’14. 2, 5
[32] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.
Selective search for object recognition. IJCV’13. 4
[33] X. Fan, K. Zheng, Y. Lin, and S. Wang. Combining local ap-
pearance and holistic view: Dual-source deep neural networks
for human pose estimation. In CVPR’15. 6
[34] Y. Yang and D. Ramanan. Articulated human detection with
flexible mixtures of parts. PAMI’13. 2, 7
[35] J. Yarkony, A. Ihler, and C. C. Fowlkes. Fast planar corre-
lation clustering for image segmentation. In ECCV, 2012.
2
4937
-