proximity since they permit simultaneous assignment of the
same body-part candidates to multiple people hypotheses.
As a principled solution for multi person pose estimation
a model is proposed that jointly estimates poses of all people
present in an image by minimizing a joint objective. The
formulation is based on partitioning and labeling an initial
pool of body part candidates into subsets that correspond to
sets of mutually consistent body-part candidates and abide to
mutual consistency and exclusion constraints. The proposed
method has a number of appealing properties. (1) The for-
mulation is able to deal with an unknown number of people,
and also infers this number by linking part hypotheses. (2)
The formulation allows to either deactivate or merge part hy-
potheses in the initial set of part candidates hence effectively
performing non-maximum suppression (NMS). In contrast
to NMS performed on individual part candidates, the model
incorporates evidence from all other parts making the pro-
cess more reliable. (3) The problem is cast in the form of
an Integer Linear Program (ILP). Although the problem is
NP-hard, the ILP formulation facilitates the computation of
bounds and feasible solutions with a certified optimality gap.
This paper makes the following contributions. The main
contribution is the derivation of a joint detection and pose
estimation formulation cast as an integer linear program. Fur-
ther, two CNN variants are proposed to generate representa-
tive sets of body part candidates. These, combined with the
model, obtain state-of-the-art results for both single-person
and multi-person pose estimation on different datasets.
Related work. Most work on pose estimation targets the
single person case. Methods progressed from simple part
detectors and elaborate body models [25, 24, 16] to tree-
structured pictorial structures (PS) models with strong part
detectors [22, 34, 7, 26]. Impressive results are obtained pre-
dicting locations of parts with convolutional neural networks
(CNN) [31, 29]. While body models are not a necessary
component for effective part localization, constraints among
parts allow to assemble independent detections into body
configurations as demonstrated in [7] by combining CNN-
based body part detectors with a body model [34].
A popular approach to multi-person pose estimation is to
detect people first and then estimate body pose independently
model for detection and pose estimation. [34] obtains mul-
tiple pose hypotheses corresponding to different root part
positions and then performing non-maximum suppression.
[15] detects people using a flexible configuration of poselets
and the body pose is predicted as a weighted average of acti-
vated poselets. [23] detects people and then predicts poses
of each person using a PS model. [5] estimates poses of mul-
tiple people in 3D by constructing a shared space of 3D body
part hypotheses, but uses 2D person detections to establish
the number of people in the scene. These approaches are
limited to cases with people sufficiently far from each other
that do not have overlapping body parts.
Our work is closely related to [12, 21] who also propose
a joint objective to estimate poses of multiple people. [12]
proposes a multi-person PS model that explicitly models
depth ordering and person-person occlusions. Our formula-
tion is not limited by a number of occlusion states among
people. [21] proposes a joint model for pose estimation and
body segmentation coupling pose estimates of individuals
by image segmentation. [12, 21] uses a person detector to
generate initial hypotheses for the joint model. [21] resorts
to a greedy approach of adding one person hypothesis at a
time until the joint objective can be reduced, whereas our
formulation can be solved with a certified optimality gap.
In addition [21] relies on expensive labeling of body part
segmentation, which the proposed approach does not require.
Similarly to [8] we aim to distinguish between visible
and occluded body parts. [8] primarily focuse on the single-
person case and handles multi-person scenes akin to [34].
We consider the more difficult problem of full-body pose
estimation, whereas [12, 8] focus on upper-body poses and
consider a simplified case of people seen from the front.
Our work is related to early work on pose estimation
that also relies on integer linear programming to assemble
candidate body part hypotheses into valid configurations [16].
Their single person method employs a tree graph augmented
with weaker non-tree repulsive edges and expects the same
number of parts. In contrast, our novel formulation relies
on fully connected model to deal with unknown number of
people per image and body parts per person.
The Minimum Cost Multicut Problem [9, 11], known in
machine learning as correlation clustering [4], has been used
in computer vision for image segmentation [1, 2, 19, 35] but
has not been used before in the context of pose estimation.
It is known to be NP-hard [10].