short-paper

Fusing object detection and region appearance for image-text alignment

Authors:

James Magahern,

Atul Kanaujia, and

Niels HaeringAuthors Info & Claims

MM '11: Proceedings of the 19th ACM international conference on Multimedia

November 2011

Pages 1113 - 1116

https://doi.org/10.1145/2072298.2071951

Published: 28 November 2011 Publication History

Abstract

We present a method for automatically aligning words to image regions that integrates specific object classifiers (e.g., "car" detectors) with weak models based on appearance features. Previous strategies have largely focused on the latter, and thus have not exploited progress on object category recognition. Hence, we augment region labeling with object detection, which simplifies the problem by reliably identifying a subset of the labels, and thereby reducing correspondence ambiguity overall. Comprehensive testing on the SAIAPR TC dataset shows that principled integration of object detection improves the region labeling task.

References

[1]

L. H. Armitage and P. G. B. Enser. Analysis of user need in image archives. Journal of Information Science, 23(4):287--299, 1997.

[2]

K. Barnard, P. Duygulu, N. d. Freitas, D. Forsyth, D. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107--1135, 2003.

Digital Library

[3]

K. Barnard and Q. Fan. Reducing correspondence ambiguity in loosely labeled training data. In IEEE CVPR, 2007.

[4]

K. Barnard, Q. Fan, R. Swaminathan, A. Hoogs, R. Collins, P. Rondot, and J. Kaufhold. Evaluation of localized semantics: data, methodology, and experiments. IJCV, 77:199--217, 2008.

Digital Library

[5]

K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In International Conference on Computer Vision, pages II:408--415, 2001.

[6]

P. Carbonetto, N. d. Freitas, and K. Barnard. A statistical model for general contextual object recognition. In ECCV, volume I, pages 350--362, 2004.

[7]

T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, volume 6314 of LNCS, pages 452--466. Springer, 2010.

Digital Library

[8]

H. J. Escalante, C. A. Hernandez, J. A. Gonzalez, A. Lopez-Lopez, M. Montes, E. F. Morales, L. E. Sucar, L. Villasenor, and M. Grubinger. The segmented and annotated iapr tc-12 benchmark. Computer Vision and Image Understanding, 114(4, Special issue on Image and Video Retrieval Evaluation):419--428, 2010.

Digital Library

[9]

C. Fellbaum, P. G. A. Miller, R. Tengi, and P. Wakefield. Wordnet - a lexical database for english.

[10]

P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE PAMI, 2009.

Digital Library

[11]

A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In In ECCV, 2008.

Digital Library

[12]

Y. Jin, L. Khan, L. Wang, and M. Awad. Image annotations by combining multiple evidence & wordnet. In ACM MM '05, New York, NY, USA, 2005.

Digital Library

[13]

V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In NIPS, 2003.

[14]

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):888--905, 2000.

Digital Library

[15]

D. M. Tax, M. V. Breukelen, R. P. Duin, and J. Kittler. Combining multiple classifiers by averaging or by multiplying?, 2000.

[16]

J. Verbeek, M. Guillaumin, T. Mensink, and C. Schmid. Image annotation with tagprop on the mirflickr set. In MIR'10, pages 537--546, New York, NY, USA, 2010. ACM.

Digital Library

[17]

H. Zhang, A. C. Berg, M. Maire, and J. Malik. Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In In CVPR, pages 2126--2136, 2006.

Digital Library

Cited By

Barnard K(2016)Computational Methods for Integrating Vision and LanguageSynthesis Lectures on Computer Vision10.2200/S00705ED1V01Y201602COV0076:1(1-227)Online publication date: 20-Apr-2016
https://doi.org/10.2200/S00705ED1V01Y201602COV007

Index Terms

Fusing object detection and region appearance for image-text alignment
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Weakly- and Semi-Supervised Fast Region-Based CNN for Object Detection
Abstract
Learning an effective object detector with little supervision is an essential but challenging problem in computer vision applications. In this paper, we consider the problem of learning a deep convolutional neural network (CNN) based object ...
Read More
Object detection based on region decomposition and assembly
AAAI'19/IAAI'19/EAAI'19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence

Region-based object detection infers object regions for one or more categories in an image. Due to the recent advances in deep learning and region proposal methods, object detectors based on convolutional neural networks (CNNs) have been flourishing and ...
Read More
Semi-supervised training of models for appearance-based statistical object detection methods
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '11: Proceedings of the 19th ACM international conference on Multimedia

November 2011

944 pages

ISBN:9781450306164

DOI:10.1145/2072298

General Chairs:
K. Selçuk Candan
Arizona State University, USA
,
Sethuraman Panchanathan
Arizona State University, USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
,
Program Chairs:
Hari Sundaram
Arizona State University, USA
,
Wu-Chi Feng
Portland State University, USA
,
Nicu Sebe
University of Trento, Italy

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper

Conference

MM '11

Sponsor:

SIGMM

MM '11: ACM Multimedia Conference

November 28 - December 1, 2011

Arizona, Scottsdale, USA

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
109
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Barnard K(2016)Computational Methods for Integrating Vision and LanguageSynthesis Lectures on Computer Vision10.2200/S00705ED1V01Y201602COV0076:1(1-227)Online publication date: 20-Apr-2016
https://doi.org/10.2200/S00705ED1V01Y201602COV007

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents

-