skip to main content
article

The Pascal Visual Object Classes (VOC) Challenge

Published: 01 June 2010 Publication History

Abstract

The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.
This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

References

[1]
Bergtholdt, M., Kappes, J., & Schnörr, C. (2006). Learning of graphical models and efficient inference for object class recognition. In Proceedings of the annual symposium of the German association for pattern recognition (DAGM06) (pp. 273-283).
[2]
Chum, O., & Zisserman, A. (2007). An exemplar model for learning object classes. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[3]
Chum, O., Philbin, J., Isard, M., & Zisserman, A. (2007). Scalable near identical image and shot detection. In Proceedings of the international conference on image and video retrieval (pp. 549-556).
[4]
Csurka, G., Bray, C., Dance, C., & Fan, L. (2004). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV (pp. 1-22).
[5]
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 886-893).
[6]
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30.
[7]
Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. A. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the European conference on computer vision (pp. 97-112).
[8]
Everingham, M., Zisserman, A., Williams, C. K. I., & Van Gool, L. (2006a). The 2005 PASCAL visual object classes challenge. In LNAI: Vol. 3944. Machine learning challenges--evaluating predictive uncertainty, visual object classification, and recognising textual entailment (pp. 117-176). Berlin: Springer.
[9]
Everingham, M., Zisserman, A., Williams, C. K. I., & Van Gool, L. (2006b). The PASCAL visual object classes challenge 2006 (VOC2006) results. http://pascal-network.org/challenges/ VOC/voc2006/results.pdf.
[10]
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) Results. http://www.pascal-network.org/ challenges/VOC/voc2007/index.html.
[11]
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594-611. http://www.vision. caltech.edu/Image_Datasets/Caltech101/Caltech101.html.
[12]
Fellbaum, C. (Ed.) (1998). WordNet: an electronic lexical database. Cambridge: MIT Press.
[13]
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[14]
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google's image search. In Proceedings of the international conference on computer vision.
[15]
Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervised scale-invariant learning of models for visual recognition. International Journal of Computer Vision, 71(3), 273-303.
[16]
Ferrari, V., Fevrier, L., Jurie, F., & Schmid, C. (2008). Groups of adjacent contour segments for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 36-51.
[17]
Fritz, M., & Schiele, B. (2008). Decomposition, discovery and detection of visual categories using topic models. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[18]
Geusebroek, J. (2006). Compact object descriptors from local colour invariant histograms. In Proceedings of the British machine vision conference (pp. 1029-1038).
[19]
Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458-1465).
[20]
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset (Technical Report 7694). California Institute of Technology. http://www.vision.caltech.edu/Image_ Datasets/Caltech256/.
[21]
Hoiem, D., Efros, A. A., & Hebert, M. (2006). Putting objects in perspective. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2137-2144).
[22]
Kohli, P., Ladicky, L., & Torr, P. (2008). Robust higher order potentials for enforcing label consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[23]
Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond sliding windows: Object localization by efficient subwindow search. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[24]
Laptev, I. (2006). Improvements of object detection using boosted histograms. In Proceedings of the British machine vision conference (pp. 949-958).
[25]
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2169-2178).
[26]
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV2004 workshop on statistical learning in computer vision, Prague, Czech Republic (pp. 17-32).
[27]
Liu, X., Wang, D., Li, J., & Zhang, B. (2007). The feature and spatial covariant kernel: Adding implicit spatial constraints to histogram. In Proceedings of the international conference on image and video retrieval.
[28]
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91-110.
[29]
Marszalek, M., & Schmid, C. (2007). Semantic hierarchies for visual object recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[30]
Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[31]
Pinto, N., Cox, D., & DiCarlo, J. (2008). Why is real-world visual object recognition hard? PLoS Computational Biology, 4(1), 151- 156.
[32]
Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1-3), 157-173. http://labelme.csail.mit.edu/.
[33]
Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill.
[34]
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1-3), 7-42. http://vision.middlebury.edu/stereo/.
[35]
Shotton, J., Winn, J. M., Rother, C., & Criminisi, A. (2006). Texton-Boost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European conference on computer vision (pp. 1-15).
[36]
Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (Vol. 2, pp. 1470-1477). http://www.robots.ox.ac.uk/~vgg.
[37]
Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVID. In MIR '06: Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321- 330).
[38]
Snoek, C., Worring, M., & Smeulders, A. (2005). Early versus late fusion in semantic video analysis. In Proceedings of the ACM international conference on multimedia (pp. 399-402).
[39]
Snoek, C., Worring, M., van Gemert, J., Geusebroek, J., & Smeulders, A. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of ACM multimedia.
[40]
Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon mechanical turk. In Proceedings of the first IEEE workshop on Internet vision (at CVPR 2008).
[41]
Spain, M., & Perona, P. (2008). Some objects are more equal than others: Measuring and predicting importance. In Proceedings of the European conference on computer vision (pp. 523-536).
[42]
Stoettinger, J., Hanbury, A., Sebe, N., & Gevers, T. (2007). Do colour interest points improve image retrieval? In Proceedings of the IEEE international conference on image processing (pp. 169- 172).
[43]
Sudderth, E. B., Torralba, A. B., Freeman, W. T., & Willsky, A. S. (2008). Describing visual scenes using transformed objects and parts. International Journal of Computer Vision, 77(1-3), 291- 330.
[44]
Torralba, A. B. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169-191.
[45]
Torralba, A. B., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 854-869.
[46]
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2008). Evaluation of color descriptors for object and scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[47]
van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In Proceedings of the European conference on computer vision.
[48]
van Gemert, J., Geusebroek, J., Veenman, C., Snoek, C., & Smeulders, A. (2006). Robust scene categorization by learning image statistics in context. In CVPR workshop on semantic learning applications in multimedia.
[49]
Viitaniemi, V., & Laaksonen, J. (2008). Evaluation of techniques for image classification, object detection and object segmentation (Technical Report TKK-ICS-R2). Department of Information and Computer Science, Helsinki University of Technology. http://www.cis.hut.fi/projects/cbir/.
[50]
Viola, P. A., & Jones, M. J. (2004). Robust Real-time Face Detection. International Journal of Computer Vision, 57(2), 137-154.
[51]
von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the ACM CHI (pp. 319-326).
[52]
Wang, D., Li, J., & Zhang, B. (2006). Relay boost fusion for learning rare concepts in multimedia. In Proceedings of the international conference on image and video retrieval.
[53]
Winn, J., & Everingham, M. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) annotation guidelines. http:// pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/guidelines. html.
[54]
Yao, B., Yang, X., & Zhu, S. C. (2007). Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks. In Proceedings of the 6th international conference on energy minimization methods in computer vision and pattern recognition. http://www.imageparsing.com/.
[55]
Yilmaz, E., & Aslam, J. (2006). Estimating average precision with incomplete and imperfect judgments. In Fifteenth ACM international conference on information and knowledge management (CIKM).
[56]
Zehnder, P., Koller-Meier, E., & Van Gool, L. (2008). An efficient multi-class detection cascade. In Proceedings of the British machine vision conference.
[57]
Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213-238.

Cited By

View all

Index Terms

  1. The Pascal Visual Object Classes (VOC) Challenge
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image International Journal of Computer Vision
    International Journal of Computer Vision  Volume 88, Issue 2
    June 2010
    194 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 June 2010

    Author Tags

    1. Benchmark
    2. Database
    3. Object detection
    4. Object recognition

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)TR-YOLOJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23667446:2(5263-5273)Online publication date: 14-Feb-2024
    • (2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 1-Jan-2024
    • (2024)HermosJournal of Information Science10.1177/0165551522109189250:2(394-403)Online publication date: 1-Apr-2024
    • (2024)A Training Strategy of Flying Bird Object Detection Model Based on Improved Self-Paced Learning AlgorithmProceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy10.1145/3672919.3673000(444-450)Online publication date: 1-Mar-2024
    • (2024)A Compression Method for Object Detection Network Using Joint Pruning and QuantizationProceedings of the 2024 8th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence10.1145/3665065.3665073(41-48)Online publication date: 24-Apr-2024
    • (2024)A Comparative Study on Pruning Deep Convolutional Neural Networks Using Clustering Methods: K-Means,CLIQUE,DENCLUE,and OptiGridProceedings of the 2024 9th International Conference on Multimedia and Image Processing10.1145/3665026.3665046(132-140)Online publication date: 20-Apr-2024
    • (2024)APCA-Net: Adaptive object detection in rainy weather based on principal component analysisProceedings of the International Conference on Computing, Machine Learning and Data Science10.1145/3661725.3661770(1-6)Online publication date: 12-Apr-2024
    • (2024)Few-shot Object Detection with Fine-grained Support Information GuidanceProceedings of the International Conference on Computer Vision and Deep Learning10.1145/3653781.3653808(1-6)Online publication date: 19-Jan-2024
    • (2024)Multi-Source and Multi-modal Deep Network Embedding for Cross-Network Node ClassificationACM Transactions on Knowledge Discovery from Data10.1145/365330418:6(1-26)Online publication date: 26-Apr-2024
    • (2024)Self-Supervised Multi-Label Classification with Global Context and Local AttentionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658026(934-942)Online publication date: 30-May-2024
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    -