Benchmarking Object Detectors with COCO: A New Path Forward (2403.18819v1)
Abstract: The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer, we inspect thousands of masks from COCO (2017 version) and uncover different types of errors such as imprecise mask boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to the prevalence of COCO, we choose to correct these errors to maintain continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner set of annotations with visibly better mask quality than COCO-2017. We evaluate fifty object detectors and find that models that predict visually sharper masks score higher on COCO-ReM, affirming that they were being incorrectly penalized due to errors in COCO-2017. Moreover, our models trained using COCO-ReM converge faster and score higher than their larger variants trained using COCO-2017, highlighting the importance of data quality in improving object detectors. With these findings, we advocate using COCO-ReM for future object detection research. Our dataset is available at https://cocorem.xyz
- Large-scale interactive object segmentation with human annotators. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
- Cascade R-CNN: Delving into high quality object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- End-to-end object detection with transformers. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In TPAMI, 2017.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Semantic understanding of scenes through the ade20k dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
- Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Dilated neighborhood attention transformer. arXiv:2209.15001, 2022.
- Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Mask R-CNN. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017a.
- Mask r-cnn. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017b.
- OneFormer: One Transformer to Rule Universal Image Segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Panoptic feature pyramid networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Segment Anything. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
- Exploring plain vision transformer backbones for object detection. In Proceedings of European Conference on Computer Vision (ECCV), 2022a.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Microsoft COCO: Common objects in context. In Proceedings of European Conference on Computer Vision (ECCV), 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021.
- A convnet for the 2020s. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- TorchVision maintainers and contributors. TorchVision: PyTorch’s Computer Vision library. https://github.com/pytorch/vision, 2016.
- The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.
- Designing network design spaces. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Do imagenet classifiers generalize to imagenet? In Proceedings of the International Conference on Machine Learning (ICML), 2019.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Benchmarking a benchmark: How reliable is ms-coco? arXiv preprint arXiv:2311.02709, 2023.