[Skip to Content]
Banner
Menu
  • Home
  • Need Help?
    • General
      • Networking and Chat
      • Login
    • Speaker
      • Zoom
      • Managing Sessions
      • NCUR Background
    • Attendee
      • Zoom
      • Accessing Sessions
  • Support
  • Login
Menu
  • Home
  • NCUR 2022 @Home Abstract Submission Gallery
  • An Out-of-Distribution Document Classification Benchmark for RVL-CDIP

Custom JS

double-click to edit, do not edit in source

An Out-of-Distribution Document Classification Benchmark for RVL-CDIP

According to Forbes, more than 95 percent of companies need to manage unstructured data. The sheer volume of data require automated analytic tools such as document classification by which documents are assigned appropriate categories (e.g., invoice, memo, email). The Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) dataset is the standard benchmark dataset within the document classification research community for evaluating document classifiers, and it consists of document images in 16 categories. However, machine learning models trained on RVL-CDIP ignore the real-world challenge of detecting if an image is not of those 16 categories (i.e., it is out-of-distribution). A company employing these machine learning models may have many out-of-distribution images falsely classified as one of the 16 categories (in-distribution). To address this problem, we created a new benchmark dataset of 10,000 out-of-distribution images. We show that while classifiers are performant at in-distribution classification, they suffer in out-of-distribution detection against our dataset. We first attempted to reproduce published results of the leading classifiers on RVL-CDIP and then ran them against our new dataset to test the changes in performance. If our results do demonstrate that current document classifiers perform poorly at out-of-distribution detection, this necessitates members of the academic community to improve their systems and use our dataset as a benchmark.

Presenter
Gordon Lim
US-Michigan

An Out-of-Distribution Document Classification Benchmark for RVL-CDIP

Category

Computer Science 2

Description

Custom CSS

double-click to edit, do not edit in source


Back to Sessions

Follow Us Facebook X (formerly Twitter) LinkedIn Email
A conference by ©2024 The Council on Undergraduate Research. All rights reserved. | Powered by OpenWater | Need assistance? Contact us via phone at 202.783.4810 or Email.