According to Forbes, more than 95 percent of companies need to manage unstructured data. The sheer volume of data require automated analytic tools such as document classification by which documents are assigned appropriate categories (e.g., invoice, memo, email). The Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) dataset is the standard benchmark dataset within the document classification research community for evaluating document classifiers, and it consists of document images in 16 categories. However, machine learning models trained on RVL-CDIP ignore the real-world challenge of detecting if an image is not of those 16 categories (i.e., it is out-of-distribution). A company employing these machine learning models may have many out-of-distribution images falsely classified as one of the 16 categories (in-distribution). To address this problem, we created a new benchmark dataset of 10,000 out-of-distribution images. We show that while classifiers are performant at in-distribution classification, they suffer in out-of-distribution detection against our dataset. We first attempted to reproduce published results of the leading classifiers on RVL-CDIP and then ran them against our new dataset to test the changes in performance. If our results do demonstrate that current document classifiers perform poorly at out-of-distribution detection, this necessitates members of the academic community to improve their systems and use our dataset as a benchmark.
An Out-of-Distribution Document Classification Benchmark for RVL-CDIP
Category
Computer Science 2