This is an official webpage of MiikeMineStamps dataset.
The MiikeMineStamps dataset is a naturally long-tailed open-ended dataset in the domain of Japanese historical documents. It contains 5056 images of Japanese stamps that belong to 415 classes including two special ones. The stamps were extracted from a large compendium of historical documents from the Japanese company Mitsui Mi’ike Mine, one of the largest business archives in modern Japan that spans half a century, includes tens of thousands of documents, and has been widely used by labor historians, business historians, and others.
The current version is v1.0.
The collection currently has 5056 images of 415 stamps collected from hundreds of documents. New versions will be published as our expert labels more images.
Figure 1 shows several examples of stamps.
The histogram of number of samples per stamp name is shown below. The dataset is collected via active learning, colors signify the cycle.
@inproceedings{MiikeMineStamps,
author = {Toropov, Evgeny and Buitrago, Paola and Prabha, Rajanie and Uran, Julian and Adal, Raja},
year = {2021},
title = {MiikeMineStamps: A Long-Tailed Dataset of JapaneseStamps via Active Learning},
journal = {ICDAR '21: 16th International Conference on Document Analysis and Recognition},
}
The dataset consists of 5056 cropped images of stamps (Figure 1) and the related meta-information. Meta-information is available in two formats: json and SQLite database with Shuffler schema. Json is human-readable and easier to load with conventional tools, Shuffler database format is easier to get various statistics.
The following information is available for each image:
imagefile
: the path to the image.name
: stamp class name, see the conventions below.decade
: the decade of the document where the stamp was originally placed.x_on_page
: x coordinate of the stamp center relative to the enclosing page (in range
[0,
1]).
y_on_page
: y coordinate of the stamp center relative to the enclosing page (in range
[0,
1]).
width_on_page
: width of the stamp relative to the enclosing page (in range [0, 1]).
height_on_page
: height of the stamp relative to the enclosing page (in range [0, 1]).
The last four fields are recorded when pages can be identified in a photograph and the stamp is located inside one of the pages. Some stamp classes are consistently located in one of the corners of their pages.
Stamp names are usually based on the phonetic reading of the Japanese characters using a modified Hepburn system. There are several exceptions to this, however. First, stamps that ask different catefories of users such as branch head, secretary, etc. to certify a document by placing their stamp in a particular place are collectively labeled "stamphere." Second,
The code used to collect the dataset is published at https://github.com/pscedu/ml4docs
This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges and Bridges-2 systems, which are supported by NSF award number ACI-1445606 and ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). The work was made possible through the XSEDE Extended Collaborative Support Service (ECSS) program.