MiikeMineStamps

This is an official webpage of MiikeMineStamps dataset.

The MiikeMineStamps dataset is a naturally long-tailed open-ended dataset in the domain of Japanese historical documents. It contains 5056 images of Japanese stamps that belong to 415 classes including two special ones. The stamps were extracted from a large compendium of historical documents from the Japanese company Mitsui Mi’ike Mine, one of the largest business archives in modern Japan that spans half a century, includes tens of thousands of documents, and has been widely used by labor historians, business historians, and others.

Download Dataset

Info

The current version is v1.0.

The collection currently has 5056 images of 415 stamps collected from hundreds of documents. New versions will be published as our expert labels more images.

Figure 1 shows several examples of stamps.

Example images
Figure 1. Examples of stamps in the dataset.

The histogram of number of samples per stamp name is shown below. The dataset is collected via active learning, colors signify the cycle.

Histogram of stamp names.
Figure 2. Histogram of stamps by name. Note the log scale. Classes with 1 instance are not shown.

License

The dataset is available under Creative Commons V4 license. If you find this work useful, please cite:
@inproceedings{MiikeMineStamps,
  author = {Toropov, Evgeny and Buitrago, Paola and Prabha, Rajanie and Uran, Julian and Adal, Raja},
  year = {2021},
  title = {MiikeMineStamps: A Long-Tailed Dataset of JapaneseStamps via Active Learning},
  journal = {ICDAR '21: 16th International Conference on Document Analysis and Recognition},
}

Dataset structure

The dataset consists of 5056 cropped images of stamps (Figure 1) and the related meta-information. Meta-information is available in two formats: json and SQLite database with Shuffler schema. Json is human-readable and easier to load with conventional tools, Shuffler database format is easier to get various statistics.

The following information is available for each image:

The last four fields are recorded when pages can be identified in a photograph and the stamp is located inside one of the pages. Some stamp classes are consistently located in one of the corners of their pages.

Stamp names conventions

Stamp names are usually based on the phonetic reading of the Japanese characters using a modified Hepburn system. There are several exceptions to this, however. First, stamps that ask different catefories of users such as branch head, secretary, etc. to certify a document by placing their stamp in a particular place are collectively labeled "stamphere." Second,

Code

The code used to collect the dataset is published at https://github.com/pscedu/ml4docs

Funding

Acknowledgements

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges and Bridges-2 systems, which are supported by NSF award number ACI-1445606 and ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). The work was made possible through the XSEDE Extended Collaborative Support Service (ECSS) program.