University College London
Browse

NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers

dataset
posted on 2025-02-11, 11:19 authored by Jonno BourneJonno Bourne
NCSE v2.0 Dataset Repository

This repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models".

Dataset Overview

The NCSE v2.0 is a digitized collection of six 19th-century English periodicals containing:

  • 82,690 pages
  • 1.4 million entries
  • 321 million words
  • 1.9 billion characters

The dataset includes:

  • 1.1 million text entries
  • 198,000 titles
  • 17,000 figure descriptions
  • 16,000 tables

Repository Contents

  1. NCSE v2.0 Dataset
    • NCSE_v2.zip: a folder containing a parquet file for each of the periodicals as well as a readme file.
  2. Bounding Box Dataset
    A zip file called bounding_box.zip. Contains
    • post_process: A folder of the processed periodical bounding box data
    • post_process_fill: A folder of the processed periodical bounding box data WITH column filling.
    • bbox_readme.txt: a readme file and data description for the bounding boxes
  3. Test Sets
    • cropped_images.zip: 378 images cropped from the NCSE test set pages, all 2-bit png files
    • ground_truth: 358 text files corresponding to the text from the cropped_images folder
  4. Classification Training Data
    The below files are used for training the classification models. They contain 12000 observations 2000 from each periodical. The labels were classified using mistral-large-2411. This data is used to train the ModernBERT classifier described in the paper. The topics are taken from the International Press Telecommunications Council (IPTC) subject codes.
    • silver_IPTC_class.parquet: IPTC topic classification silver set
    • silver_text_type.parquet: Text-type classification silver set
  5. Classified Data
    The zip file "classification_data.zip" with all rows classified using the ModernBERT classifer described in the paper.
    • IPTC_type_classified.zip: contains one parquet file per periodical
    • text_type_classified.zip: contains one parquet file per periodical
    • classification_readme.md: Description of the data
  6. Classification Mappings
    Data for mapping the classification codes to human readable names.
    • class_mappings.zip: contains a json for each classification type
      • IPTC_class_mapping.json
      • text_type_class_mapping.json

Original Images

The original page images can be found at the King's College London Repositories:

Or via the project central archive

Citation

If you use this dataset, please cite:

No citation data currently available

Related Code

All original code related to this project including the creation of the datasets and thier analysis can be found at:
https://github.com/JonnoB/ereading_the_unreadable

Contact

For questions about the dataset, please create an issue in this repository.

Usage Rights

In keeping with the original NCSE dataset, all data is made available under a Creative Commons Attribution 4.0 International License (CC BY).

History

Usage metrics

    Centre for Advanced Spatial Analysis

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC