University College London
Browse
ARCHIVE
synth_gt.zip (30.69 MB)
ARCHIVE
ncse_hf_dataset.zip (208.66 kB)
ARCHIVE
synthetic_articles.zip (9.15 MB)
ARCHIVE
synthetic_articles_text.zip (14.44 MB)
1/0
4 files

Scrambled text: training Language Models to correct OCR errors using synthetic data

dataset
posted on 2024-09-27, 11:58 authored by Jonno BourneJonno Bourne

This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".

In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.

The files in the repository are as follows

  • ncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the paper
  • synth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.
  • synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.
  • synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.

The data in this repo is used by the code repositories associated with the project

  • https://github.com/JonnoB/scrambledtext_analysis
  • https://github.com/JonnoB/training_lms_with_synthetic_data


History

Usage metrics

    Centre for Advanced Spatial Analysis

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC