posted on 2024-09-27, 11:58authored byJonno Bourne
<p dir="ltr">This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".</p><p dir="ltr">In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.</p><p dir="ltr">The files in the repository are as follows</p><ul><li><b>ncse_hf_dataset</b>: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the paper</li><li><b>synth_gt.zip</b>: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.</li><li><b>synthetic_articles.zip</b>: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.</li><li><b>synthetic_articles_text.zip</b>: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.</li></ul><p dir="ltr">The data in this repo is used by the code repositories associated with the project </p><ul><li>https://github.com/JonnoB/scrambledtext_analysis</li><li>https://github.com/JonnoB/training_lms_with_synthetic_data</li></ul><p dir="ltr"><br></p>