Foldclass databases for protein structural domains in CATH and TED
This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.
Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.
The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.
IMPORTANT:
We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.
Funding
Exploiting Differentiable Programming Models For Protein Structure Prediction And Modelling
Biotechnology and Biological Sciences Research Council
Find out more...Accelerating and enhancing the PSIPRED Workbench with deep learning
Biotechnology and Biological Sciences Research Council
Find out more...Accelerating and enhancing the PSIPRED Workbench with deep learning
Biotechnology and Biological Sciences Research Council
Find out more...