esCorpius

The advanced cognitive abilities of large language models rely on the extensive unstructured textual data they receive during training. Data is crucial – a lot of it. Consequently, the most capable recent models are trained on trillions of tokens (short scale). Acquiring such data would not be possible without relying on internet data obtained through “crawling” the internet, that is systematically searching and collecting data from websites and online sources. The most important web archive for crawled information is Common Crawl containing petabytes of data collected since 2008. The majority of the open datasets used for training large models were built using some pipeline for processing CC.

Pipelines for creating the crawled datasets usually include several steps such as cleaning, language detection, content identification and deduplication. Deduplication arguably presents the most important step in the pipeline because a dataset with a high degree of duplicated data significantly impair the performance of models trained on it. This occurs because of overlap between training, validation, and test sets, leading to artificial higher accuracy and fewer training steps. Recent creators of crawled datasets often employ a combination of techniques for deduplication, including exact matching of textual fragments and soft deduplication like SimHash, Local Sensitive Hashing, MinHash, and others.

The main disadvantage of the majority of crawled datasets and resulting models is that they are predominantly English-centric. There is a notable inequality in the availability and quality of models and data for other languages. Most NLP research is concentrated on English, with Mandarin Chinese being the second most studied language. In contrast, languages like Spanish, despite having a large number of speakers worldwide, receive significantly less attention. This situation has negative consequences, such as, for example, unequal access to clinical NLP technology for speakers of different languages.

ChatGPT’s performance in standard NLP tasks also notably deteriorates, particularly for mid and especially low-resourced languages, depending on the specific task. To address this issue and enhance the capabilities of generative models in multilingual settings, several multilingual datasets have been recently released.

The first dataset in question is ROOTS, which served as the training data for BLOOM. ROOTS is the product of collaborative efforts by European scientists and comprises 1.6 trillion tokens across 46 languages, including a subset dedicated to programming languages. This dataset stands out due to the incorporation of manual heuristics during the corpus creation process, including the selection of URLs and the establishment of quality filtering thresholds. The quality filtering metrics devised by the creators of ROOTS have also been applied in the development of CulturaX.

CulturaX represents a newly developed massive dataset designed exclusively for training multilingual generative models with 6.3 trillion tokens across 167 languages. This dataset is created by processing mC4 and OSCAR, two earlier multilingual datasets of lower quality. The development pipeline involves several steps, including URL filtering based on a curated list of appropriate URLs, language identification using a combination of cld3 and FastText, filtering based on automatic selection of thresholds for metrics derived from ROOTS, and deduplication of data using MinHash and URLs among others.

LHF Labs developed another multilingual dataset called esCorpius-m, which is suitable for training language models. It contains 0.3 trillion tokens across 34 languages.

This dataset is notably cleaner than the most state-of-the-art corpora and has been thoroughly deduplicated.
It retains both document and paragraph boundaries, enabling language models to process text in a manner similar to how humans do. This feature unlocks the potential for Natural Language Generation to comprehend paragraph representations.
Moreover, the downloaded data maintains a clear trace of the source of each document. This level of traceability allows for the application of individual website owners’ right to withdraw their data or the data of individuals cited on websites, as protected by GDPR regulations.
Additionally, it provides the means to systematically exclude websites that have been blacklisted.

In summary, this is a high-quality multilingual dataset that excels in content cleaning and deduplication. In certain languages, like Spanish, it stands as the largest web corpus of such quality available for the development of large language models.