More
Сhoose
About CausalLM

A non-profit research initiative advancing the frontiers of artificial intelligence. We focus on omni-modal AI systems, efficient architectures, and synthetic data at scale.

Retrieval-Based Multi-Turn Chat SFT Synthetic Data

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset
Category:  Datasets
Date:  
Author:  CausalLM

A year ago, we released CausalLM/Refined-Anime-Text, a thematic dataset of 1 million entries designed for continued pre-training. We are thrilled that it has been widely adopted in various training scenarios and studies on data and internet culture.

In keeping with our commitment to open data and models, we are excited to release our new Retrieval-Based Multi-Turn Chat SFT Synthetic Dataset. This is a thematic subset comprising 100,000 newly synthesized text entries. While it shares themes with our previous release, we have intentionally included niche topics that are not widely represented in existing open datasets.

The dataset was created through a sophisticated pipeline:

  1. Text data was obtained via web crawling, now including Wikipedia content.
  2. Complete webpage texts were processed using large language models with long context windows.
  3. Following additional steps like agent-based self-verification, the data was synthesized into a multi-turn dialogue format, ideal for Supervised Fine-Tuning (SFT).

To give a sense of the scale and investment, the cost to create this 100k subset is no less than $25,000, with approximately 62 billion input and output tokens used during synthesis. The data primarily comprises texts in English, Chinese, Japanese, and German, with a small number of entries in other languages.

We kindly request that users refrain from filtering the data by language or theme to preserve the original thematic coverage. However, filtering based on content duplication is encouraged, as this subset has not yet been de-duplicated.

We look forward to seeing how the community utilizes this new resource. Subsets for other topics will be released in the future, so please stay tuned.

Dataset at a Glance
  • Name: Retrieval-Based Multi-Turn Chat SFT Synthetic Data
  • Size: 100,000 entries
  • Format: Multi-turn Dialogue
  • Languages: Primarily English, Chinese, Japanese, and German
  • Generation Cost: >$25,000
  • Synthesis Tokens: ~62 billion
Download Link

Explore the Dataset on Hugging Face →