Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Сhoose

Category: Datasets

Date: February 28, 2025

Author: CausalLM

A year ago, we released CausalLM/Refined-Anime-Text, a thematic dataset of 1 million entries designed for continued pre-training. We are thrilled that it has been widely adopted in various training scenarios and studies on data and internet culture.

In keeping with our commitment to open data and models, we are excited to release our new Retrieval-Based Multi-Turn Chat SFT Synthetic Dataset. This is a thematic subset comprising 100,000 newly synthesized text entries. While it shares themes with our previous release, we have intentionally included niche topics that are not widely represented in existing open datasets.

The dataset was created through a sophisticated pipeline:

Text data was obtained via web crawling, now including Wikipedia content.
Complete webpage texts were processed using large language models with long context windows.
Following additional steps like agent-based self-verification, the data was synthesized into a multi-turn dialogue format, ideal for Supervised Fine-Tuning (SFT).

To give a sense of the scale and investment, the cost to create this 100k subset is no less than $25,000, with approximately 62 billion input and output tokens used during synthesis. The data primarily comprises texts in English, Chinese, Japanese, and German, with a small number of entries in other languages.

We kindly request that users refrain from filtering the data by language or theme to preserve the original thematic coverage. However, filtering based on content duplication is encouraged, as this subset has not yet been de-duplicated.

We look forward to seeing how the community utilizes this new resource. Subsets for other topics will be released in the future, so please stay tuned.

Dataset at a Glance

Name: Retrieval-Based Multi-Turn Chat SFT Synthetic Data
Size: 100,000 entries
Format: Multi-turn Dialogue
Languages: Primarily English, Chinese, Japanese, and German
Generation Cost: >$25,000
Synthesis Tokens: ~62 billion

Download Link

Explore the Dataset on Hugging Face →

More field notes

Explore the archive

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

Datasets

February 26, 2024

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

We introduce our unique recipe for generating high-quality synthetic datasets to boost LLM performance, featuring our new 1M+ entry Anime dataset as a proof of concept.

Models

August 26, 2024

Introducing miniG 9B

Meet miniG, a 9B parameter Vision Language Model with a 1M token context window. Trained on a massive 120M+ entry synthetic dataset, miniG pushes the boundaries of performance without extensive human preference alignment.

Research Areas

Connect

About CausalLM

Retrieval-Based Multi-Turn Chat SFT Synthetic Data

Dataset at a Glance

Download Link

More field notes

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

Introducing miniG 9B