
A year ago, we released CausalLM/Refined-Anime-Text, a thematic dataset of 1 million entries designed for continued pre-training. We are thrilled that it has been widely adopted in various training scenarios and studies on data and internet culture.
In keeping with our commitment to open data and models, we are excited to release our new Retrieval-Based Multi-Turn Chat SFT Synthetic Dataset. This is a thematic subset comprising 100,000 newly synthesized text entries. While it shares themes with our previous release, we have intentionally included niche topics that are not widely represented in existing open datasets.
The dataset was created through a sophisticated pipeline:
To give a sense of the scale and investment, the cost to create this 100k subset is no less than $25,000, with approximately 62 billion input and output tokens used during synthesis. The data primarily comprises texts in English, Chinese, Japanese, and German, with a small number of entries in other languages.
We kindly request that users refrain from filtering the data by language or theme to preserve the original thematic coverage. However, filtering based on content duplication is encouraged, as this subset has not yet been de-duplicated.
We look forward to seeing how the community utilizes this new resource. Subsets for other topics will be released in the future, so please stay tuned.