Introducing miniG 9B

Сhoose

Category: Models

Date: August 26, 2024

Author: CausalLM

We are excited to introduce miniG, a powerful new model designed to explore the frontiers of large-scale synthetic data training.

A New Approach to Training

miniG is a 9B parameter language model (initialized from THUDM/glm-4-9b-chat-1m) complemented by an optional 5B ViT, making it a capable Vision Language Model (VLM). Its foundation is a unique synthesis dataset of over 120 million entries. This dataset was generated using state-of-the-art language models with large context windows, leveraging methodologies akin to retrieval-augmented generation (RAG) and knowledge graph integration.

The entire data synthesis process was conducted within clusters derived from a curated 20 billion token pretraining corpus, with subsequent validation performed by the model itself. Notably, miniG has not undergone thorough alignment with human preferences and is under no obligation to cater to poorly constructed prompts or benchmark clichés.

Core Features

Supported Modalities: Text and Image. The Vision Language Model has undergone Locked-Image Tuning. For a text-only version, please use the revision=text-only branch at https://huggingface.co/CausalLM/miniG/tree/text-only.
Massive Context Window: miniG supports a context window of up to 1,000,000 tokens.
Model Parameters: The model consists of a 9B LLM and an optional 5B ViT.

Usage & Inference Recommendations

Cautionary Notes: We strongly recommend using a standardized implementation like Hugging Face transformers for inference. Accelerated kernels such as vllm or lmdeploy, as well as model quantization, can lead to significant performance degradation and potentially catastrophic effects, especially for vision inference.

Inference Parameters: To achieve results with fewer hallucinations, we advise using sampling with top_p=0.8 and temperature=0.3, or pure temperature sampling at 0.2. A lower temperature is generally required compared to similar models, which we tentatively attribute to overfitting on the vast dataset.

Input Formatting:

Ensure the system prompt is not empty. Even a simple "You are a helpful assistant." is sufficient.
Always include a newline character \n after each <|role|> tag for proper parsing.

Training & Disclaimer

The final released version was a merge of multiple candidate models. An efficient naïve fine-tuning can be achieved within one day on 16 nodes of 8*A100-80G, with an estimated carbon emission of 700 kg CO2 eq.

Disclaimer: The model was trained on unfiltered internet data and may contain objectionable content. We lack the resources to vet all data or implement RLHF for safety. Users are responsible for performing their own safety checks and filtering model outputs.

Model Highlights

Parameters: 9B LLM (initialized from GLM-4-9B-Chat-1M) + Optional 5B ViT

Context Window: 1,000,000 tokens

Modalities: Text and Image (with Locked-Image Tuning)

Training Data: 120M+ entry synthetic dataset generated from a 20B token corpus.

Capability	Description	miniG	Gemini-Flash	GLM-4-9B-Chat	Llama 3.1 8B Instruct
MMLU	Representation of questions in 57 subjects (incl. STEM, humanities, and others)	85.45	78.9	72.4	69.4
IFEval	Evaluation of instruction-following using verifiable prompts	74.22	-	69	80.4
GSM8K	Challenging math problems (5-shot evaluation)	75.89 (5-shot)	86.2 (11-shot)	79.6	84.5 (8-shot CoT)
HumanEval	Python code generation on a held-out dataset (0-shot)	79.88	74.3	71.8	72.6
GPQA	Challenging dataset of questions from biology, physics, and chemistry	37.37	39.5	34.3 (base)	34.2
Context Window	Maximum context length the model can handle	1M	1M	128K	128K
Input	Supported input modalities	Text, image (single model)	Text, image, audio, video	Text only	Text only

Explore the Model on Hugging Face →

More field notes

Explore the archive

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Datasets

February 28, 2025

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Retrieval-Based Multi-Turn Chat SFT Synthetic Data, a new 100k entry, multi-turn synthetic dialogue dataset for SFT, building on our work with CausalLM/Refined-Anime-Text.

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

Datasets

February 26, 2024

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

We introduce our unique recipe for generating high-quality synthetic datasets to boost LLM performance, featuring our new 1M+ entry Anime dataset as a proof of concept.

Research Areas

Connect

About CausalLM

miniG 9B VLM with
1M Context Window

A New Approach to Training

Core Features

Usage & Inference Recommendations

Training & Disclaimer

Model Highlights

More field notes

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

Research Areas

Connect

About CausalLM

miniG 9B VLM with 1M Context Window

A New Approach to Training

Core Features

Usage & Inference Recommendations

Training & Disclaimer

Model Highlights

More field notes

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

miniG 9B VLM with
1M Context Window