
We are excited to introduce miniG, a powerful new model designed to explore the frontiers of large-scale synthetic data training.
miniG is a 9B parameter language model (initialized from THUDM/glm-4-9b-chat-1m) complemented by an optional 5B ViT, making it a capable Vision Language Model (VLM). Its foundation is a unique synthesis dataset of over 120 million entries. This dataset was generated using state-of-the-art language models with large context windows, leveraging methodologies akin to retrieval-augmented generation (RAG) and knowledge graph integration.
The entire data synthesis process was conducted within clusters derived from a curated 20 billion token pretraining corpus, with subsequent validation performed by the model itself. Notably, miniG has not undergone thorough alignment with human preferences and is under no obligation to cater to poorly constructed prompts or benchmark clichés.
revision=text-only branch at https://huggingface.co/CausalLM/miniG/tree/text-only.Cautionary Notes: We strongly recommend using a standardized implementation like Hugging Face transformers for inference. Accelerated kernels such as vllm or lmdeploy, as well as model quantization, can lead to significant performance degradation and potentially catastrophic effects, especially for vision inference.
Inference Parameters: To achieve results with fewer hallucinations, we advise using sampling with top_p=0.8 and temperature=0.3, or pure temperature sampling at 0.2. A lower temperature is generally required compared to similar models, which we tentatively attribute to overfitting on the vast dataset.
Input Formatting:
"You are a helpful assistant." is sufficient.\n after each <|role|> tag for proper parsing.The final released version was a merge of multiple candidate models. An efficient naïve fine-tuning can be achieved within one day on 16 nodes of 8*A100-80G, with an estimated carbon emission of 700 kg CO2 eq.
Disclaimer: The model was trained on unfiltered internet data and may contain objectionable content. We lack the resources to vet all data or implement RLHF for safety. Users are responsible for performing their own safety checks and filtering model outputs.
Parameters: 9B LLM (initialized from GLM-4-9B-Chat-1M) + Optional 5B ViT
Context Window: 1,000,000 tokens
Modalities: Text and Image (with Locked-Image Tuning)
Training Data: 120M+ entry synthetic dataset generated from a 20B token corpus.
| Capability | Description | miniG | Gemini-Flash | GLM-4-9B-Chat | Llama 3.1 8B Instruct |
|---|---|---|---|---|---|
| MMLU | Representation of questions in 57 subjects (incl. STEM, humanities, and others) | 85.45 | 78.9 | 72.4 | 69.4 |
| IFEval | Evaluation of instruction-following using verifiable prompts | 74.22 | - | 69 | 80.4 |
| GSM8K | Challenging math problems (5-shot evaluation) | 75.89 (5-shot) | 86.2 (11-shot) | 79.6 | 84.5 (8-shot CoT) |
| HumanEval | Python code generation on a held-out dataset (0-shot) | 79.88 | 74.3 | 71.8 | 72.6 |
| GPQA | Challenging dataset of questions from biology, physics, and chemistry | 37.37 | 39.5 | 34.3 (base) | 34.2 |
| Context Window | Maximum context length the model can handle | 1M | 1M | 128K | 128K |
| Input | Supported input modalities | Text, image (single model) | Text, image, audio, video | Text only | Text only |