More
Сhoose
About CausalLM

A non-profit research initiative advancing the frontiers of artificial intelligence. We focus on omni-modal AI systems, efficient architectures, and synthetic data at scale.

miniG 9B VLM with
1M Context Window

Introducing miniG 9B
Category:  Models
Date:  
Author:  CausalLM

We are excited to introduce miniG, a powerful new model designed to explore the frontiers of large-scale synthetic data training.

A New Approach to Training

miniG is a 9B parameter language model (initialized from THUDM/glm-4-9b-chat-1m) complemented by an optional 5B ViT, making it a capable Vision Language Model (VLM). Its foundation is a unique synthesis dataset of over 120 million entries. This dataset was generated using state-of-the-art language models with large context windows, leveraging methodologies akin to retrieval-augmented generation (RAG) and knowledge graph integration.

The entire data synthesis process was conducted within clusters derived from a curated 20 billion token pretraining corpus, with subsequent validation performed by the model itself. Notably, miniG has not undergone thorough alignment with human preferences and is under no obligation to cater to poorly constructed prompts or benchmark clichés.

Core Features

  • Supported Modalities: Text and Image. The Vision Language Model has undergone Locked-Image Tuning. For a text-only version, please use the revision=text-only branch at https://huggingface.co/CausalLM/miniG/tree/text-only.
  • Massive Context Window: miniG supports a context window of up to 1,000,000 tokens.
  • Model Parameters: The model consists of a 9B LLM and an optional 5B ViT.

Usage & Inference Recommendations

Cautionary Notes: We strongly recommend using a standardized implementation like Hugging Face transformers for inference. Accelerated kernels such as vllm or lmdeploy, as well as model quantization, can lead to significant performance degradation and potentially catastrophic effects, especially for vision inference.

Inference Parameters: To achieve results with fewer hallucinations, we advise using sampling with top_p=0.8 and temperature=0.3, or pure temperature sampling at 0.2. A lower temperature is generally required compared to similar models, which we tentatively attribute to overfitting on the vast dataset.

Input Formatting:

  1. Ensure the system prompt is not empty. Even a simple "You are a helpful assistant." is sufficient.
  2. Always include a newline character \n after each <|role|> tag for proper parsing.

Training & Disclaimer

The final released version was a merge of multiple candidate models. An efficient naïve fine-tuning can be achieved within one day on 16 nodes of 8*A100-80G, with an estimated carbon emission of 700 kg CO2 eq.

Disclaimer: The model was trained on unfiltered internet data and may contain objectionable content. We lack the resources to vet all data or implement RLHF for safety. Users are responsible for performing their own safety checks and filtering model outputs.

Model Highlights

Parameters: 9B LLM (initialized from GLM-4-9B-Chat-1M) + Optional 5B ViT

Context Window: 1,000,000 tokens

Modalities: Text and Image (with Locked-Image Tuning)

Training Data: 120M+ entry synthetic dataset generated from a 20B token corpus.

Capability Description miniG Gemini-Flash GLM-4-9B-Chat Llama 3.1 8B Instruct
MMLU Representation of questions in 57 subjects
(incl. STEM, humanities, and others)
85.45 78.9 72.4 69.4
IFEval Evaluation of instruction-following
using verifiable prompts
74.22 - 69 80.4
GSM8K Challenging math problems
(5-shot evaluation)
75.89 (5-shot) 86.2 (11-shot) 79.6 84.5 (8-shot CoT)
HumanEval Python code generation on a held-out dataset
(0-shot)
79.88 74.3 71.8 72.6
GPQA Challenging dataset of questions
from biology, physics, and chemistry
37.37 39.5 34.3 (base) 34.2
Context Window Maximum context length
the model can handle
1M 1M 128K 128K
Input Supported input modalities Text, image, audio, video Text only Text only

Explore the Model on Hugging Face →