Building Sovereign AI: iGenius and NVIDIA DGX Cloud’s Breakthrough in LLM Development

Published At: Jan. 25, 2025, 10:28 a.m.

Revolutionizing AI: iGenius and NVIDIA DGX Cloud’s Journey to Build Sovereign LLMs for Regulated Industries

In recent years, large language models (LLMs) have made groundbreaking strides in areas like reasoning, code generation, machine translation, and summarization. However, despite their advanced capabilities, foundation models often fall short in domain-specific expertise—such as finance or healthcare—and struggle to capture cultural and linguistic nuances beyond English.

To address these limitations, continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG) are essential. These methods require high-quality domain-specific datasets, a robust AI platform, and advanced expertise. Enter iGenius and NVIDIA DGX Cloud, whose collaboration has set a new benchmark in developing state-of-the-art LLMs for highly regulated industries.

iGenius: Pioneering AI for Regulated Sectors

iGenius, an Italian technology company founded in 2016, specializes in AI solutions for enterprises in regulated sectors like financial services and public administration. With a mission to humanize data and democratize business knowledge, iGenius operates across Europe and the United States, delivering AI-driven insights to businesses and individuals alike.

As an NVIDIA Inception partner, iGenius aimed to develop a cutting-edge foundational LLM within a tight timeline. However, they faced challenges in accessing large-scale GPU clusters and securing scalable training frameworks. To overcome these hurdles, iGenius collaborated with NVIDIA DGX Cloud, resulting in the creation of Colosseum 355B, a sovereign LLM tailored for highly regulated environments.

Colosseum 355B: A Sovereign AI Solution

Colosseum 355B is designed to power iGenius’s business intelligence agent, Crystal, a sovereign AI solution that operates as an isolated AI operating system. By building an end-to-end stack, iGenius ensures a secure experience without relying on centralized models. Key features include:

Database integration
AI-assisted configuration
LLM-powered orchestration for tool usage, query execution, and generation
Private deployment infrastructure

This approach allows Crystal to function as an orchestrator, managing tasks effectively while integrating specialized tools. By leveraging their own foundational LLMs, iGenius ensures greater control over data privacy, customization, and performance, meeting the specific needs of regulated industries.

NVIDIA DGX Cloud: Accelerating AI Development

Developing a high-performance AI training infrastructure for Colosseum 355B required a robust, distributed hardware and software solution. NVIDIA DGX Cloud provided iGenius with immediate access to AI-optimized infrastructure, including:

Over 3,000 NVIDIA H100 GPUs
A dedicated, high-bandwidth, RDMA-based network
500 TB of Lustre-based high-performance storage
Access to the latest NVIDIA NeMo Framework containers

Within one week of signing up, iGenius had access to this environment, enabling them to complete continued pretraining for Colosseum 355B in just two months.

Dataset and Training Highlights

iGenius constructed a CPT dataset comprising approximately 2.5 trillion tokens, ensuring consistency with the original dataset’s composition. The multilingual capabilities of Colosseum 355B extend to over 50 languages, with a focus on European languages like Italian, German, and French, as well as non-European languages such as Japanese, Chinese, and Arabic.

Specialized sources from domains like finance and reasoning were incorporated to enhance the model’s performance. The supervised fine-tuning (SFT) dataset included 1 million samples curated for tasks like problem-solving, factual recall, and coding questions.

Continued Pretraining and Optimization

Improving a state-of-the-art LLM like Colosseum 355B is no small feat. iGenius leveraged the NVIDIA NeMo Framework to optimize training configurations, achieving a Model FLOP/s Utilization (MFU) of 40%—a significant improvement from the initial 25%. Key optimizations included:

Reducing pipeline parallelism from 12 to 8 nodes
Enabling FP8 training to accelerate training and reduce memory footprint
Extending the model’s context length from 4K to 16K

These optimizations allowed iGenius to complete more work in less time, highlighting the importance of exploring hyperparameters before large-scale training.

Alignment and Fine-Tuning

After pretraining, iGenius focused on aligning Colosseum 355B with specific tasks using supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). SFT refined the model’s parameters using labeled input-output pairs, while DPO aligned the model with human preferences by choosing between preferred and rejected responses.

iGenius used benchmarks like IFEval and Massive Multitask Language Understanding (MMLU) to evaluate the model’s performance, achieving an accuracy of 82.04% in a 5-shot setting.

Challenges and Best Practices

Training at scale presents unique challenges, such as checkpoint loading delays and network instability. iGenius adopted best practices like progressive scaling, robust checkpointing, and effective monitoring to ensure maximum utilization of their infrastructure.

Key lessons included:

Explore configurations at a reduced scale
Monitor performance metrics like MFU
Maintain accurate experiment tracking

By following these practices, iGenius successfully navigated the complexities of large-scale LLM training, delivering a state-of-the-art model tailored for regulated industries.

This collaboration between iGenius and NVIDIA DGX Cloud exemplifies how advanced AI infrastructure and expertise can accelerate the development of sovereign LLMs, paving the way for innovation in regulated sectors.

Published At: Jan. 25, 2025, 10:28 a.m.

Original Source: Continued Pretraining of State-of-the-Art LLMs for Sovereign AI and Regulated Industries with iGenius and NVIDIA DGX Cloud (Author: Martin Cimmino)
Note: This publication was rewritten using AI. The content was based on the original source linked above.

← Back to News