Reinforcement-Learning on Producthunt daily

DeepResearch

Tue, 23 Sep 2025 15:28:18 +0800

Alibaba-NLP/DeepResearch

🤗 HuggingFace ｜ ModelScope | 💬 WeChat(微信)

Introduction

We present Tongyi DeepResearch, an agentic large language model featuring 30.5 billion total parameters, with only 3.3 billion activated per token. Developed by Tongyi Lab, the model is specifically designed for long-horizon, deep information-seeking tasks. Tongyi DeepResearch demonstrates state-of-the-art performance across a range of agentic search benchmarks, including Humanity’s Last Exam, BrowserComp, BrowserComp-ZH, WebWalkerQA,xbench-DeepSearch, FRAMES and SimpleQA.

Tongyi DeepResearch builds upon our previous work on the WebAgent project.

More details can be found in our 📰 Tech Blog.

Features

⚙️ Fully automated synthetic data generation pipeline: We design a highly scalable data synthesis pipeline, which is fully automatic and empowers agentic pre-training, supervised fine-tuning, and reinforcement learning.
🔄 Large-scale continual pre-training on agentic data: Leveraging diverse, high-quality agentic interaction data to extend model capabilities, maintain freshness, and strengthen reasoning performance.
🔁 End-to-end reinforcement learning: We employ a strictly on-policy RL approach based on a customized Group Relative Policy Optimization framework, with token-level policy gradients, leave-one-out advantage estimation, and selective filtering of negative samples to stabilize training in a non‑stationary environment.
🤖 Agent Inference Paradigm Compatibility: At inference, Tongyi DeepResearch is compatible with two inference paradigms: ReAct, for rigorously evaluating the model’s core intrinsic abilities, and an IterResearch-based ‘Heavy’ mode, which uses a test-time scaling strategy to unlock the model’s maximum performance ceiling.

Model Download

You can directly download the model by following the links below.

Model	Download Links	Model Size	Context Length
Tongyi-DeepResearch-30B-A3B	🤗 HuggingFace 🤖 ModelScope	30B-A3B	128K

News

[2025/09/20]🚀 Tongyi-DeepResearch-30B-A3B is now on OpenRouter! Follow the Quick-start guide.

[2025/09/17]🔥 We have released Tongyi-DeepResearch-30B-A3B.

Deep Research Benchmark Results

Quick Start

This guide provides instructions for setting up the environment and running inference scripts located in the inference folder.

1. Environment Setup

Recommended Python version: 3.10.0 (using other versions may cause dependency issues).
It is strongly advised to create an isolated environment using conda or virtualenv.

1
2
3

# Example with Conda
conda create -n react_infer_env python=3.10.0
conda activate react_infer_env

2. Installation

Install the required dependencies:

`1`	`pip install -r requirements.txt`

3. Environment Configuration and Prepare Evaluation Data

Environment Configuration

Configure your API keys and settings by copying the example environment file:

1
2

# Copy the example environment file
cp .env.example .env

Edit the .env file and provide your actual API keys and configuration values:

SERPER_KEY_ID: Get your key from Serper.dev for web search and Google Scholar
JINA_API_KEYS: Get your key from Jina.ai for web page reading
API_KEY/API_BASE: OpenAI-compatible API for page summarization from OpenAI
DASHSCOPE_API_KEY: Get your key from Dashscope for file parsing
SANDBOX_FUSION_ENDPOINT: Python interpreter sandbox endpoints (see SandboxFusion)
MODEL_PATH: Path to your model weights
DATASET: Name of your evaluation dataset
OUTPUT_PATH: Directory for saving results

Note: The .env file is gitignored, so your secrets will not be committed to the repository.

Prepare Evaluation Data

The system supports two input file formats: JSON and JSONL.

Supported File Formats:

Option 1: JSONL Format (recommended)

Create your data file with .jsonl extension (e.g., my_questions.jsonl)

Each line must be a valid JSON object with question and answer keys:

1
2

{"question": "What is the capital of France?", "answer": "Paris"}
{"question": "Explain quantum computing", "answer": ""}

Option 2: JSON Format

Create your data file with .json extension (e.g., my_questions.json)

File must contain a JSON array of objects, each with question and answer keys:

[
  {"question": "What is the capital of France?", "answer": "Paris"},
  {"question": "Explain quantum computing", "answer": ""}
]

Important Note: The answer field contains the ground truth/reference answer used for evaluation. The system generates its own responses to the questions, and these reference answers are used to automatically judge the quality of the generated responses during benchmark evaluation.

File References for Document Processing:

If using the file parser tool, prepend the filename to the question field
Place referenced files in eval_data/file_corpus/ directory
Example: {"question": "report.pdf What are the key findings?", "answer": "..."}

File Organization:

project_root/
├── eval_data/
│   ├── my_questions.jsonl          # Your evaluation data
│   └── file_corpus/                # Referenced documents
│       ├── report.pdf
│       └── data.xlsx

4. Configure the Inference Script

Open run_react_infer.sh and modify the following variables as instructed in the comments:
- MODEL_PATH - path to the local or remote model weights.
- DATASET - full path to your evaluation file, e.g. eval_data/my_questions.jsonl or /path/to/my_questions.json.
- OUTPUT_PATH - path for saving the prediction results, e.g. ./outputs.
Depending on the tools you enable (retrieval, calculator, web search, etc.), provide the required API_KEY, BASE_URL, or other credentials. Each key is explained inline in the bash script.

5. Run the Inference Script

`1`	`bash run_react_infer.sh`

With these steps, you can fully prepare the environment, configure the dataset, and run the model. For more details, consult the inline comments in each script or open an issue.

6. You can use OpenRouter’s API to call our model

Tongyi-DeepResearch-30B-A3B is now available at OpenRouter. You can run the inference without any GPUs.

You need to modify the following in the file inference/react_agent.py:

In the call_server function: Set the API key and URL to your OpenRouter account’s API and URL.
Change the model name to alibaba/tongyi-deepresearch-30b-a3b.
Adjust the content concatenation way as described in the comments on lines 88–90.

Benchmark Evaluation

We provide benchmark evaluation scripts for various datasets. Please refer to the evaluation scripts directory for more details.

Deep Research Agent Family

Tongyi DeepResearch also has an extensive deep research agent family. You can find more information in the following paper:

[1] WebWalker: Benchmarking LLMs in Web Traversal (ACL 2025)
[2] WebDancer: Towards Autonomous Information Seeking Agency (NeurIPS 2025)
[3] WebSailor: Navigating Super-human Reasoning for Web Agent
[4] WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
[5] WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
[6] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
[7] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
[8] WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
[9] WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
[10] Scaling Agents via Continual Pre-training
[11] Towards General Agentic Intelligence via Environment Scaling

🌟 Misc

🚩 Talent Recruitment

🔥🔥🔥 We are hiring! Research intern positions are open (based in Hangzhou、Beijing、Shanghai)

📚 Research Area：Web Agent, Search Agent, Agent RL, MultiAgent RL, Agentic RAG

☎️ Contact：yongjiang.jy@alibaba-inc.com

Contact Information

For communications, please contact Yong Jiang (yongjiang.jy@alibaba-inc.com).

Citation

@misc{tongyidr,
  author={Tongyi DeepResearch Team},
  title={Tongyi-DeepResearch},
  year={2025},
  howpublished={\url{https://github.com/Alibaba-NLP/DeepResearch}}
}

ML-From-Scratch

Fri, 05 Sep 2025 15:27:51 +0800

eriklindernoren/ML-From-Scratch

Machine Learning From Scratch

About

Python implementations of some of the fundamental Machine Learning models and algorithms from scratch.

The purpose of this project is not to produce as optimized and computationally efficient algorithms as possible but rather to present the inner workings of them in a transparent and accessible way.

Machine Learning From Scratch

Installation

$ git clone https://github.com/eriklindernoren/ML-From-Scratch
$ cd ML-From-Scratch
$ python setup.py install

Examples

Polynomial Regression

$ python mlfromscratch/examples/polynomial_regression.py

Figure: Training progress of a regularized polynomial regression model fitting
temperature data measured in Linköping, Sweden 2016.

Classification With CNN

$ python mlfromscratch/examples/convolutional_neural_network.py

+---------+
| ConvNet |
+---------+
Input Shape: (1, 8, 8)
+----------------------+------------+--------------+
| Layer Type           | Parameters | Output Shape |
+----------------------+------------+--------------+
| Conv2D               | 160        | (16, 8, 8)   |
| Activation (ReLU)    | 0          | (16, 8, 8)   |
| Dropout              | 0          | (16, 8, 8)   |
| BatchNormalization   | 2048       | (16, 8, 8)   |
| Conv2D               | 4640       | (32, 8, 8)   |
| Activation (ReLU)    | 0          | (32, 8, 8)   |
| Dropout              | 0          | (32, 8, 8)   |
| BatchNormalization   | 4096       | (32, 8, 8)   |
| Flatten              | 0          | (2048,)      |
| Dense                | 524544     | (256,)       |
| Activation (ReLU)    | 0          | (256,)       |
| Dropout              | 0          | (256,)       |
| BatchNormalization   | 512        | (256,)       |
| Dense                | 2570       | (10,)        |
| Activation (Softmax) | 0          | (10,)        |
+----------------------+------------+--------------+
Total Parameters: 538570

Training: 100% [------------------------------------------------------------------------] Time: 0:01:55
Accuracy: 0.987465181058

Figure: Classification of the digit dataset using CNN.

Density-Based Clustering

$ python mlfromscratch/examples/dbscan.py

Figure: Clustering of the moons dataset using DBSCAN.

Generating Handwritten Digits

$ python mlfromscratch/unsupervised_learning/generative_adversarial_network.py

+-----------+
| Generator |
+-----------+
Input Shape: (100,)
+------------------------+------------+--------------+
| Layer Type             | Parameters | Output Shape |
+------------------------+------------+--------------+
| Dense                  | 25856      | (256,)       |
| Activation (LeakyReLU) | 0          | (256,)       |
| BatchNormalization     | 512        | (256,)       |
| Dense                  | 131584     | (512,)       |
| Activation (LeakyReLU) | 0          | (512,)       |
| BatchNormalization     | 1024       | (512,)       |
| Dense                  | 525312     | (1024,)      |
| Activation (LeakyReLU) | 0          | (1024,)      |
| BatchNormalization     | 2048       | (1024,)      |
| Dense                  | 803600     | (784,)       |
| Activation (TanH)      | 0          | (784,)       |
+------------------------+------------+--------------+
Total Parameters: 1489936

+---------------+
| Discriminator |
+---------------+
Input Shape: (784,)
+------------------------+------------+--------------+
| Layer Type             | Parameters | Output Shape |
+------------------------+------------+--------------+
| Dense                  | 401920     | (512,)       |
| Activation (LeakyReLU) | 0          | (512,)       |
| Dropout                | 0          | (512,)       |
| Dense                  | 131328     | (256,)       |
| Activation (LeakyReLU) | 0          | (256,)       |
| Dropout                | 0          | (256,)       |
| Dense                  | 514        | (2,)         |
| Activation (Softmax)   | 0          | (2,)         |
+------------------------+------------+--------------+
Total Parameters: 533762

Figure: Training progress of a Generative Adversarial Network generating
handwritten digits.

Deep Reinforcement Learning

$ python mlfromscratch/examples/deep_q_network.py

+----------------+
| Deep Q-Network |
+----------------+
Input Shape: (4,)
+-------------------+------------+--------------+
| Layer Type        | Parameters | Output Shape |
+-------------------+------------+--------------+
| Dense             | 320        | (64,)        |
| Activation (ReLU) | 0          | (64,)        |
| Dense             | 130        | (2,)         |
+-------------------+------------+--------------+
Total Parameters: 450

Figure: Deep Q-Network solution to the CartPole-v1 environment in OpenAI gym.

Image Reconstruction With RBM

$ python mlfromscratch/examples/restricted_boltzmann_machine.py

Figure: Shows how the network gets better during training at reconstructing
the digit 2 in the MNIST dataset.

Evolutionary Evolved Neural Network

$ python mlfromscratch/examples/neuroevolution.py

+---------------+
| Model Summary |
+---------------+
Input Shape: (64,)
+----------------------+------------+--------------+
| Layer Type           | Parameters | Output Shape |
+----------------------+------------+--------------+
| Dense                | 1040       | (16,)        |
| Activation (ReLU)    | 0          | (16,)        |
| Dense                | 170        | (10,)        |
| Activation (Softmax) | 0          | (10,)        |
+----------------------+------------+--------------+
Total Parameters: 1210

Population Size: 100
Generations: 3000
Mutation Rate: 0.01

[0 Best Individual - Fitness: 3.08301, Accuracy: 10.5%]
[1 Best Individual - Fitness: 3.08746, Accuracy: 12.0%]
...
[2999 Best Individual - Fitness: 94.08513, Accuracy: 98.5%]
Test set accuracy: 96.7%

Figure: Classification of the digit dataset by a neural network which has
been evolutionary evolved.

Genetic Algorithm

$ python mlfromscratch/examples/genetic_algorithm.py

+--------+
|   GA   |
+--------+
Description: Implementation of a Genetic Algorithm which aims to produce
the user specified target string. This implementation calculates each
candidate's fitness based on the alphabetical distance between the candidate
and the target. A candidate is selected as a parent with probabilities proportional
to the candidate's fitness. Reproduction is implemented as a single-point
crossover between pairs of parents. Mutation is done by randomly assigning
new characters with uniform probability.

Parameters
----------
Target String: 'Genetic Algorithm'
Population Size: 100
Mutation Rate: 0.05

[0 Closest Candidate: 'CJqlJguPlqzvpoJmb', Fitness: 0.00]
[1 Closest Candidate: 'MCxZxdr nlfiwwGEk', Fitness: 0.01]
[2 Closest Candidate: 'MCxZxdm nlfiwwGcx', Fitness: 0.01]
[3 Closest Candidate: 'SmdsAklMHn kBIwKn', Fitness: 0.01]
[4 Closest Candidate: '  lotneaJOasWfu Z', Fitness: 0.01]
...
[292 Closest Candidate: 'GeneticaAlgorithm', Fitness: 1.00]
[293 Closest Candidate: 'GeneticaAlgorithm', Fitness: 1.00]
[294 Answer: 'Genetic Algorithm']

Association Analysis

$ python mlfromscratch/examples/apriori.py
+-------------+
|   Apriori   |
+-------------+
Minimum Support: 0.25
Minimum Confidence: 0.8
Transactions:
    [1, 2, 3, 4]
    [1, 2, 4]
    [1, 2]
    [2, 3, 4]
    [2, 3]
    [3, 4]
    [2, 4]
Frequent Itemsets:
    [1, 2, 3, 4, [1, 2], [1, 4], [2, 3], [2, 4], [3, 4], [1, 2, 4], [2, 3, 4]]
Rules:
    1 -> 2 (support: 0.43, confidence: 1.0)
    4 -> 2 (support: 0.57, confidence: 0.8)
    [1, 4] -> 2 (support: 0.29, confidence: 1.0)

Implementations

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Deep Q-Network

Deep Learning

Neural Network
Layers
- Activation Layer
- Average Pooling Layer
- Batch Normalization Layer
- Constant Padding Layer
- Convolutional Layer
- Dropout Layer
- Flatten Layer
- Fully-Connected (Dense) Layer
- Fully-Connected RNN Layer
- Max Pooling Layer
- Reshape Layer
- Up Sampling Layer
- Zero Padding Layer
Model Types

Contact

If there’s some implementation you would like to see here or if you’re just feeling social, feel free to email me or connect with me on LinkedIn.

verifiers

Thu, 28 Aug 2025 15:29:47 +0800

willccbb/verifiers

Verifiers

Environments for LLM Reinforcement Learning

Overview

Verifiers is a library of modular components for creating RL environments and training LLM agents. Verifiers includes an async GRPO implementation built around the transformers Trainer, is supported by prime-rl for large-scale FSDP training, and can easily be integrated into any RL framework which exposes an OpenAI-compatible inference client. In addition to RL training, Verifiers can be used directly for building LLM evaluations, creating synthetic data pipelines, and implementing agent harnesses.

Full documentation is available here.

Setup

We recommend using verifiers with along uv for dependency management in your own project:

1
2
3

curl -LsSf https://astral.sh/uv/install.sh | sh
uv init # create a fresh project
source .venv/bin/activate

For local (CPU) development and evaluation with API models, do:

`1`	`uv add verifiers # uv add 'verifiers[dev]' for Jupyter + testing support`

For training on GPUs with vf.GRPOTrainer, do:

`1`	`uv add 'verifiers[all]' && uv pip install flash-attn --no-build-isolation`

To use the latest main branch, do:

`1`	`uv add verifiers @ git+https://github.com/willccbb/verifiers.git`

To use with prime-rl, see here.

To install verifiers from source for core library development, do:

git clone https://github.com/willccbb/verifiers.git
cd verifiers
uv sync --all-extras && uv pip install flash-attn --no-build-isolation
uv run pre-commit install

In general, we recommend that you build and train Environments with verifiers, not in verifiers. If you find yourself needing to clone and modify the core library in order to implement key functionality for your project, we’d love for you to open an issue so that we can try and streamline the development experience. Our aim is for verifiers to be a reliable toolkit to build on top of, and to minimize the “fork proliferation” which often pervades the RL infrastructure ecosystem.

Environments

Environments in Verifiers are installable Python modules which can specify dependencies in a pyproject.toml, and which expose a load_environment function for instantiation by downstream applications (e.g. trainers). See environments/ for examples.

To initialize a blank Environment module template, do:

`1`	`vf-init vf-environment-name # -p /path/to/environments (defaults to "./environments")`

To an install an Environment module into your project, do:

`1`	`vf-install vf-environment-name # -p /path/to/environments (defaults to "./environments")`

To install an Environment module from this repo’s environments folder, do:

`1`	`vf-install vf-math-python --from-repo # -b branch_or_commit (defaults to "main")`

Once an Environment module is installed, you can create an instance of the Environment using load_environment, passing any necessary args:

1
2

import verifiers as vf
vf_env = vf.load_environment("vf-environment-name", **env_args)

To run a quick evaluation of your Environment with an API-based model, do:

`1`	`vf-eval vf-environment-name # vf-eval -h for config options; defaults to gpt-4.1-mini, 5 prompts, 3 rollouts for each`

The core elements of Environments in are:

Datasets: a Hugging Face Dataset with a prompt column for inputs, and either answer (str) or info (dict) columns for evaluation
Rollout logic: interactions between models and the environment (e.g. env_response + is_completed for any MultiTurnEnv)
Rubrics: an encapsulation for one or more reward functions
Parsers: optional; an encapsulation for reusable parsing logic

We support both /v1/chat/completions-style and /v1/completions-style inference via OpenAI clients, though we generally recommend /v1/chat/completions-style inference for the vast majority of applications. Both the included GRPOTrainer as well as prime-rl support the full set of SamplingParams exposed by vLLM (via their OpenAI-compatible server interface), and leveraging this will often be the appropriate way to implement rollout strategies requiring finer-grained control, such as interrupting and resuming generations for interleaved tool use, or enforcing reasoning budgets.

The primary constraint we impose on rollout logic is that token sequences must be increasing, i.e. once a token has been added to a model’s context in a rollout, it must remain as the rollout progresses. Note that this causes issues with some popular reasoning models such as the Qwen3 and DeepSeek-R1-Distill series; see Footguns for guidance on adapting these models to support multi-turn rollouts.

SingleTurnEnv

For tasks requiring only a single response from a model for each prompt, you can use SingleTurnEnv directly by specifying a Dataset and a Rubric. Rubrics are sets of reward functions, which can be either sync or async.

from datasets import load_dataset
import verifiers as vf

dataset = load_dataset("my-account/my-dataset", split="train")

def reward_A(prompt, completion, info) -> float:
	# reward fn, e.g. correctness
	...

def reward_B(parser, completion) -> float:
	# auxiliary reward fn, e.g. format
	...

async def metric(completion) -> float:
	# non-reward metric, e.g. proper noun count
	...

rubric = vf.Rubric(funcs=[reward_A, reward_B, metric], weights=[1.0, 0.5, 0.0])

vf_env = SingleTurnEnv(
	dataset=dataset,
	rubric=rubric
)
results = vf_env.evaluate(client=OpenAI(), model="gpt-4.1-mini", num_examples=100, rollouts_per_example=1)
vf_env.make_dataset(results) # HF dataset format

Datasets should be formatted with columns for:

'prompt' (List[ChatMessage]) OR 'question' (str) fields
- ChatMessage = e.g. {'role': 'user', 'content': '...'}
- if question is set instead of prompt, you can also pass system_prompt (str) and/or few_shot (List[ChatMessage])
answer (str) AND/OR info (dict)
task (str): optional, used by EnvGroup and RubricGroup for orchestrating composition of Environments and Rubrics

The following named attributes available for use by reward functions in your Rubric:

prompt: sequence of input messages
completion: sequence of messages generated during rollout by model and Environment
answer: primary answer column, optional if info is used
state: can be modified during rollout to accumulate any metadata (state['responses'] includes full OpenAI response objects by default)
info: auxiliary info needed for reward computation (e.g. test cases), optional if answer is used
task: tag for task type (used by EnvGroup and RubricGroup)
parser: the parser object declared. Note: vf.Parser().get_format_reward_func() is a no-op (always 1.0); use vf.ThinkParser or a custom parser if you want a real format adherence reward.

For tasks involving LLM judges, you may wish to use vf.JudgeRubric() for managing requests to auxiliary models.

Note on concurrency: environment APIs accept max_concurrent to control parallel rollouts. The vf-eval CLI currently exposes --max-concurrent-requests; ensure this maps to your environment’s concurrency as expected.

vf-eval also supports specifying sampling_args as a JSON object, which is sent to the vLLM inference engine:

`1`	`vf-eval vf-environment-name --sampling-args '{"reasoning_effort": "low"}'`

Use vf-eval -s to save outputs as dataset-formatted JSON, and view all locally-saved eval results with vf-tui.

ToolEnv

For many applications involving tool use, you can use ToolEnv to leverage models’ native tool/function-calling capabilities in an agentic loop. Tools can be specified as generic Python functions (with type hints and docstrings), which will then be passed in JSON schema form to each inference request.

import verifiers as vf
vf_env = vf.ToolEnv(
	dataset= ... # HF Dataset with 'prompt'/'question' + 'answer'/'info' columns
	rubric= ... # Rubric object; vf.ToolRubric() can be optionally used for counting tool invocations in each rollout
	tools=[search_tool, read_article_tool, python_tool], # python functions with type hints + docstrings
	max_turns=10
)

In cases where your tools require heavy computational resources, we recommend hosting your tools as standalone servers (e.g. MCP servers) and creating lightweight wrapper functions to pass to ToolEnv. Parallel tool call support is enabled by default.

For training, or self-hosted endpoints, you’ll want to enable auto tool choice in vLLM with the appropriate parser. If your model does not support native tool calling, you may find the XMLParser abstraction useful for rolling your own tool call parsing on top of MultiTurnEnv; see environments/xml_tool_env for an example.

MultiTurnEnv

Both SingleTurnEnv and ToolEnv are instances of MultiTurnEnv, which exposes an interface for writing custom Environment interaction protocols. The two methods you must override are

from typing import Tuple
import verifiers as vf
from verifiers.types import Messages, State
class YourMultiTurnEnv(vf.MultiTurnEnv):
    def __init__(self,
                 dataset: Dataset,
                 rubric: Rubric,
				 max_turns: int,
                 **kwargs):
	
  async def is_completed(self, messages: Messages, state: State, **kwargs) -> bool:
    # return whether or not a rollout is completed

  async def env_response(self, messages: Messages, state: State, **kwargs) -> Tuple[Messages, State]:
    # return new environment message(s) + updated state

If your application requires more fine-grained control than is allowed by MultiTurnEnv, you may want to inherit from the base Environment functionality directly and override the rollout method.

Training

GRPOTrainer

The included trainer (vf.GRPOTrainer) supports running GRPO-style RL training via Accelerate/DeepSpeed, and uses vLLM for inference. It supports both full-parameter finetuning, and is optimized for efficiently training dense transformer models on 2-16 GPUs.

# install environment
vf-install vf-wordle (-p /path/to/environments | --from-repo)

# quick eval
vf-eval vf-wordle -m (model_name in configs/endpoints.py) -n NUM_EXAMPLES -r ROLLOUTS_PER_EXAMPLE

# inference (shell 0)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 vf-vllm --model willcb/Qwen3-1.7B-Wordle \
    --data-parallel-size 7 --enforce-eager --disable-log-requests

# training (shell 1)
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num-processes 2 \
    --config-file configs/zero3.yaml examples/grpo/train_wordle.py --size 1.7B

Alternatively, you can train environments with the external prime-rl project (FSDP-first orchestration). See the prime-rl README for installation and examples. For example:

1
2
3

# orchestrator config (prime-rl)
[environment]
id = "vf-math-python"  # or your environment ID

# run (prime-rl)
uv run rl \
  --trainer @ configs/your_exp/train.toml \
  --orchestrator @ configs/your_exp/orch.toml \
  --inference @ configs/your_exp/infer.toml

Troubleshooting

Ensure your wandb and huggingface-cli logins are set up (or set report_to=None in training_args). You should also have something set as your OPENAI_API_KEY in your environment (can be a dummy key for vLLM).
If using high max concurrency, increase the number of allowed open sockets (e.g. ulimit -n 4096)
On some setups, inter-GPU communication can hang or crash during vLLM weight syncing. This can usually be alleviated by setting (or unsetting) NCCL_P2P_DISABLE=1 in your environment (or potentially NCCL_CUMEM_ENABLE=1). Try this as your first step if you experience NCCL-related issues.
If problems persist, please open an issue.

Resource Requirements

GRPOTrainer is optimized for setups with at least 2 GPUs, scaling up to multiple nodes. 2-GPU setups with sufficient memory to enable small-scale experimentation can be rented for <$1/hr.

PRIME-RL

If you do not require LoRA support, you may want to use the prime-rl trainer, which natively supports Environments created using verifiers, is more optimized for performance and scalability via FSDP, includes a broader set of configuration options and user experience features, and has more battle-tested defaults. Both trainers support asynchronous rollouts, and use a one-step off-policy delay by default for overlapping training and inference. See the prime-rl docs for usage instructions.

Further Documentation

See the full docs for more information.

Contributions

Verifiers warmly welcomes community contributions! Please open an issue or PR if you encounter bugs or other pain points during your development, or start a discussion for more open-ended questions.

Please note that the core verifiers/ library is intended to be a relatively lightweight set of reusable components rather than an exhaustive catalog of RL environments. For applications of verifiers (e.g. “an Environment for XYZ task”), you are welcome to submit a PR for a self-contained module that lives within environments/ if it serves as a canonical example of a new pattern. Stay tuned for more info shortly about our plans for supporting community Environment contributions 🙂

Citation

If you use this code in your research, please cite:

@misc{brown_verifiers_2025,
  author       = {William Brown},
  title        = {{Verifiers}: Reinforcement Learning with LLMs in Verifiable Environments},
  howpublished = {\url{https://github.com/willccbb/verifiers}},
  note         = {Commit abcdefg • accessed DD Mon YYYY},
  year         = {2025}
}

Roadmap

A community Environments hub for crowdsourcing, sharing, and discovering new RL environments built with verifiers
Default patterns for hosted resources such as code sandboxes, auxiliary models, and MCP servers
Multimodal input support
Non-increasing token sequences via REINFORCE

ART

Thu, 28 Aug 2025 15:29:23 +0800

OpenPipe/ART

Agent Reinforcement Trainer

Train multi-step agents for real-world tasks using GRPO.

📏 RULER: Zero-Shot Agent Rewards

RULER (Relative Universal LLM-Elicited Rewards) eliminates the need for hand-crafted reward functions by using an LLM-as-judge to automatically score agent trajectories. Simply define your task in the system prompt, and RULER handles the rest—no labeled data, expert feedback, or reward engineering required.

✨ Key Benefits:

2-3x faster development - Skip reward function engineering entirely
General-purpose - Works across any task without modification
Strong performance - Matches or exceeds hand-crafted rewards in 3/4 benchmarks
Easy integration - Drop-in replacement for manual reward functions

# Before: Hours of reward engineering
def complex_reward_function(trajectory):
    # 50+ lines of careful scoring logic...
    pass

# After: One line with RULER
judged_group = await ruler_score_group(group, "openai/o3")

📖 Learn more about RULER →

ART Overview

ART is an open-source RL framework that improves agent reliability by allowing LLMs to learn from experience. ART provides an ergonomic harness for integrating GRPO into any python application. For a quick hands-on introduction, run one of the notebooks below. When you’re ready to learn more, check out the docs.

📒 Notebooks

Agent Task	Example Notebook	Description	Comparative Performance
ART•E LangGraph	🏋️ Train agent	Qwen 2.5 7B learns to search emails using LangGraph	[Link coming soon]
MCP•RL	🏋️ Train agent	Qwen 2.5 3B masters the NWS MCP server	[Link coming soon]
ART•E [RULER]	🏋️ Train agent	Qwen 2.5 7B learns to search emails using RULER	benchmarks
2048	🏋️ Train agent	Qwen 2.5 3B learns to play 2048	benchmarks
Temporal Clue	🏋️ Train agent	Qwen 2.5 7B learns to solve Temporal Clue	[Link coming soon]
Tic Tac Toe	🏋️ Train agent	Qwen 2.5 3B learns to play Tic Tac Toe	benchmarks
Codenames	🏋️ Train agent	Qwen 2.5 3B learns to play Codenames	benchmarks
AutoRL [RULER]	🏋️ Train agent	Train Qwen 2.5 7B to master any task	[Link coming soon]

📰 ART News

Explore our latest research and updates on building SOTA agents.

🗞️ ART now integrates seamlessly with LangGraph - Train your LangGraph agents with reinforcement learning for smarter multi-step reasoning and improved tool usage.
🗞️ MCP•RL: Teach Your Model to Master Any MCP Server - Automatically train models to effectively use MCP server tools through reinforcement learning.
🗞️ AutoRL: Zero-Data Training for Any Task - Train custom AI models without labeled data using automatic input generation and RULER evaluation.
🗞️ RULER: Easy Mode for RL Rewards is now available for automatic reward generation in reinforcement learning.
🗞️ ART·E: How We Built an Email Research Agent That Beats o3 demonstrates a Qwen 2.5 14B email agent outperforming OpenAI’s o3.
🗞️ ART Trainer: A New RL Trainer for Agents enables easy training of LLM-based agents using GRPO.

📖 See all blog posts →

Why ART?

ART provides convenient wrappers for introducing RL training into existing applications. We abstract the training server into a modular service that your code doesn’t need to interface with.
Train from anywhere. Run the ART client on your laptop and let the ART server kick off an ephemeral GPU-enabled environment, or run on a local GPU.
Integrations with hosted platforms like W&B, Langfuse, and OpenPipe provide flexible observability and simplify debugging.
ART is customizable with intelligent defaults. You can configure training parameters and inference engine configurations to meet specific needs, or take advantage of the defaults, which have been optimized for training efficiency and stability.

Installation

ART agents can be trained from any client machine that runs python. To add to an existing project, run this command:

`1`	`pip install openpipe-art`

🤖 ART•E Agent

Curious about how to use ART for a real-world task? Check out the ART•E Agent blog post, where we detail how we trained Qwen 2.5 14B to beat o3 at email retrieval!

🔁 Training Loop Overview

ART’s functionality is divided into a client and a server. The OpenAI-compatible client is responsible for interfacing between ART and your codebase. Using the client, you can pass messages and get completions from your LLM as it improves. The server runs independently on any machine with a GPU. It abstracts away the complexity of the inference and training portions of the RL loop while allowing for some custom configuration. An outline of the training loop is shown below:

Inference
1. Your code uses the ART client to perform an agentic workflow (usually executing several rollouts in parallel to gather data faster).
2. Completion requests are routed to the ART server, which runs the model’s latest LoRA in vLLM.
3. As the agent executes, each system, user, and assistant message is stored in a Trajectory.
4. When a rollout finishes, your code assigns a reward to its Trajectory, indicating the performance of the LLM.
Training
1. When each rollout has finished, Trajectories are grouped and sent to the server. Inference is blocked while training executes.
2. The server trains your model using GRPO, initializing from the latest checkpoint (or an empty LoRA on the first iteration).
3. The server saves the newly trained LoRA to a local directory and loads it into vLLM.
4. Inference is unblocked and the loop resumes at step 1.

This training loop runs until a specified number of inference and training iterations have completed.

🧩 Supported Models

ART should work with most vLLM/HuggingFace-transformers compatible causal language models, or at least the ones supported by Unsloth. Gemma 3 does not appear to be supported for the time being. If any other model isn’t working for you, please let us know on Discord or open an issue on GitHub!

🤝 Contributing

ART is in active development, and contributions are most welcome! Please see the CONTRIBUTING.md file for more information.

📖 Citation

@misc{hilton2025art,
  author = {Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones},
  title = {ART: Agent Reinforcement Trainer},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/openpipe/art}}
}

⚖️ License

This repository’s source code is available under the Apache-2.0 License.

🙏 Credits

ART stands on the shoulders of giants. While we owe many of the ideas and early experiments that led to ART’s development to the open source RL community at large, we’re especially grateful to the authors of the following projects:

Finally, thank you to our partners who’ve helped us test ART in the wild! We’re excited to see what you all build with it.

IsaacLab

Fri, 04 Jul 2025 15:30:49 +0800

isaac-sim/IsaacLab

Isaac Lab

Isaac Lab is a GPU-accelerated, open-source framework designed to unify and simplify robotics research workflows, such as reinforcement learning, imitation learning, and motion planning. Built on NVIDIA Isaac Sim, it combines fast and accurate physics and sensor simulation, making it an ideal choice for sim-to-real transfer in robotics.

Isaac Lab provides developers with a range of essential features for accurate sensor simulation, such as RTX-based cameras, LIDAR, or contact sensors. The framework’s GPU acceleration enables users to run complex simulations and computations faster, which is key for iterative processes like reinforcement learning and data-intensive tasks. Moreover, Isaac Lab can run locally or be distributed across the cloud, offering flexibility for large-scale deployments.

Key Features

Isaac Lab offers a comprehensive set of tools and environments designed to facilitate robot learning:

Robots: A diverse collection of robots, from manipulators, quadrupeds, to humanoids, with 16 commonly available models.
Environments: Ready-to-train implementations of more than 30 environments, which can be trained with popular reinforcement learning frameworks such as RSL RL, SKRL, RL Games, or Stable Baselines. We also support multi-agent reinforcement learning.
Physics: Rigid bodies, articulated systems, deformable objects
Sensors: RGB/depth/segmentation cameras, camera annotations, IMU, contact sensors, ray casters.

Getting Started

Our documentation page provides everything you need to get started, including detailed tutorials and step-by-step guides. Follow these links to learn more about:

Isaac Sim Version Dependency

Isaac Lab is built on top of Isaac Sim and requires specific versions of Isaac Sim that are compatible with each release of Isaac Lab. Below, we outline the recent Isaac Lab releases and GitHub branches and their corresponding dependency versions for Isaac Sim.

Isaac Lab Version	Isaac Sim Version
`main` branch	Isaac Sim 4.5
`v2.1.0`	Isaac Sim 4.5
`v2.0.2`	Isaac Sim 4.5
`v2.0.1`	Isaac Sim 4.5
`v2.0.0`	Isaac Sim 4.5
`feature/isaacsim_5_0` branch	Isaac Sim 5.0

Note that the feature/isaacsim_5_0 will contain active updates and may contain some breaking changes until the official Isaac Lab 2.2 release. It currently requires the Isaac Sim 5.0 branch available on GitHub built from source. Please refer to the README in the feature/isaacsim_5_0 branch for instructions for using Isaac Lab with Isaac Sim 5.0. We are actively working on introducing backwards compatibility support for Isaac Sim 4.5 for this branch.

Contributing to Isaac Lab

We wholeheartedly welcome contributions from the community to make this framework mature and useful for everyone. These may happen as bug reports, feature requests, or code contributions. For details, please check our contribution guidelines.

We encourage you to utilize our Show & Tell area in the Discussions section of this repository. This space is designed for you to:

Share the tutorials you’ve created
Showcase your learning content
Present exciting projects you’ve developed

By sharing your work, you’ll inspire others and contribute to the collective knowledge of our community. Your contributions can spark new ideas and collaborations, fostering innovation in robotics and simulation.

Troubleshooting

Please see the troubleshooting section for common fixes or submit an issue.

For issues related to Isaac Sim, we recommend checking its documentation or opening a question on its forums.

Support

Please use GitHub Discussions for discussing ideas, asking questions, and requests for new features.
Github Issues should only be used to track executable pieces of work with a definite scope and a clear deliverable. These can be fixing bugs, documentation issues, new features, or general updates.

Connect with the NVIDIA Omniverse Community

Do you have a project or resource you’d like to share more widely? We’d love to hear from you! Reach out to the NVIDIA Omniverse Community team at OmniverseCommunity@nvidia.com to explore opportunities to spotlight your work.

You can also join the conversation on the Omniverse Discord to connect with other developers, share your projects, and help grow a vibrant, collaborative ecosystem where creativity and technology intersect. Your contributions can make a meaningful impact on the Isaac Lab community and beyond!

License

The Isaac Lab framework is released under BSD-3 License. The isaaclab_mimic extension and its corresponding standalone scripts are released under Apache 2.0. The license files of its dependencies and assets are present in the docs/licenses directory.

Acknowledgement

Isaac Lab development initiated from the Orbit framework. We would appreciate if you would cite it in academic publications as well:

@article{mittal2023orbit,
   author={Mittal, Mayank and Yu, Calvin and Yu, Qinxi and Liu, Jingzhou and Rudin, Nikita and Hoeller, David and Yuan, Jia Lin and Singh, Ritvik and Guo, Yunrong and Mazhar, Hammad and Mandlekar, Ajay and Babich, Buck and State, Gavriel and Hutter, Marco and Garg, Animesh},
   journal={IEEE Robotics and Automation Letters},
   title={Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments},
   year={2023},
   volume={8},
   number={6},
   pages={3740-3747},
   doi={10.1109/LRA.2023.3270034}
}

rl-swarm

Sat, 28 Jun 2025 15:28:05 +0800

gensyn-ai/rl-swarm

RL Swarm

RL Swarm is a peer-to-peer system for reinforcement learning. It allows you to train models collaboratively with others in the swarm, leveraging their collective intelligence. It is open source and permissionless, meaning you can run it on a consumer laptop at home or on a powerful GPU in the cloud. You can also connect your model to the Gensyn Testnet to receive an on-chain identity that tracks your progress over time.

Currently, we are running the reasoning-gym swarm on the Testnet. This swarm is designed to train models to solve a diverse set of reasoning tasks using the reasoning-gym dataset. The current list of default models includes:

Models:

Gensyn/Qwen2.5-0.5B-Instruct
Qwen/Qwen3-0.6B
nvidia/AceInstruct-1.5B
dnotitia/Smoothie-Qwen3-1.7B
Gensyn/Qwen2.5-1.5B-Instruct

This iteration of rl-swarm is powered by the GenRL-Swarm library. It is a fully composable framework for decentralized reinforcement learning which enables users to create and customize their own swarms for reinforcement learning with multi-agent multi-stage environments.

Requirements

Your hardware requirements will vary depending on a number of factors including model size and the accelerator platform you use. Users running large NVIDIA GPU will be assigned a model from the large model pool, while users running less powerful hardware will be assigned a model from the small model pool. This design decision is intended to allow users to advance at a similar rate regardless of the hardware they use, maximizing their utility to the swarm.

Supported Hardware

arm64 or x86 CPU with minimum 32gb ram (note that if you run other applications during training it might crash training).

CUDA devices (officially supported):
- RTX 3090
- RTX 4090
- RTX 5090
- A100
- H100

With either configuration, you will need Python >=3.10 (for Mac, you will likely need to upgrade).

⚠️ Please read before continuing ⚠️

This software is experimental and provided as-is for users who are interested in using (or helping to develop) an early version of the Gensyn Protocol for training models.

If you care about on-chain participation, you must read the Identity Management section below.

If you encounter issues, please first check Troubleshooting. If you cannot find a solution there, please check if there is an open (or closed) Issue. If there is no relevant issue, please file one and include 1) all relevant logs, 2) information about your device (e.g. which GPU, if relevant), and 3) your operating system information.

Instructions

Run the Swarm

The easiest way to run RL Swarm is using Docker. This ensures a consistent setup across all operating systems with minimal dependencies.

1. Clone this repo

`1`	`git clone https://github.com/gensyn-ai/rl-swarm`

2. Install Docker

Make sure you have Docker installed and the Docker daemon is running on your machine. To do that, follow these instructions according to your OS. Ensure you allot sufficient memory to the Docker containers. For example if using Docker Desktop, this can be done by going to Docker Desktop Settings > Resources > Advanced > Memory Limit, and increasing it to the maximum possible value.

3. Start the Swarm

Run the following commands from the root of the repository.

CPU support

If you’re using a Mac or if your machine has CPU-only support:

`1`	`docker-compose run --rm --build -Pit swarm-cpu`

GPU support

If you’re using a machine with an officially supported GPU:

`1`	`docker-compose run --rm --build -Pit swarm-gpu`

Docker compose issue

If docker-compose does not work when running the above commands, please try docker compose (no hyphen) instead. I.e. docker compose run --rm --build -Pit swarm-gpu. This issue sometimes occurs on users running Ubuntu.

Experimental (advanced) mode

If you want to experiment with the GenRL-Swarm library and its configurable parameters, we recommend you run RL Swarm via shell script:

1
2
3

python3 -m venv .venv
source .venv/bin/activate
./run_rl_swarm.sh

To learn more about experimental mode, check out our getting started guide.

A browser window will pop open (you’ll need to manually navigate to http://localhost:3000/ if you’re on a VM).
Click ’login'.
Login with your preferred method.

Huggingface

If you would like to upload your model to Hugging Face, enter your Hugging Face access token when prompted. You can generate one from your Hugging Face account, under Access Tokens.

Initial peering and training

From this stage onward your device will begin training. You should see your peer register and vote on-chain here.

You can also track your training progress in real time:

On The RL-Swarm Dashboard: dashboard.gensyn.ai

Identity management

Introduction

On-chain identity is managed via an Alchemy modal sign-in screen. You need to supply an email address or login via a supported method (e.g. Google). This creates an EOA public/private key (which are stored by Alchemy). You will also receive local session keys in the userApiKey. Note that these aren’t your EOA public/private keys.

During the initial set-up process, you will also create a swarm.pem file which maintains the identity of your peer. This is then registered on chain using the EOA wallet hosted in Alchemy, triggered using your local api keys. This links the swarm.pem to the email address (and corresponding EOA in Alchemy).

If you want to link multiple nodes to a single EOA, simply sign up each node using the same email address. You will get a new peer ID for each node, however they will all be linked to the same EOA that your email is linked to.

Please note: if you are using a fork of this repo, or a service organised by someone else (e.g. a ‘one click deployment’ provider) the identity management flow below is not guaranteed.

What this means

In the following two scenarios, everything will work (i.e. you will have an on-chain identity linked with your RL Swarm peer training):

The very first time you run the node from scratch with a new email address. The smart account will be created fresh and linked with the swarm.pem that is also fresh.
If you run it again with a swarm.pem AND login the original email address used with that swarm.pem. Note: this will throw an error into the log on registration but will still be able to sign transactions.

In the following two scenarios, it will not work (i.e. you won’t have an on-chain identity linked with your RL Swarm peer training):

If you keep your swarm.pem and try to link it to an email address distinct from the one with which it was first registered.

Therefore, you should do these actions in the following scenarios

Signed up with email address, generated swarm.pem, BUT lost swarm.pem OR You want to run multiple nodes at once: run from scratch with the same email address and generate a new swarm.pem.
Signed up with email address, generated swarm.pem, kept swarm.pem -> you can re-run a single node using this pair if you’ve still got them both.

Troubleshooting

How do I find my logs? You can find them inside the /logs directory:
- yarn.log: This file contains logs for the modal login server.
- swarm.log: This is the main log file for the RL Swarm application.
- wandb/: This directory contains various logs related to your training runs, including a debug.log file. These can be updated to Weights & Biases (only available if you log_with wandb).
My peer ‘skipped a round’: this occurs when your device isn’t fast enough to keep up with the pace of the swarm. For example, if you start training at round 100 and by the time you finish training the rest of the swarm reaches round 102, you will skip round 101 and go straight to 102. This is because your peer is more valuable if it is participating in the active round.
My model doesn’t seem to be training?
- If you’re using a consumer device (e.g. a MacBook), it is likely just running slowly - check back in 20 minutes.
Logging in with a new account after previous login?
- Make sure you click ‘Logout’ on the login screen before you leave your previous session
- Make sure you delete swarm.pem from the root directory (try sudo rm swarm.pem). If you don’t do this, and you previously registered with the peer-id stored in this file, it will disrupt the training process.
Issues with the Login screen
- Upgrade viem: some users report issues with the viem package. There are two fixes:
  - in the modal-login/package.json update: "viem": "2.25.0"
  - in the terminal cd /root/rl-swarm/modal-login/ && yarn upgrade && yarn add next@latest && yarn add viem@latest
I’m getting lots of warnings
- This is expected behaviour and usually the output of the package managers or other dependencies. The most common is the below Protobuf warning - which can be ignored
  1
  
  WARNING: The candidate selected for download or install is a yanked version: 'protobuf' candidate...
Issues on VMs/VPSs?
- How do I access the login screen if I’m running in a VM?: port forwarding. Add this SSH flag: -L 3000:localhost:3000 when connecting to your VM. E.g. gcloud compute ssh --zone "us-central1-a" [your-vm] --project [your-project] -- -L 3000:localhost:3000. Note, some VPSs may not work with rl-swarm. Check the Gensyn discord for up-to-date information on this.
- Disconnection/general issues: If you are tunneling to a VM and suffer a broken pipe, you will likely encounter OOM or unexpected behaviour the first time you relaunch the script. If you control + c and kill the script it should spin down all background processes. Restart the script and everything should work normally.
Issues with npm/general installation?
- Try npm install -g node@latest
OOM errors on MacBook?
- Try this (experimental) fix to increase memory:
  1
  
  export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
I have a Windows machine, can I still train a model on the swarm?: Yes - but this is not very well tested and may require you to do some debugging to get it set up properly. Install WSL and Linux on your Windows machine using the following instructions: https://learn.microsoft.com/en-us/windows/wsl/install
I want to move my to a different machine and/or restart with a fresh build of the repo, but I want my animal name/peer id to persist.: To achieve this simply backup the swarm.pem file on your current machine and then put it in the corresponding location on your new machine/build of the repo.
I have multiple GPUs on one machine, can I run multiple peers?: Yes - but you’ll need to manually change things. You’ll need to isolate each GPU, install this repo for each GPU, and expose each peer under a different port to pass the modal onboard.
My round/stage is behind the smart contract/other peers?: This is expected behaviour given the different speeds of machines in the network. Once your machine completes it’s current round, it will move to the the current round.
I want to use a bigger and/or different model in the RL swarm, can I do that?: Yes - but we only recommend doing so if you are comfortable understanding what size model can reasonably run on your hardware. If you elect to bring a custom model, just paste the repo/model name into the command line when prompted.
I am running a model in the swarm on my CPU, have received a python RuntimeError, and my training progress seems to have stopped.: There are several possible causes for this, but before trying anything please wait long enough to be sure your training actually is frozen and not just slow (e.g., wait longer than a single training iteration has previously taken on your machine). If you’re sure training is actually frozen, then some things to try are:
- Set this (experimental) fix: export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 && ./run_rl_swarm.sh

all-rag-techniques

Mon, 16 Jun 2025 15:31:49 +0800

FareedKhan-dev/all-rag-techniques

All RAG Techniques: A Simpler, Hands-On Approach ✨

This repository takes a clear, hands-on approach to Retrieval-Augmented Generation (RAG), breaking down advanced techniques into straightforward, understandable implementations. Instead of relying on frameworks like LangChain or FAISS, everything here is built using familiar Python libraries openai, numpy, matplotlib, and a few others.

The goal is simple: provide code that is readable, modifiable, and educational. By focusing on the fundamentals, this project helps demystify RAG and makes it easier to understand how it really works.

Update: 📢

(12-May-2025) Added a new notebook on how to handle big data using Knowledge Graphs.
(27-April-2025) Added a new notebook which finds best RAG technique for a given query (Simple RAG + Reranker + Query Rewrite).
(20-Mar-2025) Added a new notebook on RAG with Reinforcement Learning.
(07-Mar-2025) Added 20 RAG techniques to the repository.

🚀 What’s Inside?

This repository contains a collection of Jupyter Notebooks, each focusing on a specific RAG technique. Each notebook provides:

A concise explanation of the technique.
A step-by-step implementation from scratch.
Clear code examples with inline comments.
Evaluations and comparisons to demonstrate the technique’s effectiveness.
Visualization to visualize the results.

Here’s a glimpse of the techniques covered:

Notebook	Description
1. Simple RAG	A basic RAG implementation. A great starting point!
2. Semantic Chunking	Splits text based on semantic similarity for more meaningful chunks.
3. Chunk Size Selector	Explores the impact of different chunk sizes on retrieval performance.
4. Context Enriched RAG	Retrieves neighboring chunks to provide more context.
5. Contextual Chunk Headers	Prepends descriptive headers to each chunk before embedding.
6. Document Augmentation RAG	Generates questions from text chunks to augment the retrieval process.
7. Query Transform	Rewrites, expands, or decomposes queries to improve retrieval. Includes Step-back Prompting and Sub-query Decomposition.
8. Reranker	Re-ranks initially retrieved results using an LLM for better relevance.
9. RSE	Relevant Segment Extraction: Identifies and reconstructs continuous segments of text, preserving context.
10. Contextual Compression	Implements contextual compression to filter and compress retrieved chunks, maximizing relevant information.
11. Feedback Loop RAG	Incorporates user feedback to learn and improve RAG system over time.
12. Adaptive RAG	Dynamically selects the best retrieval strategy based on query type.
13. Self RAG	Implements Self-RAG, dynamically decides when and how to retrieve, evaluates relevance, and assesses support and utility.
14. Proposition Chunking	Breaks down documents into atomic, factual statements for precise retrieval.
15. Multimodel RAG	Combines text and images for retrieval, generating captions for images using LLaVA.
16. Fusion RAG	Combines vector search with keyword-based (BM25) retrieval for improved results.
17. Graph RAG	Organizes knowledge as a graph, enabling traversal of related concepts.
18. Hierarchy RAG	Builds hierarchical indices (summaries + detailed chunks) for efficient retrieval.
19. HyDE RAG	Uses Hypothetical Document Embeddings to improve semantic matching.
20. CRAG	Corrective RAG: Dynamically evaluates retrieval quality and uses web search as a fallback.
21. Rag with RL	Maximize the reward of the RAG model using Reinforcement Learning.
Best RAG Finder	Finds the best RAG technique for a given query using Simple RAG + Reranker + Query Rewrite.
22. Big Data with Knowledge Graphs	Handles large datasets using Knowledge Graphs.

🗂️ Repository Structure

fareedkhan-dev-all-rag-techniques/
├── README.md                          <- You are here!
├── 01_simple_rag.ipynb
├── 02_semantic_chunking.ipynb
├── 03_chunk_size_selector.ipynb
├── 04_context_enriched_rag.ipynb
├── 05_contextual_chunk_headers_rag.ipynb
├── 06_doc_augmentation_rag.ipynb
├── 07_query_transform.ipynb
├── 08_reranker.ipynb
├── 09_rse.ipynb
├── 10_contextual_compression.ipynb
├── 11_feedback_loop_rag.ipynb
├── 12_adaptive_rag.ipynb
├── 13_self_rag.ipynb
├── 14_proposition_chunking.ipynb
├── 15_multimodel_rag.ipynb
├── 16_fusion_rag.ipynb
├── 17_graph_rag.ipynb
├── 18_hierarchy_rag.ipynb
├── 19_HyDE_rag.ipynb
├── 20_crag.ipynb
├── 21_rag_with_rl.ipynb
├── 22_big_data_with_KG.ipynb
├── best_rag_finder.ipynb
├── requirements.txt                   <- Python dependencies
└── data/
    └── val.json                       <- Sample validation data (queries and answers)
    └── AI_Information.pdf             <- A sample PDF document for testing.
    └── attention_is_all_you_need.pdf  <- A sample PDF document for testing (for Multi-Modal RAG).

🛠️ Getting Started

Clone the repository:

1
2

git clone https://github.com/FareedKhan-dev/all-rag-techniques.git
cd all-rag-techniques

Install dependencies:
1

pip install -r requirements.txt

Set up your OpenAI API key:

Obtain an API key from Nebius AI.

Set the API key as an environment variable:

`1`	`export OPENAI_API_KEY='YOUR_NEBIUS_AI_API_KEY'`

`1`	`setx OPENAI_API_KEY "YOUR_NEBIUS_AI_API_KEY" # On Windows`

or, within your Python script/notebook:

1
2

import os
os.environ["OPENAI_API_KEY"] = "YOUR_NEBIUS_AI_API_KEY"

Run the notebooks:

Open any of the Jupyter Notebooks (.ipynb files) using Jupyter Notebook or JupyterLab. Each notebook is self-contained and can be run independently. The notebooks are designed to be executed sequentially within each file.

Note: The data/AI_Information.pdf file provides a sample document for testing. You can replace it with your own PDF. The data/val.json file contains sample queries and ideal answers for evaluation. The ‘attention_is_all_you_need.pdf’ is for testing Multi-Modal RAG Notebook.

💡 Core Concepts

Embeddings: Numerical representations of text that capture semantic meaning. We use Nebius AI’s embedding API and, in many notebooks, also the BAAI/bge-en-icl embedding model.
Vector Store: A simple database to store and search embeddings. We create our own SimpleVectorStore class using NumPy for efficient similarity calculations.
Cosine Similarity: A measure of similarity between two vectors. Higher values indicate greater similarity.
Chunking: Dividing text into smaller, manageable pieces. We explore various chunking strategies.
Retrieval: The process of finding the most relevant text chunks for a given query.
Generation: Using a Large Language Model (LLM) to create a response based on the retrieved context and the user’s query. We use the meta-llama/Llama-3.2-3B-Instruct model via Nebius AI’s API.
Evaluation: Assessing the quality of the RAG system’s responses, often by comparing them to a reference answer or using an LLM to score relevance.

🤝 Contributing

Contributions are welcome!

Reinforcement-Learning on Producthunt daily

DeepResearch

Alibaba-NLP/DeepResearch

Introduction

Features

Model Download

News

Deep Research Benchmark Results

Quick Start

1. Environment Setup

2. Installation

3. Environment Configuration and Prepare Evaluation Data

Environment Configuration

Prepare Evaluation Data

Supported File Formats:

File References for Document Processing:

File Organization:

4. Configure the Inference Script

5. Run the Inference Script

6. You can use OpenRouter’s API to call our model

Benchmark Evaluation

Deep Research Agent Family

🌟 Misc

🚩 Talent Recruitment

Contact Information

Citation

ML-From-Scratch

eriklindernoren/ML-From-Scratch

Machine Learning From Scratch

About

Table of Contents

Installation

Examples

Polynomial Regression

Classification With CNN

Density-Based Clustering

Generating Handwritten Digits

Deep Reinforcement Learning

Image Reconstruction With RBM

Evolutionary Evolved Neural Network

Genetic Algorithm

Association Analysis

Implementations

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Deep Learning

Contact

verifiers

willccbb/verifiers

Verifiers

Overview

Setup

Environments

SingleTurnEnv

ToolEnv

MultiTurnEnv

Training

GRPOTrainer

Troubleshooting

Resource Requirements

PRIME-RL

Further Documentation

Contributions

Citation

Roadmap

ART

OpenPipe/ART

Agent Reinforcement Trainer

📏 RULER: Zero-Shot Agent Rewards

ART Overview

📒 Notebooks

📰 ART News

Why ART?

Installation

🤖 ART•E Agent

🔁 Training Loop Overview

🧩 Supported Models

🤝 Contributing

📖 Citation