Multiple nodes
In this setup, you deploy the network node and one or more inference (ML) nodes across multiple servers. To join the network, you need to deploy two services:
- Network node – a service consisting of two nodes: a chain node and an API node. This service handles all communication. The chain node connects to the blockchain, while the API node manages user requests.
- Inference (ML) node – a service that performs inference of large language models (LLMs) on GPU(s). You need at least one ML node to join the network.
The guide provides instructions for deploying both services on the same machine as well as on different machines. Services are deployed as Docker containers.
Prerequisites
For the Network node, the approximate hardware requirements are:
- 16 cores CPU (amd64)
- 64+ GB RAM
- 1TB NVe SSD
- 100Mbps minimum netowork connection (1Gbps preffered)
The final requirements will depend on the number of MLNodes connected and their total throughput.
Before proceeding, complete the Quickstart guide through step 3.4, which includes:
- Hardware and software requirements
- Download deployment files
- Container access authentication
- Key management setup (Account Key and ML Operational Key)
- Participant registration and permissions
Starting the network and inference node
This section describes how to deploy a distributed setup with a network node and multiple inference nodes.
Recommendation
All inference nodes should be registered with the same network node, regardless of their geographic location. Whether the clusters are deployed in different regions or across multiple data centers, each inference node should always connect back to the same network node.
Starting the network node
Make sure you have completed the Quickstart guide through step 3.3 (key management and participant registration) beforehand.
This server becomes the main entry point for external participants. It must be exposed to the public internet (static IP or domain recommended). High network reliability and security are essential. Host this on a stable, high-bandwidth server with robust security.
Single-Machine Deployment: Network Node + Inference Node
If your network node server has GPU(s) and you want to run both the network node and an inference node on the same machine, execute the following commands in the gonka/deploy/join
directory:
source config.env && \
docker compose -f docker-compose.yml -f docker-compose.mlnode.yml up -d && I am running a few minutes late; my previous meeting is running over.
docker compose -f docker-compose.yml -f docker-compose.mlnode.yml logs -f
This will start one network node and one inference node on the same machine.
Separate Deployment: Network Node Only
If your network node server has no GPU and you want your server to run only the network node (without inference node), execute the following in the gonka/deploy/join
directory:
source config.env && \
docker compose -f docker-compose.yml up -d \ &&
docker compose -f docker-compose.yml logs -f
Note
Address set as DAPI_API__POC_CALLBACK_URL
for network node, should be accessible from ALL inference nodes (9100 port of api
container by default)
The Network Node Status
The network node will start participating in the upcoming Proof of Computation (PoC) once it becomes active. Its weight will be updated based on the work produced by connected inference nodes. If no inference nodes are connected, the node will not participate in the PoC or appear in the list. After the following PoC, the network node will appear in the list of active participant (please allow 1–3 hours for the changes to take effect):
http://195.242.13.239:8000/v1/epochs/current/participants
If you add more servers with inference nodes (following the instructions below), the updated weight will be reflected in the list of active participants after the next PoC.
Running the inference node on a separate server
On the other servers, we run only the inference node, and for that, follow the instructions below.
Step 1. Configure the Inference Node
1.1. Download Deployment Files
Clone the repository with the base deploy scripts:
git clone https://github.com/gonka-ai/gonka.git -b main
Authentication required
If prompted for a password, use a GitHub personal access token (classic) with repo
access.
1.2. (Optional) Pre-download Model Weights to Hugging Face Cache (HF_HOME)
Inference nodes download model weights from Hugging Face. To ensure the model weights are ready for inference, we recommend downloading them before deployment. Choose one of the following options.
Option 1: Local download
export HF_HOME=/path/to/your/hf-cache
Create a writable directory (e.g. ~/hf-cache
) and pre-load models if desired.
Right now, the network supports two models: Qwen/Qwen2.5-7B-Instruct
and Qwen/QwQ-32B
.
huggingface-cli download Qwen/Qwen2.5-7B-Instruct
Option 2: 6Block NFS-mounted cache (for participants on 6Block internal network)
Mount shared cache:
sudo mount -t nfs 172.18.114.147:/mnt/toshare /mnt/shared
export HF_HOME=/mnt/shared
/mnt/shared
only works in the 6Block testnet with access to the shared NFS.
1.3. Authenticate with Docker Registry
Some Docker images used in this instruction are private. Make sure to authenticate with GitHub Container Registry
docker login ghcr.io -u <YOUR_GITHUB_USERNAME>
1.4. Ports open for network node connections
5050 - Inference requests (mapped to 5000 of MLNode)
8080 - Management API Port (mapped to 8080 of MLNode)
Important
These ports must not be exposed to the public internet (they should be accessible only within the network node environment).
Step 2. Launch the Inference Node
On the inference node's server, go to the cd gonka/deploy/join
directory and execute
docker compose -f docker-compose.mlnode.yml up -d && docker compose -f docker-compose.mlnode.yml logs -f
This will deploy the inference node and start handling inference and Proof of Compute (PoC) tasks as soon as they are registered with your network node (instructions below).
Adding (Registering) Inference Nodes with the Network Node
Note
Usually, it takes the server a couple of minutes to start. However, if your server does not accept requests after 5 minutes, please contact us for assistance.
You must register each inference node with the network node to make it operational. The recommended method is via the Admin API for dynamic management, which is accessible from the terminal of your network node server.
curl -X POST http://localhost:9200/admin/v1/nodes \
-H "Content-Type: application/json" \
-d '{
"id": "<unique_id>",
"host": "<your_inference_node_static_ip>",
"inference_port": <inference_port>,
"poc_port": <poc_port>,
"max_concurrent": <max_concurrent>,
"models": {
"<model_name>": {
"args": [
<model_args>
]
}
}
}'
Parameter descriptions
Parameter | Description | Examples |
---|---|---|
id |
A unique identifier for your inference node. | node1 |
host |
The static IP of your inference node or the Docker container name if running in the same Docker network. | http://<mlnode_ip> |
inference_port |
The port where the inference node accepts inference and training tasks. | 5000 |
poc_port |
The port which is used for MLNode management. | 8000 |
max_concurrent |
The maximum number of concurrent inference requests this node can handle. | 500 |
models |
A supported models that the inference node can process. | (see below) |
model_name |
- The name of the model. | Qwen/QwQ-32B |
model_args |
- vLLM arguments for the inference of the model. | "--quantization","fp8","--kv-cache-dtype","fp8" |
Right now, the network supports two models: Qwen/Qwen2.5-7B-Instruct
and Qwen/QwQ-32B
, both quantized to FP8 and the QwQ model uses an FP8 KV cache.
To ensure correct setup and optimal performance, use the arguments that best matches your model and GPU layout.
Model and GPU layout | vLLM arguments |
---|---|
Qwen/Qwen2 |
"--quantization","fp8" |
Qwen/QwQ-32B on 8xA100 or 8xH100 |
"--quantization","fp8","--kv-cache-dtype","fp8" |
Qwen/QwQ-32B on 8x3090 or 8x4090 |
"--quantization","fp8","--kv-cache-dtype","fp8","--tensor-parallel-size","4" |
Qwen/QwQ-32B on 8x3080 |
"--quantization","fp8","--kv-cache-dtype","fp8","--tensor-parallel-size","4","--pipeline-parallel-size","2" |
vLLM performance tuning reference
For detailed guidance on selecting optimal deployment configurations and vLLM parameters tailored to your GPU hardware, refer to the Benchmark to Choose Optimal Deployment Config for LLMs guide.
If the node is successfully added, the response will return the configuration of the newly added inference node.
Retrieving All Inference Nodes
To get a list of all registered inference nodes in your network node, use:
curl -X GET http://localhost:9200/admin/v1/nodes
Removing an inference node
Being connected to your network node server, use the following Admin API request to remove an inference node dynamically without restarting:
curl -X DELETE "http://localhost:9200/admin/v1/nodes/{id}" -H "Content-Type: application/json"
id
is the identifier of the inference node as specified in the request when registering the inference node. If successful, the response will be true.