Skip to content

Multiple nodes

In this setup, you deploy the network node and one or more inference (ML) nodes across multiple servers. To join the network, you need to deploy two services:

  • Network node – a service consisting of two nodes: a chain node and an API node. This service handles all communication. The chain node connects to the blockchain, while the API node manages user requests.
  • Inference (ML) node – a service that performs inference of large language models (LLMs) on GPU(s). You need at least one ML node to join the network.

The guide provides instructions for deploying both services on the same machine as well as on different machines. Services are deployed as Docker containers.


Prerequisites

For the Network node, the approximate hardware requirements are:

  • 16 cores CPU (amd64)
  • 64+ GB RAM
  • 1TB NVe SSD
  • 100Mbps minimum netowork connection (1Gbps preffered)

The final requirements will depend on the number of MLNodes connected and their total throughput.

Before proceeding, complete the Quickstart guide through step 3.4, which includes:

  • Hardware and software requirements
  • Download deployment files
  • Container access authentication
  • Key management setup (Account Key and ML Operational Key)
  • Participant registration and permissions

Starting the network and inference node

This section describes how to deploy a distributed setup with a network node and multiple inference nodes.

Recommendation

All inference nodes should be registered with the same network node, regardless of their geographic location. Whether the clusters are deployed in different regions or across multiple data centers, each inference node should always connect back to the same network node.

Starting the network node

Make sure you have completed the Quickstart guide through step 3.3 (key management and participant registration) beforehand.

This server becomes the main entry point for external participants. It must be exposed to the public internet (static IP or domain recommended). High network reliability and security are essential. Host this on a stable, high-bandwidth server with robust security.

Single-Machine Deployment: Network Node + Inference Node

If your network node server has GPU(s) and you want to run both the network node and an inference node on the same machine, execute the following commands in the gonka/deploy/join directory:

source config.env && \
docker compose -f docker-compose.yml -f docker-compose.mlnode.yml up -d && I am running a few minutes late; my previous meeting is running over.
docker compose -f docker-compose.yml -f docker-compose.mlnode.yml logs -f

This will start one network node and one inference node on the same machine.

Separate Deployment: Network Node Only

If your network node server has no GPU and you want your server to run only the network node (without inference node), execute the following in the gonka/deploy/join directory:

source config.env && \ 
docker compose -f docker-compose.yml up -d \ &&
docker compose -f docker-compose.yml logs -f                                 

Note

Address set as DAPI_API__POC_CALLBACK_URL for network node, should be accessible from ALL inference nodes (9100 port of api container by default)

The Network Node Status

The network node will start participating in the upcoming Proof of Computation (PoC) once it becomes active. Its weight will be updated based on the work produced by connected inference nodes. If no inference nodes are connected, the node will not participate in the PoC or appear in the list. After the following PoC, the network node will appear in the list of active participant (please allow 1–3 hours for the changes to take effect):

http://195.242.13.239:8000/v1/epochs/current/participants

If you add more servers with inference nodes (following the instructions below), the updated weight will be reflected in the list of active participants after the next PoC.

Running the inference node on a separate server

On the other servers, we run only the inference node, and for that, follow the instructions below.

Step 1. Configure the Inference Node

1.1. Download Deployment Files

Clone the repository with the base deploy scripts:

git clone https://github.com/gonka-ai/gonka.git -b main

Authentication required

If prompted for a password, use a GitHub personal access token (classic) with repo access.

1.2. (Optional) Pre-download Model Weights to Hugging Face Cache (HF_HOME)

Inference nodes download model weights from Hugging Face. To ensure the model weights are ready for inference, we recommend downloading them before deployment. Choose one of the following options.

Option 1: Local download

export HF_HOME=/path/to/your/hf-cache

Create a writable directory (e.g. ~/hf-cache) and pre-load models if desired. Right now, the network supports two models: Qwen/Qwen2.5-7B-Instruct and Qwen/QwQ-32B.

huggingface-cli download Qwen/Qwen2.5-7B-Instruct

Option 2: 6Block NFS-mounted cache (for participants on 6Block internal network)

Mount shared cache:

sudo mount -t nfs 172.18.114.147:/mnt/toshare /mnt/shared
export HF_HOME=/mnt/shared
The path /mnt/shared only works in the 6Block testnet with access to the shared NFS.

1.3. Authenticate with Docker Registry

Some Docker images used in this instruction are private. Make sure to authenticate with GitHub Container Registry

docker login ghcr.io -u <YOUR_GITHUB_USERNAME>

1.4. Ports open for network node connections

5050 - Inference requests (mapped to 5000 of MLNode)
8080 - Management API Port (mapped to 8080 of MLNode)

Important

These ports must not be exposed to the public internet (they should be accessible only within the network node environment).

Step 2. Launch the Inference Node

On the inference node's server, go to the cd gonka/deploy/join directory and execute

docker compose -f docker-compose.mlnode.yml up -d && docker compose -f docker-compose.mlnode.yml logs -f

This will deploy the inference node and start handling inference and Proof of Compute (PoC) tasks as soon as they are registered with your network node (instructions below).

Adding (Registering) Inference Nodes with the Network Node

Note

Usually, it takes the server a couple of minutes to start. However, if your server does not accept requests after 5 minutes, please contact us for assistance.

You must register each inference node with the network node to make it operational. The recommended method is via the Admin API for dynamic management, which is accessible from the terminal of your network node server.

curl -X POST http://localhost:9200/admin/v1/nodes \
     -H "Content-Type: application/json" \
     -d '{
       "id": "<unique_id>",
       "host": "<your_inference_node_static_ip>",
       "inference_port": <inference_port>,
       "poc_port": <poc_port>,
       "max_concurrent": <max_concurrent>,
       "models": {
         "<model_name>": {
           "args": [
              <model_args>
           ]
         }
       }
     }'

Parameter descriptions

Parameter Description Examples
id A unique identifier for your inference node. node1
host The static IP of your inference node or the Docker container name if running in the same Docker network. http://<mlnode_ip>
inference_port The port where the inference node accepts inference and training tasks. 5000
poc_port The port which is used for MLNode management. 8000
max_concurrent The maximum number of concurrent inference requests this node can handle. 500
models A supported models that the inference node can process. (see below)
model_name - The name of the model. Qwen/QwQ-32B
model_args - vLLM arguments for the inference of the model. "--quantization","fp8","--kv-cache-dtype","fp8"

Right now, the network supports two models: Qwen/Qwen2.5-7B-Instruct and Qwen/QwQ-32B, both quantized to FP8 and the QwQ model uses an FP8 KV cache.

To ensure correct setup and optimal performance, use the arguments that best matches your model and GPU layout.

Model and GPU layout vLLM arguments
Qwen/Qwen2 "--quantization","fp8"
Qwen/QwQ-32B on 8xA100 or 8xH100 "--quantization","fp8","--kv-cache-dtype","fp8"
Qwen/QwQ-32B on 8x3090 or 8x4090 "--quantization","fp8","--kv-cache-dtype","fp8","--tensor-parallel-size","4"
Qwen/QwQ-32B on 8x3080 "--quantization","fp8","--kv-cache-dtype","fp8","--tensor-parallel-size","4","--pipeline-parallel-size","2"

vLLM performance tuning reference

For detailed guidance on selecting optimal deployment configurations and vLLM parameters tailored to your GPU hardware, refer to the Benchmark to Choose Optimal Deployment Config for LLMs guide.

If the node is successfully added, the response will return the configuration of the newly added inference node.

Retrieving All Inference Nodes

To get a list of all registered inference nodes in your network node, use:

curl -X GET http://localhost:9200/admin/v1/nodes
This will return a JSON array containing all configured inference nodes.

Removing an inference node

Being connected to your network node server, use the following Admin API request to remove an inference node dynamically without restarting:

curl -X DELETE "http://localhost:9200/admin/v1/nodes/{id}" -H "Content-Type: application/json"
Where id is the identifier of the inference node as specified in the request when registering the inference node. If successful, the response will be true.