FAQ
Overview
What is Gonka?
Gonka is a decentralized network for high‑efficiency AI compute — run by those who run it. It functions as a cost-effective and efficient alternative to centralized cloud services for AI model training and inference. As a protocol, it's not a company or a start-up.
- In terms of Blockchain, Gonka is the foundational ledger and coordination layer (L1) of the decentralized AI network. It records balances, transactions and cryptographic artifacts that prove Hosts have correctly performed AI work, while all actual computations (such as inference and training) happen off-chain.
- In terms of Network, Gonka is a comprehensive ecosystem of participants, including Hosts and Developers that interact through a decentralized infrastructure. Powered by the Gonka Blockchain, the network distributes tasks, verifies results, and rewards honest participation only verifiable useful work, creating a competitive, scalable environment for AI workloads.
What problem is Gonka solving?
Gonka is a decentralized AI infrastructure built to reduce dependence on centralized cloud providers and to use computational power more efficiently than traditional decentralized networks. Its goal is to direct as much compute as possible toward useful AI tasks, such as inference and training, while minimizing waste due to consensus overhead.
Who are the key participants in the Gonka ecosystem?
The Gonka ecosystem has four key participant groups:
- Developer builds and deploys AI applications by leveraging the network’s distributed computing power.
- Gonka Contributor participates in development of the core blockchain codebase, protocol upgrades, performance optimizations, security patches, and new feature integrations.
- Holder holds the network’s native coin, which simply means having a Gonka wallet with coins in it. Holders may hold coins, transfer or sell them, spend them on inference and use them according to the protocol rules. Being a holder does not imply any obligation, responsibility, or governance role beyond standard coin ownership.
- Host contributes compute capacity to the network. Hosts perform inference and other computational tasks and are rewarded proportionally to their contributed compute capacity, as long as they maintain honest participation and reliability. Hosts form the backbone of the network. Only Hosts have voting power in the network. This voting power represents their weight in governance and is used to propose and vote on protocol decisions, parameter changes, and upgrades. Any Host acts as Validator, Transfer Agent and an Executor (these are not predefined or on-chain roles, but dynamic operational functions assumed when processing a inference request).
What is the GNK coin?
GNK is the native coin of the Gonka network. It is used to incentivize participants, price resources, and ensure the sustainable growth of the network.
Can I buy GNK coin?
No, you can not buy GNK on exchanges right now because the coin has not been listed yet. Follow official announcements on Twitter for any updates regarding listings.
At the moment, the main legitimate way to obtain GNK before any listing is to mine as a Host (GNK can already be earned by contributing computational resources to the network).
Important
Be aware that fake GNK listings and pages currently exist, including on CoinGecko. These pages do not represent the official GNK coin and are not affiliated with the project in any way. GNK is not tradable on any exchange at this time. Any coin claiming to be GNK, whether on Solana or other networks, is not an official GNK asset. Always verify information through official channels.
What makes the protocol efficient?
What differentiates Gonka from the "big players" is its pricing and the fact that, despite the Host's size, the inference is distributed equally. To learn more, please review the Whitepaper.
How does the network operate?
The network's operation is collaborative and depends on the role you wish to take:
- As a Developer: You can use the network's computational resources to build and deploy your AI applications.
- As a Host: You can contribute your computational resources to power the network. The protocol is designed to reward you for your contribution, ensuring the network's continuity and sovereignty.
Is this documentation exhaustive?
No. This documentation covers the primary concepts, standard workflows, and the most common operational scenarios of the protocol, but it does not represent the full behavior or implementation details of the codebase. The code includes additional logic, interactions, and edge cases that are not described here.
Because Gonka is an open-source and decentralized network, various parameters, mechanisms, and governance-driven behaviors may evolve through on-chain voting and community decisions. Certain details may change after publication, and not all edge cases or future updates may be reflected immediately.
For Hosts, Developers, and Contributors, the ultimate source of truth is the code itself. If any discrepancy arises between this documentation and the code, the code always prevails.
Participants are encouraged to review the relevant repositories, governance proposals, and network updates to ensure their understanding aligns with the protocol’s current state.
What is the incentive for contributing computational resources?
We've created a dedicated document focused on Tokenomics, where you can find all the information about how the incentive in being measured.
What are the hardware requirements?
You can find the minimum and recommended hardware specifications clearly outlined in the documentation. You should review this section to ensure your hardware meets the requirements for effective contribution.
What wallets can I use to store GNK coins?
You can store GNK coin in several supported wallets within the Cosmos ecosystem:
- Keplr
- Cosmostation
inferencedCLI - a command-line utility for local account management and network operations in Gonka.
Important for existing Leap Wallet users
If you previously created your Gonka account with Leap Wallet, please be aware that Leap is shutting down all of its products on May 28, 2026, including the browser extension, mobile app, and dashboard.
Because Leap is a non-custodial wallet, your assets and account remain on-chain. However, to keep access to your wallet, you should import your existing recovery phrase into another supported wallet, such as Keplr, before Leap services go offline.
Where can I find useful information about Gonka?
Below are the most important resources for learning about the Gonka ecosystem:
- gonka.ai — the main entry point for project information and ecosystem overview.
- Whitepaper — technical documentation describing the architecture, consensus model, Proof-of-Compute, etc.
- Tokenomics — project tokenomics overview, including supply, distribution, incentives, and economic design.
- GitHub — access to the project’s source code, repositories, development activity, and open-source contributions.
- Discord — the primary place for community discussions, announcements, and technical support.
- X (Twitter) — news, updates, and announcements.
Tokenomics
How is governance power calculated in Gonka?
Gonka uses a PoC-weighted voting model:
- Proof-of-Compute (PoC): Voting power is proportional to your verified compute contribution.
- Collateral commitment:
- 20% of PoC-derived voting weight is activated automatically.
- To unlock the remaining 80%, you must lock GNK coins as collateral.
- This ensures that governance influence reflects real compute work + economic collateral.
For the first 180 epochs (approximately 6 months), new participants can participate in governance and earn voting weight through PoC alone, without collateral requirements. During this period, the full governance rights are available, while voting weight remains tied to verified compute activity.
Why does Gonka require locking GNK coins for governance power?
Voting power is never derived solely from holding coins. GNK coins serve as economic collateral, not as a source of influence. Influence is earned through continuous computational contribution, while locking GNK collateral is required to secure participation in governance and enforce accountability.
Collateral
What is collateral?
Collateral is required to activate the collateral-eligible portion of PoC weight after the Grace Period (first 180 epochs). After the Grace Period:
- Base Weight (default 20%) is always active.
- The remaining weight requires GNK collateral to become active.
Collateral ensures that participants with governance weight also bear economic responsibility. Parameters are defined on-chain and may change via governance. Always verify current values before making economic decisions.
Is collateral required per node or per account?
Collateral is deposited per account. If multiple ML nodes are linked to the same account, the required collateral is calculated based on the total account weight across all nodes.
Do I need to deposit collateral?
Yes, if you want to activate more than the Base Weight. If no collateral is deposited, only the Base Weight remains active.
How much collateral is required?
Formula:
Required Collateral =
Total Weight × (1 - base_weight_ratio) × collateral_per_weight_unit
Recommended (with conservative buffer):
Total Weight × 2 × (1 - base_weight_ratio) × collateral_per_weight_unit
Can I partially collateralize my weight?
Yes. Your total Active Weight consists of:
- Base Weight (always active)
- Collateral-Eligible Weight (activated proportionally to deposited collateral)
If you deposit less than the full required amount:
- Base Weight remains fully active
- Only the corresponding portion of collateral-eligible weight becomes active
- The remaining portion stays inactive
Active Weight is calculated as:
Active Weight =
Base Weight +
(Deposited Collateral / Required Collateral) × Collateral-Eligible Weight
What happens if I do not deposit enough collateral?
Your Active Weight is reduced proportionally. Because rewards are distributed proportionally to Active Weight, other hosts receive a larger share of emissions when you under-collateralize. Inactive weight is not directly redistributed, it simply does not participate in consensus.
When does collateral take effect?
Collateral must be deposited before the start of the epoch to be effective. Collateral deposited during an epoch:
- does NOT increase weight immediately
- applies starting from the next epoch
Collateral cannot be increased mid-epoch.
In what unit do I deposit collateral?
Transactions must use ngonka, not GNK.
1 GNK = 1,000,000,000 ngonka
10 GNK = 10,000,000,000 ngonka
Can collateral be slashed?
Yes. Collateral may be slashed for:
- Invalid inference
- Downtime (Confirmation PoC failure or jail)
Invalid inference slashing is capped at once per epoch. Downtime slashing may be applied per jail event.
What happens to slashed coins?
Currently, slashed GNK is permanently burned and removed from circulation. Future governance may change this mechanism.
Can I withdraw collateral?
Yes. Withdrawal triggers an unbonding period (default: 1 epoch). During unbonding, collateral remains subject to slashing. After unbonding funds are automatically returned to your account balance.
What collateral is NOT
- Collateral is NOT voting power. Voting power is derived from PoC weight, not token balance.
- Collateral is NOT delegation. Each account must back its own weight.
- Collateral is NOT a permanent lock. It can be withdrawn (subject to unbonding).
- Collateral was NOT required during the Grace Period (first 180 epochs).
How are epoch-minted rewards distributed?
A fixed amount of GNK is minted each epoch and distributed proportionally to Active PoC Weight. Active Weight determines:
- Your share of epoch-minted Reward Coins
- Your governance influence
If your Active Weight is reduced due to insufficient collateral, your share of epoch rewards decreases proportionally. Inactive weight does not receive rewards.
Do I need to manually deposit collateral?
Yes. Collateral must be deposited by submitting an on-chain transaction. It is not activated automatically. If no collateral is deposited:
- Your node continues operating normally.
- It is not jailed or disabled.
- Only the Base Weight (e.g. 20%) remains active.
Your rewards and governance influence will be reduced proportionally.
Can vested (locked) GNK be used as collateral?
No. Collateral must be deposited from your available (unlocked) GNK balance. Vested coins that are not yet released cannot be used as collateral.
Governance
What types of changes require a Governance Proposal?
Governance Proposals are required for any on-chain changes that affect the network, for example:
- Updating module parameters (
MsgUpdateParams) - Executing software upgrades
- Adding, updating, or deprecating inference models
- Any other actions that must be approved and executed via the governance module
Who can create a Governance Proposal?
Anyone with a valid governance key (cold account) can pay the required fee and create a Governance Proposal. However, each proposal must still be approved by active participants through PoC-weighted voting. Proposers are encouraged to discuss significant changes off-chain first (for example, via GitHub or community forums) to increase the likelihood of approval. See the full guide.
What happens if a proposal fails?
- If a proposal does not meet quorum → it automatically fails
- If the majority votes
no→ proposal rejected, no on-chain changes - If a significant percentage votes
no_with_veto(above veto threshold) → proposal is rejected and flagged, signaling strong community disagreement - Deposits may or may not be refunded, depending on chain settings
Can governance parameters themselves be changed?
Yes. All key governance rules — quorum, majority threshold, and veto threshold — are on-chain configurable and can be updated via Governance Proposals. This allows the network to evolve decision-making rules as participation patterns and compute economic changes.
What should I do if I cannot vote because I do not have access to the cold key, or if I want another key to vote on my behalf?
If the key that holds voting power is not the key you use for day-to-day operations, governance voting permission can be granted in advance.
In this setup:
- Granter = account that owns voting power (cold key)
- Grantee = account that will submit votes on the granter’s behalf (warm key)
There are two common scenarios:
1. You want to vote, but you do not have access to the key that holds the voting power.
Please contact the owner of that key and ask them to grant your key permission to vote on their behalf. Without this authorization, your key cannot submit a governance vote for that voting power.
2. You want another key to vote on your behalf.
Use the grant command below from the key that holds the voting power. This will authorize the grantee key to submit governance votes for you. This delegation only allows voting on governance proposals. The grantee can still vote for their own key as well. The granter can revoke this permission at any time.
1) Grant voting permission (run from the granter key)
./inferenced tx authz grant <GRANTEE_GONKA_ADDRESS> generic \
--msg-type=/cosmos.gov.v1beta1.MsgVote \
--from=<GRANTER_KEY_NAME> \
--chain-id=gonka-mainnet \
--expiration=<UNIX_TIMESTAMP> \
--home .inference \
--keyring-backend file
{
"height": "0",
"txhash": "8D96FB6FC06FFB928FBC89FE950689CD040C7F338C197BA856175EC7462A3FFA",
"codespace": "",
"code": 0,
"data": "",
"raw_log": "",
"logs": [],
"info": "",
"gas_wanted": "0",
"gas_used": "0",
"tx": null,
"timestamp": "",
"events": []
}
2) Verify the grant exists (run from any node)
./inferenced query authz grants <GRANTER_GONKA_ADDRESS> <GRANTEE_GONKA_ADDRESS> \
--node="http://<MAINNET_NODE_URL>:26657" \
--output=json | jq .
{
"grants": [
{
"authorization": {
"type": "cosmos-sdk/GenericAuthorization",
"value": {
"msg": "/cosmos.gov.v1beta1.MsgVote"
}
},
"expiration": "2026-12-03T18:38:18Z"
}
],
"pagination": {
"total": "1"
}
}
3) Vote using the grantee
# Find the proposal ID which you are voting for - use it as <VOTE_PROPOSAL_ID> in the voting body
./inferenced query gov proposals --output json
# Prepare the file with the voting body
cat > /tmp/authz-vote.json << 'EOF'
{
"body": {
"messages": [
{
"@type": "/cosmos.authz.v1beta1.MsgExec",
"grantee": "<GRANTEE_GONKA_ADDRESS>",
"msgs": [
{
"@type": "/cosmos.gov.v1beta1.MsgVote",
"proposal_id": "<VOTE_PROPOSAL_ID>",
"voter": "<GRANTER_GONKA_ADDRESS>",
"option": "VOTE_OPTION_YES"
}
]
}
]
}
}
EOF
# Vote using the file
./inferenced tx authz exec /tmp/authz-vote.json \ --from=<GRANTEE_KEY_NAME> \
--chain-id=gonka-mainnet \
--home .inference \
--keyring-backend file \
--node="http://<MAINNET_NODE_URL>:26657" -y
{
"pagination": {
"total": "1"
},
"proposals": [
{
"deposit_end_time": "2026-03-06T10:40:07.016920026Z",
"final_tally_result": {
"abstain_count": "0",
"no_count": "0",
"no_with_veto_count": "0",
"yes_count": "0"
},
"id": "1",
"messages": [
{
"type": "cosmos-sdk/MsgSoftwareUpgrade",
"value": {
"authority": "gonka10d07y265gmmuvt4z0w9aw880jnsr700j2h5m33",
"plan": {
"height": "406062",
"info": "{\n \"binaries\":{\n \"linux/amd64\":\"https://github.com/product-science/race-releases/releases/download/release%2Fv0.2.10-testnet1/inferenced-amd64.zip?checksum=sha256:fb71310427436aebac32813735231882fca420cf0d94b036f8cacd055d0e1c78\"\n },\n \"api_binaries\":{\n \"linux/amd64\":\"https://github.com/product-science/race-releases/releases/download/release%2Fv0.2.10-testnet1/decentralized-api-amd64.zip?checksum=sha256:6fe214f4bb2d831c02ce407682820d95d01e6ae94a33fe9c4617b80e0ca716ce\"\n }\n }",
"name": "v0.2.10",
"time": "0001-01-01T00:00:00Z"
}
}
}
],
"proposer": "gonka1xfvr8mywcrxrcrryvj8c5d2grvyjdj5c90fd88",
"status": 2,
"submit_time": "2026-03-04T10:40:07.016920026Z",
"summary": "Upgrade Proposal v0.2.10",
"title": "Upgrade Proposal v0.2.10",
"total_deposit": [
{
"amount": "50000000",
"denom": "ngonka"
}
],
"voting_end_time": "2026-03-04T10:50:07.016920026Z",
"voting_start_time": "2026-03-04T10:40:07.016920026Z"
}
]
}
Voting options:
VOTE_OPTION_YESVOTE_OPTION_ABSTAINVOTE_OPTION_NOVOTE_OPTION_NO_WITH_VETO
4) Revoke delegation (run from the granter key)
./inferenced tx authz revoke <GRANTEE_GONKA_ADDRESS> /cosmos.gov.v1beta1.MsgVote \
--from=<GRANTER_KEY_NAME> \
--chain-id=gonka-mainnet \
--home .inference \
--keyring-backend file
{
code: 0
codespace: ""
data: ""
events: []
gas_used: "0"
gas_wanted: "0"
height: "0"
info: ""
logs: []
raw_log: ""
timestamp: ""
tx: null
txhash: A2C3CDA9E95DCF143C0D8981A4F573F1E68879ECF4903B25BA97383C3F2FDFBA
}
Improvement proposals
What’s the difference between Governance Proposals and Improvement Proposals?
Governance Proposals → on-chain proposals. Used for changes that directly affect the network and require on-chain voting. Examples:
- Updating network parameters (
MsgUpdateParams) - Executing software upgrades
- Adding new models or capabilities
- Any modification that needs to be executed by the governance module
Improvement Proposals → off-chain proposals under the control of active participants. Used for shaping the long-term roadmap, discussing new ideas, and coordinating larger strategic changes.
- Managed as Markdown files in the /proposals directory
- Reviewed and discussed through GitHub Pull Request
- Approved proposals are merged into the repository
How are Improvement Proposals reviewed and approved?
The goal of community proposal review is to gather community validation: reactions, comments, and concrete feedback that strengthens the case for eventual governance approval. This is especially relevant if the proposal implementation requires a lot of work, long-term commitment, coordination or significant changes into the protocol.
- Read the recommended guide first: https://github.com/gonka-ai/gonka/discussions/795. It explains what belongs in Improvement Proposals and how to write a strong, structured proposal.
- Publish and discuss improvement proposals in GitHub Discussions (preferred); previously they were stored as Markdown files in the
/proposalsdirectory. - To help the community evaluate your proposal (and improve its chances later in governance), it’s in the proposer’s interest and responsibility to actively gather early feedback and support signals (reactions, comments, concrete concerns).
- Share the Discussion link in Discord’s #improvements-proposals channel for reach and visibility, and amplify it through any other channels available to you (including direct outreach to Hosts/miners) to gather practical input and support.
- Share context about your experience and expertise in the proposal thread. If you represent a team or a company, mention it and link relevant work to help the community assess credibility and evaluate the proposal more efficiently.
- Community review:
- Active contributors and maintainers discuss the proposal in GitHub Discussions. Conversation can happen on any platform, but please consolidate the key context back in GitHub Discussions: it keeps the full history in one place, stays searchable, and is much easier to maintain over time. GitHub is the main source of truth.
- Please ask questions, provide feedback, suggestions, refinements, and upvote relevant proposals. Everybody’s attention and participation in this process is essential for sustainable evolution of the chain.
- Strong positive feedback and a high number of upvotes signal genuine community demand, allowing teams to treat well-received proposals as part of a community-driven roadmap and begin implementation with confidence in both community alignment and eventual governance approval. Note that feedback from the hosts is essential - it can help structure the project into milestones, unlock partial bounty payments, and even secure grants from the community pool. Ultimately, however, all on-chain updates and payments are subject to governance approval.
Can an Improvement Proposal lead to a Governance Proposal?
Yes. Often, an Improvement Proposal is used to explore ideas and gather consensus before drafting a Governance Proposal. For example:
- You might first propose a new model integration as an Improvement Proposal.
- After the community agrees, an on-chain Governance Proposal is created to update parameters or trigger the software upgrade.
Voting
How does the voting process work?
- Once a proposal is submitted and funded with the minimum deposit, it enters the voting period
-
Voting options:
yes,no,no_with_veto,abstainyes→ approve the proposalno→ reject the proposalno_with_veto→ reject and signal a strong objectionabstain→ neither approve nor reject, but counts toward quorum
-
You can change your vote anytime during the voting period; only your last vote is counted
- If quorum and thresholds are met, the proposal passes and executes automatically via the governance module
To vote, you can use the command below. This example votes yes, but you can replace it with your preferred option (yes, no, no_with_veto, abstain):
./inferenced tx gov vote 2 yes \
--from <cold_key_name> \
--keyring-backend file \
--unordered \
--timeout-duration=60s --gas=2000000 --gas-adjustment=5.0 \
--node $NODE_URL/chain-rpc/ \
--chain-id gonka-mainnet \
--yes
How can I track the status of a Governance Proposal?
You can query the proposal status at any time using the CLI:
export NODE_URL=http://47.236.19.22:18000
./inferenced query gov tally 2 -o json --node $NODE_URL/chain-rpc/
Running a Node
What if I want to stop mining but still use my account when I come back?
To restore a Network Node in the future, it will be sufficient to back up:
- cold key (most important, everything else can be rotated)
- secres from tmkms:
.tmkms/secrets/ - keyring from
.inference .inference/keyring-file/ - node key from
.inference/config .inference/config/node_key.json - password for warm key
KEYRING_PASSWORD
My node was jailed. What does it mean?
Your validator has been jailed because it signed fewer than 50 blocks out of the last 100 blocks (the requirement counts the total number of signed blocks in that window, not consecutive ones). This means your node was temporarily excluded (about 15 minutes) from block production to protect network stability. There are several possible reasons for this:
- Consensus Key Mismatch. The consensus key used by your node may differ from the one registered on-chain for your validator. Make sure the consensus key you are using matches the one registered on-chain for your validator.
- Unstable Network Connection. Network instability or interruptions can prevent your node from reaching consensus, causing missed signatures. Ensure your node has a stable, low-latency connection and isn’t overloaded by other processes.
Rewards: Even if your node is jailed, you will continue to receive most of the rewards as a Host as long as it remains active in inference or other validator-related work. So, the reward is not lost unless inference issues are detected.
How to Unjail Your Node: To resume normal operation, unjail your validator once the issue is resolved. Use your cold key to submit the unjail transaction:
export NODE_URL=http://<NODE_URL>:<port>
./inferenced tx slashing unjail \
--from <cold_key_name> \
--keyring-backend file \
--chain-id gonka-mainnet \
--gas auto \
--gas-adjustment 1.5 \
--fees 200000ngonka \
--node $NODE_URL/chain-rpc/
./inferenced query staking delegator-validators \
<cold_key_addr> \
--node $NODE_URL/chain-rpc/
jailed: true.
How to decommission an old cluster?
Follow this guide to safely shut down an old cluster without impacting reputation.
1) Use the following command to disable each ML Node:
curl -X POST http://localhost:9200/admin/v1/nodes/<id>/disable
You can list all node IDs with:
curl http://localhost:9200/admin/v1/nodes | jq '.[].node.id'
2) Nodes that are not scheduled to serve inference during the next Proof-of-Compute (PoC) will automatically stop during that PoC. Nodes that are scheduled to serve inference will remain active for one more epoch before stopping. You can verify a node’s status in the mlnode field at:
curl http://<inference_url>/v1/epochs/current/participants
Once a node is marked as disabled, it is safe to power off the MLNode server.
3) After all MLNodes have been disabled and powered off, you can shut down the Network Node. Before doing so, it’s recommended (but optional) to back up the following files:
.dapi/api-config.yaml.dapi/gonka.db(created after on-chain upgrade).inference/config/.inference/keyring-file/.tmkms/
If you skip the backup, the setup can still be restored later using your Account Key.
My node cannot connect to the default seed node specified in the config.env
If your node cannot connect to the default seed node, simply point it to another one by updating three variables in config.env.
SEED_API_URL- HTTP endpoint of the seed node (used for API communication). Choose any URL from the list below and assign it directly toSEED_API_URL.Available genesis API URLs:export SEED_API_URL=<chosen_http_url>http://185.216.21.98:8000 http://36.189.234.197:18026 http://36.189.234.237:17241 http://node1.gonka.ai:8000 http://node2.gonka.ai:8000 http://node3.gonka.ai:8000 https://node4.gonka.ai http://47.236.26.199:8000 http://47.236.19.22:18000 http://gonka.spv.re:8000SEED_NODE_RPC_URL- Public Tendermint RPC access MUST go through the seed node HTTP(S) proxy path/<chain-rpc>. Use the same scheme (http or https), host, and port as inSEED_API_URL, and append/chain-rpc.Exampleexport SEED_NODE_RPC_URL=http://<host>/chain-rpcSEED_NODE_RPC_URL=http://node2.gonka.ai:8000/chain-rpc/
Important
- Do NOT use
http://<host>:26657as a public RPC endpoint. - Port
26657MUST be internal-only (localhost/private network). Public RPC must go via/<chain-rpc>.
-
SEED_NODE_P2P_URL- the P2P address used for networking between nodes. You must obtain the P2P port from the seed node’s status endpoint via the same/<chain-rpc>proxy.Query the node:
Examplehttp://<host>:<http_port>/chain-rpc/statusFindhttps://node3.gonka.ai/chain-rpc/statuslisten_addrin the response, for example:""listen_addr"": ""tcp://0.0.0.0:5000""Use this port:
Exampleexport SEED_NODE_P2P_URL=tcp://<host>:<p2p_port>export SEED_NODE_P2P_URL=tcp://node3.gonka.ai:5000Final result example
export SEED_API_URL=http://node2.gonka.ai:8000 export SEED_NODE_RPC_URL=http://node2.gonka.ai:8000/chain-rpc/ export SEED_NODE_P2P_URL=tcp://node2.gonka.ai:5000
How to change the seed nodes?
There are two distinct ways to update seed nodes, depending on whether the node has already been initialized.
Once the file .node_initialized is created, the system no longer updates seed nodes automatically.
After that point:
- The seed list is used as-is
- Any changes must be done manually
- You can add as many seed nodes as you want
The format is a single comma-separated string:
seeds = "<node1_id>@<node1_ip>:<node1_p2p_port>,<node2_id>@<node2_ip>:<node2_p2p_port>"
curl http://47.236.26.199:8000/chain-rpc/net_info | jq
In response, look for:
listen_addr- P2P endpointrpc_addr- RPC endpoint
Example:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 94098 0 94098 0 0 91935 0 --:--:-- 0:00:01 --:--:-- 91982
{
"jsonrpc": "2.0",
"id": -1,
"result": {
"listening": true,
"listeners": [
"Listener(@tcp://47.236.26.199:5000)"
],
"n_peers": "50",
"peers": [
{
"node_info": {
"protocol_version": {
"p2p": "8",
"block": "11",
"app": "0"
},
"id": "ce6f26b9508839c29e0bfd9e3e20e01ff4dda360",
"listen_addr": "tcp://85.234.78.106:5000",
"network": "gonka-mainnet",
"version": "0.38.17",
"channels": "40202122233038606100",
"moniker": "my-node",
"other": {
"tx_index": "on",
"rpc_address": "tcp://0.0.0.0:26657"
}
},
...
This displays all peers the node currently sees.
Use this method if you want the node to regenerate its configuration and automatically apply the seed nodes defined in config.env.
source config.env
docker compose down node
sudo rm -rf .inference/data/ .inference/.node_initialized
sudo mkdir -p .inference/data/
sudo cat .inference/config/config.toml
seeds = [...]
How are Hardware, Node Weight, and ML Node configuration actually validated?
The chain does not verify real hardware. It only validates the total participant weight, and this is the sole value used for weight distribution and reward calculation.
Any breakdown of this weight across ML Nodes, as well as any “hardware type” or other descriptive fields, is purely informational and can be freely modified by the Host.
When creating or updating a node (for example, via POST http://localhost:9200/admin/v1/nodes as shown in the handler code at https://github.com/gonka-ai/gonka/blob/aa85699ab203f8c7fa83eb1111a2647241c30fc4/decentralized-api/internal/server/admin/node_handlers.go#L62), the hardware field can be explicitly specified. If it is omitted, the API service attempts to auto-detect hardware information from the ML Node.
In practice, many hosts run a proxy ML Node behind which multiple servers operate; auto-detection only sees one of these servers, which is a fully valid setup. Regardless of configuration, all weight distribution and rewards rely solely on the Host total weight, and the internal split across ML Nodes or the reported hardware types never affect on-chain validation.
How to switch to Qwen/Qwen3-235B-A22B-Instruct-2507-FP8, upgrade ML Nodes, and remove other models?
This guide explains how Hosts should update their ML Nodes in response to changes in v0.2.8 model availability and the upcoming PoC v2 update. ML Node configuration compliance with PoC v2 is observed starting Epoch 155. Hosts are encouraged to review and prepare their ML Node configuration before that point. Migration to PoC v2 can be scheduled after epoch 155. After the migration phase, weights from ML Nodes that do not meet the configuration requirements may not be counted.
1. Background: model availability changes (upgrade v0.2.8)
As part of the v0.2.8 upgrade, the active model set has been updated.
Supported models (active set)
Only the following models remain supported:
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8Qwen/Qwen3-32B-FP8
Qwen/Qwen3-32B-FP8 is supported during the migration period, but does not contribute to PoC v2 readiness or weight assignment. Participation in PoC v2 requires serving Qwen/Qwen3-235B-A22B-Instruct-2507-FP8.
Removed models
All previously supported models are removed from the active set and must not be served.
2. PoC v2 readiness criteria (Important)
Successful participation in the PoC v2 transition requires both of the following:
- All your ML Nodes serve
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8. This is the only model that contributes to PoC v2 weight. - All your ML Nodes are upgraded to a PoC v2–compatible image:
- ghcr.io/product-science/mlnode:3.0.12-post3
- ghcr.io/product-science/mlnode:3.0.12-post3-blackwell
Important
- Serving the correct model without upgrading the ML Node is not sufficient.
- Nodes that do not meet both conditions will not be eligible once the network switches to a single-model configuration.
- The ML Node upgrade must be completed before the migration is finished and PoC v2 is activated through a separate governance proposal following the v0.2.8 upgrade.
- The v0.2.8 upgrade itself does not enable PoC v2.
3. Check ML Node allocation status (recommended safety step)
Before changing models, you should inspect the current ML Node allocation. Query your Network Node admin API:
curl http://127.0.0.1:9200/admin/v1/nodes
"timeslot_allocation": [
true,
false
]
- First boolean: Whether the node is serving inference in the current epoch
- Second boolean: Whether the node is scheduled to serve inference in the next PoC
Recommended behavior
- Prefer changing the model only on nodes where the second value is
false - This reduces risk while PoC v2 behavior is still being observed
- Gradual rollout across epochs is encouraged
4. Update models for ML Nodes: keep the supported model only
Pre-download model weights (recommended). To avoid startup delays, pre-download weights into HF_HOME:
mkdir -p $HF_HOME
huggingface-cli download Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8).
For example:
curl -X PUT "http://localhost:9200/admin/v1/nodes/node1" \
-H "Content-Type: application/json" \
-d '{
"id": "node1",
"host": "inference",
"inference_port": 5000,
"poc_port": 8080,
"max_concurrent": 800,
"models": {
"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8": {
"args": [
"--tensor-parallel-size",
"4",
"--max-model-len",
"240000"
]
}
}
}'
Note
node-config.json is used only on the first launch of the Network Node API or when the local state/database is removed. Edit it for a fresh restart. For existing nodes, model updates should be performed via the Admin API.
5. Upgrade the ML Node image (required for PoC v2)
Edit docker-compose.mlnode.yml and update the ML Node image:
Standard GPUs
image: ghcr.io/product-science/mlnode:3.0.12-post3
image: ghcr.io/product-science/mlnode:3.0.12-post3-blackwell
gonka/deploy/join:
source config.env
docker compose -f docker-compose.yml -f docker-compose.mlnode.yml pull
docker compose -f docker-compose.yml -f docker-compose.mlnode.yml up -d
Confirm the ML Node is serving Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 only, which is the only model used for PoC v2 weights and future weight assignment:
curl http://127.0.0.1:8080/v1/models | jq
curl http://127.0.0.1:9200/admin/v1/nodes
Governance and PoC v2 activation notes
PoC v2 is introduced in stages, not activated all at once.
Stage 1. Observation (current state after v0.2.8)
After the v0.2.8 upgrade, PoC v2 logic is available but not active for weight assignment.
During this stage:
- Hosts are able to serve
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8orQwen/Qwen3-32B-FP8 - Hosts must switch their ML Nodes to serve
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8and upgrade them to PoC v2-compatible versions in order to contribute to PoC v2 weight. - The network observes adoption to assess Host readiness for moving to PoC v2 weights.
Stage 2. Governance proposal (optional, future) Once a sufficient level of adoption among active Hosts is observed (approximately 50%):
- A separate governance proposal may be submitted
- This proposal may request approval to activate PoC v2 and use PoC v2 for weight assignment
The adoption threshold is observational only and does not trigger any automatic changes.
Stage 3. Activation (only after governance approval)
PoC v2 becomes the active method of weight assignment only if and when the governance proposal is approved by the chain.
Until this proposal is approved:
- PoC v2 remains inactive for weight assignment
- The existing PoC mechanism continues to be used to determine weight
Summary checklist
Before PoC v2 activation, ensure that:
- ML Node serves
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 - All other models are removed from the configuration
- ML Node image is
3.0.12-post3(or3.0.12-post3-blackwell)
Keys & security
Which CLI version should be used for warm keys created after the v0.2.9 upgrade?
For granting permissions to new warm keys created after the v0.2.9 upgrade, the CLI version v0.2.9 should be used.
Where can I find information on key management?
You can find a dedicated section on Key Management in the documentation. It outlines the procedures and best practices for securely managing your application's keys on the network.
I Cleared or Overwrote My Consensus Key
If you are using tmkms and deleted the .tmkms folder, simply restart tmkms — it will automatically generate a new key.
To register the new consensus key, submit the following transaction:
./inferenced tx inference submit-new-participant \
<PUBLIC_URL> \
--validator-key <CONSENSUS_KEY> \
--keyring-backend file \
--unordered \
--from <COLD_KEY_NAME> \
--timeout-duration 1m \
--node http://<node-url>/chain-rpc/ \
--chain-id gonka-mainnet
I Deleted the Warm Key
Back up the cold key on your local device, outside the server.
1) Stop the API container:
docker compose down api --no-deps
2) Set KEY_NAME for the warm key in your config.env file.
3) [SERVER]: Recreate the warm key:
source config.env && docker compose run --rm --no-deps -it api /bin/sh
4) Then execute inside the container:
printf '%s\n%s\n' "$KEYRING_PASSWORD" "$KEYRING_PASSWORD" | \
inferenced keys add "$KEY_NAME" --keyring-backend file
5) [LOCAL]: From your local device (where you backed up the cold key), run the transaction:
./inferenced tx inference grant-ml-ops-permissions \
gonka-account-key \
<address-of-warm-key-you-just-created> \
--from gonka-account-key \
--keyring-backend file \
--gas 2000000 \
--node http://<node-url>/chain-rpc/
6) Start the API container:
source config.env && docker compose up -d
Proof-of-Compute (PoC)
What is Proof-of-Compute?
Proof of Compute (PoC) is a consensus mechanism that replaces capital-based or hash-based weighting with provable Transformer-based computational capability. It defines how real AI compute is measured and converted into governance and consensus weight. PoC is executed through short, synchronized Sprints that occur at the end of each epoch. Outside the Sprint, the epoch is used for real-world AI computation. In practice, the terms Proof of Compute (PoC) and Sprint are often used interchangeably. When referring to “Next PoC” or “PoC phase”, this typically means the next Sprint, which is the execution phase of Proof of Compute.
What is Sprint?
Sprint is a phase of Proof of Compute. During a Sprint, all Hosts simultaneously run AI-relevant inference on a transformer with randomized layers over a stream of nonces, producing output vectors. A Host’s voting power for the next epoch is proportional to the number of nonces it processed, as long as the reported outputs are verifiably produced by the required Sprint model.
How to simulate Proof-of-Compute (PoC)?
You may want to simulate PoC on a ML Node yourself to make sure that everything will work when the PoC phase begins on the chain.
To run this test you either need to have a running ML Node that isn't yet registered with the api node or pause the api node. To pause the api node use docker pause api. Once you’re finished with the test you can unpause: docker unpause api.
For the test itself you will be sending POST /v1/pow/init/generate request to ML Node, the same that api node sends at the start of the POC phase:
https://github.com/gonka-ai/gonka/blob/312044d28c7170d7f08bf88e41427396f3b95817/mlnode/packages/pow/src/pow/service/routes.py#L32
The following model params are used for PoC: https://github.com/gonka-ai/gonka/blob/312044d28c7170d7f08bf88e41427396f3b95817/mlnode/packages/pow/src/pow/models/utils.py#L41
If your node is in the INFERENCE state then you first need to transition the node to the stopped state:
curl -X POST "http://<ml-node-host>:<port>/api/v1/stop" \
-H "Content-Type: application/json"
Now you can send a request to initiate PoC:
curl -X POST "http://<ml-node-host>:<port>/api/v1/pow/init/generate" \
-H "Content-Type: application/json" \
-d '{
"node_id": 0,
"node_count": 1,
"block_hash": "EXAMPLE_BLOCK_HASH",
"block_height": 1,
"public_key": "EXAMPLE_PUBLIC_KEY",
"batch_size": 1,
"r_target": 10.0,
"fraud_threshold": 0.01,
"params": {
"dim": 1792,
"n_layers": 64,
"n_heads": 64,
"n_kv_heads": 64,
"vocab_size": 8196,
"ffn_dim_multiplier": 10.0,
"multiple_of": 8192,
"norm_eps": 1e-5,
"rope_theta": 10000.0,
"use_scaled_rope": false,
"seq_len": 256
},
"url": "http://api:9100"
}'
8080 port of ML Node's proxy container or directly to ML Node's 8080 https://github.com/gonka-ai/gonka/blob/312044d28c7170d7f08bf88e41427396f3b95817/deploy/join/docker-compose.mlnode.yml#L26
If the test runs successfully, you will see logs similar to the following:
2025-08-25 20:53:33,568 - pow.compute.controller - INFO - Created 4 GPU groups:
2025-08-25 20:53:33,568 - pow.compute.controller - INFO - Group 0: GpuGroup(devices=[0], primary=0) (VRAM: 79.2GB)
2025-08-25 20:53:33,568 - pow.compute.controller - INFO - Group 1: GpuGroup(devices=[1], primary=1) (VRAM: 79.2GB)
2025-08-25 20:53:33,568 - pow.compute.controller - INFO - Group 2: GpuGroup(devices=[2], primary=2) (VRAM: 79.2GB)
2025-08-25 20:53:33,568 - pow.compute.controller - INFO - Group 3: GpuGroup(devices=[3], primary=3) (VRAM: 79.2GB)
2025-08-25 20:53:33,758 - pow.compute.controller - INFO - Using batch size: 247 for GPU group [0]
2025-08-25 20:53:33,944 - pow.compute.controller - INFO - Using batch size: 247 for GPU group [1]
2025-08-25 20:53:34,151 - pow.compute.controller - INFO - Using batch size: 247 for GPU group [2]
2025-08-25 20:53:34,353 - pow.compute.controller - INFO - Using batch size: 247 for GPU group [3]
DAPI_API__POC_CALLBACK_URL.
2025-08-25 20:54:58,822 - pow.service.sender - INFO - Sending generated batch to http://api:9100/
What does a confirmation ratio of 0 mean, and what should I do if this happens?
A 0% confirmation ratio is an unusual condition and indicates that no nonces were sent from your API node during the epoch, meaning the node did not participate in Confirmation Proof-of-Compute (CPoC) at all. To investigate, check the API node logs and ML Node logs, as they should indicate why nonce submission did not occur.
Possible causes include:
- API node misconfiguration or downtime
- publicly exposed admin or management ports that allow access to ML Nodes
- consensus node lagging behind the chain, which may delay PoC participation beyond the allowed window
- ML Node driver failures
To mitigate this risk, ensure that admin and management ports are not publicly accessible, verify that the API node is running and correctly configured, monitor consensus node synchronization, and set up alerts for ML Node and driver failures.
Performance & troubleshooting
How do I protect my node from DDoS attacks using the proxy pre-release (v0.2.8)?
A new proxy version is available with rate limiting and DDoS protection measures.
What’s New:
- Rate limiting on API/RPC endpoints, as protection against excessive requests that have been affecting network nodes
- Blocks resource-intensive internal routes like
trainingandpoc-batches - Optional disabling of
/chain-api,/chain-rpc, and/chain-grpcendpoints
Update instructions
Step 1: Update proxy image
sed -i -E 's|(image:[[:space:]]*ghcr.io/product-science/proxy)(:.*)?$|\1:0.2.8-pre-release-proxy@sha256:6ccb8ac8885e03aab786298858cc763a99f99543b076f2a334b3c67d60fb295f |' docker-compose.yml
Important
Step 2 disables /chain-api, /chain-rpc, and /chain-grpc endpoints on this node. After applying it, this node will no longer serve public RPC traffic. If you operate public RPC endpoints, you must run separate RPC-only nodes (without these restrictions) and keep this node private.
Step 2 (Optional): Disable chain-api, chain-rpc, and chain-grpc
If you want to completely disable /chain-api, /chain-rpc, and /chain-grpc endpoints:
sed -i 's|DASHBOARD_PORT=5173|DASHBOARD_PORT=5173\n - DISABLE_CHAIN_API=${DISABLE_CHAIN_API:-true}\n - DISABLE_CHAIN_RPC=${DISABLE_CHAIN_RPC:-true}\n - DISABLE_CHAIN_GRPC=${DISABLE_CHAIN_GRPC:-true}\n|' docker-compose.yml
sed -i -E -e '/GONKA_API_(EXEMPT|BLOCKED)_ROUTES/d' -e 's|(- GONKA_API_PORT=9000)|\1\n - GONKA_API_EXEMPT_ROUTES=chat inference\n - GONKA_API_BLOCKED_ROUTES=poc-batches training|' docker-compose.yml
proxy:
container_name: proxy
image: ghcr.io/product-science/proxy:0.2.8-pre-release-proxy@sha256:6ccb8ac8885e03aab786298858cc763a99f99543b076f2a334b3c67d60fb295f
ports:
- "${API_PORT:-8000}:80"
- "${API_SSL_PORT:-8443}:443"
environment:
- NGINX_MODE=${NGINX_MODE:-http}
- SERVER_NAME=${SERVER_NAME:-}
- GONKA_API_PORT=9000
- GONKA_API_EXEMPT_ROUTES=chat inference
- GONKA_API_BLOCKED_ROUTES=poc-batches training
- CHAIN_RPC_PORT=26657
- CHAIN_API_PORT=1317
- CHAIN_GRPC_PORT=9090
- DASHBOARD_PORT=5173
- DISABLE_CHAIN_API=${DISABLE_CHAIN_API:-true}
- DISABLE_CHAIN_RPC=${DISABLE_CHAIN_RPC:-true}
- DISABLE_CHAIN_GRPC=${DISABLE_CHAIN_GRPC:-true}
docker compose -f docker-compose.mlnode.yml -f docker-compose.yml pull proxy
source ./config.env && docker compose -f docker-compose.mlnode.yml -f docker-compose.yml up -d --no-deps proxy
You can close port 26657 as an external port.
It is optional, but highly recommended:
sed -i 's|- "26657:26657"|#- "26657:26657"|g' docker-compose.yml
node:
container_name: node
...
ports:
- "5000:26656" #p2p
#- "26657:26657" #rpc
source ./config.env && docker compose -f docker-compose.mlnode.yml -f docker-compose.yml up -d --no-deps node
If you previously accessed the node status using curl -s http://localhost:26657/status, you can now access it from within the containers:
docker exec proxy curl -s node:26657/status | jq
docker exec node wget -qO- http://localhost:26657/status | jq
For continuous monitoring with watch:
watch -n 5 'docker exec node wget -qO- http://localhost:26657/status | jq -r ".result.sync_info | \"Block: \(.latest_block_height) | Time: \(.latest_block_time) | Syncing: \(.catching_up)\""'
How much free disk space is required for a Cosmovisor update, and how can I safely remove old backups from the .inference directory?
Cosmovisor creates a full backup in the .inference state folder whenever it performs an update. For example, you can see a folder like data-backup-<some_date>.
As of November 20, 2025, the size of the data directory is about 150 GB, so each backup will take approximately the same amount of space.
To safely run the update, it is recommended to have 250+ GB of free disk space.
You can remove old backups to free space, although in some cases this may still be insufficient and you might need to expand the server disk.
To remove an old backup directory, you can use:
sudo su
cd .inference
ls -la # view the list of folders. There will be folders like data-backup... DO NOT DELETE ANYTHING EXCEPT THESE
rm -rf <data-backup...>
How to prevent unbounded memory growth in NATS?
NATS is currently configured to store all messages indefinitely, which leads to continuous growth in memory usage. A recommended solution is to configure a 24-hour time-to-live (TTL) for messages in both NATS streams.
- Install the NATS CLI. Install Golang by following the instructions here: https://go.dev/doc/install. Then install the NATS CLI:
go install github.com/nats-io/natscli/nats@latest - If you already have the NATS CLI installed, run:
nats stream info txs_to_send --server localhost:<your_nats_server_port> nats stream info txs_to_observe --server localhost:<your_nats_server_port>
How to change inference_url?
You may need to update your inference_url if:
- You changed your API domain;
- You moved your API node to a new machine;
- You reconfigured HTTPS / reverse proxy;
- You are migrating infrastructure and want your Host entry to point to a new endpoint.
This operation does not require re-registration, re-deployment, or key regeneration. Updating your inference_url is performed through the same transaction used for initial registration (the submit-new-participant msg).
The chain logic checks whether your Host (participant) already exists:
- If the participant does not exist, the transaction creates a new one;
- If the participant already exists, only three fields may be updated:
InferenceURL,ValidatorKey,WorkerKey.
All other fields are preserved automatically.
This means updating inference_url is a safe, non-destructive operation.
Note
When a Node updates its execution URL, the new URL becomes active immediately for inference requests coming from other Nodes. However, the URL recorded in ActiveParticipants is not updated until the next epoch because modifying it earlier would invalidate the cryptographic proof associated with the participant set. To avoid service disruption, it is recommended to keep both the previous and the new URLs operational until the next epoch completes.
[LOCAL] Perform the update locally, using your Cold Key:
./inferenced tx inference submit-new-participant \
<PUBLIC_URL> \
--validator-key <CONSENSUS_KEY> \
--keyring-backend file \
--unordered \
--from <COLD_KEY_NAME> \
--timeout-duration 1m \
--node http://<node-url>/chain-rpc/ \
--chain-id gonka-mainnet
Why is my application.db growing so large, and how do I fix it?
Some nodes have an issue with growing size of application.db.
.inference/data/application.db stores the history of states for the chain (not blocks), by default it's state for 362880.
The state history contains a full merkle tree per each state and it's safe to have it preserved for significantly shorter length. For example, only for 1000 blocks.
The pruning parameters can be set in .inference/config/app.toml:
...
pruning = "custom"
pruning-keep-recent = "1000"
pruning-interval = "100"
New configuration will be used after restart of the node container. But there is a problem - even when pruning is enabled, database clean is really slow.
There are several ways how to reset application.db:
1) Stop node
docker stop node
2) Remove data
sudo rm -rf .inference/data/ .inference/.node_initialized
sudo mkdir -p .inference/data/
3) Start node
docker start node
This approach may take some time during which the node will not be able to record transactions.
Please use available trusted nodes to download snapshot.
Snapshots are enabled by default and stored in .inference/data/snapshots
1) Prepare new application.db ( node container's still running)
1.1) Prepare temporary home directory for inferenced
mkdir -p .inference/temp
cp -r .inference/config .inference/temp/config
mkdir -p .inference/temp/data/
1.2) Copy snapshots:
cp -r .inference/data/snapshots .inference/temp/data/
1.3) List snapshots
inferenced snapshots list --home .inference/temp
Copy height for the latest snapshot.
1.4) Start restoring from snapshot ( node container is still running)
inferenced snapshots restore <INSERT_HEIGHT> 3 --home .inference/temp
This might take some time. Once it is finished, you'll have new application.db in .inference/temp/data/application.db
2) Replace application.db with new one
2.1) Stop node container (from another terminal window)
docker stop node
2.2) Move original application.db
mv .inference/data/application.db .inference/temp/application.db-backup
mv .inference/wasm .inference/wasm.db-backup
2.3) Replace it with new one
cp -r .inference/temp/data/application.db .inference/data/application.db
cp -r .inference/temp/wasm .inference/wasm
2.4) Start node container (from another terminal window):
docker start node
3) Wait till node container is synchronized and delete .inference/temp/
If you have several nodes, it is recommended cleaning one by one.
Additional option might be to start separate instance of node container on separate CPU only machine and setup in strict validator mode:
- preserve really short history
- limit RPC and API access only to
apicontainer
Once it's running, move existing tmkms volume to the new node (disable block signing on existing one first).
This is the general idea of the approach. If you decide to try it and have any questions, feel free to reach out on Discord.
A fix is now available for the long-standing issue where application.db continues to grow under many pruning configurations.
This improvement was contributed by Lelouch33 and is included in release 0.2.10-post6. With the updated logic and the following settings, application.db can remain around 100 GB:
SNAPSHOT_INTERVAL=1000SNAPSHOT_KEEP_RECENT=2pruning-keep-recent = "20000"pruning-interval = "512"
References:
- https://github.com/gonka-ai/gonka/issues/819#issuecomment-3996332369
- https://github.com/gonka-ai/gonka/pull/867
After upgrading to this binary, pruning will begin after the next snapshot block. This process is relatively heavy and may temporarily slow down the node container while the old state history is being removed.
To reduce operational impact, it is recommended to apply the update to nodes one by one and use a higher pruning-interval, such as 512, to avoid pruning too frequently.
If a node slows down significantly during pruning, restarting the node container may help it catch up.
Applying this update before the upcoming v0.2.11 upgrade is recommended to prevent pruning from starting simultaneously across many nodes.
Apply update (example from v0.2.7, which has identical inferenced):
# Pre-check: Ensure no confirmation PoC is active (fails entire script if not false)
echo "--- Pre-flight Check: Confirmation PoC Status ---" && \
CONFIRMATION_POC_ACTIVE=$(curl -sf "https://node3.gonka.ai/v1/epochs/latest" | jq -r '.is_confirmation_poc_active') && \
[ "$CONFIRMATION_POC_ACTIVE" = "false" ] && \
echo "OK: No confirmation PoC active" && \
sudo rm -rf inferenced.zip .inference/cosmovisor/upgrades/v0.2.10-post7/ .inference/data/upgrade-info.json && \
sudo mkdir -p .inference/cosmovisor/upgrades/v0.2.10-post7/bin/ && \
wget -q -O inferenced.zip 'https://github.com/gonka-ai/gonka/releases/download/release%2Fv0.2.10-post7/inferenced-amd64.zip' && \
echo "5ed8941d50779fa2359a9745263b324b887465104f81073827321945ab1f392a inferenced.zip" | sha256sum --check && \
sudo unzip -o -j inferenced.zip -d .inference/cosmovisor/upgrades/v0.2.10-post7/bin/ && \
sudo chmod +x .inference/cosmovisor/upgrades/v0.2.10-post7/bin/inferenced && \
echo "Inference Installed and Verified" && \
# Link Binary
echo "--- Final Verification ---" && \
sudo rm -rf .inference/cosmovisor/current && \
sudo ln -sf upgrades/v0.2.10-post7 .inference/cosmovisor/current && \
echo "d9093b225cbd531afc56c99d0b0996b1fa2896c0745cd73293f0de08132f7754 .inference/cosmovisor/current/bin/inferenced" | sudo sha256sum --check && \
# Restart
source config.env && docker compose up node --no-deps --force-recreate -d
Automatic ClaimReward didn’t go through, what should I do?
If you have unclaimed reward, execute:
curl -X POST http://localhost:9200/admin/v1/claim-reward/recover \
-H "Content-Type: application/json" \
-d '{"force_claim": true, "epoch_index": 106}'
curl http://node2.gonka.ai:8000/chain-api/productscience/inference/inference/epoch_performance_summary/106/<ACCOUNT_ADDRESS> | jq
Upgrades
Upgrade v0.2.12: Pre-Upgrade Model Cleanup
Important
This cleanup process must be completed before the upgrade happens. If you upgrade before cleaning up the models, your node will be rejected and go offline.
Version 0.2.12 removes every governance model that is not on the post-upgrade approved list. On mainnet, only the previously enforced model and Kimi will remain.
Each DAPI persists its MLNode configurations locally. On startup, it validates every configured model against the on-chain governance list. If a configuration includes at least one unsupported model, the entire node is rejected and the host goes offline.
Version 0.2.11 masked this problem by trimming the runtime view down to the enforced model, so /admin/v1/nodes appeared clean even when the persisted config still contained extra models. Version 0.2.12 stops this trimming, meaning the persisted config is loaded directly.
To fix this, the script below finds each node with extra models in /admin/v1/config and sends a PUT request with a cleaned config to /admin/v1/nodes/<id>. These changes are persisted within 60 seconds. The remaining model's arguments, hardware, and ports are preserved exactly. Nodes that do not list the enforced model are skipped and will require manual fixing.
Paste the following script into the host's shell. By default, it will apply the changes. To preview the changes without applying them, set APPLY=dry (or any value other than --apply).
Script in the repository:
ADMIN=${ADMIN:-http://127.0.0.1:9200}
KEEP=${KEEP:-Qwen/Qwen3-235B-A22B-Instruct-2507-FP8}
APPLY=${APPLY:-"--apply"}
curl -sS "$ADMIN/admin/v1/config" | jq -r --arg k "$KEEP" '
.nodes[] | "\(.id): " + (
if (.models | has($k) | not) then "skip (\(.models | keys))"
elif (.models | length) == 1 then "ok"
else "\(.models | keys) -> [\($k)]" end)'
if [[ "$APPLY" == "--apply" ]]; then
curl -sS "$ADMIN/admin/v1/config" \
| jq -c --arg k "$KEEP" \
'.nodes[] | select((.models | has($k)) and (.models | length > 1)) | .models = {($k): .models[$k]}' \
| while IFS= read -r p; do
id=$(jq -r .id <<<"$p")
curl -sS -f -X PUT -H 'Content-Type: application/json' -d "$p" \
"$ADMIN/admin/v1/nodes/$id" >/dev/null && echo "$id: updated"
done
echo "done; persisted within 60s"
else
echo "preview only; rerun without APPLY=dry to commit"
fi
Wait 60 seconds after running the script to ensure the changes are persisted before triggering the upgrade. Then, verify the configuration:
curl -sS http://127.0.0.1:9200/admin/v1/config \
| jq '.nodes[] | {id, models: (.models | keys)}'
Expected output:
{
"id": "<nodeId>",
"models": [
"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
]
}
Upgrade v0.2.12: Pre-download binaries
# 1. Create Directories
sudo mkdir -p .dapi/cosmovisor/upgrades/v0.2.12/bin \
.inference/cosmovisor/upgrades/v0.2.12/bin && \
# 2. DAPI: Download -> Verify -> Unzip directly to bin -> Make Executable
wget -q -O decentralized-api.zip "https://github.com/gonka-ai/gonka/releases/download/release%2Fv0.2.12/decentralized-api-amd64.zip" && \
echo "d0143a95e12e1ada06cfea5e4d3deab13534c3523c967e9a6b87ac9f9bf3247d decentralized-api.zip" | sha256sum --check && \
sudo unzip -o -j decentralized-api.zip -d .dapi/cosmovisor/upgrades/v0.2.12/bin/ && \
sudo chmod +x .dapi/cosmovisor/upgrades/v0.2.12/bin/decentralized-api && \
echo "DAPI Installed and Verified" && \
# 3. Inference: Download -> Verify -> Unzip directly to bin -> Make Executable
sudo rm -rf inferenced.zip .inference/cosmovisor/upgrades/v0.2.12/bin/ && \
wget -q -O inferenced.zip "https://github.com/gonka-ai/gonka/releases/download/release%2Fv0.2.12/inferenced-amd64.zip" && \
echo "df7656503d39f6703767d32d5578d1291e32cb114844d8c1cd0f134d1bf4babd inferenced.zip" | sha256sum --check && \
sudo unzip -o -j inferenced.zip -d .inference/cosmovisor/upgrades/v0.2.12/bin/ && \
sudo chmod +x .inference/cosmovisor/upgrades/v0.2.12/bin/inferenced && \
echo "Inference Installed and Verified" && \
# 4. Cleanup and Final Check
rm decentralized-api.zip inferenced.zip && \
echo "--- Final Verification ---" && \
sudo ls -l .dapi/cosmovisor/upgrades/v0.2.12/bin/decentralized-api && \
sudo ls -l .inference/cosmovisor/upgrades/v0.2.12/bin/inferenced && \
echo "94ce943338d12844028e84fe770106c9d28d866cf0af99f27da30f56d69efa34 .dapi/cosmovisor/upgrades/v0.2.12/bin/decentralized-api" | sudo sha256sum --check && \
echo "642eb9858cd77d182f3e1c4d44553f5379d615983430e1fd8e85f09632af4271 .inference/cosmovisor/upgrades/v0.2.12/bin/inferenced" | sudo sha256sum --check
Bounty program
What is the bounty program? Who can participate? How are rewards paid?
It’s not necessary to be a Host to participate: many bounties go to contributors who submit fixes, implement improvements, or contribute to broader Gonka infrastructure.
Awards are paid from the community pool after governance approval. Vulnerability reports are especially valued, and responsible disclosures that help prevent exploits and improve network safety are eligible for bounties as well.
Final bounty decisions, amounts, and categories are always up to community governance.
What is the vulnerability bounty pricing model
A common way to think about severity is:
Risk = Impact × Likelihood
Impact levels
| Level | Description | Examples |
|---|---|---|
| Critical | Catastrophic for the whole network | Full network control hijack |
| High | Significant disturbance at scale | Network crash/halt; theft from module; wrong rewards for all participants |
| Medium | Moderate disruption, limited scope | Consensus or reward integrity at risk; single-participant funds or availability |
| Low | Minor impact on isolated participants, no chain impact | single-component, minor effect on a single participant, non-chain |
Likelihood
- Organic — Unintentional; occurs under normal conditions. Estimate by probability (how often conditions trigger it, usage patterns).
- Intentional — Profitable — Exploited for financial gain. Higher likelihood when gain is large and cost/complexity is low.
- Intentional — Griefing — Exploited to cause disruption. Higher likelihood when network-wide effect and low cost; single-participant griefing → lower likelihood.
Risk Matrix
| Impact \ Likelihood | High | Medium | Low |
|---|---|---|---|
| Critical | Critical | Critical | High |
| High | Critical | High | Medium |
| Medium | High | Medium | Low |
| Low | Medium | Low | Informational |
How to get started in the bounty program?
- A new GitHub issue/discussion can be created to propose an improvement and get community feedback on whether it’s worth implementing.
- Or pick an existing issue labeled up-for-grabs. Before starting, leave a quick comment that work has started and include an approximate ETA, so others have visibility and avoid duplicate effort.
What is the suggested vulnerability reporting process?
- If an issue is not high or critical severity (limited impact, no network-wide effect) and the fix is low effort, opening a PR right away is usually fine.
- If an issue is high or critical severity, please report it privately to trusted community members (long-term Gonka repository contributors), either as a report or together with a fix in a private fork.
- If an issue looks like part of a broader class and a systematic review would likely uncover more issues of the same category, leave a note that a review is planned. This helps avoid duplicate reviews running in parallel.
To contribute, pick an issue, ship a solid fix, and share the link in the relevant dev channels to get feedback.
Where can I see who was paid bounties, for what, and when?
The most reliable sources are on-chain records and GitHub. Use them as the main source of truth for who was paid, what the bounty was for, and when it was executed.
Errors
No epoch models available for this node
Here you can find examples of common errors and typical log entries that may appear in node logs.
2025/08/28 08:37:08 ERROR No epoch models available for this node subsystem=Nodes node_id=node1
2025/08/28 08:37:08 INFO Finalizing state transition for node subsystem=Nodes node_id=node1 from_status=FAILED to_status=FAILED from_poc_status="" to_poc_status="" succeeded=false blockHeight=92476
How do I fix err="no validator signing info found" when starting from a state sync snapshot?
If you periodically hit err="no validator signing info found" during startup from a state sync snapshot, it is typically related to the Cosmos SDK iavl-fastnode behavior. A safe workaround is to disable fastnode for the initial startup, then (optionally) re-enable it after the node is fully synced.
Fix (Docker):
- Stop the node:
docker stop node - In
.inference/config/app.toml, set:iavl-disable-fastnode = true - Start the node:
After a restart, the issue should not recur.
docker start node
Note
main includes v0.2.10-post6. Nodes starting from this version apply this setting automatically, so you typically won’t need to change it manually.
Inference
Why does the 4,096 output token limit cause the model to stall during thinking — returning zero tokens?
This is about you if
- You see
content=nullandfinish_reason=length. - The model is "silent" — usage shows tokens, but there's no text.
- A probe request with
max_tokens=100returns nothing.
Fix-first: a working configuration for Kimi-K2.6
If you don't have time to dig in — copy this payload as a starting point. As of 2026-05-28 it worked on two public brokers; verify it's still current with your broker operator before using it.
{
"model": "moonshotai/Kimi-K2.6",
"messages": [
{"role": "user", "content": "Write hello world in Python."}
],
"max_tokens": 4096,
"thinking": {"type": "disabled"},
"thinking_token_budget": 0,
"temperature": 0.2
}
Why these exact fields:
max_tokens: 4096— give the model the entire available output budget. The effective cap on brokers right now is 3,072 (see Q3) — going higher is useless. Minimum 256, otherwise the gateway may forcethinking_token_budgetto zero.thinking: {"type": "disabled"}— disables hidden thinking via a chat template hint.thinking_token_budget: 0— belt-and-suspenders: explicitly zeroes the budget at the generation parameter level (see Q2).- The model ID is case-sensitive:
moonshotai/Kimi-K2.6(capital K) ongonka-api.org,moonshotai/kimi-k2.6(lowercase k) ongonkagate.com. Got a 404 — flip the case. Cross-check against theGET /v1/modelsresponse.
Ready-to-use curl (replace <broker> and the model ID case):
curl -sS https://<broker>/v1/chat/completions \
-H "Authorization: Bearer $GONKA_API_KEY" \
-H "Content-Type: application/json" \
-d @payload.json
If it returned meaningful text — the problem is in your original payload; compare the fields one by one. If content=null — capture the id from the response and send it to the broker's support.
First check whether the rules are active on your broker
Gateway behavior depends on the broker and changes over time. Run this test:
curl https://<your-broker>/v1/chat/completions \
-H 'content-type: application/json' \
-H "Authorization: Bearer $GONKA_API_KEY" \
-d '{
"model": "moonshotai/Kimi-K2.6",
"messages": [{"role": "user", "content": "one word"}],
"max_tokens": 100
}'
| Gateway version | Expected result |
|---|---|
devshard ≥ 0.2.13 (force-zero-below-256 active) |
finish_reason="length", ~0–10 reasoning tokens |
| Older build | finish_reason="length", ~40–60 reasoning tokens (default max_tokens / 2) |
The rules below describe the recent gateway code (devshard ≥ 0.2.13). Your broker may not be updated yet. Not sure about the version? — run the fix-first above. If it works with meaningful text, the gateway is recent enough. If not — send response.id to the broker's support with a question about updating.
What happens on the model and gateway side
Kimi-K2.6 specifics. The model emits <think>…</think> blocks. Both sections (<think> and visible content) consume max_tokens equally. With small max_tokens the model burns the entire budget inside <think> and returns only </think>, which vLLM strips as a special token → content=null, finish_reason=length. From the client side — "0 tokens."
Gateway rules for thinking_token_budget (PR #1202, devshard 0.2.13+):
| Condition | What the gateway does |
|---|---|
max_tokens < 256 |
ttb = 0 (force-zero, overrides the client) |
ttb not set, max_tokens >= 256 |
ttb = max_tokens / 2 |
ttb set by the client |
uses the client's value |
| always | clamp: ttb ≤ 96,000 and ttb ≤ max_tokens − 64 |
Additionally:
max_tokensfloor → 16 (PR #1227) — previouslymax_tokens=1reliably producedcontent=null. Now it's silently raised to 16.thinking: {"type":"disabled"}mirror (PR #1224) — the gateway mirrors it intochat_template_kwargs.thinking=false. The Kimi chat template reads the kwarg.
Scenarios that historically produced content=null (max_tokens=1, the probe-shape max=100, min=100, ttb=50) now return non-empty content through the recent gateway. On gonkagate.com (2026-05-25), max_tokens=100 without ttb returned ~50 reasoning tokens — force-zero-below-256 is not active there.
For Inference User:
- Re-test against a broker with gateway ≥ 0.2.13 (release 2026-05-23+).
- See zero tokens — capture the
idfrom the response and send it to the broker. To extract it:
curl ... | jq .id
Format: devshard-<short>-<short>, e.g. devshard-7a4f-31b2. Where to send: the broker's support channel (for gonka-api.org — support links on the site; for gonkagate.com — the /contact section).
- Don't rely on thinking:disabled alone — to be safe, set thinking_token_budget: 0 explicitly (see Q2).
For Broker: on pre-0.2.13 — update per your validation / release cadence (no rush: clients on older versions and escrow rules require re-qualification). Until the update, clients apply the workaround above; after devshard-0.2.13 the zero-output content=null cases will disappear.
With Kimi K2, the entire token limit can be spent on thinking with no actual output. Is this an output cap, bandwidth, or upstream issue?
This is a gateway policy, not a model limitation. The thinking_token_budget resolver (PR #1202) allocates max_tokens / 2 for reasoning by default. For tool-heavy flows the budget burns out before any useful output. The mitigation is to explicitly set thinking_token_budget: 0 or thinking: {"type": "disabled"} (the gateway mirrors it into chat_template_kwargs via PR #1224). The model simply respects the budget.
Same cause as in Q1 — the model splits max_tokens between <think> and visible content. This is not bandwidth and not an output cap.
Two escape hatches
thinking: {"type": "disabled"}— the gateway mirrors it intochat_template_kwargs.thinking=false(the Kimi chat template reads the kwarg) and removes the top-levelthinking."adaptive"and"auto"are accepted (Claude Code CLI / Anthropic SDK preset, PR #1224) — both resolve toenabled.thinking_token_budget: 0— an explicit zero goes straight to vLLM as a generation parameter and reliably zeroes the thinking budget.
Important nuance: the mechanisms work at different levels (chat template hint vs. generation parameter) and don't overlap. thinking:disabled does NOT automatically zero thinking_token_budget — with the default max_tokens=4096 and only disabled, the model still gets a hidden ttb=2048 from the gateway resolver. In our tests Kimi respected thinking:disabled even on reasoning-heavy prompts. The model documentation (the planned docs/chat-api/kimi-k2.6.md) warns that in some reasoning scenarios the model may ignore the hint — we didn't reproduce it, but hedge anyway. Belt-and-suspenders: for critical flows, send both parameters together.
Numeric confirmation
The same bug-find prompt, max_tokens=500, the answer is identical in meaning:
| Config | usage.completion_tokens | Wall-clock |
|---|---|---|
thinking: {"type":"disabled"} |
65 | 3.6s |
| default (gateway resolver → ttb = max_tokens/2 = 250) | 312 | 12.5s |
Half of the default budget goes to hidden thinking even for a trivial task — hence the advice to disable thinking for tool-heavy / agentic flows.
For Inference User:
- Tool-heavy / agentic flows without reasoning —
"thinking": {"type": "disabled"}(Kimi) or"enable_thinking": false(Qwen, translated automatically). - Complex reasoning — set
thinking_token_budgetexplicitly (don't rely on the defaultmax_tokens / 2). - If
thinking:disabledstill causes burn on your prompt — duplicate it withthinking_token_budget: 0explicitly.
For Broker: on pre-0.2.13 — update per cadence. Until the update, clients apply the workaround. On the landing page, note: Kimi for tool-heavy flows requires thinking:disabled, or an explicit thinking_token_budget, or a large max_tokens.
The input token cap for Kimi is 4k tokens, and the output cap is 8,192 tokens. When will these limits be raised?
The numbers in the question are incorrect
- Output cap: 3,072 tokens on both tested brokers (they return
finish_reason=lengthat exactly 3,072 even withmax_tokens=8000). - Input: up to 240,000 tokens (
--max-model-lenon the mainnet Kimi deploy). Not 4,000.
Where the output cap comes from
The network ceiling in the code is 4,096 (RequestMaxTokensCap), but the effective limit is lower. The exact mechanism is a black box. Possible explanations (by likelihood, not confirmed from public code):
- The gateway default
DefaultRequestMaxTokens = 3,072is not overridden by the broker operator. - The broker operator set
request_max_tokens_cap = 3,072per-model via an admin endpoint (POST /v1/admin/settings). - An upstream DAPI or host-side cap (e.g. vLLM
--max-tokens-per-requestor a loader constraint).
To know for sure — ask the broker for the request_max_tokens_cap value for each model.
How much fits in 3,072 tokens
| Scenario | Fits in 3,072? |
|---|---|
| ~1,900–2,200 words of regular English | yes |
| ~600–800 lines of Python/JS | yes |
| A short answer (5–10 sentences) | yes |
One tool call + moderate JSON (arguments ≤ 500 tokens) |
yes |
| Small structured output (3–5 summary points) | yes |
| A long document summary (>10k source tokens) | no |
| Large code diffs (>2k lines) | no |
| 3+ parallel tool calls in one response | no |
| Agentic loop: reasoning + tool calls + visible content at once | no |
For use cases in the second group — request a cap increase from the broker (see For Broker).
How to raise the cap
The output cap is controlled by the broker, not the network. To raise it — ask your broker: they can increase request_max_tokens_cap with a single admin call (no code changes). A network-wide bump above 4,096 requires a PR to the gateway code + a new release; you can initiate it through a GitHub Discussion on gonka-ai/gonka.
For the curious / operators: the blockchain stores per-model price parameters (coins_per_input_token, coins_per_output_token) and deploy parameters (model_args), but there's no field for a hard output limit — the relaxation is a local broker policy, not a governance-defined value.
Where the 240k input comes from
The mainnet Kimi-K2.6 deploy is registered via the on-chain governance proposal v0.2.12 (inference-chain/app/upgrades/v0_2_12/upgrades.go:kimiGovernanceModel()):
ModelArgs: ["--max-model-len","240000",
"--tool-call-parser","kimi_k2",
"--reasoning-parser","kimi_k2"]
VRam: 720 (GB)
The model card declares 256K native context. The gateway doesn't limit input separately, other than the universal body size (10 MiB) and message count (≤ 2,048) — the "Request limits" section in docs/chat-api/README.md (planned document).
Important caveat (open issue)
Even if the broker agrees to raise the output cap, individual nodes may be started with a smaller --max-model-len. The gateway routing layer does not account for per-host context capacity (issue #818). For large payloads (>50k), landing on a "small" node is systematic behavior, not transient randomness.
For Inference User:
- The real output cap is determined by the broker — ask them for the
request_max_tokens_capvalue for each model. - Hitting a small input limit — this is almost certainly
--max-model-lenon a specific node, not a global limit. The routing layer doesn't account for per-host context (issue #818); for large payloads (>50k) this is a systematic problem. Workaround: retry or split the request into several API calls. - Hitting the output cap — ask the broker to raise it. A network-wide bump (above 4,096) is a code change; raise it through a GitHub Discussion on
gonka-ai/gonka.
For Broker:
- Raising the cap per-model is a single admin call via
POST /v1/admin/settingswithmodel_limits[].request_max_tokens_cap, no code change. It increases per-request escrow exposure and the risk of hitting the per-host--max-model-len(5xx on individual nodes). Raise it for specific models under proven demand, after verifying--max-model-lenon all escrow nodes. - A network-wide bump (above 4,096) is a PR to the gateway code + a new release. Stable demand for large outputs — open a Discussion.
Agents like Hermes, OpenClaw with 30k+ system prompts fail on Kimi. Why?
In brief
The Kimi model accepts 30k+ input at the model and gateway levels, but stability depends on routing. The native window is 256K, the mainnet deploy uses --max-model-len 240000, and the gateway accepts a body of up to 10 MiB. Empirically, a single-shot ~69,000 prompt tokens (≈800 messages × 80 words) completed in 5.5 seconds. On sustained / repeated long requests (>50k) you'll encounter instability (issue #818) — on large payloads (215k) repeated attempts may fail with 503.
Sources for verification (all in gonka-ai/gonka)
- Native context 256K — the model card in
docs/chat-api/(the exact filename is planned as part of the chat-api docs set). - Mainnet deploy params (on-chain) —
inference-chain/app/upgrades/v0_2_12/upgrades.go:kimiGovernanceModel(). - Body / message limits (10 MiB, ≤ 2,048 messages) —
docs/chat-api/README.md(planned), the "Request limits" section.
When 30k does break — two typical causes
1. A single rejected field in the agent's payload. The gateway maintains a strict allowlist. If the agent sent even one non-standard field (tags, enforced_tokens, plugins, guided_json) — the entire request is cut with HTTP 400. The Hermes-specific tags reject — anchor #reject-tags in docs/chat-api/troubleshooting.md (planned). Empirically: a valid 69k payload + tags:["session:abc"] → HTTP 400 in 2 seconds.
2. Routing to a node with a smaller --max-model-len. The gateway routing layer doesn't account for the host's actual context size when routing (issue #818; see also the planned known-issues.md §3). On very long payloads (>50k, especially >200k), landing on a "small" node is systematic behavior at the network level, not a client error: in our measurements 5×215k = 0/5 success. The request will fail on the vLLM side.
A related builder request: issue #1229 (opened in May 2026), blockers for agentic scenarios — long reasoning chains, tool-call compatibility, continuation after exceeding output limits.
Quick self-diagnosis checklist
- Remove the fields
tags,enforced_tokens,plugins,strict,guided_json,guided_regex,guided_grammar,guided_choiceone at a time. Resend the same request after each removal. - If none of the removals helped — check the schema depth in
tools[].function.parameters(≤ 16) and the total number of nodes (≤ 256), see Q9. - The payload is clean and it still fails — this is network-level (issue #818). Workaround: retry or split the request.
For Inference User:
- First check the payload against the whitelist in
docs/chat-api/README.md(planned). Most Hermes / OpenClaw 400s are due to a single field or schema. - Generic broker messages like "upstream model provider rejected" are misleading: some brokers collapse specific gateway 400s into a generic message, some pass through the original (
"Chat completions parameter \"tags\" is currently rejected by the Gonka network..."with a link to docs). The broker comparison —comparison-brokers.md(planned). If one broker shows a generic error — try another to get a readable message and understand the root cause. - The payload is clean and it still fails — network-level (issue #818). Workaround: retry or split; on sustained >50k payloads a single retry is often not enough — split.
For Broker:
- (1) Show the native context window for each model (on the landing page, via the
/v1/modelsendpoint, or in docs) with an explicit caveat that effective per-request capacity may be lower due to host heterogeneity (issue #818). Some brokers intentionally omit this to avoid over-promising — a defensible choice. (2) Until host-level capacity advertising is implemented — consider client-side filtering or a "preferred-host" list. - UX: the gateway returns specific 400s with field names and messages (
"Chat completions parameter \"tags\" is currently rejected by the Gonka network..."+ a link to docs). We recommend passing detailed messages through to clients in production — it speeds up diagnosis. Security note: detailed messages may reveal internal field names, host paths, and validator IDs that help enumeration or prompt-injection probes. Conservative masking is a defensible default. If you wrap them in a generic"upstream provider rejected"for security — use a hybrid approach: full details in async logs / error tracking, a generic message with a tracking ID to the client. The compatibility map for agents —docs/chat-api/agents.md(planned).
Why does Kimi generate malformed JSON for tool calls when output exceeds 4k–8k tokens?
Neither bandwidth nor a Gonka-side limitation. Three overlapping causes.
(a) max_tokens truncation
The effective output cap on the tested brokers is 3,072 tokens; the gateway network ceiling is 4,096. When the assistant emits tool calls with large JSON blobs in arguments plus visible content, you can hit the broker's real cap and get truncated JSON. Details on per-broker override — Q3.
(b) Kimi-K2.6 tool-parser duplicate ID collision
[vLLM PR #21259 — UNVERIFIED]. With n > 1, the kimi_k2 parser recomputes history_tool_call_cnt inside the per-choice loop — both branches get id = functions.<name>:0. The gateway sees a duplicate ID in vLLM's response and rejects it with HTTP 400 (per the OpenAI spec). Anchor #reject-duplicate-tool-call-id in docs/chat-api/troubleshooting.md (planned). Upstream fix — vLLM PR #21259 (merge status not independently confirmed).
(c) Hermes tool-parser JSONDecodeError on multiple tool blocks
[vLLM #17790 — awaiting upstream fix]. Different parser, different problem: a JSONDecodeError when the model emits several tool-call blocks in one response — vLLM #17790. Related: <tool_call> inside <think> breaks hermes parsing — vLLM #42021. These don't depend on Gonka — awaiting an upstream fix.
For Inference User:
- Rewrite
tool_call.idon the client side before sending subsequent messages, into the canonical formatfunctions.<name>:<global_idx>— the official Moonshot recommendation, duplicated indocs/chat-api/troubleshooting.md#reject-duplicate-tool-call-id(planned). An alternative is fresh UUIDs. - Don't dedup by id — two calls with the same id may contain different results. Losing them = losing the agent's work.
- Raise
max_tokensfor responses with tool calls; largeargumentsblobs quickly hit the cap. - A generic broker error "upstream model provider rejected" usually means a gateway-side reject, not the model. First check the message and the ID for duplicates, then suspect the model (see broker differences in Q4).
For Broker:
- Considering gateway-side dedup-by-id — two tool calls with the same ID may contain different results; it's safer to rewrite the ID into the canonical format
functions.<name>:<global_idx>(don't dedup). Document the pattern in the customer FAQ with a link totroubleshooting.md#reject-duplicate-tool-call-id. Security note: a naive dedup-by-id is an attack surface if not validated carefully. Canonicalizing the name instead of removing is safer. - UX: pass through the specific gateway error message (
"messages[N].tool_calls[M].id is duplicated") instead of a generic wrapper — it reduces time-to-fix for agentic clients. Security note: balance debug-friendliness and information disclosure — see Q4.
Could enabling guided decoding fix the token cap issue?
Guided decoding has nothing to do with the token cap. The mechanism forces the model to generate output to a schema (JSON Schema, regex), but doesn't change the token count. About the cap — Q3.
The low-level vLLM fields guided_json, guided_regex, guided_grammar, guided_choice are rejected by the gateway with HTTP 400 (anchor #reject-guided-decoding in docs/chat-api/troubleshooting.md (planned)). The reason — they bypass the xgrammar bounds applied to the response_format / structured_outputs envelope to mitigate CVE-2025-48944.
The correct fields for structured output
| Field | Kimi K2.6 | Qwen3-235B | Notes |
|---|---|---|---|
response_format (type: "json_schema" or "json_object") |
works | works | OpenAI standard. Reliable choice. Empirically verified on both models through a public broker. |
structured_outputs envelope (json/regex/choice/grammar/structural_tag/json_object) |
HTTP 400 (network-wide reject) | HTTP 400 (network-wide reject) | PR #1215 (StructuredOutputsValidator) merged in the repository, but not activated on production mainnet as of 2026-05-25. Both brokers reject with an identical error: "Chat completions parameterstructured_outputsis currently rejected by the Gonka network" — the error references the dev branch dl/devshards-gateway-to-main, not main. This is a network-wide release lag, not per-broker. The only reliable structured-output option today is response_format on Kimi K2.6 and Qwen3. |
Both at once (response_format + structured_outputs) |
HTTP 400 | HTTP 400 / 502 (depends on the broker) | The gateway rejects the combo before vLLM (anchor #reject-structured_outputs-with-response_format). On vLLM 0.20.0 the fields are merged via dataclasses.replace() and violate exactly-one in StructuredOutputsParams.__post_init__. |
For Inference User:
- Need maximum portability across brokers and models — use
response_format(works everywhere). Thestructured_outputsenvelope is currently rejected network-wide. - Don't combine
response_formatandstructured_outputsin one request — HTTP 400.
For Broker:
- Guided decoding doesn't raise throughput. Don't promise it to clients as a solution for the token cap.
- Watch for the rollout of PR #1215 (
StructuredOutputsValidator) on all routes — Qwen3 users are already waiting for thestructured_outputsenvelope for regex / choice / grammar workloads.
Why does generation speed fluctuate so drastically? And why does the boost apply only to reasoning tokens?
Speed fluctuations are a real, known open problem. The roots are in three different layers.
1. Per-host slowdowns / stalls (host-level)
An open research task — issue #818 "Slow nodes investigation" (OPEN since February 2026, Priority: High). Specific patterns without a root cause (the planned known-issues.md, §1 "Host returns no stream after receipt" and §2 "Host stalls after producing chunks" — in some cases it resumes after a minute, in others never).
2. Routing variance (broker-level)
Between two consecutive requests the broker may land on different hosts with different loads. End-to-end latency varies depending on the devshard-XXXX-YYY host ID. Per-token generation speed on a stable host stays practically the same.[¹]
[¹] Illustrative observation: in one test (5 requests over ~30 sec) end-to-end latency varied such that tokens / total_latency showed a range of ~8–54 tok/s, but this metric includes TTFT and is not a published variance metric.
3. Validation windows at the network level (chain-level)
During PoC / Confirmation-PoC events (cPoC — the phases that confirm validator work within an epoch) some nodes are temporarily unavailable. At epoch boundaries there was a known problem with the snapshot preserved-nodes, in which the gateway returned attempts: [] (no available hosts on the route) — from the client side, a timeout. The effect is more noticeable the fewer nodes with that model the broker serves; it's stronger on models with a small number of providers.
"Reasoning faster than visible" — not prioritization, but output structure
There's no special fast route for reasoning tokens on the gateway. In the devshard code, delta.reasoning, delta.content, delta.reasoning_content, delta.tool_calls are all detected the same way via sseChunkHasContent. Per-token speed is the same.
Kimi with thinking enabled first generates a bulky reasoning_content (hundreds to thousands of tokens), then a short visible answer (tens to hundreds). A client that doesn't show the reasoning field sees "silent, then blurts out the answer in a burst." In reality the model was generating the whole time, the result was just hidden.
For Inference User:
- Choose a broker that publishes uptime / p50 TTFT metrics. Available dashboards include gonka.pw and meter.gonka.gg (there may be others; the list is not exhaustive).
- On a slow request, remember the payload size: for short ones a retry lands on a different node; for sustained large payloads (>50k) landing on a node with a reduced window is a systematic problem (issue #818), and a retry alone may not work — better to split.
- Want to see progress while the model thinks — render
delta.reasoning_content(ordelta.reasoning) in the UI, e.g. in a collapsed block.
For Broker:
- The highest-priority shared problem for the whole network. Contribute production logs / traces to issue #818 — this gives the core team data they don't have.
- Help implement host-side improvements (chunked gossip recovery, per-escrow
lastAfterReqtracking — tracked in the plannedhost-improvements.mdand related issues) — they directly address routing / recovery weak spots.
Why does speed vary depending on hardware — faster on B200, slower on H200?
Speed depends on hardware — this is normal for a heterogeneous network. The PoC weight on the chain reflects the node's real performance (affecting the validator's reward share), while the broker's routing locally picks an available host from escrow — two consecutive requests may land on GPUs of different generations.
For Inference User: speed depends on the hardware distribution in the network. You don't pick hardware directly — you pick a broker. Need predictable latency — ask the broker which hardware tier they route to by default.
For Broker:
Where exactly the difference comes from (per internal benchmarks from kaitakuai/experiments — not measured on gonka-api.org or gonkagate.com):
| GPU | Memory | sm | Qwen3-235B nonces/min per instance | Per-GPU |
|---|---|---|---|---|
| 4×H100 SXM5 | 80 GB HBM3 | 90 | 1,248 @ batch=16 | ~312 |
| 4×H200 | 141 GB HBM3e | 90 | 1,408 @ batch=32–64 | ~352 |
| 2×B200 | 192 GB HBM3e | 100 | 1,984 @ batch=64 | ~992 |
- H200 vs H100: +13% per-GPU. Same chip (sm_90), but HBM3e + 141 GB vs HBM3 + 80 GB → allows a smaller TP for large models and a faster KV cache.
- B200/B300 vs H100/H200: ~3× per-GPU on Qwen3-235B FP8.
- Kimi-K2.6 INT4 — specific numbers: 4×B200 gives 2,240 nonces/min = ~560 per-GPU (see
experiments/2026-05/kimi_k26_int4_4xb200_q-int4-k2). 16×H100 TP gives 1,389 nonces/min = ~87 per-GPU (seeexperiments/2026-05/kimi-k26-int4-2x8xh100). The difference on a per-GPU basis is roughly 6×; in absolute numbers, per-GPU Kimi is slower than Qwen on the same hardware (4×B200 Kimi INT4 ~560 per-GPU vs Qwen ~992 per-GPU). - Kimi-K2.6 INT4 on Blackwell:
VLLM_USE_FLASHINFER_MOE_INT4=1gives +138% vs Marlin (A/B test inexperiments/2026-05/kimi_k26_b300_eager_flashinfer). Applicable only to INT4 MoE workloads on the Blackwell family (kernel gate —is_device_capability_family(100), covers B100/B200/B300; B300 is effectively sm_103a).
Tracing and diagnostics: observability was merged in PR #1046 "Implement dapi & devshard observability" — it adds OpenTelemetry traces, Prometheus metrics, and dashboards. If Grafana has no per-host TTFT panels — check that DAPI / devshard are updated and the dashboards are included in the build.
Additional sources: the repo kaitakuai/experiments (updated regularly), your own per-host stats from gonka.pw, and network status from meter.gonka.gg. Want to influence the hardware distribution — scale devshard escrow toward hosts with the preferred GPUs.
Why can't the model use tools properly within Kilo Code?
Most likely, one of four causes — the gateway applies a strict parameter allowlist and tight caps on the JSON Schema. This is not Kilo-specific: the same causes trigger for any coding agent (Cline, Continue.dev, OpenCode, etc.).
1. Hard reject (HTTP 400) — needs to be fixed on the client side
| Trigger | Cause | Fix |
|---|---|---|
The tags field in the payload |
Not from the OpenAI Chat Completions standard; folkloric Hermes convention; anchor #reject-tags |
Use metadata (OpenAI standard) or user for tracking |
Schema depth > 16 in tools[].function.parameters |
CVE-driven cap | Flatten the schema; PR #1187 raised it from 5 → 16 |
| Schema nodes > 256 (total) | CVE-driven cap | Reduce it; PR #1195 raised it from 128 → 256. MCP tools with large input schemas may approach the limit; test on your gateway. If you genuinely need an MCP tool with >256 nodes — feature request. |
2. Silent coerce / strip — the request doesn't fail, but behavior changes
| Trigger | What the gateway does | Notes |
|---|---|---|
tool_choice: "required" |
Silently → "auto" (network policy) |
Anchor #coerce-tool-choice-required. In most cases the model will make a tool call for an obviously tool-relevant prompt, but there's no "required" guarantee |
tools[].function.strict: true |
Silently drops the field | vLLM parsers (hermes, kimi_k2) ignore the flag. PR #1193 |
The compat matrix for known clients: docs/chat-api/agents.md (planned). A basic working tool-calling example: Developer Quickstart §1.4.
For Inference User:
- Reproduce with the same curl that Kilo Code generates (via the client's debug log or an intermediate proxy). In the 400 body the gateway usually states the name of the rejected field; the broker may mask the message into a generic "upstream rejected" — but the specific problem field is usually one.
- Cross-check against the lists in
agents.mdandtroubleshooting.md(planned) — most 400s fall into the documented reject anchors (#reject-tags,#reject-enforced_tokens,#reject-structured_outputs-kimi). - Quick checklist if the error message is unclear: check the fields
tags,enforced_tokens,plugins,strict,guided_*; remove them one at a time and resend the request. Doesn't help — check the schema depth (≤ 16) and nodes (≤ 256). - The rejected field is not documented — open an issue on gonka-ai/gonka with the captured request.
For Broker:
- No link to
agents.mdon the dashboard — a cheap quick-win to add. - Have capacity to file an issue about non-standard fields in
gonka-ai/gonka— it helps every broker in the ecosystem.
Agents like Hermes and OpenClaw fail to complete tool tasks on Kimi. Why?
A composition of three factors
The original FAQ version mentioned a fourth — the special-token sanitizer — but that's about security/prompt-injection, not tool-call failure; the PR fix is deferred, since Kimi handles special tokens correctly (empirically).
- The gateway allocates half of
max_tokensfor thinking by default (see Q1/Q2). With the defaultthinking_token_budget = max_tokens / 2, it goes to<think>before the model even starts emitting a tool call. For tool-heavy agentic flows the budget runs out before useful output. Mitigation —thinking_token_budget: 0explicitly (Q2). This is a gateway policy, not a model limitation. - The output cap 3,072 (effective) / 4,096 (network ceiling) is tight for tool-heavy outputs (Q3). Large
argumentsblobs + visible content easily hit the ceiling. - Upstream vLLM tool-parser bugs (Q5): duplicate
tool_calls[].idcollisions withn>1(vLLM PR #21259 — UNVERIFIED) and the hermes parserJSONDecodeErroron multiple tool blocks (vLLM #17790).
Builder pain point with a link: issue #1229 — long reasoning chains, tool-call compatibility, continuation after exceeding output limits are listed as blockers of agentic coding workflows.
For Inference User:
- For Kimi this is mandatory:
"thinking": {"type": "disabled"}+"max_tokens": 4096(or an explicitthinking_token_budget: 0, see Q2 on belt-and-suspenders). This frees the entire cap for tool-heavy output. Empirically: Kimi easily emits 5 parallel tool calls in one response in ~4 seconds. - Control tool_call.id on the client side — rewrite it into the canonical format
functions.<name>:<global_idx>(Q5) to avoid the gateway duplicate-id reject. - Control the schema — keep depth ≤ 16 and nodes ≤ 256 (Q9). MCP tools with large input schemas may not pass.
For Broker:
- Combine the cap bump (Q3 — per-model
request_max_tokens_capvia/v1/admin/settings) with the recommendations above — it covers the main class of agent failures on your gateway.
OpenCode cannot apply requested code changes (cuts off mid-sentence). What is causing this?
Three causes; the client can work around two, but not the third.
max_tokenstruncation on large diffs. Large code patches don't fit in the effective cap of 3,072 (Q3). Workaround: split the diff into several tool calls — the model fits the budget more easily on each.- vLLM crashes on edge-case params — a series of 8 merged PRs (#1170, #1171, #1172, #1174, #1180, #1212, #1215, #1216) added hardening against fields that crashed the engine. On a recent gateway (≥
devshard 0.2.13), most known crash scenarios are cut off by 400 validators instead of crashing. - Host stream drops after receipt (open — described in the planned
known-issues.md§1) — the host accepted the request but doesn't return chunks. This is network-level with no client workaround other than retry.
For Inference User:
- For Kimi:
"thinking": {"type": "disabled"}+"max_tokens": 4096. Large diffs — into several tool calls. - Long-term: Q3 on the broker cap and Q5 on the tool-call canonical id format.
For Broker: document the "split big diffs" pattern in the customer FAQ for coding-agent clients.
Is there a model that handles both input and output without trade-offs?
MiniMax-M2.7 launched on mainnet ~2026-05-28 via the chain governance upgrade v0.2.13 — Gonka's third model. Verified live on both brokers. Clarification: "Qwen output cap 8,192" in the question is inaccurate — the output cap is the same for all models (3,072 / 4,096, Q3), not model-side.
| Model | Native context | Mainnet | Native thinking | Tool calls |
|---|---|---|---|---|
| Kimi-K2.6 | 256K | 240K | yes (chat_template_kwargs) | functions.<name>:<idx> |
| Qwen3-235B-A22B-Instruct-2507-FP8 | 128K | 240K | no (Instruct) | hermes parser |
| MiniMax-M2.7 | ~180K | 180K | yes (<think> in content) |
chatcmpl-tool-<hash> |
MiniMax deploy spec (inference-chain/app/upgrades/v0_2_13/upgrades.go:minimaxGovernanceModel()):
ModelArgs: ["--enable-auto-tool-choice", "--kv-cache-dtype", "fp8",
"--tool-call-parser", "minimax_m2",
"--reasoning-parser", "minimax_m2_append_think"]
VRam: 320 GB ThroughputPerNonce: 5000 (Kimi 1500 — MiniMax ×3.3 higher)
minimaxStartEpoch: 271
HfCommit: d494266a4affc0d2995ba1fa35c8481cbd84294b
Important differences of MiniMax from Kimi/Qwen:
<think>blocks indelta.content(not inreasoning_contentlike Kimi) — behavior of theminimax_m2_append_thinkparser. Parse the tags client-side if you don't need them in the final text.- Tool-call IDs
chatcmpl-tool-<hash>— already unique by shape, so the Q5 advice about canonical id rewriting doesn't apply.
Related artifacts: PR #1163 Weight Scaling (merged 2026-05-13, aligned the economics with Kimi); PR #1226 (open, not merged) — a gateway-side refactor on top of the deployed model, not a blocker.
For Inference User: MiniMax-M2.7 is available today (ID MiniMaxAI/MiniMax-M2.7 on gonka-api.org, minimaxai/minimax-m2.7 on gonkagate.com — see case-sensitivity Q1). Choose by workload: Kimi for reasoning+tools, Qwen3 for large context + structured outputs, MiniMax-M2.7 — a tool-friendly alternative to Kimi with better throughput.
For Broker: the deploy was done by the network via the v0.2.13 upgrade. Not serving MiniMax — check that the mlnode-image supports the deploy args above and the hosts are updated. PR #1226 (open) will improve the UX (per-model dispatch, tool-message shape), but doesn't block.
Why is there no working web search available?
By design — Gonka is an inference network, not an agent framework. Plugin / web execution is a concern of the client's agent layer or a broker with value-add services, not the inference path.
Specifically: on 2026-05-25 we tested the same plugins payload through two brokers. gonka-api.org silently strips the field (HTTP 200, anchor #strip-plugins in docs/chat-api/troubleshooting.md (planned)); gonkagate.com rejects it with HTTP 400 "Plugin config is invalid". Both are valid interpretations of the gateway contract: one in favor of lenient parsing (silent strip), the other strict validation (reject unknown fields). In both cases plugins is not executed: vLLM has no plugin-execution path, and quietly passing this field through would imply a backend capability that doesn't exist. When migrating between brokers, account for the divergence (details in comparison-brokers.md (planned)).
For Inference User: run the search in your own agent layer (LangChain, LlamaIndex, your own wrapper), inject the results into messages[].content before calling /v1/chat/completions. This is the standard pattern for all OpenAI-compatible endpoints.
For Broker: an opportunity for differentiation — a broker-level value-add ("we do search and inject the results into messages") is a legitimate product. Implement it fully on top of Gonka, without protocol changes. Security note: stripping plugins may reflect an abuse-resistance policy (not a UX failure) — think this through if you'll offer plugin execution as a product. Offering it as a standard — open an ecosystem Discussion on gonka-ai/gonka.
When will reliable web fetching be supported?
By design it's not on the Gonka roadmap. The right place is a side-car or a value-add at the broker layer.
For Inference User: build / buy a fetch service (Tavily, Exa, Perplexity API for search; trafilatura/Readability for parsing), normalize into text, send it through an OpenAI-compatible call. There are plenty of ready-made solutions.
For Broker: want to offer it as a tier — open an ecosystem Discussion in gonka-ai/gonka so the community converges on common conventions (e.g. a side-car that everyone deploys consistently).
Context7 docs research — summary fails. Is this the output token limit?
The same blocker as in "The input token cap for Kimi is 4k tokens, and the output cap is 8,192 tokens. When will these limits be raised?". The output cap (effective 3,072 / network ceiling 4,096) is tight for "tool result body + summary in one response." Thinking is enabled — half of it goes there (Q1/Q2).
A ready-made payload for the summary use case:
{
"model": "moonshotai/Kimi-K2.6",
"messages": [
{"role": "system", "content": "You produce structured summaries of technical documents."},
{"role": "user", "content": "Summarize the following document:\n\n<paste the text here>"}
],
"max_tokens": 4096,
"thinking": {"type": "disabled"},
"thinking_token_budget": 0,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "document_summary",
"strict": true,
"schema": {
"type": "object",
"additionalProperties": false,
"required": ["summary", "key_points"],
"properties": {
"summary": {"type": "string", "description": "3-5 sentences"},
"key_points": {"type": "array", "items": {"type": "string"}, "minItems": 3, "maxItems": 7}
}
}
}
}
}
For Inference User:
- Use the payload above as a template.
response_formatcompresses the output into the required shape, saving budget. - If the document is long and hits the cap (
finish_reason=length) — split it into N+1 calls: one fetch+plan, the rest section summaries; stitch them together client-side. - Don't combine
response_formatwith thestructured_outputsenvelope — HTTP 400 (Q6). - Schema: depth ≤ 16, nodes ≤ 256 (Q9).
For Broker: response_format is the simplest and most portable mitigation regardless of your cap-bump policy. Consider a per-customer cap-bump option once the per-model request_max_tokens_cap is in your admin config.
Gonka has no KV cache. When will caching be added?
Short answer: there's no ETA. On the Gonka gateway side everything is ready — the blocker is on the upstream vLLM side, issue #33264 has been open 4+ months with no merged PR. Until it's closed, the prompt_cache_key field in your request is silently ignored — don't include it, so you don't rely on behavior that doesn't exist.
The vLLM prefix KV cache works on each ML node. Gateway-level prompt_cache_key / cache_key are currently silently stripped — a limitation blocked by an unmerged upstream vLLM PR.
Current status quo
- Gateway behavior:
prompt_cache_key(OpenAI standard) andcache_key(Moonshot Kimi convention) are silently stripped — neither reaches vLLM. Anchors:docs/chat-api/troubleshooting.md#strip-prompt_cache_keyand#strip-cache_key(planned). - Upstream blocker: vLLM uses the
cache_saltfield for prompt-cache isolation (RFC #16016, PR #17045). Aliasingprompt_cache_key→cache_saltis the open vLLM #33264 since January 2026, with no merged PR. - Security rationale: simply forwarding
cache_keywithout isolation would be unsafe — there are published prompt-cache timing side-channel attacks (arxiv 2502.07776 PROMPTPEEK). The gateway cannot implement false cache-isolation guarantees. - 80–90% hit rate is not a Gonka claim. It's either a misinterpretation of someone's marketing material or confusion with OpenAI / Anthropic native cache (which guarantee sticky routing within a single provider).
Important architectural caveat
Even when vLLM #33264 merges and the gateway adds a hash → cache_salt bridge, the cache remains per-vLLM-instance. Gonka's multi-host routing means two requests with the same cache_key may land on different hosts with different prefix caches. Without sticky routing (which doesn't exist now), guaranteeing an OpenAI-style ~80% hit rate is architecturally hard. None of the three blockers (upstream vLLM PR, gateway bridge, sticky routing) is shipped today.
For Inference User: there's nothing to do today — prompt_cache_key and cache_key are no-ops. Don't rely on these fields for cost optimization.
For Broker: no gateway-side change is needed until vLLM #33264 merges. Want to speed it up — comment / contribute to that upstream issue. After the merge, the Gonka gateway will add a bridge that lights up both fields together.
When will image input be enabled for Kimi on the Gonka gateway?
Not available today. ETA — release v0.2.14 or later (current is 0.2.13), no fixed date. Multimodal payloads (messages[].content with type: "image_url" or "video_url") currently return HTTP 400 on both public brokers.
Active work, the plan is written and broken into phases. The planned document multimodal-inference-plan.md in gonka-ai/gonka (≈466 lines, 6 phases — ML Node, Host↔ML Node, Broker/DAPI, Devshard Protocol, etc.). Until it's published, it's easier to track via the issues / PRs below.
Hard blockers today
-
A multimodal-specific special-token sanitizer. The Kimi-K2.6 chat template accepts
image_url/video_urlcontent parts, but the gateway currently validates only text. Multimodal payloads (image URLs, alt-text, metadata) provide an additional injection surface that must be validated. The security review flagged it as a Phase 2 blocker. There's no published CVE for this specific multimodal threat yet; internal tracking is in progress. -
Independent VLM validation review. The validation methodology for image inputs needs to be independently confirmed. Issue #1026 (initial research: Qwen2-VL-2B F1=100% intermediate) + #1198 (re-validate, up-for-grabs).
Target: v0.2.14+, but there's no committed timeline; blocked by issue #1198 (independent validation, up-for-grabs).
What's empirically confirmed today: a request with a messages[0].content array containing {type:"image_url"} returns HTTP 400 on both routes (Kimi and Qwen3). Multimodal inputs are not accepted at the gateway level.
For Inference User: not available today.
For Broker: three ways to speed it up:
- Take issue #1198 (up-for-grabs) — the independent VLM validation review is the hardest gating item.
- Review PR #1150 "vlm benchmark".
- When Phase 1-3 of the plan become reachable — prepare the gateway capability registry (Phase 3); the operator config will determine which content types your broker accepts.