LLM-Wrapper

LLM-Wrapper is a Go HTTP gateway between FinAI services and Google Gemini. It lets callers use an OpenAI-compatible chat-completion contract while also exposing Gemini-native generation routes. It centralizes API keys, rotation, retries, streaming, model metadata, protocol conversion, and limited repair of structured JSON.

News-Analyzer is its primary caller in the Trader system. The wrapper does not know investment concepts and does not decide verdicts. It transports model requests and makes their response shape more predictable.

Startup

The process loads a model configuration file and a JSON file containing one or more Gemini API keys. At least one valid key is required. It creates the key manager, Gemini HTTP client, API server, and model registry, then listens on port 11435.

Configuration is file based. The container mounts the key and model files into its application directory. The model file accepts a default model and a list of model identifiers. Model metadata not explicitly known by the registry may be represented with defaults.

HTTP interfaces

Route	Purpose
`POST /v1/chat/completions`	OpenAI-compatible chat completion, with normal or streamed output
`GET /v1/models`	OpenAI-style list of configured Gemini models and capabilities
`POST /v1/models/{model}:generateContent`	Gemini-native non-streaming request
`POST /v1/models/{model}:streamGenerateContent`	Gemini-native server-sent event relay
Swagger routes when enabled by the server	Interactive description generated from the checked-in API specification

The chat route accepts model, messages, sampling controls, maximum output tokens, stop sequences, tools, stream selection, and response-format instructions. The native routes accept Gemini content, system instruction, tools, and generation configuration.

OpenAI-to-Gemini translation

System messages become Gemini’s system instruction. User messages become user content. Assistant messages become model content. Tool results become function-response parts associated with user content, and requested tool calls become Gemini function-call parts.

Temperature, top-p, token limit, stops, tools, and structured-response settings are translated to Gemini generation configuration where supported. Unknown OpenAI-only fields are not guaranteed to affect Gemini.

For a non-streaming completion, the wrapper turns Gemini candidate text, tool calls, finish reason, and usage metadata into an OpenAI-style response. Prompt, completion, and total token counts are preserved when Gemini reports them.

Model selection

If a request names a Gemini model, that identifier is sent upstream. A model name outside the accepted Gemini naming convention falls back to the configured default rather than being forwarded as an arbitrary provider model. News-Analyzer maintains its own ordered model preference and can retry with another configured model when a stage-level call fails.

The model-list route describes configured models as owned by Google and reports context length, maximum completion tokens, modalities, and supported parameters from the local registry. This is configuration metadata, not a live query to Google.

Key rotation

Keys are held in memory in their file order. Each request asks the key manager for the next available key, advancing the index in round-robin order. When Gemini returns a rate-limit indication, the wrapper marks that key unavailable for one minute and tries another key.

The key manager is protected by a mutex, so concurrent requests cannot corrupt the rotation index or exhausted-key map. Once a key’s one-minute exclusion expires, it becomes eligible again. If every key is temporarily excluded, no key is available and the request fails.

The wrapper tries at most five upstream attempts in its generation paths. The usable number of distinct keys can be smaller than five, and retry behavior depends on the error category.

Retries and errors

Rate-limit responses rotate the key. Temporary Gemini server failures in the 500, 502, and 503 family are retried with increasing short delays in the implemented paths. Other upstream errors are returned without treating them as a key-capacity problem. After attempts are exhausted, the caller receives an error rather than a fabricated model response.

Because key-manager exhaustion is currently surfaced as a generic key-selection error in some paths, the exact downstream status is not a perfect indicator of whether the cause was configuration or temporary rate limiting. Logs provide the more useful diagnosis.

Structured responses and repair

When the OpenAI-compatible request asks for a JSON schema, the wrapper requests JSON from Gemini and retains the schema for post-processing. If Gemini returns parseable JSON, repair walks it against the expected schema.

Repair keeps direct field matches, recognizes small spelling or separator differences, can wrap flat child fields into an expected object or array, can place a flat field into a matching deeper schema path, recursively repairs object and array members, removes fields outside the expected schema, and supplies zero-like values for missing required fields.

The zero-like values are empty text, zero numbers, false, empty objects, or an array containing an empty value according to the schema type. This can make deserialization succeed, but it does not make missing model content factually correct. Callers must use validation, data-quality fields, and confidence rules rather than interpreting a repaired zero as observed evidence.

If the model text is not valid JSON at all, or no schema is available, repair returns the original text. It is not a general natural-language-to-JSON converter.

Streaming

OpenAI-compatible streaming emits server-sent events shaped as chat-completion chunks: an initial assistant role, incremental content or tool-call data, a finish event, usage when available, and a final done marker.

The Gemini-native streaming route relays Gemini’s event stream without converting it into OpenAI chunks. The server disables a normal write timeout so long responses are not cut off by the HTTP server. Clients must still implement their own idle and total timeouts.

Configuration files

The key file is a JSON array of secret strings. It must not be committed. The example key file contains placeholders only.

The model file defines the default model and model list. The checked-in example currently names Gemini 3 Flash Preview as default and includes Gemini 2.5 Flash and Gemini 3 Flash Preview. Deployed availability depends on the Google account and API at runtime; the wrapper does not verify every configured model during startup.

Security and operational behavior

The service does not implement end-user authentication. It should be reachable only by trusted services or protected by the deployment network and reverse proxy. Anyone who can reach it can consume configured Gemini quota.

Keys remain inside the wrapper and are never returned by the model-list API. Logs should identify route, model, retries, and errors without printing keys, prompts containing sensitive user data, or full provider responses.

Health monitoring should cover process reachability plus a controlled model request, because the process can be running while every key is rate limited or the configured default model is unavailable. Quota alarms should distinguish a single exhausted key from all-key exhaustion.

Boundaries callers must understand

Schema repair enforces shape, not truth. Key rotation improves throughput, not unlimited quota. Five retries are not five guaranteed distinct providers. The model registry is local configuration, not live capability discovery. Native and OpenAI-compatible streams have different event shapes. LLM-Wrapper success means the model call completed; it does not mean the investment analysis passed News-Analyzer’s validation.

Startup​

HTTP interfaces​

OpenAI-to-Gemini translation​

Model selection​

Key rotation​

Retries and errors​

Structured responses and repair​

Streaming​

Configuration files​

Security and operational behavior​

Boundaries callers must understand​