How to Monitor a Stellar Validator or Horizon Node in Production
Running Stellar infrastructure in production requires robust monitoring. Undetected issues lead to missed transactions, consensus failures, and degraded service. This guide covers everything you need to monitor validators and Horizon nodes effectively.
Monitoring Architecture
A complete monitoring stack for Stellar infrastructure includes:
┌─────────────────────────────────────────────────────────────┐
│ Alerting Layer │
│ PagerDuty / Slack / Email / SMS │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Prometheus │
│ Metrics Collection & Storage │
└─────────────────────────────────────────────────────────────┘
▲ ▲ ▲
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Stellar Core │ │ Horizon │ │ Soroban RPC │
│ Exporter │ │ Exporter │ │ Exporter │
└───────────────┘ └───────────────┘ └───────────────┘
▲ ▲ ▲
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Stellar Core │ │ Horizon │ │ Soroban RPC │
│ Node │ │ Server │ │ Server │
└───────────────┘ └───────────────┘ └───────────────┘Setting Up Prometheus + Grafana
Docker Compose Setup
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:latest
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
- GF_USERS_ALLOW_SIGN_UP=false
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
volumes:
prometheus-data:
grafana-data:Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
# Stellar Core metrics
- job_name: 'stellar-core'
static_configs:
- targets: ['stellar-core:11626']
metrics_path: /metrics
# Horizon metrics
- job_name: 'horizon'
static_configs:
- targets: ['horizon:8000']
metrics_path: /metrics
# Soroban RPC metrics (if running)
- job_name: 'soroban-rpc'
static_configs:
- targets: ['soroban-rpc:8001']
metrics_path: /metrics
# Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# PostgreSQL exporter
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']Critical Metrics to Monitor
Stellar Core Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| `stellar_core_ledger_age_seconds` | Time since last ledger close | > 30 seconds |
| `stellar_core_ledger_num` | Current ledger sequence | Lag > 10 from network |
| `stellar_core_herder_pending_txs` | Pending transactions | > 1000 |
| `stellar_core_overlay_inbound_connections` | Peer connections | < 5 |
| `stellar_core_scp_slot_externalized` | Consensus participation | Missing slots |
Horizon Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| `horizon_ingest_ledger_ingested` | Last ingested ledger | Lag > 5 from Core |
| `horizon_request_duration_seconds` | API response times | P95 > 2s |
| `horizon_db_query_duration_seconds` | Database query times | P95 > 500ms |
| `horizon_requests_total` | Request count | Sudden drops |
| `horizon_state_verifier_ok` | State verification | != 1 |
System Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| `node_cpu_seconds_total` | CPU usage | > 80% sustained |
| `node_memory_MemAvailable_bytes` | Available memory | < 10% |
| `node_filesystem_avail_bytes` | Disk space | < 15% |
| `node_disk_io_time_seconds_total` | Disk I/O | High latency |
Alert Rules
Stellar Core Alerts
# stellar-core-alerts.yml
groups:
- name: stellar-core
rules:
# Ledger age alert - critical for validators
- alert: StellarCoreLedgerStale
expr: stellar_core_ledger_age_seconds > 30
for: 1m
labels:
severity: critical
annotations:
summary: "Stellar Core ledger is stale"
description: "Ledger age is {{ $value }}s, expected < 30s"
# Peer connection alert
- alert: StellarCoreLowPeers
expr: stellar_core_overlay_inbound_connections < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer connections"
description: "Only {{ $value }} inbound peer connections"
# Pending transactions pileup
- alert: StellarCorePendingTxsHigh
expr: stellar_core_herder_pending_txs > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High pending transactions"
description: "{{ $value }} transactions pending"
# SCP consensus issues
- alert: StellarCoreConsensusIssue
expr: increase(stellar_core_scp_slot_externalized[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Stellar Core not participating in consensus"
description: "No new slots externalized in 5 minutes"
# Validator not in quorum
- alert: StellarCoreNotInQuorum
expr: stellar_core_scp_local_node_not_in_quorum == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Validator not in quorum"
description: "This node is not part of the quorum"Horizon Alerts
# horizon-alerts.yml
groups:
- name: horizon
rules:
# Ingestion lag
- alert: HorizonIngestionLag
expr: stellar_core_ledger_num - horizon_ingest_ledger_ingested > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Horizon ingestion lagging"
description: "{{ $value }} ledgers behind Core"
# High API latency
- alert: HorizonHighLatency
expr: histogram_quantile(0.95, sum(rate(horizon_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High API latency"
description: "P95 latency is {{ $value }}s"
# API errors
- alert: HorizonHighErrorRate
expr: rate(horizon_requests_total{status=~"5.."}[5m]) / rate(horizon_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High API error rate"
description: "{{ $value | humanizePercentage }} of requests failing"
# Database connection issues
- alert: HorizonDbConnectionIssues
expr: horizon_db_query_duration_seconds{quantile="0.95"} > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Slow database queries"
description: "P95 query time is {{ $value }}s"System Alerts
# system-alerts.yml
groups:
- name: system
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value | humanize }}%"
- alert: LowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low available memory"
description: "Only {{ $value | humanize }}% memory available"
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Only {{ $value | humanize }}% disk space available"RPC Health Checks
Implement active health checks beyond metrics:
#!/bin/bash
# health-check.sh
HORIZON_URL="http://localhost:8000"
SOROBAN_URL="http://localhost:8001"
CORE_URL="http://localhost:11626"
# Check Horizon
horizon_health=$(curl -s -o /dev/null -w "%{http_code}" "$HORIZON_URL/health")
if [ "$horizon_health" != "200" ]; then
echo "CRITICAL: Horizon health check failed"
exit 2
fi
# Check Horizon ledger sync
horizon_info=$(curl -s "$HORIZON_URL")
core_ledger=$(curl -s "$CORE_URL/info" | jq -r '.info.ledger.num')
horizon_ledger=$(echo "$horizon_info" | jq -r '.history_latest_ledger')
lag=$((core_ledger - horizon_ledger))
if [ "$lag" -gt 10 ]; then
echo "WARNING: Horizon ingestion lag is $lag ledgers"
exit 1
fi
# Check Soroban RPC health
soroban_health=$(curl -s -X POST "$SOROBAN_URL" -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' | jq -r '.result.status')
if [ "$soroban_health" != "healthy" ]; then
echo "CRITICAL: Soroban RPC health check failed"
exit 2
fi
# Check Stellar Core state
core_state=$(curl -s "$CORE_URL/info" | jq -r '.info.state')
if [ "$core_state" != "Synced!" ]; then
echo "WARNING: Stellar Core state is $core_state"
exit 1
fi
echo "OK: All services healthy"
exit 0Ledger Lag Detection
Ledger lag indicates your node is falling behind the network:
# ledger_lag_monitor.py
import requests
import time
from prometheus_client import Gauge, start_http_server
CORE_URL = "http://localhost:11626"
PUBLIC_HORIZON = "https://horizon.stellar.org"
ledger_lag = Gauge('stellar_ledger_lag', 'Ledger sequence lag from network')
local_ledger = Gauge('stellar_local_ledger', 'Local ledger sequence')
network_ledger = Gauge('stellar_network_ledger', 'Network ledger sequence')
def get_local_ledger():
try:
resp = requests.get(f"{CORE_URL}/info")
return resp.json()['info']['ledger']['num']
except Exception as e:
print(f"Error getting local ledger: {e}")
return None
def get_network_ledger():
try:
resp = requests.get(PUBLIC_HORIZON)
return resp.json()['history_latest_ledger']
except Exception as e:
print(f"Error getting network ledger: {e}")
return None
def monitor():
while True:
local = get_local_ledger()
network = get_network_ledger()
if local and network:
lag = network - local
ledger_lag.set(lag)
local_ledger.set(local)
network_ledger.set(network)
if lag > 100:
print(f"CRITICAL: Ledger lag is {lag}")
elif lag > 10:
print(f"WARNING: Ledger lag is {lag}")
time.sleep(10)
if __name__ == '__main__':
start_http_server(9101)
monitor()Grafana Dashboards
Stellar Overview Dashboard
{
"dashboard": {
"title": "Stellar Infrastructure Overview",
"panels": [
{
"title": "Current Ledger",
"type": "stat",
"targets": [{
"expr": "stellar_core_ledger_num"
}]
},
{
"title": "Ledger Age",
"type": "gauge",
"targets": [{
"expr": "stellar_core_ledger_age_seconds"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 30 }
]
}
}
}
},
{
"title": "Horizon Request Rate",
"type": "graph",
"targets": [{
"expr": "rate(horizon_requests_total[5m])"
}]
},
{
"title": "API Latency (P95)",
"type": "graph",
"targets": [{
"expr": "histogram_quantile(0.95, sum(rate(horizon_request_duration_seconds_bucket[5m])) by (le))"
}]
},
{
"title": "Ingestion Lag",
"type": "stat",
"targets": [{
"expr": "stellar_core_ledger_num - horizon_ingest_ledger_ingested"
}]
},
{
"title": "Pending Transactions",
"type": "graph",
"targets": [{
"expr": "stellar_core_herder_pending_txs"
}]
}
]
}
}Alertmanager Configuration
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
severity: critical
- name: 'slack-warnings'
slack_configs:
- channel: '#stellar-ops'
title: ':warning: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'Runbook Examples
High Ledger Age
Alert: StellarCoreLedgerStale
Impact: Node is not processing new ledgers
Steps:
docker logs stellar-core --tail 100curl -s localhost:11626/info | jq '.info.state'curl -s localhost:11626/peersdocker restart stellar-coreHorizon Ingestion Lag
Alert: HorizonIngestionLag
Impact: API returning stale data
Steps:
docker logs horizon --tail 100iostat -x 1 5horizon db reingest range START ENDConclusion
Effective monitoring is essential for running Stellar infrastructure in production. Key takeaways:
For teams that want reliability without the operational overhead, consider using a managed provider like LumenQuery—we handle the monitoring so you can focus on building.
*Need production-ready Stellar infrastructure without the ops burden? Try LumenQuery—fully managed with built-in monitoring and 99.9% uptime SLA.*