Operations

How to Monitor a Stellar Validator or Horizon Node in Production

Running Stellar infrastructure in production requires robust monitoring. Undetected issues lead to missed transactions, consensus failures, and degraded service. This guide covers everything you need to monitor validators and Horizon nodes effectively.

Monitoring Architecture

A complete monitoring stack for Stellar infrastructure includes:

┌─────────────────────────────────────────────────────────────┐
│                     Alerting Layer                          │
│  PagerDuty / Slack / Email / SMS                           │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│                     Prometheus                              │
│  Metrics Collection & Storage                               │
└─────────────────────────────────────────────────────────────┘
        ▲                ▲                    ▲
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Stellar Core  │ │    Horizon    │ │  Soroban RPC  │
│   Exporter    │ │   Exporter    │ │   Exporter    │
└───────────────┘ └───────────────┘ └───────────────┘
        ▲                ▲                    ▲
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Stellar Core  │ │    Horizon    │ │  Soroban RPC  │
│   Node        │ │    Server     │ │   Server      │
└───────────────┘ └───────────────┘ └───────────────┘

Setting Up Prometheus + Grafana

Docker Compose Setup

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_USERS_ALLOW_SIGN_UP=false

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  # Stellar Core metrics
  - job_name: 'stellar-core'
    static_configs:
      - targets: ['stellar-core:11626']
    metrics_path: /metrics

  # Horizon metrics
  - job_name: 'horizon'
    static_configs:
      - targets: ['horizon:8000']
    metrics_path: /metrics

  # Soroban RPC metrics (if running)
  - job_name: 'soroban-rpc'
    static_configs:
      - targets: ['soroban-rpc:8001']
    metrics_path: /metrics

  # Node exporter for system metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # PostgreSQL exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

Critical Metrics to Monitor

Stellar Core Metrics

MetricDescriptionAlert Threshold
`stellar_core_ledger_age_seconds`Time since last ledger close> 30 seconds
`stellar_core_ledger_num`Current ledger sequenceLag > 10 from network
`stellar_core_herder_pending_txs`Pending transactions> 1000
`stellar_core_overlay_inbound_connections`Peer connections< 5
`stellar_core_scp_slot_externalized`Consensus participationMissing slots

Horizon Metrics

MetricDescriptionAlert Threshold
`horizon_ingest_ledger_ingested`Last ingested ledgerLag > 5 from Core
`horizon_request_duration_seconds`API response timesP95 > 2s
`horizon_db_query_duration_seconds`Database query timesP95 > 500ms
`horizon_requests_total`Request countSudden drops
`horizon_state_verifier_ok`State verification!= 1

System Metrics

MetricDescriptionAlert Threshold
`node_cpu_seconds_total`CPU usage> 80% sustained
`node_memory_MemAvailable_bytes`Available memory< 10%
`node_filesystem_avail_bytes`Disk space< 15%
`node_disk_io_time_seconds_total`Disk I/OHigh latency

Alert Rules

Stellar Core Alerts

# stellar-core-alerts.yml
groups:
  - name: stellar-core
    rules:
      # Ledger age alert - critical for validators
      - alert: StellarCoreLedgerStale
        expr: stellar_core_ledger_age_seconds > 30
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Stellar Core ledger is stale"
          description: "Ledger age is {{ $value }}s, expected < 30s"

      # Peer connection alert
      - alert: StellarCoreLowPeers
        expr: stellar_core_overlay_inbound_connections < 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer connections"
          description: "Only {{ $value }} inbound peer connections"

      # Pending transactions pileup
      - alert: StellarCorePendingTxsHigh
        expr: stellar_core_herder_pending_txs > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High pending transactions"
          description: "{{ $value }} transactions pending"

      # SCP consensus issues
      - alert: StellarCoreConsensusIssue
        expr: increase(stellar_core_scp_slot_externalized[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Stellar Core not participating in consensus"
          description: "No new slots externalized in 5 minutes"

      # Validator not in quorum
      - alert: StellarCoreNotInQuorum
        expr: stellar_core_scp_local_node_not_in_quorum == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Validator not in quorum"
          description: "This node is not part of the quorum"

Horizon Alerts

# horizon-alerts.yml
groups:
  - name: horizon
    rules:
      # Ingestion lag
      - alert: HorizonIngestionLag
        expr: stellar_core_ledger_num - horizon_ingest_ledger_ingested > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Horizon ingestion lagging"
          description: "{{ $value }} ledgers behind Core"

      # High API latency
      - alert: HorizonHighLatency
        expr: histogram_quantile(0.95, sum(rate(horizon_request_duration_seconds_bucket[5m])) by (le)) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency"
          description: "P95 latency is {{ $value }}s"

      # API errors
      - alert: HorizonHighErrorRate
        expr: rate(horizon_requests_total{status=~"5.."}[5m]) / rate(horizon_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High API error rate"
          description: "{{ $value | humanizePercentage }} of requests failing"

      # Database connection issues
      - alert: HorizonDbConnectionIssues
        expr: horizon_db_query_duration_seconds{quantile="0.95"} > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow database queries"
          description: "P95 query time is {{ $value }}s"

System Alerts

# system-alerts.yml
groups:
  - name: system
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value | humanize }}%"

      - alert: LowMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low available memory"
          description: "Only {{ $value | humanize }}% memory available"

      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Only {{ $value | humanize }}% disk space available"

RPC Health Checks

Implement active health checks beyond metrics:

#!/bin/bash
# health-check.sh

HORIZON_URL="http://localhost:8000"
SOROBAN_URL="http://localhost:8001"
CORE_URL="http://localhost:11626"

# Check Horizon
horizon_health=$(curl -s -o /dev/null -w "%{http_code}" "$HORIZON_URL/health")
if [ "$horizon_health" != "200" ]; then
  echo "CRITICAL: Horizon health check failed"
  exit 2
fi

# Check Horizon ledger sync
horizon_info=$(curl -s "$HORIZON_URL")
core_ledger=$(curl -s "$CORE_URL/info" | jq -r '.info.ledger.num')
horizon_ledger=$(echo "$horizon_info" | jq -r '.history_latest_ledger')

lag=$((core_ledger - horizon_ledger))
if [ "$lag" -gt 10 ]; then
  echo "WARNING: Horizon ingestion lag is $lag ledgers"
  exit 1
fi

# Check Soroban RPC health
soroban_health=$(curl -s -X POST "$SOROBAN_URL"   -H "Content-Type: application/json"   -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' | jq -r '.result.status')

if [ "$soroban_health" != "healthy" ]; then
  echo "CRITICAL: Soroban RPC health check failed"
  exit 2
fi

# Check Stellar Core state
core_state=$(curl -s "$CORE_URL/info" | jq -r '.info.state')
if [ "$core_state" != "Synced!" ]; then
  echo "WARNING: Stellar Core state is $core_state"
  exit 1
fi

echo "OK: All services healthy"
exit 0

Ledger Lag Detection

Ledger lag indicates your node is falling behind the network:

# ledger_lag_monitor.py
import requests
import time
from prometheus_client import Gauge, start_http_server

CORE_URL = "http://localhost:11626"
PUBLIC_HORIZON = "https://horizon.stellar.org"

ledger_lag = Gauge('stellar_ledger_lag', 'Ledger sequence lag from network')
local_ledger = Gauge('stellar_local_ledger', 'Local ledger sequence')
network_ledger = Gauge('stellar_network_ledger', 'Network ledger sequence')

def get_local_ledger():
    try:
        resp = requests.get(f"{CORE_URL}/info")
        return resp.json()['info']['ledger']['num']
    except Exception as e:
        print(f"Error getting local ledger: {e}")
        return None

def get_network_ledger():
    try:
        resp = requests.get(PUBLIC_HORIZON)
        return resp.json()['history_latest_ledger']
    except Exception as e:
        print(f"Error getting network ledger: {e}")
        return None

def monitor():
    while True:
        local = get_local_ledger()
        network = get_network_ledger()

        if local and network:
            lag = network - local
            ledger_lag.set(lag)
            local_ledger.set(local)
            network_ledger.set(network)

            if lag > 100:
                print(f"CRITICAL: Ledger lag is {lag}")
            elif lag > 10:
                print(f"WARNING: Ledger lag is {lag}")

        time.sleep(10)

if __name__ == '__main__':
    start_http_server(9101)
    monitor()

Grafana Dashboards

Stellar Overview Dashboard

{
  "dashboard": {
    "title": "Stellar Infrastructure Overview",
    "panels": [
      {
        "title": "Current Ledger",
        "type": "stat",
        "targets": [{
          "expr": "stellar_core_ledger_num"
        }]
      },
      {
        "title": "Ledger Age",
        "type": "gauge",
        "targets": [{
          "expr": "stellar_core_ledger_age_seconds"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "yellow", "value": 10 },
                { "color": "red", "value": 30 }
              ]
            }
          }
        }
      },
      {
        "title": "Horizon Request Rate",
        "type": "graph",
        "targets": [{
          "expr": "rate(horizon_requests_total[5m])"
        }]
      },
      {
        "title": "API Latency (P95)",
        "type": "graph",
        "targets": [{
          "expr": "histogram_quantile(0.95, sum(rate(horizon_request_duration_seconds_bucket[5m])) by (le))"
        }]
      },
      {
        "title": "Ingestion Lag",
        "type": "stat",
        "targets": [{
          "expr": "stellar_core_ledger_num - horizon_ingest_ledger_ingested"
        }]
      },
      {
        "title": "Pending Transactions",
        "type": "graph",
        "targets": [{
          "expr": "stellar_core_herder_pending_txs"
        }]
      }
    ]
  }
}

Alertmanager Configuration

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'

    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#stellar-ops'
        title: ':warning: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Runbook Examples

High Ledger Age

Alert: StellarCoreLedgerStale

Impact: Node is not processing new ledgers

Steps:

  • Check Stellar Core logs: docker logs stellar-core --tail 100
  • Verify network connectivity: curl -s localhost:11626/info | jq '.info.state'
  • Check peer connections: curl -s localhost:11626/peers
  • Restart Core if necessary: docker restart stellar-core
  • Monitor recovery in Grafana
  • Horizon Ingestion Lag

    Alert: HorizonIngestionLag

    Impact: API returning stale data

    Steps:

  • Check Horizon logs: docker logs horizon --tail 100
  • Verify database connectivity
  • Check disk I/O: iostat -x 1 5
  • Reingest if necessary: horizon db reingest range START END
  • Conclusion

    Effective monitoring is essential for running Stellar infrastructure in production. Key takeaways:

  • Monitor all layers - Core, Horizon, Soroban RPC, and system
  • Set appropriate thresholds - Avoid alert fatigue
  • Automate responses - Where possible, self-heal
  • Document runbooks - Every alert needs a response plan
  • Test regularly - Chaos engineering validates your monitoring
  • For teams that want reliability without the operational overhead, consider using a managed provider like LumenQuery—we handle the monitoring so you can focus on building.


    *Need production-ready Stellar infrastructure without the ops burden? Try LumenQuery—fully managed with built-in monitoring and 99.9% uptime SLA.*