ABS Monitoring

Aerospike Backup Service (ABS) exposes system metrics that Prometheus can scrape.

Prometheus configuration

ABS exposes metrics directly on its HTTP port, so you don’t need a dedicated Prometheus exporter. By default, metrics are available at http://<ABS_HOST>:8080/metrics. You can change the port with the service.http.port parameter.

The following example shows a standalone Prometheus configuration for scraping ABS metrics:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'aerospike-backup-service'
    static_configs:
      - targets: ['abs-service:8080']

Replace abs-service:8080 with your ABS host and port.

Grafana dashboard

A pre-built Grafana dashboard is available for visualizing ABS metrics. The dashboard includes panels for backup success and failure rates, backup duration, and restore operations.

Metrics

ABS includes the following application metrics:

Name	Description	Labels
`aerospike_backup_service_backup_duration_seconds`	Duration in seconds of finished backups by routine and type (full/incremental)	routine, type
`aerospike_backup_service_backup_events_total`	Backup service job events by routine, type (full/incremental), and outcome (success, failure, canceled, retry, skip)	routine, type, outcome
`aerospike_backup_service_backup_progress_pct`	Progress of backup processes in percentage	routine, type
`aerospike_backup_service_last_successful_backup_timestamp`	Unix timestamp of the last successful backup per routine and type (full/incremental)	routine, type
`aerospike_backup_service_restore_in_progress`	Number of restore processes running

Backup cancellation outcomes

The aerospike_backup_service_backup_events_total{outcome="canceled"} series increases when a backup is canceled.

If a user explicitly cancels a backup (using the Cancel all jobs for a backup routine endpoint) or disables the routine, this metric increase is expected.
If the service shuts down gracefully during a running backup, logs can report the backup as canceled, but Prometheus may miss the final increment if it cannot scrape before shutdown completes.

Deprecated metrics

The following metrics are deprecated as of ABS 3.0. Use the recommended replacement metrics instead.

Click to show deprecated metrics

Name	Description	Replacement
`aerospike_backup_service_runs_total`	Successful backup runs counter	`aerospike_backup_service_backup_events_total`
`aerospike_backup_service_incremental_runs_total`	Successful incremental backup runs counter	`aerospike_backup_service_backup_events_total`
`aerospike_backup_service_skip_total`	Full backup skip counter	`aerospike_backup_service_backup_events_total`
`aerospike_backup_service_incremental_skip_total`	Incremental backup skip counter	`aerospike_backup_service_backup_events_total`
`aerospike_backup_service_failure_total`	Full backup failure counter	`aerospike_backup_service_backup_events_total`
`aerospike_backup_service_incremental_failure_total`	Incremental backup failure counter	`aerospike_backup_service_backup_events_total`
`aerospike_backup_service_duration_millis`	Full backup duration in milliseconds	`aerospike_backup_service_backup_duration_seconds`
`aerospike_backup_service_incremental_duration_millis`	Incremental backup duration in milliseconds	`aerospike_backup_service_backup_duration_seconds`

Example PromQL queries

Monitor and alert on backup performance with the following queries in Grafana panels or the Prometheus expression browser.

Number of successful full and incremental backups for a specific routine:

sum by (type) ( aerospike_backup_service_backup_events_total{routine="daily-ns1", outcome="success"} )

Number of failed backups per routine:

sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="failure"} )

Number of canceled backups per routine:

sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="canceled"} )

Average backup duration per routine:

rate(aerospike_backup_service_backup_duration_seconds_sum[5m]) / rate(aerospike_backup_service_backup_duration_seconds_count[5m])

Time since last full backup for a routine:

time() - aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1", type="full"}

Time since most recent backup for a routine regardless of backup type:

time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"})

Example Prometheus alerts

Integrate ABS metrics into your Prometheus alerting pipeline to stay informed of job failures or service latencies.

Detect backup job failures recorded within the last 15 minutes with this alert.

- alert: BackupJobFailureDetected
  expr: increase(aerospike_backup_service_backup_events_total{outcome="failure"}[15m]) > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: "Backup job failure detected"
    description: "A backup failure was detected in the last 15 minutes for routine {{ $labels.routine }}."

Ensure backup continuity with this alert, which detects if a specific routine has failed to complete successfully within the last 24 hours.

- alert: BackupTooOld
  expr: time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"}) > 86400
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "Backup is older than 24 hours"
    description: "The last successful backup for routine daily-ns1 was more than 24 hours ago."

Process and Go runtime metrics

ABS also exposes standard process and Go runtime metrics on /metrics. Use them to detect resource saturation and runtime behavior changes before backup failures occur.

Metric	Description	What to watch
`process_cpu_seconds_total`	Cumulative CPU time (seconds) across all cores.	`rate(process_cpu_seconds_total[5m]) * 100` gives CPU percent per core. A sustained value above 100 means ABS uses more than one core on average.
`process_resident_memory_bytes`	Resident set size (RSS) in bytes.	Keep below container memory limits and watch for sustained growth during large backup or restore windows.
`process_open_fds`	Open file descriptors held by ABS.	Compare with `process_max_fds`; a sustained high ratio indicates descriptor pressure and possible `"too many open files"` errors.
`process_max_fds`	Hard limit for open file descriptors.	Track `process_open_fds / process_max_fds` and alert when the ratio approaches 1.0 for multiple scrape intervals.
`go_goroutines`	Active Go goroutines.	In a stable workload the count stays bounded. A monotonic increase usually indicates stalled background tasks or leaked goroutines.
`go_memstats_heap_alloc_bytes`	Go heap bytes allocated for live objects.	Compare with `process_resident_memory_bytes` to separate Go heap growth from non-heap memory pressure.

Prometheus query examples:

rate(process_cpu_seconds_total[5m]) * 100
- The process_cpu_seconds_total counter tracks total CPU time since the process started. This query turns that running total into a per-second average over the last 5 minutes, then multiplies by 100 to get a percentage. In this scale, 100 means one full CPU core is busy and 200 means two cores. If the result stays above your expected core budget for several minutes, compare the spike with aerospike_backup_service_backup_progress_pct to identify which routine is running. To reduce CPU usage, stagger routine schedules so fewer backups overlap, or increase the CPU resources allocated to the ABS instance.
process_resident_memory_bytes / 1024 / 1024 / 1024
- Converts the resident memory (RSS) of the ABS process from bytes to gibibytes, which is easier to compare against any memory limits. A good practice is to alert at roughly 80% of the memory limit. If memory keeps growing, increase the container or host memory limit, or reduce the number of backup routines that run concurrently.

For details about these collectors in the Prometheus Go client, see the collectors package documentation.

Endpoints

Name	Description
`/metrics`	Exposes metrics for Prometheus to check performance of the backup service.
`/health`	Allows monitoring systems to check the service health.
`/ready`	Checks whether the service is able to handle requests.
`/version`	Returns the application version, commit hash, and build time.
`/api-docs`	Serves the API documentation in Swagger UI format.

See the official Kubernetes documentation and Prometheus documentation for more information.