Skip to content

ABS Monitoring

Aerospike Backup Service (ABS) exposes system metrics that Prometheus can scrape.

Prometheus configuration

ABS exposes metrics directly on its HTTP port, so you don’t need a dedicated Prometheus exporter. By default, metrics are available at http://<ABS_HOST>:8080/metrics. You can change the port with the service.http.port parameter.

The following example shows a standalone Prometheus configuration for scraping ABS metrics:

/etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'aerospike-backup-service'
static_configs:
- targets: ['abs-service:8080']

Replace abs-service:8080 with your ABS host and port.

Grafana dashboard

A pre-built Grafana dashboard is available for visualizing ABS metrics. The dashboard includes panels for backup success and failure rates, backup duration, and restore operations.

Metrics

ABS includes the following application metrics:

NameDescriptionLabels
aerospike_backup_service_backup_duration_secondsDuration in seconds of finished backups by routine and type (full/incremental)routine, type
aerospike_backup_service_backup_events_totalBackup service job events by routine, type (full/incremental), and outcome (success, failure, canceled, retry, skip)routine, type, outcome
aerospike_backup_service_backup_progress_pctProgress of backup processes in percentageroutine, type
aerospike_backup_service_last_successful_backup_timestampUnix timestamp of the last successful backup per routine and type (full/incremental)routine, type
aerospike_backup_service_restore_in_progressNumber of restore processes running

Backup cancellation outcomes

The aerospike_backup_service_backup_events_total{outcome="canceled"} series increases when a backup is canceled.

  • If a user explicitly cancels a backup (using the Cancel all jobs for a backup routine endpoint) or disables the routine, this metric increase is expected.
  • If the service shuts down gracefully during a running backup, logs can report the backup as canceled, but Prometheus may miss the final increment if it cannot scrape before shutdown completes.

Deprecated metrics

The following metrics are deprecated as of ABS 3.0. Use the recommended replacement metrics instead.

Click to show deprecated metrics
NameDescriptionReplacement
aerospike_backup_service_runs_totalSuccessful backup runs counteraerospike_backup_service_backup_events_total
aerospike_backup_service_incremental_runs_totalSuccessful incremental backup runs counteraerospike_backup_service_backup_events_total
aerospike_backup_service_skip_totalFull backup skip counteraerospike_backup_service_backup_events_total
aerospike_backup_service_incremental_skip_totalIncremental backup skip counteraerospike_backup_service_backup_events_total
aerospike_backup_service_failure_totalFull backup failure counteraerospike_backup_service_backup_events_total
aerospike_backup_service_incremental_failure_totalIncremental backup failure counteraerospike_backup_service_backup_events_total
aerospike_backup_service_duration_millisFull backup duration in millisecondsaerospike_backup_service_backup_duration_seconds
aerospike_backup_service_incremental_duration_millisIncremental backup duration in millisecondsaerospike_backup_service_backup_duration_seconds

Example PromQL queries

Monitor and alert on backup performance with the following queries in Grafana panels or the Prometheus expression browser.

  • Number of successful full and incremental backups for a specific routine:
sum by (type) ( aerospike_backup_service_backup_events_total{routine="daily-ns1", outcome="success"} )
  • Number of failed backups per routine:
sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="failure"} )
  • Number of canceled backups per routine:
sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="canceled"} )
  • Average backup duration per routine:
rate(aerospike_backup_service_backup_duration_seconds_sum[5m]) / rate(aerospike_backup_service_backup_duration_seconds_count[5m])
  • Time since last full backup for a routine:
time() - aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1", type="full"}
  • Time since most recent backup for a routine regardless of backup type:
time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"})

Example Prometheus alerts

Integrate ABS metrics into your Prometheus alerting pipeline to stay informed of job failures or service latencies.

  • Detect backup job failures recorded within the last 15 minutes with this alert.
- alert: BackupJobFailureDetected
expr: increase(aerospike_backup_service_backup_events_total{outcome="failure"}[15m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Backup job failure detected"
description: "A backup failure was detected in the last 15 minutes for routine {{ $labels.routine }}."
  • Ensure backup continuity with this alert, which detects if a specific routine has failed to complete successfully within the last 24 hours.
- alert: BackupTooOld
expr: time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"}) > 86400
for: 0m
labels:
severity: critical
annotations:
summary: "Backup is older than 24 hours"
description: "The last successful backup for routine daily-ns1 was more than 24 hours ago."

Process and Go runtime metrics

ABS also exposes standard process and Go runtime metrics on /metrics. Use them to detect resource saturation and runtime behavior changes before backup failures occur.

MetricDescriptionWhat to watch
process_cpu_seconds_totalCumulative CPU time (seconds) across all cores.rate(process_cpu_seconds_total[5m]) * 100 gives CPU percent per core. A sustained value above 100 means ABS uses more than one core on average.
process_resident_memory_bytesResident set size (RSS) in bytes.Keep below container memory limits and watch for sustained growth during large backup or restore windows.
process_open_fdsOpen file descriptors held by ABS.Compare with process_max_fds; a sustained high ratio indicates descriptor pressure and possible "too many open files" errors.
process_max_fdsHard limit for open file descriptors.Track process_open_fds / process_max_fds and alert when the ratio approaches 1.0 for multiple scrape intervals.
go_goroutinesActive Go goroutines.In a stable workload the count stays bounded. A monotonic increase usually indicates stalled background tasks or leaked goroutines.
go_memstats_heap_alloc_bytesGo heap bytes allocated for live objects.Compare with process_resident_memory_bytes to separate Go heap growth from non-heap memory pressure.

Prometheus query examples:

  • rate(process_cpu_seconds_total[5m]) * 100

    • The process_cpu_seconds_total counter tracks total CPU time since the process started. This query turns that running total into a per-second average over the last 5 minutes, then multiplies by 100 to get a percentage. In this scale, 100 means one full CPU core is busy and 200 means two cores. If the result stays above your expected core budget for several minutes, compare the spike with aerospike_backup_service_backup_progress_pct to identify which routine is running. To reduce CPU usage, stagger routine schedules so fewer backups overlap, or increase the CPU resources allocated to the ABS instance.
  • process_resident_memory_bytes / 1024 / 1024 / 1024

    • Converts the resident memory (RSS) of the ABS process from bytes to gibibytes, which is easier to compare against any memory limits. A good practice is to alert at roughly 80% of the memory limit. If memory keeps growing, increase the container or host memory limit, or reduce the number of backup routines that run concurrently.

For details about these collectors in the Prometheus Go client, see the collectors package documentation.

Endpoints

NameDescription
/metricsExposes metrics for Prometheus to check performance of the backup service.
/healthAllows monitoring systems to check the service health.
/readyChecks whether the service is able to handle requests.
/versionReturns the application version, commit hash, and build time.
/api-docsServes the API documentation in Swagger UI format.

See the official Kubernetes documentation and Prometheus documentation for more information.

Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?