ABS Monitoring
Aerospike Backup Service (ABS) exposes system metrics that Prometheus can scrape.
Prometheus configuration
ABS exposes metrics directly on its HTTP port, so you don’t need a dedicated Prometheus exporter.
By default, metrics are available at http://<ABS_HOST>:8080/metrics.
You can change the port with the service.http.port parameter.
The following example shows a standalone Prometheus configuration for scraping ABS metrics:
global: scrape_interval: 15s
scrape_configs: - job_name: 'aerospike-backup-service' static_configs: - targets: ['abs-service:8080']Replace abs-service:8080 with your ABS host and port.
Grafana dashboard
A pre-built Grafana dashboard is available for visualizing ABS metrics. The dashboard includes panels for backup success and failure rates, backup duration, and restore operations.
Metrics
ABS includes the following application metrics:
| Name | Description | Labels |
|---|---|---|
aerospike_backup_service_backup_duration_seconds | Duration in seconds of finished backups by routine and type (full/incremental) | routine, type |
aerospike_backup_service_backup_events_total | Backup service job events by routine, type (full/incremental), and outcome (success, failure, canceled, retry, skip) | routine, type, outcome |
aerospike_backup_service_backup_progress_pct | Progress of backup processes in percentage | routine, type |
aerospike_backup_service_last_successful_backup_timestamp | Unix timestamp of the last successful backup per routine and type (full/incremental) | routine, type |
aerospike_backup_service_restore_in_progress | Number of restore processes running |
Backup cancellation outcomes
The aerospike_backup_service_backup_events_total{outcome="canceled"} series increases when a backup is canceled.
- If a user explicitly cancels a backup (using the Cancel all jobs for a backup routine endpoint) or disables the routine, this metric increase is expected.
- If the service shuts down gracefully during a running backup, logs can report the backup as canceled, but Prometheus may miss the final increment if it cannot scrape before shutdown completes.
Deprecated metrics
The following metrics are deprecated as of ABS 3.0. Use the recommended replacement metrics instead.
Click to show deprecated metrics
| Name | Description | Replacement |
|---|---|---|
aerospike_backup_service_runs_total | Successful backup runs counter | aerospike_backup_service_backup_events_total |
aerospike_backup_service_incremental_runs_total | Successful incremental backup runs counter | aerospike_backup_service_backup_events_total |
aerospike_backup_service_skip_total | Full backup skip counter | aerospike_backup_service_backup_events_total |
aerospike_backup_service_incremental_skip_total | Incremental backup skip counter | aerospike_backup_service_backup_events_total |
aerospike_backup_service_failure_total | Full backup failure counter | aerospike_backup_service_backup_events_total |
aerospike_backup_service_incremental_failure_total | Incremental backup failure counter | aerospike_backup_service_backup_events_total |
aerospike_backup_service_duration_millis | Full backup duration in milliseconds | aerospike_backup_service_backup_duration_seconds |
aerospike_backup_service_incremental_duration_millis | Incremental backup duration in milliseconds | aerospike_backup_service_backup_duration_seconds |
Example PromQL queries
Monitor and alert on backup performance with the following queries in Grafana panels or the Prometheus expression browser.
- Number of successful full and incremental backups for a specific routine:
sum by (type) ( aerospike_backup_service_backup_events_total{routine="daily-ns1", outcome="success"} )- Number of failed backups per routine:
sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="failure"} )- Number of canceled backups per routine:
sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="canceled"} )- Average backup duration per routine:
rate(aerospike_backup_service_backup_duration_seconds_sum[5m]) / rate(aerospike_backup_service_backup_duration_seconds_count[5m])- Time since last full backup for a routine:
time() - aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1", type="full"}- Time since most recent backup for a routine regardless of backup type:
time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"})Example Prometheus alerts
Integrate ABS metrics into your Prometheus alerting pipeline to stay informed of job failures or service latencies.
- Detect backup job failures recorded within the last 15 minutes with this alert.
- alert: BackupJobFailureDetected expr: increase(aerospike_backup_service_backup_events_total{outcome="failure"}[15m]) > 0 for: 0m labels: severity: warning annotations: summary: "Backup job failure detected" description: "A backup failure was detected in the last 15 minutes for routine {{ $labels.routine }}."- Ensure backup continuity with this alert, which detects if a specific routine has failed to complete successfully within the last 24 hours.
- alert: BackupTooOld expr: time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"}) > 86400 for: 0m labels: severity: critical annotations: summary: "Backup is older than 24 hours" description: "The last successful backup for routine daily-ns1 was more than 24 hours ago."Process and Go runtime metrics
ABS also exposes standard process and Go runtime metrics on /metrics. Use them to detect resource saturation and runtime behavior changes before backup failures occur.
| Metric | Description | What to watch |
|---|---|---|
process_cpu_seconds_total | Cumulative CPU time (seconds) across all cores. | rate(process_cpu_seconds_total[5m]) * 100 gives CPU percent per core. A sustained value above 100 means ABS uses more than one core on average. |
process_resident_memory_bytes | Resident set size (RSS) in bytes. | Keep below container memory limits and watch for sustained growth during large backup or restore windows. |
process_open_fds | Open file descriptors held by ABS. | Compare with process_max_fds; a sustained high ratio indicates descriptor pressure and possible "too many open files" errors. |
process_max_fds | Hard limit for open file descriptors. | Track process_open_fds / process_max_fds and alert when the ratio approaches 1.0 for multiple scrape intervals. |
go_goroutines | Active Go goroutines. | In a stable workload the count stays bounded. A monotonic increase usually indicates stalled background tasks or leaked goroutines. |
go_memstats_heap_alloc_bytes | Go heap bytes allocated for live objects. | Compare with process_resident_memory_bytes to separate Go heap growth from non-heap memory pressure. |
Prometheus query examples:
-
rate(process_cpu_seconds_total[5m]) * 100- The
process_cpu_seconds_totalcounter tracks total CPU time since the process started. This query turns that running total into a per-second average over the last 5 minutes, then multiplies by 100 to get a percentage. In this scale, 100 means one full CPU core is busy and 200 means two cores. If the result stays above your expected core budget for several minutes, compare the spike withaerospike_backup_service_backup_progress_pctto identify which routine is running. To reduce CPU usage, stagger routine schedules so fewer backups overlap, or increase the CPU resources allocated to the ABS instance.
- The
-
process_resident_memory_bytes / 1024 / 1024 / 1024- Converts the resident memory (RSS) of the ABS process from bytes to gibibytes, which is easier to compare against any memory limits. A good practice is to alert at roughly 80% of the memory limit. If memory keeps growing, increase the container or host memory limit, or reduce the number of backup routines that run concurrently.
For details about these collectors in the Prometheus Go client, see the collectors package documentation.
Endpoints
| Name | Description |
|---|---|
/metrics | Exposes metrics for Prometheus to check performance of the backup service. |
/health | Allows monitoring systems to check the service health. |
/ready | Checks whether the service is able to handle requests. |
/version | Returns the application version, commit hash, and build time. |
/api-docs | Serves the API documentation in Swagger UI format. |
See the official Kubernetes documentation and Prometheus documentation for more information.