Prometheus Time Series Collection and Processing Server

Rules

CCS alerts			22.693s ago	262us
Rule	State	Error	Last Evaluation	Evaluation Time
alert: credit_compliance_service_mismatch_count_gain expr: delta(ccs_mismatches[1h]) > 0 for: 1m labels: severity: warning team: analytics annotations: description: CCS batch runs have shown an increase in mismatch count environment: production host: creditservices2 summary: CCS batch runs have shown an increase in mismatch count	ok		22.693s ago	254.4us
CDS alerts			930ms ago	486.8us
Rule	State	Error	Last Evaluation	Evaluation Time
alert: credit_decision_service_high_traffic expr: (avg(cds_function_benchmark_seconds_count{name="process_flow"} - (cds_function_benchmark_seconds_count{name="process_flow"} offset 5m)) > 1000) for: 1m labels: severity: critical annotations: description: The 5m average request rate is significantly higher than usual environment: production host: creditservices summary: CDS traffic is abnormally high	ok		930ms ago	308.6us
alert: credit_decision_service_response_500 expr: (cds_status_code_count{code="500"} > (cds_status_code_count{code="500"} offset 1s)) for: 1s labels: severity: warning team: analytics annotations: description: The gunicorn workers are returning HTTP 500 status codes. environment: production host: creditservices2 summary: CDS has some HTTP 500 error responses.	ok		930ms ago	90.13us
alert: credit_decision_service_error_count_change expr: (cds_error_count_total - (cds_error_count_total offset 2m) > 0) for: 2m labels: severity: warning team: analytics annotations: description: CDS error count is accumulating. environment: production host: creditservices2 summary: CDS is experiencing some exceptions.	ok		930ms ago	78.64us
Instances			14.08s ago	3.041ms
Rule	State	Error	Last Evaluation	Evaluation Time
alert: Prometheus target not responding expr: up{job!="bashcontainerstats"} == 0 for: 2m labels: severity: critical annotations: description: '{{ $labels.job }} at {{ $labels.instance }} has been unreachable for more than 2 minutes.' environment: production summary: Prometheus target at {{ $labels.instance }} is unreachable	ok		14.08s ago	321.6us
alert: Bashcontainerstats not responding expr: up{job="bashcontainerstats"} == 0 for: 2m labels: severity: warning annotations: description: '{{ $labels.job }} at {{ $labels.instance }} has been unreachable for more than 2 minutes.' environment: production summary: Prometheus target at {{ $labels.instance }} is unreachable	ok		14.08s ago	64.04us
alert: CPU Load expr: avg_over_time(node_load1[5m]) > 60 or avg_over_time(node_load[1m]) > 70 for: 1m labels: severity: critical annotations: description: '{{ $labels.instance }} has a high CPU load.' environment: production summary: Instance {{ $labels.instance }} has a high CPU load.	ok		14.08s ago	239.1us
alert: Disk Usage expr: (100 - 100 * (node_filesystem_avail_bytes{device!~"by-uuid",device!~"tmpfs",mountpoint="/"} / node_filesystem_size_bytes{device!~"by-uuid",device!~"tmpfs",mountpoint="/"})) > 90 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} has disk usage higher than 90%.' environment: production summary: Instance {{ $labels.instance }} has disk usage greater than 90%.	ok		14.08s ago	608.6us
alert: Postgres DB Connections expr: avg_over_time(pg_stat_activity_count[1m]) > 280 and avg_over_time(pg_stat_activity_count[15m]) > 275 for: 1m labels: severity: warning annotations: description: '{{ $labels.instance }} has passed a DB connection threshold.' environment: production summary: Instance {{ $labels.instance }} has passed a DB connection threshold.	ok		14.079s ago	1.511ms
alert: RAM Usage expr: (100 * (node_memory_MemAvailable_bytes) / (node_memory_MemTotal_bytes + node_memory_SwapTotal_bytes)) < 5 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} has less than 5% available RAM.' environment: production summary: Instance {{ $labels.instance }} has less than 5% available RAM.	ok		14.078s ago	278.1us
postgres			21.974s ago	1.589ms
Rule	State	Error	Last Evaluation	Evaluation Time
alert: postgres_down expr: pg_up == 0 for: 5m labels: service: postgres severity: warning annotations: description: \|- Postgres instance is down VALUE = {{ $value }} LABELS: {{ $labels }} summary: Postgres instance {{ $labels.instance }} is offline	ok		21.974s ago	190.5us
alert: High connection count on DB Primary expr: sum by(environment, instance) (pg_stat_activity_count) > on(instance) pg_settings_max_connections * 0.9 for: 5m labels: service: postgres severity: critical annotations: description: Postgres total connections have been above 70% of the configured max_connections for the past 5 minutes on dbprimary summary: dbprimary connection count is too high	ok		21.974s ago	722.1us
alert: Queries are too slow expr: avg by(datname) (rate(pg_stat_activity_max_tx_duration{datname!~"template."}[2m])) > 2 60 for: 2m labels: service: postgres severity: warning annotations: description: The average of SQL duration over two minutes is more than two minutes per each database summary: Postgres is executing slow queries	ok		21.973s ago	594us
alert: Deadlock detected expr: rate(pg_stat_database_deadlocks{datname="integra_production",instance="DB_PRIMARY_HOST_NAME:9187"}[1m]) > 0 for: 1m labels: service: postgres severity: warning annotations: description: \|- PostgreSQL has dead-locks VALUE = {{ $value }} LABELS: {{ $labels }} summary: Dead locks (instance {{ $labels.instance }})	ok		21.973s ago	72.94us
product			25.153s ago	2.289ms
Rule	State	Error	Last Evaluation	Evaluation Time
alert: Slow vendor responses expr: avg_over_time(ruby_http_request_duration_seconds{app_name="vendor_data_service",controller="requests",quantile="0.99"}[5m]) > 5 for: 1m labels: service: vendor_data_service severity: info annotations: description: 99th percentile of responses over the last 5 minutes from vendor_data_service are taking over 5 seconds environment: production summary: Vendor data responses are slow	ok		25.154s ago	208.7us
alert: Account Service down expr: absent(ruby_rss{app_name="account_service",type="unicorn_master"}) for: 2m labels: service: account_service severity: critical annotations: description: Account service hasn't been detected for 2 minutes environment: production summary: Account service is not running	ok		25.153s ago	110.6us
alert: Admin App down expr: absent(ruby_rss{app_name="admin_app",type="unicorn_master"}) for: 2m labels: service: admin_app severity: critical annotations: description: Admin App hasn't been detected for 2 minutes environment: production summary: Admin App is not running	ok		25.153s ago	63.89us
alert: Agent App down expr: absent(ruby_rss{app_name="agent_app",type="unicorn_master"}) for: 2m labels: service: agent_app severity: critical annotations: description: Agent App hasn't been detected for 2 minutes environment: production summary: Agent App is not running	ok		25.153s ago	115.7us
alert: Application Service down expr: absent(ruby_rss{app_name="application_service",type="unicorn_master"}) for: 2m labels: service: application_service severity: critical annotations: description: Application service hasn't been detected for 2 minutes environment: production summary: Application service is not running	ok		25.153s ago	122.8us
alert: Credit Reporting Service down expr: absent(ruby_rss{app_name="credit_reporting_service",type="unicorn_master"}) for: 2m labels: service: credit_reporting_service severity: critical annotations: description: Credit Reporting service hasn't been detected for 2 minutes environment: production summary: Credit Reporting service is not running	ok		25.153s ago	400.6us
alert: Customer Service down expr: absent(ruby_rss{app_name="customer_service",type="unicorn_master"}) for: 2m labels: service: customer_service severity: critical annotations: description: Customer service hasn't been detected for 2 minutes environment: production summary: Customer service is not running	ok		25.153s ago	152.8us
alert: Customer App down expr: absent(ruby_rss{app_name="customer_app",type="unicorn_master"}) for: 2m labels: service: customer_app severity: critical annotations: description: Customer App hasn't been detected for 2 minutes environment: production summary: Customer App is not running	ok		25.153s ago	117.5us
alert: Email Service down expr: absent(ruby_rss{app_name="email_service",type="unicorn_master"}) for: 2m labels: service: email_service severity: critical annotations: description: Email service hasn't been detected for 2 minutes environment: production summary: Email service is not running	ok		25.153s ago	95.19us
alert: Financial Service down expr: absent(ruby_rss{app_name="financial_service",type="unicorn_master"}) for: 2m labels: service: financial_service severity: critical annotations: description: Financial service hasn't been detected for 2 minutes environment: production summary: Financial service is not running	ok		25.153s ago	170.8us
alert: Five9 Service down expr: absent(ruby_rss{app_name="five9_service",type="unicorn_master"}) for: 2m labels: service: five9_service severity: critical annotations: description: Five9 service hasn't been detected for 2 minutes environment: production summary: Five9 service is not running	ok		25.153s ago	78.37us
alert: Leads Service down expr: absent(ruby_rss{app_name="leads_service",type="unicorn_master"}) for: 2m labels: service: leads_service severity: critical annotations: description: Leads service hasn't been detected for 2 minutes environment: production summary: Leads service is not running	ok		25.153s ago	101.5us
alert: Payment Gateway Service down expr: absent(ruby_rss{app_name="payment_gateway_service",type="unicorn_master"}) for: 2m labels: service: payment_gateway_service severity: critical annotations: description: Payment Gateway service hasn't been detected for 2 minutes environment: production summary: Payment Gateway service is not running	ok		25.153s ago	83.35us
alert: Scheduler service down expr: absent(ruby_rss{app_name="scheduler_service",type="sidekiq"}) for: 1m labels: service: scheduler_service severity: critical annotations: description: Scheduler service sidekiq hasn't been detected for 1 minute environment: production summary: Scheduler service sidekiq not running	ok		25.153s ago	57.93us
alert: Underwriting Service down expr: absent(ruby_rss{app_name="underwriting_service",type="unicorn_master"}) for: 2m labels: service: underwriting_service severity: critical annotations: description: Underwriting service hasn't been detected for 2 minutes environment: production summary: Underwriting service is not running	ok		25.153s ago	115us
alert: Vendor Data Service down expr: absent(ruby_rss{app_name="vendor_data_service",type="unicorn_master"}) for: 2m labels: service: vendor_data_service severity: critical annotations: description: Vendor Data service hasn't been detected for 2 minutes environment: production summary: Vendor Data service is not running	ok		25.153s ago	173.6us
alert: Vendor Proxy Service is down expr: absent(ruby_rss{app_name="vendor_proxy_service",type="unicorn_master"}) for: 2m labels: service: vendor_proxy_service severity: critical annotations: description: Vendor Proxy service hasn't been detected for 2 minutes environment: production summary: Vendor Proxy service is not running	ok		25.153s ago	102.3us
rails			6.928s ago	4.532ms
Rule	State	Error	Last Evaluation	Evaluation Time
alert: Rails apps are returning 500s expr: (ruby_http_requests_total{status="500"} - (ruby_http_requests_total{status="500"} offset 1m) > 0) for: 1m labels: service: rails severity: warning annotations: description: '{{ $labels.app_name }} is firing some 500 error responses!' environment: production summary: '{{ $labels.app_name }} is firing some 500 error responses!'	ok		6.928s ago	216.6us
alert: Z-Score indicator high traffic on Customer App expr: ((sum by(app_name) (rate(job:customer_app_http_requests_total:z_score[5m]) > 0)) / (count by(app_name) (rate(job:customer_app_http_requests_total:z_score[5m]) > 0))) > 2 for: 1m labels: service: customer_app severity: warning annotations: description: The z-score for customer_app requests is above +2.0 environment: production summary: Traffic for customer_app is high, as indicated by z-score offset	ok		6.928s ago	2.209ms
alert: Z-Score indicator low traffic on Customer App expr: ((sum by(app_name) (rate(job:customer_app_http_requests_total:z_score[5m]) > 0)) / (count by(app_name) (rate(job:customer_app_http_requests_total:z_score[5m]) > 0))) < -1 for: 1m labels: service: customer_app severity: warning annotations: description: The z-score for customer_app requests is below -1.0 environment: production summary: Traffic for customer_app is low, as indicated by z-score offset	ok		6.926s ago	2.019ms
alert: Slow responses from Rails apps expr: avg_over_time(ruby_http_request_duration_seconds{quantile="0.99"}[3m]) > 20 for: 1m labels: service: rails severity: info annotations: description: '{{ $labels.app_name }} has responses that are slower than 20 seconds for over 3 minutes' environment: production summary: '{{ $labels.app_name }} is responding slowly!'	ok		6.924s ago	74.25us
recording_rules			600ms ago	583.4ms
Rule	State	Error	Last Evaluation	Evaluation Time
record: job:customer_app_http_requests_total:rate5m expr: rate(ruby_http_requests_total{app_name="customer_app"}[5m])	ok		600ms ago	1.988ms
record: job:customer_app_http_requests_total:rate5m:avg_over_time_1w expr: avg_over_time(job:customer_app_http_requests_total:rate5m[1w])	ok		598ms ago	289.9ms
record: job:customer_app_http_requests_total:rate5m:stddev_over_time_1w expr: stddev_over_time(job:customer_app_http_requests_total:rate5m[1w])	ok		308ms ago	287.4ms
record: job:customer_app_http_requests_total:z_score expr: (job:customer_app_http_requests_total:rate5m - job:customer_app_http_requests_total:rate5m:avg_over_time_1w) / job:customer_app_http_requests_total:rate5m:stddev_over_time_1w	ok		21ms ago	3.982ms
redis			4.943s ago	276.3us
Rule	State	Error	Last Evaluation	Evaluation Time
alert: redis_down expr: redis_up == 0 for: 5m labels: service: redis severity: warning annotations: description: \|- Redis instance is down VALUE = {{ $value }} LABELS: {{ $labels }} environment: production summary: Redis down (instance {{ $labels.instance }})	ok		4.943s ago	183.7us
alert: redis_high_memory_load expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 for: 10m labels: service: redis severity: warning annotations: description: \|- Redis is running out of memory (> 90%) VALUE = {{ $value }} LABELS: {{ $labels }} environment: production summary: Out of memory (instance {{ $labels.instance }})	ok		4.943s ago	84.44us
security			2.906s ago	1.129ms
Rule	State	Error	Last Evaluation	Evaluation Time
alert: Unusually high traffic on Customer App expr: rate(customer_app_http_response_count_total[5m]) > 5 for: 1m labels: service: rails severity: warning annotations: description: The 5m average request rate for {{ $labels.app_name }} is significantly higher than usual environment: production summary: Rails app traffic for {{ $labels.app_name }} is abnormally high	ok		2.906s ago	956.4us
alert: Potential Credential Stuffing attack expr: rate(ruby_http_request_duration_seconds_count{action=~"create",app_name=~"customer_app",controller=~"sessions"}[10m]) > 0.2 for: 1m labels: service: rails severity: critical annotations: description: The 5m average request rate for {{ $labels.app_name }} is significantly higher than usual environment: production summary: Rails app traffic for {{ $labels.app_name }} is abnormally high	ok		2.905s ago	163.7us

Rules

22.693s ago

262us

930ms ago

486.8us

14.08s ago

3.041ms

21.974s ago

1.589ms

25.153s ago

2.289ms

6.928s ago

4.532ms

600ms ago

583.4ms

4.943s ago

276.3us

2.906s ago

1.129ms