Skip to main content

Zyntem Fiscalization - Operational Runbook

Company: Zyntem Product: Fiscalization by Zyntem Version: 1.0 Last Updated: 2025-10-30 Status: Active


Overview

This runbook provides operational procedures for Fiscalization by Zyntem, including normal operations, incident response, and emergency procedures.

Target Audience: On-call engineers, DevOps, SREs

Emergency Contact: +1-XXX-XXX-XXXX (update with real number)


Table of Contents

  1. System Architecture
  2. Normal Operations
  3. Incident Response
  4. Emergency Procedures
  5. Common Issues
  6. Monitoring & Alerts
  7. Contact Information

System Architecture

Components

graph TD
A[User/Client] --> B[Cloud Run: Core API]
B --> C[Cloud SQL: PostgreSQL]
B --> D[Redis: Rate Limiting]
B --> E[Secret Manager: Credentials]
B --> F[Cloud Storage: Receipts]
B --> G[Adapters: Spain/Italy/France]
G --> H[Tax Authorities]

Critical Services

ServicePlatformRegionPurpose
core-apiCloud Runeurope-west1Main API gateway
dashboardCloud Runeurope-west1Customer UI
docsCloud Runeurope-west1Documentation
adapter-spainCloud Runeurope-west1Spain fiscalization
adapter-italyCloud Runeurope-west1Italy fiscalization
adapter-franceCloud Runeurope-west1France fiscalization
postgresCloud SQLeurope-west1Primary database
redisCloud Memorystoreeurope-west1Cache & rate limiting

Dependencies

External Services:

  • Stripe (payment processing)
  • AEAT (Spain tax authority)
  • AdE SDI (Italy tax authority)
  • GCP Services (Cloud Run, Cloud SQL, etc.)

Normal Operations

Daily Checks

Morning Checklist (5 minutes):

  1. Check System Health

    # Verify all services running
    gcloud run services list --region europe-west1

    # Check health endpoints
    curl https://core-api-dev-zyntem.run.app/health
    curl https://dashboard-dev-zyntem.run.app/health
  2. Review Metrics

    • Open Cloud Monitoring dashboard
    • Verify error rate < 5%
    • Verify P99 latency < 2s
    • Check no active alerts
  3. Check Recent Deployments

    gcloud run revisions list --service core-api --region europe-west1 --limit 5
  4. Review Error Logs

    gcloud logging read "severity>=ERROR" --limit 20 --freshness=1d

Weekly Checks

Monday Morning (15 minutes):

  1. Certificate Expiration Check

    # Check certificates expiring in < 30 days
    # (Automated alert should notify, but double-check)
  2. Database Performance Review

    # Connect to Cloud SQL
    gcloud sql connect zyntem-dev --user=postgres

    # Check slow queries
    SELECT query, mean_exec_time, calls
    FROM pg_stat_statements
    WHERE mean_exec_time > 1000
    ORDER BY mean_exec_time DESC
    LIMIT 10;
  3. Backup Verification

    # Verify automated backups exist
    gcloud sql backups list --instance=zyntem-dev
  4. Review Capacity & Costs

    • Check Cloud SQL storage usage
    • Review Cloud Run instance scaling
    • Monitor GCP billing

Monthly Checks

First Monday of Month (30 minutes):

  1. Security Review

    • Review IAM permissions
    • Check for unused service accounts
    • Rotate API keys (if policy requires)
  2. Dependency Updates

    • Review Go dependency updates: go list -u -m all
    • Review Node dependency updates: npm outdated
    • Schedule dependency upgrade sprint
  3. Disaster Recovery Test

    • Test database restore from backup
    • Verify rollback procedures
    • Test failover scenarios (Phase 2)

Incident Response

Severity Levels

SeverityDescriptionResponse TimeExample
P0 - CriticalComplete service outage15 minutesDatabase down, API returning 500s
P1 - HighMajor feature broken1 hourTransactions failing, dashboard down
P2 - MediumDegraded performance4 hoursHigh latency, some errors
P3 - LowMinor issue24 hoursUI bug, documentation error

Incident Response Process

graph TD
A[Alert Triggered] --> B[Acknowledge Alert]
B --> C[Assess Severity]
C --> D{P0/P1?}
D -->|Yes| E[Page On-Call]
D -->|No| F[Create Ticket]
E --> G[Triage & Investigate]
G --> H[Implement Fix]
H --> I[Verify Resolution]
I --> J[Post-Mortem]
F --> K[Schedule Work]

Step 1: Acknowledge Alert

# Silence alert (if using Cloud Monitoring)
gcloud alpha monitoring policies update POLICY_ID --disabled

# Or acknowledge in PagerDuty/Opsgenie

Time: < 5 minutes


Step 2: Assess Impact

Questions to Answer:

  • How many customers affected?
  • What functionality is broken?
  • Is data at risk?
  • Is this a security incident?

Commands:

# Check error rate
gcloud monitoring read \
--project=zyntem-dev \
--filter='metric.type="run.googleapis.com/request_count" AND metric.label.response_code_class="5xx"'

# Check affected services
gcloud run services list --filter="metadata.labels.status=unhealthy"

# Count affected requests (last 1 hour)
gcloud logging read 'severity>=ERROR AND timestamp>="'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'"' --format='value(timestamp)' | wc -l

Step 3: Communicate Status

Update Status Page:

Notify Team:

Slack: #fiscalization-incidents
Subject: [P0] API Returning 500 Errors
Status: Investigating
ETA: Unknown
Lead: @engineer-name

Step 4: Investigate & Fix

Common Investigation Commands:

# Check service health
curl -v https://core-api-dev-zyntem.run.app/health

# View recent logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit 50 --freshness=1h

# Check resource usage
gcloud run services describe core-api --region europe-west1 --format="value(status.conditions)"

# Check database connectivity
gcloud sql connect zyntem-dev --user=postgres
# Then: SELECT 1;

# Check Redis connectivity
redis-cli -h REDIS_IP ping

Fix Options:

  1. Rollback (fastest, use if new deployment caused issue)
  2. Hotfix (if root cause identified and fix is simple)
  3. Workaround (temporary fix to restore service)

Step 5: Verify Resolution

# Check error rate returned to normal
gcloud monitoring read --filter='metric.type="run.googleapis.com/request_count"' --format=json

# Run smoke tests
./scripts/smoke-tests.sh

# Monitor for 15 minutes
watch -n 10 'curl -s https://core-api-dev-zyntem.run.app/health | jq .'

Step 6: Post-Mortem

Within 48 hours of resolution:

  1. Create Post-Mortem Document

    • What happened?
    • Root cause?
    • How was it detected?
    • How was it resolved?
    • How to prevent recurrence?
  2. Action Items

    • Immediate fixes
    • Long-term improvements
    • Monitoring enhancements
    • Process changes

Emergency Procedures

Emergency Rollback

When: Deployment caused critical issue

Time to Execute: < 30 seconds (instant traffic shift to previous revision)

Option 1: Instant traffic rollback (recommended)

# List recent revisions to identify the previous stable one
gcloud run revisions list \
--service core-api \
--region europe-west1 \
--limit 5

# Shift 100% traffic to the previous stable revision
gcloud run services update-traffic core-api \
--to-revisions=<PREVIOUS_REVISION>=100 \
--region europe-west1

# Verify
curl https://core-api-dev-zyntem.run.app/health

Option 2: Emergency rollback script

./scripts/emergency-rollback.sh core-api

Note: The deploy workflow (deploy.yml) includes automatic rollback on failure. If a health check fails during traffic shifting, traffic is automatically reverted to the previous revision without manual intervention.


Database Emergency Restore

When: Database corruption or data loss

Time to Execute: 15-30 minutes

# List available backups
gcloud sql backups list --instance=zyntem-dev

# Restore from backup (creates new instance)
gcloud sql backups restore BACKUP_ID \
--backup-instance=zyntem-dev \
--backup-id=BACKUP_ID

# Or restore to new instance
gcloud sql backups restore BACKUP_ID \
--backup-instance=zyntem-dev \
--target-instance=zyntem-dev-restore

# Update application to point to restored instance
# (requires configuration change and deployment)

WARNING: This will cause downtime. Only use if data loss is unacceptable.


Enable Maintenance Mode

When: Need to perform emergency maintenance

Time to Execute: < 1 minute

# Deploy maintenance page to Cloud Run
gcloud run deploy core-api-maintenance \
--image gcr.io/zyntem/fiscalization-api/maintenance:latest \
--region europe-west1

# Redirect traffic
gcloud run services update-traffic core-api \
--to-revisions=core-api-maintenance-00001=100 \
--region europe-west1

Maintenance Page Response:

{
"status": "maintenance",
"message": "Fiscalization is undergoing emergency maintenance. Expected resolution: 2025-10-29 12:00 UTC",
"retry_after": 3600
}

Circuit Breaker Manual Override

When: External service (tax authority) is down

Action: Manually enable circuit breaker to prevent cascading failures

# Set circuit breaker state in Redis
redis-cli -h REDIS_IP SET "circuit_breaker:spain:aeat" "open"

# This triggers graceful degradation (FR30):
# - Receipts generated immediately
# - Fiscalization queued for retry

Revert:

redis-cli -h REDIS_IP DEL "circuit_breaker:spain:aeat"

How to Check Cloud Run Logs

Via gcloud CLI

# Live tail of logs for the core-api service
gcloud run services logs tail core-api --region europe-west1

# Recent logs (last hour)
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.service_name="core-api"' \
--limit 100 --freshness=1h

# Errors only
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.service_name="core-api" AND severity>=ERROR' \
--limit 50

# Logs for a specific revision (useful during deployments)
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.revision_name="core-api-00010-abc"' \
--limit 50

Via Cloud Console

  1. Navigate to: https://console.cloud.google.com/run?project=zyntem-dev
  2. Select the core-api service
  3. Click the Logs tab
  4. Use severity filters (Error, Warning, Info) to narrow results

Common Issues

Issue 1: High Error Rate

Symptoms:

  • Alert: "Error rate > 5%"
  • Cloud Monitoring shows spike in 5xx errors

Investigation:

# Check error logs
gcloud logging read "severity=ERROR" --limit 50

# Check service status
gcloud run services describe core-api --region europe-west1

# Check database connectivity
gcloud sql operations list --instance=zyntem-dev

Common Causes:

  1. Database connection pool exhausted
  2. External service timeout (Stripe, tax authorities)
  3. Memory leak / OOM
  4. Bad deployment

Resolution:

# If database issue: Scale up connections
gcloud sql instances patch zyntem-dev --database-flags max_connections=200

# If bad deployment: Rollback
./scripts/emergency-rollback.sh core-api

# If external service: Enable circuit breaker
redis-cli SET "circuit_breaker:SERVICE" "open"

Issue 2: High Latency

Symptoms:

  • Alert: "P99 latency > 2s"
  • Requests timing out
  • Customer complaints of slow responses

Investigation:

# Check Cloud Run metrics
gcloud run services describe core-api --region europe-west1

# Check database slow queries
# (See database query from "Normal Operations" section)

# Check Redis latency
redis-cli --latency -h REDIS_IP

Common Causes:

  1. Cold starts (low traffic)
  2. Slow database queries
  3. External service latency
  4. Insufficient resources

Resolution:

# Prevent cold starts
gcloud run services update core-api --min-instances=1 --region europe-west1

# Scale up resources
gcloud run services update core-api --memory=1Gi --region europe-west1

# Add database indexes (if slow query identified)

Issue 3: Database Connection Errors

Symptoms:

  • Errors: could not connect to server
  • API returns 500 errors
  • Database queries timing out

Investigation:

# Check Cloud SQL status
gcloud sql instances describe zyntem-dev

# Check connections
gcloud sql operations list --instance=zyntem-dev --limit 10

# Check Cloud SQL logs
gcloud logging read "resource.type=cloudsql_database" --limit 50

Common Causes:

  1. Cloud SQL instance down
  2. Connection pool exhausted
  3. Network issue
  4. Maintenance window

Resolution:

# If instance down: Restart
gcloud sql instances restart zyntem-dev

# If connection pool issue: Scale connections
gcloud sql instances patch zyntem-dev --database-flags max_connections=200

# Emergency: Enable read replica if available
gcloud sql instances failover zyntem-dev

Issue 4: Certificate Expiration

Symptoms:

  • Alert: "Certificate expires in < 30 days"
  • Tax authority API calls failing with SSL errors

Investigation:

# Check certificate expiration
openssl x509 -in certificate.pem -noout -enddate

Resolution:

# Renew certificate (process varies by country)
# Spain: Contact AEAT for renewal
# Italy: Use InfoCert portal
# France: Automated renewal (if applicable)

# Upload new certificate to Secret Manager
gcloud secrets versions add CERT_NAME --data-file=new-certificate.pem

# Restart services to pick up new certificate
gcloud run services update adapter-spain --region europe-west1

Monitoring & Alerts

Dashboards

Primary Dashboard:

Secondary Dashboards:

  • Cloud Run Service Metrics
  • Cloud SQL Performance
  • Error Logs

Alert Policies

AlertThresholdAction
High Error Rate> 5% for 2 minPage on-call, investigate immediately
High LatencyP99 > 2s for 2 minPage on-call, check database/resources
Database DownConnection failures > 3Page on-call, emergency restore
Certificate Expiring< 30 daysEmail owner, schedule renewal
Disk Full> 90%Email team, clean up or scale

Log Queries

Useful Log Queries:

# All errors in last hour
gcloud logging read 'severity>=ERROR AND timestamp>="'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'"'

# Errors for specific service
gcloud logging read 'resource.type=cloud_run_revision AND resource.labels.service_name=core-api AND severity>=ERROR' --limit 50

# Slow database queries (> 1s)
gcloud logging read 'resource.type=cloudsql_database AND jsonPayload.message=~"duration:.*[1-9][0-9]{3,}ms"' --limit 20

# Failed transactions
gcloud logging read 'jsonPayload.transaction.status="failed"' --limit 50

Contact Information

On-Call Rotation

WeekPrimary On-CallBackup
Week 1Engineer A (+1-XXX-XXX-XXXX)Engineer B (+1-XXX-XXX-XXXX)
Week 2Engineer B (+1-XXX-XXX-XXXX)Engineer C (+1-XXX-XXX-XXXX)
Week 3Engineer C (+1-XXX-XXX-XXXX)Engineer A (+1-XXX-XXX-XXXX)

Escalation Path

  1. Level 1: On-Call Engineer
  2. Level 2: Engineering Lead (javier.sanchez@zyntem.com)
  3. Level 3: CTO/Technical Advisor

External Contacts

GCP Support:

Stripe Support:

Tax Authority Contacts:

  • Spain AEAT: +34-XXX-XXX-XXX
  • Italy AdE: +39-XXX-XXX-XXX

Runbook Maintenance

Review Schedule: Quarterly (every 3 months)

Last Reviewed: 2025-10-29

Next Review: 2026-01-29

Changelog:

  • 2025-10-29: Initial version

Additional Resources