Zyntem Fiscalization - Operational Runbook

Company: Zyntem Product: Fiscalization by Zyntem Version: 1.0 Last Updated: 2025-10-30 Status: Active

Overview

This runbook provides operational procedures for Fiscalization by Zyntem, including normal operations, incident response, and emergency procedures.

Target Audience: On-call engineers, DevOps, SREs

Emergency Contact: +1-XXX-XXX-XXXX (update with real number)

System Architecture
Normal Operations
Incident Response
Emergency Procedures
Common Issues
Monitoring & Alerts
Contact Information

System Architecture

Components

graph TD
    A[User/Client] --> B[Cloud Run: Core API]
    B --> C[Cloud SQL: PostgreSQL]
    B --> D[Redis: Rate Limiting]
    B --> E[Secret Manager: Credentials]
    B --> F[Cloud Storage: Receipts]
    B --> G[Adapters: Spain/Italy/France]
    G --> H[Tax Authorities]

Critical Services

Service	Platform	Region	Purpose
core-api	Cloud Run	europe-west1	Main API gateway
dashboard	Cloud Run	europe-west1	Customer UI
docs	Cloud Run	europe-west1	Documentation
adapter-spain	Cloud Run	europe-west1	Spain fiscalization
adapter-italy	Cloud Run	europe-west1	Italy fiscalization
adapter-france	Cloud Run	europe-west1	France fiscalization
postgres	Cloud SQL	europe-west1	Primary database
redis	Cloud Memorystore	europe-west1	Cache & rate limiting

Dependencies

External Services:

Stripe (payment processing)
AEAT (Spain tax authority)
AdE SDI (Italy tax authority)
GCP Services (Cloud Run, Cloud SQL, etc.)

Normal Operations

Daily Checks

Morning Checklist (5 minutes):

Check System Health

# Verify all services running
gcloud run services list --region europe-west1

# Check health endpoints
curl https://core-api-dev-zyntem.run.app/health
curl https://dashboard-dev-zyntem.run.app/health

Review Metrics
- Open Cloud Monitoring dashboard
- Verify error rate < 5%
- Verify P99 latency < 2s
- Check no active alerts

Check Recent Deployments

gcloud run revisions list --service core-api --region europe-west1 --limit 5

Review Error Logs

gcloud logging read "severity>=ERROR" --limit 20 --freshness=1d

Weekly Checks

Monday Morning (15 minutes):

Certificate Expiration Check

# Check certificates expiring in < 30 days
# (Automated alert should notify, but double-check)

Database Performance Review

# Connect to Cloud SQL
gcloud sql connect zyntem-dev --user=postgres

# Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

Backup Verification

# Verify automated backups exist
gcloud sql backups list --instance=zyntem-dev

Review Capacity & Costs
- Check Cloud SQL storage usage
- Review Cloud Run instance scaling
- Monitor GCP billing

Monthly Checks

First Monday of Month (30 minutes):

Security Review
- Review IAM permissions
- Check for unused service accounts
- Rotate API keys (if policy requires)
Dependency Updates
- Review Go dependency updates: go list -u -m all
- Review Node dependency updates: npm outdated
- Schedule dependency upgrade sprint
Disaster Recovery Test
- Test database restore from backup
- Verify rollback procedures
- Test failover scenarios (Phase 2)

Incident Response

Severity Levels

Severity	Description	Response Time	Example
P0 - Critical	Complete service outage	15 minutes	Database down, API returning 500s
P1 - High	Major feature broken	1 hour	Transactions failing, dashboard down
P2 - Medium	Degraded performance	4 hours	High latency, some errors
P3 - Low	Minor issue	24 hours	UI bug, documentation error

Incident Response Process

graph TD
    A[Alert Triggered] --> B[Acknowledge Alert]
    B --> C[Assess Severity]
    C --> D{P0/P1?}
    D -->|Yes| E[Page On-Call]
    D -->|No| F[Create Ticket]
    E --> G[Triage & Investigate]
    G --> H[Implement Fix]
    H --> I[Verify Resolution]
    I --> J[Post-Mortem]
    F --> K[Schedule Work]

Step 1: Acknowledge Alert

# Silence alert (if using Cloud Monitoring)
gcloud alpha monitoring policies update POLICY_ID --disabled

# Or acknowledge in PagerDuty/Opsgenie

Time: < 5 minutes

Step 2: Assess Impact

Questions to Answer:

How many customers affected?
What functionality is broken?
Is data at risk?
Is this a security incident?

Commands:

# Check error rate
gcloud monitoring read \
  --project=zyntem-dev \
  --filter='metric.type="run.googleapis.com/request_count" AND metric.label.response_code_class="5xx"'

# Check affected services
gcloud run services list --filter="metadata.labels.status=unhealthy"

# Count affected requests (last 1 hour)
gcloud logging read 'severity>=ERROR AND timestamp>="'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'"' --format='value(timestamp)' | wc -l

Step 3: Communicate Status

Update Status Page:

https://status.zyntem.dev (Epic 5)
Post incident notice
Provide ETA if known

Notify Team:

Slack: #fiscalization-incidents
Subject: [P0] API Returning 500 Errors
Status: Investigating
ETA: Unknown
Lead: @engineer-name

Step 4: Investigate & Fix

Common Investigation Commands:

# Check service health
curl -v https://core-api-dev-zyntem.run.app/health

# View recent logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit 50 --freshness=1h

# Check resource usage
gcloud run services describe core-api --region europe-west1 --format="value(status.conditions)"

# Check database connectivity
gcloud sql connect zyntem-dev --user=postgres
# Then: SELECT 1;

# Check Redis connectivity
redis-cli -h REDIS_IP ping

Fix Options:

Rollback (fastest, use if new deployment caused issue)
Hotfix (if root cause identified and fix is simple)
Workaround (temporary fix to restore service)

Step 5: Verify Resolution

# Check error rate returned to normal
gcloud monitoring read --filter='metric.type="run.googleapis.com/request_count"' --format=json

# Run smoke tests
./scripts/smoke-tests.sh

# Monitor for 15 minutes
watch -n 10 'curl -s https://core-api-dev-zyntem.run.app/health | jq .'

Step 6: Post-Mortem

Within 48 hours of resolution:

Create Post-Mortem Document
- What happened?
- Root cause?
- How was it detected?
- How was it resolved?
- How to prevent recurrence?
Action Items
- Immediate fixes
- Long-term improvements
- Monitoring enhancements
- Process changes

Emergency Procedures

Emergency Rollback

When: Deployment caused critical issue

Time to Execute: < 30 seconds (instant traffic shift to previous revision)

Option 1: Instant traffic rollback (recommended)

# List recent revisions to identify the previous stable one
gcloud run revisions list \
  --service core-api \
  --region europe-west1 \
  --limit 5

# Shift 100% traffic to the previous stable revision
gcloud run services update-traffic core-api \
  --to-revisions=<PREVIOUS_REVISION>=100 \
  --region europe-west1

# Verify
curl https://core-api-dev-zyntem.run.app/health

Option 2: Emergency rollback script

./scripts/emergency-rollback.sh core-api

Note: The deploy workflow (deploy.yml) includes automatic rollback on failure. If a health check fails during traffic shifting, traffic is automatically reverted to the previous revision without manual intervention.

Database Emergency Restore

When: Database corruption or data loss

Time to Execute: 15-30 minutes

# List available backups
gcloud sql backups list --instance=zyntem-dev

# Restore from backup (creates new instance)
gcloud sql backups restore BACKUP_ID \
  --backup-instance=zyntem-dev \
  --backup-id=BACKUP_ID

# Or restore to new instance
gcloud sql backups restore BACKUP_ID \
  --backup-instance=zyntem-dev \
  --target-instance=zyntem-dev-restore

# Update application to point to restored instance
# (requires configuration change and deployment)

WARNING: This will cause downtime. Only use if data loss is unacceptable.

Enable Maintenance Mode

When: Need to perform emergency maintenance

Time to Execute: < 1 minute

# Deploy maintenance page to Cloud Run
gcloud run deploy core-api-maintenance \
  --image gcr.io/zyntem/fiscalization-api/maintenance:latest \
  --region europe-west1

# Redirect traffic
gcloud run services update-traffic core-api \
  --to-revisions=core-api-maintenance-00001=100 \
  --region europe-west1

Maintenance Page Response:

{
  "status": "maintenance",
  "message": "Fiscalization is undergoing emergency maintenance. Expected resolution: 2025-10-29 12:00 UTC",
  "retry_after": 3600
}

Circuit Breaker Manual Override

When: External service (tax authority) is down

Action: Manually enable circuit breaker to prevent cascading failures

# Set circuit breaker state in Redis
redis-cli -h REDIS_IP SET "circuit_breaker:spain:aeat" "open"

# This triggers graceful degradation (FR30):
# - Receipts generated immediately
# - Fiscalization queued for retry

Revert:

redis-cli -h REDIS_IP DEL "circuit_breaker:spain:aeat"

How to Check Cloud Run Logs

Via gcloud CLI

# Live tail of logs for the core-api service
gcloud run services logs tail core-api --region europe-west1

# Recent logs (last hour)
gcloud logging read \
  'resource.type="cloud_run_revision" AND resource.labels.service_name="core-api"' \
  --limit 100 --freshness=1h

# Errors only
gcloud logging read \
  'resource.type="cloud_run_revision" AND resource.labels.service_name="core-api" AND severity>=ERROR' \
  --limit 50

# Logs for a specific revision (useful during deployments)
gcloud logging read \
  'resource.type="cloud_run_revision" AND resource.labels.revision_name="core-api-00010-abc"' \
  --limit 50

Via Cloud Console

Navigate to: https://console.cloud.google.com/run?project=zyntem-dev
Select the core-api service
Click the Logs tab
Use severity filters (Error, Warning, Info) to narrow results

Common Issues

Issue 1: High Error Rate

Symptoms:

Alert: "Error rate > 5%"
Cloud Monitoring shows spike in 5xx errors

Investigation:

# Check error logs
gcloud logging read "severity=ERROR" --limit 50

# Check service status
gcloud run services describe core-api --region europe-west1

# Check database connectivity
gcloud sql operations list --instance=zyntem-dev

Common Causes:

Database connection pool exhausted
External service timeout (Stripe, tax authorities)
Memory leak / OOM
Bad deployment

Resolution:

# If database issue: Scale up connections
gcloud sql instances patch zyntem-dev --database-flags max_connections=200

# If bad deployment: Rollback
./scripts/emergency-rollback.sh core-api

# If external service: Enable circuit breaker
redis-cli SET "circuit_breaker:SERVICE" "open"

Issue 2: High Latency

Symptoms:

Alert: "P99 latency > 2s"
Requests timing out
Customer complaints of slow responses

Investigation:

# Check Cloud Run metrics
gcloud run services describe core-api --region europe-west1

# Check database slow queries
# (See database query from "Normal Operations" section)

# Check Redis latency
redis-cli --latency -h REDIS_IP

Common Causes:

Cold starts (low traffic)
Slow database queries
External service latency
Insufficient resources

Resolution:

# Prevent cold starts
gcloud run services update core-api --min-instances=1 --region europe-west1

# Scale up resources
gcloud run services update core-api --memory=1Gi --region europe-west1

# Add database indexes (if slow query identified)

Issue 3: Database Connection Errors

Symptoms:

Errors: could not connect to server
API returns 500 errors
Database queries timing out

Investigation:

# Check Cloud SQL status
gcloud sql instances describe zyntem-dev

# Check connections
gcloud sql operations list --instance=zyntem-dev --limit 10

# Check Cloud SQL logs
gcloud logging read "resource.type=cloudsql_database" --limit 50

Common Causes:

Cloud SQL instance down
Connection pool exhausted
Network issue
Maintenance window

Resolution:

# If instance down: Restart
gcloud sql instances restart zyntem-dev

# If connection pool issue: Scale connections
gcloud sql instances patch zyntem-dev --database-flags max_connections=200

# Emergency: Enable read replica if available
gcloud sql instances failover zyntem-dev

Issue 4: Certificate Expiration

Symptoms:

Alert: "Certificate expires in < 30 days"
Tax authority API calls failing with SSL errors

Investigation:

# Check certificate expiration
openssl x509 -in certificate.pem -noout -enddate

Resolution:

# Renew certificate (process varies by country)
# Spain: Contact AEAT for renewal
# Italy: Use InfoCert portal
# France: Automated renewal (if applicable)

# Upload new certificate to Secret Manager
gcloud secrets versions add CERT_NAME --data-file=new-certificate.pem

# Restart services to pick up new certificate
gcloud run services update adapter-spain --region europe-west1

Monitoring & Alerts

Dashboards

Primary Dashboard:

URL: https://console.cloud.google.com/monitoring/dashboards
Name: "Fiscalization Production Monitoring"
Metrics: Error rate, latency, request count, database health

Secondary Dashboards:

Cloud Run Service Metrics
Cloud SQL Performance
Error Logs

Alert Policies

Alert	Threshold	Action
High Error Rate	> 5% for 2 min	Page on-call, investigate immediately
High Latency	P99 > 2s for 2 min	Page on-call, check database/resources
Database Down	Connection failures > 3	Page on-call, emergency restore
Certificate Expiring	< 30 days	Email owner, schedule renewal
Disk Full	> 90%	Email team, clean up or scale

Log Queries

Useful Log Queries:

# All errors in last hour
gcloud logging read 'severity>=ERROR AND timestamp>="'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'"'

# Errors for specific service
gcloud logging read 'resource.type=cloud_run_revision AND resource.labels.service_name=core-api AND severity>=ERROR' --limit 50

# Slow database queries (> 1s)
gcloud logging read 'resource.type=cloudsql_database AND jsonPayload.message=~"duration:.*[1-9][0-9]{3,}ms"' --limit 20

# Failed transactions
gcloud logging read 'jsonPayload.transaction.status="failed"' --limit 50

Contact Information

On-Call Rotation

Week	Primary On-Call	Backup
Week 1	Engineer A (+1-XXX-XXX-XXXX)	Engineer B (+1-XXX-XXX-XXXX)
Week 2	Engineer B (+1-XXX-XXX-XXXX)	Engineer C (+1-XXX-XXX-XXXX)
Week 3	Engineer C (+1-XXX-XXX-XXXX)	Engineer A (+1-XXX-XXX-XXXX)

Escalation Path

Level 1: On-Call Engineer
Level 2: Engineering Lead (javier.sanchez@zyntem.com)
Level 3: CTO/Technical Advisor

External Contacts

GCP Support:

Phone: +1-XXX-XXX-XXXX
Portal: https://console.cloud.google.com/support
Priority: P1 (production outage)

Stripe Support:

Portal: https://support.stripe.com
Email: support@stripe.com

Tax Authority Contacts:

Spain AEAT: +34-XXX-XXX-XXX
Italy AdE: +39-XXX-XXX-XXX

Runbook Maintenance

Review Schedule: Quarterly (every 3 months)

Last Reviewed: 2025-10-29

Next Review: 2026-01-29

Changelog:

2025-10-29: Initial version

Additional Resources

DEPLOYMENT.md - Deployment procedures
DEPLOYMENT-CHECKLIST.md - Deployment checklist
CI-CD.md - CI/CD pipeline
Cloud Run Documentation
Cloud SQL Documentation

Overview​

Table of Contents​

System Architecture​

Components​

Critical Services​

Dependencies​

Normal Operations​

Daily Checks​

Weekly Checks​

Monthly Checks​

Incident Response​

Severity Levels​

Incident Response Process​

Step 1: Acknowledge Alert​

Step 2: Assess Impact​

Step 3: Communicate Status​

Step 4: Investigate & Fix​

Step 5: Verify Resolution​

Step 6: Post-Mortem​

Emergency Procedures​

Emergency Rollback​

Database Emergency Restore​

Enable Maintenance Mode​

Circuit Breaker Manual Override​

How to Check Cloud Run Logs​

Via gcloud CLI​

Via Cloud Console​

Common Issues​

Issue 1: High Error Rate​

Issue 2: High Latency​

Issue 3: Database Connection Errors​

Issue 4: Certificate Expiration​

Monitoring & Alerts​

Dashboards​

Alert Policies​

Log Queries​

Contact Information​

On-Call Rotation​

Escalation Path​

External Contacts​

Runbook Maintenance​

Additional Resources​

Overview

Table of Contents

System Architecture

Components

Critical Services

Dependencies

Normal Operations

Daily Checks

Weekly Checks

Monthly Checks

Incident Response

Severity Levels

Incident Response Process

Step 1: Acknowledge Alert

Step 2: Assess Impact

Step 3: Communicate Status

Step 4: Investigate & Fix

Step 5: Verify Resolution

Step 6: Post-Mortem

Emergency Procedures

Emergency Rollback

Database Emergency Restore

Enable Maintenance Mode

Circuit Breaker Manual Override

How to Check Cloud Run Logs

Via gcloud CLI

Via Cloud Console

Common Issues

Issue 1: High Error Rate

Issue 2: High Latency

Issue 3: Database Connection Errors

Issue 4: Certificate Expiration

Monitoring & Alerts

Dashboards

Alert Policies

Log Queries

Contact Information

On-Call Rotation

Escalation Path

External Contacts

Runbook Maintenance

Additional Resources