Zyntem Fiscalization - Operational Runbook
Company: Zyntem Product: Fiscalization by Zyntem Version: 1.0 Last Updated: 2025-10-30 Status: Active
Overview
This runbook provides operational procedures for Fiscalization by Zyntem, including normal operations, incident response, and emergency procedures.
Target Audience: On-call engineers, DevOps, SREs
Emergency Contact: +1-XXX-XXX-XXXX (update with real number)
Table of Contents
- System Architecture
- Normal Operations
- Incident Response
- Emergency Procedures
- Common Issues
- Monitoring & Alerts
- Contact Information
System Architecture
Components
graph TD
A[User/Client] --> B[Cloud Run: Core API]
B --> C[Cloud SQL: PostgreSQL]
B --> D[Redis: Rate Limiting]
B --> E[Secret Manager: Credentials]
B --> F[Cloud Storage: Receipts]
B --> G[Adapters: Spain/Italy/France]
G --> H[Tax Authorities]
Critical Services
| Service | Platform | Region | Purpose |
|---|---|---|---|
| core-api | Cloud Run | europe-west1 | Main API gateway |
| dashboard | Cloud Run | europe-west1 | Customer UI |
| docs | Cloud Run | europe-west1 | Documentation |
| adapter-spain | Cloud Run | europe-west1 | Spain fiscalization |
| adapter-italy | Cloud Run | europe-west1 | Italy fiscalization |
| adapter-france | Cloud Run | europe-west1 | France fiscalization |
| postgres | Cloud SQL | europe-west1 | Primary database |
| redis | Cloud Memorystore | europe-west1 | Cache & rate limiting |
Dependencies
External Services:
- Stripe (payment processing)
- AEAT (Spain tax authority)
- AdE SDI (Italy tax authority)
- GCP Services (Cloud Run, Cloud SQL, etc.)
Normal Operations
Daily Checks
Morning Checklist (5 minutes):
-
Check System Health
# Verify all services running
gcloud run services list --region europe-west1
# Check health endpoints
curl https://core-api-dev-zyntem.run.app/health
curl https://dashboard-dev-zyntem.run.app/health -
Review Metrics
- Open Cloud Monitoring dashboard
- Verify error rate < 5%
- Verify P99 latency < 2s
- Check no active alerts
-
Check Recent Deployments
gcloud run revisions list --service core-api --region europe-west1 --limit 5 -
Review Error Logs
gcloud logging read "severity>=ERROR" --limit 20 --freshness=1d
Weekly Checks
Monday Morning (15 minutes):
-
Certificate Expiration Check
# Check certificates expiring in < 30 days
# (Automated alert should notify, but double-check) -
Database Performance Review
# Connect to Cloud SQL
gcloud sql connect zyntem-dev --user=postgres
# Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10; -
Backup Verification
# Verify automated backups exist
gcloud sql backups list --instance=zyntem-dev -
Review Capacity & Costs
- Check Cloud SQL storage usage
- Review Cloud Run instance scaling
- Monitor GCP billing
Monthly Checks
First Monday of Month (30 minutes):
-
Security Review
- Review IAM permissions
- Check for unused service accounts
- Rotate API keys (if policy requires)
-
Dependency Updates
- Review Go dependency updates:
go list -u -m all - Review Node dependency updates:
npm outdated - Schedule dependency upgrade sprint
- Review Go dependency updates:
-
Disaster Recovery Test
- Test database restore from backup
- Verify rollback procedures
- Test failover scenarios (Phase 2)
Incident Response
Severity Levels
| Severity | Description | Response Time | Example |
|---|---|---|---|
| P0 - Critical | Complete service outage | 15 minutes | Database down, API returning 500s |
| P1 - High | Major feature broken | 1 hour | Transactions failing, dashboard down |
| P2 - Medium | Degraded performance | 4 hours | High latency, some errors |
| P3 - Low | Minor issue | 24 hours | UI bug, documentation error |
Incident Response Process
graph TD
A[Alert Triggered] --> B[Acknowledge Alert]
B --> C[Assess Severity]
C --> D{P0/P1?}
D -->|Yes| E[Page On-Call]
D -->|No| F[Create Ticket]
E --> G[Triage & Investigate]
G --> H[Implement Fix]
H --> I[Verify Resolution]
I --> J[Post-Mortem]
F --> K[Schedule Work]
Step 1: Acknowledge Alert
# Silence alert (if using Cloud Monitoring)
gcloud alpha monitoring policies update POLICY_ID --disabled
# Or acknowledge in PagerDuty/Opsgenie
Time: < 5 minutes
Step 2: Assess Impact
Questions to Answer:
- How many customers affected?
- What functionality is broken?
- Is data at risk?
- Is this a security incident?
Commands:
# Check error rate
gcloud monitoring read \
--project=zyntem-dev \
--filter='metric.type="run.googleapis.com/request_count" AND metric.label.response_code_class="5xx"'
# Check affected services
gcloud run services list --filter="metadata.labels.status=unhealthy"
# Count affected requests (last 1 hour)
gcloud logging read 'severity>=ERROR AND timestamp>="'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'"' --format='value(timestamp)' | wc -l
Step 3: Communicate Status
Update Status Page:
- https://status.zyntem.dev (Epic 5)
- Post incident notice
- Provide ETA if known
Notify Team:
Slack: #fiscalization-incidents
Subject: [P0] API Returning 500 Errors
Status: Investigating
ETA: Unknown
Lead: @engineer-name
Step 4: Investigate & Fix
Common Investigation Commands:
# Check service health
curl -v https://core-api-dev-zyntem.run.app/health
# View recent logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit 50 --freshness=1h
# Check resource usage
gcloud run services describe core-api --region europe-west1 --format="value(status.conditions)"
# Check database connectivity
gcloud sql connect zyntem-dev --user=postgres
# Then: SELECT 1;
# Check Redis connectivity
redis-cli -h REDIS_IP ping
Fix Options:
- Rollback (fastest, use if new deployment caused issue)
- Hotfix (if root cause identified and fix is simple)
- Workaround (temporary fix to restore service)
Step 5: Verify Resolution
# Check error rate returned to normal
gcloud monitoring read --filter='metric.type="run.googleapis.com/request_count"' --format=json
# Run smoke tests
./scripts/smoke-tests.sh
# Monitor for 15 minutes
watch -n 10 'curl -s https://core-api-dev-zyntem.run.app/health | jq .'
Step 6: Post-Mortem
Within 48 hours of resolution:
-
Create Post-Mortem Document
- What happened?
- Root cause?
- How was it detected?
- How was it resolved?
- How to prevent recurrence?
-
Action Items
- Immediate fixes
- Long-term improvements
- Monitoring enhancements
- Process changes
Emergency Procedures
Emergency Rollback
When: Deployment caused critical issue
Time to Execute: < 30 seconds (instant traffic shift to previous revision)
Option 1: Instant traffic rollback (recommended)
# List recent revisions to identify the previous stable one
gcloud run revisions list \
--service core-api \
--region europe-west1 \
--limit 5
# Shift 100% traffic to the previous stable revision
gcloud run services update-traffic core-api \
--to-revisions=<PREVIOUS_REVISION>=100 \
--region europe-west1
# Verify
curl https://core-api-dev-zyntem.run.app/health
Option 2: Emergency rollback script
./scripts/emergency-rollback.sh core-api
Note: The deploy workflow (deploy.yml) includes automatic rollback on failure.
If a health check fails during traffic shifting, traffic is automatically reverted
to the previous revision without manual intervention.
Database Emergency Restore
When: Database corruption or data loss
Time to Execute: 15-30 minutes
# List available backups
gcloud sql backups list --instance=zyntem-dev
# Restore from backup (creates new instance)
gcloud sql backups restore BACKUP_ID \
--backup-instance=zyntem-dev \
--backup-id=BACKUP_ID
# Or restore to new instance
gcloud sql backups restore BACKUP_ID \
--backup-instance=zyntem-dev \
--target-instance=zyntem-dev-restore
# Update application to point to restored instance
# (requires configuration change and deployment)
WARNING: This will cause downtime. Only use if data loss is unacceptable.
Enable Maintenance Mode
When: Need to perform emergency maintenance
Time to Execute: < 1 minute
# Deploy maintenance page to Cloud Run
gcloud run deploy core-api-maintenance \
--image gcr.io/zyntem/fiscalization-api/maintenance:latest \
--region europe-west1
# Redirect traffic
gcloud run services update-traffic core-api \
--to-revisions=core-api-maintenance-00001=100 \
--region europe-west1
Maintenance Page Response:
{
"status": "maintenance",
"message": "Fiscalization is undergoing emergency maintenance. Expected resolution: 2025-10-29 12:00 UTC",
"retry_after": 3600
}
Circuit Breaker Manual Override
When: External service (tax authority) is down
Action: Manually enable circuit breaker to prevent cascading failures
# Set circuit breaker state in Redis
redis-cli -h REDIS_IP SET "circuit_breaker:spain:aeat" "open"
# This triggers graceful degradation (FR30):
# - Receipts generated immediately
# - Fiscalization queued for retry
Revert:
redis-cli -h REDIS_IP DEL "circuit_breaker:spain:aeat"
How to Check Cloud Run Logs
Via gcloud CLI
# Live tail of logs for the core-api service
gcloud run services logs tail core-api --region europe-west1
# Recent logs (last hour)
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.service_name="core-api"' \
--limit 100 --freshness=1h
# Errors only
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.service_name="core-api" AND severity>=ERROR' \
--limit 50
# Logs for a specific revision (useful during deployments)
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.revision_name="core-api-00010-abc"' \
--limit 50
Via Cloud Console
- Navigate to: https://console.cloud.google.com/run?project=zyntem-dev
- Select the
core-apiservice - Click the Logs tab
- Use severity filters (Error, Warning, Info) to narrow results
Common Issues
Issue 1: High Error Rate
Symptoms:
- Alert: "Error rate > 5%"
- Cloud Monitoring shows spike in 5xx errors
Investigation:
# Check error logs
gcloud logging read "severity=ERROR" --limit 50
# Check service status
gcloud run services describe core-api --region europe-west1
# Check database connectivity
gcloud sql operations list --instance=zyntem-dev
Common Causes:
- Database connection pool exhausted
- External service timeout (Stripe, tax authorities)
- Memory leak / OOM
- Bad deployment
Resolution:
# If database issue: Scale up connections
gcloud sql instances patch zyntem-dev --database-flags max_connections=200
# If bad deployment: Rollback
./scripts/emergency-rollback.sh core-api
# If external service: Enable circuit breaker
redis-cli SET "circuit_breaker:SERVICE" "open"
Issue 2: High Latency
Symptoms:
- Alert: "P99 latency > 2s"
- Requests timing out
- Customer complaints of slow responses
Investigation:
# Check Cloud Run metrics
gcloud run services describe core-api --region europe-west1
# Check database slow queries
# (See database query from "Normal Operations" section)
# Check Redis latency
redis-cli --latency -h REDIS_IP
Common Causes:
- Cold starts (low traffic)
- Slow database queries
- External service latency
- Insufficient resources
Resolution:
# Prevent cold starts
gcloud run services update core-api --min-instances=1 --region europe-west1
# Scale up resources
gcloud run services update core-api --memory=1Gi --region europe-west1
# Add database indexes (if slow query identified)
Issue 3: Database Connection Errors
Symptoms:
- Errors:
could not connect to server - API returns 500 errors
- Database queries timing out
Investigation:
# Check Cloud SQL status
gcloud sql instances describe zyntem-dev
# Check connections
gcloud sql operations list --instance=zyntem-dev --limit 10
# Check Cloud SQL logs
gcloud logging read "resource.type=cloudsql_database" --limit 50
Common Causes:
- Cloud SQL instance down
- Connection pool exhausted
- Network issue
- Maintenance window
Resolution:
# If instance down: Restart
gcloud sql instances restart zyntem-dev
# If connection pool issue: Scale connections
gcloud sql instances patch zyntem-dev --database-flags max_connections=200
# Emergency: Enable read replica if available
gcloud sql instances failover zyntem-dev
Issue 4: Certificate Expiration
Symptoms:
- Alert: "Certificate expires in < 30 days"
- Tax authority API calls failing with SSL errors
Investigation:
# Check certificate expiration
openssl x509 -in certificate.pem -noout -enddate
Resolution:
# Renew certificate (process varies by country)
# Spain: Contact AEAT for renewal
# Italy: Use InfoCert portal
# France: Automated renewal (if applicable)
# Upload new certificate to Secret Manager
gcloud secrets versions add CERT_NAME --data-file=new-certificate.pem
# Restart services to pick up new certificate
gcloud run services update adapter-spain --region europe-west1
Monitoring & Alerts
Dashboards
Primary Dashboard:
- URL: https://console.cloud.google.com/monitoring/dashboards
- Name: "Fiscalization Production Monitoring"
- Metrics: Error rate, latency, request count, database health
Secondary Dashboards:
- Cloud Run Service Metrics
- Cloud SQL Performance
- Error Logs
Alert Policies
| Alert | Threshold | Action |
|---|---|---|
| High Error Rate | > 5% for 2 min | Page on-call, investigate immediately |
| High Latency | P99 > 2s for 2 min | Page on-call, check database/resources |
| Database Down | Connection failures > 3 | Page on-call, emergency restore |
| Certificate Expiring | < 30 days | Email owner, schedule renewal |
| Disk Full | > 90% | Email team, clean up or scale |
Log Queries
Useful Log Queries:
# All errors in last hour
gcloud logging read 'severity>=ERROR AND timestamp>="'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'"'
# Errors for specific service
gcloud logging read 'resource.type=cloud_run_revision AND resource.labels.service_name=core-api AND severity>=ERROR' --limit 50
# Slow database queries (> 1s)
gcloud logging read 'resource.type=cloudsql_database AND jsonPayload.message=~"duration:.*[1-9][0-9]{3,}ms"' --limit 20
# Failed transactions
gcloud logging read 'jsonPayload.transaction.status="failed"' --limit 50
Contact Information
On-Call Rotation
| Week | Primary On-Call | Backup |
|---|---|---|
| Week 1 | Engineer A (+1-XXX-XXX-XXXX) | Engineer B (+1-XXX-XXX-XXXX) |
| Week 2 | Engineer B (+1-XXX-XXX-XXXX) | Engineer C (+1-XXX-XXX-XXXX) |
| Week 3 | Engineer C (+1-XXX-XXX-XXXX) | Engineer A (+1-XXX-XXX-XXXX) |
Escalation Path
- Level 1: On-Call Engineer
- Level 2: Engineering Lead (javier.sanchez@zyntem.com)
- Level 3: CTO/Technical Advisor
External Contacts
GCP Support:
- Phone: +1-XXX-XXX-XXXX
- Portal: https://console.cloud.google.com/support
- Priority: P1 (production outage)
Stripe Support:
- Portal: https://support.stripe.com
- Email: support@stripe.com
Tax Authority Contacts:
- Spain AEAT: +34-XXX-XXX-XXX
- Italy AdE: +39-XXX-XXX-XXX
Runbook Maintenance
Review Schedule: Quarterly (every 3 months)
Last Reviewed: 2025-10-29
Next Review: 2026-01-29
Changelog:
- 2025-10-29: Initial version
Additional Resources
- DEPLOYMENT.md - Deployment procedures
- DEPLOYMENT-CHECKLIST.md - Deployment checklist
- CI-CD.md - CI/CD pipeline
- Cloud Run Documentation
- Cloud SQL Documentation