Guess Vertica Infrastructure Project Link to heading

This engagement involved comprehensive Vertica database infrastructure work for Guess Inc., spanning multiple areas including system monitoring, backup strategies, disaster recovery planning, and operational optimization. The work included both hands-on implementation and strategic documentation.

Key Engagement Areas Link to heading

1. Vertica Cluster Management & Upgrades Link to heading

  • Cluster Inventory: Multiple Vertica clusters were managed including:

    • VERTKP (Korea Production) - upgraded from 6.0.0-3 to 6.1.3-0
    • VERTSC (Supply Chain)
    • VERTW (Warehouse)
    • VERTR (Reporting)
    • VERTC (Commerce)
    • VERTU (US Production)
    • VERTKD (Korea Development)
  • Management Console (MC) Implementation:

    • Added all clusters to Vertica MC
    • Generated TLS 1.2 certificates for secure HTTPS sessions
    • Installed MC for Korea operations

2. Monitoring & Alerting Infrastructure Link to heading

  • HashiCorpConsul-Based Monitoring System:

    • Deployed Consul agents across all US (V1) and Korea (VK1) nodes
    • Organized into two datacenters with elected server leadership
    • Implemented multi-tier health checking:
      • TCP port monitoring (Vertica service on port 5433, NFS, Consul itself)
      • Application-level checks (fat_query, epoch_gap, disk_usage)
      • Cron-initiated cluster health validation
  • Alert Routing Capabilities:

    • Configured consul-alerts agent supporting multiple notification channels:
      • Email alerts
      • PagerDuty integration
      • OpsGenie integration
      • Slack notifications
      • AWS SNS integration
      • Log-based alerts
  • Network Challenges:

    • Identified connectivity issues with VERTC cluster
    • Network timeout problems with Korea clusters
    • Resolved conflicts with existing SolarWinds SNMP monitoring

3. Backup & Recovery Strategy Link to heading

  • Backup Infrastructure:

    • Implemented scheduled backups across all clusters (VERTKP, VERTSC, VERTW)
    • Utilized Compellent SAN storage with dedicated backup drives
    • Standard configuration: Node 5 in each cluster designated as backup target
    • Used Vertica’s native vbr.py backup utility
  • Recovery Planning:

    • Documented full disaster recovery procedures
    • Estimated recovery times (e.g., VERTW: 5 hours)
    • Created parallel production failover strategy for critical applications
    • Designed DNS service switching for hot failover scenarios

4. System Audit & Optimization Link to heading

  • Comprehensive Audit Framework:

    • Developed 6-day audit engagement process
    • Created 105+ audit points covering:
      • Technical architecture review
      • Hardware assessment
      • OS/Vertica configuration validation
      • Logical/physical design optimization
      • Performance baseline establishment
  • Optimization Areas:

    • Database designer projections (generic and custom)
    • Application design optimizations (ingestion, staging, transformation)
    • ROS (Read Optimized Store) tuning
    • Environmental optimizations
    • Global query profiling implementation

5. Operational Incidents & Resolutions Link to heading

  • NFS Mount Crisis:

    • Resolved critical NFS disconnection affecting VERTC and VERTU clusters
    • ETL server (irvetl01) reboot caused service outage around 5pm
    • Systematic troubleshooting of mount point failures
    • Restored service through coordinated ETL server reboot and NFS service restart
    • Documented improved procedures for future NFS issues
  • Time Synchronization:

    • Fixed time sync issues on VERTR cluster
    • Established SSH authentication standards

6. Documentation & Knowledge Transfer Link to heading

  • Comprehensive Documentation Created:

    • Backup and restore procedures
    • Disaster recovery playbooks
    • Monitoring system architecture
    • Audit methodology and best practices
    • Troubleshooting guides for common issues
    • Network configuration documentation
  • Hardware Manifests:

    • Detailed hardware inventory across all clusters
    • Storage performance analysis
    • Disk utilization worksheets

Technical Achievements Link to heading

  1. High Availability: Implemented robust monitoring across 7 Vertica clusters spanning US and Korea
  2. Scalability: Designed backup strategy supporting 49TB+ data volumes
  3. Reliability: Created failover mechanisms with minimal application downtime
  4. Compliance: Established audit framework ensuring best practice adherence
  5. Performance: Optimized query performance through projection design and profiling

Engagement Metrics Link to heading

  • Clusters Managed: 7 production and development clusters
  • Geographic Scope: US and Korea datacenters
  • Audit Points: 105+ comprehensive system checks
  • Backup Capacity: 49TB+ storage infrastructure

Strategic Impact Link to heading

This engagement successfully transformed Guess’s Vertica infrastructure from a basic setup to a production-ready, enterprise-grade system with:

  • Proactive monitoring and alerting
  • Comprehensive backup and disaster recovery capabilities
  • Performance optimization framework
  • Standardized operational procedures
  • Geographic redundancy and failover capabilities

The work established a foundation for scalable, reliable data warehouse operations supporting Guess’s global retail operations, with particular emphasis on critical POS (Point of Sale) applications and store management reporting systems.

Recommendations Implemented Link to heading

  1. Monitoring: Multi-tier health checking with automated alerting
  2. Backup Strategy: Daily full backups with documented restore procedures
  3. Failover Planning: DNS-based service switching for critical applications
  4. Performance: Query profiling and projection optimization
  5. Documentation: Comprehensive operational runbooks and troubleshooting guides

This engagement represents a complete infrastructure modernization effort that positioned Guess for reliable, scalable data warehouse operations across their global retail footprint.