Database Troubleshooting Guide

Critical Connection Pool Issues

PostgreSQL Connection Pool Exhaustion - FATAL Error

Problem: FATAL: MaxClientsInSessionMode: max clients reached - in Session mode max clients are limited to pool_size

Root Cause: Backend is creating too many concurrent PostgreSQL connections without proper cleanup, exhausting Supabase’s connection pool limits.

Symptoms:

  • All GraphQL queries fail with connection errors
  • Frontend shows “Load failed” or connection timeout errors
  • Backend logs show psycopg.OperationalError
  • Multiple connection attempts to different Supabase pool hosts fail

Investigation:

# Check active connections in Supabase dashboard
# Or query directly:
SELECT count(*) as active_connections FROM pg_stat_activity;

Immediate Solutions:

  1. Restart the backend service (Vercel will reset connections)
  2. Reduce concurrent queries in frontend temporarily
  3. Implement connection pooling with proper max_overflow and pool_size

Backend Connection Pool Config (Required Fix):

# In app/db_postgres.py or similar
engine = create_async_engine(
    DATABASE_URL,
    pool_size=5,           # Reduce from default
    max_overflow=10,       # Limit overflow connections  
    pool_timeout=30,       # Connection timeout
    pool_recycle=1800,     # Recycle connections every 30min
    pool_pre_ping=True,    # Validate connections
)

Prevention:

  • Use async with engine.begin() pattern consistently
  • Never leave database connections open
  • Monitor connection count in Supabase dashboard
  • Implement connection retry logic with exponential backoff

Data Sync Issues

PostgreSQL/Neo4j Out of Sync - “Load Failed” Errors

Problem: Frontend shows “Load failed” errors when trying to fetch object details. GraphQL queries fail with network errors.

Root Cause: Objects exist in PostgreSQL but not in Neo4j, creating a data sync mismatch. The getObjectDetailsComplete resolver finds objects in PostgreSQL but fails when querying Neo4j for spatial relationships.

Symptoms:

  • Frontend: TypeError: Load failed in Apollo Client
  • Backend: Objects found in PostgreSQL but missing from Neo4j
  • GraphQL queries timeout or crash

Investigation:

# Check object counts in both databases
# Neo4j
MATCH (o:Object) RETURN count(o) as total_objects

# PostgreSQL  
SELECT COUNT(*) as total_objects FROM object_instances;

Immediate Fix (Graceful Degradation): The backend now includes an existence check before querying Neo4j relationships. Objects will load with PostgreSQL data only when missing from Neo4j.

Permanent Solution (Data Sync): Run the data sync script to populate Neo4j with PostgreSQL objects:

cd Plings-Backend
python sync_data.py

Prevention:

  • Ensure both databases are updated atomically during object creation
  • Monitor data consistency between PostgreSQL and Neo4j
  • Implement regular sync validation checks

Critical Column Name Mismatches (RESOLVED)

Supabase Schema vs Code Mismatches

CRITICAL: Always verify actual database column names before writing queries.

Object Images Table Issues (RESOLVED)

Symptoms:

  • column "object_id" does not exist errors
  • column "image_url" does not exist errors
  • column "created_at" does not exist errors

Root Cause: Code was using incorrect column names that didn’t match the actual Supabase schema.

Affected Files:

  • app/resolvers.py (lines 117-130)
  • app/image_resolvers.py (multiple functions)

Column Name Corrections:

-- ❌ WRONG column names used in code
object_id     -> object_instance_id
image_url     -> public_url  
created_at    -> uploaded_at
neo4j_id      -> terminusdb_id

-- ✅ CORRECT schema (verified via Supabase)
object_instance_id UUID
public_url TEXT
uploaded_at TIMESTAMP
terminusdb_id TEXT

Prevention:

  1. Always verify schema first:
    -- Check actual table structure
    SELECT column_name, data_type, is_nullable
    FROM information_schema.columns 
    WHERE table_name = 'object_images'
    ORDER BY ordinal_position;
    
  2. Use MCP Supabase tools:
    # List all tables and their columns
    mcp_supabase_list_tables --project_id=your_project_id
    
  3. Test queries in Supabase SQL editor before implementing

UUID Format Issues (RESOLVED)

Symptoms:

  • invalid input syntax for type uuid: "dev-user-01"
  • Authentication failures in development

Root Cause: Development authentication was using string IDs instead of proper UUID format.

Solution:

# ❌ WRONG - String IDs
id="dev-user-01"
org_id="dev-org-01"

# ✅ CORRECT - Proper UUIDs
id="89abf882-fecc-4e3c-8347-b94425a62ac8"
org_id="9ba2023b-117f-4346-a809-839a66b8c0d4"

Prevention:

  • Always use proper UUID format for development data
  • Verify UUIDs exist in production database
  • Use uuid.uuid4() to generate valid UUIDs

Neo4j Async Event Loop Conflicts (CRITICAL)

Serverless Environment Issue

CRITICAL: Async Neo4j operations cause event loop conflicts in serverless environments like Vercel.

Symptoms:

  • <neo4j._async_compat.concurrency.AsyncRLock object...> is bound to a different event loop
  • Task got Future attached to a different loop errors
  • Neo4j mutations failing in production but working locally

Root Cause: Serverless functions create new event loops for each request, but async Neo4j sessions maintain state tied to the original event loop, causing conflicts when the function execution context changes.

Solution – Use Synchronous Neo4j Operations (✅ Updated 2025-06-29):

# ❌ WRONG - Async Neo4j operations (cause event loop conflicts)
async def resolve_move_object():
    neo4j_driver = info.context["neo4j"]
    async with neo4j_driver.session() as session:
        result = await session.run(query, params)

# ✅ CORRECT - Synchronous Neo4j operations
def _move_object_sync(driver, object_id: str, container_id: str, user_id: str):
    """Synchronous Neo4j operations to avoid event loop conflicts."""
    with driver.session() as session:
        result = session.run(query, params)
        return dict(result.single())

async def resolve_move_object():
    sync_neo4j_driver = info.context["sync_neo4j"]
    await asyncio.to_thread(
        _move_object_sync, sync_neo4j_driver, objectId, newContainerId, user_id
    )

Resolved Issues - Fixed in Production:

  1. getObjectDetailsComplete Resolver (2024-01-15)
    • Problem: Frontend “Load failed” errors when clicking on objects
    • Root Cause: Complex async Neo4j query in object detail resolvers
    • Solution: Converted to _get_object_relationships_sync() function with asyncio.to_thread()
    • Files Fixed: app/object_detail_resolvers.py
  2. Basic Object Resolver (2024-01-15)
    • Problem: Spatial relationship queries failing
    • Solution: Added _get_spatial_relationships_sync() function
    • Files Fixed: app/object_detail_resolvers.py
  3. getSpatialChildren Resolver (2024-01-15)
    • Problem: Spatial dashboard not loading children objects
    • Solution: Added _get_spatial_children_sync() function
    • Files Fixed: app/object_detail_resolvers.py
  4. Neo4j Fallback Function (2024-01-15)
    • Problem: Fallback queries for objects missing from PostgreSQL
    • Solution: Added _get_neo4j_fallback_object_sync() function
    • Files Fixed: app/object_detail_resolvers.py
  5. All Spatial Resolvers (2024-01-15 - LATEST)
    • Problem: “AsyncSession object does not support the context manager protocol” errors
    • Root Cause: Multiple remaining async Neo4j operations in spatial, status, and image resolvers
    • Solution: Complete conversion to sync pattern:
      • _update_spatial_relationships_sync() in spatial resolvers
      • _get_spatial_parent_sync() in spatial resolvers
      • _remove_spatial_relationship_sync() in spatial resolvers
      • _update_object_statuses_sync() in status resolvers
      • Removed async Neo4j enhancement in image resolvers
    • Files Fixed: app/spatial_resolvers.py, app/status_resolvers.py, app/image_resolvers.py

Implementation Pattern:

  1. Create separate _operation_sync() functions for Neo4j operations
  2. Use with driver.session() instead of async with driver.session()
  3. Call via await asyncio.to_thread(sync_function, sync_neo4j_driver, ...)
  4. ALWAYS pass the sync driver (sync_neo4j_driver) to any helper that opens a blocking with driver.session() context manager – passing the async variant (neo4j_driver) will surface the error "AsyncSession object does not support the context manager protocol" even if the helper itself is synchronous.

Quick check

# Good – synchronous helper receives the sync driver
spatial_parent, spatial_children = await asyncio.to_thread(
    _get_spatial_relationships_sync, info.context["sync_neo4j"], object_id
)

# Bad – will crash with AsyncSession context-manager error
spatial_parent, spatial_children = await asyncio.to_thread(
    _get_spatial_relationships_sync, info.context["neo4j"], object_id
)

Affected Operations:

  • All spatial relationship mutations (moveObject, createSpatialRelationship)
  • Any GraphQL resolver that performs Neo4j write operations
  • Complex queries that maintain session state

Prevention:

  • Never use async Neo4j sessions in serverless resolvers
  • Always use the established sync pattern with asyncio.to_thread()
  • Test in actual serverless environment, not just locally

Database Connection Issues

PostgreSQL Connection Pool Exhaustion (CRITICAL)

CRITICAL: Supabase PostgreSQL connection pool exhaustion in serverless environments.

Symptoms:

  • FATAL: MaxClientsInSessionMode: max clients reached - in Session mode max clients are limited to pool_size
  • GraphQL queries failing with OperationalError: connection failed
  • Concurrent requests causing 500 errors

Root Cause: Serverless environments like Vercel create multiple concurrent instances, each trying to establish database connections. Default SQLAlchemy pool settings are too aggressive for Supabase’s connection limits.

Solution - Supabase-Optimized Connection Pool (UPDATED 2024-01-15):

# ❌ WRONG - Default pool settings (causes exhaustion)
engine = create_async_engine(pg_dsn, pool_pre_ping=True, echo=False)

# ✅ CORRECT - Supabase-optimized pool (current production settings)
engine = create_async_engine(
    pg_dsn, 
    pool_pre_ping=True, 
    echo=False,
    # Conservative connection pool for Supabase Session mode limits
    pool_size=3,          # Reduced to stay within Supabase limits
    max_overflow=2,       # Minimal overflow to avoid pool exhaustion
    pool_timeout=30,      # Longer timeout for connection acquisition
    pool_recycle=300,     # Shorter recycle to free connections faster (5 min)
    pool_reset_on_return='rollback',  # Aggressive connection reset
    # Additional settings for better connection management
    connect_args={
        "server_settings": {
            "application_name": "plings-backend",
            "jit": "off"  # Disable JIT for faster connection setup
        }
    }
)

Current Pool Settings Explained (Production):

  • pool_size=3: Conservative pool size for Supabase Session mode
  • max_overflow=2: Minimal overflow connections (total max: 5)
  • pool_timeout=30: Longer timeout for better availability
  • pool_recycle=300: Fast connection turnover (5 minutes)
  • application_name: Better connection identification in logs
  • jit=off: Faster connection establishment

Health Check Monitoring: The /health/db endpoint now provides detailed connection pool metrics:

{
  "postgres": {
    "pool_size": 3,
    "checked_out": 2,
    "connection_usage": "2/5",
    "usage_percentage": "40.0%",
    "pool_full": false,
    "warning": null
  }
}

Prevention:

  1. Monitor Health Endpoint: Check /health/db for pool usage
  2. Watch for Warnings: Alert when usage > 80%
  3. Optimize Query Duration: Use efficient queries to reduce connection hold time
  4. Connection Pattern Optimization: Use async with engine.begin() consistently

Emergency Pool Size Reduction:

# If experiencing immediate outage, further reduce pool size
pool_size=2, max_overflow=1  # Total max: 3 connections per instance

PostgreSQL Connection String Format

CRITICAL: Async SQLAlchemy requires specific connection string format.

# ❌ WRONG - Standard PostgreSQL format
postgresql://user:pass@host:port/db

# ✅ CORRECT - Async psycopg format
postgresql+psycopg://user:pass@host:port/db

Implementation:

# Auto-convert in settings
pg_dsn = settings.supabase_db_url
if pg_dsn.startswith("postgresql://"):
    pg_dsn = pg_dsn.replace("postgresql://", "postgresql+psycopg://", 1)

engine = create_async_engine(pg_dsn, pool_pre_ping=True)

Neo4j Connection Issues

Common Problems:

  1. URI format: Must use bolt:// or neo4j://
  2. Authentication: Verify username/password
  3. Database selection: Specify correct database name
# Verify Neo4j connection
from neo4j import GraphDatabase

def test_neo4j_connection():
    driver = GraphDatabase.driver(
        settings.neo4j_uri,
        auth=(settings.neo4j_username, settings.neo4j_password)
    )
    with driver.session(database=settings.neo4j_database) as session:
        result = session.run("RETURN 1 as test")
        print(result.single()["test"])  # Should print: 1

Debugging Workflow

1. Verify Database Schema

-- Check table exists
SELECT table_name FROM information_schema.tables 
WHERE table_schema = 'public' AND table_name = 'your_table';

-- Check column names and types
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns 
WHERE table_name = 'your_table'
ORDER BY ordinal_position;

2. Test Individual Queries

# Test query in isolation
async def test_query():
    async with get_db_session() as session:
        result = await session.execute(
            text("SELECT object_instance_id, public_url FROM object_images LIMIT 1")
        )
        row = result.fetchone()
        print(f"Row: {row}")

3. Check Query Logs

# Enable SQL query logging
engine = create_async_engine(
    pg_dsn, 
    pool_pre_ping=True, 
    echo=True  # This logs all SQL queries
)

4. Verify Data Types

-- Check for type mismatches
SELECT 
    column_name,
    data_type,
    character_maximum_length,
    is_nullable
FROM information_schema.columns 
WHERE table_name = 'your_table';

Prevention Checklist

Before Writing Database Code:

  • Verify table exists in target database
  • Check actual column names (not assumptions)
  • Verify data types match expected usage
  • Test queries in database console first
  • Use proper UUID format for UUID columns
  • Verify foreign key relationships exist

Before Deployment:

  • Test all database queries in staging
  • Verify connection strings are correct format
  • Check environment variables are set
  • Test both PostgreSQL and Neo4j connections
  • Verify authentication works with real user data

Common Patterns to Avoid:

  • Don’t assume column names without verification
  • Don’t use string IDs where UUIDs are expected
  • Don’t mix sync/async database operations
  • Don’t forget to handle connection errors gracefully
  • Don’t skip testing queries in database console

Schema Verification Commands

Supabase (PostgreSQL)

-- List all tables
SELECT table_name FROM information_schema.tables 
WHERE table_schema = 'public';

-- Get table structure
\d+ table_name

-- Check constraints
SELECT * FROM information_schema.table_constraints 
WHERE table_name = 'your_table';

Neo4j

// List all node labels
CALL db.labels();

// List all relationship types
CALL db.relationshipTypes();

// Get schema overview
CALL db.schema.visualization();

Remember: Always verify the actual database schema before writing code. Assumptions about column names are the most common source of database errors.

Supabase Session-Mode Limits in Server-less Deployments (UPDATED 2025-06-29)

Even with the conservative pool (3 + 2 overflow) we previously recommended, a server-less platform such as Vercel can still exhaust the Session-mode cap because each lambda instance maintains its own pool. One heavy user running stress-tests can spawn enough lambdas to blow past the limit.

Why it happens

Trigger Effect
Vercel auto-scales many short-lived lambdas every lambda initialises its own SQLAlchemy pool
Pool was 3 + 2 overflow 5 sockets per lambda
5 concurrent lambdas 25 sockets; Session-mode limit reached

Plings is now on the Supabase Pro plan (≈500 realtime concurrent connections), but Session-mode pooling is still stricter (usually 20-25). Therefore we need to minimise per-lambda usage.

Definitive Fix

  1. Singleton engine – The backend exposes get_pg_engine() in app/db_postgres.py. Each process calls it once and re-uses the returned engine everywhere.
_engine = None

def get_pg_engine():
    global _engine
    if _engine is None:
        _engine = create_async_engine(dsn,
            pool_size=1,        # base pool
            max_overflow=1)     # total max 2 sockets
    return _engine
  1. Per-event-loop engine with NullPoolget_pg_engine() now caches one AsyncEngine per Python event loop (keyed by id(asyncio.get_event_loop())). Each engine uses NullPool, so every statement opens and closes its own socket. This prevents both Session-mode exhaustion and cross-loop lock errors like <asyncio.locks.Lock …> is bound to a different event loop.

  2. (Recommended) use Supabase PgBouncer – point supabase_db_url at the connection-pooling port (6543) so thousands of client sockets can share ~20 backend sessions.

  3. Health-check circuit breaker/health/db now returns 503 when usage_percentage > 80 %, signalling Vercel to shed load before Supabase returns FATAL errors.

After-action checklist

  • Centralised engine in db_postgres.py.
  • Re-wired main.py and graphql.py to call the singleton instead of building their own engine.
  • Replaced pool with NullPool (no persistent sockets) – opens/closes per transaction.
  • Updated docs (this section).

Result: Even the GraphQL debugger’s five concurrent queries no longer trigger `

Spatial Predicates Migration (NORMAL_/CURRENT_ System)

Status: ✅ COMPLETE
Migration Date: 2025-07-03 08:17:03
Scope: Backend + Frontend

Migration Summary

The Plings system has successfully migrated from legacy spatial predicates to a dual predicate system:

  • NORMAL_ predicates: Where objects should be (designated locations)
  • CURRENT_ predicates: Where objects currently are (real-time locations - future implementation)
  • Legacy predicates: Maintained for backward compatibility

Components Updated

✅ Backend (Completed 2025-07-03 08:17:03)

  • Supabase: Applied dual spatial predicates migration (27 total predicates)
  • Neo4j: Migrated existing relationships to NORMAL_ predicates
  • GraphQL Schema: Enhanced SpatialRelationshipType enum with NORMAL_, CURRENT_, and legacy predicates
  • Resolvers: All spatial resolvers updated with predicate mapping and backward compatibility
  • Object Creation: New objects use NORMAL_ predicates by default

✅ Frontend (Completed 2025-07-03 08:17:03)

  • Type Definitions: Updated to support all predicate variants
  • UI Components: Drag-and-drop now creates NORMAL_ predicates
  • GraphQL Integration: Fully compatible with enhanced backend schema
  • Documentation: Comprehensive migration guide created for developers
  • Backward Compatibility: Existing code continues to work unchanged

Key Benefits Achieved

  1. Enhanced Inventory Management: Foundation for misplacement detection
  2. Future-Proofing: System ready for real-time location tracking
  3. Zero Downtime: Migration completed without breaking changes
  4. Developer Experience: Clear migration guides and backward compatibility

Next Steps

  1. CURRENT_ Predicate Implementation: Add real-time location tracking
  2. Misplacement Detection: Compare NORMAL_ vs CURRENT_ predicates
  3. Advanced UI Features: Location auditing and correction workflows
  4. Automated Organization: Guide users to return objects to designated locations

For detailed implementation information, see: