Database Troubleshooting Guide

Critical Connection Pool Issues

PostgreSQL Connection Pool Exhaustion - FATAL Error

Problem: FATAL: MaxClientsInSessionMode: max clients reached - in Session mode max clients are limited to pool_size

Root Cause: Backend is creating too many concurrent PostgreSQL connections without proper cleanup, exhausting Supabase’s connection pool limits.

Symptoms:

All GraphQL queries fail with connection errors
Frontend shows “Load failed” or connection timeout errors
Backend logs show psycopg.OperationalError
Multiple connection attempts to different Supabase pool hosts fail

Investigation:

# Check active connections in Supabase dashboard
# Or query directly:
SELECT count(*) as active_connections FROM pg_stat_activity;

Immediate Solutions:

Restart the backend service (Vercel will reset connections)
Reduce concurrent queries in frontend temporarily
Implement connection pooling with proper max_overflow and pool_size

Backend Connection Pool Config (Required Fix):

# In app/db_postgres.py or similar
engine = create_async_engine(
    DATABASE_URL,
    pool_size=5,           # Reduce from default
    max_overflow=10,       # Limit overflow connections  
    pool_timeout=30,       # Connection timeout
    pool_recycle=1800,     # Recycle connections every 30min
    pool_pre_ping=True,    # Validate connections
)

Prevention:

Use async with engine.begin() pattern consistently
Never leave database connections open
Monitor connection count in Supabase dashboard
Implement connection retry logic with exponential backoff

Data Sync Issues

PostgreSQL/Neo4j Out of Sync - “Load Failed” Errors

Problem: Frontend shows “Load failed” errors when trying to fetch object details. GraphQL queries fail with network errors.

Root Cause: Objects exist in PostgreSQL but not in Neo4j, creating a data sync mismatch. The getObjectDetailsComplete resolver finds objects in PostgreSQL but fails when querying Neo4j for spatial relationships.

Symptoms:

Frontend: TypeError: Load failed in Apollo Client
Backend: Objects found in PostgreSQL but missing from Neo4j
GraphQL queries timeout or crash

Investigation:

# Check object counts in both databases
# Neo4j
MATCH (o:Object) RETURN count(o) as total_objects

# PostgreSQL  
SELECT COUNT(*) as total_objects FROM object_instances;

Immediate Fix (Graceful Degradation): The backend now includes an existence check before querying Neo4j relationships. Objects will load with PostgreSQL data only when missing from Neo4j.

Permanent Solution (Data Sync): Run the data sync script to populate Neo4j with PostgreSQL objects:

cd Plings-Backend
python sync_data.py

Prevention:

Ensure both databases are updated atomically during object creation
Monitor data consistency between PostgreSQL and Neo4j
Implement regular sync validation checks

Critical Column Name Mismatches (RESOLVED)

Supabase Schema vs Code Mismatches

CRITICAL: Always verify actual database column names before writing queries.

Object Images Table Issues (RESOLVED)

Symptoms:

column "object_id" does not exist errors
column "image_url" does not exist errors
column "created_at" does not exist errors

Root Cause: Code was using incorrect column names that didn’t match the actual Supabase schema.

Affected Files:

app/resolvers.py (lines 117-130)
app/image_resolvers.py (multiple functions)

Column Name Corrections:

-- ❌ WRONG column names used in code
object_id     -> object_instance_id
image_url     -> public_url  
created_at    -> uploaded_at
neo4j_id      -> terminusdb_id

-- ✅ CORRECT schema (verified via Supabase)
object_instance_id UUID
public_url TEXT
uploaded_at TIMESTAMP
terminusdb_id TEXT

Prevention:

Always verify schema first:

-- Check actual table structure
SELECT column_name, data_type, is_nullable
FROM information_schema.columns 
WHERE table_name = 'object_images'
ORDER BY ordinal_position;

Use MCP Supabase tools:

# List all tables and their columns
mcp_supabase_list_tables --project_id=your_project_id

Test queries in Supabase SQL editor before implementing

UUID Format Issues (RESOLVED)

Symptoms:

invalid input syntax for type uuid: "dev-user-01"
Authentication failures in development

Root Cause: Development authentication was using string IDs instead of proper UUID format.

Solution:

# ❌ WRONG - String IDs
id="dev-user-01"
org_id="dev-org-01"

# ✅ CORRECT - Proper UUIDs
id="89abf882-fecc-4e3c-8347-b94425a62ac8"
org_id="9ba2023b-117f-4346-a809-839a66b8c0d4"

Prevention:

Always use proper UUID format for development data
Verify UUIDs exist in production database
Use uuid.uuid4() to generate valid UUIDs

Neo4j Async Event Loop Conflicts (CRITICAL)

Serverless Environment Issue

CRITICAL: Async Neo4j operations cause event loop conflicts in serverless environments like Vercel.

Symptoms:

<neo4j._async_compat.concurrency.AsyncRLock object...> is bound to a different event loop
Task got Future attached to a different loop errors
Neo4j mutations failing in production but working locally

Root Cause: Serverless functions create new event loops for each request, but async Neo4j sessions maintain state tied to the original event loop, causing conflicts when the function execution context changes.

Solution – Use Synchronous Neo4j Operations (✅ Updated 2025-06-29):

# ❌ WRONG - Async Neo4j operations (cause event loop conflicts)
async def resolve_move_object():
    neo4j_driver = info.context["neo4j"]
    async with neo4j_driver.session() as session:
        result = await session.run(query, params)

# ✅ CORRECT - Synchronous Neo4j operations
def _move_object_sync(driver, object_id: str, container_id: str, user_id: str):
    """Synchronous Neo4j operations to avoid event loop conflicts."""
    with driver.session() as session:
        result = session.run(query, params)
        return dict(result.single())

async def resolve_move_object():
    sync_neo4j_driver = info.context["sync_neo4j"]
    await asyncio.to_thread(
        _move_object_sync, sync_neo4j_driver, objectId, newContainerId, user_id
    )

Resolved Issues - Fixed in Production:

getObjectDetailsComplete Resolver (2024-01-15)
- Problem: Frontend “Load failed” errors when clicking on objects
- Root Cause: Complex async Neo4j query in object detail resolvers
- Solution: Converted to _get_object_relationships_sync() function with asyncio.to_thread()
- Files Fixed: app/object_detail_resolvers.py
Basic Object Resolver (2024-01-15)
- Problem: Spatial relationship queries failing
- Solution: Added _get_spatial_relationships_sync() function
- Files Fixed: app/object_detail_resolvers.py
getSpatialChildren Resolver (2024-01-15)
- Problem: Spatial dashboard not loading children objects
- Solution: Added _get_spatial_children_sync() function
- Files Fixed: app/object_detail_resolvers.py
Neo4j Fallback Function (2024-01-15)
- Problem: Fallback queries for objects missing from PostgreSQL
- Solution: Added _get_neo4j_fallback_object_sync() function
- Files Fixed: app/object_detail_resolvers.py
All Spatial Resolvers (2024-01-15 - LATEST)
- Problem: “AsyncSession object does not support the context manager protocol” errors
- Root Cause: Multiple remaining async Neo4j operations in spatial, status, and image resolvers
- Solution: Complete conversion to sync pattern:
  - _update_spatial_relationships_sync() in spatial resolvers
  - _get_spatial_parent_sync() in spatial resolvers
  - _remove_spatial_relationship_sync() in spatial resolvers
  - _update_object_statuses_sync() in status resolvers
  - Removed async Neo4j enhancement in image resolvers
- Files Fixed: app/spatial_resolvers.py, app/status_resolvers.py, app/image_resolvers.py

Implementation Pattern:

Create separate _operation_sync() functions for Neo4j operations
Use with driver.session() instead of async with driver.session()
Call via await asyncio.to_thread(sync_function, sync_neo4j_driver, ...)
ALWAYS pass the sync driver (sync_neo4j_driver) to any helper that opens a blocking with driver.session() context manager – passing the async variant (neo4j_driver) will surface the error "AsyncSession object does not support the context manager protocol" even if the helper itself is synchronous.

Quick check

# Good – synchronous helper receives the sync driver
spatial_parent, spatial_children = await asyncio.to_thread(
    _get_spatial_relationships_sync, info.context["sync_neo4j"], object_id
)

# Bad – will crash with AsyncSession context-manager error
spatial_parent, spatial_children = await asyncio.to_thread(
    _get_spatial_relationships_sync, info.context["neo4j"], object_id
)

Affected Operations:

All spatial relationship mutations (moveObject, createSpatialRelationship)
Any GraphQL resolver that performs Neo4j write operations
Complex queries that maintain session state

Prevention:

Never use async Neo4j sessions in serverless resolvers
Always use the established sync pattern with asyncio.to_thread()
Test in actual serverless environment, not just locally

Database Connection Issues

PostgreSQL Connection Pool Exhaustion (CRITICAL)

CRITICAL: Supabase PostgreSQL connection pool exhaustion in serverless environments.

Symptoms:

FATAL: MaxClientsInSessionMode: max clients reached - in Session mode max clients are limited to pool_size
GraphQL queries failing with OperationalError: connection failed
Concurrent requests causing 500 errors

Root Cause: Serverless environments like Vercel create multiple concurrent instances, each trying to establish database connections. Default SQLAlchemy pool settings are too aggressive for Supabase’s connection limits.

Solution - Supabase-Optimized Connection Pool (UPDATED 2024-01-15):

# ❌ WRONG - Default pool settings (causes exhaustion)
engine = create_async_engine(pg_dsn, pool_pre_ping=True, echo=False)

# ✅ CORRECT - Supabase-optimized pool (current production settings)
engine = create_async_engine(
    pg_dsn, 
    pool_pre_ping=True, 
    echo=False,
    # Conservative connection pool for Supabase Session mode limits
    pool_size=3,          # Reduced to stay within Supabase limits
    max_overflow=2,       # Minimal overflow to avoid pool exhaustion
    pool_timeout=30,      # Longer timeout for connection acquisition
    pool_recycle=300,     # Shorter recycle to free connections faster (5 min)
    pool_reset_on_return='rollback',  # Aggressive connection reset
    # Additional settings for better connection management
    connect_args={
        "server_settings": {
            "application_name": "plings-backend",
            "jit": "off"  # Disable JIT for faster connection setup
        }
    }
)

Current Pool Settings Explained (Production):

pool_size=3: Conservative pool size for Supabase Session mode
max_overflow=2: Minimal overflow connections (total max: 5)
pool_timeout=30: Longer timeout for better availability
pool_recycle=300: Fast connection turnover (5 minutes)
application_name: Better connection identification in logs
jit=off: Faster connection establishment

Health Check Monitoring: The /health/db endpoint now provides detailed connection pool metrics:

{
  "postgres": {
    "pool_size": 3,
    "checked_out": 2,
    "connection_usage": "2/5",
    "usage_percentage": "40.0%",
    "pool_full": false,
    "warning": null
  }
}

Prevention:

Monitor Health Endpoint: Check /health/db for pool usage
Watch for Warnings: Alert when usage > 80%
Optimize Query Duration: Use efficient queries to reduce connection hold time
Connection Pattern Optimization: Use async with engine.begin() consistently

Emergency Pool Size Reduction:

# If experiencing immediate outage, further reduce pool size
pool_size=2, max_overflow=1  # Total max: 3 connections per instance

PostgreSQL Connection String Format

CRITICAL: Async SQLAlchemy requires specific connection string format.

# ❌ WRONG - Standard PostgreSQL format
postgresql://user:pass@host:port/db

# ✅ CORRECT - Async psycopg format
postgresql+psycopg://user:pass@host:port/db

Implementation:

# Auto-convert in settings
pg_dsn = settings.supabase_db_url
if pg_dsn.startswith("postgresql://"):
    pg_dsn = pg_dsn.replace("postgresql://", "postgresql+psycopg://", 1)

engine = create_async_engine(pg_dsn, pool_pre_ping=True)

Neo4j Connection Issues

Common Problems:

URI format: Must use bolt:// or neo4j://
Authentication: Verify username/password
Database selection: Specify correct database name

# Verify Neo4j connection
from neo4j import GraphDatabase

def test_neo4j_connection():
    driver = GraphDatabase.driver(
        settings.neo4j_uri,
        auth=(settings.neo4j_username, settings.neo4j_password)
    )
    with driver.session(database=settings.neo4j_database) as session:
        result = session.run("RETURN 1 as test")
        print(result.single()["test"])  # Should print: 1

Debugging Workflow

1. Verify Database Schema

-- Check table exists
SELECT table_name FROM information_schema.tables 
WHERE table_schema = 'public' AND table_name = 'your_table';

-- Check column names and types
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns 
WHERE table_name = 'your_table'
ORDER BY ordinal_position;

2. Test Individual Queries

# Test query in isolation
async def test_query():
    async with get_db_session() as session:
        result = await session.execute(
            text("SELECT object_instance_id, public_url FROM object_images LIMIT 1")
        )
        row = result.fetchone()
        print(f"Row: {row}")

3. Check Query Logs

# Enable SQL query logging
engine = create_async_engine(
    pg_dsn, 
    pool_pre_ping=True, 
    echo=True  # This logs all SQL queries
)

4. Verify Data Types

-- Check for type mismatches
SELECT 
    column_name,
    data_type,
    character_maximum_length,
    is_nullable
FROM information_schema.columns 
WHERE table_name = 'your_table';

Prevention Checklist

Before Writing Database Code:

Verify table exists in target database
Check actual column names (not assumptions)
Verify data types match expected usage
Test queries in database console first
Use proper UUID format for UUID columns
Verify foreign key relationships exist

Before Deployment:

Test all database queries in staging
Verify connection strings are correct format
Check environment variables are set
Test both PostgreSQL and Neo4j connections
Verify authentication works with real user data

Common Patterns to Avoid:

Don’t assume column names without verification
Don’t use string IDs where UUIDs are expected
Don’t mix sync/async database operations
Don’t forget to handle connection errors gracefully
Don’t skip testing queries in database console

Schema Verification Commands

Supabase (PostgreSQL)

-- List all tables
SELECT table_name FROM information_schema.tables 
WHERE table_schema = 'public';

-- Get table structure
\d+ table_name

-- Check constraints
SELECT * FROM information_schema.table_constraints 
WHERE table_name = 'your_table';

Neo4j

// List all node labels
CALL db.labels();

// List all relationship types
CALL db.relationshipTypes();

// Get schema overview
CALL db.schema.visualization();

Remember: Always verify the actual database schema before writing code. Assumptions about column names are the most common source of database errors.

Supabase Session-Mode Limits in Server-less Deployments (UPDATED 2025-06-29)

Even with the conservative pool (3 + 2 overflow) we previously recommended, a server-less platform such as Vercel can still exhaust the Session-mode cap because each lambda instance maintains its own pool. One heavy user running stress-tests can spawn enough lambdas to blow past the limit.

Why it happens

Trigger	Effect
Vercel auto-scales many short-lived lambdas	every lambda initialises its own SQLAlchemy pool
Pool was 3 + 2 overflow	5 sockets per lambda
5 concurrent lambdas	25 sockets; Session-mode limit reached

Plings is now on the Supabase Pro plan (≈500 realtime concurrent connections), but Session-mode pooling is still stricter (usually 20-25). Therefore we need to minimise per-lambda usage.

Definitive Fix

Singleton engine – The backend exposes get_pg_engine() in app/db_postgres.py. Each process calls it once and re-uses the returned engine everywhere.

_engine = None

def get_pg_engine():
    global _engine
    if _engine is None:
        _engine = create_async_engine(dsn,
            pool_size=1,        # base pool
            max_overflow=1)     # total max 2 sockets
    return _engine

Per-event-loop engine with NullPool – get_pg_engine() now caches one AsyncEngine per Python event loop (keyed by id(asyncio.get_event_loop())). Each engine uses NullPool, so every statement opens and closes its own socket. This prevents both Session-mode exhaustion and cross-loop lock errors like <asyncio.locks.Lock …> is bound to a different event loop.
(Recommended) use Supabase PgBouncer – point supabase_db_url at the connection-pooling port (6543) so thousands of client sockets can share ~20 backend sessions.
Health-check circuit breaker – /health/db now returns 503 when usage_percentage > 80 %, signalling Vercel to shed load before Supabase returns FATAL errors.

After-action checklist

Centralised engine in db_postgres.py.
Re-wired main.py and graphql.py to call the singleton instead of building their own engine.
Replaced pool with NullPool (no persistent sockets) – opens/closes per transaction.
Updated docs (this section).

Result: Even the GraphQL debugger’s five concurrent queries no longer trigger `

Spatial Predicates Migration (NORMAL_/CURRENT_ System)

Status: ✅ COMPLETE
Migration Date: 2025-07-03 08:17:03
Scope: Backend + Frontend

Migration Summary

The Plings system has successfully migrated from legacy spatial predicates to a dual predicate system:

NORMAL_ predicates: Where objects should be (designated locations)
CURRENT_ predicates: Where objects currently are (real-time locations - future implementation)
Legacy predicates: Maintained for backward compatibility

Components Updated

✅ Backend (Completed 2025-07-03 08:17:03)

Supabase: Applied dual spatial predicates migration (27 total predicates)
Neo4j: Migrated existing relationships to NORMAL_ predicates
GraphQL Schema: Enhanced SpatialRelationshipType enum with NORMAL_, CURRENT_, and legacy predicates
Resolvers: All spatial resolvers updated with predicate mapping and backward compatibility
Object Creation: New objects use NORMAL_ predicates by default

✅ Frontend (Completed 2025-07-03 08:17:03)

Type Definitions: Updated to support all predicate variants
UI Components: Drag-and-drop now creates NORMAL_ predicates
GraphQL Integration: Fully compatible with enhanced backend schema
Documentation: Comprehensive migration guide created for developers
Backward Compatibility: Existing code continues to work unchanged

Key Benefits Achieved

Enhanced Inventory Management: Foundation for misplacement detection
Future-Proofing: System ready for real-time location tracking
Zero Downtime: Migration completed without breaking changes
Developer Experience: Clear migration guides and backward compatibility

Next Steps

CURRENT_ Predicate Implementation: Add real-time location tracking
Misplacement Detection: Compare NORMAL_ vs CURRENT_ predicates
Advanced UI Features: Location auditing and correction workflows
Automated Organization: Guide users to return objects to designated locations

For detailed implementation information, see:

Frontend: Spatial Predicates Migration Guide
Backend: Spatial resolvers with predicate mapping system

Search Results:

Database Troubleshooting Guide

Critical Connection Pool Issues

PostgreSQL Connection Pool Exhaustion - FATAL Error

Data Sync Issues

PostgreSQL/Neo4j Out of Sync - “Load Failed” Errors

Critical Column Name Mismatches (RESOLVED)

Supabase Schema vs Code Mismatches

Object Images Table Issues (RESOLVED)

UUID Format Issues (RESOLVED)

Neo4j Async Event Loop Conflicts (CRITICAL)

Serverless Environment Issue

Database Connection Issues

PostgreSQL Connection Pool Exhaustion (CRITICAL)

PostgreSQL Connection String Format

Neo4j Connection Issues

Debugging Workflow

1. Verify Database Schema

2. Test Individual Queries

3. Check Query Logs

4. Verify Data Types

Prevention Checklist

Before Writing Database Code:

Before Deployment:

Common Patterns to Avoid:

Schema Verification Commands

Supabase (PostgreSQL)

Neo4j

Supabase Session-Mode Limits in Server-less Deployments (UPDATED 2025-06-29)

Why it happens

Definitive Fix

After-action checklist

Spatial Predicates Migration (NORMAL_/CURRENT_ System)

Migration Summary

Components Updated

✅ Backend (Completed 2025-07-03 08:17:03)

✅ Frontend (Completed 2025-07-03 08:17:03)

Key Benefits Achieved

Next Steps