Database Troubleshooting Guide
Database Troubleshooting Guide
Critical Connection Pool Issues
PostgreSQL Connection Pool Exhaustion - FATAL Error
Problem: FATAL: MaxClientsInSessionMode: max clients reached - in Session mode max clients are limited to pool_size
Root Cause: Backend is creating too many concurrent PostgreSQL connections without proper cleanup, exhausting Supabase’s connection pool limits.
Symptoms:
- All GraphQL queries fail with connection errors
- Frontend shows “Load failed” or connection timeout errors
- Backend logs show
psycopg.OperationalError - Multiple connection attempts to different Supabase pool hosts fail
Investigation:
# Check active connections in Supabase dashboard
# Or query directly:
SELECT count(*) as active_connections FROM pg_stat_activity;
Immediate Solutions:
- Restart the backend service (Vercel will reset connections)
- Reduce concurrent queries in frontend temporarily
- Implement connection pooling with proper
max_overflowandpool_size
Backend Connection Pool Config (Required Fix):
# In app/db_postgres.py or similar
engine = create_async_engine(
DATABASE_URL,
pool_size=5, # Reduce from default
max_overflow=10, # Limit overflow connections
pool_timeout=30, # Connection timeout
pool_recycle=1800, # Recycle connections every 30min
pool_pre_ping=True, # Validate connections
)
Prevention:
- Use
async with engine.begin()pattern consistently - Never leave database connections open
- Monitor connection count in Supabase dashboard
- Implement connection retry logic with exponential backoff
Data Sync Issues
PostgreSQL/Neo4j Out of Sync - “Load Failed” Errors
Problem: Frontend shows “Load failed” errors when trying to fetch object details. GraphQL queries fail with network errors.
Root Cause: Objects exist in PostgreSQL but not in Neo4j, creating a data sync mismatch. The getObjectDetailsComplete resolver finds objects in PostgreSQL but fails when querying Neo4j for spatial relationships.
Symptoms:
- Frontend:
TypeError: Load failedin Apollo Client - Backend: Objects found in PostgreSQL but missing from Neo4j
- GraphQL queries timeout or crash
Investigation:
# Check object counts in both databases
# Neo4j
MATCH (o:Object) RETURN count(o) as total_objects
# PostgreSQL
SELECT COUNT(*) as total_objects FROM object_instances;
Immediate Fix (Graceful Degradation): The backend now includes an existence check before querying Neo4j relationships. Objects will load with PostgreSQL data only when missing from Neo4j.
Permanent Solution (Data Sync): Run the data sync script to populate Neo4j with PostgreSQL objects:
cd Plings-Backend
python sync_data.py
Prevention:
- Ensure both databases are updated atomically during object creation
- Monitor data consistency between PostgreSQL and Neo4j
- Implement regular sync validation checks
Critical Column Name Mismatches (RESOLVED)
Supabase Schema vs Code Mismatches
CRITICAL: Always verify actual database column names before writing queries.
Object Images Table Issues (RESOLVED)
Symptoms:
column "object_id" does not existerrorscolumn "image_url" does not existerrorscolumn "created_at" does not existerrors
Root Cause: Code was using incorrect column names that didn’t match the actual Supabase schema.
Affected Files:
app/resolvers.py(lines 117-130)app/image_resolvers.py(multiple functions)
Column Name Corrections:
-- ❌ WRONG column names used in code
object_id -> object_instance_id
image_url -> public_url
created_at -> uploaded_at
neo4j_id -> terminusdb_id
-- ✅ CORRECT schema (verified via Supabase)
object_instance_id UUID
public_url TEXT
uploaded_at TIMESTAMP
terminusdb_id TEXT
Prevention:
- Always verify schema first:
-- Check actual table structure SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE table_name = 'object_images' ORDER BY ordinal_position; - Use MCP Supabase tools:
# List all tables and their columns mcp_supabase_list_tables --project_id=your_project_id - Test queries in Supabase SQL editor before implementing
UUID Format Issues (RESOLVED)
Symptoms:
invalid input syntax for type uuid: "dev-user-01"- Authentication failures in development
Root Cause: Development authentication was using string IDs instead of proper UUID format.
Solution:
# ❌ WRONG - String IDs
id="dev-user-01"
org_id="dev-org-01"
# ✅ CORRECT - Proper UUIDs
id="89abf882-fecc-4e3c-8347-b94425a62ac8"
org_id="9ba2023b-117f-4346-a809-839a66b8c0d4"
Prevention:
- Always use proper UUID format for development data
- Verify UUIDs exist in production database
- Use
uuid.uuid4()to generate valid UUIDs
Neo4j Async Event Loop Conflicts (CRITICAL)
Serverless Environment Issue
CRITICAL: Async Neo4j operations cause event loop conflicts in serverless environments like Vercel.
Symptoms:
<neo4j._async_compat.concurrency.AsyncRLock object...> is bound to a different event loopTask got Future attached to a different looperrors- Neo4j mutations failing in production but working locally
Root Cause: Serverless functions create new event loops for each request, but async Neo4j sessions maintain state tied to the original event loop, causing conflicts when the function execution context changes.
Solution – Use Synchronous Neo4j Operations (✅ Updated 2025-06-29):
# ❌ WRONG - Async Neo4j operations (cause event loop conflicts)
async def resolve_move_object():
neo4j_driver = info.context["neo4j"]
async with neo4j_driver.session() as session:
result = await session.run(query, params)
# ✅ CORRECT - Synchronous Neo4j operations
def _move_object_sync(driver, object_id: str, container_id: str, user_id: str):
"""Synchronous Neo4j operations to avoid event loop conflicts."""
with driver.session() as session:
result = session.run(query, params)
return dict(result.single())
async def resolve_move_object():
sync_neo4j_driver = info.context["sync_neo4j"]
await asyncio.to_thread(
_move_object_sync, sync_neo4j_driver, objectId, newContainerId, user_id
)
Resolved Issues - Fixed in Production:
- getObjectDetailsComplete Resolver (2024-01-15)
- Problem: Frontend “Load failed” errors when clicking on objects
- Root Cause: Complex async Neo4j query in object detail resolvers
- Solution: Converted to
_get_object_relationships_sync()function withasyncio.to_thread() - Files Fixed:
app/object_detail_resolvers.py
- Basic Object Resolver (2024-01-15)
- Problem: Spatial relationship queries failing
- Solution: Added
_get_spatial_relationships_sync()function - Files Fixed:
app/object_detail_resolvers.py
- getSpatialChildren Resolver (2024-01-15)
- Problem: Spatial dashboard not loading children objects
- Solution: Added
_get_spatial_children_sync()function - Files Fixed:
app/object_detail_resolvers.py
- Neo4j Fallback Function (2024-01-15)
- Problem: Fallback queries for objects missing from PostgreSQL
- Solution: Added
_get_neo4j_fallback_object_sync()function - Files Fixed:
app/object_detail_resolvers.py
- All Spatial Resolvers (2024-01-15 - LATEST)
- Problem: “AsyncSession object does not support the context manager protocol” errors
- Root Cause: Multiple remaining async Neo4j operations in spatial, status, and image resolvers
- Solution: Complete conversion to sync pattern:
_update_spatial_relationships_sync()in spatial resolvers_get_spatial_parent_sync()in spatial resolvers_remove_spatial_relationship_sync()in spatial resolvers_update_object_statuses_sync()in status resolvers- Removed async Neo4j enhancement in image resolvers
- Files Fixed:
app/spatial_resolvers.py,app/status_resolvers.py,app/image_resolvers.py
Implementation Pattern:
- Create separate
_operation_sync()functions for Neo4j operations - Use
with driver.session()instead ofasync with driver.session() - Call via
await asyncio.to_thread(sync_function, sync_neo4j_driver, ...) - ALWAYS pass the sync driver (
sync_neo4j_driver) to any helper that opens a blockingwith driver.session()context manager – passing the async variant (neo4j_driver) will surface the error"AsyncSession object does not support the context manager protocol"even if the helper itself is synchronous.
Quick check
# Good – synchronous helper receives the sync driver
spatial_parent, spatial_children = await asyncio.to_thread(
_get_spatial_relationships_sync, info.context["sync_neo4j"], object_id
)
# Bad – will crash with AsyncSession context-manager error
spatial_parent, spatial_children = await asyncio.to_thread(
_get_spatial_relationships_sync, info.context["neo4j"], object_id
)
Affected Operations:
- All spatial relationship mutations (
moveObject,createSpatialRelationship) - Any GraphQL resolver that performs Neo4j write operations
- Complex queries that maintain session state
Prevention:
- Never use async Neo4j sessions in serverless resolvers
- Always use the established sync pattern with
asyncio.to_thread() - Test in actual serverless environment, not just locally
Database Connection Issues
PostgreSQL Connection Pool Exhaustion (CRITICAL)
CRITICAL: Supabase PostgreSQL connection pool exhaustion in serverless environments.
Symptoms:
FATAL: MaxClientsInSessionMode: max clients reached - in Session mode max clients are limited to pool_size- GraphQL queries failing with
OperationalError: connection failed - Concurrent requests causing 500 errors
Root Cause: Serverless environments like Vercel create multiple concurrent instances, each trying to establish database connections. Default SQLAlchemy pool settings are too aggressive for Supabase’s connection limits.
Solution - Supabase-Optimized Connection Pool (UPDATED 2024-01-15):
# ❌ WRONG - Default pool settings (causes exhaustion)
engine = create_async_engine(pg_dsn, pool_pre_ping=True, echo=False)
# ✅ CORRECT - Supabase-optimized pool (current production settings)
engine = create_async_engine(
pg_dsn,
pool_pre_ping=True,
echo=False,
# Conservative connection pool for Supabase Session mode limits
pool_size=3, # Reduced to stay within Supabase limits
max_overflow=2, # Minimal overflow to avoid pool exhaustion
pool_timeout=30, # Longer timeout for connection acquisition
pool_recycle=300, # Shorter recycle to free connections faster (5 min)
pool_reset_on_return='rollback', # Aggressive connection reset
# Additional settings for better connection management
connect_args={
"server_settings": {
"application_name": "plings-backend",
"jit": "off" # Disable JIT for faster connection setup
}
}
)
Current Pool Settings Explained (Production):
pool_size=3: Conservative pool size for Supabase Session modemax_overflow=2: Minimal overflow connections (total max: 5)pool_timeout=30: Longer timeout for better availabilitypool_recycle=300: Fast connection turnover (5 minutes)application_name: Better connection identification in logsjit=off: Faster connection establishment
Health Check Monitoring:
The /health/db endpoint now provides detailed connection pool metrics:
{
"postgres": {
"pool_size": 3,
"checked_out": 2,
"connection_usage": "2/5",
"usage_percentage": "40.0%",
"pool_full": false,
"warning": null
}
}
Prevention:
- Monitor Health Endpoint: Check
/health/dbfor pool usage - Watch for Warnings: Alert when usage > 80%
- Optimize Query Duration: Use efficient queries to reduce connection hold time
- Connection Pattern Optimization: Use
async with engine.begin()consistently
Emergency Pool Size Reduction:
# If experiencing immediate outage, further reduce pool size
pool_size=2, max_overflow=1 # Total max: 3 connections per instance
PostgreSQL Connection String Format
CRITICAL: Async SQLAlchemy requires specific connection string format.
# ❌ WRONG - Standard PostgreSQL format
postgresql://user:pass@host:port/db
# ✅ CORRECT - Async psycopg format
postgresql+psycopg://user:pass@host:port/db
Implementation:
# Auto-convert in settings
pg_dsn = settings.supabase_db_url
if pg_dsn.startswith("postgresql://"):
pg_dsn = pg_dsn.replace("postgresql://", "postgresql+psycopg://", 1)
engine = create_async_engine(pg_dsn, pool_pre_ping=True)
Neo4j Connection Issues
Common Problems:
- URI format: Must use
bolt://orneo4j:// - Authentication: Verify username/password
- Database selection: Specify correct database name
# Verify Neo4j connection
from neo4j import GraphDatabase
def test_neo4j_connection():
driver = GraphDatabase.driver(
settings.neo4j_uri,
auth=(settings.neo4j_username, settings.neo4j_password)
)
with driver.session(database=settings.neo4j_database) as session:
result = session.run("RETURN 1 as test")
print(result.single()["test"]) # Should print: 1
Debugging Workflow
1. Verify Database Schema
-- Check table exists
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public' AND table_name = 'your_table';
-- Check column names and types
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;
2. Test Individual Queries
# Test query in isolation
async def test_query():
async with get_db_session() as session:
result = await session.execute(
text("SELECT object_instance_id, public_url FROM object_images LIMIT 1")
)
row = result.fetchone()
print(f"Row: {row}")
3. Check Query Logs
# Enable SQL query logging
engine = create_async_engine(
pg_dsn,
pool_pre_ping=True,
echo=True # This logs all SQL queries
)
4. Verify Data Types
-- Check for type mismatches
SELECT
column_name,
data_type,
character_maximum_length,
is_nullable
FROM information_schema.columns
WHERE table_name = 'your_table';
Prevention Checklist
Before Writing Database Code:
- Verify table exists in target database
- Check actual column names (not assumptions)
- Verify data types match expected usage
- Test queries in database console first
- Use proper UUID format for UUID columns
- Verify foreign key relationships exist
Before Deployment:
- Test all database queries in staging
- Verify connection strings are correct format
- Check environment variables are set
- Test both PostgreSQL and Neo4j connections
- Verify authentication works with real user data
Common Patterns to Avoid:
- Don’t assume column names without verification
- Don’t use string IDs where UUIDs are expected
- Don’t mix sync/async database operations
- Don’t forget to handle connection errors gracefully
- Don’t skip testing queries in database console
Schema Verification Commands
Supabase (PostgreSQL)
-- List all tables
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public';
-- Get table structure
\d+ table_name
-- Check constraints
SELECT * FROM information_schema.table_constraints
WHERE table_name = 'your_table';
Neo4j
// List all node labels
CALL db.labels();
// List all relationship types
CALL db.relationshipTypes();
// Get schema overview
CALL db.schema.visualization();
Remember: Always verify the actual database schema before writing code. Assumptions about column names are the most common source of database errors.
Supabase Session-Mode Limits in Server-less Deployments (UPDATED 2025-06-29)
Even with the conservative pool (3 + 2 overflow) we previously recommended, a server-less platform such as Vercel can still exhaust the Session-mode cap because each lambda instance maintains its own pool. One heavy user running stress-tests can spawn enough lambdas to blow past the limit.
Why it happens
| Trigger | Effect |
|---|---|
| Vercel auto-scales many short-lived lambdas | every lambda initialises its own SQLAlchemy pool |
| Pool was 3 + 2 overflow | 5 sockets per lambda |
| 5 concurrent lambdas | 25 sockets; Session-mode limit reached |
Plings is now on the Supabase Pro plan (≈500 realtime concurrent connections), but Session-mode pooling is still stricter (usually 20-25). Therefore we need to minimise per-lambda usage.
Definitive Fix
- Singleton engine – The backend exposes
get_pg_engine()inapp/db_postgres.py. Each process calls it once and re-uses the returned engine everywhere.
_engine = None
def get_pg_engine():
global _engine
if _engine is None:
_engine = create_async_engine(dsn,
pool_size=1, # base pool
max_overflow=1) # total max 2 sockets
return _engine
-
Per-event-loop engine with
NullPool–get_pg_engine()now caches oneAsyncEngineper Python event loop (keyed byid(asyncio.get_event_loop())). Each engine usesNullPool, so every statement opens and closes its own socket. This prevents both Session-mode exhaustion and cross-loop lock errors like<asyncio.locks.Lock …> is bound to a different event loop. -
(Recommended) use Supabase PgBouncer – point
supabase_db_urlat the connection-pooling port (6543) so thousands of client sockets can share ~20 backend sessions. -
Health-check circuit breaker –
/health/dbnow returns503whenusage_percentage> 80 %, signalling Vercel to shed load before Supabase returns FATAL errors.
After-action checklist
- Centralised engine in
db_postgres.py. - Re-wired
main.pyandgraphql.pyto call the singleton instead of building their own engine. - Replaced pool with
NullPool(no persistent sockets) – opens/closes per transaction. - Updated docs (this section).
Result: Even the GraphQL debugger’s five concurrent queries no longer trigger `
Spatial Predicates Migration (NORMAL_/CURRENT_ System)
Status: ✅ COMPLETE
Migration Date: 2025-07-03 08:17:03
Scope: Backend + Frontend
Migration Summary
The Plings system has successfully migrated from legacy spatial predicates to a dual predicate system:
- NORMAL_ predicates: Where objects should be (designated locations)
- CURRENT_ predicates: Where objects currently are (real-time locations - future implementation)
- Legacy predicates: Maintained for backward compatibility
Components Updated
✅ Backend (Completed 2025-07-03 08:17:03)
- Supabase: Applied dual spatial predicates migration (27 total predicates)
- Neo4j: Migrated existing relationships to NORMAL_ predicates
- GraphQL Schema: Enhanced SpatialRelationshipType enum with NORMAL_, CURRENT_, and legacy predicates
- Resolvers: All spatial resolvers updated with predicate mapping and backward compatibility
- Object Creation: New objects use NORMAL_ predicates by default
✅ Frontend (Completed 2025-07-03 08:17:03)
- Type Definitions: Updated to support all predicate variants
- UI Components: Drag-and-drop now creates NORMAL_ predicates
- GraphQL Integration: Fully compatible with enhanced backend schema
- Documentation: Comprehensive migration guide created for developers
- Backward Compatibility: Existing code continues to work unchanged
Key Benefits Achieved
- Enhanced Inventory Management: Foundation for misplacement detection
- Future-Proofing: System ready for real-time location tracking
- Zero Downtime: Migration completed without breaking changes
- Developer Experience: Clear migration guides and backward compatibility
Next Steps
- CURRENT_ Predicate Implementation: Add real-time location tracking
- Misplacement Detection: Compare NORMAL_ vs CURRENT_ predicates
- Advanced UI Features: Location auditing and correction workflows
- Automated Organization: Guide users to return objects to designated locations
For detailed implementation information, see:
- Frontend: Spatial Predicates Migration Guide
- Backend: Spatial resolvers with predicate mapping system