Backend Troubleshooting Guide

This guide covers common issues and solutions for the Plings backend, with a focus on problems encountered in a Vercel serverless deployment.


1. Vercel Serverless, Lifespan, and Database Connections

This is the most critical and difficult-to-diagnose issue when deploying a FastAPI application to Vercel.

Symptom:

  • A persistent 500 Internal Server Error on all GraphQL operations.
  • Vercel logs show the 500 error but provide no traceback or details.
  • The error is an AttributeError that can only be seen by adding a global exception handler, for example: AttributeError: 'State' object has no attribute 'pg_engine'.

Root Cause: The FastAPI lifespan function (or the older @app.on_event("startup")) does not run on every request in a serverless environment like Vercel. It only runs when a new serverless function instance is provisioned (a “cold start”). Subsequent requests to the same warm instance will not trigger the lifespan function again.

If you initialize resources like database connections and attach them to the app’s state inside lifespan, that state will be lost on subsequent warm invocations, leading to AttributeError when your resolvers try to access it via info.context['request'].state.pg_engine.

Solution: Lazy Initialization

The correct pattern is to initialize resources on-demand within a function that is guaranteed to run for every single request. The GraphQL get_context_value function is the perfect place for this. This “lazy initialization” ensures the database engine is always available.

# In graphql.py

# A simple cache to hold the engine
db_engine = None

async def get_context_value(request: Request, response: Response) -> AppContext:
    global db_engine

    # 1. Check if the engine has already been initialized for this server instance
    if db_engine is None:
        logger.info("Database engine is not initialized. Creating new engine.")
        # 2. If not, create it and cache it globally
        settings = get_settings()
        pg_dsn = settings.supabase_db_url.replace(
            "postgresql://", "postgresql+psycopg://"
        )
        db_engine = create_async_engine(pg_dsn, pool_pre_ping=True)
        logger.info("Database engine created and cached.")
    
    # 3. Attach the guaranteed-to-exist engine to the request state
    request.state.pg_engine = db_engine
    
    # ... rest of the context setup
    return AppContext(request=request, response=response, user_context=user_context)

By using this pattern, you completely avoid relying on the serverless lifespan event for request-critical resources.


2. Diagnosing 500 Errors with No Traceback

Symptom:

  • Vercel logs show “500 Internal Server Error” but no Python traceback.

Root Cause: An exception is occurring very early in the request lifecycle, before it hits the main GraphQL execution engine’s error handling. This is often caused by a misconfigured or crashing middleware, or the lifespan issue described above.

Solution: Global Exception Middleware

To ensure you always capture a traceback, add a global try...except middleware as the very first middleware in your main.py.

# In main.py
from fastapi import Request, Response
from starlette.middleware.base import BaseHTTPMiddleware
import logging

logger = logging.getLogger(__name__)

class GlobalExceptionMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        try:
            response = await call_next(request)
            return response
        except Exception as e:
            logger.error("An unhandled exception occurred", exc_info=True)
            # You can return a generic 500 response here
            return Response("Internal Server Error", status_code=500)

# Add it FIRST
app.add_middleware(GlobalExceptionMiddleware)
app.add_middleware(
    CORSMiddleware,
    # ...
)

This guarantees that any crash, no matter how early, will be logged with a full traceback.


3. Common GraphQL and Database Errors

3.1. Frontend 400 Bad Request & Schema Mismatches

Symptom:

  • The frontend reports a 400 Bad Request when making a GraphQL query.

Root Cause: This often indicates a mismatch between the frontend’s query and the backend’s schema. The frontend is asking for fields or types that do not exist on the backend. This can happen frequently during development.

Solution:

  1. Use the GraphiQL interface at /graphql/ to inspect the backend’s current schema.
  2. Compare the schema with the frontend’s GET_OBJECT_DETAILS (or similar) query.
  3. Add the missing fields and corresponding resolvers to the backend schema in graphql.py and *.resolvers.py. It’s acceptable to have stub resolvers that return None or [] initially to unblock the frontend.

3.2. Database relation "..." does not exist

Symptom:

  • asyncpg.exceptions.UndefinedTableError: relation "objects" does not exist

Root Cause: The SQL query is using an incorrect table or column name that does not exist in the database.

Solution:

  • Do not guess table names.
  • The canonical source of truth for the Postgres schema is the supabase_schema.sql file in the root of the Plings-Lovable-Frontend repository.
  • Always consult this file to get the correct table and column names (e.g., object_instances, not objects).

3.3. Conflicting FastAPI Event Handlers

This section is preserved from the original guide as it is still relevant.

Symptom:

  • 500 errors on all endpoints including /health
  • CORS preflight failures
  • “FUNCTION_INVOCATION_FAILED” on Vercel

Root Cause: Conflicting FastAPI event handlers - mixing lifespan functions with @app.on_event decorators.

Solution: Use only lifespan. Do not mix it with @app.on_event.

# ❌ WRONG - Conflicting event handlers
@asynccontextmanager
async def lifespan(app: FastAPI):
    yield

app = FastAPI(lifespan=lifespan)

@app.on_event("startup")  # ❌ CONFLICTS
async def startup():
    pass

# ✅ CORRECT - Use only lifespan
@asynccontextmanager
async def lifespan(app: FastAPI):
    # All startup code here
    # ...
    yield
    # All cleanup code here

app = FastAPI(lifespan=lifespan)
# No @app.on_event decorators

Critical Issues and Solutions

1. Import Errors Causing Application Startup Failures ⚠️ CRITICAL

Symptoms:

  • 500 errors on all endpoints including /health and /graphql
  • CORS preflight failures with status code 500
  • “FUNCTION_INVOCATION_FAILED” on Vercel
  • Python traceback: ImportError: cannot import name 'X' from 'Y'

Root Cause: Incorrect imports trying to import functions from wrong modules, typically caused by copying import statements without verifying the source module.

Most Common Import Mistakes:

# ❌ WRONG - These functions don't exist in custom modules
from .db_postgres import create_async_engine  # create_async_engine is from SQLAlchemy
from .db_neo4j import AsyncGraphDatabase       # AsyncGraphDatabase is from neo4j package

# ✅ CORRECT - Import from the actual source packages
from sqlalchemy.ext.asyncio import create_async_engine
from neo4j import AsyncGraphDatabase

Prevention Rules:

  1. Never assume what’s in custom modules - always check the actual file contents
  2. Use your IDE’s import suggestions - they show the correct source packages
  3. Test imports immediately - run python -c "from module import function" to verify
  4. Check the actual package documentation - SQLAlchemy, Neo4j, etc. have their own imports

Debugging Process:

# Test if imports work locally
cd Plings-Backend
python -c "from sqlalchemy.ext.asyncio import create_async_engine; print('✅ SQLAlchemy import OK')"
python -c "from neo4j import AsyncGraphDatabase; print('✅ Neo4j import OK')"

# Test main.py compilation
python -m py_compile app/main.py

Complete Fix Pattern:

# In main.py - USE THESE CORRECT IMPORTS
from sqlalchemy.ext.asyncio import create_async_engine  # ✅ From SQLAlchemy
from neo4j import AsyncGraphDatabase                     # ✅ From Neo4j package

# NOT from custom modules:
# from .db_postgres import create_async_engine  # ❌ WRONG
# from .db_neo4j import AsyncGraphDatabase       # ❌ WRONG

2. GraphQL Authentication Failures

Symptoms:

  • “Authentication required: Invalid user context or ID in token”
  • Frontend dashboard shows empty or error states
  • GraphQL queries fail even with valid JWT tokens

Root Cause: Creating a new GraphQL instance in main.py instead of using the pre-configured one from graphql.py.

Solution:

# ❌ WRONG - Missing authentication context
from .graphql import schema
graphql_app = GraphQL(schema, debug=True)
app.mount("/graphql", graphql_app)

# ✅ CORRECT - Includes proper context and JWT parsing
from .graphql import graphql_app
app.mount("/graphql", graphql_app)

Why This Happens: The graphql.py file contains a get_context_value function that:

  • Parses JWT tokens from Authorization headers
  • Creates UserContext objects with user ID and org ID
  • Sets up database connections in GraphQL context
  • Provides development fallback authentication

Testing:

# Test GraphQL authentication
curl -X POST -H "Content-Type: application/json" \
  -d '{"query":"query { myObjects { id name } }"}' \
  https://plings-backend.vercel.app/graphql/

# Should return object data, not authentication errors

4. Database Connection Issues

Symptoms:

  • Connection timeouts during startup
  • “Could not connect to database” errors

Common Causes:

  1. Environment variables missing
    • Check SUPABASE_DB_URL, NEO4J_URI, etc.
  2. Connection string format
    • PostgreSQL: Must use postgresql+psycopg:// for async
  3. Network/firewall issues
    • Verify database hosts are accessible from Vercel

Debugging:

# Add detailed logging to lifespan function
async def lifespan(app: FastAPI):
    try:
        logger.info("🚀 Starting application...")
        settings = get_settings()
        logger.info(f"✅ Settings loaded: {settings.supabase_db_url[:20]}...")
        
        # Test each connection separately
        pg_dsn = settings.supabase_db_url
        if pg_dsn.startswith("postgresql://"):
            pg_dsn = pg_dsn.replace("postgresql://", "postgresql+psycopg://", 1)
        
        engine = create_async_engine(pg_dsn, pool_pre_ping=True, echo=settings.debug)
        async with engine.begin() as conn:
            await conn.execute(text("SELECT 1"))
        logger.info("✅ PostgreSQL connection established")
        
    except Exception as e:
        logger.error(f"💥 Startup failed: {e}", exc_info=True)
        raise

Debugging Workflow

1. Check Application Health

curl https://plings-backend.vercel.app/health
# Should return: {"status":"healthy","message":"Plings API is running"}

2. Test GraphQL Basic Query

curl -X POST -H "Content-Type: application/json" \
  -d '{"query":"query { __typename }"}' \
  https://plings-backend.vercel.app/graphql/
# Should return: {"data":{"__typename":"Query"}}

3. Test GraphQL Authentication

curl -X POST -H "Content-Type: application/json" \
  -d '{"query":"query { myObjects { id name } }"}' \
  https://plings-backend.vercel.app/graphql/
# Should return object data or specific auth error

4. Check Vercel Logs

  • Go to Vercel dashboard
  • Check function logs for detailed error messages
  • Look for startup/lifespan errors

Prevention Checklist

Import Verification (CRITICAL - Check First!)

  • Never import database functions from custom modules - always use the actual packages
  • Import create_async_engine from sqlalchemy.ext.asyncio not db_postgres
  • Import AsyncGraphDatabase from neo4j not db_neo4j
  • Test all imports with python -c "from module import function"
  • Use IDE import suggestions to verify correct source packages

GraphQL Setup

  • Always import graphql_app from graphql.py
  • Never create new GraphQL() instances in main.py
  • Use only lifespan function, no @app.on_event decorators
  • Frontend uses /graphql/ with trailing slash
  • Test authentication with curl commands
  • Add detailed error logging to lifespan function
  • Verify environment variables are set correctly

File Structure Reference

app/
├── main.py          # FastAPI app, routes, CORS
├── graphql.py       # GraphQL schema, resolvers, context (USE THIS graphql_app)
├── auth.py          # JWT parsing, UserContext
├── resolvers.py     # GraphQL query resolvers
└── config.py        # Settings and environment

Remember: The graphql.py file contains the authoritative GraphQL configuration. Always use its graphql_app instance.


4. Asyncio Event Loop Conflicts with Neo4j on Vercel

Symptom:

  • GraphQL queries that access Neo4j fail intermittently or consistently.
  • The Vercel runtime logs show the error: Task <...> got Future <...> attached to a different loop.
  • This often happens inside async with neo4j_driver.session() as session: blocks.

Root Cause: This is a complex asyncio issue common in serverless environments. The Vercel Python runtime manages its own asyncio event loop. The neo4j asynchronous driver (AsyncGraphDatabase) may, under certain conditions, try to use or create a different event loop, causing a conflict. A Future or Task created in one loop cannot be awaited in another.

Solution: Use Synchronous Driver in a Separate Thread

The most robust solution is to avoid the neo4j async driver for queries originating from the main FastAPI/GraphQL thread. Instead, use the standard synchronous driver (GraphDatabase) and execute the database call in a separate thread using asyncio.to_thread. This prevents blocking the main event loop while completely avoiding event loop conflicts.

Implementation Steps:

  1. Update graphql.py to provide both drivers:
    • Create and cache a synchronous Neo4j driver alongside the existing async one.
    • Add both to the GraphQL context.
    # In Plings-Backend/app/graphql.py
    from neo4j import AsyncGraphDatabase, GraphDatabase
        
    # ... global caches
    sync_neo4j_driver_global = None
    
    async def get_context_value(request: Any, _) -> Dict[str, Any]:
        global db_engine, neo4j_driver_global, sync_neo4j_driver_global
            
        if sync_neo4j_driver_global is None:
            # ... initialize it
            settings = get_settings()
            sync_neo4j_driver_global = GraphDatabase.driver(
                settings.neo4j_uri,
                auth=(settings.neo4j_user, settings.neo4j_password)
            )
            
        return {
            # ... other context keys
            "neo4j": neo4j_driver_global,
            "sync_neo4j": sync_neo4j_driver_global
        }
    
  2. Refactor failing resolvers:
    • In any resolver that uses async with neo4j_driver.session(), replace the call with asyncio.to_thread.
    • Create a small synchronous helper function to contain the database logic.
    # In Plings-Backend/app/resolvers.py
    import asyncio
    
    def _get_objects_sync(driver, query, **params):
        """Synchronous Neo4j query execution."""
        with driver.session() as session:
            result = session.run(query, **params)
            return [dict(record) for record in result]
    
    async def resolve_my_objects(info: GraphQLResolveInfo, ...):
        # ...
        try:
            sync_driver = info.context["sync_neo4j"]
            records = await asyncio.to_thread(
                _get_objects_sync, sync_driver, neo4j_query, user_id=...
            )
        except Exception as e:
            logging.error(f"Error fetching objects from Neo4j: {e}", exc_info=True)
            return []
        # ...
    

    This pattern isolates the Neo4j call from the main event loop, resolving the conflict and making the queries stable on Vercel.