Skip to main content
Prerequisites:
  • Python 3.10 or higher (3.11+ recommended)
  • Basic understanding of async/await patterns
  • SQLite3

Architecture Overview

CosmaSense uses a monorepo structure with three main packages:
  • Backend (packages/cosma-backend): Quart async web framework + Uvicorn
  • TUI (packages/cosma-tui): Textual framework for terminal UI
  • CLI (root): Click-based orchestrator
All packages share dependencies via workspace configuration.

Tech Stack

Backend Framework

ComponentTechnology
Web FrameworkQuart (async Flask)
ASGI ServerUvicorn
ValidationQuartSchema
Databaseasqlite (async SQLite)
File Watchingwatchdog
ComponentTechnology
Vector Searchsqlite-vec
Keyword SearchFTS5 (Full-Text Search)
LLM BackendLiteLLM/Ollama
Embeddingssentence-transformers (e5-base-v2)
File ParsingMarkItDown (20+ formats)

Processing Pipeline

Files go through a 4-stage pipeline (see pipeline.py:56-174):
1

Discovery

Recursively scan directories and collect file metadata. Only files with modified timestamps different from the database are processed.
# Skip logic checks modified time
if file.modified_time == db_modified_time:
    skip_file()
2

Parsing

Extract text from 20+ file formats using MarkItDown. Calculate content hash to detect changes.Supported formats:
  • Documents: PDF, DOCX, TXT, MD
  • Images: PNG, JPG, GIF (with OCR)
  • Code: PY, JS, TS, JAVA, etc.
  • Spreadsheets: XLSX, CSV
# Hash check to detect content changes
if calculate_hash(content) == db_hash:
    skip_to_next_stage()
3

Summarization

AI generates:
  • Title
  • Summary (max 100 words)
  • 3-5 relevant keywords
Uses LiteLLM or Ollama for local/cloud LLM support.
4

Embedding

Create 768-dimensional vectors using the e5-base-v2 model for semantic search.Embeddings stored in file_embeddings virtual table with triggers to keep in sync.

Hybrid Search System

CosmaSense combines two search methods (see searcher.py:91-220):

Semantic Search (Vector Similarity)

  • Embeds query using same e5-base-v2 model
  • Calculates cosine similarity against file embeddings
  • Score: exp(-distance) scaled to 0-0.5 range

Keyword Search (FTS5)

  • SQLite FTS5 searches content/title/keywords
  • Uses BM25 ranking algorithm
  • Score: relevance scaled to 0-0.5 range

Combined Scoring

final_score = semantic_score + keyword_score
Results sorted by combined score in descending order.

Database Schema

See schema.sql for full details:
-- Core file data
CREATE TABLE files (
  path TEXT PRIMARY KEY,
  content TEXT,
  title TEXT,
  summary TEXT,
  keywords TEXT,
  hash TEXT,
  status TEXT,
  modified_time INTEGER
);

-- Vector search (sqlite-vec)
CREATE VIRTUAL TABLE file_embeddings USING vec0(
  embedding FLOAT[768]
);

-- Keyword search (FTS5)
CREATE VIRTUAL TABLE files_fts USING fts5(
  content, title, keywords
);
Triggers keep all three tables synchronized automatically.

Async Programming in CosmaSense

CosmaSense uses Python’s async/await for non-blocking I/O operations.

Key Concepts

# Use 'async def' to create async functions
async def fetch_data():
    result = await database.query()
    return result

# Call with 'await'
data = await fetch_data()
Critical: Never use time.sleep() in async functions! It will freeze the entire app. Use await asyncio.sleep() instead.

Real-time Updates

Server-Sent Events (SSE)

The backend uses SSE to push updates to the TUI:
event: update
data: {"opcode": "file_parsing", "data": {"path": "/path/to/file.txt", "filename": "file.txt"}}
Events are published via an internal Hub and streamed to clients as SSE update events. Each event has an opcode and a data payload. Event categories:
  • File processing: file_parsing, file_parsed, file_summarizing, file_summarized, file_embedding, file_embedded, file_complete, file_failed, file_skipped
  • File system: file_created, file_modified, file_deleted, file_moved, directory_deleted, directory_moved
  • Queue: queue_item_added, queue_item_updated, queue_item_processing, queue_item_completed, queue_item_failed, queue_item_removed, queue_paused, queue_resumed
  • Scheduler: scheduler_paused, scheduler_resumed
  • Watch: watch_started, watch_added, watch_removed
  • General: status_update, error, info, shutting_down
See the Real-time Updates API reference for full details.

File Watching

Uses watchdog library to monitor filesystem changes:
# Auto-reindex when files change
watcher.schedule(handler, path, recursive=True)

Development Setup

1

Clone Repository

git clone https://github.com/cosmasense/cosma.git
cd cosma
2

Install Dependencies

pip install -r requirements.txt

# For development
pip install -r requirements-dev.txt
3

Run Backend

# Start backend server (port 60534)
cosma serve

# Or with debug logging
DEBUG=1 cosma serve
4

Run TUI

# In another terminal
cosma search

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=cosma_backend

# Run specific test file
pytest tests/test_pipeline.py

Troubleshooting

You forgot to use await when calling an async function:
# Wrong
result = async_function()

# Correct
result = await async_function()
You’re running blocking code in an async function. Use asyncio.to_thread():
# Wrong
async def process():
    slow_operation()  # Blocks event loop

# Correct
async def process():
    await asyncio.to_thread(slow_operation)
Multiple async operations trying to write simultaneously. Use proper connection pooling:
async with app.db.acquire() as conn:
    await conn.execute("INSERT ...")

Further Reading