---
name: text-embeddings
description: Generate text embeddings using Ollama models and store them in SQLite for semantic search and similarity calculations. Works with any text content (articles, documents, notes, etc.).
compatibility: Created for Zo Computer
metadata:
  author: rob.zo.computer
---

# Text Embeddings Skill

Generate text embeddings using Ollama's local embedding models and store them in SQLite for semantic search, similarity calculations, and retrieval.

## Prerequisites

- Ollama installed and running (`ollama serve`)
- An embedding model pulled (e.g., `embeddinggemma`, `nomic-embed-text`)
- Python 3 with `sqlite3` (built-in)

## Quick Start

### 1. Start Ollama Server

```bash
# Start the Ollama server in background
nohup ollama serve > /dev/shm/ollama.log 2>&1 &

# Verify it's running
sleep 2 && ollama list
```

### 2. Pull an Embedding Model

```bash
# Pull a lightweight embedding model
ollama pull embeddinggemma

# Or pull another model
ollama pull nomic-embed-text
```

### 3. Generate Embeddings

Use the scripts in `scripts/` to generate embeddings for your text content.

## Available Scripts

### `embed_text.py` — Generate embeddings for arbitrary text

Generate embeddings for any text content and store in SQLite.

```bash
# Embed a single text string
python3 scripts/embed_text.py --text "Your text here" --model embeddinggemma

# Embed multiple texts from a file (one per line)
python3 scripts/embed_text.py --file texts.txt --model embeddinggemma

# Embed all markdown files in a directory
python3 scripts/embed_text.py --dir /path/to/docs --model embeddinggemma --pattern "*.md"
```

**Output:** Creates `embeddings.db` with a `texts` table containing:
- `id` — Primary key
- `text` — Original text content
- `source` — Source file or identifier
- `embedding` — JSON array of embedding values
- `model` — Model used for embedding
- `created_at` — Timestamp

### `search_similar.py` — Find similar texts by semantic search

Find texts most similar to a query using cosine similarity.

```bash
# Search for similar texts
python3 scripts/search_similar.py --query "programming" --db embeddings.db

# Show top N results
python3 scripts/search_similar.py --query "machine learning" --db embeddings.db --top 5

# Use a different similarity metric
python3 scripts/search_similar.py.py --query "cooking" --db embeddings.db --metric euclidean
```

**Output:** Prints ranked results with similarity scores.

### `batch_embed.py` — Batch process files

Process multiple files or directories in batch.

```bash
# Process all markdown files in a directory
python3 scripts/batch_embed.py --input /path/to/articles --output articles.db --model embeddinggemma

# Process with custom pattern
python3 scripts/batch_embed.py --input /path/to/docs --output docs.db --model embeddinggemma --pattern "*.txt"

# Process with custom table name
python3 scripts/batch_embed.py --input /path/to/notes --output notes.db --model embeddinggemma --table notes
```

## SQLite Vector Storage

SQLite doesn't have native vector support, so embeddings are stored as JSON arrays in BLOB columns. This approach works well for:

- **Storage:** Compact JSON representation of float arrays
- **Querying:** SQLite's JSON functions for extraction and manipulation
- **Portability:** Standard SQLite, no extensions needed

### Schema Example

```sql
CREATE TABLE embeddings (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  text TEXT NOT NULL,
  source TEXT,
  embedding BLOB,  -- JSON array: [0.1, -0.2, 0.3, ...]
  model TEXT,
  embedding_dim INTEGER,
  created_at TEXT DEFAULT (datetime('now'))
);
```

### Querying Embeddings

```sql
-- Get embedding dimension
SELECT json_array_length(embedding) as dim FROM embeddings LIMIT 1;

-- Extract first few values
SELECT 
  json_extract(embedding, '$[0]') as v0,
  json_extract(embedding, '$[1]') as v1,
  json_extract(embedding, '$[2]') as v2
FROM embeddings
LIMIT 1;

-- Find texts with specific embedding dimension
SELECT text, source FROM embeddings WHERE json_array_length(embedding) = 768;
```

## Similarity Metrics

The scripts support multiple similarity metrics:

### Cosine Similarity (default)

Measures the cosine of the angle between vectors. Range: [-1, 1], where 1 is identical.

```python
similarity = dot_product(vec1, vec2) / (norm(vec1) * norm(vec2))
```

### Euclidean Distance

Measures straight-line distance between vectors. Range: [0, ∞], where 0 is identical.

```python
distance = sqrt(sum((a - b) ** 2 for a, b in zip(vec1, vec2)))
```

### Dot Product

Measures alignment between vectors. Range: [-∞, ∞], higher means more similar.

```python
dot = sum(a * b for a, b in zip(vec1, vec2))
```

## Usage Patterns

### Pattern 1: Article/Document Search

Embed all your articles, then search semantically:

```bash
# Embed articles
python3 scripts/batch_embed.py --input Articles --output articles.db --model embeddinggemma

# Search
python3 scripts/search_similar.py --query "machine learning" --db articles.db
```

### Pattern 2: Note Organization

Embed your notes for semantic organization:

```bash
# Embed notes
python3 scripts/batch_embed.py --input Notes --output notes.db --model embeddinggemma --pattern "*.md"

# Find related notes
python3 scripts/search_similar.py --query "project planning" --db notes.db
```

### Pattern 3: Code Snippet Search

Embed code documentation or comments:

```bash
# Embed code docs
python3 scripts/batch_embed.py --input docs --output code.db --model embeddinggemma

# Search for relevant code
python3 scripts/search_similar.py --query "authentication" --db code.db
```

## Advanced Usage

### Custom Embedding Functions

The `embed_text.py` script can be imported as a module:

```python
from scripts.embed_text import get_embedding, store_embedding

# Get embedding for any text
embedding = get_embedding("Your text here", model="embeddinggemma")

# Store in database
store_embedding(
    db_path="embeddings.db",
    text="Your text here",
    source="custom",
    embedding=embedding,
    model="embeddinggemma"
)
```

### Batch Processing with Custom Logic

Modify `batch_embed.py` to add custom preprocessing:

```python
def preprocess_text(text: str) -> str:
    """Custom text preprocessing."""
    # Remove markdown, clean whitespace, etc.
    return text.strip()
```

### Integration with Other Skills

Combine with other skills for powerful workflows:

- **markdown-to-public-site:** Embed published docs for search
- **personal-data:** Embed health/activity data for pattern discovery
- **econ-scratchpad:** Embed research notes for semantic retrieval

## High-Throughput Embedding

The basic `ollama run` approach processes texts sequentially (~1 text/sec). For large-scale embedding, use these alternatives:

### Option 1: Concurrent Ollama API Calls

Use Ollama's HTTP API with async concurrency. ~10x improvement over subprocess calls.

```python
import asyncio
import aiohttp

OLLAMA_URL = "http://localhost:11434/api/embeddings"
MODEL = "embeddinggemma"
CONCURRENCY = 16

async def get_embedding(session, text, semaphore):
    async with semaphore:
        async with session.post(OLLAMA_URL, json={"model": MODEL, "prompt": text[:2000]}) as resp:
            result = await resp.json()
            return result.get("embedding")

async def embed_all(texts):
    semaphore = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        tasks = [get_embedding(session, t, semaphore) for t in texts]
        return await asyncio.gather(*tasks)

# Usage
embeddings = asyncio.run(embed_all(["text1", "text2", ...]))
```

**Note:** Ollama still serializes GPU work internally, so this saturates at ~10-20 texts/sec.

### Option 2: sentence-transformers (Recommended for High Throughput)

Load the model directly and use GPU batching. **800+ texts/sec** on a modern GPU.

```bash
pip install sentence-transformers
```

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dims, fast
# model = SentenceTransformer("all-mpnet-base-v2")  # 768 dims, higher quality

texts = ["text1", "text2", ...]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
```

**Model recommendations:**
- `all-MiniLM-L6-v2` — Fast, 384 dims, good for most use cases
- `all-mpnet-base-v2` — Higher quality, 768 dims
- `BAAI/bge-small-en-v1.5` — State-of-the-art quality, 384 dims

### Throughput Comparison

| Method | Throughput | GPU Utilization |
|--------|-----------|-----------------|
| `ollama run` (subprocess) | ~1 text/sec | Low |
| Ollama API (concurrent) | ~10-20 texts/sec | Medium |
| sentence-transformers (batched) | 800+ texts/sec | High |

## Troubleshooting

### Ollama Server Not Running

```bash
# Check if running
ps aux | grep ollama

# Start server
nohup ollama serve > /dev/shm/ollama.log 2>&1 &

# Check logs
tail -f /dev/shm/ollama.log
```

### Model Not Found

```bash
# List available models
ollama list

# Pull missing model
ollama pull embeddinggemma
```

### Embedding Dimension Mismatch

Ensure all embeddings use the same model. Different models produce different dimensions:

```bash
# Check embedding dimensions
sqlite3 embeddings.db "SELECT model, json_array_length(embedding) as dim FROM embeddings GROUP BY model;"
```

## Performance Tips

- **Batch processing:** Process multiple texts in one script run
- **Background server:** Keep Ollama server running for faster requests
- **Model selection:** Use smaller models (embeddinggemma) for faster embedding
- **Database indexing:** Add indexes on `source` or `model` columns for faster queries

```sql
CREATE INDEX idx_source ON embeddings(source);
CREATE INDEX idx_model ON embeddings(model);
```

## References

- [Ollama Documentation](https://ollama.com/docs)
- [SQLite JSON Functions](https://www.sqlite.org/json1.html)
- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)