---
name: personal-data
description: Central documentation for all personal data exports — iMessage, Spotify, Apple Health, Amazon/Audible, Google (Chrome, YouTube, Calendar, Maps), GitHub, X/Twitter, Anthropic Claude, Amex, Letterboxd, DoorDash, Function Health lab results, 23andMe, and uLogMe activity tracking. Includes schema overviews, common queries, and cross-dataset analysis patterns.
compatibility: Created for Zo Computer
metadata:
  author: rob.zo.computer
---
# Personal Data

This skill is the central reference for working with personal data exports. It covers:

1. **What data exists** — 13+ datasets from various services, all stored as DuckDB databases
2. **How to query it** — Schema details, example queries, and cross-dataset patterns
3. **The rc-data site** — A notebook-style website for visualizing and exploring this data
4. **Maintenance** — How to refresh exports, add new datasets, and keep things current

**When to use this skill**: Any time you're asked about personal data, habits, history, or patterns—or when building analysis notebooks in rc-data.

---

## Quick Reference

| Dataset | Database Path | What's In It |
|---------|--------------|--------------|
| iMessage | `dataset-imessage/data.duckdb` | Texts, group chats, attachments |
| Spotify | `zo-data/spotify/data.duckdb` | Music & podcast listening |
| Apple Health | `zo-data/apple_health/data.duckdb` | Steps, heart rate, workouts, sleep |
| Amazon/Audible | `zo-data/amazon/data.duckdb` | Purchases, Kindle, audiobooks |
| Google | `zo-data/google/data.duckdb` | Chrome, YouTube, Calendar, Maps, Search |
| GitHub | `zo-data/github/data.duckdb` | Repos, PRs, issues |
| X/Twitter | `zo-data/x/data.duckdb` | Tweets, likes, DMs |
| Anthropic | `zo-data/anthropic/data.duckdb` | Claude conversations |
| Amex | `zo-data/amex/data.duckdb` | Credit card transactions |
| Letterboxd | `zo-data/letterboxd/data.duckdb` | Film ratings, reviews |
| DoorDash | `zo-data/doordash/data.duckdb` | Food delivery orders |
| Function Health | `function-health/data.duckdb` | Lab tests, biomarkers, health metrics |
| 23andMe | `zo-data/23_and_me/data.duckdb` | Genetic SNP data |
| uLogMe | `data/data.duckdb` | Desktop activity tracking |

All paths relative to `/home/workspace/`.

---

## How to Query

All datasets use **DuckDB**, a fast analytical database. Query from command line or Python:

```bash
# Command line
duckdb zo-data/spotify/data.duckdb -c "SELECT artist_name, COUNT(*) FROM streams GROUP BY 1 ORDER BY 2 DESC LIMIT 10"

# Show all tables
duckdb zo-data/spotify/data.duckdb -c "SHOW TABLES"

# Describe a table
duckdb zo-data/spotify/data.duckdb -c "DESCRIBE streams"
```

```python
# Python
import duckdb
conn = duckdb.connect("zo-data/spotify/data.duckdb", read_only=True)
df = conn.execute("SELECT * FROM streams LIMIT 10").fetchdf()
```

### Cross-Dataset Queries

DuckDB can attach multiple databases:

```sql
ATTACH 'zo-data/spotify/data.duckdb' AS spotify (READ_ONLY);
ATTACH 'zo-data/apple_health/data.duckdb' AS health (READ_ONLY);

SELECT DATE(s.played_at) as day, 
       COUNT(*) as songs,
       SUM(h.value) as steps
FROM spotify.streams s
JOIN health.records h ON DATE(s.played_at) = DATE(h.start_date) AND h.type = 'step_count'
GROUP BY day;
```

### Schema Discovery

Each dataset directory contains:
- `data.duckdb` — The database file
- `schema.yaml` — Table/column descriptions and row counts
- `README.md` — Context about the data source and quirks

To explore an unfamiliar dataset:
```bash
# List tables
duckdb path/to/data.duckdb -c "SHOW TABLES"

# See columns
duckdb path/to/data.duckdb -c "DESCRIBE table_name"

# Sample data
duckdb path/to/data.duckdb -c "SELECT * FROM table_name LIMIT 5"
```

---

## rc-data Analysis Site

The **rc-data** site (`/home/workspace/rc-data`) is a personal data exploration hub—a growing collection of notebook-style pages combining prose and visualizations.

> **Important**: The rc-data repo has its own `README.md` with detailed architecture docs. Read it when making changes.

### What Are Notebooks?

Notebooks are **ad-hoc, scratchpad-style pages** for data analysis. They're working documents where you explore questions like:

- "How has my X usage changed over the years?"
- "What time of day do I listen to music most?"
- "How does my step count correlate with my mood?"

Each notebook is a **visual prose response**: explanatory text, charts that reveal patterns, and summary stats. Over time, the site accumulates these explorations—becoming a personal reference for understanding your own data.

### URL Structure

```
/{dataset}/{notebook-name}
```

Examples:
- `/x/overview` — General X/Twitter usage summary
- `/spotify/top-artists` — Top artists analysis
- `/health/sleep-trends` — Sleep pattern exploration

### Registry

The notebook registry at `rc-data/src/data/registry.json` is the source of truth. It drives the homepage automatically.

```json
{
  "x": {
    "name": "X / Twitter",
    "blurb": "Tweets, likes, DMs, followers",
    "notebooks": [
      { "slug": "overview", "title": "Overview", "description": "General usage summary" }
    ]
  }
}
```

**Always add new notebooks to the registry.** They'll appear on the homepage automatically.

### Hidden Notebooks

A **hidden notebook** is a page that has a route but is **not linked from anywhere**, including the homepage. Use this for sensitive, draft, or highly specific analyses you don’t want discoverable from navigation.

**How to create one:**
1. Build the notebook page component as usual.
2. Add a route in `rc-data/src/App.tsx`.
3. **Do not** add it to `rc-data/src/data/registry.json`.

The URL will work if you know it, but it won’t appear on the homepage or in any registry-driven navigation.

### Creating a Notebook

#### 1. Write a Python Query Script

Create `rc-data/src/data/{dataset}/{notebook}.py`:

```python
#!/usr/bin/env python3
import duckdb
import json
from pathlib import Path
from datetime import datetime

DB_PATH = "/home/workspace/zo-data/{dataset}/data.duckdb"
OUTPUT_PATH = Path(__file__).parent / "{notebook}.json"

conn = duckdb.connect(DB_PATH, read_only=True)

# Run queries
summary = conn.execute("SELECT COUNT(*) FROM table_name").fetchone()
by_year = conn.execute("""
    SELECT EXTRACT(YEAR FROM created_at) as year, COUNT(*) as count
    FROM table_name GROUP BY year ORDER BY year
""").fetchall()

# Structure output
data = {
    "summary": {"total": summary[0]},
    "byYear": [{"year": int(r[0]), "count": r[1]} for r in by_year],
    "generatedAt": datetime.now().isoformat(),
}

OUTPUT_PATH.write_text(json.dumps(data, indent=2))
print(f"Wrote {OUTPUT_PATH}")
conn.close()
```

Run: `python3 rc-data/src/data/{dataset}/{notebook}.py`

**Why Python?** DuckDB's Python bindings are reliable. Bun/Node bindings have compatibility issues.

#### 2. Create the Page Component

Create `rc-data/src/pages/{dataset}/{Notebook}.tsx`:

```tsx
import data from "@/data/{dataset}/{notebook}.json";
import NotebookLayout from "@/components/NotebookLayout";
import { ChartContainer, ChartConfig, ChartTooltip, ChartTooltipContent } from "@/components/ui/chart";
import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card";
import { BarChart, Bar, XAxis, YAxis, CartesianGrid } from "recharts";

const chartConfig = {
  count: { label: "Count", color: "var(--chart-1)" },
} satisfies ChartConfig;

export default function NotebookPage() {
  return (
    <NotebookLayout title="Title" subtitle="Optional subtitle">
      <p className="text-muted-foreground mb-8">Prose explaining what we're looking at...</p>
      
      <Card>
        <CardHeader><CardTitle>Chart Title</CardTitle></CardHeader>
        <CardContent>
          <ChartContainer config={chartConfig} className="h-[250px] w-full">
            <BarChart data={data.byYear}>
              <CartesianGrid vertical={false} />
              <XAxis dataKey="year" />
              <YAxis />
              <ChartTooltip content={<ChartTooltipContent />} />
              <Bar dataKey="count" fill="var(--color-count)" radius={4} />
            </BarChart>
          </ChartContainer>
        </CardContent>
      </Card>
    </NotebookLayout>
  );
}
```

`NotebookLayout` provides consistent styling and a home button.

#### 3. Add the Route

In `rc-data/src/App.tsx`:

```tsx
import NotebookPage from "./pages/{dataset}/{Notebook}";
// ...
<Route path="/{dataset}/{notebook}" element={<NotebookPage />} />
```

#### 4. Add to Registry

Update `rc-data/src/data/registry.json` — the notebook appears on the homepage automatically.

### Why Bake JSON?

Query scripts output JSON that's imported at build time:

- **Fast**: Pages load instantly, no database calls
- **Static**: Site can be published without backend
- **Versioned**: JSON snapshots in git show how data changes
- **Decoupled**: Query logic separate from presentation

To refresh: re-run the Python script.

### Chart Guidelines

Charts use **ShadCN/UI components** (built on Recharts). Key rules:

1. **Always use ChartContainer + ChartConfig**
2. **Color variables**: `var(--chart-1)` through `var(--chart-5)` — never wrap in `hsl()`
3. **Pie tooltips**: Use `hideLabel` with matching `nameKey`
4. **Pie data**: Include `fill` property per slice

Full reference: `rc-data/docs/shadcncharts.md`

### Current Notebooks

| Path | Description |
|------|-------------|
| `/x/overview` | X/Twitter usage summary |

---

## Data Maintenance

Dataset management is handled through Zo's **[Datasets](/?t=datasets)** feature.

### What the Datasets Tab Does

- **Add new datasets**: Import source data (exports, archives, raw files) and Zo will ingest it into a standardized DuckDB database
- **Rebuild existing datasets**: When you have fresh exports, re-run ingestion to update the database
- **Track status**: See which datasets exist, their row counts, and when they were last updated

### How It Works

Each dataset lives in a directory (e.g., `zo-data/spotify/`) with:
- `source/` — Raw export files you provide
- `ingest/` — Scripts that transform source → DuckDB  
- `data.duckdb` — The queryable database
- `schema.yaml` — Auto-generated schema documentation
- `README.md` — Context about the data source

Zo follows these predefined processes to understand any input source data, normalize it into a single DuckDB file, generate documentation, and prepare it for uniform analysis. You don't need to write ingest scripts yourself—just provide the source data and let the Datasets feature handle the rest.

### Getting Fresh Exports

Most datasets come from service exports:

| Service | Where to Export |
|---------|----------------|
| Spotify | Privacy settings → Download your data |
| Google | takeout.google.com |
| X/Twitter | Settings → Your Account → Download archive |
| Apple Health | Health app → Export All Health Data |
| Amazon | Request Your Data page |
| Letterboxd | Settings → Import & Export |
| 23andMe | Settings → Download Raw Data |
| Anthropic | Claude settings → Export conversations |
| Function Health | functionhealth.com → Dashboard → Download results |

---

## Keeping This Skill Updated

**This skill should evolve as we learn.** If you discover:

- A useful query pattern → add it to the relevant dataset section
- A new table or field → update the schema documentation
- A gotcha or quirk → note it in General Tips or the dataset section
- A new dataset → add full documentation following existing format
- A notebook workflow improvement → update the Creating a Notebook section

The goal is that someone (or a future AI session) coming in fresh can understand everything needed to work with this data.

---

## Detailed Dataset Reference

### iMessage

**Database:** `dataset-imessage/data.duckdb`

**Tables:**
- `messages` — All sent/received messages
- `handles` — Contacts (phone/email)
- `chats` — Conversations (1:1 and group)
- `attachments` — File metadata
- `chat_messages`, `chat_participants`, `message_attachments` — Junction tables

**Key Fields:**
- `is_from_me` — TRUE if you sent it
- `service` — "iMessage" or "SMS"
- `sent_at` — Timestamp
- `associated_message_type` — Tapbacks (2000-2005 = adding, 3000-3005 = removing)

**Example Queries:**
```sql
-- Messages per contact (top 10)
SELECT h.identifier, COUNT(*) as msg_count
FROM messages m JOIN handles h ON m.handle_id = h.handle_id
GROUP BY h.identifier ORDER BY msg_count DESC LIMIT 10;

-- Messages by hour
SELECT EXTRACT(HOUR FROM sent_at) as hour, COUNT(*) as cnt
FROM messages GROUP BY hour ORDER BY hour;

-- Group chat activity
SELECT c.display_name, COUNT(*) as messages
FROM messages m
JOIN chat_messages cm ON m.message_id = cm.message_id
JOIN chats c ON cm.chat_id = c.chat_id
WHERE c.style = 45
GROUP BY c.display_name ORDER BY messages DESC;
```

---

### Spotify

**Database:** `zo-data/spotify/data.duckdb`

**Tables:**
- `streams` — Music listening history
- `podcast_streams` — Podcast episodes
- `video_streams` — Video content

**Key Fields:**
- `played_at` — Timestamp (UTC)
- `track_name`, `artist_name`, `album_name`
- `ms_played` — Duration in milliseconds
- `skipped` — Whether track was skipped

**Example Queries:**
```sql
-- Top artists by listening time
SELECT artist_name, SUM(ms_played)/3600000.0 as hours
FROM streams GROUP BY artist_name ORDER BY hours DESC LIMIT 20;

-- Listening by hour
SELECT EXTRACT(HOUR FROM played_at) as hour, COUNT(*) as plays
FROM streams GROUP BY hour ORDER BY hour;

-- Skip rate by artist
SELECT artist_name, 
       ROUND(100.0 * COUNT(*) FILTER (WHERE skipped) / COUNT(*), 1) as skip_pct
FROM streams GROUP BY artist_name HAVING COUNT(*) > 50 ORDER BY skip_pct DESC;
```

---

### Apple Health

**Database:** `zo-data/apple_health/data.duckdb`

**Tables:**
- `records` — All measurements (steps, heart rate, etc.)
- `sleep` — Sleep stage data (from Eight Sleep, Garmin Connect, iPhone)
- `workouts` — Workout sessions
- `activity_summaries` — Daily Watch ring data

**Key Fields (records):**
- `type` — Measurement type (step_count, heart_rate, etc.)
- `value`, `unit` — The measurement
- `start_date`, `end_date` — Time range
- `source_name` — Device/app

**Key Fields (sleep):**
- `start_date`, `end_date` — Time range for each sleep segment
- `stage` — Sleep stage: `deep`, `rem`, `core`, `awake`, `in_bed`
- `hours` — Duration of this segment in hours
- `source_name` — Device: `"Eight Sleep"`, `"Connect"` (Garmin), or `"Robert's iPhone"`

**⚠️ Sleep Data Quirk — Deduplication Required:**

The `sleep` table contains **duplicate rows**: identical `(start_date, end_date, stage, source_name)` tuples appear 2-3x per night (varies by day). This is an artifact of the Apple Health export. **Always deduplicate** before aggregating.

Additionally, multiple sources may report sleep for the same night. Source preference order:
1. **Garmin ("Connect")** — preferred (wrist-based tracking)
2. **Eight Sleep** — fallback (mattress-based tracking)
3. **Robert's iPhone** — ignore (minimal data, only ~9 days in 2023)

As of Jan 2026, Garmin data stops at Dec 26, 2025 — recent sleep is Eight Sleep only.

**Canonical sleep query (deduplicated, source-preferred):**
```sql
WITH deduped AS (
  SELECT DISTINCT start_date, end_date, stage, hours, source_name
  FROM sleep
  WHERE start_date >= CURRENT_DATE - INTERVAL '30 days'
    AND stage NOT IN ('in_bed', 'awake')
),
preferred AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY start_date, end_date, stage
    ORDER BY CASE source_name
      WHEN 'Connect' THEN 1
      WHEN 'Eight Sleep' THEN 2
      ELSE 3 END
  ) as rn
  FROM deduped
)
SELECT
  CAST(start_date AS DATE) as day,
  ROUND(SUM(hours), 2) as total_sleep,
  ROUND(SUM(hours) FILTER (WHERE stage = 'deep'), 2) as deep,
  ROUND(SUM(hours) FILTER (WHERE stage = 'rem'), 2) as rem,
  ROUND(SUM(hours) FILTER (WHERE stage = 'core'), 2) as core
FROM preferred
WHERE rn = 1
GROUP BY 1
ORDER BY 1 DESC;
```

**Example Queries:**
```sql
-- Daily steps
SELECT DATE(start_date) as day, SUM(value) as steps
FROM records WHERE type = 'step_count'
GROUP BY day ORDER BY day DESC LIMIT 30;

-- Average heart rate by month
SELECT DATE_TRUNC('month', start_date) as month, ROUND(AVG(value), 1) as avg_hr
FROM records WHERE type = 'heart_rate'
GROUP BY month ORDER BY month;

-- Workout summary
SELECT activity_type, COUNT(*) as sessions,
       ROUND(SUM(duration)/60, 1) as total_minutes
FROM workouts GROUP BY activity_type ORDER BY sessions DESC;
```

---

### Amazon/Audible

**Database:** `zo-data/amazon/data.duckdb`

**Tables:**
- `retail_orders`, `retail_items` — Amazon purchases
- `audible_library`, `audible_listening`, `audible_purchases` — Audiobooks
- `kindle_library`, `kindle_notes` — Ebooks
- `digital_orders`, `digital_items` — Digital purchases

**Example Queries:**
```sql
-- Monthly Amazon spending
SELECT DATE_TRUNC('month', order_date) as month, SUM(total) as spent
FROM retail_orders GROUP BY month ORDER BY month DESC;

-- Audible listening by book
SELECT product_name, SUM(duration_ms)/3600000.0 as hours
FROM audible_listening GROUP BY product_name ORDER BY hours DESC LIMIT 10;

-- Most highlighted books
SELECT title, COUNT(*) as highlights
FROM kindle_notes GROUP BY title ORDER BY highlights DESC;
```

---

### Google

**Database:** `zo-data/google/data.duckdb`

**Tables:**
- `chrome_history` — Browser history
- `search_history` — Google searches
- `youtube_watches`, `youtube_subscriptions`, `youtube_playlists`
- `calendar_events`, `flights`
- `maps_activity`, `maps_saved_places`, `maps_reviews`

**Example Queries:**
```sql
-- Most visited domains
SELECT REGEXP_EXTRACT(url, 'https?://([^/]+)', 1) as domain, COUNT(*) as visits
FROM chrome_history GROUP BY domain ORDER BY visits DESC LIMIT 20;

-- YouTube by month
SELECT DATE_TRUNC('month', watched_at) as month, COUNT(*) as videos
FROM youtube_watches GROUP BY month ORDER BY month;

-- Flights this year
SELECT departure_time, airline_name, origin_airport, destination_airport
FROM flights WHERE YEAR(departure_time) = YEAR(CURRENT_DATE) ORDER BY departure_time;

-- Searches by hour
SELECT EXTRACT(HOUR FROM searched_at) as hour, COUNT(*) as searches
FROM search_history WHERE activity_type = 'search' GROUP BY hour ORDER BY hour;
```

---

### GitHub

**Database:** `zo-data/github/data.duckdb`

**Tables:**
- `repositories` — Repos you own
- `pull_requests` — PRs created
- `issues` — Issues
- `issue_comments`, `commit_comments`

**Example Queries:**
```sql
-- PRs by month
SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as prs
FROM pull_requests GROUP BY month ORDER BY month DESC;

-- Most active repos
SELECT repository, COUNT(*) as issues
FROM issues GROUP BY repository ORDER BY issues DESC;
```

---

### X/Twitter

**Database:** `zo-data/x/data.duckdb`

**Tables:**
- `tweets` — Your tweets
- `likes` — Tweets you liked
- `direct_messages` — 1:1 DMs
- `direct_messages_group` — Group DMs
- `followers`, `following`

**Example Queries:**
```sql
-- Tweets by year
SELECT EXTRACT(YEAR FROM created_at) as year, COUNT(*) as tweets
FROM tweets WHERE NOT is_retweet GROUP BY year ORDER BY year;

-- Top tweets
SELECT full_text, favorite_count, retweet_count
FROM tweets ORDER BY favorite_count DESC LIMIT 10;
```

---

### Anthropic Claude

**Database:** `zo-data/anthropic/data.duckdb`

**Tables:**
- `conversations` — Conversation metadata
- `messages` — Individual turns (human/assistant)
- `projects`, `project_docs`

**Example Queries:**
```sql
-- Conversations by month
SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as convos
FROM conversations GROUP BY month ORDER BY month;

-- Total characters by sender
SELECT sender, SUM(text_length) as total_chars
FROM messages GROUP BY sender;
```

---

### Amex

**Database:** `zo-data/amex/data.duckdb`

**Tables:**
- `transactions`

**Key Fields:**
- `date`, `description`, `amount`
- `category`, `category_group`
- `city_state`, `country`

**Example Queries:**
```sql
-- Monthly spending by category
SELECT DATE_TRUNC('month', date) as month, category_group, SUM(amount) as spent
FROM transactions WHERE amount > 0
GROUP BY month, category_group ORDER BY month DESC, spent DESC;

-- Top merchants
SELECT description, COUNT(*) as txns, SUM(amount) as total
FROM transactions WHERE amount > 0
GROUP BY description ORDER BY total DESC LIMIT 20;
```

---

### Letterboxd

**Database:** `zo-data/letterboxd/data.duckdb`

**Tables:**
- `watched` — Films logged
- `ratings` — 0.5-5.0 stars
- `reviews` — Written reviews
- `watchlist`, `lists`, `list_entries`
- `profile`

**Example Queries:**
```sql
-- Rating distribution
SELECT rating, COUNT(*) FROM ratings GROUP BY rating ORDER BY rating;

-- Films by year
SELECT EXTRACT(YEAR FROM date_logged) as year, COUNT(*) as films
FROM watched GROUP BY year ORDER BY year DESC;

-- 5-star films
SELECT film_name, year FROM ratings WHERE rating = 5.0;
```

---

### DoorDash

**Database:** `zo-data/doordash/data.duckdb`

**Tables:**
- `orders` — Line items per order
- `profile`

**Example Queries:**
```sql
-- Most ordered restaurants
SELECT store_name, COUNT(DISTINCT delivered_at) as orders, SUM(subtotal) as total
FROM orders GROUP BY store_name ORDER BY orders DESC;

-- Monthly spending
SELECT DATE_TRUNC('month', delivered_at) as month, SUM(subtotal) as spent
FROM orders GROUP BY month ORDER BY month DESC;
```

---

### Function Health

**Database:** `function-health/data.duckdb`

**Tables:**
- `lab_results` — Lab test results (one row per biomarker)

**Key Fields:**
- `category` — Health category (Blood, Heart, Liver, Kidney, Thyroid, etc.)
- `biomarker` — Test name/biomarker measured
- `raw_value` — Original reported value as string
- `numeric_value` — Numeric value for analysis (when applicable)
- `units` — Unit of measurement
- `status` — Test result status: In Range, Above Range, Below Range, Out of Range, Younger
- `sample_type` — Type of sample (blood, urine, etc.)
- `collected_at` — Date sample was collected (YYYY-MM-DD)
- `source` — Data source (e.g., Function Health version)

**Example Queries:**
```sql
-- Biomarkers out of range
SELECT category, biomarker, status, raw_value, units
FROM lab_results
WHERE status IN ('Above Range', 'Below Range', 'Out of Range')
ORDER BY category, biomarker;

-- Summary by health category
SELECT category, 
       COUNT(*) as total_tests,
       SUM(CASE WHEN status = 'In Range' THEN 1 ELSE 0 END) as healthy,
       SUM(CASE WHEN status IN ('Above Range', 'Below Range', 'Out of Range') THEN 1 ELSE 0 END) as flagged
FROM lab_results
GROUP BY category
ORDER BY flagged DESC, total_tests DESC;

-- Cardiovascular risk markers
SELECT biomarker, raw_value, units, status
FROM lab_results
WHERE category = 'Heart' 
  AND (biomarker LIKE '%cholesterol%' OR biomarker LIKE '%LDL%' 
       OR biomarker LIKE '%ApoB%' OR biomarker LIKE '%triglyceride%')
ORDER BY biomarker;
```

**Coverage:** Single test date (currently 2025-12-06), 112 biomarkers across 17 health categories

---

### 23andMe

**Database:** `zo-data/23_and_me/data.duckdb`

**Tables:**
- `snps` — Raw genotype data (~600K+ variants)
- `chromosome_stats` — Per-chromosome summary
- `genotype_distribution`

**Key Fields:**
- `rsid` — SNP identifier (rs# or i#)
- `chromosome`, `position`
- `genotype` — Two-letter call (AA, AG, etc.)
- `zygosity` — homozygous, heterozygous, no_call

**Example Queries:**
```sql
-- SNPs per chromosome
SELECT chromosome, snp_count FROM chromosome_stats ORDER BY 
  CASE WHEN chromosome ~ '^\d+$' THEN CAST(chromosome AS INT) ELSE 100 END;

-- Look up specific SNP
SELECT * FROM snps WHERE rsid = 'rs1426654';
```

---

### uLogMe (Activity Tracking)

**Database:** `data/data.duckdb`

Desktop activity tracking — active windows and keystrokes.

**Tables:**
- `window_events` — Active window logs
  - `timestamp`, `app_name`, `window_title`, `browser_url`, `logical_date`
- `key_events` — Keystroke counts
  - `timestamp`, `key_count`, `logical_date`
- `notes`, `daily_blog`, `settings`

**Coverage:** December 2025 – ongoing

**Example Queries:**
```sql
-- Time per app (hours, ~9 sec intervals)
SELECT app_name, COUNT(*) * 9 / 3600.0 as hours
FROM window_events GROUP BY app_name ORDER BY hours DESC;

-- Top browser URLs
SELECT browser_url, COUNT(*) as visits
FROM window_events WHERE browser_url IS NOT NULL
GROUP BY browser_url ORDER BY visits DESC LIMIT 20;

-- Keystrokes by hour
SELECT EXTRACT(HOUR FROM timestamp) as hour, SUM(key_count) as keys
FROM key_events GROUP BY hour ORDER BY hour;
```

---

## Cross-Dataset Analysis Examples

### Productivity + Music

```sql
ATTACH 'data/data.duckdb' AS ulog (READ_ONLY);
ATTACH 'zo-data/spotify/data.duckdb' AS spotify (READ_ONLY);

SELECT u.logical_date, SUM(u.key_count) as keystrokes,
       COUNT(s.track_name) as songs_played
FROM ulog.key_events u
LEFT JOIN spotify.streams s ON DATE(s.played_at) = u.logical_date
GROUP BY u.logical_date ORDER BY u.logical_date DESC;
```

### Messages + Steps

```sql
ATTACH 'dataset-imessage/data.duckdb' AS imsg (READ_ONLY);
ATTACH 'zo-data/apple_health/data.duckdb' AS health (READ_ONLY);

SELECT DATE(m.sent_at) as day,
       COUNT(*) FILTER (WHERE m.is_from_me) as msgs_sent,
       (SELECT SUM(value) FROM health.records r 
        WHERE r.type = 'step_count' AND DATE(r.start_date) = DATE(m.sent_at)) as steps
FROM imsg.messages m
GROUP BY day ORDER BY day DESC LIMIT 30;
```

---

## General Tips

1. **Read the README first**: Before querying any dataset, read its `README.md`. These contain critical business rules, data quirks, and deduplication requirements (e.g., the sleep table's duplicate rows) that will produce wildly wrong results if missed.
2. **Schema Discovery**: Check `schema.yaml` and `README.md` in each dataset directory
3. **Attach Multiple**: `ATTACH 'path' AS alias (READ_ONLY)` for cross-dataset queries
4. **Time Zones**: Most timestamps are UTC — convert for local analysis
5. **Row Counts**: `schema.yaml` includes approximate counts (0 = empty/not ingested)
6. **Read-Only**: Always use `read_only=True` or `(READ_ONLY)` to prevent accidental writes