PostgreSQL: Full-Text Search: Mistakes

Module: Database-Specific Features

Not creating GIN index on tsvector column

Create GIN index: CREATE INDEX USING GIN (search_vector), 100x faster queries

Without GIN index, PostgreSQL does sequential scan (checks every row). With GIN index, PostgreSQL uses index scan (checks only matching rows). Performance: Without GIN: Sequential scan, 1000ms for 100K rows. With GIN: Index scan, 10ms for 100K rows (100x faster). GIN index is inverted index (maps words to rows). Example: Search "postgresql" without GIN: Check every row (slow). With GIN: Look up "postgresql" in index, get matching rows (fast). Real-world: Medium uses GIN indexes on article search (100x faster). Setup: (1) Add generated column: ADD COLUMN search_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED. (2) Create GIN index: CREATE INDEX idx_search ON table USING GIN (search_vector). Lesson: Always create GIN index for full-text search.

Always create GIN index on tsvector columns. Essential for performance. Use generated columns for better performance than expression indexes.

Critical

Sequential scan (100x slower), poor performance on large tables

Using expression index instead of generated column for tsvector

Use generated column: ADD COLUMN search_vector tsvector GENERATED ALWAYS AS (to_tsvector(...)) STORED

Expression index: CREATE INDEX ON table USING GIN (to_tsvector('english', content)). Problem: Recalculates to_tsvector() on every query (slow). Generated column: ADD COLUMN search_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED. Benefit: Pre-computed tsvector (faster queries). Performance: Expression index: Recalculate on every query (50ms overhead). Generated column: Pre-computed (no overhead). Example: 100K rows, search query. Expression index: 60ms (10ms index + 50ms recalculation). Generated column: 10ms (no recalculation). Real-world: Shopify uses generated columns for product search (faster queries). Migration: (1) Add generated column. (2) Create GIN index on generated column. (3) Update queries to use generated column. Lesson: Use generated columns, not expression indexes.

Use generated columns for tsvector. Pre-computed = faster queries. Expression indexes recalculate on every query (slow).

High

Slower queries (recalculates tsvector on every query), wasted CPU

Using wrong language configuration for text (e.g., english for Spanish text)

Use appropriate language: to_tsvector('spanish', text) for Spanish, to_tsvector('english', text) for English

Language configuration controls stemming and stop words. Wrong language produces bad results. Example: Spanish text "Las bases de datos están corriendo". English config: to_tsvector('english', text) = Wrong stemming (bases → base, not bas), wrong stop words (las not removed). Spanish config: to_tsvector('spanish', text) = Correct stemming (bases → bas), correct stop words (las removed). Result: English config misses searches for "base" (stemmed wrong). Spanish config finds searches for "base" (stemmed correctly). Real-world: Shopify uses language-specific configs for international product search. Multi-language: Use CASE statement in generated column to apply language-specific config based on language column. Lesson: Always use appropriate language configuration.

Use appropriate language configuration. english for English, spanish for Spanish, simple for no stemming. Multi-language: Use CASE statement.

Medium

Bad search results (wrong stemming, wrong stop words), poor relevance

Not using weighted search (title and content treated equally)

Use setweight() to boost title: setweight(to_tsvector(..., title), 'A') || setweight(to_tsvector(..., content), 'B')

Without weights, title and content matches ranked equally. Problem: Content match ranked same as title match (bad UX). With weights, title matches ranked higher. Solution: setweight(to_tsvector('english', title), 'A') || setweight(to_tsvector('english', content), 'B'). Weight A (highest priority), Weight B (lower priority). Example: Search "postgresql". Article 1: "postgresql" in title (weight A). Article 2: "postgresql" in content (weight B). Without weights: Both ranked equally. With weights: Article 1 ranked higher (better UX). Real-world: Medium uses weighted search (title higher priority than content). Lesson: Always use weighted search for better ranking.

Use weighted search. Title weight A (highest), content weight B (lower). Better ranking = better UX.

Medium

Poor ranking (content matches ranked same as title matches), bad UX

Using LIKE operator for search instead of Full-Text Search

Use Full-Text Search: WHERE search_vector @@ to_tsquery(...), 100x faster than LIKE

LIKE operator: WHERE content LIKE ''%word%'. Problems: (1) Sequential scan (slow), (2) No ranking (all results equal), (3) No stemming (misses "words" when searching "word"). Full-Text Search: WHERE search_vector @@ to_tsquery('word'). Benefits: (1) GIN index scan (100x faster), (2) Ranking with ts_rank(), (3) Stemming (finds "words" when searching "word"). Performance: LIKE: Sequential scan, 1000ms for 100K rows. FTS: Index scan, 10ms for 100K rows (100x faster). Example: Search "database". LIKE: Misses "databases" (no stemming). FTS: Finds "databases" (stemming). Real-world: Medium migrated from LIKE to FTS (100x faster). Lesson: Use Full-Text Search, not LIKE.

Use Full-Text Search, not LIKE. 100x faster, ranking, stemming. LIKE only for exact substring matching.

High

Slow queries (sequential scan), no ranking, no stemming