PostgreSQL: Full-Text Search: Interview

Module: Database-Specific Features

When would you use PostgreSQL Full-Text Search vs Elasticsearch? What are the trade-offs?

PostgreSQL FTS vs Elasticsearch decision depends on scale and feature requirements. Use PostgreSQL FTS when: (1) Simple search needs (keyword search, basic ranking), (2) < 1M documents (good performance), (3) PostgreSQL-only stack (no separate service), (4) Avoid operational complexity (no Elasticsearch cluster), (5) Budget constraints (no Elasticsearch license/hosting). Use Elasticsearch when: (1) Advanced features needed (fuzzy search, synonyms, aggregations, facets), (2) > 1M documents (distributed search), (3) Complex queries (nested queries, geo-search), (4) Horizontal scaling needed (multiple nodes), (5) Real-time analytics (aggregations, dashboards). Trade-offs: PostgreSQL FTS: Simpler (just SQL), lower operational overhead, good for < 1M documents, limited features. Elasticsearch: More complex (separate service), higher operational overhead, good for > 1M documents, advanced features. Performance: PostgreSQL FTS: 10-50ms for < 1M documents. Elasticsearch: 5-20ms for > 1M documents (distributed). Real-world: Medium uses PostgreSQL FTS for article search (< 1M articles, simple needs). Amazon uses Elasticsearch for product search (> 100M products, advanced features). Lesson: PostgreSQL FTS for simple needs, Elasticsearch for advanced features or large scale.

Explain how tsvector and tsquery work. Why is GIN index essential?

tsvector and tsquery are core of PostgreSQL Full-Text Search. tsvector: Searchable document format. Converts text to normalized, stemmed tokens with positions. Example: to_tsvector('english', 'The databases are running') = 'databas':2 'run':4. Process: (1) Tokenize (split into words), (2) Normalize (lowercase), (3) Remove stop words (the, are), (4) Stem (databases → databas, running → run), (5) Store with positions. tsquery: Search query format. Supports operators: & (AND), | (OR), ! (NOT), <-> (phrase). Example: to_tsquery('english', 'postgresql & database') = 'postgresql' & 'databas'. @@ operator: Matches tsvector against tsquery. Returns true if match. GIN index: Inverted index (maps words to rows). Structure: Word → List of rows containing word. Example: 'postgresql' → [row1, row5, row10]. Why essential: Without GIN: Sequential scan (check every row), 1000ms for 100K rows. With GIN: Index scan (look up word in index, get rows), 10ms for 100K rows (100x faster). Real-world: GitLab uses GIN indexes on issue search (10x faster). Lesson: GIN index is essential for full-text search performance.

What is the difference between ts_rank() and ts_rank_cd()? When would you use each?

ts_rank() and ts_rank_cd() are ranking functions that order search results by relevance. ts_rank(): Basic ranking based on word frequency. How it works: Counts how many times query words appear in document. Higher frequency = higher rank. Good for: Short documents (tweets, product names), equal-length documents. Example: Document 1: "postgresql" appears 3 times. Document 2: "postgresql" appears 1 time. ts_rank() ranks Document 1 higher. ts_rank_cd(): Cover density ranking. How it works: Considers document length and word proximity. Penalizes long documents with sparse matches. Good for: Long documents (articles, blog posts), variable-length documents. Example: Document 1: 100 words, "postgresql" appears 3 times (sparse). Document 2: 50 words, "postgresql" appears 2 times (dense). ts_rank_cd() ranks Document 2 higher (denser matches). Formula: ts_rank_cd() = matches / document_length. Trade-offs: ts_rank(): Simple, fast, good for short documents. ts_rank_cd(): More complex, considers length, good for long documents. Real-world: Medium uses ts_rank_cd() for article search (long documents, variable length). Twitter uses ts_rank() for tweet search (short documents, equal length). Lesson: Use ts_rank() for short documents, ts_rank_cd() for long documents.

Design a full-text search system for a blog platform. Include: (1) Schema with generated column, (2) GIN index, (3) Search query with ranking and highlighting.

-- Design full-text search for blog platform

-- (1) Create articles table

CREATE TABLE articles (

article_id SERIAL PRIMARY KEY,

title VARCHAR(200) NOT NULL,

content TEXT NOT NULL,

author_id INTEGER,

category VARCHAR(50),

published_at TIMESTAMP DEFAULT NOW(),

created_at TIMESTAMP DEFAULT NOW()

);

-- (2) Add generated column for tsvector (weighted search)

ALTER TABLE articles

ADD COLUMN search_vector tsvector

GENERATED ALWAYS AS (

setweight(to_tsvector('english', COALESCE(title, '')), 'A') ||

setweight(to_tsvector('english', COALESCE(content, '')), 'B')

) STORED;

-- Weight A for title (higher priority), Weight B for content

-- (3) Create GIN index for fast queries

CREATE INDEX idx_articles_search_vector

ON articles

USING GIN (search_vector);

-- (4) Create index on published_at for filtering

CREATE INDEX idx_articles_published_at

ON articles (published_at DESC);

-- Insert sample articles

INSERT INTO articles (title, content, category) VALUES

(

'PostgreSQL Full-Text Search Tutorial',

'Learn how to implement full-text search in PostgreSQL. PostgreSQL provides powerful search capabilities with tsvector and tsquery.',

'Database'