Designing an address-matching pipeline for Indian addresses

Address matching at scale is hard—especially for Indian addresses with inconsistent formatting, transliteration, and regional variations. This post walks through how we designed a pipeline to clean, normalize, and match addresses for enterprise clients.

The problem

Clients had large datasets of addresses that needed to be validated and matched against reference data for feasibility checks and verification. Manual verification didn’t scale, and off-the-shelf tools often fell short for Indian formats.

Architecture choices

We went with a modular pipeline: ingest → normalize → match → score → output. Each stage is configurable so we could tune for different clients and use cases. The core is built in Python, with PostgreSQL for reference data and FastAPI for a thin service layer where needed.

Key learnings

Normalization first. Standardizing script, case, and common abbreviations before matching improved recall significantly.
Phased matching. Starting with exact/block matches and then applying fuzzy logic reduced false positives.
Observability. Logging match scores and human-reviewed samples helped us iterate on rules and thresholds.

You can extend this pattern to other geographies by swapping normalization rules and reference datasets.