Delhi's Digital Archives Are Drowning in Duplicate Images — The Numbers Reveal Why
Government digitisation drives across the capital have generated millions of redundant image files, and the data trail shows how bad the problem has actually become.
Government digitisation drives across the capital have generated millions of redundant image files, and the data trail shows how bad the problem has actually become.

Delhi's municipal and state agencies collectively stored an estimated 40 to 60 percent of their digitised image archives as exact or near-exact duplicates by the end of the 2025–26 financial year, according to internal assessments circulated among IT departments working under the Delhi e-Governance Society. The redundancy is not a minor housekeeping issue. It is eating storage budgets, slowing public-facing portals, and quietly undermining the city's ambitious push to digitise everything from land records in Patwari offices across East Delhi to heritage photographs held by the Archaeological Survey of India's regional office on Janpath.
The timing matters because Delhi is mid-way through a parallel expansion. Delhi Metro Rail Corporation is digitising engineering drawings and surveillance footage as Phase 4 construction accelerates across the Janakpuri West to RK Ashram Marg corridor. The Delhi government's Revenue Department has been uploading property records, mutation documents and scanned deeds onto the e-District portal since at least 2019. Every one of those workflows produces images. Without deduplication protocols baked into the ingestion pipeline, duplicates accumulate faster than any manual review team can catch them.
Storage is not cheap, even in bulk. Cloud object storage priced for Indian government procurement runs roughly ₹2 to ₹4 per gigabyte per month depending on the tier and vendor empanelled under the MeitY rate contract. A single uncompressed TIFF scan of an A3 land document — the standard the Revenue Department uses — sits at around 25 to 50 MB. Multiply that by millions of records across dozens of tehsils, add surveillance stills from Metro stations, and the monthly bill compounds sharply. When 40 percent of those files are duplicates serving no informational purpose, the wastage is arithmetically straightforward.
The Delhi State Archives on Shyam Nath Marg holds physical collections dating to the Mughal period, and its digitisation programme, running since at least 2016, had processed several lakh folios as of its last publicly available progress note. Archivists familiar with mass digitisation projects — not specific to Delhi but documented in peer-reviewed library science literature — routinely cite duplication rates between 30 and 55 percent when batch scanning is done without real-time hash-checking. The Delhi State Archives has not published a specific deduplication audit figure, so the city's actual exposure on heritage collections remains unverified.
The Municipal Corporation of Delhi, which merged its three predecessor bodies in May 2022, inherited three separate document management systems, each with its own image repository. That merger created an almost guaranteed duplication event: the same property photographs, building-plan scans, and birth-certificate images existed in all three legacy databases simultaneously. MCD's IT wing has acknowledged migration challenges in budget documents tabled before the municipal council, though it has not released a precise duplicate-file count.
Duplicate image replacement — the process of identifying redundant files using perceptual hashing or MD5 checksums and substituting a single canonical version — costs relatively little when set against the storage savings. Independent assessments of comparable South Asian government digitisation projects suggest a one-time deduplication exercise on a 10-terabyte archive can be completed in under 72 hours using open-source tools, with storage reclaimed typically between 30 and 45 percent. For Delhi's combined government storage estate, even a conservative 30 percent reclamation would represent tens of thousands of rupees saved monthly at current procurement rates.
The National Informatics Centre, which provides technical infrastructure to Delhi's state government from its Lodhi Road facility, has published national-level guidelines on data quality for government repositories, though state-level implementation timelines vary and Delhi has not announced a formal deduplication deadline.
Practically, departments that want to get ahead of this before the next budget cycle should prioritise three steps: mandate SHA-256 checksums at the point of scanning so duplicates never enter the archive, schedule a retrospective hash-comparison sweep on existing repositories before October 2026, and publish results in annual IT audit disclosures so the Delhi Legislative Assembly's Public Accounts Committee can track progress. The data already shows the scale of the problem. The question now is whether anyone with authority over IT procurement acts on it before next year's storage invoices arrive.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Delhi
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News