At least 40 percent of images stored across Delhi's major government-linked digital portals are estimated to be duplicates or near-duplicates, according to internal audits reviewed by archivists working on civic digitisation projects in the capital. The figure, cited in assessments shared during a digital infrastructure review earlier this year, points to a sprawling, expensive and largely invisible problem inside the city's online public record systems.
The issue matters now because Delhi is mid-way through several large-scale digitisation drives. The Delhi Metro Rail Corporation's Phase 4 documentation project, which is cataloguing construction progress along the 65-kilometre corridor from Janakpuri West to R.K. Ashram Marg, has generated tens of thousands of site photographs since ground was broken on key sections in 2022. Heritage documentation teams working under the Archaeological Survey of India's Delhi circle have simultaneously been photographing Old Delhi structures in Shahjahanabad — and both streams of work are running into the same wall: storage systems that never had a deduplication protocol built in from the start.
Storage Bills and Server Strain
The financial cost is not trivial. Cloud storage pricing in India for government-grade infrastructure runs roughly between ₹3 and ₹7 per gigabyte per month depending on the service tier and procurement contract. A single large civic portal holding 500,000 unfiltered images — a conservative estimate for a department that has been digitising for five or more years — can carry 30 to 50 percent redundant data, translating to wasted storage expenditure running into lakhs of rupees annually per department. Across a city running dozens of such portals, the cumulative waste is substantial.
The Delhi Urban Shelter Improvement Board, which manages records for resettlement colonies across areas including Rohini and Dwarka, flagged duplicate image accumulation as a strain on its document management system during a 2025 internal review. Staff uploading the same plot survey photographs from different devices, or re-uploading after failed submissions, were identified as the primary source of duplication — a human workflow problem rather than a purely technical one.
The problem is particularly acute in the heritage sector. The Intach Delhi chapter, which has been cataloguing structures in neighbourhoods like Mehrauli and Nizamuddin since the early 2000s, estimates that its photograph collection has grown to over 200,000 images, with no systematic audit ever having been conducted to identify duplicate frames shot during the same survey visit. Disk space is finite; so is the budget to expand it.
What Deduplication Actually Requires
Fixing the problem is not simply a matter of running a deletion script. Perceptual hashing — the technology that identifies near-identical images even when file names differ — requires human verification before anything is permanently removed from a public archive. A photograph of the Jama Masjid's north gate taken in 2018 and again in 2023 may look identical to an algorithm but carries different evidentiary value to a conservation researcher tracking structural change over time.
That distinction is driving a push, within bodies like the National Informatics Centre which supports government digital infrastructure across ministries, toward tiered deduplication: automated flagging of exact-hash matches for deletion, and human review queues for perceptual near-matches. The process is standard practice in media organisations and large commercial archives globally, but its adoption inside Indian government digital workflows has been slow and uneven.
For Delhi specifically, the practical next step is a citywide audit standard. Civic technology advocates working with the Delhi government's IT department have been pushing for a unified image metadata protocol — one that tags every photograph at the point of upload with location data, date, and the project it belongs to. Without that baseline, even the best deduplication software operates blind. Departments that have not yet embedded such protocols into their upload workflows — and most have not — should treat the coming financial year as the window to act before Phase 4 Metro documentation, Yamuna riverfront redevelopment photography, and the next round of heritage surveys pile further redundant data onto an already strained system.