Across Delhi's sprawling network of government servers, municipal archives and media organisations, a quiet data emergency is deepening. Duplicate image files — identical or near-identical photographs stored multiple times across different drives and cloud folders — are consuming terabytes of storage that agencies are paying for without realising the scale of the problem. Estimates from digital asset management professionals working with public-sector clients in the capital suggest that between 30 and 45 percent of stored image files in large institutional databases are redundant copies, though the figure varies sharply by organisation and no single authoritative audit has been published.
Why does this matter right now? Two forces are converging. Delhi Metro Rail Corporation is in the middle of Phase 4 construction documentation, generating tens of thousands of site photographs every month across corridors stretching from Janakpuri West to RK Ashram Marg. Simultaneously, the Delhi government's own digitisation push — part of the broader e-District Delhi portal programme — is pulling decades of physical records into digital form. Both processes are creating enormous image libraries with minimal deduplication protocols in place. Storage costs, cloud licensing fees and the staff hours required to locate the correct version of a file are all climbing.
What the Numbers Actually Look Like
The economics are stark. A single terabyte of enterprise cloud storage on commonly used platforms costs Indian public-sector organisations roughly ₹4,000 to ₹7,000 per month depending on the contract tier and vendor. A department running a 50-terabyte image archive — not unusual for a body like the Delhi Development Authority, which documents construction across thousands of hectares — could theoretically eliminate 15 to 20 terabytes of duplicate content through a structured deduplication exercise, trimming monthly costs by as much as ₹1.4 lakh. Multiply that across a dozen major Delhi government departments and the annual savings potential runs into crores of rupees.
The National Archives of India, headquartered near Janpath in central Delhi, has been digitising records since the early 2000s. Its photographic holdings run into millions of items. Archivists working in this space — without speaking on the record — point to a structural problem: images are typically scanned and uploaded by different teams at different times, with no cross-referencing tool flagging when the same photograph has already been processed. The result is warehouse-scale duplication baked into the foundation of the archive itself.
At Indira Gandhi National Centre for the Arts in Janpath, similar pressures apply to its digital heritage collections. Cultural institutions in Delhi generally lack dedicated digital asset management software with hash-based duplicate detection — the technology that compares unique file fingerprints to identify identical images regardless of filename or folder location. Off-the-shelf tools capable of scanning a 10-terabyte library and flagging duplicates within hours are available for annual licence fees starting around ₹80,000, a fraction of the storage costs being wasted.
Practical Steps and What Comes Next
The deduplication problem is not simply about tidying up hard drives. In a city where Right to Information requests regularly ask government bodies to produce photographic evidence of completed works — road repairs in Shahdara, flood mitigation along the Yamuna floodplain, construction progress on the Barapullah elevated corridor — having multiple conflicting versions of the same image in circulation creates legal and administrative headaches. Departments have produced RTI responses citing photographs that turn out to be duplicates of older, pre-repair images, prompting fresh complaints and re-filing.
Several solutions are available and in use elsewhere. Perceptual hashing algorithms can catch near-duplicate images — the same photograph saved at different resolutions or with minor cropping differences — which byte-level comparison misses entirely. Delhi Police's crime scene documentation unit and the Delhi Fire Service both maintain large photographic databases where this distinction between exact and near-duplicate matters operationally.
The practical path forward for any Delhi institution serious about this issue starts with a baseline audit: how many image files exist, what percentage are duplicates, and what is the current monthly storage bill. That audit, conducted by a specialist vendor or an in-house team with the right tools, typically takes two to four weeks for a mid-sized archive. From there, automated deduplication workflows can be built into upload pipelines, preventing the problem from rebuilding itself. The data problem is solvable. The harder question is whether the administrative will exists to prioritise it before the next budget cycle makes the waste impossible to ignore.