Duplicate Image Finder … sort of!

Going through my “huge” archive of digital camera images, I noticed that I had some duplicate images. The filename differs but the content is the same. Now I could just download a tool from the almighty internet, that could compare all my images, but that would be going to far! Why not give it a go myself?



Several approaches come to mind. I could start cataloging all the files and then compare filesizes. If the filesize match, I should then compare the content of the files. If the content matches, then the file should be marked a duplicate.

I could also Google for an advanced image comparison algorithm and spend the next week trying to implement it.

I took the easy way out 🙂 First I catalog all the image files (*.jpg). Then I calculated the MD5 hash of every file. Before all that, I had created an SQL Compact Edition 3.5 database containing two fields. Filename and MD5 hash. Only one index was created, MD5 hash must be unique.



So every time I tried to insert a record into the database, and the MD5 hash already existed, I would get an exception. In the exception handling code I rename the file to “DUP_”+original filename. Naturally I could have deleted it on the spot, but I wanted to make sure the file really was a dup 🙂

Anyways, code speak louder than words

			// Create a new instance of the MD5CryptoServiceProvider
			MD5 md5Hasher = MD5.Create();

			// Get all files, including subdirectories (recursively)
			string[] files = Directory.GetFiles( @"path_to_imagefiles_here/",
												  "*.jpg",
												  SearchOption.AllDirectories );

			// loop through all files
			BinaryReader br = null;
			foreach( string file in files )
			{
				try
				{
					br = new BinaryReader( new FileStream( file, FileMode.Open ) );
					FileInfo fi = new FileInfo( file );
					byte[] fileData = new byte[(int)fi.Length];
					// read content of file
					br.Read( fileData, 0, fileData.Length );
					br.Close();

					// compute the MD5 hash
					byte[] md5Data = md5Hasher.ComputeHash( fileData );

					// create the MD5 human readable string
					StringBuilder sBuilder = new StringBuilder();

					// Loop through each byte of the hashed data 
					// and format each one as a hexadecimal string.
					for( int i = 0; i < md5Data.Length; i++ )
					{
						sBuilder.Append( md5Data[i].ToString( "x2" ) );
					}

					if( !insertRow( file, sBuilder.ToString() ) )
					{
						// rename the file
						File.Move( file, string.Format("{0}\\{1}{2}",
													   fi.DirectoryName,
													   "DUP_",
													   fi.Name) );
					}
				}
			}

Granted, this might not be the best way but it only took about 10 minutes to write… and it works for me 🙂


Leave a Reply

You must be logged in to post a comment.