Jump to content

Photo

Question For Coders - Duplicate File Finder

- - - - - c# duplicate file finder

  • Please log in to reply
8 replies to this topic

#1
yahoo_b0rn_

yahoo_b0rn_

  • Corporal


    Posts:
    33
    Joined:
    14-September 12
    Reputation:
    16 / 0 / 0
  • Gender:Unknown
  • Country:Country Flag

jjf8Byu.jpg

 

What algorithm would you use to find duplicate files in a drive or directory?

I checked out some major softwares that does the job but the results vary for each, under the same test conditions!

What would be the fastest and most accurate algorithm?



#2
Dermot

Dermot
linq and directory.getfiles?

#3
yahoo_b0rn_

yahoo_b0rn_

  • Corporal


    Posts:
    33
    Joined:
    14-September 12
    Reputation:
    16 / 0 / 0
  • Gender:Unknown
  • Country:Country Flag
is linq fast enough? also how would you implement it? is it like getting all the files and comparing one with all? or hashing?
whats the algorithm?

#4
Dermot

Dermot
Ok, do you want to just list all duplicates or remove duplicates?

like this is a linq function that finds duplicates and deletes them

using System;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;

namespace DupeFinder
{
internal class Program
{
private static void Main(string[] args)
{
Directory.GetFiles(@"d:\icons", "*.ico")
.Select(
f => new
{
FileName = f,
FileHash = Encoding.UTF8.GetString( new SHA1Managed()
.ComputeHash(new FileStream(f,
FileMode.Open,
FileAccess.Read)))
})
.GroupBy(f => f.FileHash)
.Select(g => new {FileHash = g.Key, Files = g.Select(z => z.FileName).ToList()})
.SelectMany(f => f.Files.Skip(1))
.ToList()
.ForEach(File.Delete);

Console.ReadKey();
}
}
}

  • Like x 1

#5
yahoo_b0rn_

yahoo_b0rn_

  • Corporal


    Posts:
    33
    Joined:
    14-September 12
    Reputation:
    16 / 0 / 0
  • Gender:Unknown
  • Country:Country Flag
Hashing doesnt seem to be the fastest way im afraid.

#6
Dermot

Dermot
for delete or find?

#7
yahoo_b0rn_

yahoo_b0rn_

  • Corporal


    Posts:
    33
    Joined:
    14-September 12
    Reputation:
    16 / 0 / 0
  • Gender:Unknown
  • Country:Country Flag

for delete or find?


To find the dupes not deleting.

What i did was
1. get all files
2. sort by file size using linq
3. compare files by size
4. if size matches, read file bytes and compare

Its faster than hashing. any faster methods?

#8
Dermot

Dermot
I doubt there is from what i see, seems that would be the fastest way
  • Like x 1

#9
jimmyha

jimmyha

  • Banned


    Posts:
    2
    Joined:
    19-September 12
    Reputation:
    0 / 0 / 0
  • Country:Country Flag
nice share...............