DEV Community

Cover image for A Duplicate File Finder in #Rust
Ken Salter
Ken Salter

Posted on

A Duplicate File Finder in #Rust

I set up a new storage system for my home network. I attached a 10TB drive, and proceeded to copy all my photos from every source I had. Also grabbed photos from my spouse, my parents...any source I could find.

The end result was that I have quite a bit of duplicate photos resulting in a lot of wasted space.

There are a lot of duplicate file finding software free for the download. Since I'm trying to learn #rust, I decided to build one for myself.

Thus was born dupefindr

I included quite a few options, including the ability to move, copy, or delete duplicate files, wildcard file matching, multi-threaded for performance, interactivity to choose files to keep (as well as keep newest or oldest automatically), and more!

Usage: dupefindr [OPTIONS] <COMMAND>

Commands:
  find    Find duplicate files
  move    Move duplicate files to a new location
  copy    Copy duplicate files to a new location
  delete  Delete duplicate files
  help    Print this message or the help of the given subcommand(s)

Options:
  -p, --path <PATH>
          The directory to search for duplicates in [default: .]
  -w, --wildcard <WILDCARD>
          wildcard pattern to search for Example: *.txt [default: *]
      --exclusion-wildcard <EXCLUSION_WILDCARD>
          wildcard pattern to exclude fo Example: *.txt [default: ]
  -r, --recursive
          Recursively search for duplicates
      --debug
          Display debug information
  -0, --include-empty-files
          Include empty files
      --dry-run
          Dry run the program This will not delete or modify any files
  -H, --include-hidden-files
          Include hidden files
  -q, --quiet
          Hide progress indicators
  -v, --verbose
          Display verbose output
  -m, --max-threads <MAX_THREADS>
          Max threads to use Example: 4 Default: Number of CPUs If set to 0, then it will use the number of CPUs [default: 0]
  --create-report
          Create a report
  --report-path <REPORT_PATH>
          Path of the report Defaults to the folder where dupefindr was run [default: ./dupefindr-report.csv]
  -h, --help
          Print help
  -V, --version
          Print version
Enter fullscreen mode Exit fullscreen mode

How does it work? First, it collects all the files in the path you specify, recursively traversing sub folders if configured. After all files have been collected, it generates a hash code for each file using MD5. This hash code is a unique code based on the contents of the file. Each hash code is compared to all the hash codes it has collected, and if it sees it in the collection, it knows it has found a duplicate! Once this scan is complete, the program now knows all the files that are duplicates, and proceeds to process them by either copying, moving, or deleting the duplicates. What it chooses to keep is based on either newest, oldest or interactive (e.g. the user chooses from a list).

I suggest running it first with the --dry-run command line argument. This will run the program but not make any changes to your file system. You will be able to see what it would like do to your files. And, as always, make sure you backup your data!

You can find the project here

Since I'm quite new to Rust, I expect there are many improvements that can be made to the code. Please feel free to create Pull Requests with your changes/additions! And be kind, I'm still learning :-)

Top comments (0)