I set up a new storage system for my home network. I attached a 10TB drive, and proceeded to copy all my photos from every source I had. Also grabbed photos from my spouse, my parents...any source I could find.
The end result was that I have quite a bit of duplicate photos resulting in a lot of wasted space.
There are a lot of duplicate file finding software free for the download. Since I'm trying to learn #rust, I decided to build one for myself.
Thus was born dupefindr
I included quite a few options, including the ability to move, copy, or delete duplicate files, wildcard file matching, multi-threaded for performance, interactivity to choose files to keep (as well as keep newest or oldest automatically), and more!
Usage: dupefindr [OPTIONS] <COMMAND>
Commands:
find Find duplicate files
move Move duplicate files to a new location
copy Copy duplicate files to a new location
delete Delete duplicate files
help Print this message or the help of the given subcommand(s)
Options:
-p, --path <PATH>
The directory to search for duplicates in [default: .]
-w, --wildcard <WILDCARD>
wildcard pattern to search for Example: *.txt [default: *]
--exclusion-wildcard <EXCLUSION_WILDCARD>
wildcard pattern to exclude fo Example: *.txt [default: ]
-r, --recursive
Recursively search for duplicates
--debug
Display debug information
-0, --include-empty-files
Include empty files
--dry-run
Dry run the program This will not delete or modify any files
-H, --include-hidden-files
Include hidden files
-q, --quiet
Hide progress indicators
-v, --verbose
Display verbose output
-m, --max-threads <MAX_THREADS>
Max threads to use Example: 4 Default: Number of CPUs If set to 0, then it will use the number of CPUs [default: 0]
--create-report
Create a report
--report-path <REPORT_PATH>
Path of the report Defaults to the folder where dupefindr was run [default: ./dupefindr-report.csv]
-h, --help
Print help
-V, --version
Print version
How does it work? First, it collects all the files in the path you specify, recursively traversing sub folders if configured. After all files have been collected, it generates a hash code for each file using MD5. This hash code is a unique code based on the contents of the file. Each hash code is compared to all the hash codes it has collected, and if it sees it in the collection, it knows it has found a duplicate! Once this scan is complete, the program now knows all the files that are duplicates, and proceeds to process them by either copying, moving, or deleting the duplicates. What it chooses to keep is based on either newest, oldest or interactive (e.g. the user chooses from a list).
I suggest running it first with the --dry-run command line argument. This will run the program but not make any changes to your file system. You will be able to see what it would like do to your files. And, as always, make sure you backup your data!
You can find the project here
Since I'm quite new to Rust, I expect there are many improvements that can be made to the code. Please feel free to create Pull Requests with your changes/additions! And be kind, I'm still learning :-)
Top comments (0)