Duplication selection
What is a duplicate file?
This is an important question, and CnW FM has several options. The very simplest will be files with the same name and same length. We may need to track different versions of files, so the length may not be critical. Date may or may be important. It is possible to open a document and save it without making any changes, so two identical documents, but different date.
The second part of this option is the type of file to test for. Testing for all files means many hits, while testing for specfic types will help most users work with their data, rather than Windows operating system files.
An ultimate test for seeing if a file is idential is to take a digital signature of the file, eg MD5. This will show an identical file, even with a very different files name. Forensic investigators often use this to find potentially hidden or renamed files. This is very safe comparison, but as it requires every file to be full read, it will be slow.
For CnWFM there are 4 basic options, as shown in the box above. These are described below
Match on file names and size only.
- This is the quickest match. It assumes that if a filename is the same, and the size is the same, then it is a duplicate. In many cases this can be true, but there can be files which have had contents changed, but remian the same size. These will be miss files as a duplicate
Match on names, file size and contents
- This option requires two passes. The first is as above, just file name and size. It then looks at all files that have been deemed a match and does a CRC32 sumcheck on these files to discover if they have been changed in any way. The reason for CRC32 rather than MD5 is that it it much faster, and requires less space (memory) to save the results. The chance of a false match is probably 1 in 4 billion which for this type of application is acceptable. To save time, only the first by default 1MB of the file is tested
Match on contents MD5
- This mode is the same as the CRC32 above, but uses MD5 hash and will optionally hash the complete files, rather than just the first 1MB. It will be slow on large disk drives, but very accurate
Match over all drives
The final option is matching that spans all the drives that have been examined. In this mode, select the drives to examine, and the display will show any duplicate files within all the drives. A major point behind this feature can be to see if you have critical files stored on multiple drives - essential for data security.
Match Length
The match length is the length of data that is compared to see if the file contents are the same. The comparison is either CRC32 or MD5. The longer the compare, the more accurate the process will be, but the slower the process. For a forensically secure compare, the full length MD5 shoud be used, but for domestic and most general users, a 100KB compare will be extremly accurate. Once the program has been run in this mode, all duplicates will be shown automatucally unpon completion.
File types
Searching for duplicates for all drives will show many not necessarilly relevant matches. Typically a better solution is to select the type of file to look for. CnW-FM allows you to do this in several ways. There are simple selections for Photo files, audio, video, office type files. There is also the option to to enter a number of user extensions.
Another option is to only show files that are duplicated across drives. This is useful to see if critical/or valuable files are backed up on multple devces.