I used to work in a situation where I had a laptop and desktop computers. Initially I was frustrated by the fact that the file that I wanted was always on the other computer. I then discovered the rsync program and was able to synchronise my entire working directory on both computers and a backup disk. Initially all was well, I had access to my data on both computers and a backup for the inevitable disaster recovery. I chose a synchronisation scheme that only added to the target directory. This meant that I could work in different subdirectories on the two computers at the same time and synchronisation in each direction would then bring the two computers up to date without without worrying about losing one set of work. I then noticed that my working directory was steadily increasing in size. There was several reasons for this. Firstly, the rsync program only looks at file modification time when deciding that a file has been changed. Secondly, I was working on large data files that I would keep on disk in compressed form unless I was actually working on it. Thirdly I found that moving a directory was a nightmare; I had to remember to manually move the directory on each computer before synchronisation.
Thus the desync program was concieved to clean up directories that where synchronisation. Initially this developed as a couple of scripts to look for the file duplicates and compressed/uncompressed file duplicates. However when I came to consider a script to look for directory duplicates I decided a program was required. This program would attempt to address the three identified causes of duplication.
The desync program divides the directories a program like rsync works on into two broad types. A peer type directory holds files that are being worked on directly by a user and a store type holds a static copy. The desync program performs different actions depending on the type of the directory. For example you would want to keep compressed and uncompressed versions of a file you are currently working on in a peer directory but only want the compressed version in a store directory.
Perform directory clean up operations.
desync [-no-archdup] [-no-compdup] [-no-dirdup] [-no-filedup] (-peer|-store) <directory> [-archive-suffix <arg>] [-backup-suffix <arg>] [-compress-suffix <arg>] [-duplicate-cutoff <arg>]
For a working directory the simplest command would be:
desync -peer <directory>
and for a back-up directory the simplest command would be:
desync -store <directory>
Get help text.
desync [-help] or [-h|-?]
Get current configuration options.
desync [-conf]
Note that the -help
option displays the programs default values.
When the program is installed an example directory archive should be included. This will be located in the data directory for the program (/usr/share/desync or /usr/local/share/desync). To try the example unarchive the example using a command like:
tar jxpf /usr/local/share/desync/test2.bz2
or
tar jxpf /usr/local/share/desync/test3.bz2
This should create a directory called test-dir2
or test-dir3
.
Run the desync program on the example using a command like:
desync -peer test-dir2
or
desync -peer test-dir2
The program performs its search for each type of duplication in stages.
The first stage looks for files that have identical backups. The program prints the following report:
---------------------------------------------------------------------- Generated report: Remove duplicated backups ---------------------------------------------------------------------- This report analyses a directory for pairs of files that appear to be an original and backup. If these pairs have identical content then the report marks the backup file for deletion. ---------------------------------------------------------------------- <1> [Trash] = Move to trash [test-dir2/TAGS.bak]. <2> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS.bak.~1~]. <3> [Trash] = Move to trash [test-dir2/test_demangle/TAGS.bak.~2~]. <4> [Trash] = Move to trash [test-dir2/TAGS~]. <5> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS~.~1~]. <6> [Trash] = Move to trash [test-dir2/test_demangle/TAGS~.~2~]. ** Report has actions ** Options are: (1) do actions (2) edit report (remove some actions) (3) restart (4) ignore and continue to next report Choice -
The top of this report has a title and some information about the
current stage. It then lists six actions that will clean up the
directory. You have four options: (1)
will perform all the actions,
(2)
will allow you to remove some of the actions, (3)
will restart the
the stage and (4)
will ignore the actions and go onto the next stage.
If the user chooses to edit the report (option (2)
above) then you get
a prompt like:
Type an index and 'Enter' to delete an action, 'Enter' on a blank line to finish. Delete entry -
You enter numbers as listed in the report. Numbers outside the listed indices are silently ignored. 'Enter' on a blank line gives one of two prompts. If some actions remain you get something like the following:
Number of actions remaining = 6 ** Report has actions ** Options are: (1) do actions (2) restart (3) ignore and continue to next report Choice -
If no actions remain you get the following:
Number of actions remaining = 0 ** Report now empty ** Options are: (1) restart (2) ignore and continue to next report Choice -
After accepting the actions you get a report on the success of the stage. The completion report for the first stage looks something like the following. Other stages would give similar reports.
---------------------------------------------------------------------- Completed report: Remove duplicated backups ---------------------------------------------------------------------- ---------------------------------------------------------------------- <1> [Trash] = Move to trash [test-dir2/TAGS.bak]. <2> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS.bak.~1~]. <3> [Trash] = Move to trash [test-dir2/test_demangle/TAGS.bak.~2~]. <4> [Trash] = Move to trash [test-dir2/TAGS~]. <5> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS~.~1~]. <6> [Trash] = Move to trash [test-dir2/test_demangle/TAGS~.~2~]. ** Report actions completed successfully ** Options are: (1) continue (2) undo and continue or restart Choice -
The two options allow you to: (1)
continue to the next stage,
(2)
undo the actions and get to choose to restart or continue to
the next stage.
All current stages are undoable. Choosing to undo a set of actions leads to the following choices if the undo was successful. In future cases where the undo operation is not possible you would get a similar message but the only option is to continue.
** Report actions undone successfully ** Options are: (1) regenerate report (2) continue to next report Choice -
The second stage looks for compressed and uncompressed pairs of files. The program prints the following report:
---------------------------------------------------------------------- Generated report: Remove uncompressed duplicates ---------------------------------------------------------------------- This report analyses a directory for pairs of files that appear to be compressed and uncompressed versions. If these pairs have identical content then the report marks the uncompressed file for deletion. ---------------------------------------------------------------------- <1> [Trash] = Move to trash [test-dir2/gztags.gz]. <2> [Trash] = Move to trash [test-dir2/temp/test_demangle/gztags.gz.~1~]. <3> [Trash] = Move to trash [test-dir2/test_demangle/gztags.gz.~2~]. <4> [Trash] = Move to trash [test-dir2/bztags.bz2]. <5> [Trash] = Move to trash [test-dir2/temp/test_demangle/bztags.bz2.~1~]. <6> [Trash] = Move to trash [test-dir2/test_demangle/bztags.bz2.~2~]. ** Report has actions ** Options are: (1) do actions (2) edit report (remove some actions) (3) restart (4) ignore and continue to next report Choice -
Again the top of this report gives a title and some information about the current stage. It then lists six actions that will clean up the directory. You have the same four options as the previous stage.
The third stage looks for pairs of files archive files with the same name. The reports it prints are essentially similar to the previous two stages.
The third stage looks for duplicated directories. The program prints the following report:
---------------------------------------------------------------------- Generated report: Merge duplicated directories. ---------------------------------------------------------------------- This report analyses a directory for duplicated branches. Once found the report attempts to merge or create symbolic links between the duplicated branches. When created, the symbolic links will be relative to the store directory selected. ---------------------------------------------------------------------- <1> [program-1.1] <2> [program-1.2] ** Report has actions ** Options are: (1) merge directory <2> into <1> (note <1> appears more recent) (2) link directory <2> to <1> (3) link directory <1> to <2> (4) ignore these directories and search for next pair (5) ignore report and continue to next report Choice -
This gives the directory names of the two duplicated directories.
With this stage you have three main choices: (1)
perform the
actions to merge these duplicated directory, (2)
create symbolic
links in directory <2>
for all files that have identical copies in
<1>
. (3)
create symbolic links in directory <1>
for all
files that have identical copies in <2>
. The other choices are
(4)
ignore this duplicated directory and attempt to find the next
duplicated directory and (5)
ignore this duplicated directory and
any remaining duplicated directories and go to the next stage. With
this version of the program there is no next stage so choosing (5)
will end the program.
Note that the files names given in reports may not correspond directly to files in the original directory but may have been given extra subscripts to ensure that the file will have a unique name in the trash directory.
For duplicated directories the program attempts to determine the correct location be determining which one is more recent. It then merges the two directories together. The merging ensures that the most recent copy of a file ends up in the kept directory. Any files that exist in only one of the directories are also kept.
The program does not actually delete any files. Instead it moves them into a separate directory. The name of this directory is the same as the directory to be cleaned up with an additional suffix derived from the date and time the program was run. This allows the user to manually inspect the removed files before performing an actual delete operation to recover disk storage.
All operations of the program are recorded into a log file. It is intended that future versions of the program will be able to use the log file and the directory of "removed" files to undo all the actions performed by a previous execution of desync.
Use of the program has led to identification of several areas for
improvement. The main one is the issue of multiple versioned
directories. For example, if you keep the source code of several
versions of a program in different directories such as name-1.1
,
name-1.2
and name-1.3
desync will think they are accidentally
duplicated. The program allows you to create symbolic links between
any two subdirectories for files that are identical. This occurs in a
pair-wise manner as duplicate directories are discovered and it may be
better to link these directories together when considering them as a
set.
-archive-suffix <arg>
-backup-suffix <arg>
-compress-suffix <arg>
-duplicate_cutoff <arg>
-no-archdup
-no-compdup
-no-dirdup
-no-filedup
-help
-h|-?
-conf
-peer <required arg>
-store <required arg>
This version of desync is distributed under the Eiffel Forum License, version 2. See the file LICENSE for details.
I would be delighted to hear from you if you like this program.
This program uses the excellent zlib and bzip2 compression librarys.
desync was written by Justin Finnerty who can be contacted at justin_finnerty@optusnet.com.au