desync - directory cleaner

Justin Finnerty
Last Updated Mon Oct 31 16:16:55 2005



Introduction to desync: the directory cleaner

I used to work in a situation where I had a laptop and desktop computers. Initially I was frustrated by the fact that the file that I wanted was always on the other computer. I then discovered the rsync program and was able to synchronise my entire working directory on both computers and a backup disk. Initially all was well, I had access to my data on both computers and a backup for the inevitable disaster recovery. I chose a synchronisation scheme that only added to the target directory. This meant that I could work in different subdirectories on the two computers at the same time and synchronisation in each direction would then bring the two computers up to date without without worrying about losing one set of work. I then noticed that my working directory was steadily increasing in size. There was several reasons for this. Firstly, the rsync program only looks at file modification time when deciding that a file has been changed. Secondly, I was working on large data files that I would keep on disk in compressed form unless I was actually working on it. Thirdly I found that moving a directory was a nightmare; I had to remember to manually move the directory on each computer before synchronisation.

Thus the desync program was concieved to clean up directories that where synchronisation. Initially this developed as a couple of scripts to look for the file duplicates and compressed/uncompressed file duplicates. However when I came to consider a script to look for directory duplicates I decided a program was required. This program would attempt to address the three identified causes of duplication.

Terminology

The desync program divides the directories a program like rsync works on into two broad types. A peer type directory holds files that are being worked on directly by a user and a store type holds a static copy. The desync program performs different actions depending on the type of the directory. For example you would want to keep compressed and uncompressed versions of a file you are currently working on in a peer directory but only want the compressed version in a store directory.

Usage synopsis

Perform directory clean up operations.

     desync [-no-archdup] [-no-compdup] [-no-dirdup] [-no-filedup]
            (-peer|-store) <directory>
  	  [-archive-suffix <arg>] [-backup-suffix <arg>] 
  	  [-compress-suffix <arg>] [-duplicate-cutoff <arg>]

For a working directory the simplest command would be:

     desync -peer <directory>

and for a back-up directory the simplest command would be:

     desync -store <directory>

Get help text.

     desync  [-help] or [-h|-?] 

Get current configuration options.

     desync  [-conf]

Note that the -help option displays the programs default values.

Example operation

When the program is installed an example directory archive should be included. This will be located in the data directory for the program (/usr/share/desync or /usr/local/share/desync). To try the example unarchive the example using a command like:

  tar jxpf /usr/local/share/desync/test2.bz2

or

  tar jxpf /usr/local/share/desync/test3.bz2

This should create a directory called test-dir2 or test-dir3. Run the desync program on the example using a command like:

  desync -peer test-dir2

or

  desync -peer test-dir2

The program performs its search for each type of duplication in stages.

Removing identical backups

The first stage looks for files that have identical backups. The program prints the following report:

  ----------------------------------------------------------------------
  Generated report: Remove duplicated backups
  ----------------------------------------------------------------------
  This report analyses a directory for pairs of files that appear to be 
  an original and backup. If these pairs have identical content then the 
  report marks the backup file for deletion.
  ----------------------------------------------------------------------
  <1> [Trash] = Move to trash [test-dir2/TAGS.bak].
  <2> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS.bak.~1~].
  <3> [Trash] = Move to trash [test-dir2/test_demangle/TAGS.bak.~2~].
  <4> [Trash] = Move to trash [test-dir2/TAGS~].
  <5> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS~.~1~].
  <6> [Trash] = Move to trash [test-dir2/test_demangle/TAGS~.~2~].
  ** Report has actions **
  Options are:
   (1) do actions
   (2) edit report (remove some actions)
   (3) restart
   (4) ignore and continue to next report
   Choice -

(see note on filenames shown)

The top of this report has a title and some information about the current stage. It then lists six actions that will clean up the directory. You have four options: (1) will perform all the actions, (2) will allow you to remove some of the actions, (3) will restart the the stage and (4) will ignore the actions and go onto the next stage.

Editing actions in a report

If the user chooses to edit the report (option (2) above) then you get a prompt like:

   Type an index and 'Enter' to delete an action, 'Enter' on a 
  blank line to finish.
   Delete entry - 

You enter numbers as listed in the report. Numbers outside the listed indices are silently ignored. 'Enter' on a blank line gives one of two prompts. If some actions remain you get something like the following:

   Number of actions remaining = 6
  ** Report has actions **
  Options are:
   (1) do actions
   (2) restart
   (3) ignore and continue to next report
   Choice - 

If no actions remain you get the following:

   Number of actions remaining = 0
  ** Report now empty **
  Options are:
   (1) restart
   (2) ignore and continue to next report
   Choice - 

Completed stage report on actions

After accepting the actions you get a report on the success of the stage. The completion report for the first stage looks something like the following. Other stages would give similar reports.

  ----------------------------------------------------------------------
  Completed report: Remove duplicated backups
  ----------------------------------------------------------------------
  ----------------------------------------------------------------------
  <1> [Trash] = Move to trash [test-dir2/TAGS.bak].
  <2> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS.bak.~1~].
  <3> [Trash] = Move to trash [test-dir2/test_demangle/TAGS.bak.~2~].
  <4> [Trash] = Move to trash [test-dir2/TAGS~].
  <5> [Trash] = Move to trash [test-dir2/temp/test_demangle/TAGS~.~1~].
  <6> [Trash] = Move to trash [test-dir2/test_demangle/TAGS~.~2~].
  ** Report actions completed successfully **
  Options are:
   (1) continue
   (2) undo and continue or restart
   Choice - 

The two options allow you to: (1) continue to the next stage, (2) undo the actions and get to choose to restart or continue to the next stage.

Undoing a set of actions

All current stages are undoable. Choosing to undo a set of actions leads to the following choices if the undo was successful. In future cases where the undo operation is not possible you would get a similar message but the only option is to continue.

  ** Report actions undone successfully **
  Options are:
   (1) regenerate report
   (2) continue to next report
   Choice - 

Removing uncompressed duplicates

The second stage looks for compressed and uncompressed pairs of files. The program prints the following report:

  ----------------------------------------------------------------------
  Generated report: Remove uncompressed duplicates
  ----------------------------------------------------------------------
  This report analyses a directory for pairs of files that appear to be 
  compressed and uncompressed versions. If these pairs have identical 
  content then the report marks the uncompressed file for deletion.
  ----------------------------------------------------------------------
  <1> [Trash] = Move to trash [test-dir2/gztags.gz].
  <2> [Trash] = Move to trash [test-dir2/temp/test_demangle/gztags.gz.~1~].
  <3> [Trash] = Move to trash [test-dir2/test_demangle/gztags.gz.~2~].
  <4> [Trash] = Move to trash [test-dir2/bztags.bz2].
  <5> [Trash] = Move to trash [test-dir2/temp/test_demangle/bztags.bz2.~1~].
  <6> [Trash] = Move to trash [test-dir2/test_demangle/bztags.bz2.~2~].
  ** Report has actions **
  Options are:
   (1) do actions
   (2) edit report (remove some actions)
   (3) restart
   (4) ignore and continue to next report
   Choice - 

(see note on filenames shown)

Again the top of this report gives a title and some information about the current stage. It then lists six actions that will clean up the directory. You have the same four options as the previous stage.

Removing duplicate archives

The third stage looks for pairs of files archive files with the same name. The reports it prints are essentially similar to the previous two stages.

Removing duplicated directories

The third stage looks for duplicated directories. The program prints the following report:

  ----------------------------------------------------------------------
  Generated report: Merge duplicated directories.
  ----------------------------------------------------------------------
  This report analyses a directory for duplicated branches. Once found 
  the report attempts to merge or create symbolic links between the 
  duplicated branches. When created, the symbolic links will be relative 
  to the store directory selected.
  ----------------------------------------------------------------------
  <1> [program-1.1]
  <2> [program-1.2]
  ** Report has actions **
  Options are:
   (1) merge directory <2> into <1> (note <1> appears more recent)
   (2) link directory <2> to <1>
   (3) link directory <1> to <2>
   (4) ignore these directories and search for next pair
   (5) ignore report and continue to next report
   Choice - 

This gives the directory names of the two duplicated directories. With this stage you have three main choices: (1) perform the actions to merge these duplicated directory, (2) create symbolic links in directory <2> for all files that have identical copies in <1>. (3) create symbolic links in directory <1> for all files that have identical copies in <2>. The other choices are (4) ignore this duplicated directory and attempt to find the next duplicated directory and (5) ignore this duplicated directory and any remaining duplicated directories and go to the next stage. With this version of the program there is no next stage so choosing (5) will end the program.

Filenames listed in reports

Note that the files names given in reports may not correspond directly to files in the original directory but may have been given extra subscripts to ensure that the file will have a unique name in the trash directory.

Safe guards

For duplicated directories the program attempts to determine the correct location be determining which one is more recent. It then merges the two directories together. The merging ensures that the most recent copy of a file ends up in the kept directory. Any files that exist in only one of the directories are also kept.

The program does not actually delete any files. Instead it moves them into a separate directory. The name of this directory is the same as the directory to be cleaned up with an additional suffix derived from the date and time the program was run. This allows the user to manually inspect the removed files before performing an actual delete operation to recover disk storage.

All operations of the program are recorded into a log file. It is intended that future versions of the program will be able to use the log file and the directory of "removed" files to undo all the actions performed by a previous execution of desync.

Future directions

Use of the program has led to identification of several areas for improvement. The main one is the issue of multiple versioned directories. For example, if you keep the source code of several versions of a program in different directories such as name-1.1, name-1.2 and name-1.3 desync will think they are accidentally duplicated. The program allows you to create symbolic links between any two subdirectories for files that are identical. This occurs in a pair-wise manner as duplicate directories are discovered and it may be better to link these directories together when considering them as a set.

Options summary

Cleaning up options

-archive-suffix <arg>
A ':' separated list of suffixes used for archive files to be used in finding duplicate archive files.

-backup-suffix <arg>
A ':' separated list of backup filename suffixes to be used in finding duplicated backup files. The -help option will print the default value used if this option is not set.

-compress-suffix <arg>
A ':' separated list of suffixes used for compressed files to be used in finding compressed duplicated files.

-duplicate_cutoff <arg>
The maximum number of files with the same name to consider as potential leads for finding duplicate branches. The -help option will print the default value used if this option is not set.

-no-archdup
Do not attempt to find duplicate archive files.

-no-compdup
Do not attempt to find files with a compressed duplicate version.

-no-dirdup
Do not attempt to find duplicated directory trees.

-no-filedup
Do not attempt to find duplicated backup files.

Help options

-help
Print an extended help message to terminal

-h|-?
Print a short usage message to terminal

Configuration listing

-conf
Print current configuration to terminal

Directory options

-peer <required arg>
Treat directory as a peer or working directory

-store <required arg>
Treat directory as a store or back-up directory

Credits

This version of desync is distributed under the Eiffel Forum License, version 2. See the file LICENSE for details.

I would be delighted to hear from you if you like this program.

This program uses the excellent zlib and bzip2 compression librarys.

Author

desync was written by Justin Finnerty who can be contacted at justin_finnerty@optusnet.com.au