Finding Filename Conflicts

Posted on May 26, 2009

Although my company works with web servers that run Linux (with case-sensitive filesystems), our development machines are typically OS X, which like Windows, uses case-insensitive filesystems. Because of this, we can’t create files that differ only in letter case. Unfortunately, we can’t always constrain our clients and third parties from doing the same. Sometimes we have to fix it for them by removing files with duplicate names (except for case).

It’s a problem of linear size, because each conflicting file has to be individually examined and removed. How can we find out ahead of time what the size of the problem is — i.e., how many files have this problem?

The *nix command line comes to our rescue again. Here’s the command I used. By no means is it the best solution, but it gets the job done:

find . -type d -print0 | xargs -0 sh -c 'ls -1 "$@" | tr "[A-Z]" "[a-z]" | uniq -d'

I’ll explain this line in parts. There are basically 2 parts. The first part, “find . -type d -print0”, searches all directories (and subdirectories). The second part, “xargs …” to the end, finds the duplicate filenames in that directory. The output from the first command (list of directories) is piped to the second command.

The `find` command is pretty simple. It searches the current directory (“.”) recursively for directories (“-type d”). It outputs them, but instead of delimiting them with the normal linebreak (“-print”), I use “-print0” (that’s a zero) to delimit them with the NUL character. This ensures that filenames with spaces, quotes, newlines, and other strange characters in them don’t get interpreted incorrectly.

Next is the `xargs` command. Xargs is a flexible utility that processes input in parallel. The “-0” (that’s also a zero) argument tells xargs that the delimiter is the NUL character. The rest of the line is the actual command that is executed for each argument — that is, each directory from `find`.

The command:

sh -c 'ls -1 "$@" | tr "[A-Z]" "[a-z]" | uniq -d'

is not as complicated as it looks. `sh` tells xargs to run the shell interpreter, and the “-c” argument is used to pass the shell command as a string (the stuff inside the single quotes). So in essence we’re running a command inside a shell inside another command. The reason we have to do this is because the shell command (“ls -1 …”) uses pipes, and this redirection has to be done separately from the top-level redirection of `find` into `xargs`.

So finally, we end of with the following string:

ls -1 "$@" | tr "[A-Z]" "[a-z]" | uniq -d

The `ls` command, of course, lists the files in the directory (“$@”). Here the “$@” variable represents the directory passed to `find`. The “-1” argument (that’s a numeral one) prints only the filenames, in a single column for easy parsing.

Next, the list of files is piped to the `tr` (“translate”) command, which converts all uppercase characters to lowercase. This is so we can compare the letter cases.

Finally the `uniq` command (“unique”) is used. Typically `uniq` is used to find unique values, but if we pass the “-d” argument it will instead return duplicate values.

That’s it. Right now this operation doesn’t print the directory that the file conflict is in. Xargs runs in parallel so the results won’t necessarily be ordered by directory in any way. This would be easy to get around by adding an `echo` statement in the line above, or by running another `find` to track it down.

Finding Filename Conflicts

Leave a Reply Cancel reply