Related Searches: How to remove duplicate files in Linux or Unix. How to find duplicate files using shell script in Linux. List the duplicate files in Linux using shell script. Linux find duplicate files by name and hash value. automatic duplicate file remover.
In my last article I shared multiple commands and methods to check if the node is connected to internet with shell script in Linux and Unix, Now in this article I will show some sample shell scripts to remove duplicate files.
Script 1: Find duplicate files using shell script
The script template is taken from
bash cookbook. I
have modified the script to prompt the user before removing any
duplicate file. This can help decide the user as which file out of the
duplicates he/she wishes to delete. For example /etc/hosts and
/etc/hosts.bak are same file so the script will fail to understand
which file is important and it may delete /etc/hosts considering it as
duplicate.
Below is the sample script to remove duplicate files with a prompt before deleting the file:
#!/bin/bash
# Find duplicate files using shell script
# Remove duplicate files interactively
TMP_FILE=$(mktemp /tmp/temp_file.XXX)
DUP_FILE=$(mktemp /tmp/dup_file.XXX)
function add_file() {
# Store the hash of the file if not already added
echo "$1" "$2" >> $TMP_FILE
}
function red () {
# print colored output
/bin/echo -e "\e[01;31m$1\e[0m" 1>&2
}
function del_file() {
# Delete the duplicate file
rm -f "$1" 2>/dev/null
[[ $? == 0 ]] && red "File \"$1\" deleted"
}
function srch_file() {
# Store the filename in this variable
local NEW="$1"
# Before we check hash value of file, make this variable null
local SUM="0"
# If file exists and the temporary file's size is zero
if [ -f $NEW ] && [ ! -s $TMP_FILE ];then
# Print Store the hash value of file. This value will be later stored in RET which is further used to check duplicate file
echo $(sha512sum ${NEW} | awk -F' ' '{print $1}')
# Exit the loop here
return
fi
# If the size of temporary file is non-zero read temporary file line by line in a loop. Each line is stored in ELEMENT variable
while read ELEMENT; do
# Get the hash value of input file
SUM=$(sha512sum ${NEW} | awk -F' ' '{print $1}')
# Get the hash value of file collected from temporary file
ELEMENT_SUM=$(echo $ELEMENT | awk -F' ' '{print $1}')
ELEMENT_FILENAME=$(echo $ELEMENT | awk -F' ' '{print $2}')
# if the hash value is same means we have found a duplicate file
if [ "${SUM}" == "${ELEMENT_SUM}" ];then
echo $ELEMENT_FILENAME > $DUP_FILE
return 1
else
continue
fi
done<$TMP_FILE
# If duplicate file is not found then just print the hash value of the file
echo "${SUM}"
}
function begin_search_and_deduplication {
local DIR_TO_SRCH="$1"
for FILE in `find ${DIR_TO_SRCH} -type f`; do
# this stores the return value from srch_file function
RET=`srch_file ${FILE}`
if [[ "${RET}" == "" ]];then
FILE1=`cat $DUP_FILE`
echo "$FILE1 is a duplicate of $FILE"
while true; do
read -rp "Which file you wish to delete? $FILE1 (or) $FILE: " ANS
if [ $ANS == "$FILE1" ];then
del_file $FILE1
break
elif [ $ANS == "$FILE" ];then
del_file $FILE
break
fi
done
continue
elif [[ "${RET}" == "0" ]];then
continue
elif [[ "${RET}" == "1" ]];then
continue
else
# If the file hash is not found then it's entry is added in temporary file
add_file "${RET}" ${FILE}
continue
fi
done
}
# This will read the user input to collect list of directories to search for duplicate files
echo "Enter directory name to search: "
echo "Press [ENTER] when ready"
echo
read DIR
begin_search_and_deduplication "${DIR}"
# Delete the temporary files once done
rm -f $TMP_FILE
rm -f $DUP_FILE
You can execute the script which will print for the list of directories you wish to check for and remove duplicate files in Linux. I have created few files with duplicate content for the sake of this article.
# /tmp/remove_duplicate_files.sh
Enter directory name to search:
Press [ENTER] when ready
/dir1 /dir2 /dir3 <-- This is my input (search duplicate files in these directories)
/dir1/file1 is a duplicate of /dir1/file2
Which file you wish to delete? /dir1/file1 (or) /dir1/file2: /dir1/file2
File "/dir1/file2" deleted
/dir1/file1 is a duplicate of /dir2/file101
Which file you wish to delete? /dir1/file1 (or) /dir2/file101: /dir2/file101
File "/dir2/file101" deleted
Here as you see the script waits for user input when a duplicate file is found in the provided directory. Based on the user input it will proceed next.
How it works?
- I have added comments before most of the section which can help you understand how the script works to remove duplicate files.
- Using this hash, we can compare the hash against a list of hashes already computed.
- If the has matches, we have seen the contents of this file before and so we can delete it.
- If the hash is new, we can record the entry and move onto calculating the hash of the next file until all files have beenhashed.
Now if you wish the script to automatically find and remove duplicate
files then you can remove the highlighted block in the above script and
just use del_file $FILE1 so it will directly remove duplicate files
(if found)
Script 2: Remove duplicate files using shell script
Here we will use awk to find duplicate files using shell script. This code will find the copies of the same file in a directory and remove all except one copy of the file.
#!/bin/bash
# Filename: remove_duplicate.sh
# Description: Find and remove duplicate files and
# keep one sample of each file.
ls -lS --time-style=long-iso | awk 'BEGIN {
getline; getline;
name1=$8; size=$5
}
{
name2=$8;
if (size==$5)
{
"md5sum "name1 | getline; csum1=$1;
"md5sum "name2 | getline; csum2=$1;
if ( csum1==csum2 )
{
print name1; print name2
}
};
size=$5; name1=name2;
}' | sort -u > duplicate_files
cat duplicate_files | xargs -I {} md5sum {} | sort | uniq -w 32 | awk '{ print $2 }' | sort -u > unique_files
echo Removing..
comm duplicate_files unique_files -3 | tee /dev/stderr | xargs rm
echo Removed duplicates files successfully.
You must navigate inside the
directory where you wish to find and remove duplicate files and
then execute the script. Here I want to find duplicate files using shell
script inside /dir1 so I will cd /dir1 and then execute the script
without any argument
[root@centos-8 dir1]# /tmp/remove_duplicate.sh
Removing..
file1_copy
file2_copy
Removed duplicates files successfully.
List the files under /dir1 and make sure no duplicate files exists
here. Some more files are created and left for your reference.
[root@centos-8 dir1]# ls -l
total 20
-rw-r--r-- 1 root root 34 Jan 9 07:01 duplicate_files
-rw-r--r-- 1 root root 16 Jan 9 06:56 duplicate_sample
-rw-r--r-- 1 root root 5 Jan 9 07:00 file1
-rw-r--r-- 1 root root 6 Jan 9 07:00 file2
-rw-r--r-- 1 root root 12 Jan 9 07:01 unique_files
How it works?
ls -lS lists the details of the files in the current folder sorted by
file size. The --time-style=long-iso option tells ls to print dates in
the ISO format. awk reads the output of ls -lS and performs
comparisons on columns and rows of the input text to find duplicate
files using shell script.
The logic behind the code to find duplicate files using shell script is as follows:
- We list the files sorted by size, so files of the same size will be adjacent. The first step in finding identical files is to find ones with the same size. Next, we calculate the checksum of the files. If the checksums match, the files are duplicates and one set of the duplicates are removed.
- The
BEGIN{}block ofawkis executed before the main processing. It reads the “total” lines and initializes the variables. The bulk of the processing takes place in the{}block, when<a href="https://www.golinuxcloud.com/awk-examples-with-command-tutorial-unix-linux/" title="30+ awk examples for beginners / awk command tutorial in Linux/Unix" target="_blank" rel="noopener noreferrer">awk</a>reads and processes the rest of the ls output. TheEND{}block statements are executed after all input has been read. - In the
BEGINblock, we read the first line and store the name and size (which are the eighth and fifth columns). When awk enters the{}block, the rest of the lines are read, one by one. This block compares the size obtained from the current line and the previously stored size in the size variable. If they are equal, it means that the two files are duplicates by size and must be further checked bymd5sum. - Once the line is read, the entire line is in $0 and each column is
available in
$1, $2, ..., $n. Here, we read themd5sumchecksum of files into thecsum1andcsum2variables. Thename1andname2variables store the consecutive filenames. If the checksums of two files are the same, they are confirmed to be duplicates and are printed. - We calculate the
md5sumvalue of the duplicates and print one file from each group of duplicates by finding unique lines, comparingmd5sumfrom each line using-w 32(the first 32 characters in themd5sumoutput; usually, themd5sumoutput consists of a 32-character hash followed by the filename). One sample from each group of duplicates is written tounique_files. - Now, we need to remove the files listed in
duplicate_files, excluding the files listed inunique_files. The comm command prints files induplicate_filesbut not inunique_files. - comm only processes sorted input. Therefore,
sort -uis used to filterduplicate_filesandunique_files. - The
teecommand is used to pass filenames to thermcommand as well as print. The tee command sends its input to bothstdoutand a file. We can also print text to the terminal by redirecting tostderr./dev/stderris the device corresponding tostderr(standard error). By redirecting to astderrdevice file, text sent tostdinwill be printed in the terminal as standard error.
References:
Linux Shell Scripting
Cookbook
Bash cookbook


