Shell script to remove outliers with IQR technique
by Linux.tar.gz from LinuxQuestions.org on (#4R6DS)
Hi,
I wrote an highly unoptimized script that reads a one-column file containing values as the one and only argument.
As far as I've tested it, it does the job correctly, but slowly.
IQR stands for InterQuartile Range:
https://en.wikipedia.org/wiki/Interquartile_range
The IQR outliers removal is explained here:
https://en.wikipedia.org/wiki/Interq...range#Outliers
I guess that there's a gazillion ways to improve it, that's why I'm posting it.
Also, because I didn't find a bash script doing this.
The script:
Code:#!/bin/bash
filename="/dev/shm/tmp.txt"
sort -n $1 > $filename
tmpfile="/dev/shm/tmp2.txt"
IS_DATA_OK=0
IQRfilter () {
rows=`wc -l $filename | cut -d' ' -f1`
q2=`echo "($rows+1)/2" | bc`
q1=`echo "$q2 / 2" | bc`
q3=`echo "3 * $q1" | bc`
Q1=`head -$q1 $filename | tail -1`
#echo $Q1
Q2=`head -$q2 $filename | tail -1`
#echo $Q2
Q3=`head -$q3 $filename | tail -1`
#echo $Q3
IQR=`echo "$Q3-$Q1" | bc`
#echo $IQR
IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
#echo $IQR_X15LOW
IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`
#echo $IQR_X15HIGH
while IFS= read -r line; do
if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
sed "0,/$line/{/$line/d;}" $filename > $tmpfile
IS_DATA_OK=0
break
fi
done < "$filename"
}
while ((IS_DATA_OK!=1))
do
IS_DATA_OK=1
IQRfilter
mv $tmpfile $filename &2>/dev/null
done
mv $filename ./output.txt &2>/dev/null
rm /dev/shm/tmp.txt /dev/shm/tmp2.txt
exit 0A sample data file is attached.
Attached Files


I wrote an highly unoptimized script that reads a one-column file containing values as the one and only argument.
As far as I've tested it, it does the job correctly, but slowly.
IQR stands for InterQuartile Range:
https://en.wikipedia.org/wiki/Interquartile_range
The IQR outliers removal is explained here:
https://en.wikipedia.org/wiki/Interq...range#Outliers
I guess that there's a gazillion ways to improve it, that's why I'm posting it.
Also, because I didn't find a bash script doing this.
The script:
Code:#!/bin/bash
filename="/dev/shm/tmp.txt"
sort -n $1 > $filename
tmpfile="/dev/shm/tmp2.txt"
IS_DATA_OK=0
IQRfilter () {
rows=`wc -l $filename | cut -d' ' -f1`
q2=`echo "($rows+1)/2" | bc`
q1=`echo "$q2 / 2" | bc`
q3=`echo "3 * $q1" | bc`
Q1=`head -$q1 $filename | tail -1`
#echo $Q1
Q2=`head -$q2 $filename | tail -1`
#echo $Q2
Q3=`head -$q3 $filename | tail -1`
#echo $Q3
IQR=`echo "$Q3-$Q1" | bc`
#echo $IQR
IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
#echo $IQR_X15LOW
IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`
#echo $IQR_X15HIGH
while IFS= read -r line; do
if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
sed "0,/$line/{/$line/d;}" $filename > $tmpfile
IS_DATA_OK=0
break
fi
done < "$filename"
}
while ((IS_DATA_OK!=1))
do
IS_DATA_OK=1
IQRfilter
mv $tmpfile $filename &2>/dev/null
done
mv $filename ./output.txt &2>/dev/null
rm /dev/shm/tmp.txt /dev/shm/tmp2.txt
exit 0A sample data file is attached.
Attached Files
![]() | input.txt (39.1 KB) |
