Article 4R6DS Shell script to remove outliers with IQR technique

Shell script to remove outliers with IQR technique

by
Linux.tar.gz
from LinuxQuestions.org on (#4R6DS)
Hi,

I wrote an highly unoptimized script that reads a one-column file containing values as the one and only argument.
As far as I've tested it, it does the job correctly, but slowly.

IQR stands for InterQuartile Range:
https://en.wikipedia.org/wiki/Interquartile_range

The IQR outliers removal is explained here:
https://en.wikipedia.org/wiki/Interq...range#Outliers

I guess that there's a gazillion ways to improve it, that's why I'm posting it.
Also, because I didn't find a bash script doing this.

The script:
Code:#!/bin/bash

filename="/dev/shm/tmp.txt"
sort -n $1 > $filename

tmpfile="/dev/shm/tmp2.txt"

IS_DATA_OK=0

IQRfilter () {

rows=`wc -l $filename | cut -d' ' -f1`

q2=`echo "($rows+1)/2" | bc`
q1=`echo "$q2 / 2" | bc`
q3=`echo "3 * $q1" | bc`

Q1=`head -$q1 $filename | tail -1`
#echo $Q1

Q2=`head -$q2 $filename | tail -1`
#echo $Q2

Q3=`head -$q3 $filename | tail -1`
#echo $Q3

IQR=`echo "$Q3-$Q1" | bc`
#echo $IQR

IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
#echo $IQR_X15LOW

IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`
#echo $IQR_X15HIGH

while IFS= read -r line; do
if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
sed "0,/$line/{/$line/d;}" $filename > $tmpfile
IS_DATA_OK=0
break
fi
done < "$filename"

}

while ((IS_DATA_OK!=1))
do

IS_DATA_OK=1
IQRfilter
mv $tmpfile $filename &2>/dev/null

done

mv $filename ./output.txt &2>/dev/null

rm /dev/shm/tmp.txt /dev/shm/tmp2.txt

exit 0A sample data file is attached.
Attached Files
txt.gifinput.txt (39.1 KB)
latest?d=yIl2AUoC8zA latest?i=oONdocjAnro:5qJFmofP0mE:F7zBnMy latest?i=oONdocjAnro:5qJFmofP0mE:V_sGLiP latest?d=qj6IDK7rITs latest?i=oONdocjAnro:5qJFmofP0mE:gIN9vFwoONdocjAnro
External Content
Source RSS or Atom Feed
Feed Location https://feeds.feedburner.com/linuxquestions/latest
Feed Title LinuxQuestions.org
Feed Link https://www.linuxquestions.org/questions/
Reply 0 comments