1. 程式人生 > >shell 統計單詞頻率

shell 統計單詞頻率

#!/bin/bash
#n個出現頻率最高的單詞
help(){ echo "該shell指令碼統計一個文字中出現次數最多的n個單詞"
	      echo "usage: sh "$0" filename n"
	      echo "filename 為你要統計的文字名稱 n為要統計的單詞個數"
	      echo "sh "$0" englist_statment.txt 10"
	    }
	    
:<<EOF

First Flight
  Mr. Johnson had never been up in an aerophane before and he had read a lot about air accidents, so one day when a friend offered to take him for a ride in his own small phane, Mr. Johns
on was very worried about accepting. Finally, however, his friend persuaded him that it was very safe, and Mr. Johnson boarded the plane.
  His friend started the engine and began to taxi onto the runway of the airport. Mr. Johnson had heard that the most dangerous part of a flight were the take-off and the landing, so he w
as extremely frightened and closed his eyes.
  After a minute or two he opened them again, looked out of the window of the plane, and said to his friend, Look at those people down there. They look as small as ants, dont they?
  Those are ants, answered his friend. Were still on the ground.
EOF

if [[ -z "$1" || -z "$2" ]];then
	 help
	 exit 
fi 

if [[ -f "$1" ]];then
	 statis=$(more "$1" |tr -cs "[a-z][A-Z]" "\n"|tr A-Z a-z|sort|uniq -c|sort -k1nr -k2|head -"$2")
	 echo "$statis"
else 
    help 
  exit 1

fi



[
[email protected]
shellscript]# sh statis_word.sh englist_statment.txt 5 10 the 6 and 6 his 5 a 5 friend #如果沒有正確使用 列印幫助資訊 [[email protected] shellscript]# sh statis_word.sh englist_statment.txt 該shell指令碼統計一個文字中出現次數最多的n個單詞 usage: sh statis_word.sh filename n filename 為你要統計的文字名稱 n為要統計的單詞個數 sh statis_word.sh englist_statment.txt 10 [
[email protected]
shellscript]# tr --help Usage: tr [OPTION]... SET1 [SET2] Translate, squeeze, and/or delete characters from standard input, writing to standard output. -c, -C, --complement first complement SET1 -d, --delete delete characters in SET1, do not translate -s, --squeeze-repeats replace each input sequence of a repeated character that is listed in SET1 with a single occurrence of that character -t, --truncate-set1 first truncate SET1 to length of SET2 --help display this help and exit --version output version information and exit SETs are specified as strings of characters. Most represent themselves. Interpreted sequences are: \NNN character with octal value NNN (1 to 3 octal digits) \\ backslash \a audible BEL \b backspace \f form feed \n new line \r return \t horizontal tab \v vertical tab CHAR1-CHAR2 all characters from CHAR1 to CHAR2 in ascending order [CHAR*] in SET2, copies of CHAR until length of SET1 [CHAR*REPEAT] REPEAT copies of CHAR, REPEAT octal if starting with 0 [:alnum:] all letters and digits [:alpha:] all letters [:blank:] all horizontal whitespace [:cntrl:] all control characters [:digit:] all digits [:graph:] all printable characters, not including space [:lower:] all lower case letters [:print:] all printable characters, including space [:punct:] all punctuation characters [:space:] all horizontal or vertical whitespace [:upper:] all upper case letters [:xdigit:] all hexadecimal digits [=CHAR=] all characters which are equivalent to CHAR Translation occurs if -d is not given and both SET1 and SET2 appear. -t may be used only when translating. SET2 is extended to length of SET1 by repeating its last character as necessary. Excess characters of SET2 are ignored. Only [:lower:] and [:upper:] are guaranteed to expand in ascending order; used in SET2 while translating, they may only be used in pairs to specify case conversion. -s uses SET1 if not translating nor deleting; else squeezing uses SET2 and occurs after translation or deletion. Report bugs to <
[email protected]
>. tr -cs "[A-Z][a-z]" "[\n*]" #測試下 -c的意思,有一個test0.sh的檔案.裡面有大寫字母 小寫字母 數字 [[email protected] shellscript]# more test0.sh M C a b 8 6 [[email protected] shellscript]# more test0.sh |tr -c "[A-Z]" "$" $$M$C$$$$$$$$$$$ [[email protected] shellscript]# more test0.sh |tr -c "[a-z]" "$" $$$$$$$a$b$$$$$$ [[email protected] shellscript]# more test0.sh |tr -c "[:digit:]" "$" $$$$$$$$$$$8$6$$ 可以看出-c是取反的意思.意思是把除SET1之外的替換為 SET2 -s 就是把連續出現的只保留一個. [[email protected] shellscript]# more test0.sh |tr -cs "[:digit:]" "$" $8$6$[[email protected] shellscript]# tr -cs "[a-z][A-Z]" "\n" 就是把除單詞之外的替換為換行符.然後只保留一個.