16.4. 文本处理命令

影响文本和文本文件的命令

sort

文件排序工具,通常用作管道中的过滤器。此命令按正序或倒序,或根据各种键或字符位置对文本流或文件进行排序。 使用-m选项,它可以合并预先排序的输入文件。信息页列出了它的许多功能和选项。请参阅示例 11-10示例 11-11示例 A-8

tsort

拓扑排序,读取成对的空格分隔字符串并根据输入模式进行排序。 tsort 的最初目的是为 UNIX "远古" 版本中的 ld 链接器的过时版本对依赖项列表进行排序。

tsort 的结果通常与上面的标准 sort 命令的结果明显不同。

uniq

此过滤器从已排序的文件中删除重复的行。它经常与 sort 结合使用在管道中。

cat list-1 list-2 list-3 | sort | uniq > final.list
# Concatenates the list files,
# sorts them,
# removes duplicate lines,
# and finally writes the result to an output file.

有用的-c选项在输入文件的每一行前加上它出现的次数。

bash$ cat testfile
This line occurs only once.
 This line occurs twice.
 This line occurs twice.
 This line occurs three times.
 This line occurs three times.
 This line occurs three times.


bash$ uniq -c testfile
      1 This line occurs only once.
       2 This line occurs twice.
       3 This line occurs three times.


bash$ sort testfile | uniq -c | sort -nr
      3 This line occurs three times.
       2 This line occurs twice.
       1 This line occurs only once.
	      

命令sort INPUTFILE | uniq -c | sort -nr字符串生成出现频率的列表,该列表位于INPUTFILE文件上(sort 的 -nr-nr选项导致反向数字排序)。 此模板用于分析日志文件和字典列表,以及需要检查文档词汇结构的地方。

示例 16-12. 词频分析

#!/bin/bash
# wf.sh: Crude word frequency analysis on a text file.
# This is a more efficient version of the "wf2.sh" script.


# Check for input file on command-line.
ARGS=1
E_BADARGS=85
E_NOFILE=86

if [ $# -ne "$ARGS" ]  # Correct number of arguments passed to script?
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi

if [ ! -f "$1" ]       # Check if file exists.
then
  echo "File \"$1\" does not exist."
  exit $E_NOFILE
fi



########################################################
# main ()
sed -e 's/\.//g'  -e 's/\,//g' -e 's/ /\
/g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
#                           =========================
#                            Frequency of occurrence

#  Filter out periods and commas, and
#+ change space between words to linefeed,
#+ then shift characters to lowercase, and
#+ finally prefix occurrence count and sort numerically.

#  Arun Giridhar suggests modifying the above to:
#  . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
#  This adds a secondary sort key, so instances of
#+ equal occurrence are sorted alphabetically.
#  As he explains it:
#  "This is effectively a radix sort, first on the
#+ least significant column
#+ (word or string, optionally case-insensitive)
#+ and last on the most significant column (frequency)."
#
#  As Frank Wang explains, the above is equivalent to
#+       . . . | sort | uniq -c | sort +0 -nr
#+ and the following also works:
#+       . . . | sort | uniq -c | sort -k1nr -k
########################################################

exit 0

# Exercises:
# ---------
# 1) Add 'sed' commands to filter out other punctuation,
#+   such as semicolons.
# 2) Modify the script to also filter out multiple spaces and
#+   other whitespace.

bash$ cat testfile
This line occurs only once.
 This line occurs twice.
 This line occurs twice.
 This line occurs three times.
 This line occurs three times.
 This line occurs three times.


bash$ ./wf.sh testfile
      6 this
       6 occurs
       6 line
       3 times
       3 three
       2 twice
       1 only
       1 once
	       

expand, unexpand

expand 过滤器将制表符转换为空格。它通常用在 管道中。

unexpand 过滤器将空格转换为制表符。这会反转 expand 的效果。

cut

用于从文件中提取字段的工具。它类似于awk 中的print $N命令集,但功能更有限。 在脚本中使用 cut 可能比 awk 更简单。 特别重要的是-d(分隔符) 和-f(字段说明符)选项。

使用 cut 获取已挂载文件系统的列表

cut -d ' ' -f1,2 /etc/mtab

使用 cut 列出操作系统和内核版本

uname -a | cut -d" " -f1,3,11,12

使用 cut 从电子邮件文件夹中提取消息标头

bash$ grep '^Subject:' read-messages | cut -c10-80
Re: Linux suitable for mission-critical apps?
 MAKE MILLIONS WORKING AT HOME!!!
 Spam complaint
 Re: Spam complaint

使用 cut 解析文件

# List all the users in /etc/passwd.

FILENAME=/etc/passwd

for user in $(cut -d: -f1 $FILENAME)
do
  echo $user
done

# Thanks, Oleg Philon for suggesting this.

cut -d ' ' -f2,3 filename等效于awk -F'[ ]' '{ print $2, $3 }' filename

Note

甚至可以将换行符指定为分隔符。 诀窍是在命令序列中实际嵌入一个换行符 (RETURN)。

bash$ cut -d'
 ' -f3,7,19 testfile
This is line 3 of testfile.
 This is line 7 of testfile.
 This is line 19 of testfile.
	      

感谢 Jaka Kranjc 指出这一点。

另请参阅 示例 16-48

paste

用于将不同的文件合并到单个多列文件中的工具。与 cut 结合使用,可用于创建系统日志文件。

bash$ cat items
alphabet blocks
 building blocks
 cables

bash$ cat prices
$1.00/dozen
 $2.50 ea.
 $3.75

bash$ paste items prices
alphabet blocks $1.00/dozen
 building blocks $2.50 ea.
 cables  $3.75

join

可以认为它是 paste 的一个特殊用途的近亲。 这个强大的实用程序允许以有意义的方式合并两个文件,这实际上创建了一个简单的关系数据库版本。

join 命令作用于正好两个文件,但只粘贴那些具有公共标记字段(通常是数字标签)的行,并将结果写入stdout。 要连接的文件应根据标记字段排序,以便匹配正常工作。

File: 1.data

100 Shoes
200 Laces
300 Socks

File: 2.data

100 $40.00
200 $1.00
300 $2.00

bash$ join 1.data 2.data
File: 1.data 2.data

 100 Shoes $40.00
 200 Laces $1.00
 300 Socks $2.00
	      

Note

标记字段在输出中只出现一次。

head

将文件的开头列出到stdout。 默认是10行,但可以指定不同的数字。 该命令有许多有趣的选项。

示例 16-13. 哪些文件是脚本?

#!/bin/bash
# script-detector.sh: Detects scripts within a directory.

TESTCHARS=2    # Test first 2 characters.
SHABANG='#!'   # Scripts begin with a "sha-bang."

for file in *  # Traverse all the files in current directory.
do
  if [[ `head -c$TESTCHARS "$file"` = "$SHABANG" ]]
  #      head -c2                      #!
  #  The '-c' option to "head" outputs a specified
  #+ number of characters, rather than lines (the default).
  then
    echo "File \"$file\" is a script."
  else
    echo "File \"$file\" is *not* a script."
  fi
done
  
exit 0

#  Exercises:
#  ---------
#  1) Modify this script to take as an optional argument
#+    the directory to scan for scripts
#+    (rather than just the current working directory).
#
#  2) As it stands, this script gives "false positives" for
#+    Perl, awk, and other scripting language scripts.
#     Correct this.

示例 16-14. 生成 10 位随机数

#!/bin/bash
# rnd.sh: Outputs a 10-digit random number

# Script by Stephane Chazelas.

head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'


# =================================================================== #

# Analysis
# --------

# head:
# -c4 option takes first 4 bytes.

# od:
# -N4 option limits output to 4 bytes.
# -tu4 option selects unsigned decimal format for output.

# sed: 
# -n option, in combination with "p" flag to the "s" command,
# outputs only matched lines.



# The author of this script explains the action of 'sed', as follows.

# head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
# ----------------------------------> |

# Assume output up to "sed" --------> |
# is 0000000 1198195154\n

#  sed begins reading characters: 0000000 1198195154\n.
#  Here it finds a newline character,
#+ so it is ready to process the first line (0000000 1198195154).
#  It looks at its <range><action>s. The first and only one is

#   range     action
#   1         s/.* //p

#  The line number is in the range, so it executes the action:
#+ tries to substitute the longest string ending with a space in the line
#  ("0000000 ") with nothing (//), and if it succeeds, prints the result
#  ("p" is a flag to the "s" command here, this is different
#+ from the "p" command).

#  sed is now ready to continue reading its input. (Note that before
#+ continuing, if -n option had not been passed, sed would have printed
#+ the line once again).

#  Now, sed reads the remainder of the characters, and finds the
#+ end of the file.
#  It is now ready to process its 2nd line (which is also numbered '$' as
#+ it's the last one).
#  It sees it is not matched by any <range>, so its job is done.

#  In few word this sed commmand means:
#  "On the first line only, remove any character up to the right-most space,
#+ then print it."

# A better way to do this would have been:
#           sed -e 's/.* //;q'

# Here, two <range><action>s (could have been written
#           sed -e 's/.* //' -e q):

#   range                    action
#   nothing (matches line)   s/.* //
#   nothing (matches line)   q (quit)

#  Here, sed only reads its first line of input.
#  It performs both actions, and prints the line (substituted) before
#+ quitting (because of the "q" action) since the "-n" option is not passed.

# =================================================================== #

# An even simpler altenative to the above one-line script would be:
#           head -c4 /dev/urandom| od -An -tu4

exit
另请参阅 示例 16-39

tail

将文件的(尾部)末尾列出到stdout。 默认是10行,但这可以使用-n选项更改。 通常用于使用-f-f

选项,该选项输出附加到文件的行,以跟踪系统日志文件的更改。

#!/bin/bash

filename=sys.log

cat /dev/null > $filename; echo "Creating / cleaning out file."
#  Creates the file if it does not already exist,
#+ and truncates it to zero length if it does.
#  : > filename   and   > filename also work.

tail /var/log/messages > $filename  
# /var/log/messages must have world read permission for this to work.

echo "$filename contains tail end of system log."

exit 0

Tip

示例 16-15. 使用 tail 监视系统日志要列出文本文件的特定行,请将 head 的输出管道传输tail -n 1。 例如head -n 8 database.txt | tail -n 1列出文件.

database.txt

var=$(head -n $m $filename | tail -n $n)

# filename = name of file
# m = from beginning of file, number of lines to end of block
# n = number of lines to set variable to (trim from end of block)

Note

的第 8 行

要将变量设置为文本文件的给定块

tail 的较新实现弃用了较旧的 tail -$LINES filename 用法。 标准 tail -n $LINES filename 是正确的。

另请参阅 示例 16-5示例 16-39示例 32-6grep一种使用正则表达式的多用途文件搜索工具。 它最初是古老的 ed 行编辑器中的命令/过滤器

g/re/p -- global - regular expression - print(全局 - 正则表达式 - 打印)。 [grep...]

pattern-- global - regular expression - print(全局 - 正则表达式 - 打印)。file-- global - regular expression - print(全局 - 正则表达式 - 打印)。在目标文件中搜索

bash$ grep '[rst]ystem.$' osinfo.txt
The GPL governs the distribution of the Linux operating system.
	      

的出现次数,其中stdout可以是文字文本或正则表达式。

bash$ ps ax | grep clock
765 tty1     S      0:00 xclock
 901 pts/1    S      0:00 grep clock
	      

命令如果未指定目标文件,则 grep 用作 管道中的上的过滤器。

命令-i选项导致不区分大小写的搜索。

命令-w选项仅匹配整个单词。

命令-l选项仅列出找到匹配项的文件,而不列出匹配的行。

命令-n-r

bash$ grep -n Linux osinfo.txt
2:This is a file containing information about Linux.
 6:The GPL governs the distribution of the Linux operating system.
	      

命令(递归)选项搜索当前工作目录及其以下所有子目录中的文件。-n选项列出匹配的行,以及行号。-v

grep pattern1 *.txt | grep -v pattern2

# Matches all lines in "*.txt" files containing "pattern1",
# but ***not*** "pattern2".	      

命令-c ((或--invert-match

grep -c txt *.sgml   # (number of occurrences of "txt" in "*.sgml" files)


#   grep -cz .
#            ^ dot
# means count (-c) zero-separated (-z) items matching "."
# that is, non-empty ones (containing at least 1 character).
# 
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz .     # 3
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$'   # 5
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^'   # 5
#
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$'    # 9
# By default, newline chars (\n) separate items to match. 

# Note that the -z option is GNU "grep" specific.


# Thanks, S.C.

命令)选项过滤掉匹配项。-n--count)选项给出匹配项的数字计数,而不是实际列出匹配项。--color--colour

)选项以颜色标记匹配的字符串(在控制台上或在 xterm 窗口中)。 由于 grep 打印出包含匹配模式的每一整行,因此您可以准确地看到什么正在被匹配。 另请参阅

#!/bin/bash
# from.sh

#  Emulates the useful 'from' utility in Solaris, BSD, etc.
#  Echoes the "From" header line in all messages
#+ in your e-mail directory.


MAILDIR=~/mail/*               #  No quoting of variable. Why?
# Maybe check if-exists $MAILDIR:   if [ -d $MAILDIR ] . . .
GREP_OPTS="-H -A 5 --color"    #  Show file, plus extra context lines
                               #+ and display "From" in color.
TARGETSTR="^From"              # "From" at beginning of line.

for file in $MAILDIR           #  No quoting of variable.
do
  grep $GREP_OPTS "$TARGETSTR" "$file"
  #    ^^^^^^^^^^              #  Again, do not quote this variable.
  echo
done

exit $?

#  You might wish to pipe the output of this script to 'more'
#+ or redirect it to a file . . .

-o

bash$ grep Linux osinfo.txt misc.txt
osinfo.txt:This is a file containing information about Linux.
 osinfo.txt:The GPL governs the distribution of the Linux operating system.
 misc.txt:The Linux operating system is steadily gaining in popularity.
	      

Tip

选项,该选项仅显示行中匹配的部分。示例 16-16. 打印存储的电子邮件消息中的 From当使用多个目标文件调用时,grep 指定哪个文件包含匹配项。

bash$ grep Linux osinfo.txt /dev/null
osinfo.txt:This is a file containing information about Linux.
 osinfo.txt:The GPL governs the distribution of the Linux operating system.
	      

要强制 grep 在仅搜索一个目标文件时显示文件名,只需给出/dev/null作为第二个文件。

SUCCESS=0                      # if grep lookup succeeds
word=Linux
filename=data.file

grep -q "$word" "$filename"    #  The "-q" option
                               #+ causes nothing to echo to stdout.
if [ $? -eq $SUCCESS ]
# if grep -q "$word" "$filename"   can replace lines 5 - 7.
then
  echo "$word found in $filename"
else
  echo "$word not found in $filename"
fi

如果存在成功的匹配,则 grep 返回 0 的退出状态,这使其在脚本中的条件测试中很有用,尤其是在与

-q

#!/bin/bash
# grp.sh: Rudimentary reimplementation of grep.

E_BADARGS=85

if [ -z "$1" ]    # Check for argument to script.
then
  echo "Usage: `basename $0` pattern"
  exit $E_BADARGS
fi  

echo

for file in *     # Traverse all files in $PWD.
do
  output=$(sed -n /"$1"/p $file)  # Command substitution.

  if [ ! -z "$output" ]           # What happens if "$output" is not quoted?
  then
    echo -n "$file: "
    echo "$output"
  fi              #  sed -ne "/$1/s|^|${file}: |p"  is equivalent to above.

  echo
done  

echo

exit 0

# Exercises:
# ---------
# 1) Add newlines to output, if more than one match in any given file.
# 2) Add features.

选项一起使用以抑制输出时。

示例 32-6演示了如何使用 grep 在系统日志文件中搜索单词模式。

示例 16-17. 在脚本中模拟 grep

# Filename: tstfile

This is a sample file.
This is an ordinary text file.
This file does not contain any unusual text.
This file is not unusual.
Here is some text.

grep 如何搜索两个(或多个)单独的模式? 如果您希望 grep 显示文件或文件中包含 "pattern1" "pattern2" 的所有行,该怎么办?

bash$ grep file tstfile
# Filename: tstfile
 This is a sample file.
 This is an ordinary text file.
 This file does not contain any unusual text.
 This file is not unusual.

bash$ grep file tstfile | grep text
This is an ordinary text file.
 This file does not contain any unusual text.

一种方法是将 grep pattern1 的结果管道传输grep pattern2

例如,给定以下文件

#!/bin/bash
# cw-solver.sh
# This is actually a wrapper around a one-liner (line 46).

#  Crossword puzzle and anagramming word game solver.
#  You know *some* of the letters in the word you're looking for,
#+ so you need a list of all valid words
#+ with the known letters in given positions.
#  For example: w...i....n
#               1???5????10
# w in position 1, 3 unknowns, i in the 5th, 4 unknowns, n at the end.
# (See comments at end of script.)


E_NOPATT=71
DICT=/usr/share/dict/word.lst
#                    ^^^^^^^^   Looks for word list here.
#  ASCII word list, one word per line.
#  If you happen to need an appropriate list,
#+ download the author's "yawl" word list package.
#  http://ibiblio.org/pub/Linux/libs/yawl-0.3.2.tar.gz
#  or
#  http://bash.deta.in/yawl-0.3.2.tar.gz


if [ -z "$1" ]   #  If no word pattern specified
then             #+ as a command-line argument . . .
  echo           #+ . . . then . . .
  echo "Usage:"  #+ Usage message.
  echo
  echo ""$0" \"pattern,\""
  echo "where \"pattern\" is in the form"
  echo "xxx..x.x..."
  echo
  echo "The x's represent known letters,"
  echo "and the periods are unknown letters (blanks)."
  echo "Letters and periods can be in any position."
  echo "For example, try:   sh cw-solver.sh w...i....n"
  echo
  exit $E_NOPATT
fi

echo
# ===============================================
# This is where all the work gets done.
grep ^"$1"$ "$DICT"   # Yes, only one line!
#    |    |
# ^ is start-of-word regex anchor.
# $ is end-of-word regex anchor.

#  From _Stupid Grep Tricks_, vol. 1,
#+ a book the ABS Guide author may yet get around
#+ to writing . . . one of these days . . .
# ===============================================
echo


exit $?  # Script terminates here.
#  If there are too many words generated,
#+ redirect the output to a file.

$ sh cw-solver.sh w...i....n

wellington
workingman
workingmen

现在,让我们在这个文件中搜索同时包含 "file""text" 的行 . . .

bash $ egrep 'matches|Matches' file.txt
Line 1 matches.
 Line 3 Matches.
 Line 4 contains matches, but also Matches
              

现在,对于 grep 的一个有趣的娱乐用途 . . .

Note

示例 16-18. 纵横字谜求解器egrep -- extended grep(扩展 grep) -- 与 grep -E 相同。 这使用了一组稍微不同的扩展的正则表达式,这可以使搜索更加灵活。 它还允许布尔 | () 运算符。fgrep -- fast grep(快速 grep) -- 与 grep -F 相同。 它执行文字字符串搜索(没有正则表达式),这通常会加快速度。在某些 Linux 发行版上,egrepfgrep 是指向 grep 的符号链接或别名,但使用-E

#!/bin/bash
# dict-lookup.sh

#  This script looks up definitions in the 1913 Webster's Dictionary.
#  This Public Domain dictionary is available for download
#+ from various sites, including
#+ Project Gutenberg (http://www.gutenberg.org/etext/247).
#
#  Convert it from DOS to UNIX format (with only LF at end of line)
#+ before using it with this script.
#  Store the file in plain, uncompressed ASCII text.
#  Set DEFAULT_DICTFILE variable below to path/filename.


E_BADARGS=85
MAXCONTEXTLINES=50                        # Maximum number of lines to show.
DEFAULT_DICTFILE="/usr/share/dict/webster1913-dict.txt"
                                          # Default dictionary file pathname.
                                          # Change this as necessary.
#  Note:
#  ----
#  This particular edition of the 1913 Webster's
#+ begins each entry with an uppercase letter
#+ (lowercase for the remaining characters).
#  Only the *very first line* of an entry begins this way,
#+ and that's why the search algorithm below works.



if [[ -z $(echo "$1" | sed -n '/^[A-Z]/p') ]]
#  Must at least specify word to look up, and
#+ it must start with an uppercase letter.
then
  echo "Usage: `basename $0` Word-to-define [dictionary-file]"
  echo
  echo "Note: Word to look up must start with capital letter,"
  echo "with the rest of the word in lowercase."
  echo "--------------------------------------------"
  echo "Examples: Abandon, Dictionary, Marking, etc."
  exit $E_BADARGS
fi


if [ -z "$2" ]                            #  May specify different dictionary
                                          #+ as an argument to this script.
then
  dictfile=$DEFAULT_DICTFILE
else
  dictfile="$2"
fi

# ---------------------------------------------------------
Definition=$(fgrep -A $MAXCONTEXTLINES "$1 \\" "$dictfile")
#                  Definitions in form "Word \..."
#
#  And, yes, "fgrep" is fast enough
#+ to search even a very large text file.


# Now, snip out just the definition block.

echo "$Definition" |
sed -n '1,/^[A-Z]/p' |
#  Print from first line of output
#+ to the first line of the next entry.
sed '$d' | sed '$d'
#  Delete last two lines of output
#+ (blank line and first line of next entry).
# ---------------------------------------------------------

exit $?

# Exercises:
# ---------
# 1)  Modify the script to accept any type of alphabetic input
#   + (uppercase, lowercase, mixed case), and convert it
#   + to an acceptable format for processing.
#
# 2)  Convert the script to a GUI application,
#   + using something like 'gdialog' or 'zenity' . . .
#     The script will then no longer take its argument(s)
#   + from the command-line.
#
# 3)  Modify the script to parse one of the other available
#   + Public Domain Dictionaries, such as the U.S. Census Bureau Gazetteer.

Note

-F

选项分别调用。

Tip

示例 16-19. 在 韦氏 1913 词典中查找定义

另请参阅 示例 A-41,以获取在大型文本文件上进行快速 fgrep 查找的示例。

agrep (approximate grep,近似 grep) 将 grep 的功能扩展到近似匹配。 搜索字符串与结果匹配项可能相差指定数量的字符。 此实用程序不是核心 Linux 发行版的一部分。

要搜索压缩文件,请使用 zgrepzegrepzfgrep。 这些也适用于未压缩的文件,但比普通的 grepegrepfgrep 慢。 它们便于搜索混合文件集,有些是压缩的,有些则不是。要搜索 bzipped 文件,请使用 bzgreplook

look 命令的工作方式类似于 grep,但在 "字典"(已排序的单词列表)上执行查找。 默认情况下,look

#!/bin/bash
# lookup: Does a dictionary lookup on each word in a data file.

file=words.data  # Data file from which to read words to test.

echo
echo "Testing file $file"
echo

while [ "$word" != end ]  # Last word in data file.
do               # ^^^
  read word      # From data file, because of redirection at end of loop.
  look $word > /dev/null  # Don't want to display lines in dictionary file.
  #  Searches for words in the file /usr/share/dict/words
  #+ (usually a link to linux.words).
  lookup=$?      # Exit status of 'look' command.

  if [ "$lookup" -eq 0 ]
  then
    echo "\"$word\" is valid."
  else
    echo "\"$word\" is invalid."
  fi  

done <"$file"    # Redirects stdin to $file, so "reads" come from there.

echo

exit 0

# ----------------------------------------------------------------
# Code below line will not execute because of "exit" command above.


# Stephane Chazelas proposes the following, more concise alternative:

while read word && [[ $word != end ]]
do if look "$word" > /dev/null
   then echo "\"$word\" is valid."
   else echo "\"$word\" is invalid."
   fi
done <"$file"

exit 0
/usr/dict/words

中搜索匹配项,但可以指定不同的字典文件。

示例 16-20. 检查列表中的单词是否有效

sed, awk

特别适合于解析文本文件和命令输出的脚本语言。 可以单独或组合嵌入在管道和 shell 脚本中。

sed

非交互式"流编辑器",允许在批处理模式下使用许多 ex 命令。 它在 shell 脚本中找到了许多用途。

wc 用于对文件或 I/O 流进行"字数统计"

bash $ wc /usr/share/doc/sed-4.1.2/README
13  70  447 README
[13 lines  70 words  447 characters]

wc -w仅给出单词计数。

wc -l仅给出行数计数。

wc -c仅给出字节计数。

wc -m仅给出字符计数。

wc -L仅给出最长行的长度。

使用 wc 来统计当前工作目录中有多少.txt文件

$ ls *.txt | wc -l
#  Will work as long as none of the "*.txt" files
#+ have a linefeed embedded in their name.

#  Alternative ways of doing this are:
#      find . -maxdepth 1 -name \*.txt -print0 | grep -cz .
#      (shopt -s nullglob; set -- *.txt; echo $#)

#  Thanks, S.C.

使用 wc 来统计所有名称以字母 d - h 开头的文件的大小总和

bash$ wc [d-h]* | grep total | awk '{print $3}'
71832
	      

使用 wc 来统计本书的主要源文件中 "Linux" 一词出现的次数。

bash$ grep Linux abs-book.sgml | wc -l
138
	      

另请参阅 示例 16-39示例 20-8

某些命令包含 wc 的部分功能作为选项。

... | grep foo | wc -l
# This frequently used construct can be more concisely rendered.

... | grep -c foo
# Just use the "-c" (or "--count") option of grep.

# Thanks, S.C.

tr

字符转换过滤器。

Caution

必须适当地使用引号和/或括号。引号可防止 shell 重新解释 tr 命令序列中的特殊字符。括号应加上引号以防止 shell 扩展。

以下任一种方式:tr "A-Z" "*" <filename或者tr A-Z \* <filenamefilename中的所有大写字母更改为星号(写入到stdout)。在某些系统上,这可能无法工作,但tr A-Z '[**]'可以。

命令-d选项删除一系列字符。

echo "abcdef"                 # abcdef
echo "abcdef" | tr -d b-d     # aef


tr -d 0-9 <filename
# Deletes all digits from the file "filename".

命令--squeeze-repeats-n-s)选项删除除连续字符字符串的第一个实例之外的所有实例。此选项对于删除多余的空白很有用。

bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
X

命令-c "complement" 选项 反转要匹配的字符集。使用此选项,tr 仅对匹配指定集合的字符起作用。

bash$ echo "acfdeb123" | tr -c b-d +
+c+d+b++++

请注意,tr 识别 POSIX 字符类[1]

bash$ echo "abcd2ef1" | tr '[:alpha:]' -
----2--1
	      

示例 16-21. toupper:将文件转换为全大写。

#!/bin/bash
# Changes a file to all uppercase.

E_BADARGS=85

if [ -z "$1" ]  # Standard check for command-line arg.
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi  

tr a-z A-Z <"$1"

# Same effect as above, but using POSIX character set notation:
#        tr '[:lower:]' '[:upper:]' <"$1"
# Thanks, S.C.

#     Or even . . .
#     cat "$1" | tr a-z A-Z
#     Or dozens of other ways . . .

exit 0

#  Exercise:
#  Rewrite this script to give the option of changing a file
#+ to *either* upper or lowercase.
#  Hint: Use either the "case" or "select" command.

示例 16-22. lowercase:将工作目录中的所有文件名更改为小写。

#!/bin/bash
#
#  Changes every filename in working directory to all lowercase.
#
#  Inspired by a script of John Dubois,
#+ which was translated into Bash by Chet Ramey,
#+ and considerably simplified by the author of the ABS Guide.


for filename in *                # Traverse all files in directory.
do
   fname=`basename $filename`
   n=`echo $fname | tr A-Z a-z`  # Change name to lowercase.
   if [ "$fname" != "$n" ]       # Rename only files not already lowercase.
   then
     mv $fname $n
   fi  
done   

exit $?


# Code below this line will not execute because of "exit".
#--------------------------------------------------------#
# To run it, delete script above line.

# The above script will not work on filenames containing blanks or newlines.
# Stephane Chazelas therefore suggests the following alternative:


for filename in *    # Not necessary to use basename,
                     # since "*" won't return any file containing "/".
do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]'`
#                             POSIX char set notation.
#                    Slash added so that trailing newlines are not
#                    removed by command substitution.
   # Variable substitution:
   n=${n%/}          # Removes trailing slash, added above, from filename.
   [[ $filename == $n ]] || mv "$filename" "$n"
                     # Checks if filename already lowercase.
done

exit $?

示例 16-23. du:DOS 到 UNIX 文本文件转换。

#!/bin/bash
# Du.sh: DOS to UNIX text file converter.

E_WRONGARGS=85

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename-to-convert"
  exit $E_WRONGARGS
fi

NEWFILENAME=$1.unx

CR='\015'  # Carriage return.
           # 015 is octal ASCII code for CR.
           # Lines in a DOS text file end in CR-LF.
           # Lines in a UNIX text file end in LF only.

tr -d $CR < $1 > $NEWFILENAME
# Delete CR's and write to new file.

echo "Original DOS text file is \"$1\"."
echo "Converted UNIX text file is \"$NEWFILENAME\"."

exit 0

# Exercise:
# --------
# Change the above script to convert from UNIX to DOS.

示例 16-24. rot13:超弱加密。

#!/bin/bash
# rot13.sh: Classic rot13 algorithm,
#           encryption that might fool a 3-year old
#           for about 10 minutes.

# Usage: ./rot13.sh filename
# or     ./rot13.sh <filename
# or     ./rot13.sh and supply keyboard input (stdin)

cat "$@" | tr 'a-zA-Z' 'n-za-mN-ZA-M'   # "a" goes to "n", "b" to "o" ...
#  The   cat "$@"   construct
#+ permits input either from stdin or from files.

exit 0

示例 16-25. 生成 "密码引用" 谜题

#!/bin/bash
# crypto-quote.sh: Encrypt quotes

#  Will encrypt famous quotes in a simple monoalphabetic substitution.
#  The result is similar to the "Crypto Quote" puzzles
#+ seen in the Op Ed pages of the Sunday paper.


key=ETAOINSHRDLUBCFGJMQPVWZYXK
# The "key" is nothing more than a scrambled alphabet.
# Changing the "key" changes the encryption.

# The 'cat "$@"' construction gets input either from stdin or from files.
# If using stdin, terminate input with a Control-D.
# Otherwise, specify filename as command-line parameter.

cat "$@" | tr "a-z" "A-Z" | tr "A-Z" "$key"
#        |  to uppercase  |     encrypt       
# Will work on lowercase, uppercase, or mixed-case quotes.
# Passes non-alphabetic characters through unchanged.


# Try this script with something like:
# "Nothing so needs reforming as other people's habits."
# --Mark Twain
#
# Output is:
# "CFPHRCS QF CIIOQ MINFMBRCS EQ FPHIM GIFGUI'Q HETRPQ."
# --BEML PZERC

# To reverse the encryption:
# cat "$@" | tr "$key" "A-Z"


#  This simple-minded cipher can be broken by an average 12-year old
#+ using only pencil and paper.

exit 0

#  Exercise:
#  --------
#  Modify the script so that it will either encrypt or decrypt,
#+ depending on command-line argument(s).

当然,tr 适用于 代码混淆

#!/bin/bash
# jabh.sh

x="wftedskaebjgdBstbdbsmnjgz"
echo $x | tr "a-z" 'oh, turtleneck Phrase Jar!'

# Based on the Wikipedia "Just another Perl hacker" article.

fold

一个过滤器,它将输入行包装到指定的宽度。这对于-s选项尤其有用,它在单词空格处断行(参见 示例 16-26示例 A-1)。

fmt

简单的文件格式化程序,用作管道中的过滤器以 "包装" 长文本输出行。

示例 16-26. 格式化的文件列表。

#!/bin/bash

WIDTH=40                    # 40 columns wide.

b=`ls /usr/local/bin`       # Get a file listing...

echo $b | fmt -w $WIDTH

# Could also have been done by
#    echo $b | fold - -s -w $WIDTH
 
exit 0

另请参阅 示例 16-5

Tip

Kamil Toman 的 par 实用程序是 fmt 的一个强大的替代方案,可从 http://www.cs.berkeley.edu/~amc/Par/ 获取。

col

这个名称具有欺骗性的过滤器从输入流中删除反向换行符。它还会尝试用等效的制表符替换空格。col 的主要用途是过滤来自某些文本处理实用程序(如 grofftbl)的输出。

column

列格式化程序。此过滤器通过在适当的位置插入制表符,将列表类型的文本输出转换为"漂亮打印"的表格。

示例 16-27. 使用 column 格式化目录列表

#!/bin/bash
# colms.sh
# A minor modification of the example file in the "column" man page.


(printf "PERMISSIONS LINKS OWNER GROUP SIZE MONTH DAY HH:MM PROG-NAME\n" \
; ls -l | sed 1d) | column -t
#         ^^^^^^           ^^

#  The "sed 1d" in the pipe deletes the first line of output,
#+ which would be "total        N",
#+ where "N" is the total number of files found by "ls -l".

# The -t option to "column" pretty-prints a table.

exit 0
colrm

列删除过滤器。它从文件中删除列(字符),并将缺少指定列范围的文件写回到stdout. colrm 2 4 <filename从文本文件的每一行中删除第二个到第四个字符filename.

Caution

如果文件包含制表符或不可打印字符,这可能会导致不可预测的行为。在这种情况下,请考虑在 colrm 之前的管道中使用 expandunexpand

nl

行编号过滤器nl filename列出filenamestdout,但在每个非空白行的开头插入连续的数字。如果filename省略,则对stdin 操作。

nl 的输出非常类似于cat -b,因为默认情况下 nl 不列出空白行。

示例 16-28. nl:一个自我编号的脚本。

#!/bin/bash
# line-number.sh

# This script echoes itself twice to stdout with its lines numbered.

echo "     line number = $LINENO" # 'nl' sees this as line 4
#                                   (nl does not number blank lines).
#                                   'cat -n' sees it correctly as line #6.

nl `basename $0`

echo; echo  # Now, let's try it with 'cat -n'

cat -n `basename $0`
# The difference is that 'cat -n' numbers the blank lines.
# Note that 'nl -ba' will also do so.

exit 0
# -----------------------------------------------------------------
pr

打印格式化过滤器。这会将文件(或stdout)分页成适合硬拷贝打印或在屏幕上查看的部分。各种选项允许行和列操作、连接行、设置边距、编号行、添加页眉以及合并文件等。pr 命令结合了 nlpastefoldcolumnexpand 的大部分功能。

pr -o 5 --width=65 fileZZZ | more提供了一个漂亮的屏幕分页列表fileZZZ边距设置为 5 和 65。

一个特别有用的选项是-d,强制双倍行距(与 sed -G 效果相同)。

gettext

GNU gettext 软件包是一组用于本地化和将程序的文本输出翻译成外语的实用程序。虽然最初用于 C 程序,但现在它支持相当多的编程和脚本语言。

gettext *程序* 适用于 shell 脚本。参见info 页面.

msgfmt

一个用于生成二进制消息目录的程序。它用于本地化

iconv

一个用于将文件转换为不同编码(字符集)的实用程序。它的主要用途是本地化

# Convert a string from UTF-8 to UTF-16 and print to the BookList
function write_utf8_string {
    STRING=$1
    BOOKLIST=$2
    echo -n "$STRING" | iconv -f UTF8 -t UTF16 | \
    cut -b 3- | tr -d \\n >> "$BOOKLIST"
}

#  From Peter Knowles' "booklistgen.sh" script
#+ for converting files to Sony Librie/PRS-50X format.
#  (http://booklistgensh.peterknowles.com)

recode

可以将此视为上面 iconv 的一个更高级版本。这是一个非常通用的实用程序,用于将文件转换为不同的编码方案。请注意,recode 不是标准 Linux 安装的一部分。

TeX, gs

TeXPostscript 是用于准备打印副本或格式化视频显示的文本标记语言。

TeX 是 Donald Knuth 精心设计的排版系统。通常方便地编写一个 shell 脚本,封装所有传递给这些标记语言之一的选项和参数。

Ghostscript (gs) 是一个 GPL 许可的 Postscript 解释器。

texexec

用于处理 TeXpdf 文件的实用程序。位于/usr/bin在许多 Linux 发行版上,它实际上是一个调用 Perl 来调用 Texshell 包装器

texexec --pdfarrange --result=Concatenated.pdf *pdf

#  Concatenates all the pdf files in the current working directory
#+ into the merged file, Concatenated.pdf . . .
#  (The --pdfarrange option repaginates a pdf file. See also --pdfcombine.)
#  The above command-line could be parameterized and put into a shell script.

enscript

用于将纯文本文件转换为 PostScript 的实用程序

例如,enscript filename.txt -p filename.ps 生成 PostScript 输出文件filename.ps.

groff, tbl, eqn

另一种文本标记和显示格式化语言是 groff。这是古老的 UNIX roff/troff 显示和排版软件包的增强 GNU 版本。手册页 使用 groff

tbl 表处理实用程序被认为是 groff 的一部分,因为它的功能是将表标记转换为 groff 命令。

eqn 方程处理实用程序同样是 groff 的一部分,其功能是将方程标记转换为 groff 命令。

示例 16-29. manview:查看格式化的手册页

#!/bin/bash
# manview.sh: Formats the source of a man page for viewing.

#  This script is useful when writing man page source.
#  It lets you look at the intermediate results on the fly
#+ while working on it.

E_WRONGARGS=85

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename"
  exit $E_WRONGARGS
fi

# ---------------------------
groff -Tascii -man $1 | less
# From the man page for groff.
# ---------------------------

#  If the man page includes tables and/or equations,
#+ then the above code will barf.
#  The following line can handle such cases.
#
#   gtbl < "$1" | geqn -Tlatin1 | groff -Tlatin1 -mtty-char -man
#
#   Thanks, S.C.

exit $?   # See also the "maned.sh" script.

另请参阅 示例 A-39

lex, yacc

lex 词法分析器生成用于模式匹配的程序。在 Linux 系统上,它已被非专有的 flex 取代。

yacc 实用程序基于一组规范创建一个解析器。在 Linux 系统上,它已被非专有的 bison 取代。

注释

[1]

这仅适用于 GNU 版本的 tr,而不适用于商业 UNIX 系统上常见的通用版本。