Bash String Processing

Introduction

Over the years, the bash shell has acquired lots of new bells and whistles. Some of these are very useful in shell scripts; but they don't seem well known. This page mostly discusses the newer ones, especially those that modify strings. In some cases, they provide useful alternatives to such old standbys as tr and sed, with gains in speed. The general theme is avoiding pipelines.

In the examples below, I'll assume that string is a shell variable that contains some character string. It might be as short as a single character, or it might be the contents of a whole document.

Case Conversions

One of the obscure enhancements that can be discovered by reading the man page for bash is the case-conversion pair:

	newstring=${string^^}	# the string, converted to UPPER CASE
	newstring=${string,,}	# the string, converted to lower case

(You can also convert just the first letter by using a single ^ or , .) Notice that the original variable, string, is not changed.

Normally, we think of doing this by using the tr command:

	newstring=`echo "$string" | tr '[a-z]' '[A-Z]'`
	newstring=`echo "$string" | tr '[A-Z]' '[a-z]'`

Of course, that involves spawning a new process. Actually, as the man page for tr tells you, this isn't optimal; depending on your locale setting, you might get unexpected results. It's safer to say

	newstring=`echo "$string" | tr '[:lower:]' '[:upper:]'`
	newstring=`echo "$string" | tr '[:upper:]' '[:lower:]'`

Using tr is certainly more readable; but it also takes a lot longer to type. How about execution time?

Timing tests

Here's a code snippet that does nothing, a hundred thousand times:

	str1=X

	i=0
	time (
	while [ $i -lt 100000 ]
	do
		let i++
	done
	)

On my machine — currently, a 3 GHz (6000 bogomips) dual-core Pentium box — that takes about 1.57 seconds. That's the bash overhead for running the useless loop. Nearly all of that is “user” time; the “sys” time is only a few dozen milliseconds.

Now let's add the line

	str2=${str1^^}

to the loop, just after the let statement. The execution time jumps to about 2.3 seconds; so executing the added line 100,000 times took about 0.7 second. That's about 7 microseconds per execution.

Now, let's try putting the line

	str2=`echo "$str1" | tr '[:lower:]' '[:upper:]'`

inside the loop instead. The execution time is now a whopping 1m 33s of real time — but only 3 seconds of user and 7 sec of system time! Apparently, the system gives both bash and tr a thousand one-millisecond time-slices a second, and then takes a vacation until the next round millisecond comes up.

If we try to even things up a bit by making the initial string longer, we find practically the same times for the version using tr, but about 0.2 second longer than before for the all-shell version, if the string to convert is "Hello, world!". Clearly, we need a really big string to measure bash's speed accurately.

So let's initialize the original string with the line

	str1=`cat /usr/share/dict/american-english`

which is a text file of 931708 characters. For this big file, a single cycle through the loop is enough: it takes bash about 45.7 seconds, all but a few milliseconds of which is “user” time. On the other hand, the tr version takes only 0.24 seconds to process the big text file.

Clearly, there's a trade-off here that depends on the size of the string to be converted. Evidently, the context switch required to invoke tr is the bottleneck when the string is short; but tr is so much more efficient than bash in converting big strings that it's faster when the string exceeds a few thousand characters. I find my machine takes about 1.55 milliseconds to process a string about 4100 characters long, regardless of which method is used. (About a quarter of a millisecond is used by the system when tr is invoked; presumably, that's the time required to set up the pipeline and make the context switch.)

sed-like Substitutions

Likewise, you can often make bash act enough like sed to avoid using a pipeline. The syntax is

	newstring=${oldstring/pattern/replacement}

Notice that there is no trailing slash, as in sed or vi: the closing brace terminates the substitution string.

The catch is that only shell-type patterns (like those used in pathname expansion) can be used, not the elaborate regular expressions recognized by sed. Also, only a single replacement normally occurs; but you can simulate a “global” replacement by using two slashes before the pattern:

	newstring=${oldstring//pattern/replacement}

A handy use for this trick is in sanitizing user input. For example, you might want to convert a filename into a form that's safe to use as (part of) a shell-variable name: filenames can contain hyphens and other special characters that are not allowed in variable names, which can only be alphanumeric. So, to clean up a dirty string:

	clean=${dirty//[-+=.,]/_}

If we had set dirty='a,b.c=d-e+f', the line above converts the dangerous characters to underscores, forming the clean string: a_b_c_d_e_f, which can be used safely in a shell script.

And you can omit the replacement string, thereby deleting the offensive parts entirely. So, for example,

	cleaned=${dirty//[-+=.,]}

is equivalent to

	cleaned=`echo $dirty | sed -e 's/[-+=.,]//g'`

or

	cleaned=`echo $dirty | tr -d '+=.,-'`

where we have to put the hyphen last so tr won't think it's an option.

Be careful: sed and tr allow the use of ranges like 'A-Z' and '0-9' ; but bash requires you to either enumerate these, or to use character classes like [:upper:] or [:digit:] within the brackets that define the pattern list.

You can even force the pattern to appear at the beginning or the end of the string being edited, by prefixing pattern with # (for the start) or % (for the end).

Faking basename and dirname

This use of # to mark the beginning of an edited string, and % for the end, can also be used to simulate the basename and dirname commands in shell scripts:

	dirpath=${path%/*}

extracts the part of the path variable before the last slash; and

	base=${path##*/}

yields the part after the last slash. CAUTION : Notice that the asterisk goes between the slash and the ##, but after the %.

That's because

	${varname#pattern}

trims the shortest prefix from the contents of the shell variable varname that matches the shell-pattern pattern ; and

	${varname##pattern}

trims the longest prefix that matches the pattern from the contents of the shell variable. Likewise,

	${varname%pattern}

trims the shortest suffix from the contents of the shell variable varname that matches the shell-pattern pattern ; and

	${varname%%pattern}

trims the longest suffix that matches the pattern from the contents of the shell variable. You can see that the general rule here is: a single # or % to match the shortest part; or a double ## or %% to match the longest part.

But be careful. If you just feed a bare filename instead of a pathname to dirname, you get just a dot [.]; but if there are no slashes in the variable you process with the hack above, you get the filename back, unaltered: because there were no slashes in it, nothing got removed. So this trick isn't a complete replacement for dirname.

Another use of basename is to remove a suffix from a filename. We often need to do this in shell scripts when we want to generate an output file with the same basename but a different extension from an input file. For example, to convert file.old to file.new, you could use

	newname=`basename $oldname .old`.new

so that, if you had set oldname to file.old , newname would be set to file.new . But it's faster to say

	newname=${oldname%.old}.new

(Notice that we have to use the % operation here, even though the generic replacement for basename given above uses the ## operation. That's because we're trimming off a suffix rather than a prefix, in this case.) If you didn't know the old file extension, you could still replace it by saying

	newname=${oldname%.*}.new

This way of trimming off a prefix or a suffix is also useful for separating numbers that contain a decimal point into the integer and fractional parts. For example, if we set DECIMAL=123.4567, we can get the part before the decimal as

	INTEGER=${DECIMAL%.*}

and the digits of the fraction as

	FRACT=${DECIMAL#*.}

Numerical operations

Speaking of digits, you can also perform simple integer arithmetic in bash without having to invoke another process, such as expr. Remember that the let operation automatically invokes arithmetic evaluation on its operands. So

	let sum=5+2

will store 7 in sum. Of course, the operands on the right side can just as well be shell variables; so, if x and y are numerical, you could

	let sum=x+y

which is both more compact and faster than

	sum=`expr $x + $y`

If you want to space the expression out for better readability, you can say

	let "sum = x + y"

and bash will do the right thing. (You have to use quotes so that let has just a single argument. If you don't like the quotes, you can say

	sum=$(( x + y ))

but then you can't have spaces around the = sign.)

This way of doing arithmetic is a lot more readable than using expr — especially when you're doing multiplications, because expr has to have its arguments separated by whitespace, so the asterisk[*] has to be quoted:

	product=`expr $x \* $y`

Yuck. Pretty ugly, compared to

	let "product = x * y"

Finally, when you need to increment a counter, you can say

	let i++

or

	let j+=2

which is cleaner, faster, and more readable than invoking expr.

Sub-strings

In addition to truncating prefixes and suffixes, bash can extract sub-strings. To get the 2 characters that follow the first 5 in a string, you can say

	${string:5:2}

for example.

This can save a lot of work when parsing replies to shell-script questions. If the shell script asks a yes/no question, you only need to check the first letter of the reply. Then

	init=${string:0:1}

is what you want to test. (This gives you 1 character, starting at position 0 — in other words, the first character of the string.)

If the “offset” parameter is −1, the substring begins at the last character of the string; so

	last=${string: −1:1}

gives you just the last character. (Note the space that's needed to separate the colon from the minus sign; this is required to avoid confusion with the colon-minus sequence used in specifying a default value.)

To get the last 2 characters, you should specify

	last2=${string: −2:2} ;

note that

	penult=${string: −2:1}

gives you the next -to-last character.

Replacing wc

Many invocations of wc can be avoided, especially when the object to be measured is small. Of course, you should avoid operating on a file directly with wc in constructions like

	size=`wc -c somefile`

because this captures the user-friendly repetition of the filename in the output. Instead, you want to re-direct the input to wc:

	size=`wc -c < somefile`

But if the operand is already in a shell variable, you certainly don't want to do this:

	size=`echo -n "$string" | wc -c`

— particularly if the string is short — because bash can do the job itself:

	size=${#string}

It's even possible to make bash fake wc -w, if you don't mind sacrificing the positional parameters:

	set $string
	nwords=$#

Copyright © 2011 – 2012 Andrew T. Young