Introduction
Most simple iterative programming tasks can be rephrased in terms of a 'map' or 'reduce' operation. The advantage here is that if we can work our problem into a map - reduce problem, we can easily increase parallelism.
A simple example
Let's check a couple Wikipedia pages for mentions of 'Unix'. The most straight forward, imperative way, to do this is something along these lines:
for url in Linux ANSI_escape_code Shell; do
page="$( curl -s "https://en.wikipedia.org/wiki/$url" )"
if [[ $page =~ Unix ]]; then
echo "$url mentions Unix!"
fi
done
If you're familiar with shell scripts, you may already see some immediate
improvements we could make. If all we care about is whether the page contains
'Unix', we could just pipe the curl
output to grep
.
for url in Linux ANSI_escape_code Shell; do
if curl -s "https://en.wikipedia.org/wiki/$url" | grep -q 'Unix'; then
echo "$url mentions Unix!"
fi
done
With this approach, it's not even necessary to download the rest of the article
if we encounter 'Unix' early. grep
will return success and stop reading the
input from curl
, so then curl
will stop downloading any more of the
article. If we allow error reporting from curl
, we can see this is the case:
$ curl -sS "https://en.wikipedia.org/wiki/Linux" | grep -q Unix && echo 'Found it!'
curl: (23) Failed writing body (161 != 1300)
Found it!
We can hit this even sooner with the -N
'no buffering' option.
$ curl -sSN "https://en.wikipedia.org/wiki/Linux" | grep -q Unix && echo 'Found it!'
curl: (23) Failed writing body (0 != 397)
Found it!
Pipelines
This works for a single article at a time, but what about all of them?
It would be nice to break this down to a single Pipeline, and we can with
xargs
. echo
allows multi line strings, xargs
expects separate inputs to
come on separate lines, so we can use them together to pass an arbitrary list
of elements.
$ echo 'Linux
Shell
ANSI_escape_code' | xargs -i echo curl "https://en.wikipedia.org/wiki/{}"
curl https://en.wikipedia.org/wiki/Linux
curl https://en.wikipedia.org/wiki/Shell
curl https://en.wikipedia.org/wiki/ANSI_escape_code
Let's clean this up and search for mentions of 'Unix' again.
list() {
for item in "$@"; do
echo "$item"
done
}
map() {
xargs -i "$@"
}
list Linux ANSI_escape_code Shell \
| map curl -s "https://en.wikipedia.org/wiki/{}" \
| grep -q 'Unix' \
&& echo 'Found it!'
This looks nice, and is completing the same task as everything above. Helper
functions aside, this solution is half as many lines as the original solution,
and gives us the option of early stopping. It also gives us the option to
parallelize - all of the articles could be downloaded concurrently since grep
can easily handle more input at once.
A more interesting example
Let's try something more interesting. On these three pages, what are the most common words around mentions of 'Unix'? The steps we need to take are:
- Download the files
- Pick out the letters around mentions of 'Unix', using space as a delimiter
- Remove duplicates and count instances
- Sort on the number of instances
- Grab the top 5
list Linux ANSI_escape_code Shell \
| map curl -s "https://en.wikipedia.org/wiki/{}" \
| grep -o '[^ ]\+Unix[^ ]\+' \
| sort | uniq -c \
| sort -r -n -k 1 \
| head -n 5
$ bash example.sh
6 href="/wiki/Unix-like"
5 title="Unix-like">Unix-like</a>
4 href="/wiki/Unix"
3 title="Unix">Unix</a>
2 href="/wiki/File:Unix_timeline.en.svg"
How about the most linked articles on the Linux page?
- Download the article
download-article() {
curl -s "https://en.wikipedia.org/wiki/"$1"
}
- Grab references to other Wikipedia pages. Clean them up.
page-to-links() {
# page -> href=/wiki/Slackware, href=/wiki/Austrumi_Linux
grep -o 'href="/wiki/[^"]\+"' \
| tr -d '"'
}
- Break up the URL, take the last element, which is the name
links-to-articles() {
# href=/wiki/Slackware -> Slackware
tr '/' ' ' \
| awk '{print $NF}'
}
- Get the top 'n' unique instances
top-unique() {
sort | uniq -c \
| sort -r -n -k 1 \
| head -n "$1"
}
Assemble the pieces, and we have an answer!
download-article "Linux" \
| page-to-links \
| links-to-articles \
| top-unique 15
$ bash example.sh
13 Linux_kernel
10 Android_(operating_system)
9 Free_software
8 Ubuntu_(operating_system)
8 Linux_distribution
7 Linux
7 GNU
7 Debian
7 C_(programming_language)
6 Linus_Torvalds
6 IBM
6 Chrome_OS
5 Wayland_(display_server_protocol)
5 Unix-like
5 Smartphone
Parallelism
Let's write our own version xargs
with parallelism built in. We can either
let each subprocess write to stdout
whenever it has a result, which will get
us results more quickly and without blocking, but will also interleave results
from other processes.
Alternatively, we can redirect the output of each subprocess somewhere else,
wait for them all to end, then write the results to stdout
in order. Either
way, writing our own xargs
function in our shell lets us 'map' our own
functions. xargs
can't use shell functions.
faster-map() {
while read -r line; do
"$@" "$line" &
done
wait
}
strict-map() {
output=()
while read -r line; do
tmp="$( mktemp /dev/shm/strict-map.XXXXXX )"
output+=( "$tmp" )
# could keep separate files for stdout + stderr if we wanted
# but not necessary here
"$@" "$line" >"$tmp" 2>&1 &
done
wait
for item in "${output[@]}"; do
cat "$item"
rm -f "$item"
done
}
faster-map
should be slightly faster than strict-map
, but both will be much
faster than running each command + argument pair serially like xargs
does.
Summary
Treating input data as a list of items and then progressively filtering or transforming the data atomically is a powerful way to describe a problem. It allows for easy extension or refinement, and parallelism.