Introduction

I've implemented a couple Unix core utilities in Haskell, and want to continue a series of posts going through the details - starting with simple programs like cat, seq, and which, and then moving on towards more feature-rich programs like uniq, tr and maybe grep.

You can find the full source from this post here. Let's implement split in Haskell!

Background

split is the original counterpart to cat, and comes all the way from Version 3 of AT&T Unix. Rather than concatenating files, split breaks one file into multiple files. The predicate for this could be number of bytes, number of lines, or the number of output files. Suppose we have a 50MB PDF that we want to email to someone (some better file sharing system isn't available), and our email system only supports attachments of 10MB. We can use split to break the PDF up, and our friend can use cat to reconstruct it. For example:

Our PDF, 50MB.

I leaf@elm ~> ls -lh paper.pdf
-rw-rw-r-- 1 leaf leaf 50M May 25 12:16 paper.pdf

Use split to break the file into 10M chunks and prefix the output files with chunks-.

leaf@elm ~> split --bytes 10M paper.pdf chunks-
leaf@elm ~> ls -lh chunks*
-rw-rw-r-- 1 leaf leaf 10M May 25 12:20 chunks-aa
-rw-rw-r-- 1 leaf leaf 10M May 25 12:20 chunks-ab
-rw-rw-r-- 1 leaf leaf 10M May 25 12:20 chunks-ac
-rw-rw-r-- 1 leaf leaf 10M May 25 12:20 chunks-ad
-rw-rw-r-- 1 leaf leaf 10M May 25 12:20 chunks-ae

Reconstruct the file using cat and verify its content.

leaf@elm ~> cat chunks-* > reconstructed.pdf
leaf@elm ~> sha1sum paper.pdf reconstructed.pdf
49886561f8e26ed5e2ae549897a28aaab44881bd  paper.pdf
49886561f8e26ed5e2ae549897a28aaab44881bd  reconstructed.pdf

Seem useful? Let's implement it in Haskell so we have a acceptably fast cross-platform implementation, and for fun.

Modules and Imports

The top of the file contains the module definition and imports. We'll skip going through the imports for now, but reference them as we move through the file. To follow along with the examples, you can put this header in a file and load it from ghci with :load split.hs. The overloaded strings extension allows us to treat character literals as other 'string-like' types; in this case ByteStrings.

{-# LANGUAGE OverloadedStrings #-}

module Main where

-- split
--
-- break a file into chunks by line count, byte count, or figure out the size based on a
-- required number of output files

import           Control.Monad
import qualified Data.ByteString.Lazy  as L
import           Data.Char
import           Data.Int
import           Data.List
import           System.Console.GetOpt
import           System.Exit
import           System.IO
import           Text.Read

Data Flow

If we look at the essence of split, it has one function: splitting files, and can do this three ways: by byte range, by line range, or by some byte range to produce a particular number of output files. We can call this the Runtime, and express it in Haskell like so.

data Runtime =
    -- what kinds of splits are we doing?
          RunBytes Int
        | RunLines Int
        | RunChunk Int
    deriving (Show, Eq)

Our option parsing will determine which runtime to use, and whatever other special flags should be enabled. The user doesn't provide the exact output files for split; they're determined internally. The flags not related to the runtime have to do with influencing what these output filenames look like. They have three parts: a prefix (default x), a body (determined by split, sorted lexicographically), and an optional additional suffix.

data Options = Options
        { optSuffixLength :: Int     -- body width
        , optExtraSuffix  :: String  -- extra suffix
        , optNumeric      :: Bool    -- use numbers in body
        , optRuntime      :: Runtime -- lines, bytes, or chunks
        }

A high level view of our implementation is:

Parse arguments to determine the runtime and additional flags
Use the options to produce a stream of output filenames
Start splitting the input file, using the runtime, into the output files

Each of our runtimes has the same prototype. Take a file, the value for your runtime (number of bytes, number of lines, number of chunks), a stream of output filenames, and produce an IO side effect. Like the majority of Unix utilities, split can read from a filename, stdin, or stdin again if the filename is -. To hide this from our runtimes, we define a File type which has a handle and potentially a size. This size will be useful for an optimization later, but isn't available for stdin.

data File = File Handle (Maybe Int)

splitBytes :: File -> Int -> [FilePath] -> IO ()
splitLines :: File -> Int -> [FilePath] -> IO ()
splitChunk :: File -> Int -> [FilePath] -> IO ()

runSplit :: Options -> File -> String -> IO ()
-- switchboard, and create the output filename stream
runSplit (Options s e n r) file prefix =
        case r of
            RunBytes v -> splitBytes file v filenames
            RunLines v -> splitLines file v filenames
            RunChunk v -> splitChunk file v filenames
    where
        filenames = filenameGenerator prefix n s e

Bytes

Splitting by bytes is the simplest of the runtimes. All we're doing here is grabbing n bytes, writing them to the next output file, and repeating if there's more to read. Our implementation of filenameGenerator shouldn't ever run out of values, but GHC doesn't know that with the types provided. To avoid a hole in the pattern matching or fussing with additional libraries like Data.InfList, we'll provide a match and error message.

Using Data.ByteString.Lazy means that our memory usage should stay constant and fairly low when dealing with large files or stdin, and that binary files will be no issue. In files whose size isn't divisible by our byte range, the remaining bytes will spill over into one more file that will be smaller than the rest.

outOfFilenames :: IO ()
outOfFilenames = die "split could not generate any more output filenames"

splitBytes :: File -> Int -> [FilePath] -> IO ()
-- simplest split, just by byte ranges
splitBytes _ _ [] = outOfFilenames
splitBytes f@(File h _) n (fn:fs) = do
        L.hGet h n >>= L.writeFile fn
        hIsEOF h   >>= flip unless (splitBytes f n fs)

Lines

Working with lines are a little more complex, but not much. Here we want to turn our stream of bytes into a stream of lines. Next, batch the lines into groups of the appropriate size, reconstruct the content by inserting newlines again, and write the group out as a result. We're finished when we've consumed all the lines in the file.

For very large ranges of lines, appending a subgroup of lines into the output file would be more memory efficient, but doesn't seem worth the complexity at the moment. As is, using Data.ByteString.Lazy and its function for split still allows us to maintain our lazy reading of the file and keep memory usage constant in the number of lines required for an output file.

splitLines :: File -> Int -> [FilePath] -> IO ()
-- create a stream of lines, group them, write them out
splitLines (File h _) n paths =
        L.split newline <$> L.hGetContents h >>= go paths
    where
        go :: [FilePath] -> [L.ByteString] -> IO ()
        go [] _ = outOfFilenames
        go _ [] = pure ()
        go (fn:fs) bs = do
            let elements = take n bs
            L.writeFile fn $
                if length elements == n
                    -- we're somewhere in the middle of the file
                    then L.snoc (L.intercalate "\n" elements) newline

                    -- don't tack an extra newline on the end of the file
                    else L.intercalate "\n" elements
            go fs (drop n bs)

        newline = 10

Chunks

Writing chunks means that the user has provided the number of output files they'd like, but not how large each of those files should be. An implicit requirement here is that the output files are all roughly the same size.

Our job is to determine size of the output files for them. For normal files, it's easy to get the size of a file without reading it with a stat call. However for stdin, we will have to read everything into memory. We'll have to determine the size and (lazily) get the content of the file different ways, but the work after that will be the same.

Get enough bytes from the lazy bytestring for the chunk, write it into the file.
When we're on the last chunk, write everything remaining into the file.

Suppose we have a a 20 byte file, and the user requests 3 chunks. To make this happen, we'll produce chunks of size: 6, 6, and 8. The last chunk picks up the remainder rather than spilling over into an additional file like in splitBytes.

chunkWriter :: Int64 -> [FilePath] -> Int64 -> L.ByteString -> IO ()
chunkWriter _ [] _ _ = outOfFilenames
chunkWriter i (fn:fs) chunkSize bs
        -- last iteration, it gets everything remaining
        | i == 1    = L.writeFile fn bs

        -- other iteration, write our chunk and continue
        | otherwise = do
            L.writeFile fn $ L.take chunkSize bs
            chunkWriter (i - 1) fs chunkSize $ L.drop chunkSize bs

We've already handled whether a file size exists for our input file with our File type, so our implementation here can just check for that information. Handling this earlier prevents more code having to worry about calling hGetSize, or equivalent, on a handle that doesn't support stating like stdin. In the case of a pre-known file size, the implementation becomes very similar to splitBytes except for the handling of remainder bytes in the last chunk. fromIntegral is necessary to convert the Int provided by the user into an Int64 for use by Data.ByteString.Lazy.

splitChunk (File h Nothing) n paths = do
-- size is unknown, have to read it all to get the length
        bs <- L.hGetContents h
        let chunks    = fromIntegral n
            fileSize  = L.length bs
            chunkSize = fileSize `div` chunks
        chunkWriter chunks paths chunkSize bs

splitChunk (File h (Just s)) n paths = do
-- we can skip reading everything into memory since we already know the size
        bs <- L.hGetContents h
        chunkWriter chunks paths chunkSize bs
    where
        chunks    = fromIntegral n
        chunkSize = fileSize `div` chunks
        fileSize  = fromIntegral s

Filename Generation

Generating suitable output filenames is the main complexity in split besides actually splitting files. We're subject to a couple constraints in addition to the user provided options:

Every filename generated must be lexicographically greater than the previous. Without this, the output files won't be sorted by shell globbing and the original file will be a pain to reconstruct. We'd have all the pieces, but not the ordering.
The stream of filenames must be infinite, even if we have to bend the user's arguments to make it happen. Suppose the user specifies a prefix width of 2. This only provides 26 ^ 2 = 676 filenames. If the input is split -b 1K 100-mb-file.pdf, we'd run out of filenames.
Output filenames shouldn't be longer than necessary.

To satisfy these constraints, we need permutations (order matters, compared to combinations), in such a way that we can add more permutations without breaking ordering. An infinite list of permutations can be had by generating permutations of size n, then n + 1, then n + 2, etc. replicateM can give us specific permutations, and limiting the selections to one less than all the options can provide us the wiggle room to maintain ordering. Let's go through an example in a world where the alphabet is only 3 letters: A, B, C.

Here are the permutations of length 2 for 'abc'

*Coreutils.Split> replicateM 2 "abc"
["aa","ab","ac","ba","bb","bc","ca","cb","cc"]
*Coreutils.Split> isSorted xs = xs == sort xs
*Coreutils.Split> isSorted $ replicateM 2 "abc"
True

However, if we try to tack permutations of length 3 onto the end, we'll break ordering.

*Coreutils.Split> replicateM 2 "abc" <> replicateM 3 "abc"
["aa","ab","ac","ba","bb","bc","ca","cb","cc","aaa","aab","aac","aba",
"abb","abc","aca","acb","acc","baa","bab","bac","bba","bbb","bbc","bca",
"bcb","bcc","caa","cab","cac","cba","cbb","cbc","cca","ccb","ccc"]
*Coreutils.Split> isSorted $ replicateM 2 "abc" <> replicateM 3 "abc"
False

An easy solution is available: if we filter results that begin with the last character in our alphabet, we can use it as a prefix for all later permutations and guarantee ordering. This continues to work even when additional 'permutation groups' are added.

*Coreutils.Split> let g n = filter (\i -> head i /= 'c') $ replicateM n "abc"
*Coreutils.Split> g 2
["aa","ab","ac","ba","bb","bc"]
*Coreutils.Split> g 2 <> map (\i -> 'c' : i) (g 3)
["aa","ab","ac","ba","bb","bc","caaa","caab","caac","caba","cabb", "cabc",
"caca","cacb","cacc","cbaa","cbab","cbac","cbba","cbbb","cbbc","cbca","cbcb","cbcc"]
*Coreutils.Split> isSorted $ g 2 <> map (\i -> 'c' : i) (g 3)
True

Here's what that looks like all together and handling the options we're exposing to the user. We'll start with the width provided, but increase it as necessary. numeric determines whether we're using letters or numbers in our alphabet. In either case, the idea is the same: filter out results that start with the last element in our alphabet so we can maintain ordering. prefix and suffix allow the user more control over what the results look like.

filenameGenerator :: String -> Bool -> Int -> String -> [String]
-- ^ lazy list of filenames that conform to these options. 9 and z aren't included
-- in the 'main' body of results since they're needed to prefix the next group so
-- we maintain ordering
filenameGenerator prefix numeric width suffix =
        initial <> more
    where
        initial =
            map (\i -> prefix <> i <> suffix)
            $ filter (\i -> head i /= next)
            $ replicateM width characters
        more = filenameGenerator (prefix <> [next]) numeric (width + 1) suffix

        characters
            | numeric   = concatMap show ([0..8] :: [Integer])
            | otherwise = ['a'..'y']

        next = last characters

Option Parsing and Running

With the core implementation handled, all that's left is to parse the user's command line arguments into our Options type and wire everything together. We'll use System.Console.GetOpt from the standard library for parsing. It's simple and expressive enough for our needs here. To use it, we'll define our default values and describe the options with help text conforming to the prototype getOpt requires. I've elided most of the option definitions here, but you can see them all here.

defaults :: Options
defaults = Options
        { optSuffixLength = 2
        , optExtraSuffix  = ""
        , optNumeric      = False
        , optRuntime      = NoRuntime
        }

options :: [OptDescr (Options -> Either String Options)]
options =
    [ Option "a" ["suffix-length"]
        (ReqArg
            (\arg opt -> case readMaybe arg of
                Nothing  -> Left $ "could not parse " <> arg <> " as a number"
                (Just n) -> Right opt { optSuffixLength = n })
            "N")
        "generate suffixes of length N (default 2)"
    ...

With our options defined, we're ready to write our splitMain function. It will only require the output of getArgs to run, i.e. we could define main as getArgs >>= splitMain.

The first thing to do is parse options. getOpt returns a triple:

Successfully parsed options
Arguments that were not parsed as options
Errors discovered during parsing, such as providing a flag that requires an argument without an argument

Assuming success, we need to separate the input file from prefix if provided. We can use pattern matching to get the optional behavior that split provides.

If nothing is provided, the file is stdin and the prefix is 'x'
If one thing is provided, it's used as a filename and the prefix is 'x'
If two things are provided, the first is used as a filename and the second as a prefix
If anything else is provided, it's an error

Additionally, this where we'll handle the - == stdin conversion and collect a file size if possible. All the code after this point deals only with a File data type, and doesn't need to know whether it's working against stdin or a normal file.

splitMain :: [String] -> IO ()
-- parse arguments and check for obvious errors
splitMain args = do
        let (actions, files, errors) = getOpt RequireOrder options args

        unless (null errors) $ do
            mapM_ putStr errors
            exitFailure

        (prefix, file) <- case files of
            []           -> return ("x", File stdin Nothing)
            [fn]         -> (,) "x" <$> getHandleAndSize fn
            [fn, prefix] -> (,) prefix <$> getHandleAndSize fn
            _            -> die "split cannot operate on more than one file at a time"

        case foldM (flip id) defaults actions of
            Left err   -> die err
            Right opts -> runSplit opts file prefix
    where
        getHandleAndSize :: FilePath -> IO File
        getHandleAndSize "-" = pure $ File stdin Nothing
        getHandleAndSize fn = do
                h <- openBinaryFile fn ReadMode
                size <- fromIntegral <$> hFileSize h
                pure $ File h (Just size)

A bit of spelunking in GHCI will show what's going on with foldM (flip id) defaults action.

*Coreutils.Split> let (a, f, e) = getOpt RequireOrder options ["-d", "file"]
*Coreutils.Split> f
["file"]
*Coreutils.Split> e
[]
*Coreutils.Split> :t a
a :: [Options -> Either String Options]
*Coreutils.Split> :t foldM (flip id) defaults
foldM (flip id) defaults
  :: (Foldable t, Monad m) => t (Options -> m Options) -> m Options
*Coreutils.Split> :t foldM (flip id) defaults a
foldM (flip id) defaults a :: Either String Options

It's using fold to apply the default options to each of the functions we defined in options to build up a 'filled in' Options result. If we have a match, the value from actions is used, otherwise the default value is used. The result of all this is either an error string or useable Options that we can pass to runSplit. And that's it!

Conclusion

We have a fully functional implementation of split in Haskell! We identified the core functionality, created data types to represent it with Runtime and Options, and implemented the core functionality using lazy IO and abstracted input. To tie it all together, we built a helper function for creating an infinite list of lexicographically ordered filenames, defined options and how to parse them, and wrote a main function to kick everything off. Here's all the code together.

Anardil

A spot of sunlight

Haskell coreutils - split