Introduction

At work, we use Apache HTTP indexing for easy sharing of files internally; sometimes you just want a link rather than full NFS access. With a little work in an .htaccess file, these shares can look pretty nice too. I wanted to create something similar for sharing my personal projects with the world, without relying on a third-party service like GitHub or Google Drive. Further, I've moved from using Apache on my personal servers to Nginx for performance and simplicity of management. Nginx does less out of the box, and that's what I prefer for the time being!

Architecture

My current architecture's data backend is on Digital Ocean Spaces - I've been using them since the beta and been very pleased with the performance and ease of use. Having my resources off-server has been very convenient too; rather than rsync'ing to a server, I can use all manner of S3 tools to move my data around from anywhere.

However, S3 and Nginx make indexing more complicated than Apache and local files. I like having everything on S3, so I wrote a Haskell program, indexer, to pre-generate index pages for my data and upload that along side. A few hard won Ngninx proxy_pass rules stitch everything together.

You can see the end result here. Everything you see is static content, including the index pages themselves. The index pages and thumbnails are generated by indexer, the main topic of this post. Another tool, devbot, runs locally and watches for content changes to regenerate and upload the new content & new index pages.

Indexer

To generate the index pages, we need to recursively walk the directory to be shared collecting name, and stat information for each file. At the end, we should be able to write out HTML pages that describe and link to everything in the source directory as it will appear in S3.

Here are some more formal requirements:

HTML

  • Contain a listing of each file and child directory in a directory. This listing will be sorted alphabetically. Folders are always first.
  • Contain a link to the parent directory, if one exists. This must be the first element in the listing.
  • Contain a generic header and footer.
  • HTML output directories and files must be structured to mirror the source, but may not be written into the source.
  • Subsequent runs must only write to HTML files for source directories that have changed.

Scanning

  • Collect all information in a single file system pass.
  • Sub directories should be scanned in parallel for faster execution.
  • Files and directories must be ignorable.

Icons

  • Listing elements will have icons based on file type, based on file extension, file name, or stat (2) information. Executable files will also a specific icon.
  • Icons for pictures will be custom thumbnails for those pictures.
  • Icons for videos will be custom GIF thumbnails for those videos.
  • Any custom thumbnails should only be generated once, and be as small as possible.

Metadata

  • Listing elements will have a human readable size
  • Listing elements will have a human readable age description instead of a timestamp.

Walking the File System

Walking the file system isn't too hard, but I do need to define what information to collect along the way and how the tree itself will be represented. The following is what I went with - the tree is described in terms of a tree of DirectoryElements, with the only possible oddity being that everyone knows their parent too. Further, though icons are required for all elements, during scanning I allow Nothing to be returned to indicate that post processing is required for thumbnails. This separates scanning issues from thumbnails generation issues, and allows thumbnails to be generated in batches rather than piece-meal as they're discovered.

Source Code

data FileElement = FileElement
        { _fname :: String
        , _fpath :: FilePath      -- ^ full path to the file
        , _fsize :: Integer       -- ^ size in bytes
        , _ftime :: Integer       -- ^ how many seconds old is this file?
        , _ficon :: Maybe String  -- ^ URI for appropriate icon
        , _fexec :: Bool          -- ^ is this file executable?
        }
    deriving (Show, Eq)

data DirElement = DirElement
        { _dname    :: String
        , _dpath    :: FilePath
        , _children :: [DirectoryElement]
        }
    deriving (Show, Eq)

data DirectoryElement =
          File      FileElement
        | Directory DirElement
        | ParentDir DirElement
    deriving (Show, Eq)

Defining Ord on DirectoryElement lets us satisfy the ordering requirements easily too. With the following in place, I can use plain old sort to provide the exact ordering we're looking for on any grouping of files and directories.

instance Ord DirectoryElement where
    -- compare files and directories by name
    compare (File a)      (File b)      = compare (_fname a) (_fname b)
    compare (Directory a) (Directory b) = compare (_dname a) (_dname b)

    -- directories are always before files
    compare (File _)      (Directory _) = GT
    compare (Directory _) (File _)      = LT

    -- parent directories are before everything
    compare _          (ParentDir _)    = GT
    compare (ParentDir _) _             = LT

File and directory ignoring comes in during the tree walk phase, so no unnecessary work is done walking or stating these elements. filter after listDirectory is sufficient. As for parallelism, Control.Concurrent.Async.mapConcurrently is the work horse, conveniently matching the type for mapM. Swapping between these was nice for debugging during development. One note, don't attempt to use System.Process.callProcess under mapConcurrently! You'll block the Haskell IO fibers and jam up everything, potentially causing a deadlock. Another reason to process thumbnail creation outside of scanning.

Age, Size, and other Extras

With basic file information available, meeting the age and size requirements is mostly a matter of presentation.

For age some Ord and Show instances do the heavy lifting, and for getting that information from arbitrary DirectoryElements, a new class Age works great. Tree recursion is our friend here, since the age of a directory is the minimum age of it's children. Age would be easiest represented as just an Integer, but the top level parent directory does not have an age. Further, any particular parent directory's age is unknown by it's children do the single pass scan we're making. This could be filled in through post-processing, but it's not interesting information to present anyway, so I don't bother.

Source Code

data Age = Age Integer | NoAge
    deriving (Eq)

instance Ord Age where
instance Show Age where

class Ageable a where
    age :: a -> Age

instance Ageable FileElement where
instance Ageable DirElement where
instance Ageable DirectoryElement where
instance (Ageable a) => Ageable [a] where

Likewise, size follows a similar pattern defining Num and Show instances, and a Sizeable class to provide a polymorphic size function. The size of a directory is the sum of it's children's sizes. Computing these sums (and the minimum age for Age) does incur multiple walks of the tree, but this is the in memory representation, not the file system, so it's a bit of extra work I'm willing to pay for implementation simplicity. Yet again, parent directories get in the way of a simple representation and require a NoSize sum type so we can correctly distinguish between zero size and no size.

Source code

data Size = Size Integer | NoSize

instance Num Size where
instance Show Size where

class Sizeable a where
    size :: a -> Size

instance Sizeable FileElement where
instance Sizeable DirElement where
instance (Sizeable a) => Sizeable [a] where
instance Sizeable DirectoryElement where

Icons and Thumbnails

Source Code

HTML generation

Detecting Changes

Nginx

server {
    listen 80;
    listen [::]:80;
    server_name  public.anardil.net;

    proxy_ignore_headers   Set-Cookie;
    add_header             X-Cache-Status $upstream_cache_status;
    proxy_hide_header      Strict-Transport-Security;

    // root
    location = / {
        proxy_pass https://mirror.sfo2.digitaloceanspaces.com/share/.indexes/index.html;
    }

    // directories
    location ~ ^/(.*)/$ {
        proxy_pass https://mirror.sfo2.digitaloceanspaces.com/share/.indexes${request_uri}index.html;
    }

    // files
    location ~ ^/(.*)$ {
        proxy_pass https://mirror.sfo2.digitaloceanspaces.com/share${request_uri};
    }
}

Summary