Introduction
At work, we use Apache HTTP indexing for easy sharing of files internally;
sometimes you just want a link rather than full NFS access. With a little work
in an .htaccess
file, these shares can look pretty nice too. I wanted to
create something similar for sharing my personal projects with the world,
without relying on a third-party service like GitHub or Google Drive. Further,
I've moved from using Apache on my personal servers to Nginx for performance
and simplicity of management. Nginx does less out of the box, and that's what I
prefer for the time being!
My current architecture's data backend is on Digital Ocean Spaces - I've been using them since the beta and been very pleased with the performance and ease of use. Having my resources off-server has been very convenient too; rather than rsync'ing to a server, I can use all manner of S3 tools to move my data around from anywhere.
However, S3 and Nginx make indexing more complicated than Apache and local
files. I like having everything on S3, so I wrote a Haskell program, indexer
,
to pre-generate index pages for my data and upload that along side. A few hard
won Ngninx proxy_pass
rules stitch everything together.
You can see the end result here. Everything you see is static content, including the index pages themselves. The index pages and thumbnails are generated by indexer, the main topic of this post. Another tool, devbot, runs locally and watches for content changes to regenerate and upload the new content & new index pages.
Indexer
To generate the index pages, we need to recursively walk the directory to be shared collecting name, and stat information for each file. At the end, we should be able to write out HTML pages that describe and link to everything in the source directory as it will appear in S3.
Here are some more formal requirements:
HTML
- Contain a listing of each file and child directory in a directory. This listing will be sorted alphabetically. Folders are always first.
- Contain a link to the parent directory, if one exists. This must be the first element in the listing.
- Contain a generic header and footer.
- HTML output directories and files must be structured to mirror the source, but may not be written into the source.
- Subsequent runs must only write to HTML files for source directories that have changed.
Scanning
- Collect all information in a single file system pass.
- Sub directories should be scanned in parallel for faster execution.
- Files and directories must be ignorable.
Icons
- Listing elements will have icons based on file type, based on file extension,
file name, or
stat (2)
information. Executable files will also a specific icon. - Icons for pictures will be custom thumbnails for those pictures.
- Icons for videos will be custom GIF thumbnails for those videos.
- Any custom thumbnails should only be generated once, and be as small as possible.
Metadata
- Listing elements will have a human readable size
- Listing elements will have a human readable age description instead of a timestamp.
Walking the File System
Walking the file system isn't too hard, but I do need to define what
information to collect along the way and how the tree itself will be
represented. The following is what I went with - the tree is described in terms
of a tree of DirectoryElement
s, with the only possible oddity being that
everyone knows their parent too. Further, though icons are required for all
elements, during scanning I allow Nothing
to be returned to indicate that
post processing is required for thumbnails. This separates scanning issues from
thumbnails generation issues, and allows thumbnails to be generated in batches
rather than piece-meal as they're discovered.
data FileElement = FileElement
{ _fname :: String
, _fpath :: FilePath -- ^ full path to the file
, _fsize :: Integer -- ^ size in bytes
, _ftime :: Integer -- ^ how many seconds old is this file?
, _ficon :: Maybe String -- ^ URI for appropriate icon
, _fexec :: Bool -- ^ is this file executable?
}
deriving (Show, Eq)
data DirElement = DirElement
{ _dname :: String
, _dpath :: FilePath
, _children :: [DirectoryElement]
}
deriving (Show, Eq)
data DirectoryElement =
File FileElement
| Directory DirElement
| ParentDir DirElement
deriving (Show, Eq)
Defining Ord
on DirectoryElement
lets us satisfy the ordering requirements
easily too. With the following in place, I can use plain old sort
to provide
the exact ordering we're looking for on any grouping of files and directories.
instance Ord DirectoryElement where
-- compare files and directories by name
compare (File a) (File b) = compare (_fname a) (_fname b)
compare (Directory a) (Directory b) = compare (_dname a) (_dname b)
-- directories are always before files
compare (File _) (Directory _) = GT
compare (Directory _) (File _) = LT
-- parent directories are before everything
compare _ (ParentDir _) = GT
compare (ParentDir _) _ = LT
File and directory ignoring comes in during the tree walk phase, so no
unnecessary work is done walking or stat
ing these elements. filter
after
listDirectory
is sufficient. As for parallelism,
Control.Concurrent.Async.mapConcurrently
is the work horse, conveniently
matching the type for mapM
. Swapping between these was nice for debugging
during development. One note, don't attempt to use System.Process.callProcess
under mapConcurrently
! You'll block the Haskell IO fibers and jam up
everything, potentially causing a deadlock. Another reason to process
thumbnail creation outside of scanning.
Age, Size, and other Extras
With basic file information available, meeting the age and size requirements is mostly a matter of presentation.
For age some Ord
and Show
instances do the heavy lifting, and for getting
that information from arbitrary DirectoryElement
s, a new class Age
works
great. Tree recursion is our friend here, since the age of a directory is the
minimum age of it's children. Age would be easiest represented as just an
Integer
, but the top level parent directory does not have an age. Further,
any particular parent directory's age is unknown by it's children do the single
pass scan we're making. This could be filled in through post-processing, but
it's not interesting information to present anyway, so I don't bother.
data Age = Age Integer | NoAge
deriving (Eq)
instance Ord Age where
instance Show Age where
class Ageable a where
age :: a -> Age
instance Ageable FileElement where
instance Ageable DirElement where
instance Ageable DirectoryElement where
instance (Ageable a) => Ageable [a] where
Likewise, size follows a similar pattern defining Num
and Show
instances,
and a Sizeable
class to provide a polymorphic size
function. The size of a
directory is the sum of it's children's sizes. Computing these sums (and the
minimum age for Age) does incur multiple walks of the tree, but this is the in
memory representation, not the file system, so it's a bit of extra work I'm
willing to pay for implementation simplicity. Yet again, parent directories get
in the way of a simple representation and require a NoSize
sum type so we can
correctly distinguish between zero size and no size.
data Size = Size Integer | NoSize
instance Num Size where
instance Show Size where
class Sizeable a where
size :: a -> Size
instance Sizeable FileElement where
instance Sizeable DirElement where
instance (Sizeable a) => Sizeable [a] where
instance Sizeable DirectoryElement where
Icons and Thumbnails
HTML generation
Detecting Changes
Nginx
server {
listen 80;
listen [::]:80;
server_name public.anardil.net;
proxy_ignore_headers Set-Cookie;
add_header X-Cache-Status $upstream_cache_status;
proxy_hide_header Strict-Transport-Security;
// root
location = / {
proxy_pass https://mirror.sfo2.digitaloceanspaces.com/share/.indexes/index.html;
}
// directories
location ~ ^/(.*)/$ {
proxy_pass https://mirror.sfo2.digitaloceanspaces.com/share/.indexes${request_uri}index.html;
}
// files
location ~ ^/(.*)$ {
proxy_pass https://mirror.sfo2.digitaloceanspaces.com/share${request_uri};
}
}