Understanding Iteratees

Iteratees are an abstraction discovered by Oleg Kiselyov, which provide a performant, predictable, and safe alternative to lazy I/O. Though the data types involved are simple, their relationship to incremental processing is not obvious, and existing documentation ranges in quality from merely dense to outright baffling. This article attempts to clarify the concepts and use underlying iteratees.

Please note that these are my notes, as I attempt to implement iteratee–based libraries. I may have misunderstood minor or major parts of iteratees. If in doubt, the final authority is Oleg -- though understanding his answers requires a saving throw vs. confusion. Please e-mail me any comments or suggestions.

2010–08–19: the code available in this article has been expanded and packaged as the enumerator library.

Iteratees vs. Lazy I/O

Lazy I/O – eg, hGetContents and friends – is known to have several shortcomings. Most notably, IO errors can occur in pure code and Handles may remain open for arbitrary periods of time. Oleg notes[1] that this can lead to unexpected failures, due to resource exhaustion.

Iteratees do not suffer from these problems. Their resource use is bounded and predictable, and the type system provides guarantees that limited resources are released when no longer needed. Notably, iteratees can process arbitrarily large inputs in constant space.

Implementing iteratees

There are at least five generic iteratee libraries, each with differing type signatures and semantics. Oleg's Iteratee.hs, IterateeM.hs, & IterateeMCPS.hs, John Lato's iteratee package, and a post by Per Magnus Therning.

This page documents a sixth implementation, based on IterateeM, with simplified error handling and naming conventions (hopefully) more obvious to the average Haskell programmer.

data Chunk a
	= Chunk [a]
	| EOF
	deriving (Show, Eq)

data Step e a m b
	= Continue (Chunk a -> Iteratee e a m b)
	| Yield b (Chunk a)
	| Error e

newtype Iteratee e a m b = Iteratee
	{ runIteratee :: m (Step e a m b)
	}

In general, an iteratee begins in the Continue state. As each chunk is passed to the continuation, the iteratee may return the next step, which is one of:

Continue: The iteratee requires more input before it can produce a result.
Yield: The iteratee has received enough input to generate a result, along with left–over input. If the iteratee will no longer accept input, it should yield EOF. If no input remains, but the iteratee can still accept more, it should yield Chunk [].
Error: The iteratee experienced an error which prevents it from proceeding further. The type of error contained will depend on the enumerator and/or iteratee – common choices are String and SomeException.

Based on these semantics, some simple instances can be created:

instance Monoid (Chunk a) where
	mempty = Chunk []
	mappend (Chunk xs) (Chunk ys) = Chunk $ xs ++ ys
	mappend _ _ = EOF

instance Functor Chunk where
	fmap _ EOF = EOF
	fmap f (Chunk xs) = Chunk $ map f xs

instance (Show a, Show b, Show e) => Show (Step e a m b) where
	showsPrec d step = showParen (d > 10) $ case step of
		(Continue _) -> s "Continue"
		(Yield b chunk) -> s "Yield " . sp b . s " " . sp chunk
		(Error err) -> s "Error " . sp err
		where
			s = showString
			sp :: Show a => a -> ShowS
			sp = showsPrec 11

Slightly more complex is the Monad instance for iteratees. The first iteratee is run, and if it yielded a value, that value is fed into the second iteratee.

instance Monad m => Monad (Iteratee e a m) where
	return x = Iteratee . return . Yield x $ Chunk []
	m >>= f = Iteratee $ runIteratee m >>= \mStep -> case mStep of
		Continue k -> return $ Continue ((>>= f) . k)
		Error err -> return $ Error err
		Yield x (Chunk []) -> runIteratee $ f x
		Yield x chunk -> runIteratee (f x) >>= \r -> case r of
			Continue k -> runIteratee $ k chunk
			Error err -> return $ Error err
			
			-- runIteratee (f x) does not consume any input; if it
			-- returns Yield, then its "extra" input must be
			-- (Chunk []) and can be ignored.
			Yield x' _ -> return $ Yield x' chunk

instance MonadTrans (Iteratee e a) where
	lift m = Iteratee $ m >>= runIteratee . return

instance MonadIO m => MonadIO (Iteratee e a m) where
	liftIO = lift . liftIO

instance Monad m => Functor (Iteratee e a m) where
	fmap f i = i >>= return . f

Next, lets define a few simple primitive combinators for building iteratees from pure functions:

returnI :: Monad m => Step e a m b -> Iteratee e a m b
returnI = Iteratee . return

liftI :: Monad m => (Chunk a -> Step e a m b) -> Iteratee e a m b
liftI k = returnI $ Continue (returnI . k)

yield :: Monad m => b -> Chunk a -> Iteratee e a m b
yield x chunk = returnI $ Yield x chunk

continue :: Monad m => (Chunk a -> Iteratee e a m b) -> Iteratee e a m b
continue k = returnI $ Continue k

throwError :: Monad m => e -> Iteratee e a m b
throwError err = returnI $ Error err

These combinators are sufficient to define simple iteratees; for example, a variation of dropWhile:

-- import Prelude hiding (dropWhile)
-- import qualified Prelude as Prelude
dropWhile :: Monad m => (a -> Bool) -> Iteratee e a m ()
dropWhile f = liftI step where
	step (Chunk xs) = case Prelude.dropWhile f xs of
		[] -> Continue $ returnI . step
		xs' -> Yield () (Chunk xs')
	step EOF = Yield () EOF

Or an iteratee for printing received chunks to stdout, useful for debugging:

printChunks :: MonadIO m => Show a => Bool -> Iteratee e a m ()
printChunks printEmpty = continue step where
	step (Chunk []) | not printEmpty = continue step
	step (Chunk xs) = liftIO (print xs) >> continue step
	step EOF = liftIO (putStrLn "EOF") >> yield () EOF

Finally, to extract the final result from an iteratee, it's sufficient to feed it EOF and check the returned Step. Note that a "well–behaved" iteratee continuation will always return Yield or Error in response to EOF – iteratees which return Continue may loop forever, depending on their monadic behavior.

run :: Monad m => Iteratee e a m b -> m (Either e b)
run i = runIteratee i >>= check where
	check (Continue k) = runIteratee (k EOF) >>= check
	check (Yield x _) = return $ Right x
	check (Error e) = return $ Left e

Enumerators

Iteratees consume data from a sequence of input chunks. To generate those chunks, we define enumerators (and enumerator composition operators).

type Enumerator e a m b = Step e a m b -> Iteratee e a m b

infixl 1 >>==, ==<<
(>>==) :: Monad m =>
	Iteratee e a m b ->
	(Step e a m b -> Iteratee e a' m b') ->
	Iteratee e a' m b'
m >>== f = Iteratee (runIteratee m >>= runIteratee . f)

(==<<):: Monad m =>
	(Step e a m b -> Iteratee e a' m b') ->
	Iteratee e a m b ->
	Iteratee e a' m b'
f ==<< m = m >>== f

Note that the Enumerator type is semantically equivalent to:

type Enumerator e a m b = Step e a m b -> m (Step e a m b)

Simple enumerators can be defined in terms of existing combinators. The basic format of an enumerator is that when it receives a Continue step, it passes a chunk to the continuation to generate its returned iteratee. Other step types are passed through unchanged.

enumList :: Monad m => [a] -> Enumerator e a m b
enumList xs (Continue k) = case xs of
	[] -> k EOF
	(x:xs') -> k (Chunk [x]) >>== enumList xs'
enumList _ step = returnI step

More complex enumerators require building the result manually. Note that while the recursive step is much larger in this example, the fundamental layout (loop on Continue, pass on others) remains.

enumHandle :: Handle -> Enumerator String ByteString IO b
enumHandle h = Iteratee . allocaBytes bufferSize . loop where
	bufferSize = 4096
	loop (Continue k) = do_read k
	loop step = const $ return step
	do_read k p = do
		n <- try $ hGetBuf h p bufferSize
		case (n :: Either SomeException Int) of
			Left err -> return $ Error $ show err
			Right 0 -> return $ Continue k
			Right n' -> do
				bytes <- packCStringLen (p, n')
				step <- runIteratee (k (Chunk [bytes]))
				loop step p

In some cases, it might make more sense to define this enumerator in terms of bytes rather than byte strings. The required changes are minor – the bytes are stored directly in the Chunk list.

enumHandle :: Handle -> Enumerator String Word8 IO b
…
			Right n' -> do
				bytes <- F.peekArray n' p
				step <- runIteratee (k (Chunk bytes))
				loop step p

Enumeratees

Enumerators generate data, iteratees consume it. When a value needs to generate a stream using another stream as input, it is named an enumeratee.

type Enumeratee e aOut aIn m b = Step e aIn m b -> Iteratee e aOut m (Step e aIn m b)

Most interesting transformations in iteratee-based code are enumeratees. For example, map can be encoded as an enumeratee:

checkDone :: Monad m =>
	((Chunk a -> Iteratee e a m b) -> Iteratee e a' m (Step e a m b)) ->
	Enumeratee e a' a m b
checkDone _ (Yield x chunk) = return $ Yield x chunk
checkDone f (Continue k) = f k
checkDone _ (Error err) = throwError err

mapI :: Monad m => (ao -> ai) -> Enumeratee e ao ai m b
mapI f = checkDone $ continue . step where
	step k EOF = yield (Continue k) EOF
	step k (Chunk []) = continue $ step k
	step k chunk = k (fmap f chunk) >>== mapI f

A more complex example: sequenceI converts an iteratee to an enumeratee, by feeding it input until it returns EOF. This is useful for chaining iteratees together, to support embedded streams.

finished :: Monad m => Iteratee e a m Bool
finished = liftI $ \chunk -> case chunk of
	EOF -> Yield True EOF
	_   -> Yield False chunk

sequenceI :: Monad m => Iteratee e ao m ai -> Enumeratee e ao ai m b
sequenceI i = checkDone check where
	check k = finished >>= \f -> if f
		then yield (Continue k) EOF
		else step k
	step k = i >>= \v -> k (Chunk [v]) >>== sequenceI i

A join combinator is useful for "extracting" an output stream from an enumeratee's result.

joinI :: Monad m => Iteratee e a m (Step e a' m b) -> Iteratee e a m b
joinI outer = outer >>= check where
	check (Continue k) = k EOF >>== check
	check (Yield x _) = return x
	check (Error e) = throwError e

Oleg Kiselyov – Lazy vs correct IO