Haskell's a great language; it's efficient, consistent, terse, reliable, and so on. But if there's one thing Haskell's not, it's "batteries included". Compared to popular dynamic languages, such as Python and Ruby, Haskell has a very limited module library. Writing bindings to Python libraries (via the Python/C API) is an easy and practical approach to reusing the Python community's work.
Code: https://john-millikin.com/code/haskell-cpython (GitHub mirror)
In addition to standard Haskell development tools (GHC, Cabal, etc), building the example code requires the Python 3.1 headers. In Debian/Ubuntu, apt-get install python3.1-dev.
Once necessary libraries are installed, you should be able to run the following test program. If the program won't compile, or crashes, double-check that GHC and Cabal are installed properly.
module Main where
import qualified Data.Text.IO as T
import qualified CPython as Py
main :: IO ()
main = do
Py.initialize
Py.getVersion >>= T.putStrLn
The program should give output like this:
$ runhaskell version.hs
3.1.2 (release31-maint, Sep 17 2010, 20:37:45)
[GCC 4.4.5]
Like any self-respecting language, Python has a variety of built-in types; integers, text, lists, tuples, etc. The first step to using any Python library is marshaling Haskell values into an equivalent Python value. A full list of types supported by the CPython bindings is available in the API reference.
Lets marshal some basic stuff, using print()
to see what Python makes of it:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as B
import System.IO (stdout)
import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
main :: IO ()
main = do
Py.initialize
unicode <- Py.toUnicode "Hello World!"
Py.print unicode stdout
bytes <- Py.toBytes (B.pack "Hello\NULWorld!\ETX")
Py.print bytes stdout
float <- Py.toFloat 1.2345
Py.print float stdout
int <- Py.toInteger 12345
Py.print int stdout
list <- Py.toList [Py.toObject int]
Py.print list stdout
tuple <- Py.toTuple [Py.toObject int]
Py.print tuple stdout
set <- Py.toSet [Py.toObject int]
Py.print set stdout
$ runhaskell marshaling.hs
'Hello World!'
b'Hello\x00World!\x03'
1.2345
12345
[12345]
(12345,)
{12345}
That's a big chunk to digest at once, so lets break it down a bit:
unicode
, bytes
, float
, and int
types match up precisely with Haskell's Text
, ByteString
, Double
, and Integer
, respectively. Byte literals are prefixed with b
, to reduce confusion with unicode strings.SomeObject
GADT to represent the contents of lists (and of arbitrary Python objects in general). Every value stored in a list must be first converted to a SomeObject
, using Py.toObject
.{1, 2, 3}
is equivalent to Haskell's Data.Set.fromList [1, 2, 3]
.Every Python object has a selection of methods, which can be called by external code to do stuff. If you've ever used a pseudo-OO language like C++ or Java, you've used methods before. Some methods are exposed directly via Python/C; others must be queried as attributes from an object.
When separate types have similar methods, those methods are usually standardized into a protocol. Python protocols are like Haskell typeclasses, except not type checked; any value with the appropriate methods is said to implement a protocol. For example, tuple
, list
, and bytes
values all implement the sequence protocol.
There's only so much you can do with the built-in types; sooner or later, you'll want to use one of Python's rich selection of libraries. That's why you're reading this, right?
Modules are exposed to the runtime as standard Python objects, and their contents (variables, procedures, class definitions) can be queried like any other object attribute. Lets look at an example of calling os.uname()
:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Text as T
import System.IO (stdout)
import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
import qualified CPython.Types.Module as Py
main :: IO ()
main = do
Py.initialize
os <- Py.importModule "os"
uname <- Py.getAttribute os =<< Py.toUnicode "uname"
res <- Py.callArgs uname []
Py.print res stdout
$ runhaskell import.hs
('Linux', 'desktop', '2.6.35-22-generic', '#35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010', 'x86_64')
The getAttribute
and callArgs
functions are both part of the object protocol; the former works on all objects, while the latter works on objects with the __call__()
magic method.
A module can be imported any number of times, but will only be loaded once per interpreter. This comes in very useful in Haskell, which has no native support for static data – if you need to call a Python method, just import its module at the call site.
Of course, even inexpensive operations can become a bottleneck if performed often enough; importing an already-loaded module is fast, but the full lookup still involves several string comparisons and a marshal. If the same Python function needs to be run many times, consider querying it once and caching the function object.
If anybody's been playing around with the above examples, they might have run into the following problem:
$ runhaskell exceptions.hs
exceptions.hs: <CPython exception>
Because Python exceptions are themselves Python objects, printing them requires an IO action. In fact, because Python methods can perform arbitrary actions, printing the same exception twice might give different output! Therefore, the Show
instance for Python exceptions is mostly worthless.
Every Python exception has three components: a class, a value, and an optional traceback (i.e. stack trace). The class is generally not interesting, but the value can be printed to see what went wrong:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Control.Exception as E
import qualified Data.Text as T
import System.IO (stdout)
import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types.Exception as Py
import qualified CPython.Types.Module as Py
main :: IO ()
main = do
Py.initialize
E.handle onException $ do
Py.importModule "no-such-mod"
return ()
onException :: Py.Exception -> IO ()
onException exc = Py.print (Py.exceptionValue exc) stdout
$ runhaskell exceptions.hs
ImportError('No module named no-such-mod',)
This'll do for quick and dirty scripts, but more complex errors will benefit from using the traceback module. Use procedures like print_exception()
to get nice, pretty-printed error messages. If an exception originated in Python code, a stack trace will also be printed.
import qualified CPython.Constants as Py
import qualified CPython.Types as Py
-- ...
onException exc = do
tb <- case Py.exceptionTraceback exc of
Just obj -> return obj
Nothing -> Py.none
mod <- Py.importModule "traceback"
proc <- Py.getAttribute mod =<< Py.toUnicode "print_exception"
Py.callArgs proc [Py.exceptionType exc, Py.exceptionValue exc, tb]
return ()
$ runhaskell exceptions.hs
ImportError: No module named no-such-mod
Here's the payoff; implementing a Haskell library with an existing Python library. For this I'll use the mimetypes module, since it's simple and self-contained; more useful bindings might be to the Universal Feed Parser or docutils.
Even a simple binding is a bit big to read all at once as an example, so I've split it up. First is the imports and exports; no explanation needed, hopefully.
{-# LANGUAGE OverloadedStrings #-}
module MimeTypes
( MimeTypes
, newMimeTypes
, guessExtension
, guessType
) where
import qualified Data.Text as T
import qualified CPython as Py
import qualified CPython.Constants as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
import qualified CPython.Types.Module as Py
import qualified CPython.Types.Tuple as PyT
Next we have a data type for matching the mimetypes.MimeTypes
class; it doesn't have the full complement of attributes, but enough for demonstration. newMimeTypes
's parameters mimic that of the Python class's constructor.
Note that there are no Python types exposed in this module's public interface; clients of this module are insulated from the internal implementation. Aside from the absurdly heavy dependency list, there is no sign that this module is just a binding.
data MimeTypes = MimeTypes
{ mtGuessExtension :: Py.SomeObject
, mtGuessType :: Py.SomeObject
}
newMimeTypes :: [FilePath] -> Bool -> IO MimeTypes
newMimeTypes files strict = do
Py.initialize
mod <- Py.importModule "mimetypes"
cls <- Py.getAttribute mod =<< Py.toUnicode "MimeTypes"
pyFiles <- Py.toList =<< mapM (fmap Py.toObject . Py.toUnicode) files
pyStrict <- if strict then Py.true else Py.false
mt <- Py.callArgs cls [Py.toObject pyFiles, Py.toObject pyStrict]
pyGuessExtension <- Py.getAttribute mt =<< Py.toUnicode "guess_extension"
pyGuessType <- Py.getAttribute mt =<< Py.toUnicode "guess_type"
return $ MimeTypes pyGuessExtension pyGuessType
If you've any sense, one of the first things you thought after reading that was "golly, that sure is ugly". And you're right – it is ugly. Anybody who wants to make a serious go of binding large-scale Python libraries (such as Django) are heavily encouraged to write something similar to c2hs to automate the worst of it. Call it py2hs?
However, aside from being dreadfully verbose, it's not particularly complex. Parameters are marshaled from Haskell types into their Python equivalents, packaged up into a parameter list, and used to call the class constructor. After the MimeTypes
object has been created, its guess_extension
and guess_type
methods are queried and cached for later use.
Which brings us to:
guessExtension :: MimeTypes -> T.Text -> Bool -> IO (Maybe T.Text)
guessExtension mt type_ strict = do
pyType <- Py.toUnicode type_
pyStrict <- if strict then Py.true else Py.false
res <- Py.callArgs (mtGuessExtension mt) [Py.toObject pyType, Py.toObject pyStrict]
textOrNone res
guessType :: MimeTypes -> T.Text -> Bool -> IO (Maybe T.Text, Maybe T.Text)
guessType mt url strict = do
pyURL <- Py.toUnicode url
pyStrict <- if strict then Py.true else Py.false
res <- Py.callArgs (mtGuessType mt) [Py.toObject pyURL, Py.toObject pyStrict]
Just tup <- Py.cast res
[pyType, pyEncoding] <- Py.fromTuple tup
type_ <- textOrNone pyType
encoding <- textOrNone pyEncoding
return (type_, encoding)
textOrNone :: Py.SomeObject -> IO (Maybe T.Text)
textOrNone obj = do
isNone <- Py.isNone obj
if isNone
then return Nothing
else do
Just cast <- Py.cast obj
Just `fmap` Py.fromUnicode cast
Really, it's more of the same; marshal parameters, call, dissect the result. Testing for None
is common enough that I moved it to a helper; more complex bindings might have dozens such helpers for special cases. Are you listening, py2hs author?
Finally, load up our new binding into GHCi and see if it works:
$ ghci -XOverloadedStrings
GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help
Prelude> :l MimeTypes
[1 of 1] Compiling MimeTypes ( MimeTypes.hs, interpreted )
Ok, modules loaded: MimeTypes.
*MimeTypes> types <- newMimeTypes [] False
It loaded! And it didn't crash! We're off to a good start; lets see if our guessType
works:
*MimeTypes> import Data.Text
*MimeTypes Data.Text> guessType types "foo.txt" True
(Just "text/plain",Nothing)
*MimeTypes Data.Text> guessType types "foo.html.gz" True
(Just "text/html",Just "gzip")
Looks good; it's picking up the file type, and the optional encoding. Now for guessExtension
:
*MimeTypes Data.Text> guessExtension types "text/plain" True
Just ".ksh"
Hmm.
http://bugs.python.org/issue1043134
Hmm 🤔