haskell-cpython: Calling Python libraries from Haskell

Haskell's a great language; it's efficient, consistent, terse, reliable, and so on. But if there's one thing Haskell's not, it's "batteries included". Compared to popular dynamic languages, such as Python and Ruby, Haskell has a very limited module library. Writing bindings to Python libraries (via the Python/C API) is an easy and practical approach to reusing the Python community's work.

Code: https://john-millikin.com/code/haskell-cpython (GitHub mirror)

Preflight

In addition to standard Haskell development tools (GHC, Cabal, etc), building the example code requires the Python 3.1 headers. In Debian/Ubuntu, apt-get install python3.1-dev.

Once necessary libraries are installed, you should be able to run the following test program. If the program won't compile, or crashes, double-check that GHC and Cabal are installed properly.

module Main where
import qualified Data.Text.IO as T
import qualified CPython as Py

main :: IO ()
main = do
	Py.initialize
	Py.getVersion >>= T.putStrLn

The program should give output like this:

$ runhaskell version.hs
3.1.2 (release31-maint, Sep 17 2010, 20:37:45)
[GCC 4.4.5]

Python's built-in types

Like any self-respecting language, Python has a variety of built-in types; integers, text, lists, tuples, etc. The first step to using any Python library is marshaling Haskell values into an equivalent Python value. A full list of types supported by the CPython bindings is available in the API reference.

Lets marshal some basic stuff, using print() to see what Python makes of it:

{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as B
import System.IO (stdout)
import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py

main :: IO ()
main = do
	Py.initialize
	unicode <- Py.toUnicode "Hello World!"
	Py.print unicode stdout
	
	bytes <- Py.toBytes (B.pack "Hello\NULWorld!\ETX")
	Py.print bytes stdout
	
	float <- Py.toFloat 1.2345
	Py.print float stdout
	
	int <- Py.toInteger 12345
	Py.print int stdout
	
	list <- Py.toList [Py.toObject int]
	Py.print list stdout
	
	tuple <- Py.toTuple [Py.toObject int]
	Py.print tuple stdout
	
	set <- Py.toSet [Py.toObject int]
	Py.print set stdout

$ runhaskell marshaling.hs
'Hello World!'
b'Hello\x00World!\x03'
1.2345
12345
[12345]
(12345,)
{12345}

That's a big chunk to digest at once, so lets break it down a bit:

Python's unicode, bytes, float, and int types match up precisely with Haskell's Text, ByteString, Double, and Integer, respectively. Byte literals are prefixed with b, to reduce confusion with unicode strings.
Python's tuples are similar to Haskell's, except they may contain any number of elements. Single-element tuples are indicated by a trailing comma.
Python's lists are heterogeneous and support constant-time indexing; in Haskell, we use the SomeObject GADT to represent the contents of lists (and of arbitrary Python objects in general). Every value stored in a list must be first converted to a SomeObject, using Py.toObject.
Python's sets are also heterogeneous and constant-time; the special syntax {1, 2, 3} is equivalent to Haskell's Data.Set.fromList [1, 2, 3].

Methods and Protocols

Every Python object has a selection of methods, which can be called by external code to do stuff. If you've ever used a pseudo-OO language like C++ or Java, you've used methods before. Some methods are exposed directly via Python/C; others must be queried as attributes from an object.

When separate types have similar methods, those methods are usually standardized into a protocol. Python protocols are like Haskell typeclasses, except not type checked; any value with the appropriate methods is said to implement a protocol. For example, tuple, list, and bytes values all implement the sequence protocol.

Importing modules

There's only so much you can do with the built-in types; sooner or later, you'll want to use one of Python's rich selection of libraries. That's why you're reading this, right?

Modules are exposed to the runtime as standard Python objects, and their contents (variables, procedures, class definitions) can be queried like any other object attribute. Lets look at an example of calling os.uname():

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.Text as T
import System.IO (stdout)

import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
import qualified CPython.Types.Module as Py

main :: IO ()
main = do
	Py.initialize
	
	os <- Py.importModule "os"
	uname <- Py.getAttribute os =<< Py.toUnicode "uname"
	res <- Py.callArgs uname []
	Py.print res stdout

$ runhaskell import.hs
('Linux', 'desktop', '2.6.35-22-generic', '#35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010', 'x86_64')

The getAttribute and callArgs functions are both part of the object protocol; the former works on all objects, while the latter works on objects with the __call__() magic method.

A module can be imported any number of times, but will only be loaded once per interpreter. This comes in very useful in Haskell, which has no native support for static data – if you need to call a Python method, just import its module at the call site.

Of course, even inexpensive operations can become a bottleneck if performed often enough; importing an already-loaded module is fast, but the full lookup still involves several string comparisons and a marshal. If the same Python function needs to be run many times, consider querying it once and caching the function object.

Catching Exceptions

If anybody's been playing around with the above examples, they might have run into the following problem:

$ runhaskell exceptions.hs
exceptions.hs: <CPython exception>

Because Python exceptions are themselves Python objects, printing them requires an IO action. In fact, because Python methods can perform arbitrary actions, printing the same exception twice might give different output! Therefore, the Show instance for Python exceptions is mostly worthless.

Every Python exception has three components: a class, a value, and an optional traceback (i.e. stack trace). The class is generally not interesting, but the value can be printed to see what went wrong:

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Control.Exception as E
import qualified Data.Text as T
import System.IO (stdout)

import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types.Exception as Py
import qualified CPython.Types.Module as Py

main :: IO ()
main = do
	Py.initialize
	E.handle onException $ do
		Py.importModule "no-such-mod"
		return ()

onException :: Py.Exception -> IO ()
onException exc = Py.print (Py.exceptionValue exc) stdout

$ runhaskell exceptions.hs
ImportError('No module named no-such-mod',)

This'll do for quick and dirty scripts, but more complex errors will benefit from using the traceback module. Use procedures like print_exception() to get nice, pretty-printed error messages. If an exception originated in Python code, a stack trace will also be printed.

import qualified CPython.Constants as Py
import qualified CPython.Types as Py

-- ...

onException exc = do
	tb <- case Py.exceptionTraceback exc of
		Just obj -> return obj
		Nothing -> Py.none
	mod <- Py.importModule "traceback"
	proc <- Py.getAttribute mod =<< Py.toUnicode "print_exception"
	Py.callArgs proc [Py.exceptionType exc, Py.exceptionValue exc, tb]
	return ()

$ runhaskell exceptions.hs
ImportError: No module named no-such-mod

Putting it all together: binding 'mimetypes'

Here's the payoff; implementing a Haskell library with an existing Python library. For this I'll use the mimetypes module, since it's simple and self-contained; more useful bindings might be to the Universal Feed Parser or docutils.

Even a simple binding is a bit big to read all at once as an example, so I've split it up. First is the imports and exports; no explanation needed, hopefully.

{-# LANGUAGE OverloadedStrings #-}
module MimeTypes
	( MimeTypes
	, newMimeTypes
	, guessExtension
	, guessType
	) where

import qualified Data.Text as T
import qualified CPython as Py
import qualified CPython.Constants as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
import qualified CPython.Types.Module as Py
import qualified CPython.Types.Tuple as PyT

Next we have a data type for matching the mimetypes.MimeTypes class; it doesn't have the full complement of attributes, but enough for demonstration. newMimeTypes's parameters mimic that of the Python class's constructor.

Note that there are no Python types exposed in this module's public interface; clients of this module are insulated from the internal implementation. Aside from the absurdly heavy dependency list, there is no sign that this module is just a binding.

data MimeTypes = MimeTypes
	{ mtGuessExtension :: Py.SomeObject
	, mtGuessType :: Py.SomeObject
	}

newMimeTypes :: [FilePath] -> Bool -> IO MimeTypes
newMimeTypes files strict = do
	Py.initialize
	mod <- Py.importModule "mimetypes"
	cls <- Py.getAttribute mod =<< Py.toUnicode "MimeTypes"
	pyFiles <- Py.toList =<< mapM (fmap Py.toObject . Py.toUnicode) files
	pyStrict <- if strict then Py.true else Py.false
	mt <- Py.callArgs cls [Py.toObject pyFiles, Py.toObject pyStrict]
	
	pyGuessExtension <- Py.getAttribute mt =<< Py.toUnicode "guess_extension"
	pyGuessType <- Py.getAttribute mt =<< Py.toUnicode "guess_type"
	return $ MimeTypes pyGuessExtension pyGuessType

If you've any sense, one of the first things you thought after reading that was "golly, that sure is ugly". And you're right – it is ugly. Anybody who wants to make a serious go of binding large-scale Python libraries (such as Django) are heavily encouraged to write something similar to c2hs to automate the worst of it. Call it py2hs?

However, aside from being dreadfully verbose, it's not particularly complex. Parameters are marshaled from Haskell types into their Python equivalents, packaged up into a parameter list, and used to call the class constructor. After the MimeTypes object has been created, its guess_extension and guess_type methods are queried and cached for later use.

Which brings us to:

guessExtension :: MimeTypes -> T.Text -> Bool -> IO (Maybe T.Text)
guessExtension mt type_ strict = do
	pyType <- Py.toUnicode type_
	pyStrict <- if strict then Py.true else Py.false
	res <- Py.callArgs (mtGuessExtension mt) [Py.toObject pyType, Py.toObject pyStrict]
	textOrNone res

guessType :: MimeTypes -> T.Text -> Bool -> IO (Maybe T.Text, Maybe T.Text)
guessType mt url strict = do
	pyURL <- Py.toUnicode url
	pyStrict <- if strict then Py.true else Py.false
	res <- Py.callArgs (mtGuessType mt) [Py.toObject pyURL, Py.toObject pyStrict]
	Just tup <- Py.cast res
	[pyType, pyEncoding] <- Py.fromTuple tup
	type_ <- textOrNone pyType
	encoding <- textOrNone pyEncoding
	return (type_, encoding)

textOrNone :: Py.SomeObject -> IO (Maybe T.Text)
textOrNone obj = do
	isNone <- Py.isNone obj
	if isNone
		then return Nothing
		else do
			Just cast <- Py.cast obj
			Just `fmap` Py.fromUnicode cast

Really, it's more of the same; marshal parameters, call, dissect the result. Testing for None is common enough that I moved it to a helper; more complex bindings might have dozens such helpers for special cases. Are you listening, py2hs author?

Finally, load up our new binding into GHCi and see if it works:

$ ghci -XOverloadedStrings
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :l MimeTypes
[1 of 1] Compiling MimeTypes        ( MimeTypes.hs, interpreted )
Ok, modules loaded: MimeTypes.
*MimeTypes> types <- newMimeTypes [] False

It loaded! And it didn't crash! We're off to a good start; lets see if our guessType works:

*MimeTypes> import Data.Text
*MimeTypes Data.Text> guessType types "foo.txt" True
(Just "text/plain",Nothing)
*MimeTypes Data.Text> guessType types "foo.html.gz" True
(Just "text/html",Just "gzip")

Looks good; it's picking up the file type, and the optional encoding. Now for guessExtension:

*MimeTypes Data.Text> guessExtension types "text/plain" True
Just ".ksh"

Hmm.

http://bugs.python.org/issue1043134

Hmm 🤔