Now on ScienceBlogs: Charles Darwin February 12, 1809 - April 19, 1882

ScienceBlogs Book Club: Inside the Outbreaks

Good Math, Bad Math

Finding the fun in good math; Shredding bad math and squashing the crackpots who espouse it.

Search

Profile

markcc.jpg
Mark Chu-Carroll (aka MarkCC) is a PhD Computer Scientist, who works for Google as a Software Engineer. My professional interests center on programming languages and tools, and how to improve the languages and tools that are used for building complex software systems.

Donors Choose

Other Information

Add this blog to my Technorati Favorites!

Recent Posts

Recent Comments

Categories

Blogroll

Old Topic Indices

Great Online Books

« The Balance of Screening Tests | Main | A Special Midweek Recipe: Ad-Libbed Cranberry Chutney »

Creating User-Defined Types in Haskell

Category: Haskell
Posted on: November 23, 2009 9:55 AM, by Mark C. Chu-Carroll

  • (This is an edited repost of one of the posts from the earlier version of my Haskell tutorial.)
  • (This file is a literate haskell script. If you save it as a file whose name ends in ".lhs", it's actually loadable and runnable in GHCI or Hugs.)

Like any other modern programming language, Haskell has excellent support for building user-defined data types. In fact, even though Haskell is very much not object-oriented, most Haskell programs end up being centered around the design and implementation of data structures, using constructions called classes and instances.

In this post, we're going to start looking at how you implement data types in Haskell. What I'm going to do is start by showing you how to implement a simple binary search tree. I'll start with a very simple version, and then build up from there.

There are a variety of useful primitive types in Haskell - a wide range of numeric types (arbitrary sized integers, rationals, floats, etc.), characters, and so on. There are also a lot of basic compound types provided by the standard library - the usual tuples, arrays, lists, and so on. But when you go to build your own datatypes, the main thing that you use in Haskell is something called an algebraic data types. An algebraic data type consists of a set of variants, each of which is (essentially) a tuple which has been tagged with a marker that identifies which variant it belongs to. When you define an algebraic type, Haskell creates a set of constructor functions for the type, one for each variant.

As usual, talking about things like this in theory is hopelessly vague, so let's dive in and look at some code. We'll start with a very simple binary search tree containing integers.

The basics are pretty straightforward. A binary search tree is a data structure consisting of a collection of nodes. Every node contains a value, a left child, and a right child. The left and right children each contain another binary search tree which could contain some values, or which could be empty.

When you're describing a data type and you use the word "or", that's your clue that you're looking at a variant. The tree type has two cases: trees containing values, and empty trees. A declaration of that type would look like:


>data SimpleIntBST = SIBNode Integer SimpleIntBST SimpleIntBST
>                  | SIBEmpty

What this says is: we're defining a data type named SimpleIntBST. There are two kinds of SimpleIntBST values. The first kind is an SIBNode, which contains an integer value, a left child, and a right child. The other kind is an SIBEmpty, which contains no values. In a language like C++ or Java, you'd have each node contain pointers to children nodes, and an empty child would be represented by a null pointer. But in Haskell, there are no pointers - and more importantly, null doesn't have a well-defined type. In Haskell, when you've got something like a null, you need to explicity define the null case using a variant, like SIBEmpty.

In the declaration, SIBNode and SIBEmpty are the names of the constructor functions that are used to create values. To create an SIBNode, you call it like a normal function with parameters of the correct type: for example, to create an SIBNode containing the value "4" with empty left and right children, you'd write:

SIBNode 4 SIBEmpty SIBEmpty

Let's try that in GHCI:

*Main> SIBEmpty
<interactive>:1:0:
    No instance for (Show SimpleIntBST)
      arising from use of `print' at <interactive>:1:0-7
    Possible fix: add an instance declaration for (Show SimpleIntBST)
    In the expression: print it
    In a 'do' expression: print it

What's wrong? The interpreter tried to print out the value returned by the expression. The interpreter outputs values using the print function. In general in Haskell, when we have a problem like this, we look at types. What's the type of the print function?

*Main> :type print
print :: (Show a) => a -> IO ()
*Main>

The parameter to print must be a value of some type that is an instance of the Show class. But we didn't make our binary tree an instance of Show. Fortunately, since Show is such a fundamental interface, Haskell provides a nifty shorthand to make the compiler generate the code necessary to provide an implementation of whatever is needed for Show, called a deriving clause. Let's try again, this time with a deriving clause. We'll call the new version "IntBST" rather than "SimpleIntBST", because treating this post as a literate file, we can't re-declare "SimpleIntBST".


>data IntBST = IBNode Integer IntBST IntBST
>                  | IBEmpty deriving (Show)

Now, loading that into GHC, we'll see:

*Main> IBNode 4 IBEmpty IBEmpty
IBNode 4 IBEmpty IBEmpty
*Main>

Which is exactly what we want: our binary search tree is now printable.

A data structure like this isn't useful without some operations to work with it. So let's go ahead and do some implementing! We'll start with an insert operator:


>ibstInsert :: IntBST -> Integer -> IntBST
>ibstInsert (IBNode i left right) v 
>        | i > v = IBNode i (ibstInsert left v) right
>        | otherwise = IBNode i left (ibstInsert right v)
>ibstInsert IBEmpty v = (IBNode v IBEmpty IBEmpty)

So let's look at that a bit. It starts off with an explicit type declaration. That's not strictly necessary: the compiler can infer the type of the function without any help. But it's a good practice to put it there: it's a useful bit of documentation for a human reader, and it's useful for catching problems during compilation - if you made any mistakes writing the function, the compiler might be able to catch them because the inferred type won't match the declaration.

After the type declaration, we get to the real definition, which is where things get interesting. I'm sure you noticed that there was something a bit strange about the type declaration for IntBST. In most non-functional languages, you define a new data type as some kind of a record or structure, which contains a list of named fields; when you use the structure, you access its fields by derefencing an instance of the data structure using the name of a field. But in our Haskell binary search tree, the values belonging to an instance of the data structure don't have names. Like most functional languages, Haskell uses pattern matching to access the pieces of a data structure.

A pattern looks like a value expression, except that it may contain some unbound variables. The pattern can also be followed by a guard expression which is a boolean expression using variables bound in the pattern. If the guard is included, it's separated from the patterns by a | character, which is generally read as "such that". So the first declaration line of ibstInsert has two patterns: (IBNode i left right), and v. It's followed by a guard. So that full declaration line can be read as "ibstInsert of an IntBST matching '(IBNode i left right)' and a value matching 'v' such that 'i > v' returns '(IBNode i left right)'.

As a shorthand, if you have a guard expression on an indented line following a pattern with a guard, the pattern is reused with the second guard. So the second line starting with a "|" is used if the first guard fails. As a guard expression, "otherwise" can only be used as the last guard for a pattern, and it's always true.

So the first case of the declaration of ibstInsert, in full, does the following:

  1. If the tree parameter is an IBNode, then bind "i" to the value element of the node, "left" to the left subtree, and "right" to the right subtree; and bind the variable "v" to the second parameter, which is the value to be inserted to the tree.
  2. If the value in the tree node is larger than the value being inserted, then return a tree that's the same as this, except that the value is inserted into the left subtree.
  3. Otherwise, return a tree like this one, except that the new value is inserted into the right subtree.

The second case says that if the node is an empty, then return a new non-empty node with the new value as its value, and with empty left and right children.

If you look at the code now that you know how to read it, one thing that should hopefully be rather striking about it is just how much it looks like the explanatory pseudo-code that you'd find in an algorithms textbook. It's very straightforward code, written in a very easy to read style. (As a brief aside, this is an example of one of the things that really attracts me to Haskell programming. Code in Haskell tends to very naturally decompose into very small definitions that look and read like explanatory code from a textbook.)

So let's look at a quick example of using this to build a tree.

> treeone = ibstInsert (ibstInsert (ibstInsert (ibstInsert (ibstInsert IBEmpty 3) 5) 2) 17) 10
> treetwo = (ibstInsert (ibstInsert (ibstInsert (ibstInsert treeone 4) 6) 1) 9)

That gives us two trees, one of which contains a subset of the values in the other. We can ask ghc to show us the values of them:

*Main> treeone
IBNode 3 (IBNode 2 IBEmpty IBEmpty) (IBNode 5 IBEmpty (IBNode 17 (IBNode 10 IBEmpty IBEmpty) IBEmpty))
*Main*gt; treetwo
IBNode 3 (IBNode 2 (IBNode 1 IBEmpty IBEmpty) IBEmpty) (IBNode 5 (IBNode 4 IBEmpty IBEmpty) (IBNode 17 (IBNode 10 (IBNode 6 IBEmpty (IBNode 9 IBEmpty IBEmpty)) IBEmpty) IBEmpty))
*Main>

The other thing that we want to be able to do with a binary search tree is find nodes in it. Since our current BST only contains integers, we'll just write a function that returns a boolean value indicating whether or not a particular value is in the tree.


>
>ibstSearch :: IntBST -> Integer -> Bool
>ibstSearch (IBNode k left right) v
>    | k == v = True
>        | k > v = ibstSearch left v
>        | otherwise = ibstSearch right v
>ibstSearch IBEmpty _ = False
>

This one should be pretty easy to read without much explanation. To be pedantic, I'll walk through it quickly. If it's a non-empty node, and the value in the node equals the value being searched for, return true. If the value in the node is bigger, then return the result of searching the left subtree; otherwise, return the result of searching the right subtree. If it's an empty node, then the value isn't there, so return false.


A binary search tree isn't particularly useful if it can only hold integers. So as a first step towards making it more real, we'd like to be able to store not just integers in the tree, but any value for which the necessary comparisons (> and ==) are defined. There's a typeclass for that, called "Ord", for Ordered. So now let's just rewrite IntBST as a generic BST using typeclasses:


> data (Ord a) => GenBST a = GBNode a (GenBST a) (GenBST a) > | GBEmpty
>      deriving (Show)

The variable on the left of the equal in the data type definition is a type parameter. The type-class constraint preceding the name of the new datatype means that the type parameter "a" can only be instantiated by a type that is an instance of the type-class "Ord". The method implementations will barely change:


> gbstInsert :: (Ord a) => GenBST a -> a -> GenBST a
> gbstInsert (GBNode i left right) v 
>         | i > v = GBNode i (gbstInsert left v) right
>         | otherwise = GBNode i left (gbstInsert right v)
> gbstInsert GBEmpty v = (GBNode v GBEmpty GBEmpty)
>
> gbstSearch :: (Ord a) => GenBST a -> a -> Bool
> gbstSearch (GBNode k left right) v
>        | k == v = True
>        | k > v = gbstSearch left v
>        | otherwise = gbstSearch right v
> gbstSearch GBEmpty _ = False

With that implementation of gbstInsert, we can pretty much reuse the code to generate a sample instance:

> gtreeone = gbstInsert (gbstInsert 
>        (gbstInsert (gbstInsert (gbstInsert GBEmpty 3) 5) 2) 17) 10
> gtreetwo = (gbstInsert (gbstInsert (gbstInsert (gbstInsert gtreeone 4) 6) 1) 9)

Look what ghci says about those:

*Main*gt; gtreetwo
GBNode 3 (GBNode 2 (GBNode 1 GBEmpty GBEmpty) GBEmpty) (GBNode 5 (GBNode 4 GBEmpty GBEmpty) (GBNode 17 (GBNode 10 (GBNode 6 GBEmpty (GBNode 9 GBEmpty GBEmpty)) GBEmpty) GBEmpty))
*Main> :type gtreetwo
gtreetwo :: GenBST Integer
*Main> gbstSearch gtreetwo 17
True
*Main> gbstSearch gtreetwo 18
False
*Main> gbstSearch gtreetwo 1
True
*Main> gbstSearch gtreeone 1
False
*Main>

We can use the new generic tree in exactly the same way that we used the integer specific one. We don't need to declare types of expressions that use it. The compiler will happily infer the type parameter, and just make everything work. But now we're not restricted to integers:

*Main> gbstInsert (gbstInsert (gbstInsert GBEmpty "hi") "hello") "Salutations"
GBNode "hi" (GBNode "hello" (GBNode "Salutations" GBEmpty GBEmpty) GBEmpty) GBEmpty
*Main> gbstInsert (gbstInsert (gbstInsert (gbstInsert GBEmpty "hi") "hello") "S
alutations") "Greetings"
GBNode "hi" (GBNode "hello" (GBNode "Salutations" (GBNode "Greetings" GBEmpty GBEmpty) GBEmpty) GBEmpty) GBEmpty
*Main>

That's some of the basics. Next post, we'll build on the binary search tree, giving it some more features, and updating the insert routine so that the tree stays nicely balanced. That will lead us in to some interesting things about how you code in a language like Haskell, where you can't actually modify data structures in-place.

Share on Facebook
Share on StumbleUpon
Share on Facebook
Find more posts in: Technology

Comments

1

Mark, can you load and run this in your interpreter? I haven't spent a lot of time on it, but there are at least a few syntax issues I run into when trying to get it to run with GHC 6.8.2 between extraneous spaces and missing blank lines between code and comment.

Posted by: phlyingpenguin | November 24, 2009 2:58 PM

2

Dear phlyingpenguin:
Browsers -- or browser/OS pairs -- frequently disagree on what to offer to the cut-and-paste machinery, some omitting the blank line between a code block and the following text, making a hash of otherwise sound literate blogging. (In this case, Safari was wrong for me, Firefox right; but there is the further problem that here the html is also missing some of the spaces between the opening '>'s of the code lines.) I think this paste will be cut-able:

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13329

Posted by: Michael | November 28, 2009 10:45 PM

3

Indeed, I keep forgetting that I've switched between Chrome and Safari which both exhibit the same behavior. Firefox does appear to render the way you intend. I had fixed the copy/paste manually without too much more difficulty later on. Thanks!

Posted by: phlyingpenguin | November 30, 2009 9:40 PM

4

Dear Mark,
I have a Haskell program to parsing and evaluating an expression. I have the defined the property to perform quickcheck and thus an instance for the Expr which is as follows

instance Arbitrary Expr where
arbitrary = sized arbExpr

arbExpr :: Int -> Gen Expr
arbExpr s =
frequency [ (1, do n return (Num (abs(n))))
, (s, do a b return (Add a b))
, (s, do a b return (Mul a b))
, (s, do a return (Sin (a)))
, (s, do a return (Cos (a)))
, (1, do return (Var "x"))
]
where
s' = s `div` 2

data Expr
= Num Double
| Add Expr Expr
| Mul Expr Expr
| Sin Expr
| Cos Expr
| Var Name
| Neg Expr
deriving (Eq,Show)

type Name = String

Now I want to get a random expression fro my GUI Calculator, where whenever the user presses a button and the Calculator generates a random Expression of the above type.
Thanks in advance..

Posted by: Pawan | December 22, 2009 6:18 AM

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.