on
Haskell
My name is Chris. I teach Haskell to people that are new to programming and as well as long-time coders. Haskell is a general purpose programming language that is most useful to mere mortals.
I’m going to show you how to write a package in Haskell and interact with the code inside of it.
Installing tools for writing Haskell code
We’re going to use Stack to manage our project dependencies, compiler, building our code, and running our tests. Start by getting Stack installed.
After you’ve finished the install instructions, stack
should all be in your path. ghci
is the REPL (read-eval-print loop) for Haskell, though as often as not, you’ll use stack ghci
to invoke a REPL that is aware of your project and its dependencies.
What we’re going to make
We’re going to write a little csv parser for some baseball data. I don’t care a whit about baseball, but it was the best example of free data I could find.
Project layout
There’s not a prescribed project layout, but there are a few guidelines I would advise following.
One is that Edward Kmett’s lens library is not only a fantastic library in its own right, but is also a great resource for people wanting to see how to structure a Haskell project, write and generate Haddock
documentation, and organize your namespaces. Kmett’s library follows Hackage guidelines on what namespaces and categories to use for his libraries.
There is an alternative namespacing pattern demonstrated by Pipes, a streaming library. It uses a top-level eponymous namespace. For an example of another popular project you could also look at Pandoc for examples of how to organize non-trivial Haskell projects.
Once we’ve finished laying out our project, it’s going to look like this:
$ tree
.
├── LICENSE
├── Setup.hs
├── bassbull.cabal
├── src
│ ├── Main.hs
├── stack.yaml
4 directories, 7 files
Ordinarily I’d structure things a little more, but there isn’t a lot to this project. Onward!
Getting your project started
We’ll use Stack
, our GHC Haskell dependency manager and build tool, to create some initial files for us.
$ stack new bassbull simple
Here bassbull
is the name of our project and simple
is the project template we’re using. Now we’re going to download our test data while inside the directory of our bassbull
project.
You can download the data from here. If you want to download it via the terminal on a Unix-alike (Mac, Linux, BSD, etc) you can do so via:
$ curl -0 https://raw.githubusercontent.com/bitemyapp/csvtest/master/batting.csv > batting.csv
It should be about 2.3 MB when it’s all said and done.
Before we start making changes, I’m going to init my version control (git, for me) so I can track my changes and not lose any work.
$ cd bassbull
$ git init
$ git add .
$ git commit -am "Initial commit"
I’m also going to add the gitignore from Github’s gitignore repository plus some additions for Haskell so we don’t accidentally check in unnecessary build artifacts or other things inessential to the project.
This should go into a file named .gitignore
at the top level of your
bassbull project.
dist
dist-*
cabal-dev
*.o
*.hi
*.chi
*.chs.h
*.dyn_o
*.dyn_hi
.hpc
.hsenv
.cabal-sandbox/
cabal.sandbox.config
*.prof
*.aux
*.hp
*.eventlog
.stack-work/
Editing the Cabal file
First we need to fix up our cabal
file a bit. Mine is named bassbull.cabal
and is in the top level directory of the project.
Here’s what I changed my cabal
file to:
name: bassbull
version: 0.1.0.0
synopsis: Processing some csv data
description: Baseball data analysis
homepage: bitemyapp.com
license: BSD3
license-file: LICENSE
author: Chris Allen
maintainer: [email protected]
copyright: 2016, Chris Allen
category: Data
build-type: Simple
cabal-version: >=1.10
executable bassbull
ghc-options: -Wall
hs-source-dirs: src
main-is: Main.hs
build-depends: base >= 4.7 && <5,
bytestring,
vector,
cassava
default-language: Haskell2010
A few things to note:
- The description tells people what the package is about.
- The
hs-source-dirs
includessrc
so Cabal knows where my modules are. - An executable stanza with the name bassbull is in the Cabal file so we can build a binary by that name and run it.
main-is
is set toMain.hs
in the executable stanza so the compiler knows which file contains the Main module and main function.- We have
ghc-options
with-Wall
so we get the rather handy warnings GHC offers on top of the usual type checking. - We included the libraries our project will use in
build-depends
.
Building and interacting with your program
The contents of src/Main.hs
:
module Main where
main = putStrLn "hello"
One thing to note is that for a module to work as a main-is
target for GHC, it must have a function named main
and itself be named Main
. Most people make little wrapper Main
modules to satisfy this, sometimes with argument parsing and handling done via libraries like optparse-applicative.
For now, we’ve left Main very simple, making it just a putStrLn
of the string "Hello"
. To validate that everything is working, let’s build and run this program.
Then we install our dependencies by building our project. This can take some time on the first run, but Stack will cache and share dependencies across your projects automatically.
$ stack setup
$ stack build
We did the stack setup
just in case you didn’t already have GHC installed. Note that you’ll only have to do this once for a particular version of GHC. If this succeeds, we should get a binary named bassbull
. To run this, do the following.
$ stack exec bassbull
hello
If everything is in place, let’s move onto writing a little csv processor.
Writing a program to process csv data
One thing to note before we begin is that you can fire up a project-aware Haskell REPL using Stack’s GHCi command. The benefit of doing so is that you can write and type-check code interactively as you explore new and unfamiliar libraries or just to refresh your memory about existing code.
You can do so by running it in your shell like so:
$ stack ghci
If you do, you should see a bunch of stuff about loading packages
installed for the project and then a Prelude Main>
prompt.
[1 of 1] Compiling Main ( Main.hs, interpreted )
Ok, modules loaded: Main.
Prelude Main>
Now we can load our src/Main.hs
in the REPL.
$ stack ghci
Preprocessing executable 'bassbull' for bassbull-0.1.0.0...
GHCi, version 7.8.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package array-0.5.0.0 ... linking ... done.
Loading package deepseq-1.3.0.2 ... linking ... done.
Loading package bytestring-0.10.4.0 ... linking ... done.
Loading package containers-0.5.5.1 ... linking ... done.
Loading package text-1.2.0.0 ... linking ... done.
Loading package hashable-1.2.2.0 ... linking ... done.
Loading package scientific-0.3.3.2 ... linking ... done.
Loading package attoparsec-0.12.1.2 ... linking ... done.
Loading package blaze-builder-0.3.3.4 ... linking ... done.
Loading package unordered-containers-0.2.5.1 ... linking ... done.
Loading package primitive-0.5.4.0 ... linking ... done.
Loading package vector-0.10.12.1 ... linking ... done.
Loading package cassava-0.4.2.0 ... linking ... done.
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
src/Main.hs:3:1: Warning:
Top-level binding with no type signature: main :: IO ()
Ok, modules loaded: Main.
*Main> :load src/Main.hs
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
src/Main.hs:3:1: Warning:
Top-level binding with no type signature: main :: IO ()
Ok, modules loaded: Main.
*Main>
Becoming comfortable with the REPL can be a serious boon to productivity. There is editor integration for those that want it as well.
Now we’re going to update our src/Main.hs
. Our goal is to read a CSV file into a ByteString
(basically a byte vector), parse the ByteString
into a Vector
of tuples, and sum up the “at bats” column.
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Vector as V
-- from cassava
import Data.Csv
-- a simple type alias for data
type BaseballStats = (BL.ByteString, Int, BL.ByteString, Int)
main :: IO ()
main = do
csvData <- BL.readFile "batting.csv"
let v = decode NoHeader csvData :: Either String (V.Vector BaseballStats)
let summed = fmap (V.foldr summer 0) v
putStrLn $ "Total atBats was: " ++ (show summed)
where summer (name, year, team, atBats) n = n + atBats
Let’s break down this code.
import qualified Data.ByteString.Lazy as BL
import qualified Data.Vector as V
-- from cassava
import Data.Csv
First, we’re importing our dependencies. Qualified imports let us give names to the namespaces we’re importing and use those names as a prefix, such as BL.ByteString
. This is used to refer to values and type constructors alike. In the case of import Data.Csv
where we didn’t qualify the import (with qualified
), we’re bringing everything from that module into scope. This should be done only with modules that have names of things that won’t conflict with anything else. Other modules like Data.ByteString
and Data.Vector
have a bunch of functions that are named identically to functions in the Prelude
and should be qualified.
-- a simple type alias for data
type BaseballStats = (BL.ByteString, Int, BL.ByteString, Int)
Here we’re creating a type alias for BaseballStats
. I made it a type alias for a few reasons. One is so I could put off talking about algebraic data types! I made it a type alias of the 4-tuple specifically because the Cassava library already understands how to translate CSV rows into tuples and our type here will “just work” as long as the columns that we say are Int
actually are parseable as integral numbers. Haskell tuples are allowed to have heterogenous types and are defined primarily by their length. The parentheses and commas are used to signify them. For example, (a, b)
would be both a valid value and type constructor for referring to 2-tuples, (a, b, c)
for 3-tuples, and so forth.
main :: IO ()
main = do
csvData <- BL.readFile "batting.csv"
We need to read in a file so we can parse our CSV data. We called the lazy ByteString
namespace BL
using the qualified
keyword in the import. From that namespace we used BL.readFile
which has type FilePath -> IO ByteString
. You can read this in English as I take a FilePath as an argument and I return a ByteString after performing some side effects
.
You can see the type of BL.readFile
here.
We’re binding over the IO ByteString
that BL.readFile "batting.csv"
returns. csvData
has type ByteString
due to binding over IO
. Remember our tuples that we signified with parentheses earlier? Well, ()
is a sort of tuple too, but it’s the 0-tuple! In Haskell we usually call it unit. It can’t contain anything; it’s a type that has a single value - ()
, that’s it. It’s often used to signify we don’t return anything. Since there’s usually no point in executing functions that don’t return anything, ()
is often wrapped in IO
. Printing strings are a good example of the result type IO ()
as they do their work and return nothing. In Haskell you can’t actually “return nothing;” the concept doesn’t even make sense. Thus we use ()
as the idiomatic “I got nothin’ for ya” type and value. Usually if something returns ()
you won’t even bother to bind to a name, you’ll just ignore it.
let v = decode NoHeader csvData :: Either String (V.Vector BaseballStats)
v
has the type you see at the right with the type assignment operator ::
I’m assigning the type to dispatch the typeclass that decode
uses to parse csv data. See more about the typeclass cassava uses for parsing csv data here.
In this case, because I defined a type
alias of a tuple for my record, I get my parsing code for free (already defined for tuples, bytestring
, and Int
).
let summed = fmap (V.foldr summer 0) v
Here we’re using a let
expression to bind the expression fmap (V.foldr summer 0) v
to the name summed
so that the expressions
that follow it can refer to summed
without repeating all the same
code.
First we fmap over the Either String (V.Vector BaseballStats)
. This lets us apply (V.foldr summer 0)
to V.Vector BaseballStats
. We partially applied the Vector
folding function foldr
to the summing function and the number 0
. The number 0
here is our “start” value for the fold. Generally in Haskell we don’t use recursion directly. Instead in Haskell we use higher order functions and abstractions, giving names to common things programmers do in a way that lets us be more productive. One of those very common things is folding data. You’re going to see examples of folding and the use fmap
from Functor
in a bit.
We say V.foldr
is partially applied because we haven’t applied all of the arguments yet. Haskell has something called currying built into all functions by default which lets us avoid some tedious work that would require a “Builder” pattern in languages like Java. Unlike previous code samples, these examples are using my interactive ghci
REPL.
-- Person is a product/record, if that
-- is confusing think "struct" but better.
Prelude> data Person = Person String Int String deriving Show
Prelude> :type Person
Person :: String -> Int -> String -> Person
Prelude> :t Person "Chris" 415
Person "Chris" 415 :: String -> Person
Prelude> :t Person "Chris" 415 "Allen"
Person "Chris" 415 "Allen" :: Person
Prelude> let namedChris = Person "Chris"
Prelude> namedChris 415 "Allen"
Person "Chris" 415 "Allen"
Prelude> Person "Chris" 415 "Allen"
Person "Chris" 415 "Allen"
This lets us apply some, but not all, of the arguments to a function and pass around the result as a function expecting the rest of the arguments.
Fully explaining the fmap
in let summed = fmap (V.foldr summer 0) v
would require explaining Functor
. I don’t want to belabor
specific concepts too much, but I think a quick demonstration of
fmap
and foldr
would help here. This is also a transcript from my
interactive ghci
REPL. I’ll explain Either, Right, and Left after
the REPL sample. The :type
or :t
command is a command to my ghci
REPL, not part of the Haskell language. It’s a way to request the
type of an expression.
Prelude> let v = Right 1 :: Either String Int
Prelude> let x = Left "blah" :: Either String Int
Prelude> :t v
v :: Either String Int
Prelude> :t x
x :: Either String Int
Prelude> let addOne x = x + 1
<interactive>:4:12: Warning:
This binding for ‘x’ shadows the existing binding
defined at <interactive>:3:5
Prelude> addOne 2
<interactive>:5:1: Warning:
Defaulting the following constraint(s) to type ‘Integer’
(Show a0) arising from a use of ‘print’ at <interactive>:5:1-8
(Num a0) arising from a use of ‘it’ at <interactive>:5:1-8
In a stmt of an interactive GHCi command: print it
3
Prelude> fmap addOne v
Right 2
Prelude> fmap addOne x
Left "blah"
Either
in Haskell is used to signify cases where we might get values of one of two possible types. Either String Int
is a way of saying, “you’ll get either a String
or an Int
”. This is an example of sum types. You can think of them as a way to say or
in your type, where a struct
or class
would let you say and
. Either
has two constructors, Right
and Left
. Culturally in Haskell Left
signifies an “error” case. This is partly why the Functor
instance for Either
maps over the Right
constructor but not the Left
. If you have an error value, you can’t keep applying your happy path functions. In the case of Either String Int
, String
would be our error value in a Left
constructor and Int
would be the happy-path “yep, we’re good” value in the Right
constructor. Also, Haskell has type inference. You don’t have to declare types explicitly like I did in the example from my REPL transcript - I did so for the sake of explicitness.
Either
isn’t the only type we can map over.
Prelude> let myList = [1, 2, 3] :: [Int]
Prelude> fmap addOne myList
[2,3,4]
Prelude> let multTwo x = x * 2
Prelude> fmap multTwo myList
[2,4,6]
Here we have the list type, signified using the []
brackets and
whatever type is inside in our list, in this case Int
. With Either
we have two possible types and Functor
only lets us map over one of
them, so the Functor
instance for Either
only applies our function
over the happy path values. With the type [a]
there’s only one type
inside of it, so it’ll get applied regardless…or will it? What if
I have an empty list?
Prelude> fmap multTwo []
[]
Prelude> fmap addOne []
[]
Conveniently not only does fmap
let us avoid manually pattern matching the Left
and Right
cases of Either
, but it lets us not bother to manually recurse our list or pattern-match the empty list case. This helps us prevent mistakes as well as clean up and abstract our code. In a less happy alternate universe, we would’ve had to write the following code, written in typical code file style rather than for the REPL this time:
addOne :: Int -> Int
addOne x = x + 1 -- at least we can abstract this out
incrementEither :: Either e Int -> Either e Int
incrementEither (Right numberWeWanted) = Right (addOne numberWeWanted)
incrementEither (Left errorString) = Left errorString
We use parens on the left-hand side here to pattern match at the function declaration level on whether our Either e Int
is Right
or Left
. Parentheses wrap (addOne numberWeWanted)
so we don’t try to erroneously pass two arguments to Right
when we mean to pass the result of applying addOne
to numberWeWanted
, to Right
. If our value is Right 1
this is returning Right (addOne 1)
which reduces to Right 2
.
As we process the CSV data we’re going to be doing so by folding the
data. This is a general model for understanding how you process data
that extends beyond specific programming languages. You might have
seen fold
called reduce
. Here are some examples of folds and
list/string concatenation in Haskell. We’re switching back to REPL
demonstration again.
Prelude> :t foldr
foldr :: (a -> b -> b) -> b -> [a] -> b
Prelude> foldr (+) 0 [1, 2, 3]
6
Prelude> foldr (+) 1 [1, 2, 3]
7
Prelude> foldr (+) 2 [1, 2, 3]
8
Prelude> foldr (+) 2 [1, 2, 3, 4]
12
Prelude> :t (++)
(++) :: [a] -> [a] -> [a]
Prelude> [1, 2, 3] ++ [4, 5, 6]
[1,2,3,4,5,6]
Prelude> "hello, " ++ "world!"
"hello, world!"
Okay, enough of the REPL jazz session.
Now back to the CSV processing code!
putStrLn $ "Total atBats was: " ++ (show summed)
Last, we stringify the summed up count using show
, then concatenate
that with a string to describe what we’re printing, then print the
whole shebang using putStrLn
. The $
is just so everything to the
right of the $
gets evaluated before whatever is to the left. To see
why I did that remove the $
and build the code. Alternatively, I
could’ve used parentheses in the usual fashion. That would look like
the following.
putStrLn ("Total atBats was: " ++ (show summed))
show
is a function from the typeclass Show
. Here’s how you can find out about it in your REPL:
Prelude> :type show
show :: Show a => a -> String
Prelude> :info Show
class Show a where
showsPrec :: Int -> a -> ShowS
show :: a -> String
showList :: [a] -> ShowS
-- Defined in ‘GHC.Show’
instance (Show a, Show b) => Show (Either a b)
-- Defined in ‘Data.Either’
instance Show a => Show [a] -- Defined in ‘GHC.Show’
instance Show Ordering -- Defined in ‘GHC.Show’
instance Show a => Show (Maybe a) -- Defined in ‘GHC.Show’
instance Show Integer -- Defined in ‘GHC.Show’
instance Show Int -- Defined in ‘GHC.Show’
instance Show Char -- Defined in ‘GHC.Show’
instance Show Bool -- Defined in ‘GHC.Show’
...
What instance Show Integer
is telling us is that Integer
has
implemented Show
. This means we should be able to use show
on
something with that type. We can specialize the type of show
to
Integer
in a few passes.
show :: Show a => a -> String
show :: Show Integer => Integer -> String
-- you can just drop Show Integer =>, the typeclass
-- instances associated with a specific type are
-- a given.
show :: Integer -> String
In fact, we can even make a pointless version of show pre-specialized
to Integer
. Here’s an example from my REPL:
Prelude> :t show
show :: Show a => a -> String
Prelude> :t show myInteger
show myInteger :: String
Prelude> let integerShow = show :: Integer -> String
Prelude> integerShow 1
"1"
Prelude> integerShow ("blah", ())
<interactive>:11:13:
Couldn't match expected type ‘Integer’
with actual type ‘([Char], ())’
In the first argument of ‘integerShow’, namely ‘("blah", ())’
In the expression: integerShow ("blah", ())
In an equation for ‘it’: it = integerShow ("blah", ())
Prelude> show ("blah", ())
"(\"blah\",())"
Next we’ll look at summer
. summer
is the function we are folding our Vector
with. You can hang where
clauses off of functions which are a bit like let
but they come last. where
clauses are more common in Haskell than let
clauses, but there’s nothing wrong with using both.
Our folding function here takes two arguments: the tuple record (we’ll have many of those in the vector of records), and the sum of our data so far.
Here n
is the sum we’re carrying along as fold the Vector
of BaseballStats
.
where summer (name, year, team, atBats) n = n + atBats
Building and running our csv parsing program
First we’re going to rebuild the project.
$ stack build
Then, assuming we have the batting.csv
I mentioned earlier in our current directory, we can run our program and get the results.
$ stack exec bassbull
Total atBats was: Right 4858210
$
Refactoring our code a bit
Splitting out logic into independent functions is a common method for making Haskell code more composable and easy to read.
To that end, we’ll clean up our example a bit.
First, we don’t care about name
, year
, and team
for our folding code.
So we’re going to use the Haskell idiom of bindings things we don’t care about to _
.
This changes our fold from this:
where summer (name, year, team, atBats) sum = sum + atBats
To this:
where summer (_, _, _, atBats) sum = sum + atBats
Next we’ll make our extraction of the ‘at bats’ from the tuple more compositional. If you’d like to play with this further, consider rewriting our example program at the end of this article into using a Haskell record instead of a tuple. I used a tuple here because Cassava already understands how to parse them, sparing me having to write that code.
First we’ll add fourth
:
fourth :: (a, b, c, d) -> d
fourth (_, _, _, d) = d
Then we’ll rewrite our folding function again from:
where summer (_, _, _, atBats) n = n + atBats
Into:
where summer r n = n + fourth r
Here we can use something called eta reduction to remove the explicit record and sum values to make it point-free. Since our function is really just about composing the extraction of the fourth value from the tuple and summing that value with the summed up atBat
values so far, this makes the code quite concise.
You can read more about this in the article on pointfree programming in Haskell.
To that end, we go from:
where summer r n = n + fourth r
to:
where summer = (+) . fourth
.
is how we compose functions in Haskell. The entire definition of
.
is:
(f . g) x = f (g x)
So, for example, if we multiplyByTwo . addOne
we’re adding one, then passing that
result to the multiplyByTwo
function. In the csv parser code, first fourth
gets applied to
the r
argument, then (+)
is composed so that it is applied to the
result of fourth r
and the value n
.
We should also split out our decoding of BaseballStats
from CSV data.
We’re going to move this code:
let v = decode NoHeader csvData :: Either String (V.Vector BaseballStats)
Into an independent function:
baseballStats :: BL.ByteString -> Either String (V.Vector BaseballStats)
baseballStats = decode NoHeader
Then summed
becomes:
let summed = fmap (V.foldr summer 0) (baseballStats csvData)
With that bit of tidying done, we should have:
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Vector as V
-- cassava
import Data.Csv
type BaseballStats = (BL.ByteString, Int, BL.ByteString, Int)
fourth :: (a, b, c, d) -> d
fourth (_, _, _, d) = d
baseballStats :: BL.ByteString -> Either String (V.Vector BaseballStats)
baseballStats = decode NoHeader
main :: IO ()
main = do
csvData <- BL.readFile "batting.csv"
let summed = fmap (V.foldr summer 0) (baseballStats csvData)
putStrLn $ "Total atBats was: " ++ (show summed)
where summer = (+) . fourth
Now we’re going to double-check that our code is working:
$ stack build
...(stuff happens)...
$ stack exec bassbull
Total atBats was: Right 4858210
Streaming
We can improve upon what we have here. Currently we’re going to use as much memory as it takes to store the entirety of the csv file in memory, but we don’t really have to do that to sum up the records!
Since we’re just adding the current records’ “at bats” with the sum we’ve accumulated so far, we only really need to read one record into memory at a time. By default Cassava will load the csv into a Vector
for convenience, but fortunately it has a streaming module so we can stream the data incrementally and fold our result without loading the entire dataset at once.
First, we’re going to drop Cassava’s default module for the streaming module.
Changing from this:
-- cassava
import Data.Csv
To this:
-- cassava
import Data.Csv.Streaming
Next, since we won’t have a Vector
anymore (we’re streaming, not using in-memory collections), we can drop:
import qualified Data.Vector as V
In favor using the Foldable
typeclass Cassava offers for use with its streaming API:
import qualified Data.Foldable as F
Then in order to use the streaming API we just change the definition of our summed
from:
let summed = fmap (V.foldr summer 0) (baseballStats csvData)
To:
let summed = F.foldr summer 0 (baseballStats csvData)
We are incrementally processing the results, not loading the entire dataset into a Vector.
The final result should look like:
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Foldable as F
-- cassava
import Data.Csv.Streaming
type BaseballStats = (BL.ByteString, Int, BL.ByteString, Int)
fourth :: (a, b, c, d) -> d
fourth (_, _, _, d) = d
baseballStats :: BL.ByteString -> Records BaseballStats
baseballStats = decode NoHeader
main :: IO ()
main = do
csvData <- BL.readFile "batting.csv"
let summed = F.foldr summer 0 (baseballStats csvData)
putStrLn $ "Total atBats was: " ++ (show summed)
where summer = (+) . fourth
The core here is the Records
datatype Cassava
gives us via the Streaming module. You can read more about the Records
datatype on hackage. Records
is a sum type, you could read out in English like so:
data Records a
-> Records is a datatype that takes a type variablea
Cons (...) | Nil (...)
-> It is a sum type of two possible constructors,Cons
orNil
(note the list-like nomenclature). This is way of saying aRecord a
is always eitherCons
orNil
.Cons (Either String a) (Record a)
-> theCons
data constructor is a product ofEither String a
andRecord a
. We’re sayingCons
is alwaysEither String a
andRecord a
. Also, thisCons
resembles the cons-cells in Lisp, Haskell, ML, etc. The library has the following comment about it: “A record or an error message, followed by more records.”Nil (Maybe String) BL.ByteString
-> theNil
data constructor is a product ofMaybe String
andBL.ByteString
. The library has the following comment: “End of stream, potentially due to a parse error. If a parse error occured, the first field contains the error message. The second field contains any unconsumed input.”
What the Records type is doing for us is letting us process the records like a lazy list, but with a little extra context in the Nil
case.
Because Haskell has abstractions like the Foldable
typeclass, we can talk about folding a dataset without caring about the underlying implementation! We could’ve used the foldr
from Foldable
on our Vector
, a List
, a Tree
, a Map
- not just Cassava’s streaming API. foldr
from Foldable
has the type: Foldable t => (a -> b -> b) -> b -> t a -> b
. Note the similarity with the foldr
for the list type, (a -> b -> b) -> b -> [a] -> b
. What we’ve done is abstracted the specific type out and made it into a generic interface.
In case you’re wondering what the Foldable
instance is doing under the hood:
-- | Skips records that failed to convert.
instance Foldable Records where
foldr = foldrRecords
foldrRecords :: (a -> b -> b) -> b -> Records a -> b
foldrRecords f = go
where
go z (Cons (Right x) rs) = f x (go z rs)
go z _ = z
{-# INLINE foldrRecords #-}
Adding tests
Now we’re going to add tests to our package. First we are going to add a test suite to our bassbull.cabal
file. The name of our test suite will just be tests
.
test-suite tests
ghc-options: -Wall
type: exitcode-stdio-1.0
main-is: Tests.hs
hs-source-dirs: tests
build-depends: base,
bassbull,
hspec
default-language: Haskell2010
We’re also going to add a library and shift over some code so that our package is exposed as a proper library rather than only working as an executable. We’re exposing a single module named Bassbull
. With an hs-source-dirs
of src
and an exposed module named Bassbull
, Cabal will expect a file to exist at src/Bassbull.hs
.
library
ghc-options: -Wall
exposed-modules: Bassbull
build-depends: base >= 4.7 && <5,
bytestring,
vector,
cassava
hs-source-dirs: src
default-language: Haskell2010
We need to change our executable in the Cabal file so that it depends on our library. No point duplicating the code!
executable bassbull
main-is: Main.hs
ghc-options: -rtsopts -O2
build-depends: base,
bassbull,
bytestring,
cassava
hs-source-dirs: src
default-language: Haskell2010
Next we’re going to create a file named src/Bassbull.hs
and shift code from src/Main.hs
over to it. Note we’ve also refactored our main
function so it takes an argument of what csv file to process.
-- src/Bassbull.hs
module Bassbull where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Foldable as F
import Data.Csv.Streaming
type BaseballStats = (BL.ByteString, Int, BL.ByteString, Int)
baseballStats :: BL.ByteString -> Records BaseballStats
baseballStats = decode NoHeader
fourth :: (a, b, c, d) -> d
fourth (_, _, _, d) = d
summer :: (a, b, c, Int) -> Int -> Int
summer = (+) . fourth
-- FilePath is just an alias for String
getAtBatsSum :: FilePath -> IO Int
getAtBatsSum battingCsv = do
csvData <- BL.readFile battingCsv
return $ F.foldr summer 0 (baseballStats csvData)
And here’s our defrocked src/Main.hs
which is now only responsible for fronting the executable.
module Main where
import Bassbull
main :: IO ()
main = do
summed <- getAtBatsSum "batting.csv"
putStrLn $ "Total atBats was: " ++ (show summed)
Next we’ll create a directory named tests
and add a file named Tests.hs
to it.
For our tests, we’re going to use HSpec because the library is easy to use, the syntax is clean, and the author Simon Hengel is one of the most responsive and helpful I’ve run into in open source.
Here’s our tests/Tests.hs
file
module Main where
import Bassbull
import Test.Hspec
main :: IO ()
main = hspec $ do
describe "Verify that bassbull outputs the correct data" $ do
it "equals zero" $ do
theSum <- getAtBatsSum "batting.csv"
theSum `shouldBe` 4858210
There’s not too much here. We’re importing Bassbull
, which is the library module we’ve exposed. This is also a Main
module with its own main
file because we execute our test suite as a binary just like we do with executables.
With all that in place, we’ll build and run the actual tests.
$ stack test
stack test
is just a shortcut for building tests
specifically, then running the executable produced to see test output.
You aren’t limited to building the tests
binary and running your tests in that manner. You can also pass stack ghci
an argument to make it load your tests. This can be faster as the REPL uses an interpreter and can reload your code very quickly - much more quickly than doing a full build & execution run.
$ stack ghci bassbull:tests
The above will then give you a REPL which can see anything the build in your Cabal named tests
can see. You can then run the main
function or individual test suites - if you bother to split them out.
Tests are useful and important in Haskell, although I often find I need much fewer of them. Often my process for working on an existing Haskell project will involve working on the code I’m changing with Emacs and a REPL instantiated via stack ghci
. As my code starts passing the type-checker, I start running the tests as another layer of assurance that I’m doing the right thing.
I like having a lot of feedback and help from my computer when writing code!
Making your Haskell packages available to the Haskell community
Hackage is the main community repository of Haskell packages and will usually be where you look to find libraries you need.
Mostly you’ll find libraries and the occasional executable utility, but utilities should also be exposing library APIs that make their functionality accessible via Haskell code. This is not only more useful to other people but enforces good practices and more modular projects.
Haskell users are accustomed to documentation that is accessible via the Hackage website directly such as you might find for the base library that comes with GHC. The tool that builds this documentation is called Haddock.
I strongly recommend you look at well-established libraries like lens for examples of how to build your documentation and use continuous integration with your Haskell projects.
To learn more and for more information on building a package for uploading to Hackage see this tutorial.
How I work
When I’m working with Haskell code, I interact with my code in a few ways. One is that I’m writing the code itself in Emacs. I’ll also have a terminal with a REPL open, usually via stack ghci
as I am almost always working on a specific project.
My Emacs config is pretty sundry, it’s just haskell-mode
and flycheck
.
My basic happy-path event-loop for writing Haskell is:
Import module I’m working on in the REPL before I’ve changed anything
Change/add/delete code
:reload
in the REPL.flycheck
will give me type errors, but I sometimes like to see them in the REPL too.Sometimes I’ll use eta-reduction to refactor code. You can see an example of this in this code review on StackExchange. Making code point-free makes the most sense when it’s primarily about composing functions rather than about applying them.
If code still type-checks after some cleaning, I’ll run the tests. If tests pass, I move on unless I’m suspicious about test coverage. If tests break or I want more coverage, I write more tests until I’m satisfied. When that’s done, I return to step #1 in this loop for the next unit of work I want to perform.
My diagnosis process when something isn’t working:
If I can’t get something to type-check, I’ll break down sub-expression, query the types of those sub-expressions and make certain they were what I expected.
If have expressions I am trying to combine and I trying to make the types thereof make sense, but I haven’t implemented them yet I will use
undefined
and work with only application, composition, and monadic variations thereof to figure out how I need to get to where I’m going before I’ve implemented anything. You can see a good example of this in this Github gist. I wrote the solution @ifesdjeen displays in his final comment.If I have a function expecting arguments I can’t figure out how to satisfy, I will sometimes use typed holes or a similar trick with implicit parameters to see what type I need to provide.
Since Haskell functions are pure and lazy, I can replace references to functions with their contents with a high degree of confidence that it will not change the semantics of my program. To that end, sometimes it’s easier to understand what’s going on by inlining the code by hand and seeing what your code turns into.
If something type-checks but doesn’t work, I’ll run the tests. If the coverage isn’t catching it, I add it. This is less common for me in Haskell than you’d think. If I can frame the test as an assertion about some property the code should satisfy like with QuickCheck I will do so. You can learn more about using QuickCheck in Real World Haskell.
Emacs
vim
Sublime Text 2/3
My personal dotfiles
Wrapping up
This is the end of our little journey in playing around with Haskell to process CSV data. Learning how to use abstractions like Foldable
, Functor
or use techniques like eta reduction takes practice! I have a guide for learning Haskell which has been compiled based on my experiences learning and teaching Haskell with many people over the last year or so.
If you are curious and want to learn more, I strongly recommend you do a course of basic exercises first and then explore the way Haskell enables you think about your programs in terms of abstractions. Once you have the basics down, this can be done in a variety of ways. Some people like to attack practical problems, some like to follow along with white papers, some like to hammer out abstractions from scratch in focused exercises & examples.
Things to do after finishing this article:
- Check out the Haskell community website
- Learn about (unit|spec|property) testing Haskell software with Kazu Yamamoto’s tutorial
- Search for code by type structurally with Hoogle
- Learn about Haddock, the Haskell source documentation tool and look at the many examples of Haskell package documentation.
More than anything else, my greatest wish would be that you develop a richer and more rewarding relationship with learning. Haskell has been a big part of this in my life.
Special thanks to Daniel Compton and Julie Moronuki for helping me test & edit this article. I couldn’t have gotten it together without their help.