I need a parser for a simple Domain Specific Language and I am writing it in Haskell using the Parsec combinator library. My future posts will show why I am building this parser but in this post, I will focus on how to construct a parser for an imperative language using Parsec.
I will make no attempt to explain Haskell features except to touch upon some details of applicatives. There are many resources available on the Internet for learning Haskell
The implementation will be using Applicatives which will be easier to
read than one written in a monadic approach. The language I intend to
parse is a context-free grammar (CFG) and applicatives will do just
fine. It is helpful to review type signatures for <$>
, <*>
,
*>
and <*
shown below.
:t (<$>) (<$>) :: Functor f => (a -> b) -> f a -> f b :t (<*>) (<*>) :: Applicative f => f (a -> b) -> f a -> f b :t (<*) (<*) :: Applicative f => f a -> f b -> f a :t (*>) (*>) :: Applicative f => f a -> f b -> f b
<$>
applies a function (a -> b)
to an argument (f a)
in a
computational context and produces a new value (f b)
.
<*>
extracts both the function (a -> b)
and the argument from
context (f a)
producing a new value in (f b)
.
<*
always returns the first argument and *>
the second.
The syntax of the language that our parser will recognize is shown below.
script :: sequence of stmts stmt :: var_decl | if_stmt | while_stmt | for_stmt | continue_stmt | break_stmt | "{" stmt... "}" var_decl :: ident ":=" expr bool :: "true" | "false" expr :: bexpr bool_op bexpr | bexpr bexpr :: rexpr relop rexpr | rexpr rexpr :: term termOp term | term term :: term factor_op factor | factor factor :: ID | number | string | True | False | "(" expr ")" | "+|-" factor | ID "(" [exppr ["," expr]] ")" term_op :: "+" | "-" factor_op :: "*" | "/" if_stmt :: "if" expr stmt | "if" bool_expr stmt "else" stmt while_stmt :: "while" expr stmt for_stmt :: "for" ID stmt break_stmt :: "break" continue_stmt :: "continue"
The result of parsing will be an abstract syntax tree (AST). In further posts, I will implement evaluation of the tree or use it generate code.
Using Parsec in applicative style leads to remarkably concise and simple implementation.
Lists exports from this module. We export lower-level parse functions for testing purposes.
module Parser (Expr(..), Stmt(..), dslP, parse, exprP, stmtP) where
Notable imports are Parsec and friends. We also import some functions
from Control.Applicative
.
import Control.Applicative (liftA2, pure, (<*>), (<$>), (<*), (*>)) import Text.Parsec import Text.Parsec.String (Parser) import Text.ParserCombinators.Parsec.Char (digit, letter, alphaNum, lower, upper) import Text.ParserCombinators.Parsec.Language (emptyDef) import qualified Text.ParserCombinators.Parsec.Token as Token
tokenDef = Token.makeTokenParser $ emptyDef { Token.commentStart = "/*" , Token.commentEnd = "*/" , Token.commentLine = "//" , Token.identStart = letter , Token.identLetter = alphaNum , Token.reservedNames = [ "break" , "continue" , "else" , "false" , "if" , "print" , "true" , "while" ] , Token.reservedOpNames = ["+", "-", "*", "/", ":=" , "<", ">", "|" , "and", "or", "not"] } reserved = Token.reserved tokenDef reservedOp = Token.reservedOp tokenDef ident = Token.identifier tokenDef integer = Token.integer tokenDef float = Token.float tokenDef stringLit = Token.stringLiteral tokenDef ws = Token.whiteSpace tokenDef symbol = Token.symbol tokenDef parens = Token.parens tokenDef braces = Token.braces tokenDef
data Expr = Add Expr Expr | Sub Expr Expr | Mul Expr Expr | Div Expr Expr | Eq Expr Expr | Less Expr Expr | Greater Expr Expr | Le Expr Expr | Ge Expr Expr | Ne Expr Expr | And Expr Expr | Or Expr Expr | Not Expr | Neg Expr | Call String [Expr] | V String | I Integer | D Double | S String | T | F deriving (Show, Eq) data Stmt = Assign Expr Expr | Block [Stmt] | Print [Expr] | If Expr Stmt (Maybe Stmt) | While Expr Stmt | Break | Continue deriving (Show, Eq)
commaSep p = p `sepBy` (symbol ",")
This is the expression parser. This accepts semantically invalid expressions as there is no distinction between numerical, string and boolean expressions. In a future post, I will implement a semantic pass over the AST which will flag invalid expressions.
exprP :: Parser Expr --exprP = termP `chainl1` termopP exprP = bexprP `chainl1` bopP bexprP = rexprP `chainl1` relopP rexprP = termP `chainl1` termopP termP :: Parser Expr termP = factorP `chainl1` factoropP factorP :: Parser Expr factorP = Not <$> (notP *> factorP) <|> Neg <$> (symbol "-" *> factorP) <|> symbol "+" *> factorP <|> D <$> try float <|> I <$> try integer <|> S <$> stringLit <|> reserved "true" *> return T <|> reserved "false" *> return F <|> try callP <|> V <$> ident <|> parens exprP relopP = (reservedOp "=" *> return Eq <|> reservedOp "<" *> return Less <|> reservedOp ">" *> return Greater <|> reservedOp "!=" *> return Ne <|> reservedOp "<=" *> return Le <|> reservedOp "<=" *> return Ge) bopP = symbol "|" *> return Or <|> symbol "&" *> return And <|> reserved "or" *> return Or <|> reserved "and" *> return And termopP = symbol "+" *> return Add <|> symbol "-" *> return Sub factoropP = symbol "*" *> return Mul <|> symbol "/" *> return Div callP = Call <$> ident <*> parens (commaSep exprP) notP = reservedOp "!" <|> reserved "not"
stmtP :: Parser Stmt stmtP = assignP <|> blockP <|> printP <|> try ifElseP <|> ifP <|> whileP <|> breakP <|> continueP blockP = Block <$> braces (many stmtP) printP = Print <$> (reserved "print" >> (commaSep exprP)) assignP = Assign <$> exprP <*> (reservedOp ":=" >> exprP) ifP = If <$> ((reserved "if") >> exprP) <*> stmtP <*> return Nothing ifElseP = If <$> ((reserved "if") >> exprP) <*> stmtP <*> ((reserved "else") *> (Just <$> stmtP)) whileP = While <$> (reserved "while" >> exprP) <*> stmtP breakP = reserved "break" *> return Break continueP = reserved "continue" *> return Continue
dslP :: Parser [Stmt] dslP = ws *> many stmtP <* eof
With Parsec, it turns out to be important to order the parsers and
adorn them with try
. This was most evident in factorP
. Very
briefly, when there are two parsers one of which is a prefix of the
orher, the parser of the longer input should be listed first. If a
parser can fail after consuming some input, it should be wrapped in
try
so that the next parser will be tried at the correct input
position.
Other than this, using Parsec to build parsers is pretty straightforward.
Here is test program that verifies the correctness of the parser. The
tokenizer seems to have a bug. It correctly parses "1.2" as D 1.2
but parses -1.2
as I (-1)
. I will defer this issue for now!
import Text.Parsec (parseTest) import Data.List (intercalate) import Text.Parsec.String import Parser
Testing expression parsing.
exprTests :: [(String, Expr)] exprTests = [("10", I 10) , ("-1", Neg (I 1)) , ("- 1", Neg (I 1)) , ("1.2", D 1.2) , ("-1.2", Neg (D 1.2)) , ("- 1.3", Neg (D 1.3)) , ("a", V "a") , ("\"a\"", S "a") , ("true", T) , ("false", F) , ("1 + 2", Add (I 1) (I 2)) , ("1 + -2", Add (I 1) (Neg (I 2))) , ("1 + 2 * 3", Add (I 1) (Mul (I 2) (I 3))) , ("1 - 2", Sub (I 1) (I 2)) , ("1 - 2 * 3", Sub (I 1) (Mul (I 2) (I 3))) , ("1 + 2 * 3 / 4", Add (I 1) (Div (Mul (I 2) (I 3)) (I 4))) , ("1 + a", Add (I 1) (V "a")) , ("1 = a", Eq (I 1) (V "a")) , ("1 = 2", Eq (I 1) (I 2)) , ("true and true", And T T) , ("true & true", And T T) , ("true | true", Or T T) , ("true or true", Or T T) , ("1 = 2 & 2 = 4", And (Eq (I 1) (I 2)) (Eq (I 2) (I 4))) , ("a = b & c = d", And (Eq (V "a") (V "b")) (Eq (V "c") (V "d"))) , ("a = b | c = d", Or (Eq (V "a") (V "b")) (Eq (V "c") (V "d"))) , ("(a | b) & (c | d)", And (Or (V "a") (V "b")) (Or (V "c") (V "d"))) , ("(a & b) | (c & d)", Or (And (V "a") (V "b")) (And (V "c") (V "d"))) , ("-(1.2)", Neg (D 1.2)) , ("+(1.2)", D 1.2) , ("not true", Not T) , ("not not true", Not (Not T)) , ("true = false", Eq T F) , ("foo()", Call "foo" []) , ("foo(1)", Call "foo" [I 1]) , ("foo(1, true)", Call "foo" [I 1, T]) , ("foo(1, 2)", Call "foo" [I 1, I 2]) ]
Testing statement parsing.
stmtTests :: [(String, Stmt)] stmtTests = [ ("x := 1", Assign (V "x") (I 1)) , ("print 1, 2", Print [I 1, I 2]) , ("print 1", Print [I 1]) , ("{}", Block []) , ("if true print \"T\" else print \"F\"", If T (Print [S "T"]) (Just (Print [S "F"]))) , ("if true print 1", If T (Print [I 1]) Nothing) , ("break", Break) , ("continue", Continue) ]
Example DSL and its expected result.
dslTests :: [(String, [Stmt])] dslTests = [ ("x := 1 y:= 2", [Assign (V "x") (I 1), Assign (V "y") (I 2)]) , (" x := 1 ", [Assign (V "x") (I 1)]) ]
testParser
accepts a list of input and expected results of parsing
them. Inputs that don't produce the expected results are printed.
testParser:: (Eq a, Show a) => Parser a -> [(String, a)] -> IO () testParser p tests = do putStr (intercalate "\r\n" (filter (not . null) (map (\(s, e, r) -> case r of Right ast -> if e == ast then "" -- "Parsed: " ++ s else "Error: " ++ s ++ " Exp: " ++ show e ++ " Act: " ++ show ast Left e -> "Parse error: \n" ++ show e) (map (\(s, e) -> (s, e, parse p s s)) tests))))
Runs all tests defined above.
main :: IO () main = do testParser exprP exprTests testParser stmtP stmtTests testParser dslP dslTests putStr "\nDONE!\n"