Switch from Parsec to Megaparsec
Published on October 15, 2015, last updated September 27, 2018
- Renamed things
- Removed things
- Completely changed things
- Character parsing
- Expression parsing
What happened to
- What’s next?
This tutorial explains the practical differences between the two libraries that you will need to address if you choose to undertake the switch. Remember, all the functionality available in Parsec is available in Megaparsec and often in a better form.
You’ll mainly need to replace
Parsec part in your imports with
Megaparsec. That’s pretty simple. Typical import section of module that
uses Megaparsec looks like this:
-- this module contains commonly useful tools: import Text.Megaparsec -- if you parse a stream of characters import Text.Megaparsec.Char -- if you parse a stream of bytes import Text.Megaparsec.Byte -- if you need to parse permutation phrases: import Control.Applicative.Permutations -- from parser-combinators -- if you need to parse expressions: import Control.Monad.Combinators.Expr -- from parser-combinators -- for lexing of character streams import qualified Text.Megaparsec.Char.Lexer as L -- for lexing of binary streams import qualified Text.Megaparsec.Byte.Lexer as L
So, the only noticeable difference that Megaparsec has no
Text.Megaparsec.Token module which is replaced with
Text.Megaparsec.Byte.Lexer if you work
with binary data), see about this in the section “What happened to
Megaparsec introduces a more consistent naming scheme, so some things are called differently, but renaming functions is a very easy task, you don’t need to think. Here are renamed items:
†—pay attention to these, since
space parses many
including zero, if you write something like
many space, your parser will
hang. So be careful to replace
many space with either
many spaceChar or
Parsec also has many names for the same or similar things. Megaparsec usually has one function per task. Here are the items that were removed in Megaparsec and reasons of their removal:
parseFromFile—from file and then parsing its contents is trivial for every instance of
Streamand this function provides no way to use newer methods for running a parser, such as
modifyState—ad-hoc backtracking user state has been eliminated.
tokens, now there is a bit different versions of these functions under the same names.
Consumedare not public data types anymore, because they are low-level implementation details.
runPwere essentially synonyms for
Completely changed things
In Megaparsec 5, 6, and 7 the modules
Text.Megaparsec.Error are completely different from those found in Parsec
and Megaparsec 4. Take some time to look at documentation of the modules if
your use-case requires operations on error messages or positions. You may
like the fact that we have well-typed and extensible error messages now.
For full up to date info see the changelog. Over the years we have gone so far ahead of Parsec that it would take a lot of space to enumerate all the nice stuff.
New character parsers in
may be useful if you work with Unicode:
Case-insensitive character parsers are also available:
makeExprParser has flipped order of arguments: term parser first, operator
table second. To specify associativity of infix operators you use one of the
What happened to
That module was extremely inflexible and thus it has been eliminated. In
Megaparsec you have
instead, which doesn’t impose anything on user but provides useful helpers.
The module can also parse indentation-sensitive languages.
Let’s quickly describe how you go about writing your lexer with
Text.Megaparsec.Char.Lexer. First, you should import the module qualified,
we will use
L as its synonym here.
Start writing your lexer by defining what counts as white space in your
skipBlockComment can be helpful:
sc :: Parser () -- ‘sc’ stands for “space consumer” sc = L.space space1 lineComment blockComment where lineComment = L.skipLineComment "//" blockComment = L.skipBlockComment "/*" "*/"
sc is generally called space consumer. Often you’ll need only one space
consumer, but you can define as many of them as you want. Note that this new
module allows you avoid consuming newline characters automatically, just use
something different than
space1 as first argument of
space. Even better,
you can control what white space is on per-lexeme basis:
lexeme :: Parser a -> Parser a lexeme = L.lexeme sc symbol :: String -> Parser String symbol = L.symbol sc
Note that all tools in Megaparsec work with any instance of
All commonly useful monad transformers like
MonadParsec out of the box. For example, what if you want to
collect contents of comments, (say, they are documentation strings of a
sort), you may want to have backtracking user state were you put last
encountered comment satisfying some criteria, and then when you parse
function definition you can check the state and attach doc-string to your
parsed function. It’s all possible and easy with Megaparsec:
import Control.Monad.State.Lazy … type MyParser = StateT String Parser skipLineComment' :: MyParser () skipLineComment' = … skipBlockComment' :: MyParser () skipBlockComment' = … sc :: MyParser () sc = space (void spaceChar) skipLineComment' skipBlockComment'
Parsing of indentation-sensitive language deserves its own tutorial, but let’s take a look at the basic tools upon which we can build. First of all we should work with space consumer that doesn’t eat newlines automatically. This means we’ll need to pick them up manually.
The main helper is called
indentGuard. It takes a parser that will be used
to consume white space (indentation) and a predicate of the type
Int -> Bool. If after running the given parser column number does not satisfy
given predicate, the parser fails with message “incorrect indentation”,
otherwise it returns current column number.
In simple cases you can explicitly pass around value returned by
indentGuard, i.e. current level of indentation. If you prefer to preserve
some sort of state you can achieve backtracking state combining
ParsecT, like this:
StateT Int Parser a
Here we have state of the type
Int. You can use
put as usual,
although it may be better to write a modified version of
could get current indentation level (indentation level on previous line),
then consume indentation of current line, perform necessary checks, and put
new level of indentation.
Later update: now we have full support for indentation-sensitive parsing,
lineFold in the
Character and string literals
Parsing of string and character literals is done a bit differently than in
Parsec. We have the single helper
charLiteral, which parses a character
literal. It does not parse surrounding quotes, because different languages
may quote character literals differently. The purpose of this parser is to
help with parsing of conventional escape sequences (literal character is
parsed according to rules defined in the Haskell report).
myCharLiteral :: Parser Char myCharLiteral = char '\'' *> charLiteral <* char '\''
charLiteral can be also used to parse string literals. This is simplified
version that will accept plain (not escaped) newlines in string literals
(it’s easy to make it conform to Haskell syntax, this is left as an exercise
for the reader):
stringLiteral :: Parser String stringLiteral = char '"' >> manyTill L.charLiteral (char '"')
Parsing of numbers is easy:
decimal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a decimal = lexeme L.decimal float :: (MonadParsec e s m, Token s ~ Char, RealFloat a) => m a float = lexeme L.float number :: (MonadParsec e s m, Token s ~ Char) => m Scientific number = lexeme L.scientific -- similar to ‘naturalOrFloat’ in Parsec
Megaparsec’s numeric parsers have been heavily optimized in version 6, they are close to Attoparsec’s solutions in terms of performance.
Hexadecimal and octal numbers do not parse “0x” or “0o” prefixes, because different languages may have other prefixes for this sort of numbers. We should parse the prefixes manually:
hexadecimal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a hexadecimal = lexeme $ char '0' >> char' 'x' >> L.hexadecimal octal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a octal = lexeme $ char '0' >> char' 'o' >> L.octal
Since the Haskell report says nothing about sign in numeric literals, basic
decimal do not parse sign. You can easily create parsers for
signed numbers with the help of
signedDecimal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a signedDecimal = L.signed sc decimal signedFloat :: (MonadParsec e s m, Token s ~ Char, RealFloat a) => m a signedFloat = L.signed sc float signedNumber :: (MonadParsec e s m, Token s ~ Char) => m Scientific signedNumber = L.signed sc number
And that’s it, shiny and new,
Text.Megaparsec.Char.Lexer is at your
service, now you can implement anything you want without the need to copy
and edit entire
Text.Parsec.Token module (people had to do it sometimes,
Changes you may want to perform may be more fundamental than those described
here. For example, previously you may have to use a workaround because
Text.Parsec.Token was not sufficiently flexible. Now you can replace it
with a proper solution. If you want to use the full potential of Megaparsec,
take time to read about its features, they can help you improve your