Switch from Parsec to Megaparsec

Published on October 15, 2015, last updated September 27, 2018

This tutorial explains the practical differences between the two libraries that you will need to address if you choose to undertake the switch. Remember, all the functionality available in Parsec is available in Megaparsec and often in a better form.

Imports

You’ll mainly need to replace Parsec part in your imports with Megaparsec. That’s pretty simple. Typical import section of module that uses Megaparsec looks like this:

-- this module contains commonly useful tools:
import Text.Megaparsec
-- if you parse a stream of characters
import Text.Megaparsec.Char
-- if you parse a stream of bytes
import Text.Megaparsec.Byte
-- if you need to parse permutation phrases:
import Control.Applicative.Permutations -- from parser-combinators
-- if you need to parse expressions:
import Control.Monad.Combinators.Expr -- from parser-combinators
-- for lexing of character streams
import qualified Text.Megaparsec.Char.Lexer as L
-- for lexing of binary streams
import qualified Text.Megaparsec.Byte.Lexer as L

So, the only noticeable difference that Megaparsec has no Text.Megaparsec.Token module which is replaced with Text.Megaparsec.Char.Lexer (or Text.Megaparsec.Byte.Lexer if you work with binary data), see about this in the section “What happened to Text.Parsec.Token.

Renamed things

Megaparsec introduces a more consistent naming scheme, so some things are called differently, but renaming functions is a very easy task, you don’t need to think. Here are renamed items:

  • many1some (re-exported from Control.Applicative)
  • skipMany1skipSome
  • tokenPrimtoken
  • optionMaybeoptional (re-exported from Control.Applicative)
  • permutemakePermParser
  • buildExpressionParsermakeExprParser

Character parsing:

  • alphaNumalphaNumChar
  • digitdigitChar
  • endOfLineeol
  • hexDigithexDigitChar
  • letterletterChar
  • lowerlowerChar
  • octDigitoctDigitChar
  • spacespaceChar
  • spacesspace
  • upperupperChar

†—pay attention to these, since space parses many spaceChars, including zero, if you write something like many space, your parser will hang. So be careful to replace many space with either many spaceChar or spaces.

Removed things

Parsec also has many names for the same or similar things. Megaparsec usually has one function per task. Here are the items that were removed in Megaparsec and reasons of their removal:

  • parseFromFile—from file and then parsing its contents is trivial for every instance of Stream and this function provides no way to use newer methods for running a parser, such as runParser'.

  • getState, putState, modifyState—ad-hoc backtracking user state has been eliminated.

  • unexpected, token and tokens, now there is a bit different versions of these functions under the same names.

  • Reply and Consumed are not public data types anymore, because they are low-level implementation details.

  • runPT and runP were essentially synonyms for runParserT and runParser respectively.

  • chainl, chainl1, chainr, and chainr1—use Text.Megaparsec.Expr instead.

Completely changed things

In Megaparsec 5, 6, and 7 the modules Text.Megaparsec.Pos and Text.Megaparsec.Error are completely different from those found in Parsec and Megaparsec 4. Take some time to look at documentation of the modules if your use-case requires operations on error messages or positions. You may like the fact that we have well-typed and extensible error messages now.

Other

For full up to date info see the changelog. Over the years we have gone so far ahead of Parsec that it would take a lot of space to enumerate all the nice stuff.

Character parsing

New character parsers in Text.Megaparsec.Char may be useful if you work with Unicode:

  • asciiChar
  • charCategory
  • controlChar
  • latin1Char
  • markChar
  • numberChar
  • printChar
  • punctuationChar
  • separatorChar
  • symbolChar

Case-insensitive character parsers are also available:

  • char'
  • string'

Expression parsing

makeExprParser has flipped order of arguments: term parser first, operator table second. To specify associativity of infix operators you use one of the three Operator constructors:

  • InfixN—non-associative infix
  • InfixL—left-associative infix
  • InfixR—right-associative infix

What happened to Text.Parsec.Token?

That module was extremely inflexible and thus it has been eliminated. In Megaparsec you have Text.Megaparsec.Char.Lexer instead, which doesn’t impose anything on user but provides useful helpers. The module can also parse indentation-sensitive languages.

Let’s quickly describe how you go about writing your lexer with Text.Megaparsec.Char.Lexer. First, you should import the module qualified, we will use L as its synonym here.

White space

Start writing your lexer by defining what counts as white space in your language. space, skipLineComment, and skipBlockComment can be helpful:

sc :: Parser () -- ‘sc’ stands for “space consumer”
sc = L.space space1 lineComment blockComment
  where
    lineComment  = L.skipLineComment "//"
    blockComment = L.skipBlockComment "/*" "*/"

sc is generally called space consumer. Often you’ll need only one space consumer, but you can define as many of them as you want. Note that this new module allows you avoid consuming newline characters automatically, just use something different than space1 as first argument of space. Even better, you can control what white space is on per-lexeme basis:

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: String -> Parser String
symbol = L.symbol sc

Monad transformers

Note that all tools in Megaparsec work with any instance of MonadParsec. All commonly useful monad transformers like StateT and WriterT are instances of MonadParsec out of the box. For example, what if you want to collect contents of comments, (say, they are documentation strings of a sort), you may want to have backtracking user state were you put last encountered comment satisfying some criteria, and then when you parse function definition you can check the state and attach doc-string to your parsed function. It’s all possible and easy with Megaparsec:

import Control.Monad.State.Lazy



type MyParser = StateT String Parser

skipLineComment' :: MyParser ()
skipLineComment' = 

skipBlockComment' :: MyParser ()
skipBlockComment' = 

sc :: MyParser ()
sc = space (void spaceChar) skipLineComment' skipBlockComment'

Indentation-sensitive languages

Parsing of indentation-sensitive language deserves its own tutorial, but let’s take a look at the basic tools upon which we can build. First of all we should work with space consumer that doesn’t eat newlines automatically. This means we’ll need to pick them up manually.

The main helper is called indentGuard. It takes a parser that will be used to consume white space (indentation) and a predicate of the type Int -> Bool. If after running the given parser column number does not satisfy given predicate, the parser fails with message “incorrect indentation”, otherwise it returns current column number.

In simple cases you can explicitly pass around value returned by indentGuard, i.e. current level of indentation. If you prefer to preserve some sort of state you can achieve backtracking state combining StateT and ParsecT, like this:

StateT Int Parser a

Here we have state of the type Int. You can use get and put as usual, although it may be better to write a modified version of indentGuard that could get current indentation level (indentation level on previous line), then consume indentation of current line, perform necessary checks, and put new level of indentation.

Later update: now we have full support for indentation-sensitive parsing, see nonIndented, indentBlock, and lineFold in the Text.Megaparsec.Char.Lexer module.

Character and string literals

Parsing of string and character literals is done a bit differently than in Parsec. We have the single helper charLiteral, which parses a character literal. It does not parse surrounding quotes, because different languages may quote character literals differently. The purpose of this parser is to help with parsing of conventional escape sequences (literal character is parsed according to rules defined in the Haskell report).

myCharLiteral :: Parser Char
myCharLiteral = char '\'' *> charLiteral <* char '\''

charLiteral can be also used to parse string literals. This is simplified version that will accept plain (not escaped) newlines in string literals (it’s easy to make it conform to Haskell syntax, this is left as an exercise for the reader):

stringLiteral :: Parser String
stringLiteral = char '"' >> manyTill L.charLiteral (char '"')

Numbers

Parsing of numbers is easy:

decimal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a
decimal = lexeme L.decimal

float :: (MonadParsec e s m, Token s ~ Char, RealFloat a) => m a
float = lexeme L.float

number :: (MonadParsec e s m, Token s ~ Char) => m Scientific
number = lexeme L.scientific -- similar to ‘naturalOrFloat’ in Parsec

Megaparsec’s numeric parsers have been heavily optimized in version 6, they are close to Attoparsec’s solutions in terms of performance.

Hexadecimal and octal numbers do not parse “0x” or “0o” prefixes, because different languages may have other prefixes for this sort of numbers. We should parse the prefixes manually:

hexadecimal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a
hexadecimal = lexeme $ char '0' >> char' 'x' >> L.hexadecimal

octal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a
octal = lexeme $ char '0' >> char' 'o' >> L.octal

Since the Haskell report says nothing about sign in numeric literals, basic parsers like decimal do not parse sign. You can easily create parsers for signed numbers with the help of signed:

signedDecimal :: (MonadParsec e s m, Token s ~ Char, Integral a) => m a
signedDecimal = L.signed sc decimal

signedFloat :: (MonadParsec e s m, Token s ~ Char, RealFloat a) => m a
signedFloat = L.signed sc float

signedNumber :: (MonadParsec e s m, Token s ~ Char) => m Scientific
signedNumber = L.signed sc number

And that’s it, shiny and new, Text.Megaparsec.Char.Lexer is at your service, now you can implement anything you want without the need to copy and edit entire Text.Parsec.Token module (people had to do it sometimes, you know).

What’s next?

Changes you may want to perform may be more fundamental than those described here. For example, previously you may have to use a workaround because Text.Parsec.Token was not sufficiently flexible. Now you can replace it with a proper solution. If you want to use the full potential of Megaparsec, take time to read about its features, they can help you improve your parsers.