Announcing Megaparsec 5

Published on May 15, 2016, last updated June 1, 2017

I’m happy to announce a new major release of Megaparsec.

Thanks

It’s hard not to see that Megaparsec is moving away from its granddaddy Parsec with every new release. In the version 5 we re-thought some design decisions made in Parsec and the result is a library written in more flexible and modern Haskell.

Megaparsec development is completely issue-driven. Actual users and their practical problems have been shaping the library for quite some time now and so I would like to thank people who opened most important issues and feature-requests that made me think again about design of the library:

  • Matteo Ferrando, an active user who noticed the problem with non-flexible position-advancing mechanism.

  • Herbert Valerio Riedel who proposed to add support for include files and move position-advancing functions into the Stream type class.

  • Wojciech Daniło who opened issue about well-typed and customizable error messages and actively participated in discussion.

  • Joe Hermaszewski who proposed deriving of Read and Show instances and using special functions for pretty-printing.

  • And others who proposed and in some cases implemented various minor changes.

Well-typed and customizable error messages

Perhaps the most important feature in this release is well-typed and customizable error messages. ParseError is now parametrized over token type t and custom error component e:

data ParseError t e = ParseError
  { errorPos        :: NonEmpty SourcePos -- ^ Stack of source positions
  , errorUnexpected :: Set (ErrorItem t)  -- ^ Unexpected items
  , errorExpected   :: Set (ErrorItem t)  -- ^ Expected items
  , errorCustom     :: Set e              -- ^ Associated data, if any
  } deriving (Show, Read, Eq, Typeable)

ErrorItem is parametrized over token type t and looks like this:

data ErrorItem t
  = Tokens (NonEmpty t)      -- ^ Non-empty stream of tokens
  | Label (NonEmpty Char)    -- ^ Label (cannot be empty)
  | EndOfInput               -- ^ End of input
  deriving (Show, Read, Eq, Ord, Data, Typeable)

Now an error in Megaparsec is not just a bunch of strings, but something more comprehensible. I expect some people may not like the intensive use of NonEmpty, but I tried to use types that are inhabited only by logically valid values.

As you can see, parse errors support stack of source positions. This has corresonding helpers in Text.Megaparsec.Prim and can be used to work with include files. Pretty-printing function parseErrorPretty also knows how to display stacks of source positions and there are pushPosition and popPosition functions available as well.

What about errorCustom? You can use your own type there and the whole library will work with that just fine. We have Dec out-of-the-box:

data Dec
  = DecFail String         -- ^ 'fail' has been used in parser monad
  | DecIndentation Ordering Pos Pos -- ^ Incorrect indentation error
  deriving (Show, Read, Eq, Ord, Data, Typeable)

Dec stands for “default error component” and unless you’re doing something advanced it should just work. To use your own type you just need to make it instance of ErrorComponent and ShowErrorComponent (only needed if you want to pretty-print error messages):

class Ord e => ErrorComponent e where

  -- | Represent message passed to 'fail' in parser monad.
  --
  -- @since 5.0.0

  representFail :: String -> e

  -- | Represent information about incorrect indentation.
  --
  -- @since 5.0.0

  representIndentation
    :: Ordering -- ^ Desired ordering between reference and actual level
    -> Pos             -- ^ Reference indentation level
    -> Pos             -- ^ Actual indentation level
    -> e

class Ord a => ShowErrorComponent a where

  -- | Pretty-print custom data component of 'ParseError'.

  showErrorComponent :: a -> String

Then you build on the top of failure primitive to report error messages with your data:

failure :: MonadParsec e s m
  => Set (ErrorItem (Token s)) -- ^ Unexpected items
  -> Set (ErrorItem (Token s)) -- ^ Expected items
  -> Set e                     -- ^ Custom data
  -> m a

For example, indentation-sensitive helpers from Text.Megaparsec.Lexer make use of this:

incorrectIndent :: MonadParsec e s m
  => Ordering  -- ^ Desired ordering between reference and actual level
  -> Pos               -- ^ Reference indentation level
  -> Pos               -- ^ Actual indentation level
  -> m a
incorrectIndent ord ref actual = failure E.empty E.empty (E.singleton x)
  where x = representIndentation ord ref actual

Which produces error messages like this (taken from this tutorial):

λ> parseTest parser "something\n  one\n    two\n  three"
3:5:
incorrect indentation (got 5, should be equal to 3)
λ> parseTest parser "something\n  one\n  two\n three"
4:2:
incorrect indentation (got 2, should be equal to 3)
λ> parseTest parser "something\n  one\n  two\n  three"
("something",["one","two","three"])

Much better than simply “incorrect indentation”!

The Stream type class

The Stream type class now has the associated type function Token:

class Ord (Token s) => Stream s where

  -- | Type of token in stream.
  --
  -- @since 5.0.0

  type Token s :: *

This is a very handy thing, it allowed to remove one type parameter from MonadParsec type class and write natural type constraints like this:

newline :: (MonadParsec e s m, Token s ~ Char) => m Char
newline = char '\n'

We will talk about this more in the next section.

Another addition to the Stream type class is the updatePos method. Its signature looks like this:

updatePos
  :: Proxy s -- ^ Proxy clarifying stream type, 'Token' is not injective
  -> Pos             -- ^ Tab width
  -> SourcePos       -- ^ Current position
  -> Token s         -- ^ Current token
  -> (SourcePos, SourcePos) -- ^ Actual position\/incremented position

This improved support for streams of tokens where information about position of token is encoded in token itself.

The MonadParsec type class

Declaration of MonadParsec has been changed:

class (ErrorComponent e, Stream s, Alternative m, MonadPlus m)
    => MonadParsec e s m | m -> e s where

We added type of custom error component e and removed type of token t because now it can be found from type of stream s via the type function Token. This gives us signatures like this:

token :: MonadParsec e s m
  => (Token s -> Either ( Set (ErrorItem (Token s))
                        , Set (ErrorItem (Token s))
                        , Set e ) a)
     -- ^ Matching function for the token to parse, it allows to
     -- construct arbitrary error message on failure; sets in the
     -- three-tuple are: unexpected items, expected items, and
     -- custom data pieces.
  -> Maybe (Token s) -- ^ Token to report when input stream is empty
  -> m a

tokens :: MonadParsec e s m
  => (Token s -> Token s -> Bool)
     -- ^ Predicate to check equality of tokens
  -> [Token s]
     -- ^ List of tokens to parse
  -> m [Token s]

Support for line folds

Finally support for line folds is here. It’s not difficult to implement though:


-- | Create a parser that supports line-folding. The first argument is
-- used to consume white space between components of line fold, thus it
-- /must/ consume newlines in order to work properly. The second
-- argument is a callback that receives custom space-consuming parser
-- as argument. This parser should be used after separate components of
-- line fold that can be put on different lines.
--
-- An example should clarify the usage pattern:
--
-- > sc = L.space (void spaceChar) empty empty
-- >
-- > myFold = L.lineFold sc $ \sc' -> do
-- >   L.symbol sc' "foo"
-- >   L.symbol sc' "bar"
-- >   L.symbol sc  "baz" -- for the last symbol we use a normal
--                        -- space consumer
--
-- @since 5.0.0

lineFold :: MonadParsec e s m
  => m ()              -- ^ How to consume indentation (white space)
  -> (m () -> m a)     -- ^ Callback that uses provided space-consumer
  -> m a
lineFold sc action =
  sc >> indentLevel >>= action . void . indentGuard sc GT

It’s simple and it works.

Use of scientific

Like Attoparsec, we switched to the scientific package for parsing of floating point values in Text.Megaparsec.Lexer. The Scientific type is safe against numbers with huge exponents and it can reliably represent integers too (it even has functions isFloating, isInteger, and others that allow handy dispatching), so now we can write number as:

-- | Parse a number: either integer or floating point. The parser can
-- handle overlapping grammars graciously. Use functions like
-- 'Data.Scientific.floatingOrInteger' from "Data.Scientific" to test
-- and extract integer or real values.

number :: (MonadParsec e s m, Token s ~ Char) => m Scientific

This is amazing, because you can get either floating point number or integer from it later and signed does not need an ad-hoc Signed type class anymore to compose with other functions from the module (previously number returned Either Integer Double):

signed :: (MonadParsec e s m, Token s ~ Char, Num a)
  => m ()              -- ^ How to parse space after sign
  -> m a               -- ^ Parser for unsigned number
  -> m a               -- ^ Parser for signed number

Performance

All these new features started to make Megaparsec slower. To counteract this I had to do some profiling and benchmarking (thanks again to Auke Booij who contributed benchmarks for Megaparsec on early stages of development) and indeed some inlining, manual re-write of (<*>), and careful use of strictness allowed me to improve the situation. Here are simplified results of comparison on my laptop:

Benchmark (size 1000) Parsec 3.1.9 Megaparsec 5.0.0
string/match 74.59 μs 48.65 μs
string/nomatch_early 374.0 ns 59.06 ns
string/nomatch_late 69.80 μs 31.51 μs
try-string/match 76.60 μs 48.15 μs
try-string/nomatch_early 383.5 ns 61.93 ns
try-string/nomatch_late 70.88 μs 31.72 μs
lookahead-string/match 76.78 μs 47.07 μs
lookahead-string/nomatch_early 389.5 ns 60.13 ns
lookahead-string/nomatch_late 77.80 μs 29.73 μs
notfollowedby-string/match 79.46 μs 48.54 μs
notfollowedby-string/nomatch_early 418.8 ns 47.17 ns
notfollowedby-string/nomatch_late 79.46 μs 30.56 μs
manual-string 328.8 μs 57.15 μs
choice/match 355.5 μs 232.6 μs
choice/nomatch 523.8 μs 289.2 μs
count 260.8 μs 48.85 μs
sepBy1 357.8 μs 100.6 μs
manyTill 464.7 μs 139.7 μs

tokens (that influences performance of string) is coded differently than in Parsec because I wanted to have a bit different error messages that show not just current mismatched token, but an entire sequence parsed up to first mismatch and an entire sequence that is expected. tokens also backtracks automatically in Megaparsec 4.3+ but this has no impact on performance. choice is not considerably faster because in Megaparsec it does a lot more than in Parsec since we need to be able to get parser state on errors for some features to work.

All in all, given flexibility and features of our fork, the results are not shameful. To be honest, Parsec can be made faster than Megaparsec with some effort, but that effort has yet to be applied. I would also appreciate if someone could add Megaparsec into other parsing benchamarks and let me know about results.

To be continued

These are the most important but not all changes and improvements in Megaparsec 5, please see the change log for the complete list.