Megaparsec 8

Published on November 7, 2019

Another year has passed and it is time again for a new major version of Megaparsec. What is different this time though, is that this is the least disruptive major release ever. In fact, I think most users will not need to do anything at all to upgrade.

There are reasons for this:

There are not so many issues opened and bugs reported. I think it has to do with the fact that Megaparsec “just works” these days and mostly in a satisfying manner.
The library is widely used now. At the time of this writing there are 146 packages on Hackage that depend on Megaparsec directly. New exciting libraries that choose to build on top of Megaparsec appear. Projects such as Idris and Dhall use Megaparsec to solve their parsing issues.

All this tells us that the library got older and more mature, so perhaps let’s not break something that is working. That said, there is always room for improvement.

Quality assurance with Nix

Before I started working on version 8, I decided to go for better quality assurance using Nix. Understanding how many projects now depend on Megaparsec and remembering my successful experiment with using Nix to find bugs in Ormolu, I decided to try to use Nix to check for breakage, performance changes, and bugs by using dependent packages.

I documented the result in the HACKING.md file which is now in Megaparsec’s repo.

Apart from the development shell, the Nix expression provides the following groups of targets:

base, which is a set of closely related packages such as parser-combinators and hspec-megaparsec together with their tests. By running nix-build -A base --no-out-link developer can make sure that all of this still builds and still passes the tests.
deps, which is a set of selected dependencies that I wanted to check for build breakage and test suite failures.
benches, which is a collection of benchmarks. This includes Megaparsec’s micro-benchmarks as well as a few packages that show how the library performs on realistic tasks.

Each of these sets can be accessed to “zoom in” on a particular package or benchmark. For example, I can run nix-build -A benches.parsers-bench to check benchmarks in parsers-bench. In short, I found that most packages will still work with the new changes and those that won’t are easy to fix. In fact, I had to patch the failing packages to continue using the system, so the patches for upgrading are available.

I found no logic or performance regressions.

Control over parse error locations

Let’s talk about the new features. The primitives failure and fancyFailure have been replaced by parseError:

parseError :: MonadParsec e s m => ParseError s e -> m a

-- now 'failure' and 'fancyFailure' are ordinary functions:

failure
  :: MonadParsec e s m
  => Maybe (ErrorItem (Token s)) -- ^ Unexpected item (if any)
  -> Set (ErrorItem (Token s)) -- ^ Expected items
  -> m a
failure us ps = do
  o <- getOffset
  parseError (TrivialError o us ps)

fancyFailure
  :: MonadParsec e s m
  => Set (ErrorFancy e) -- ^ Fancy error components
  -> m a
fancyFailure xs = do
  o <- getOffset
  parseError (FancyError o xs)

This is not about reducing the number of primitives though (which is also nice). The main feature that parseError enables is that we can report parse errors with arbitrary offsets, not necessarily with current offset from the parser’s state. This is important when you want to make a parse error to point to a particular part of input even if you already moved past that point. In the past this had to be accomplished by first obtaining the correct offset via getOffset and then by setting the offset with setOffset just before reporting the parse error. Not only this is ugly, it is also error-prone: you could forget to restore the correct offset. Here is a real example from mmark:

  o' <- getOffset
  setOffset o
  (void . hidden . string) "[]"
  -- ↑ if this fails, we want this to be reported at the offset 'o'
  setOffset (o' + 2)

I will not explain the full context here, but it suffices to say that this code had a bug for some time because I forgot to account for offset increment after parsing "[]" (the + 2 part). Now the same thing can be expressed nicer:

  region (setErrorOffset o) $
    (void . hidden . string) "[]"

-- N.B.

region :: MonadParsec e s m
  => (ParseError s e -> ParseError s e)
     -- ^ How to process 'ParseError's
  -> m a
     -- ^ The “region” that the processing applies to
  -> m a

region used to do the same getOffset/setOffset hacks and as a side effect it could change current offset if the function that updates parse errors changed it. Now region can use parseError and do away with its old hacks:

region f m = do
  r <- observing m
  case r of
    Left err -> parseError (f err)
    Right x -> return x

Nice.

Better story for multi-error parsers

From the very beginning of the project we were moving slowly in the direction of supporting multi-error parsing. In version 7 we even started to return ParseErrorBundle instead of old familiar ParseError. Everything was in place for multi-error revolution, except that there was no official way to report more than one parse error!

One prerequisite for having a multi-error parser is that it should be possible to skip over a problematic part of input and resume parsing from a position that is known to be good. This part is accomplished by using the withRecovery primitive (available beginning from Megaparsec 4.4.0):

-- | @'withRecovery' r p@ allows continue parsing even if parser @p@
-- fails. In this case @r@ is called with the actual 'ParseError' as its
-- argument. Typical usage is to return a value signifying failure to
-- parse this particular object and to consume some part of the input up
-- to the point where the next object starts.
--
-- Note that if @r@ fails, original error message is reported as if
-- without 'withRecovery'. In no way recovering parser @r@ can influence
-- error messages.

withRecovery
  :: (ParseError s e -> m a) -- ^ How to recover from failure
  -> m a             -- ^ Original parser
  -> m a             -- ^ Parser that can recover from failures

Before Megaparsec 8 users had to pick the type a to be a sum type including the possibilities for success and failure. For example, it could be Either (ParseError s e) Result. The parse errors had to be collected and later manually added to the ParseErrorBundle before displaying. Needless to say, all of this was an example of advanced usage that was not user friendly.

Megaparsec 8 adds support for delayed parse errors:

-- | Register a 'ParseError' for later reporting. This action does not end
-- parsing and has no effect except for adding the given 'ParseError' to the
-- collection of “delayed” 'ParseError's which will be taken into
-- consideration at the end of parsing. Only if this collection is empty
-- parser will succeed. This is the main way to report several parse errors
-- at once.

registerParseError :: MonadParsec e s m => ParseError s e -> m ()

-- | Like 'failure', but for delayed 'ParseError's.

registerFailure
  :: MonadParsec e s m
  => Maybe (ErrorItem (Token s)) -- ^ Unexpected item (if any)
  -> Set (ErrorItem (Token s)) -- ^ Expected items
  -> m ()

-- | Like 'fancyFailure', but for delayed 'ParseError's.

registerFancyFailure
  :: MonadParsec e s m
  => Set (ErrorFancy e) -- ^ Fancy error components
  -> m ()

These errors can be registered in the error-processing callback of withRecovery making the resulting type Maybe Result. This takes care of including the delayed errors in the final ParseErrorBundle as well as making the parser fail in the end if the collection of delayed errors in not empty.

With all this, I hope that the practice of writing multi-error parsers will become more common among the users.

Other

As always, for the full list of changes see the changelog.
I updated all texts including the official tutorial to be compatible with version 8. I even extended it to include sections teaching how to use the new features.
Satellite packages such as hspec-megaparsec have been updated and now work with version 8.

Happy parsing!