How to introduce custom error messages

Published on August 10, 2016, last updated September 27, 2018

One of the advantages of Megaparsec is the ability to use your own data types as part of data that is returned on parse failure. This opens the possibility to tailor error messages to your domain of interest in a way that is quite unique to this library. Needless to say, all data that constitutes a error message is typed, so it’s easy to inspect and manipulate it.

The goal

In this tutorial we will walk through creation of a parser found in an existing library called cassava-megaparsec, which is an alternative parser for the popular cassava library for parsing CSV data. The default parser features not very user-friendly error messages, so I was asked to design a better one using Megaparsec.

In addition to the standard error messages (“expected” and “unexpected” tokens), the library can report problems that have to do with using methods from the FromRecord and FromNamedRecord type classes that describe how to transform a collection of ByteStrings into a particular instance of those type classes. While performing the conversion, things can go wrong, and we would like to use a special data constructor in these cases.

The complete source code can be found in this GitHub repository.

Language extensions and imports

We will need some language extensions and imports, here is the top of Data.Csv.Parser.Megaparsec almost literally:

{-# LANGUAGE BangPatterns       #-}
{-# LANGUAGE DeriveDataTypeable #-}
{-# LANGUAGE RecordWildCards    #-}

module Data.Csv.Parser.Megaparsec
  ( Cec (..)
  , decode
  , decodeWith
  , decodeByName
  , decodeByNameWith )
where

import Control.Monad
import Data.ByteString (ByteString)
import Data.Csv hiding
  ( Parser
  , record
  , namedRecord
  , header
  , toNamedRecord
  , decode
  , decodeWith
  , decodeByName
  , decodeByNameWith )
import Data.Data
import Data.Vector (Vector)
import Data.Word (Word8)
import Text.Megaparsec
import Text.Megaparsec.Byte
import qualified Data.ByteString      as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.Csv             as C
import qualified Data.HashMap.Strict  as H
import qualified Data.Set             as S
import qualified Data.Vector          as V

Note that there are two imports for Data.Csv, one for some common things like names of type classes that I want to keep unprefixed and the second one for the rest (qualified as C).

What is ParseError actually?

To start with custom error messages we should take a look at how parse errors are represented in Megaparsec 7.

At the top level we have ParseErrorBundle which is just a way to pass around a non-empty collection of ParseErrors together with some state that is necessary to pretty-print them (it’s of no interest to us here).

-- | A non-empty collection of 'ParseError's equipped with 'PosState' that
-- allows to pretty-print the errors efficiently and correctly.

data ParseErrorBundle s e = ParseErrorBundle
  { bundleErrors :: NonEmpty (ParseError s e)
    -- ^ A collection of 'ParseError's that is sorted by parse error offsets
  , bundlePosState :: PosState s
    -- ^ State that is used for line\/column calculation
  }

The main type for error messages is ParseError which is defined like this:

-- | @'ParseError' s e@ represents a parse error parametrized over the
-- stream type @s@ and the custom data @e@.
--
-- 'Semigroup' and 'Monoid' instances of the data type allow to merge parse
-- errors from different branches of parsing. When merging two
-- 'ParseError's, the longest match is preferred; if positions are the same,
-- custom data sets and collections of message items are combined. Note that
-- fancy errors take precedence over trivial errors in merging.

data ParseError s e
  = TrivialError Int (Maybe (ErrorItem (Token s))) (Set (ErrorItem (Token s)))
    -- ^ Trivial errors, generated by Megaparsec's machinery. The data
    -- constructor includes the offset of error, unexpected token (if any),
    -- and expected tokens.
  | FancyError Int (Set (ErrorFancy e))
    -- ^ Fancy, custom errors.
  deriving (Typeable, Generic)

We can see immediately that there are two different possibilities for parse errors:

  • TrivialError, which is usually generated by Megaparsec’s machinery. Most parse errors are of this kind.

  • FancyError, which is a collection of ErrorFancy items.

-- | Additional error data, extendable by user. When no custom data is
-- necessary, the type is typically indexed by 'Void' to “cancel” the
-- 'ErrorCustom' constructor.
--
-- @since 6.0.0

data ErrorFancy e
  = ErrorFail String
    -- ^ 'fail' has been used in parser monad
  | ErrorIndentation Ordering Pos Pos
    -- ^ Incorrect indentation error: desired ordering between reference
    -- level and actual level, reference indentation level, actual
    -- indentation level
  | ErrorCustom e
    -- ^ Custom error data, can be conveniently disabled by indexing
    -- 'ErrorFancy' by 'Void'
  deriving (Show, Read, Eq, Ord, Data, Typeable, Generic, Functor)

ErrorFail and ErrorIndentation are required by the library. The constructor ErrorCustom works like an extension slot allowing to insert arbitrary data inside. When we don’t need any custom data, we can “multiply the constructor by zero” by parametrizing ErrorFancy by Void. Since Void is not inhabited by any value other than bottom, the ErrorCustom constructor cannot be used.

Defining a custom error component

The custom error component will store conversion errors when a vector of ByteStrings cannot be converted into some type:

-- | Custom error component for CSV parsing. It allows typed reporting of
-- conversion errors.

data ConversionError = ConversionError String
  deriving (Eq, Data, Typeable, Ord, Read, Show)

instance ShowErrorComponent ConversionError where
  showErrorComponent (ConversionError msg) =
    "conversion error: " ++ msg

ConversionError is just a wrapper around a String that conversion functions of Cassava return. We could do better if Cassava provided typed error values, but String is all we have, so let’s work with it.

The ShowErrorComponent instance defines how to render the custom error component.

Another handy definition we need is the Parser type synonym:

-- | Parser type that uses custom error component 'ConversionError'.

type Parser = Parsec ConversionError BL.ByteString

Since it’s recommended to work with concrete types in your parser to get maximum efficiency out of the library, we’ll be writing parsers of this Parser type.

Top level API and helpers

Let’s start from the top and take a look at the top-level, public API:

-- | Deserialize CSV records form a lazy 'BL.ByteString'. If this fails due
-- to incomplete or invalid input, 'Left' is returned. Equivalent to
-- 'decodeWith' 'defaultDecodeOptions'.

decode :: FromRecord a
  => HasHeader
     -- ^ Whether the data contains header that should be skipped
  -> FilePath
     -- ^ File name (only for displaying in parse error messages, use empty
     -- string if you have none)
  -> BL.ByteString
     -- ^ CSV data
  -> Either (ParseErrorBundle BL.ByteString ConversionError) (Vector a)
decode = decodeWith defaultDecodeOptions

-- | Like 'decode', but lets you customize how the CSV data is parsed.

decodeWith :: FromRecord a
  => DecodeOptions
     -- ^ Decoding options
  -> HasHeader
     -- ^ Whether the data contains header that should be skipped
  -> FilePath
     -- ^ File name (only for displaying in parse error messages, use empty
     -- string if you have none)
  -> BL.ByteString
     -- ^ CSV data
  -> Either (ParseErrorBundle BL.ByteString ConversionError) (Vector a)
decodeWith = decodeWithC csv

-- | Deserialize CSV records from a lazy 'BL.ByteString'. If this fails due
-- to incomplete or invalid input, 'Left' is returned. The data is assumed
-- to be preceded by a header. Equivalent to 'decodeByNameWith'
-- 'defaultDecodeOptions'.

decodeByName :: FromNamedRecord a
  => FilePath
     -- ^ File name (only for displaying in parse error messages, use empty
     -- string if you have none)
  -> BL.ByteString
     -- ^ CSV data
  -> Either (ParseErrorBundle BL.ByteString ConversionError) (Header, Vector a)
decodeByName = decodeByNameWith defaultDecodeOptions

-- | Like 'decodeByName', but lets you customize how the CSV data is parsed.

decodeByNameWith :: FromNamedRecord a
  => DecodeOptions
     -- ^ Decoding options
  -> FilePath
     -- ^ File name (only for displaying in parse error messages, use empty
     -- string if you have none)
  -> BL.ByteString
     -- ^ CSV data
  -> Either (ParseErrorBundle BL.ByteString ConversionError) (Header, Vector a)
decodeByNameWith opts = parse (csvWithHeader opts)

-- | Decode CSV data using the provided parser, skipping a leading header if
-- necessary.

decodeWithC
  :: (DecodeOptions -> Parser a)
     -- ^ Parsing function parametrized by 'DecodeOptions'
  -> DecodeOptions
     -- ^ Decoding options
  -> HasHeader
     -- ^ Whether to expect a header in the input
  -> FilePath
     -- ^ File name (only for displaying in parse error messages, use empty
     -- string if you have none)
  -> BL.ByteString
     -- ^ CSV data
  -> Either (ParseErrorBundle BL.ByteString ConversionError) a
decodeWithC p opts@DecodeOptions {..} hasHeader = parse parser
  where
    parser = case hasHeader of
      HasHeader -> header decDelimiter *> p opts
      NoHeader  -> p opts

Really nothing interesting here, just a bunch of wrappers that boil down to running the parser either with skipping the CSV header or not.

The parser

Let’s start with parsing a field. A field in a CSV file can be either escaped or unescaped:

-- | Parse a field. The field may be in either the escaped or non-escaped
-- format. The returned value is unescaped.

field :: Word8 -> Parser Field
field del = label "field" (escapedField <|> unescapedField del)

An escaped field is written inside straight quotes "" and can contain any characters at all, but the quote sign itself " must be escaped by repeating it twice:

-- | Parse an escaped field.

escapedField :: Parser ByteString
escapedField =
  B.pack <$!> between (char 34) (char 34) (many $ normalChar <|> escapedDq)
  where
    normalChar = anySingleBut 34 <?> "unescaped character"
    escapedDq  = label "escaped double-quote" (34 <$ string "\"\"")

Simple so far. unescapedField is even simpler, it can contain any character except for the quote sign ", delimiter sign, and newline characters:

-- | Parse an unescaped field.

unescapedField :: Word8 -> Parser ByteString
unescapedField del = BL.toStrict <$> takeWhileP (Just "unescaped character") f
  where
    f x = x /= del && x /= 34 && x /= 10 && x /= 13

To parse a record we have to parse a non-empty collection of fields separated by delimiter characters (provided by DecodeOptions). Then we convert it to Vector ByteString, because that’s what Cassava’s conversion functions expect:

-- | Parse a record, not including the terminating line separator. The
-- terminating line separate is not included as the last record in a CSV
-- file is allowed to not have a terminating line separator.

record
  :: Word8             -- ^ Field delimiter
  -> (Record -> C.Parser a)
     -- ^ How to “parse” record to get the data of interest
  -> Parser a
record del f = do
  notFollowedBy eof -- to prevent reading empty line at the end of file
  r <- V.fromList <$!> (sepBy1 (field del) (void $ char del) <?> "record")
  case C.runParser (f r) of
    Left msg -> customFailure (ConversionError msg)
    Right x  -> return x

(<$!>) works just like the familiar (<$>)operator, but applies V.fromList strictly. Now that we have a vector of ByteStrings, we can try to convert it: on success we just return the result, on failure we fail using the customFailure helper.

customFailure is defined in terms of a more general primitive called fancyFaliure:

customFailure :: MonadParsec e s m => e -> m a
customFailure = fancyFailure . E.singleton . ErrorCustom

-- where

fancyFailure
  :: MonadParsec e s m
  => Set (ErrorFancy e) -- ^ Fancy error components
  -> m a

-- and yes, there is also

failure
  :: MonadParsec e s m
  => Maybe (ErrorItem (Token s)) -- ^ Unexpected item (if any)
  -> Set (ErrorItem (Token s)) -- ^ Expected items
  -> m a

-- which you probably won't ever need

Back to our task. The library also should handle CSV files with headers:

-- | Parse a CSV file that includes a header.

csvWithHeader :: FromNamedRecord a
  => DecodeOptions     -- ^ Decoding options
  -> Parser (Header, Vector a)
     -- ^ The parser that parser collection of named records
csvWithHeader DecodeOptions {..} = do
  !hdr <- header decDelimiter
  let f = parseNamedRecord . toNamedRecord hdr
  xs   <- sepEndBy1 (record decDelimiter f) eol
  eof
  return $ let !v = V.fromList xs in (hdr, v)

-- | Convert a 'Record' to a 'NamedRecord' by attaching column names. The
-- 'Header' and 'Record' must be of the same length.

toNamedRecord :: Header -> Record -> NamedRecord
toNamedRecord hdr v = H.fromList . V.toList $ V.zip hdr v

-- | Parse a header, including the terminating line separator.

header :: Word8 -> Parser Header
header del = V.fromList <$!> p <* eol
  where
    p = sepBy1 (name del) (void $ char del) <?> "file header"

-- | Parse a header name. Header names have the same format as regular
-- 'field's.

name :: Word8 -> Parser Name
name del = field del <?> "name in header"

The code should be self-explanatory by now. The only thing that remains is to parse collection of records:

-- | Parse a CSV file that does not include a header.

csv :: FromRecord a
  => DecodeOptions     -- ^ Decoding options
  -> Parser (Vector a) -- ^ The parser that parses collection of records
csv !DecodeOptions {..} = do
  xs <- sepEndBy1 (record decDelimiter parseRecord) eol
  eof
  return $! V.fromList xs

Trying it out

The custom error messages play seamlessly with the rest of the parser. Let’s parse a CSV file as a collection of (String, Maybe Int, Double) items. If I try to parse "foo, I get the usual Megaparsec error message with “unexpected” and “expected” parts:

1:5:
  |
1 | "foo
  |     ^
unexpected end of input
expecting '"', escaped double-quote, or unescaped character

However, when that phase of parsing is passed successfully, as with foo,12,boo input, the conversion is attempted and its results are reported:

1:11:
  |
1 | foo,12,boo
  |           ^
conversion error: expected Double, got "boo" (Failed reading: takeWhile1)

(I wouldn’t mind if (Failed reading: takeWhile1) part were omitted, but that’s what Cassava’s conversion methods are producing.)

Conclusion

I hope this walk-through has demonstrated that it’s quite trivial to add arbitrary data to Megaparsec error messages. This way it’s possible to pump out some data from a failing parser keeping track of things in a type-safe way, which is one thing we should always care about when writing Haskell programs.