Short ByteString and Text
haskellPublished on October 30, 2017
The ByteString
and Text
(strict and lazy) types have a great adoption in
the Haskell ecosystem. Usually when one has binary data, or data that cannot
contain characters outside of ASCII range, ByteString
is preferred. Text
on the other hand allows to work with arbitrary streams of characters,
including Unicode characters, similarly to String
, but in a more efficient
way.
The both libraries
(bytestring
and
text
) come with collections of
operations that are stream fuse-able and sufficient for solving virtually
any task. However, it happens so that there are other under-appreciated
types and libraries that complement ByteString
and Text
. These may be a
better choice in some circumstances, but I believe many Haskellers are not
aware of them. This post aims to rectify that.
ShortByteString
ShortByteString
lives in the same package as ByteString
—bytestring
.
Even without looking at the documentation this fact suggests that there is
something not-so-perfect about ByteString
, otherwise there would be no
additional different type.
Let’s see what’s wrong. Here is the definition of strict ByteString
:
data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- length
We can see that the payload is a pointer to an array of bytes (Word8
in
Haskell). “Foreign” in ForeignPtr
means that the data the pointer points
to is managed not by Haskell’s GC, but possibly by some external means.
ForeignPtr
may have a finalizer associated with it. This finalizer is run
when Haskell run time detects that there are no more references to the the
pointer within Haskell heap and stack. Seems about right, it’s not that easy
to cause memory leaks by forgetting to free a ForeignPtr
. We should be OK
then? The problem is that Haskell GC cannot move around things ForeignPtr
points to (they are in what is called pinned memory), because foreign code
would become… err very fragile if it could, and also because moving around a
long ByteString
is not very efficient. This means that once a ByteString
is allocated, its payload stays at the same address. When many short
ByteStrings
are allocated/freed more or less intensively the impossibility
of moving them around to optimize memory usage contributes to memory
fragmentation.
So if you work with many short ByteString
s, you may start to obverse
something like a space leak when memory usage becomes much higher than the
amount of data you actually have.
ShortByteString
is different. It uses ByteArray#
as payload:
data ShortByteString = SBS ByteArray#
Let’s refresh what ByteArray#
is:
A
ByteArray#
is a just a region of raw memory in the garbage-collected heap, which is not scanned for pointers. It carries its own size (in bytes).
Not only we do not have to worry about memory fragmentation, but
ByteArray#
takes care of storing its own size, so the representation is
also more compact than that of ByteString
.
If we look at the
Data.ByteString.Short
module we however will find that ShortByteString
does not support as many
operations as ByteString
. Here goes a quote about where to use
ShortByteString
:
It is suitable for use as an internal representation for code that needs to keep many short strings in memory, but it should not be used as an interchange type. That is, it should not generally be used in public APIs. The
ByteString
type is usually more suitable for use in interfaces; it is more flexible and it supports a wide range of operations.
It’s also worth remembering that conversion between ShortByteString
and
ByteString
is an O(n) operation involving copying of payload.
ShortText
The ShortText
type provided by the new
text-short
package to
Text
is roughly what ShortByteString
to ByteString
—a more memory
efficient, but also quite limited in functionality variation (check the
Data.Text.Short
module for collection of supported operations on ShortText
values).
Here is the definition of strict Text
:
data Text = Text
{-# UNPACK #-} !A.Array -- payload (Word16 elements)
{-# UNPACK #-} !Int -- offset (units of Word16, not Char)
{-# UNPACK #-} !Int -- length (units of Word16, not Char)
And the definition of ShortText
:
newtype ShortText = ShortText ShortByteString
-- essentially has a ByteArray# inside
ShortText
is different from Text
in the following:
-
It uses UTF-8 internally, while
Text
uses UTF-16. This allowsShortText
to have a more compact representation in most cases, see for example this link about UTF-8 vs UTF-16 compactness. -
It does not support slicing because it does not store start/end offsets saving several bytes this way as well (may be a big deal if you have many short
ShortText
values).
So the use case for ShortText
is when you have to store a lot of short
text values in memory. This is the case e.g. with Cabal which needs to store
thousands of package names. Cabal 2.0 uses ShortText
under the hood, see
the heap
profiles
for example.
The text-utf8
project
Using UTF-8 instead of UTF-16 is a good idea not only for short text values. Indeed, there are several reasons (other than saving memory) to prefer UTF-8:
-
Encoding and decoding to UTF-8 becomes very cheap. I think most people will agree that UTF-8 is probably the most popular encoding right now, so the ability to covert from/to UTF-8 efficiently is a nice thing. Right now for encoding we need to do the conversion from UTF-16 to UTF-8 in vanilla
text
. Decoding UTF-8 values would just amount to checking if a given sequence of bytes is a valid UTF-8 string. -
It becomes possible to write efficient parsers over UTF-8 when syntactic-relevant code-points are in the ASCII plane.
There is a project called text-utf8
which
aims to take the familiar text
library and make it use UTF-8 instead of
UTF-16 under the hood. So far the results are:
-
Improved memory consumption.
-
It’s on par with UTF-16-based
text
in terms of performance, but the developers behindtext-utf8
think that they make it faster than originaltext
as well.
From a talk with Herbert Valerio Riedel who works on that project I learned
that text-utf8
will be released in foreseeable future and if it’s
successful the changes will be incorporated back into text
.
Conclusion
The ShortByteString
and ShortText
should help scaling applications that
tend to store large quantities of short ByteString
s/Text
values. As easy
as it is, just replacing a type may actually help to reduce memory usage
considerably.
I’d like to thank Herbert Valerio Riedel for reading a draft of this post,
telling me almost everything I have written here, and for developing
short-text
and text-utf8
projects :-D