blob: 2606c06ce09f7af3dffad007af8b8c989fab3642 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
---
title: On the Design of a Parser
published: 2016-01-12
tags: Thermoprint
---
The concrete application we’ll be walking through is a naive parser for [bbcode](https://en.wikipedia.org/wiki/BBCode)
-- more specifically the contents of the directory `bbcode` in the
[git repo](git://git.yggdrasil.li/thermoprint#rewrite).
In a manner consistent with designing software as
[compositions of simple morphisms](https://en.wikipedia.org/wiki/Tacit_programming) we start by determining the type of
our solution (as illustrated by the following mockup):
~~~ {.haskell}
-- | Our target structure -- a rose tree with an explicit terminal constructor
data DomTree = Element Text (Map Text Text) [DomTree]
| Content Text
deriving (Show, Eq)
bbcode :: Text -> Maybe DomTree
-- ^ Parse BBCode
~~~
Writing a parser capable of dealing with `Text` directly from scratch would be unnecessarily abstruse, we’ll be using
the [attoparsec](https://hackage.haskell.org/package/attoparsec/docs/Data-Attoparsec-Text.html) family of parser
combinators instead.
We reproduce an incomplete version of the lexer below (it’s missing tag attributes and self-closing tags).
We introduce `escapedText`, a helper function for extracting text until we reach one of a set of delimiting characters
(exclusive).
While doing this we also parse any delimiting character iff it's prefixed with an escape character (we use `\\`) -- the
escape character itself needs only be escaped if encountered directly before one of the delimiting characters.
~~~ {.haskell}
data Token = BBOpen Text -- ^ "[open]"
| BBClose Text -- ^ "[/close]"
| BBStr Text -- ^ "text"
token :: Parser [Token]
token = BBClose <$ "[/" <*> escapedText' [']'] <* "]"
<|> BBOpen <$ "[" <*> escapedText' [']'] <* "]"
<|> BBStr <$> escapedText ['[']
escapedText' :: [Char] -> Parser Text
escapedText' = option "" . escapedText
escapedText :: [Char] -> Parser Text
escapedText [] = takeText -- No delimiting characters -- parse all remaining input
escapedText cs = recurse $ choice [ takeWhile1 (not . special) -- a series of characters we don't treat as special
, escapeSeq -- an escaped delimiter
, escapeChar' -- the escape character
]
where
escapeChar = '\\'
special = inClass $ escapeChar : cs
escapeChar' = string $ T.singleton escapeChar
escapeSeq = escapeChar' *> (T.singleton <$> satisfy special) -- escape character followed by a special character (which encludes the escape character)
recurse p = mappend <$> p <*> escapedText' cs -- parse a prefix and optionally append another chunk of escapedText
runTokenizer :: Text -> Maybe [Token]
runTokenizer = either (const Nothing) Just . parseOnly (many token <* endOfInput)
~~~
We have now reduced the Problem to `[Token] -> DomTree`.
We quickly see that the structure of the problem is that of a
[fold](https://hackage.haskell.org/package/base/docs/Data-Foldable.html).
Having realised this we require a function of type `Token -> DomTree -> DomTree` to recursively build up our target
structure.
In general we’ll want to not only keep track of the `DomTree` during recursion but also maintain a reference to the
position at which we’ll be inserting new tokens.
This kind of problem is well understood and solved idiomatically by using a
[zipper](https://en.wikipedia.org/wiki/Zipper_(data_structure))
([a cursory introduction](http://learnyouahaskell.com/zippers)).
Writing zippers tends to be tedious. We’ll therefore introduce an
[additional intermediate structure](https://hackage.haskell.org/package/containers/docs/Data-Tree.html) for which an
[implementation](https://hackage.haskell.org/package/rosezipper) is available readily.
The morphism from this new structure (`Forest BBLabel`) to our `DomTree` will be almost trivial.
~~~ {.haskell}
import Data.Tree.Zipper (TreePos, Empty, Full)
import qualified Data.Tree.Zipper as Z
data BBLabel = BBTag Text
| BBPlain Text
rose :: [BBToken] -> Maybe (Forest BBLabel)
rose = Z.toForest <$> foldM (flip rose') (Z.fromForest [])
rose' :: BBToken -> TreePos Empty BBLabel -> Maybe (TreePos Empty BBLabel)
rose' (BBStr t) = return . Z.nextSpace . Z.insert (Node (BBPlain t) []) -- insert a node with no children and move one step to the right in the forest we’re currently viewing
rose' (BBOpen t) = return . Z.children . Z.insert (Node (BBTag t) []) -- insert the node and move into position to insert it's first child
rose' (BBClose t) = close t -- haskell complains if multiple equations for the same function have a differing number of arguments, therefore: 'close'
where
close :: Text -> TreePos Empty BBLabel -> Maybe (TreePos Empty BBLabel)
close tag pos = do
pos' <- Z.parent pos -- fail if we're trying to close a tag that does not have a parent (this indicates imbalanced tags)
let
pTag = (\(BBTag t) -> t) $ Z.label pos' -- yes, this will fail unceremoniously if the parent is not a tag, this poses no problem since we're constructing the structure ourselves. The proof that this failure mode does not occur is left as an exercise for the reader.
guard (pTag == tag) -- The structure shows that this mode of failure (opening tags content does not match the closing tags) is not logically required -- it only serves as a *notification* to the user
return $ Z.nextSpace pos' -- move one level up and to point at the next sibling of the parent
~~~
All that is left to do now is present our final morphism:
~~~ {.haskell}
dom :: Forest BBLabel -> [DomTree]
dom = map dom'
where
dom' (Node (BBPlain t) []) = Content t
dom' (Node (BBTag t) ts = Element t $ map dom' ts
~~~
|