Object-oriented lexical encoding of multiword expressions: Short and sweet

Multiword expressions (MWEs) exhibit both regular and idiosyncratic properties. Their idiosyn-crasy requires lexical encoding in parallel with their component words. Their (at times intricate) regularity, on the other hand, calls for means of ﬂexible factorization to avoid redundant descriptions of shared properties. However, so far, non-redundant general-purpose lexical encoding of MWEs has not received a satisfactory solution. We offer a proof of concept that this challenge might be effectively addressed within eXtensible MetaGrammar (XMG), an object-oriented metagrammar framework. We ﬁrst make an existing metagrammatical resource, the FrenchTAG grammar, MWE-aware. We then evaluate the factorization gain during incremental implementation with XMG on a dataset extracted from an MWE-annotated reference corpus.


Introduction
Multiword expressions are combinations of words which encompass heterogeneous linguistic objects such as idioms (IDs: to pull one's leg), compounds (a hot dog), light verb constructions (LVCs: to pay a visit ), inherently reflexive verbs (IRVs: s'apercevoir 'self perceive'⇒'realize' in French), rhetorical figures (as busy as a bee), or named entities (the Sea of Tranquility). Their most pervasive and challenging feature is their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structures, in a way deemed regular for the given language. For this reason, as well as because of their pervasiveness in texts, MWEs constitute a major challenge in semantically oriented NLP applications.
But MWEs also exhibit unexpected behavior on other levels of linguistic analysis including the lexical, morphological and syntactic ones. These properties can be defective or restrictive (Lichte et al., 2018). A defective property excludes a literal interpretation of the MWE, e.g. a cross-roads cannot be understood literally because of the lack of number agreement between the determiner and the head noun. A restrictive property reduces the number of possible surface realizations of the MWE with respect to the literal reading. For instance in in example (3), the possessive determiner has to agree with the subject, otherwise the expression can only be understood literally as in #John crossed her fingers. 1 Since defective and restrictive properties help distinguish literal from idiomatic readings of MWEs, their description and processing are important both for linguistic modeling and for NLP applications, including MWE identification (Constant et al., 2017). (1)

John broke my mug (2)
John broke his/our fall 'John made his/our fall less forceful' (3) John crossed his fingers 'John hoped for good luck' (4) John held his tongue 'John refrained from expressing his view' When characterizing MWEs, some authors (Grégoire, 2010;Przepiórkowski et al., 2014) oppose the regular behavior of "free" phrases (i.e. those obeying the rules of a "regular" grammar), like (1), to the idiosyncratic behavior of MWEs, like (2)-(4). Some others point out that regularity is a matter of scale rather than a binary phenomenon (Gross, 1988;Herzig Sheinfux et al., 2015;Lichte et al., 2018). We take the latter stand, and extend it by assuming that the degree of regularity is a feature of linguistic properties on the one hand, and of MWEs on the other hand. Firstly, the more (resp. less) objects share a certain property, the more it is regular (resp. idiosyncratic). For instance, allowing a possessive determiner in a Verb-Det-Noun construction is more regular than imposing that it agrees with the subject, because the former applies to (1)-(4), while the latter is limited to (3)-(4). Still the latter is not fully irregular since it is shared by many expressions. Secondly, in (3), while the direct object of the verb to cross is lexicalized (has to be realized by the lexeme finger), the subject is not. While the noun does not admit adjectival modifiers (#He crossed his long fingers.), passivization is allowed (fingers crossed ). While the noun has to occur in plural, the verb can be inflected freely, etc. Thus, this MWE combines more regular properties (e.g. a free subject) with more idiosyncratic ones (e.g. a lexically and morphologically fixed object).
Because MWEs exhibit (more or less) idiosyncratic properties, their modeling has to include lexical encoding, i.e. MWEs should become separate lexical entries, additionally to their single-word components. The main challenge is then to account for the irregularity of a MWE, while avoiding redundancy, i.e. repeated description of common properties. For instance, the subject-possessive agreement is shared by (3)-(4) and many other MWEs, so its formalization should preferably be done only once, rather than repeatedly for each MWE lexicon entry. As shown is Sec. 2, no previous work seems to have addressed this challenge in a satisfactory way.
In this paper, we aim at providing a proof of concept that non-redundant lexical encoding of MWEs can be effectively achieved in an object-oriented metagrammar-based approach. We use XMG (Crabbé et al., 2013;Petitjean et al., 2016), a declarative constraint-based description language in which more or less regular tree structures are modeled via a hierarchy of classes. We test our proposal on French. We accommodate FrenchTAG (Crabbé, 2005), a pre-existing XMG resource which implements a large fragment of a reference grammar of French (Abeillé, 2002). We show how FrenchTAG can be adapted and extended so as to accommodate a small subset of verbal MWEs (VMWEs) of different syntactic structures and of varying degrees of syntactic flexibility. We evaluate the proposal on a dataset based on the PARSEME corpus of VMWEs (Savary et al., forthcoming). The experiment shows that adding MWE descriptions to a general grammar can be done elegantly by introducing interface constraints in pre-existing classes (to account for restrictive properties), and by adding some new classes (to account for defective properties and for various syntactic structures of lexicalized verbal arguments).
The paper is organized as follows. We discuss the state of the art in lexical encoding of MWEs in computational lexicons and grammars (Sec. 2). We introduce our formalisms and tools (Sec. 3). We explain the methodology of making the original metagrammar MWE-aware (Sec. 5). We discuss the evaluation protocol and results (Sec. 6). Finally, we conclude and give directions for future work (Sec. 7).

Related work
Lexical encoding of MWEs has a long linguistic tradition, notably in French with Gross (1986) and Mel'čuk (1988). They assume that units of meaning are located at the level of elementary sentences rather than of words, and MWEs, especially verbal, are special instances of predicates in which some arguments are lexicalized. Those works paved the way towards systematic syntactic description of MWEs, but were not directly applicable to NLP due to insufficient formalization (Constant and Tolone, 2010).
With the growing understanding of the challenges which MWEs pose to NLP, a large number of NLPdedicated MWE lexicons have been created for many languages (Losnegaard et al., 2016), some of which account for selected morpho-syntactic properties. They can be classified notably along two axes: Axis 1 Formalization of the lexicon-grammar interaction.
Axis 2 Existence of factorization mechanisms.
Axis 1 introduces a division between works which account for continuous MWEs only (i.e. those whose components are adjacent in text) and possibly discontinuous ones. In the former case (Savary, 2008), finite-state-related formalisms, possibly enriched with unification, are often used (Karttunen et al., 1992;Breidt et al., 1996;Oflazer et al., 2004;Silberztein, 2005) since only local phenomena need to be covered, and there is no need to account for the grammar of a language in a comprehensive way.
Conversely, the description of discontinuous MWE, most prominently of VMWEs, usually calls for more or less explicit reference to a full-fledged grammar, because of interactions between MWEs and external elements. For instance, the MWE in (2) has a compulsory but non-lexicalized modifier of the noun fall, which can be realized by syntactically complex nominal phrases (John broke his secretly adored office mate's fall ). Such long-distance dependencies have been covered with two objectives in mind: (i) theory-independence and (ii) integration with computational grammars. Firstly, it was postulated that MWE encoding, which is a labor intensive task, calls for a theory-neutral framework (Grégoire, 2010;Przepiórkowski et al., 2014;McShane et al., 2015). These works assume the existence of general grammar rules (of the language under study), whose observance a native lexicographer is able to verify. The description of a MWE is then done in such a way that only its idiosyncratic properties (i.e. those not conforming to the regular grammar) are encoded, while the regular ones are assumed implicitly. Although these lexicons suffer from insufficient formalization (Lichte et al., 2018), they could be successfully applied to parsing after ad hoc conversion to particular grammar formalisms (Patejuk, 2015). Secondly, a range of computational grammars accommodate some types of MWEs directly in their lexicons. In Head-driven Phrase Structure Grammar (HPSG), , , and Villavicencio et al. (2004) represent decomposable English MWEs (to spill the beans) by paraphrasing (spill ⇒ 'reveal', the beans ⇒ 'a secret') and MWEs with opaque semantics (to kick the bucket ) by separate semantic predicates. Bond et al. (2015) additionally focus on co-indexation mechanisms needed to represent possessed idioms such as (3) (2006) parses Arabic continuous semi-fixed MWEs (traffic light ) as single tokens, while syntactically compositional but semantically non-compositional MWEs ('fiery bike'⇒'motorbike') are handled by the grammar via lexical selection rules, similarly to the HPSG approaches. A formally very different account is found in Lexicalized Tree-Adjoining Grammar (LTAG). Since the elementary structures of LTAG, the elementary trees, correspond to an "extended domain of locality", even the surface structure of discontinuous MWEs can be directly represented within the lexicon (Abeillé and Schabes, 1989;Abeillé and Schabes, 1996;Vaidya et al., 2014;Lichte and Kallmeyer, 2016). This sort of approach is therefore most similar to the words-with-spaces approaches in HPSG and LFG, yet making available more structure and including slots rather than spaces. As a conclusion, there exist, on the one hand, generic lexical MWE resources which suffer from the lack of sufficient formalization, and, on the other hand, perfectly formalized solutions but restricted to particular grammar formalisms.
Along Axis 2, the challenge to address is the proliferation of idiosyncrasy profiles of MWEs. Some MWE lexicons do not introduce generalization of the MWE behavior (Al-Haj et al., 2014). Some do it via inflection codes (Savary, 2009), equivalence classes (Grégoire, 2010), macros (Przepiórkowski et al., 2014), or type hierarchies (Herzig Sheinfux et al., 2015), with a limited degree of recursiveness. The metagrammatical approach by Jacquemin (2001) addresses the morphological, syntactic and semantic variation of French MWEs in a factorized way. There, canonical forms of MWEs are represented as fully lexicalized CFG-like rules with feature structures and unification, while MWE variants are covered by metarules, but the proposal is restricted to continuous terminological compounds.
In view of this state of the art, it seems that a non-redundant lexical encoding of MWEs, which would account for both continuous and discontinuous MWEs, as well as their scale-wise regularity, has not yet received a satisfactory solution. The goal of this paper is to take steps towards such a solution in a metagrammatical (i.e. relatively theory-independent) framework. We focus on lexical and morpho-syntactic properties of MWEs, and present a proof of concept in the context of FTAG.

From a metagrammar to parsing
The framework of eXtensible MetaGrammar (XMG) (Crabbé et al., 2013;Petitjean et al., 2016) provides description languages and dedicated compilers for generating a wide range of linguistic resources. 2 Descriptions are organized into classes, as in object-oriented programming. Similarly, classes have encapsulated name spaces and inheritance relations may hold between them. The crucial elements of a class are dimensions. They can be equipped with specific description languages and are compiled independently, thereby enabling the grammar writer to treat the levels of linguistic information separately.
In this paper, we use the <syn> and <iface> dimensions. <syn> holds tree constraints that express dominance and precedence relations among nodes. Nodes may carry (untyped) feature structures. <iface> is an interface dimension where (non-typed) feature structures are used to share information between other dimensions and classes. 3 The general structure of a class is given on the left of Listing 1. The more concrete class CanSubject on the right of Listing 1 describes the canonical position of a subject N on the left of the verbal nucleus VN. The corresponding tree fragment is given in Fig. 1(a). Note that import, export, and declare behave similarly to the corresponding constructs in object-oriented programming. The ; and | operators expresses conjunction and disjunction of statements, respectively.
While XMG comes with compilers (implemented as constraint solvers) for several syntactic formalisms, we focus on LTAG, due to its extended domain of locality enabling seamless representation of MWEs. We use LTAG grammars compiled from XMG metagrammars as input to the TuLiPA parser (Parmentier et al., 2008;Arps and Petitjean, 2018), applied in the experiments reported on in Sec. 6.

FrenchTAG: A French XMG metagrammar
FrenchTAG (Crabbé, 2005) is a syntactic XMG implementation of the reference grammar of French by Abeillé (2002). 4 It contains 285 XMG classes 5 , including 96 families (sets of trees assigned to lexemes), which compile into 9045 TAG trees. Its main focus are verbs. It defines about 40 verbal subcategorisation frames, and -for each frame -the allowed diatheses (active, passive, middle, reflexive, impersonal, etc.) and argument realizations (canonical, clitic, extracted, omitted, etc.). Listing 2 shows an extract of the classes describing transitive verbs like ouvrir 'open'. Invoking a class C1 within class C2 means that C2 inherits C1's descriptions. The Subject class is a disjunction of various realizations of a subject: canonical (Jean 'Jean'), clitic (il 'he.SUBJ'), relative (Jean qui ouvre la porte 'Jean who opens the door'), etc. Similarly, a direct Object can be canonical (la porte 'the door'), clitic (la 'it.OBJ'), relative (la porte que Jean ouvre 'the door which Jean opens'), reflexive (Jean s'ouvre 'Jean opens himself') etc. Then, the dian0Vn1Active class combines any realization of a subject and an object with a verb in active voice. Class n0Vn1 is a disjunction of diatheses: active (Jean ouvre la porte 'Jean opens the door'), passive (la porte est ouverte par Jean 'the door is opened by Jean'), short passive (la porte est ouverte 'the door is opened'), etc. Class n0ClV represents inherently reflexive verbs discussed in Sec. 5.
Each class describes a more or less abstract set of tree fragments, which can be made more specific by adding constraints in the inheriting classes. For instance, CanonicalSubject inherited by Listing 2: FrenchTAG classes for the realizations of a subject, a direct object, an active diathesis, all diatheses of transitive verbs, and of an inherently reflexive verb.  Subject corresponds to the tree in Fig. 1(a). In case of class conjunction, tree fragments of the inherited classes are combined in that shared nodes (represented by global variables, invisible here) are unified. Here, dian0Vn1Active yields several combinations of tree fragments (a)-(f), including tree (i) (Jean l'ouvre 'John opens it') obtained from (a), (d) and (e) unified along the verbal spine S→VN→V.
Compiling an XMG metagrammar M into an LTAG boils down to finding minimal tree models which fulfill constraints expressed in M . Here, the tree fragment (d) imposes that the clitic CL precedes the verb V possibly indirectly (which is signaled by dots between the sibling nodes). But since no other fragment imposes a third node between the CL and V, the minimal model is the one with a direct precedence. An extract of the FrenchTAG class hierarchy, with some classes from Listing 2, is shown in Fig. 2.  Figure 2: Extract of the XMG class hierarchy in the original FrenchTAG metagrammar.

Enriching a metagrammar with MWEs
XMG offers elegant, fully formalized and powerful factoring mechanisms, which enable largely nonredundant grammar engineering. However, FrenchTAG, which is one of its most advanced use cases, only encodes a limited number of MWE types. Prominently, it covers some inherently reflexive verbs (IRVs), such as se taire 'self ignore'⇒'refrain from talking', in which the reflexive clitic se 'self' either is inherent to the verb (i.e. the verb never occurs alone), or markedly changes the meaning and/or the subcategorization frame of the verb. Although the clitic in an IRV occupies the syntactic place of a direct object, it is not a semantic argument of the verb and does not alternate with non-reflexive objects, unlike for transitive verbs. For instance, the verb ouvrir 'open' assigned to n0Nn1 from Fig. 2 exhibits various object realizations including a reflexive one (Jean s'ouvre 'Jean opens himself'), the latter covered by a combination of tree fragments (d) and (e) from Fig. 1. Conversely, IRVs are assigned to the n0ClV class (cf. Fig. 2) describing the tree fragment (f), which directly integrates the clitic reflexive CL. This mechanism displays one possibility of representing restrictive properties of MWEs: new classes are created by combining only those alternatives which are allowed by a given type of MWEs. Here, class n0ClV directly combines a free subject with a verb taking a clitic but not any other form of object. In early stages of this work, we experimented with this approach. New classes were created for sets of MWEs which had the same syntactic structure and allowed exactly the same syntactic alternations of their verbal arguments. While some redundancy could be avoided due to factorization of the most common alternations, the number of classes grew rapidly with the addition of new MWEs to the lexicon.
Here, we describe another approach which offers a much higher degree of factorization due to an intensive use of interface filters (or interface FSs) both in the grammar and in the lexicon. Consider the VMWE in (5). On the one hand, it shares some properties with transitive verbs (class n0Vn1). For instance, its verb inflects freely, and its subject is unconstrained (canonical, clitic, etc.) and agrees with the verb as shown in (5)-(8). On the other hand, it exhibits restrictive properties. Its verb cannot be passivized (9). Its object is lexicalized (10) and cannot be cliticized (11), extracted (12) or modified (13).

(5)
Jean prend la porte 'Jean takes the door'⇒'Jean leaves (because he is forced to)' (6) Jean/il prend la porte 'Jean/he takes the door'⇒'Jean/he leaves' (7) Jean qui prend la porte 'Jean, who takes the door'⇒'Jean, who leaves . . . ' (8) Prenez la porte! 'Take.2.PL the door!'⇒'Leave!' (9) #La porte est prise par Jean 'The door is taken by Jean' (10) #Jean prend la sortie 'Jean takes the exit' (11) #Jean la prend 'Jean takes it' (12) #La porte que Jean prend 'The door that Jean takes' (13) #Jean prend la grande porte 'Jean takes the big door' In order to account for these properties, the MWE prendre la porte receives the lexical class mweLemmePrendreLaPorte from the left column of Listing 4. The entry holds the head of the MWE, here prendre 'take', which anchors (in some inflected form) tree templates from the mwen0Vn1 family specified in fam. A bunch of filter feature-value pairs help to make a selection from the tree templates in mwen0Vn1 by unifying with their interface features. For example, the filter dia=active selects the trees that correspond to active diathesis (see also below). The other lexicalized components of prendre la porte are specified in coanchor where ObjectDetNode and ObjNode are node names from the trees in mwen0Vn1. These names are also used in the equation part in order to pass on morphological features onto the referred nodes.
1 class mweLemmePrendreLaPorte { 2 <lemma> { 3 entry <-"prendre"; 4 cat <-v; 5 fam <-mwen0Vn1; 6 filter dia = active; 7 filter subj = free; 8 filter obj = lex; 9 filter objtype = canonical; 10 filter objstruct = lexDetLexN; 11 coanchor ObjDetNode -> "la"/d; 12 coanchor ObjNode -> "porte"/n; The right column of Listing 4 shows the MWE-aware versions of some FrenchTAG classes from Fig. 2, in which each syntactic alternation receives an interface constraint, e.g. dia=active in line 12. MWEs having a structure of transitive verbs, like (5), are assigned to class mwen0Vn1, which includes similar diatheses to n0Vn1. These are however filtered by the MWE lexicon entries: here, the dia filter in the mweLemmePrendreLaPorte class will only admit the active alternation from mwen0Vn1, while all others will be neglected during grammar anchoring prior to parsing. Similarly, a verbal argument in a MWE, here a mweSubject or a mweObject, can be either free or lexicalized. In the former case, the pre-existing FrenchTAG classes Subject and Object are re-used. In the latter, new classes from Fig. 3 and 4 are needed, mostly because our objective so far is to keep the original classes intact.   Fig. 1 (a) and (e).
The mweObjectLexStruct class is an alternative of 3 classes which describe lexicalized noun  Figure 5: Extract of the XMG class hierarchy in the metagrammar with encoded MWEs. Classes encoding the original families with additional interface filters (and possibly with restricted alternatives) are marked with underlined mwe prefixes. New classes encoding the syntactic structure of lexicalized verbal arguments of MWEs are boxed. All other classes remain unchanged with respect to Fig. 2.
phrases of the Det-Noun, Noun or Noun-Adj structures, respectively, as in (5), Jean fait face au problème 'Jean makes face to the problem'⇒'Jean deals with the problem' and Jean fait profil bas 'Jean makes low profile'⇒'Jean keeps quiet and discreet'. The corresponding tree fragments are depicted in Fig. 4 (b)-(d). The mweObjectLex class, inherited by mweObjectLexStruct, is similar to Object in Fig. 2 up to substitution marking of the object nodes. For instance, one of the diatheses described by mweObjectLex is the one in Fig. 4(a), which is identical to Fig. 1(b) up to the substitution mark of the N node. This allows us to unify precisely this node with the root (xLexNPTop) of one of the trees in Fig.4(b)-(d), so as to ensure that the object is lexicalized. Note that the ObjDetNode and ObjNode node names, as well as the objstruct filters in Fig. 3, coincide with the lexicon entry in Listing 4. This ensures that the filters of this MWE select only the right syntactic alternations, and that its coanchors attach to nodes D and N in Fig. 4(e). In brief, making FrenchTAG MWE-aware is based on 3 principles: (i) creating MWE lexicon entries with coanchors and interface filters constraining the syntactic alternations, (ii) reusing pre-existing classes for (more) regular properties and decorating some of them with interface filters, (iii) creating new classes for lexicalized arguments of various syntactic structures. Fig. 5 shows an extract of the class hierarchy with the pre-existing classes from Fig. 2, as well as decorated or newly created MWE-aware classes. Henceforth, we refer to this MWE-aware version of FrenchTAG as mweFrenchTAG.

Evaluation
This work is meant to provide a proof of concept that non-redundant lexical encoding of MWEs can be effectively achieved in XMG. Therefore, we do not aim at evaluating the coverage of the metagrammar. Instead, we wish to see: (i) to what extent the metagrammar allows us to cover the properties of MWEs of a variable degree of regularity, as they are present in real data, (ii) how far the metagrammar changes or extends, when new MWEs and phenomena are to be covered.
To this end, we used the French part of the PARSEME corpus of VMWEs (Savary et al., forthcoming). For the 1, 449 unique VMWEs, their idiomatic and some literal occurrences were extracted. From the resulting 5, 893 occurrences an evaluation dataset was selected following three criteria: frequency, syntactic variety, and presence of literal readings. We first chose 14 most frequent VMWE of various categories and checked that some of them exhibited literal readings in the dataset. Their total number of occurrences was equal to 1,004. We needed a dataset of a more manageable size but representing a large variety of syntactic phenomena. To this end, each occurrence was abstracted as a sequence of components, each component represented by (i) its POS, part-of-speech tag, and (ii) the set of outgoing dependency relation types limited to 2 (first two in the lexicographic order). This reduced the 1,004 oc-currences to 225 classes, from which we repeatedly randomly selected a subset of 56 occurrences (4 for each of the 14 VMWEs). The syntactic variety, in terms of Shannon's entropy, of VMWE occurrences inside each subset was calculated. The subset maximizing the entropy was then randomly divided into two disjoint subsets: DEV (50%) and TEST (50%).
We initially intended to use literal readings to demonstrate the way the grammar represents ambiguous MWEs. However, it turned out that most of them were embedded in other MWEs.We therefore abandoned them in this study and thus reduced DEV and TEST to 26 sentences each.
The original FrenchTAG is focused on verbal predicates and their arguments, while providing only basic coverage for other syntactic categories. It also has a rather scarce lexicon. Therefore, the 52 sentences were manually simplified so as to: (i) avoid subordinate sentences and coordinations, (ii) reduce the non-lexicalized arguments of the verbs (to the head nouns, or to single-word proper names and adverbials). The simplified datasets, DEV-S and TEST-S, are cited in the Appendix. 6 The lexicon necessary for parsing them was automatically derived from the Unitex corpus processor. 7 Then, in phase 1, most MWEs occurring in DEV-S were encoded in mweFrenchTAG (additionally to a dozen of previously encoded examples), so as to parse at least the syntactic alternations occurring in DEV-S. As shown in Tab. 1, after this phase, we observed an 18% increase in the number of XMG classes covering the grammar (i.e. not the lexicon), which is substantial but expected, since this phase included re-engineering the class hierarchy to make it MWE-ware (cf. Sec. 5). In phase 2, we repeated the same actions for TEST-S. We noted only a 1.1% increase in the number of classes between phases 1 and 2. We expect this increase to be even lower in the next encoding iterations, as new MWEs should exhibit increasingly similar syntactic structures and properties to those already encoded. Experimental large-scale validation of this hypothesis is left to future work.  Some examples in DEV-S and TEST-S have not been successfully encoded so far. Example (33) contains a regular extraction whose type seems not to be covered by FrenchTAG. The MWEs in (38) and (63) are reflexive but have a behavior of a copula and their optimal representation requires more insight. Finally, the encoding of the MWE in (36), (37), (60) and (61) is partly satisfactory since the implementation of a compulsory but non-lexicalized modifier has not yet been solved in our approach. Note that the ratio of not (or partly) encoded MWEs decreases between phases 1 and 2.

Conclusions and future work
There is a fundamental tension between the flexibility of MWEs, that is their unforeseeable but possibly recurring "irregularities", and lexical encoding systems which ought to be denotationally precise and at the same time maximally theory-independent. In this paper, we have argued that XMG incorporates both the necessary flexibility and denotational rigor. To this end, we have shown how to extend the metagrammar underlying FrenchTAG, a large-coverage grammar of French, by grammar-specific classes to which the rather grammar-agnostic descriptions of MWEs can be linked. By evaluation against a MWE-annotated reference corpus, we noticed a considerable drop in the growth of grammar size during incremental implementation, which seems to suggest that the redundancy of MWE is efficiently captured.
There are several work packages left for future work: (i) to extend the approach to challenging phenomena such as co-indexation; (ii) to make the XMG compiler and LTAG parser also handle compulsory or prohibited modification of anchor or co-anchor nodes; (iii) to allow for co-anchoring lemmas rather than fixed inflected forms; (iv) to introduce intermediate classes into the original hierarchy from Fig. 3 which directly account for more co-anchoring phenomena. 8 As to the last point, our strategy was to keep the original FrenchTAG classes intact, but this is sub-optimal in cases where co-anchoring becomes massive. Finally, what needs to be done in the long run is to investigate how the MWE descriptions we propose can be reused with grammars from other frameworks, for example HPSG or LFG.