Wednesday, January 10, 2024

A Language of Graphs

Recently I got thinking about syntax trees, and what a purely-written language might be like that was restricted to the syntactic structures available to linearized spoken languages and made those structures  explicit in a 2D representation. Or in other words, a graphical (double-entendre fully intended) language consisting of trees--that is, graphs in which there is exactly one path between any two nodes/vertices--whose nodes are either functional lexemes roughly corresponding to internal syntactic nodes and function words in natural languages, or semantic lexemes corresponding to content words--but where, since the "internal" structure is made visible, content words are not restricted to leaf nodes!

Without loss of generality, and for the sake of simplicity, we can even restrict the visual grammar to binary trees--which X-bar theory does for natural languages anyway--although calling them "binary" doesn't make much sense if you don't display them in the traditional top-down tree format with a distinguished root node, since internal nodes can have up to three connections--one "parent" and two "daughters", which are a natural distinction in natlang syntax trees but completely arbitrary when you aren't trying to impose a reversible linearization on the leaf nodes! So, in other terms, we can say that sentences of this sort of language would consist of tree-structured graphs with a maximal vertex degree of 3.

I am hardly the first person to have thought up the idea of 2D written language, but a common issue plaguing such conlang projects (including their most notable example, UNLWS) is figuring out how to lay them out in actual two dimensions; general graphs are three-dimensional, and squishing them onto a plane often requires crossing lines or making long detours, or both. Even when you can avoid crossings, figuring out the optimal way to lay out a graph on the page is a very hard computational problem. Trees, however, have the very nice property that they are always planar, and trivial to draw on a 2D surface; if we allow cycles, or diamonds (same thing with undirected edges), it becomes much more difficult to specify grammatical rules that will naturally enforce planarity--which is whay I've yet to see a 2D language project that even tries. Not only is it easy to planarize trees, there are even multiple ways of doing so automatically, so one could aspire to writing software that would nicely lay out graphical sentences given, say, parenthesized typed input. (Another benefit of trees is that they can be fully specified by matched-parentheses expressions, w we could actually hope to be able to write this on a keyboard!) And then we can imagine imposing additional grammatical rules and pragmatic implications for different standard layout choices--what does it mean if one node is arbitrarily specified as the root, and you do lay it out as a traditional tree? What if you instead highlight a root node by centering it and laying out the rest of the sentence around it? What if you center a degree-two node and split the rest of the sentence into two halves splayed out on either side?

The downside of trees is that semantic structure is not limited to trees; knowledge graphs are arbitrary non-planar graphs. But, linear natural languages already deal with that successfully; expanding our linguistic from a line to a tree should still reduce the kinds of ambiguities that natural languages handle all the time. So, this sort of 2D language will require the equivalent of pronouns for cross-references; but they probably won't look much like spoken pronouns, and there's a lot more potential freedom in where you decide to make cuts in the semantic graph to turn it into a tree, and thus where pronouns get introduced to encode those missing edges, and those choices can probably be filled with pragmatic meaning on top of the implications of visual layout decisions.

Now, what should words--the nodes in these trees--look like? It seems to be common in 2D languages for glyphs to be essentially arbitrary logographs, perhaps with standard boundary shapes or connection point shapes for different word classes. The philosophy behing UNLWS, that it should take maximal advantage of the native possibilities of the written visual medium, even encourages using iconic pictoral expressions when feasible. But that's not how natural languages work; even visual languages (i.e., sign languages), despite having more iconicity on average than oral languages, have a phonological system consisting of a finite number of basic combinatorial units that are used to build meaningful words, analogous to the finite number of phonemes that oral languages have to sring together into arbitrary words. Since we've already got a certain limited set of graphical "phonological" items necessary for drawing syntax trees, and constraint breeds creativity, why not just re-use those?


Here we have an idealized representation of the available phonemes / graphemes / glyphemes: a vertex with one adjoining edge, a vertex with 2 adjoining edges, and a vertex with 3 adjoining edges. On the left, the three -emic forms. On the right, the basic allographic variants. In all cases, absolute orientation and chirality don't matter--if you mirror the "y" glyph, it is still the same glyph. Note that "graph" and "grapheme" are standard terms in linguistics for the written equivalents of "phones" and "phonemes", but that's gonna get really confusing when we're also talking about "graphs" in the mathematical sense. "Glyph" also has a technical meaning, but I am going to repurpose it here to talk about the basic units of this 2D language. So, we have glyphs, glyphemes, and alloglyphs, which are composed into graphs to form lexemes and phrases. Having only 3 glyphemes to work with may seem extremely limiting, but the expanded combinatorial possibilities in 2D vs. 3D make up for it.

While keeping syntax restricted to tree structures is the core idea of this language experiment, lexical items, which don't need to be invented and laid out on the fly, can be more general; we could allow them to be any planar graph. And just as syntax trees can be laid out in many different ways, we could say that lexical items are solely defined by their abstract graphs, which can also laid out in many ways. But, it turns out that recognizing the topological equivalence of two graphs laid out in different ways is a computationall hard problem! If this language is to be usable by humans, that simply will not do. Thus, the layout for lexical items should be significant, up to rotation and reflection equivalence, so that their visual representations are easily recognizable. This doesn't require introducing any additional phonemic elements--the arrangement of phonemes and letters in one-dimensional natural language words also affects meaning, but we don't consider it "phonemic". Despite the Monty Python sketch about the guy who speaks in anagrams, spoken words are not just bags of sounds in arbitrary order, and written words are not just bags of letters--that's why, for example, "bat" and "tab" mean different things, and "bta" just isn't an English word at all. The spatial arrangement--which, in the case of natural language, works out to just linear order--matters a lot, and that sketch only works because it's precisely constructed to use close-enough anagrams  with a lot of supporting context. So, what sort of glyphotactic rules should we have to determine the valid and recognizable arrangements of glyphs in 2D space?

With 3 edges per vertex, the most natural-seeming arrangement is to spread them out at 120 degree angles, and degree-2 vertices would sit nicely in a pattern with 180-degree angles (although we probably want to minimize those, since vertices are more noticeable if they are highlighted by a corresponding angle, rather than a straight line through them). That suggests a triangular grid, which can accomodate both arrangements. The idealized glyphemes and alloglyphs shown above are drawn assuming placement on such a triangular grid, with 60, 120, and 180-degree angles. (I will continue to refer to the features of glyphs in terms of 60, 120, and 180-degree angles, but these, too, are idealizations; in practice, non-equilateral grids might be used for artistic or typographic purposes--e.g., as an equivalent to italics--in which case these angle measurements should be interpreted as representing 1, 2, or 3 angular steps around a point in the grid.) So, words shouldn't be completely arbitrary planar graphs--they should be planar graphs with a particular layout on a triangular grid.

It does not make sense to extend a single grid across an entire phrase or sentence; the boundaries of trees grow exponentially, so you'd need a hyperbolic grid to do it in the general case, and hyperbolic paper is hard to come by (although laying out a sentence on a single common grid within, say, a Poincare-disc model of the hyperbolic plane might be a neat artistic exercise). Maintaining a grid within a word is sufficient to maintain graphical recognizability, and breaking the grid is one signal of the boundary between lexicon and morphology on one side and syntax on the other.

Making an analogy to chemistry, I feel, as an aesthetic preference, that word-graphs should have a minimal amount of "strain". That is, glyphotactically valid layouts should use 120-degree angles wherever possible, and squish them to 60 degrees or spread them to 180 degrees only where necessary. So, where is it necessary?

  • 60-degree angles should only occur on 3-vertex triangles, the acute points of 4-vertex diamonds, or as paired 60-degree angles on the interior of a hexagon.
  • 180-degree angles should only occur adjacent to 60-degree angles, or crossing vertices at the centers of hexagons.
Additional restrictions:
  • All edges should be exactly one grid unit long--i.e., there are no words distinguished by having a straight line across multiple edges, vs. two edges with a 180-degree angle at a vertex in the middle.
  • Syntactic connections must occur on the outer boundary. I.e., you can't have a word embedded inside another word.
  • All vertices must have a maximum of three adjacent edges; thus, any word must have at least one exterior vertex with degree 2 or 1, to allow a syntactic adge to attach to it.
  • As they are nodes in a binary syntax tree, words can have at most 3 external syntactic connection points.

With those restrictions in place, here are all of the possible word skeletons of 2, 3, or 4 vertices:


I refer to these "word skeletons" rather than full words because they abstract away the specification of syntactic binding points--and the choice of binding points may distinguish words (although they should probably be semantically-related words if I'm not being perverse!) Including all of the possible binding point patterns for every skeleton massively increases the number of possibilities, and it quickly gets impractically tedious to enumerate them all and write them down. Here are all of the word skeletons with 5 vertices:

And here are all of the word skeletons with 6 vertices:

And the number of possible 7-vertex words is.... big. Counting graphs turns out to also be a hard problem, so I can't tell you exactly how fast the number of possible words grows, but it grows fast.

Now, I just need to start actually assigning meanings to some of these....

2 comments:

  1. Given the restriction that each connector is exactly one unit long, are the graphs restricted not only to be planar, but to fit together without overlap?

    If language is allowed to put words on the hidden nodes, I think that's similar to having a secret third child for each node, so in some ways this takes a binary tree and makes it into a trinary tree (where each node has at most 4 connections).

    ReplyDelete
  2. Affix graphs, which provide morphological alterations within a word, if there are any, would need to fit together without overlaps. But once you leave a word, the grid no longer applies--so inter-word syntactic connections can be as long as you like to ensure no overlap, and regularly making them longer than the local grid unit would be the equivalent of putting spaces between words in linear languages.

    ReplyDelete