Saturday, December 11, 2021

Conlanging with vec2word

Word2Vec is a family of machine-learning based NLP (Natural Language Processing) algorithms which encode the semantics of words as vectors on a unit hypersphere--that is, the meaning of each word is encoded as a big list of numbers (a vector) whose sum-of-squares is 1. The smaller the angle between any two vectors, the more similar their associated words are in meaning. These kinds of models let you do some kinda neat stuff, like evaluating the semantic similarity between documents (which is useful for fuzzy searches), and doing arithmetic on vectors to complete analogies (e.g., "king - man + woman = queen").

It occurred to me (thanks to a Zoom discussion about a taxonomic philosophical language based on Semitic-style triliteral roots) that one could automatically generate vocabulary with taxonomic structure by starting with a word vector model, sorting the vectors along each dimension, and then mapping each vector entry to a phoneme based on its position along that dimension. Going not from source language words 2 vectors... but from vectors 2 conlang words.

(Side note: in theory, it would make more sense to convert semantic vectors into polyspherical coordinates and factor out the redundant radius dimension first... but in practice, sorting by all-but-one Euclidean coordinate gets you the same groupings, just in a possibly-different order [e.g., think about expressing your latitude and longitude in degrees, vs. miles north or south of the Earth's core and miles under Null Island, respectively; longitude coordinates end up sorting differently, and the scale in not linear, but coordinates that are close in one scheme are still close in the other], and is way more computationally efficient--not because there's anything special about Euclidean coordinates, but because pretty much every model ever already comes in that format, and the fastest conversion operation is the one you never do.)

Of course, decent models tend to have between 100 and 300 dimensions, which would make for really long words... but, we can do a neat thing called Principle Component Analysis (PCA) to figure out which vector components are the most important (i.e., in this case, carry the greatest semantic load), which means we just choose however long we want our words to be, extract that many principle components, and then proceed as before.

Additionally, if we don't allow an arbitrarily large number of phonemes to scatter along every dimension (which we don't, because languages have finite phonemic inventories, and rules about what phonemes can appear in what contexts), we won't necessarily be able to give every vector in the model a unique form. Thus, we will have to group semantically-similar source words together in the output. This is actually a good thing, because we don't want to just create an algorithmic relex of the source language (uh... unless you actually do, in which case, go for it, I guess!); rather, the list of source words associated with any given output word can serve as exemplars of a more general semantic field from which the precise definition of your new conlang word can be picked. It's not completely-automated, definitions-included word generation, but it is a lot easier than coming up with new words and definitions completely from scratch! And, even if you aren't intending to create a proper taxonomic language, this approach to producing semantic prompts for specific word forms can help produce a conlang lexicon which naturally contains discoverable sound symbolism patterns, without obvious taxonomic morphology.

Having realized that we will need to do some clustering, there are a few ways we could go about that. We could try just dividing the semantic space into rectangular regions, by dividing up each dimension individually, either at regular intervals or adjusted to try to get an equal distribution of source words in each cluster (and of course we can do that in Euclidean or polyspherical space)... but natural boundaries in semantic space aren't necessarily rectangular, and forcing rectangular clusters can end up putting weirdly different source words together in the same bucket. If you want that, to give you more flexibility in choosing which way to go with a definition, cool! But, there's another option: K-Means Clustering, which takes a set of points (i.e., vectors) and a number of clusters to group them into, and tries to find the best grouping into that number of clusters, dividing the space into Voronoi regions around the cluster centers. The number of clusters can be determined from the number of possible forms that are available, and then forms can be assigned just based on the locations of cluster centers.

(Another side note: it turns out that K-Means clustering of points in a sphere in Euclidean space produces exactly the same results as clustering points in polyspherical coordinate space--so once again, no coordinate transformation is required!)

If you choose to do K-Means clustering, there is then also the choice to perform clustering before or after doing dimensionality-reduction with PCA. Even though dimensionality reduction with PCA keeps the most important information around, when you are going from 300 dimensions down to, say, 3 (for a triliteral root), or even 10 (for a set of pretty darn long words), the stuff that gets thrown out can still be pretty important, so PCA will smush a bunch of stuff together which genuinely does have some kind of objective semantic relation... but whose relations you might be hard-pressed to actually figure out from the list of exemplars! Again, that could be a plus or a minus, depending on what you are going for (and clustering post-PCA will be more computationally efficient), but if you want the most semantically-coherent categories, clustering should be done prior to PCA.

That is the process that I have currently implemented in the vec2word Python program, hosted in the conlang-software-dev GitHub organization. And for this Lexember, I have been using prompts generated from a filtered (numbers and proper nouns removed) version of a word2vec model generated from a 2017 Wikipedia dump to create vocabulary for 2 new conlangs. This process might not work for everybody, but it's been my most consistent and productive Lexember to date! The exact process I have developed around this tool is not quite what I thought it would be when I first conceived of it, and is in fact different for each language, so I might write up some more on that later. But for now, the software is there for other people to try out, and I wanna get more some more experience through the end of Lexember to solidify my process thoughts before putting them out here for the world.

Additional Thought: Useful semantic relations could be automatically extracted by taking the differences of vectors (e.g., "woman - man = queen - king") and looking to see how often applying that difference to some other vector yields another known word. Such vectors are not guaranteed to correspond to any existing regular morphological process in the source language used to build the model, but that's just perfect for providing inspiration for new stuff that you could do in a conlang!

I have not implemented this yet, partly because it would be extremely computationally expensive, but I am very tempted to.

And, as always, just in case you feel like giving me money: you can do that right here.

No comments:

Post a Comment