Friday, January 7, 2022

Lexember 2021: A vec2word Retrospective

So last Lexember, I made heavy use of vec2word for machine-assisted vocabulary creation.

As should be expected with an experimental prototype thing, it did not go as smoothly as I had hoped it would. But, it worked well enough that I think I can recommend the concept! In fact, it worked well enough that, for the first time ever, I did not miss a single day of Lexember--and actually produced significantly more than one word per day.

I ended up using vec2word's suggested semantic fields to do my glossopoesis for both Tjugem (an in-progress whistled conlang) and Fysh A (the result of my pondering about speech in the modality of modulated electric fields), but I used it in slightly different ways for each language.

For Fysh A, I generated two cluster lists from the same vector model--one for single-syllable words, and one for two-syllable words. The idea here was that the shorter list of one-syllable words would produce more semantically broad clusters, and short words should be semantically broad so that they can get used a lot. It turns out that that was not the best line of thinking after all, because common words are not always super semantically broad! That problem can be ameliorated in a couple of ways, though: First, Lexember doesn't have enough days to exhaust the complete monosyllabic word list, so there are a lot of monosyllables left over--I could try to continue using the vec2word outputs to assign meaning to all of them, but I can just as well simply decide not to, and get my common-yet-specific words through more traditional Artisanal Lexicon Creation processes. That will mess up the phonosemantic tendencies... but well, natural languages don't have universal phonosemantic systems anyway! Second, on several occasions I just decided that some particular word was going to mean a much more specific subset of what was suggested by the model. That requires more thought than I originally anticipated, but honestly it's how I would probably recommend using the system if you stick with the cluster-generation algorithm I described in my last vec2word post (i.e., trying to get clusters of words that are as semantically coherent as possible).

For Tjugem, I also generate two cluster lists, one for (a subset of) possible single-syllable roots, and one for (a subset of) possible two-syllable roots. These served very different purpose from the lists for Fysh A, however. The intention was to use the much larger (with correspondingly narrower semantic fields) list of two-syllable roots as slightly-polysemous stems, with single-syllable suffixes corresponding to broader semantic fields to disambiguate the precise meaning--sort of like Chinese two-character compounds. This approach, however, had a couple of problems: In many cases, two-syllable stems would already have precise enough meanings that any broadly compatible suffix wouldn't actually add anything useful; and conversely, it was often difficult to find a variety of suffixes that would all make sense in different ways with one stem. Almost always possible, but requiring a lot more searching and thought about how meanings might shift or how useful meaning might be implied by weird combinations. To make this kind of structure work, I suspect it would work better to not try to ensure that the clusters are as coherent as possible--in fact, bothering with vector clustering is probably entirely unnecessary, as one could just produce random jumbles of source words to represent random homophone sets; then, compounds could be automatically generated by finding stems and suffixes that have overlaps in their polysemies.

It may also be worth exploring entirely different clustering strategies. For example, a proper ontological hierarchy might be generated by, rather than producing all the clusters at once and then sorting them along their centroid dimensions, instead looking for a small number of high-level clusters corresponding to an initial phoneme or syllable; and then independently finding another layer of clusters within each of those, and so on, until you have the total number that you want. This is essentially how philosophical languages like Wilkins's Real Character work, although John Wilkins produced his ontology entirely manually!

I also experimented last month with various ways of trying to extract meaningful semantic relations in an unsupervised manner, which might suggest possible morphological processes for adding to a conlang. As I predicted it would be, this is an incredibly computationally expensive process, so not much ended up coming of it; however, some new approaches to efficient clustering have been suggested to me, so there are a few more approaches to this that I might still try in the future.

That's pretty much all I have to say about that, but, as always, just in case you feel like giving me money: you can do that right here.

No comments:

Post a Comment