Tuesday, June 21, 2022

The Phonology of Baseline

Dath ilan is an alternate-history Earth envisioned by Eliezer Yudkowski, whose history diverges at least a couple thousand years ago from our own, and in which civilization has achieved a much higher degree of global economic coordination. Part of this increased coordination is that everyone on dath ilan speaks, at minimum, an in-universe conlang called "Baseline". Out-of-universe, Baseline does not actually exist--but descriptions of what it is like do, so I have determined to attempt to remedy the situation. In terms of explicit descriptions of Baseline's phonology, this is all we have:
For example, all the phonemes are a minimum distance away from each other that guarantees people with slightly less acute hearing can understand it when spoken under slightly adverse conditions. In-between phonemes that are possible to pronounce, but potentially difficult to hear correctly, are then reserved for constructing 'conlangs', constructed languages, many of which use 'Baseline' as a baseline but add new short words using the expanded phoneme set.

That seems... not to be super well supported by the data? Like, it appears to contain all three of s/θ/f, which are easily confusable in low-fidelity audio environments. (It's actually rather difficult to figure out what the objective perceptual distance between different phones is, independent of biases induced by a test subjects pre-existing knowledge of any specific language; the closest I could find to that kind of research is the planning that went into designing the NATO Phonetic Alphabet--but even that is optimized to avoid confusion by speakers of particular popular languages, which is overconstrained for our purposes here. However, when native speakers of some language--like English--do in fact confuse phonemes of their own language sometimes, that seems like strong evidence that the underlying phones are actually pretty close!) 

However, fortunately for us, the character who speaks that paragraph is not specifically trained in linguistics, and may not know exactly what he's talking about--and there are other constraints on the design of Baseline which may conflict with that one, such that the optimal design for Baseline phonology is not one which optimizes distinctness-of-phonemes in isolation. In particular, Baseline speakers seem to have a strong sense of syllables as the most salient components of word structure, and count of syllables as the obvious way to measure utterance length; and, they value having short words and short utterances for concepts that are common in their culture. Thus, we can also expect to have a large phonemic inventory to allow for the maximum number of individual syllables, maximum information per syllable, and maximal number of short words, which is in direct conflict with keeping individual phones as far apart from each other in acoustic space as possible.

By skimming all of the "Planecrash" stories (about dath ilani people who are in a plane crash, and get isekaied to various other fantasy worlds to have culture shock in), I have extracted a total of five actual Baseline words-that-are-not-names:

dath
ilan
tsi-imbi
farsheth
kelthorkarnen

And then a bunch of personal names:

AlisAthpechyaBahb
BahdhiBohobCorun
ElshormElzbethHelorm
IlleiaKaralKeltham
LimyarMerrinMiyalsvor
NemamelRanthalSalthin
ThellimVerrez

Most names have two syllables; a few (4 in this list) have 3, or maybe 4. "Bahb" is the only one-syllable names, but I don't think that is actually representative of any real name used for a dath ilani person, as it appears in a context where it is clearly meant to be transcription of the English name "Bob", as part of the set "Alis, Bahb, and Karal", standing in for "Alice, Bob, and Carol", the standard placeholder names for participants in a cryptographic protocol. "Bohob" seems to be an alternative adaptation of "Bob" that fits Baseline naming patterns better. In combination with "Bahdhi", though, the orthographic possibility of "Bahb" suggests the existence of <a> and <ah> as separate vowels. If <h> can only occur in onset positions, there would be minimal ambiguity introduced in the Anglicization by adopting that convention. <Illeia> could be a four-syllable name, but we have a negative example in that <Athpechya> is presented as a dath ilani equivalent for a non-Baseline 4-syllable name, which has been cut down to 3 syllables (assuming <y> is to be interpreted as a consonant). Thus, I am inclined to interpret that intervocalic <i> as a transcriptional variant of <y>, much like <c> is a transcriptional variant of <k>, rather than as a whole extra syllable.

As a cultural note, all dath ilani are mononymic, so there is nothing to be said about the structure of family names / patronymics.

From this data, I conclude that Baseline has a 6 vowel system:

FrontBack
Hi/i/ <i>/u/ <u>
Mid/ɛ/ <e>/o/ <o>
Hi/æ/ <a>/ɑ/ <ah>

with three degrees of height, a binary front-back distinction, and rounding in the back non-low vowels.

I would like the <e> vowel to be a little higher, to maximize contrast with /æ/, but we've got an explicit negative example where the dath ilani Merrin struggles to pronounce the French name "Félix", which
confirms that the Baseline <e> vowel is not /e/. ¯\_(ツ)_/¯

Attested consonants, based on the assumption that names are supposed to be pronounced in the most obvious possible way for an Anglophone reader, are as follows:

p - /p/
b - /b/
d - /d/
k/c - /k/

f - /f/
v - /v/
s - /s/
z - /z/
th - /θ/
dh - /ð/
sh - /ʃ/
h - /h/

ts - /t͡s/
ch - /t͡ʃ/

l - /L/ (for maximal distinctiveness from /j/, I'm assuming this to be universally a dark/velarized l, rather than copying English's light/dark allophony; the presence of this and /v/ justify the lack of /w/)
r - /r/ (for maximal distinctiveness from /l/, I'll assume this to be a tap/trill even though that's not the most natural reading for most Anglophones).
y - /j/

m - /m/
n - /n/

The lack of /g/ is not typologically odd, but the lack of isolated /t/ (assuming that <ts> is, in fact, an affricate, which seems reasonable given the existence of <ch> and the lack of other /Cs/ clusters in onset positions) in the presence of /p/ and /d/ is a bizarre gap. On that basis, and because there seems to be a fairly robust voicing distinction in the affricates, I infer that there should also be /t/ and /g/ phonemes, even though they happen to be missing from this dataset. Additionally, I feel we ought to fill in unattested */ʒ/, */d͡z/, and */d͡ʒ/, on the basis that, having decided that voicing was usefully distinctive for all other obstruents, the in-world engineers of Baseline wouldn't have just left those specific place/manner combinations unused!

Now, I want to consider the case of <tsi-imbi> a little more closely; it's the only word with a hyphen in it, and the only word with consecutive identical vowels if you ignore the hyphen. In fact, no attested words have consecutive vowels at all! I infer that this is to maximize the ease of syllable segmentation, and that the hyphen should in fact represent an additional marginal glottal stop (/ʔ/) phoneme (such as shows up in the English "uh-oh"), which shows up wherever vowels would otherwise be in hiatus. That also allows to resolve any possible ambiguity in the usage of <ah> to transcribe the low-back vowel. Something like <bahob> (a minimal change from the attested <Bohob>) would have to be read as /bæ.hob/, while /baob/ would be phonetically [ba.ʔ.ob], with extra-metrical /ʔ/, and transcribed as <bah-ob>--and /ba.hob/ would be <bahhob>.

Now, this raises a potential problem with the transcription of other consonants; while we have examples of single intervocalic <l> and <r>, there are also a few instance of doubled <ll> and <rr>--but no other doubled consonants. And if we aren't allowing doubled vowels, having geminate continuant consonants across syllable boundaries seems like a very weird choice, completely counter to the goal of making syllabic segmentation easy and unambiguous. One could imagine heterosyllabic /l.ʔ.l/ and /r.ʔ.r/ sequences, with epenthetic glottal stops separating syllables just like they do between vowels, but in the absence of written hyphens in the attested names, I am going to assume that the doubled letters are there purely for purposes of Anglophone aesthetics, and that cross-syllable geminates do not actually exist in Baseline.

That leads to the following consonants chart:

Bilabial/
Labiodental
DentalAlveolarPostalveolar/
Palatal
VelarGlottal
Plosivep bt dk g(ʔ)
Nasalmn
Trillr
Fricativef vθ ðs zʃ ʒh
Affricatet͡s d͡zt͡ʃ d͡ʒ
ApproximantjL

The fricatives are a little bit weird; I probably would have dropped θ/ð and h in exchange for x/ɣ to maximize distinctiveness and get slightly better correspondence between fricative and plosive series. But perhaps the in-world justification is that they just Wanted More Options for making more short words, and the possibility of x/h confusion pushed for pulling in the dental fricatives instead, despite the labial/dental/alveolar confusability. And for the plosives, I think it would make sense if all of the voiceless plosives were also secondarily aspirated--we've only got two plosive series, so we might as well make them as phonetically distinctive as possible!

We can also state the following apparent phonotactic rules:
  • Syllables have the form (C1)V((r)C2)(s|z)), where:
  • C1 is any consonant.
  • C2 is any consonant except /h/
  • The optional /r/ cannot occur before another /r/ in the C2 slot.
  • The optional final sibilant cannot occur after another sibilant in the C2 slot.
  • /s/ cannot occur after voiced stops/fricatives
  • /z/ cannot occur-- after voiceless stops/fricatives
Within a word:
  • A syllable cannot end with the same consonant with which the next syllable starts (nor should t/d precede t͡s/d͡z or t͡ʃ/d͡ʒ, respectively).
  • Vowels cannot occur in hiatus, and l and r cannot in hiatus with themselves, with extra-syllabic glottal stops being inserted for repair.

Making codas more complex than onsets is just weird, and I cannot justify that in-world at all, but that seems to be where the available data is pointing. Maybe it allows sub-syllable-level suffixing/infixing morphology?

We have no data on tone or stress, so I assume that by default that Baseline has some sort of non-lexical, predictable stress system--e.g., strict initial stress. However, based on character's commenting on how many syllables are required to say something in various languages, and treating syllable count as a reliable measure of how long an utterance is / how much effort it takes to express something, I infer that the language is syllable-timed, rather than stress- or mora-timed.

Making another default assumption that the maximum onset principle for syllabification applies, the attested syllables are as follows:

a ath
i il im
el elz
bah bahb
beth bi bo
dath dhi
far
he hob
ka kar kel ko
lan le lim lis lorm
ma mel mer mi
ne nen
pech
ral ran rez rin run
sal
sheth shorm
thal tham thel thin thor
tsi
ya
yals yar ver vor

The possible syllables are a much larger set!