Gliese 1337: Speech in the Electroreceptive Modality

A few weeks ago, I saw this SciShow video on electric eels that hunt in packs.

And then I went to the local aquarium where my kids spent several minutes watching the electric eel and the "shock meter" posted above its tank.

So I got to thinking about how electric field modulation might be used as a modality for a proper language. After a little googling, I found a couple of interesting articles: Electrocyte Physiology: 50 years later and Regulation and modulation of electric waveforms in gymnotiform electric fish.

Which reveal that it is possible to modulate electrocyte activity on millisecond timescales, and that (at least one family of) electric fish can produce multiple simultaneous overlapping waveforms.

From which I conclude that it should be totally possible (modulo the fact that we humans happen not to have electrocytes).

What the phonological structure of a language in this modality would be considerably less constrained than our own--it could consist of arbitrary combinations of some finite number of simultaneous formants, rather than the particular types of noises that happen to be obtainable from mechanical vibrations of air in a vocal tract with a limited range of geometries. But any language using this modality would naturally be adapted for use in an underwater environment (since electroception is much less effective in air), and for communication at relatively close range--not quite like protactile communication, because you have the option for lower-fidelity communication at longer range, but kinda similar.

For reference, information in human speech is not encoded in the specific frequencies, but in the relations between frequencies, which allows speech to be frequency-shifted without changing meaning (not the case for, e.g., the tonal language of the aliens from the novel The Jupiter Theft; note that I have linked to the book for convenience, but as an Amazon Associate I earn from qualifying purchases). This is necessary because human vocal tracts come in different sizes! A species using electrostatic communication will have similar constraints, but for a different reason--speaking more "loudly", to cover a longer range, entails a reduction in the maximum available signal frequency, since it takes time to build charge in electrocytes, and a larger charge producing a stronger electric field takes more time to build up. Additionally, electric field waveforms are influenced by hormonal and neuro-structural changes in real electric fishes, so the use of specific frequency bands may hold identifying information, just as human voices do.

So, suppose that our intelligent electric fish can generate up to three independent waveforms simultaneously, with differing maximum frequency components, because that seems biologically plausible based on the "Electrocyte physiology" paper. Since real electric fish can modulate electrocyte activity at scales of a couple of milliseconds, we'll put the top of the speech frequency range around 500Hz; that's considerably lower than the top of the standard human speech range used for telephony purposes, but well within the ranges at which human speech exists, and well above the minimum frequencies that humans can hear--so even though it is surely plausible to encode language in infrasound, we can maintain even more solid plausibility by containing at least the near-range frequency bands entirely within the equivalent of human hearing range.

We can use the lowest-frequency signaling component as an analog to the "voicing bar" in human acoustic phonology. Whatever the fundamental frequency is, it will set the baseline for interpreting all of the higher frequency components. It doesn't need to be expressed all the time (so there can be voiceless segments), but it does need to be expressed frequently, so that listeners don't lose track of the frequency base.

"Voicing" all by itself, however, does not carry linguistic information--it could just be the sound of someone engaged in active sensing. So for any given independent segment, we need at least one additional "formant" (or "just voicing" could be the equivalent of a schwa vowel or something, but I am ignoring that possibility for now)--although there could be "dependent" segments, or sub-segmental features, which involve just the higher formants, or just a single formant. Two-component signals could correspond to vowels, or perhaps more broadly to sonorants--they are unlikely to be "unvoiced" because that leaves you with only a single frequency component which is difficult to interpret as a segment in isolation. Adding in the third "formant" gives you literal con-sonants--sonorants with an additional frequency component added. Just like human acoustic phonology, some segments could be defined by motion of formants... but I am not sure how precise higher-order modulation of electrocyte activity like that might be, so I am more comfortable as a first pass just saying that each significant segment is defined by a fixed pattern of frequency relations, with movement between them being entirely incidental.

So, what can we conclude about this communication medium if we just look at what we know about production, and some generic information-theoretical constraints?

So, suppose that for near-range communication, we use 125Hz as the average base frequency (with some variation between individuals), with 500Hz as the absolute top, soprano-singer level of the available range. For maximum volume, that might scale down to 20Hz base (the lower limit of human hearing) with an 80Hz top. That gives us two full octaves of range in which to place the additional formants--analogous to a typical human vocal range. If we restrict segmental patterns to falling within a single octave, that would allow plenty of room for linguistically-significant tone, if you wanted it, where the whole frequency structure is shifted up or down without altering volume/range.

Unlike human articulatory phonology, there is no obvious reason why there should be limits on the articulation of sequences of consonants vs. sonorants, but the need to bring in the "voicing bar" periodically to establish the frequency baseline means that it makes sense to me to define syllable units that always begin with voicing, and may or may not end with voicing. Segmentation can be further improved (especially if many syllables are voiced throughout) if every syllable has a consonant onset, and a sonorant rhyme--analogous to a human CV syllable structure. The equivalent of unvoiced onsets would be partially-devoiced, leaving a "voiceless vowel" rhyme; additionally, one could have voiced onset / voiceless rhyme, and full voiced syllables, but in any case you get a regular pattern of 3 components, reducing to 2 components (either dropping the voicing bar or the consonant bar), optionally reducing to a single component (voiceless vowel), before reintroducing all three components for the next syllable.

Presuming that you need at least 2 full cycles of the base frequency to identify said frequency, that implies that light syllables could be spoken at a rate of 10Hz, and heavy syllables at a rate of 6Hz, at maximum volume. A typical English speech rate is 3 to 6 syllables per second (and of course people slow down when speaking loudly), so that should be fine!

If total charge is directly proportional to cycle time, that provides a dynamic of 6.25 times in volume between "normal close range voice" (and of course, one could "whisper" by reducing magnitude further without further increasing frequency) and maximum volume, which is a 2.5 times increase in physical range for "yelling at the top of one's voice". Not a lot, but still potentially useful for "talking to one other person" vs. "talking to your whole hunting group". And that advantage is magnified by the fact that you can pack more people within a certain radius in a 3D aquatic environment than you can in a 2D land-based environment.

Now, what if we imagine some species-specific biophysical phonetic constraints? I'll call the relevant creatures "Fysh" (because they are like fish, but aliens).

Fysh communication is accomplished the modulation of electrostatic fields produced by electrocytes and detected by electroreceptor organs. There are a total of 3 electrically-active tissue systems under active neurological control, providing the possibility of producing 3 independent simultaneous frequencies of electrostatic field modulation.

One of these systems, evolved for active sensing, is semi-autonomic, similar to human breathing; it can be actively controlled, but when not under conscious control it produces a constant low-amplitude background pattern. When multiple individuals are near each other, they will instinctively adjust their frequencies to avoid confusion, with the lower-status individual adjusting to higher frequencies.

While the exact waveform is unique to each individual, it is an approximate sinusoid. This system is always used for the lowest frequency component in linguistic communication.

The other two systems are fully voluntary, and each produce sawtooth waves evolved for hunting and stunning prey. These are mutually indistinguishable from each other, so the specific organ or tissue used to produce a higher or lower formant is irrelevant, and the precise perceived ratio of volumes between these formants may change depending on the relative orientation of the speaker and listener.

The communication channel has a perceptual limit at approximately 20Hz, below which changes in electric field strength are not intuitively perceived by most individuals as being part of a single consistent wave pattern. The upper limit is set by articulatory constraints; Fysh cannot consciously produce frequencies over 500Hz in any of the three systems.

The average fundamental frequency for linguistic communication across all individuals is approximately 125Hz. Individuals can speak arbitrarily "quietly" at any frequency should they so choose, but higher volumes inherently limit the maximum achievable frequency, since there is a minimum time required to build any given level of charge. Shifting down to a 20Hz fundamental allows Fysh to "yell" at a maximum of about 6.5 times their normal volume with some distortion, but clear communication is impossible at higher volumes.

Due to volume restrictions on frequency and individual variations in the natural fundamental, Fysh speech segments are independent of absolute frequency (just like humans') and are defined by ratios within chords. Below 100Hz, Fysh can reliably recognize frequency differences of about 2Hz (finer percentage distinctions are perceptible above 100Hz, but the lower, louder end sets the limits for linguistic usage), resulting in around 10 potentially distinct frequencies per octave. Of course, any given language will not use that many distinctions, but variations in which precise ratios are used can be indicators of different dialects. Also, while I have used octaves as a basis for human reference, just as audio octave equivalence is not a universal experience across human cultures, any particular Fysh culture may or may not actually recognize electrostatic octave equivalence or give it any linguistic significance.

Between segments, it is possible for a Fysh to initiate multiple formants quickly enough to be perceptually simultaneous. Ceasing articulation, however, can only be done one formant at a time, and instantaneous transitions between discrete frequencies are not possible.

Linguistically-significant segmental features always extend across at least two full cycles of the fundamental frequency, giving a maximum speech rate of 10 minimum-length segments per second at the bottom of the frequency range. Faster speech is possible at higher frequencies, but between 6 and 10 segments per second is a typical speed range for average speech frequencies.

Independent phones may consist of chords of any combination of 2 or 3 formants, such that they can be identified by the frequency interval between the formants. The one exception is the "schwa" phone, consisting of the base frequency by itself, which needs no second reference frequency because it has a unique distinguishable waveform.
Dependent phones may consist of one or two non-fundamental formants. These only occur as parts of larger utterances which contain a fundamental formant for reference at some point.

With that foundation more precisely specified, we can now consider the phonology and romanization of one specific language, which I shall identify as Fysh A.

Fysh A features frequent "devoicing", where the fundamental formant is suppressed. Segments are organized into syllables based on a consistent 3-part chord. The possible "notes" of these chords are:

An arbitrary fundamental frequency, roughly analogous to the human voicing feature.
An "a" note, in a frequency band centered on 4/3 times the fundamental.
An "o" note, in a frequency band centered on 3/2 times the fundamental.
An "u" note, in a frequency band centered on 5/3 times the fundamental.
An "i" note, in a frequency band centered on double the fundamental.

Note that the octave span of these frequency bands is essentially coincidental (i.e., I liked it); other Fysh languages may not have a similar structure. They may have more or fewer vowel frequencies, or they may allow a vowel frequency to overlap with the fundamental, being distinguished by waveform.

Syllables may begin fully voiced, or (exclusively at the beginning of a word) have a delayed-fundamental feature which offsets the initiation of the fundamental formant. All syllables then drop at least one formant, which may be any of the three.

Complex syllables will drop a second formant; the remaining formant cannot be the fundamental. Complex syllables are required word-finally to avoid simultaneous cessation of multiple formants (a universal feature of Fysh languages, although they may not be phonemic in all languages).

Sequential syllables within a word must have at least one matching non-fundamental formant at their boundary. Where a phonemically-simple syllable occurs before another syllable which does not have two matching formants, there is a sub-segmental period in which the non-matching formant is dropped before the new syllable begins.

Between words, a final formant may transition smoothly to a neighboring formant in the following onset, or a sub-segmental length pause may be automatically inserted.

This results in a total of 8 syllable types:

1. Simple Devoiced: syllables which drop the fundamental and are not word-final.
2. Simple Voiced: syllables which drop a higher formant and are not word-final.

These first two types set the basic unit of syllable length.

3. Complex Devoiced: syllables which drop the fundamental followed by a higher formant.
4. Complex Voiced: syllables which drop a higher formant followed by the fundamental.

Complex syllables are the same length as simple syllables, but divide the length evenly among the three parts rather than two.

5 - 8: Voice-delayed. These syllables are one-third longer than non-delayed syllables due to time devoted to the initial unvoiced section. They can only occur word-initially.

Additionally, any syllable can be phonemically lengthened, which extends the time spent on the voiced core.

Epenthetic single-formant subsegments are one-sixth the length of a non-delayed syllable.

Fysh A also features lexical stress, realized as an increase in field amplitude by a factor of about 1.2 compared with immediately adjacent syllables in the same word, with one stressed syllable per word. Stress may also be associated with a proportional drop in frequency and corresponding increase in syllable length, but this is non-contrastive.

Romanized symbols can be used to represent the features of Fysh A syllables in a way that allows them to be mapped onto human pronunciations. As indicated above, single high-formant bands are represented by 4 vowel letters, by analogy with the rhymes of human CV syllables.

Each possible core chord, of which there are 6, is represented by a consonant letter, again by analogy with human CV syllables.

A straightforward mapping of Fysh A segments to Roman letters might use unvoiced letters for 2-part chords and voiced letters for 3-part chords. However, while the following strategy does not intuitively represent the detailed internal structure of a Fysh A syllable, the romanization is shortened and made more easily pronounceable if we instead use sonorant letters for the onsets of syllables which drop the fundamental first (because these can stand as entire syllables on their own), and obstruent letters for the onsets of syllables which drop the fundamental second.

Each syllable type is thus romanized as follows:

Simple Devoiced: A single sonorant letter.
Simple Voiced: A sonorant letter followed by a vowel letter.
Complex Devoiced: A voiceless obstruent letter followed by a vowel letter.
Complex Voiced: A voiced obstruent letter followed by a vowel letter.

Voice delay is indicated by a leading <h>, and <e> indicates phonemic gemination of the voiced core of a syllable.

Lexical stress is indicated by an apostrophe at the beginning of the syllable, unless stress is word-initial.

The conventional selections for voiced obstruent, unvoiced obstruent, and sonorant romanizations, paired with their component vowels, are as follows:

z, s, l -> ao
x, c, r -> au
b, p, m -> ai
g, k, w -> ou
d, t, n -> oi
v, f, y -> ui

Suggested conventional pronunciations are as follows:

z /z/; s /s/; l /l/
x /ʒ/; c /ʃ/; r /ɹ/
b /b/; p /p/; m /m/
g /g/; k /k/; w /w/, /u/
d /d/; t /t/; n /n/
v /v/; f /f/; y /j/, /ɨ/
h /hə̥/
a /a/
o /o/
u /ʊ
i /i/
e /e/, or actual gemination

And that's enough information to write a few computer programs that will generate the complete list of possible syllables, valid sequences of syllables, and words of arbitrary length; and to read the romanization and synthesize an audio representation of the actual electric field patterns that it describes....

So, I am ready to move on to actual words and grammar!

Now, the interesting part of this is simply being in a weird modality, so I don't intend to put in too much effort on a super intricate grammar--just whatever is needed to produce an introduction to the Conlangery Podcast! But being in this modality, and in the sort of environment that permits this modality (i.e., aquatic) will have some influence on the lexicon, and perhaps on the grammar as well. For example, because electric communication is inherently limited in range, the idea of a speech or address or broadcast to a large audience would be entirely absent--large-scale communication would have to rely on multiple steps of person-to-person repetition. So perhaps "podcast episode" ends up being translated with a word meaning something like "an extended-length memorized text for repetition to many people", with "podcast" somehow derived from that.

Without getting too much into other details of the speakers' anatomy, they must obviously have an electric sense, which suggests there should be basic vocabulary for electrical properties of materials--e.g., high vs. low relative permittivity, and high vs. low voltage. And we should expect deictics and other spatial terminology addressing a full 3D environment (although with the vertical dimension still distinguished from the horizontal, as it is the axis of both gravity and pressure change)...

But none of that is necessary right now to introduce a podcast!

OK, so the Conlangery intro is: "Welcome to Conlangery, the podcast about constructed languages and the people who create them."

Let's suppose there is in fact a cultural tradition of extended memorized texts that can be passed around; those might be called something like "utterance memories". I don't want a word for "word", because words aren't real, and these are aliens, so why should I impose my Anglophone human ideas of linguistic analysis on their vocabulary? But "utterances" are real, and can be of arbitrary length, so there you go!

So I'm thinking the beginning will end up as something like "Welcome! This is an utterance-memory from language-art". For "podcast", I'm thinking I can go as far as assuming that Fysh will have some kind of mythological cycles within which individual stories might be named and extracted, and that could be generalized to refer to other compendia of knowledge. For "language", I'm thinking "utterance-way". "Constructed" and "people" and "create" are fairly basic vocabulary, so the end result is something like this:

"Welcome! Hear an utterance-memory from utterance-way-art, which is a myth_cycle about created utterance-ways and the people who create them."

That requires the following vocabulary:

Welcome - <yei>
Hear - <lo>
utterance - <re'go>
memory - <'pama>
way / method - <'weza>
art - <'tifu>
myth_cycle - <mi'feu>
create - <'deoza>
person - <'yino>

And enough grammar to combine them in the appropriate ways.

Sticking that vocab into some kind of grammatical framework (head-initial, heavily isolating so I don't have to think too hard about morphology in this system), we get:

<Yei> <lo> IMP OBJ <'pama>-<re'go>. BE_PART REL OBJ <'tifu>-<'weza>-<re'go>, BE_EXAMPLE REL OBJ <mi'feu>, BE_PART REL OBJ "topic" <'weza>-<re'go>, <'deoza> REL SUBJ <'yino> <'yino>, OBJ "topic" <'yino> <'yino>, <'deoza> REL OBJ "it" "it".

So now I can assign phonological forms to grammatical morphs:

IMP - eh, let's go ahead and reduplicate that, just like I did for plurals!
OBJ - <za>
SUBJ - <gu>
BE_PART - <'bala>
BE_EXAMPLE - <ye'vi>
it / that - <ma>
REL - <rea>
topic - <'hreyi>

And boom, we've got a translation!

Yei 'lo lo za 'pama re'go 'bala rea za 'tifu 'weza re'go ye'vi rea za mi'feu 'bala rea za 'hreyi 'weza re'go 'deoza rea gu 'yino 'yino za 'hreyi 'yino 'yino 'deoza rea za ma ma.

"Welcome, hear 'memory of utterance' which consists in the art of way of utterance which is a myth-cycle which consists in the topic of way of utterance which people create [and] the topic of people which create them."

Which you can also hear at the beginning of this episode of Conlangery.

If you liked this post, please consider making a small donation.

Gliese 1337

Sunday, May 16, 2021

Speech in the Electroreceptive Modality

No comments:

Post a Comment