Generating phonologically redundant vocabulary for an engineered constructed language

(draft, comments welcome)

The theory
Interaction with other design criteria
The scripts

The theory

Most if not all natural languages, and many popular constructed languages, have a large number of minimal pairs, or near-homophones; pairs of words which differ by a single phoneme, or even by a single distinctive feature in one phoneme. Examples in English include "fight" /fAjt/ and "bite" /bAjt/, differing by one phoneme but at least two distinctive features; or "seat" /si:t/ and "seed" /si:d/, differing by a single distinctive feature. Probably such near-homophones are more likely to persist in a natural langauge (or a constructed language that comes into use as a spoken language) if the words of the pair are unlikely to occur in the same context.

I've been experimenting with some methods to generate the initial root vocabulary for an engineered language in such a way that there will be no minimal pairs. I've tried several different approaches:

no two words differ by less than two phonemes
no two words differ by less than two distinctive features
no two words differ by less than three distinctive features

For instance, if I take the first criteria, and have some roots of the form CVC (consonant-vowel-consonant), then /kan/ could coexist with /kim/, /kel/, /koŋ/ and /kur/, but would block /kam/, /tan/, /ken/, and so forth from being used. I've managed to quantify exactly how this criterion reduces the number of possible words available for a given phonology and phonotactic. It's necessary first to consider a syllable or word in terms of a series of phoneme slots, each of which can have any of several phonemes (possibly including null). Then represent a word of a given length by an N-dimensional matrix where each coordinate's value represents a particular phoneme in a particular position. For instance, a syllabary table may represents the set of all possible CV syllables in a given phonology: a 2-dimensional matrix. If the language allows an optional final /n/, we can add a second layer, giving a 3-dimensional matrix with slight thickness. We can extend the matrix into any number of dimensions, to allow for words of more than one syllable (e.g. CVCVCV...) or words with more complex structure (CCVCC, etc.).

For instance, a simple CV phonology with three consonants and three vowels might be represented by:

ka  ta  pa
ki  ti  pi
ku  tu  pu

It's clear that there are several ways we could pick three different monosyllables that each differ from the others by two phonemes (e.g., ka, ti, and pu; or ta, pi, and ku; or pa, ki, and tu); but there's no way to get more than three out of it. We can represent this by blocking off all the spaces in the same row and column as a word we've picked.

Step 1:

ka  ##  ##
##  __  __
##  __  __

Step 2:

ka  ##  ##
##  ti  ##
##  ##  __

Step 3:

ka  ##  ##
##  ti  ##
##  ##  pu

If we extend this into three dimensions, by allowing a final consonant, the second layer we add allows us to get three more redundant words out of the system (e.g., /kin/, /tun/, /pan/); as does the third (/kum/, /tam/, /pim/) — but a fourth layer (four final consonants, or three consonants plus null) yields no additional benefit in terms of the maximum number of words available. It's easy to see why; for each cell of the matrix representing a word we pick, we must block off not only the cells on the same row or column of the same layer, but the cells on the same row and column of every other layer. After we've picked nine words from the first three layers of our 3x3x4 matrix, all the cells on the fourth layer are blocked off.

If the phonotactics of the language don't allow for certain phonemes to occur next to each other, we can pre-block certain cells of the matrix (representing words in which those forbidden combinations occur) before we start searching. This may reduce the total number of words available. For instance, if we have initial consonants /k/, /j/, /m/; medial vowels /i/, /u/, /a/; and the same consonants in final position, it's obvious we can get up to nine redundant CVC words. But if we forbid the sequence /ji/, we must pre-block three cells (one in each layer), and can get no more than eight redundant words (two of the cells blocked because they have /ji/ would have been blocked by our choice of other words anyway). If we forbid /ij/ as well, the by the simpler algorithm described so far we must pre-block five cells (one cell represents /jij/). It's still possible to get eight redundant words under these constraints, as Alex Fink pointed out in private email; he suggests starting with a set of nine redundant words that contains /jij/ and then drop it, getting e.g. /jam juk mik maj mum kim kak kuj/. I haven't yet gotten around to figuring out how to generalize this insight for any arbitrary set of constraints in any number of dimensions, or implement it in my scripts, but I plan to work on the next time I am getting ready to generate vocabulary for a particular conlang (e.g., when I eventually build a large enough corpus for säb zjed'a to make corpus frequency analysis meaningful, and am ready to relex it with a new (hopefully more euphonious!) set of redundant morphemes). I suppose one way to do it might be to first figure out how many constraints each cell violates, and then find one of those that violate the most constraints, and let that be the starting point for the script's geometrical iteration over the set of all cells; then remove the offending word we started with when done...? Or iterate over the cells in order of most to least constraints they violate, instead of in the straightforward geometrical ordere described below, and then drop all words that violate any constraints at the end? This needs a lot more work. [updated 2008/8/7]

The maximum number of redundant words to be extracted from a given phonology, assuming there are no forbidden sequences, is equal to the product of the width of the matrix in all dimensions except the widest. (If the width is equal in all dimensions — that is, the same number of phonemes can occur in every slot — we can consider any of them the widest, since they're equal, and discard one of them to multiply the others — e.g., for a 3x3x3 matrix (representing CVC syllables with three consonants and three vowels), 3 * 3 = 9.) So for instance, with ten initial consonants, an optional semivowel (one of two; so three possibilities for the second slot), one of five vowels, and an optional final nasal (one of three; so four possibilities for the fourth slot), the dimensions are 10x3x5x4, and the maximum number of redundant words is 3 * 5 * 4 = 60. It wouldn't matter if there were only 5 initial consonants allowed, or as many as 100; the maximum number of words would still be 60. The redundant words available in an N-slot phonology are bottlenecked by the (N-1) smallest dimensions in the matrix representing it.

We can extend this method to finding words with at least two distinctive features. Each dimension of the matrix will represent a distinctive feature of a phoneme in a given slot. For instance, CV syllables with voiced and unvoiced fricatives and plosives in three points of articulation, and front or back vowels with three heights, could be represented by a six-dimensional matrix; one dimension (width 3) represents the point of articulation of the consonant, another dimension (width 2) represents its manner of articulation (plosive or fricative), a third dimension (thickness two) represents its voicing, and the other two dimensions represent the vowel's height and front/backness. If we follow the same method as before to pick out cells representing words and block the cells in the same row, column, stack... as each word-cell picked so far, we will come up with a set of CV words where every word differs from the other by two distinctive features — perhaps both in the same phoneme (/pa/ vs. /va/ vs. /ga/), perhaps in different phonemes (/pa/ vs. /bu/ vs. /fo/).

It's interesting to note that the order in which we pick words (and thus block off other potential words) may determine how many words we can get. For instance, with a 3x3x3 matrix we can get as few as seven words if we pick cells in an unwise sequence. Generally the best method I've found is to start in a corner, proceed diagonally whenever we've just filled a cell successfully, stay in one plane until it is completely filled in, and use the same row/column as our starting point when proceeding to the next plane. This always works when the matrix is equally thick in all dimensions, and usually when there are different thicknesses in different dimensions. There are some cases where this doesn't fill in the matrix as efficiently as possible; with some configurations of particular thickness in four or more dimensions, the direction you proceed in matters a lot. I haven't figured out all the details yet.

I have figured out, empirically, that filling in the cells in a random order instead of in the above systematic way has more adverse consequences the higher the number of phoneme slots. With three dimensions each of thickness 3, you can do no worse than find seven redundant words. In general, as long as you have only three dimensions random order is not much worse than systematic order. But in four or more dimensions, the average performance of a random fill-in gets worse and worse.

Interaction with other design criteria

Any of the above criteria for a minimum degree of redundancy is in tension with other desirable goals for a usable constructed langauge: conciseness and euphony. With any given phonology, the first redundancy criterion drastically reduces the number of words available at a given length; so given the same phonology, one might produce hundreds of monosyllables and thousands of disyllables, never needing any trisyllables, without this criteria — but if this degree of redundancy is required, only a few tens of monosyllables and a few hundred disyllables, requiring trisyllables and perhaps even tetrasyllables for thousands of less common words. This applies a fortiori if the phonology is designed for a high degree of euphony (which typically means a limited phoneme inventory and tight restrictions on permitted consonant clusters and diphthongs; therefore narrowing the matrix's thickness and pre-blocking many word-cells). So probably a strict use of this criteria would produce a fairly verbose language, -- either a large core vocabulary with many common words being polysyllabic roots, or a small core vocabulary with many common words being long compounds of short roots.

An alternate, more conservative way of using these methods would be to apply them not to the vocabulary of a language as a whole, but to sets of words within a given semantic domain, or a given distributional category. Thus one would ensure that no two words likely to occur in the same context would be minimal pairs. For instance, with a CVC phonology with 10 consonants and 5 vowels, one might generate several different internally redundant sets of 50 words, and use one set for physical verbs, one for mental verbs, one for concrete adjectives, one for abstract adjectives, one for animate nouns, one for inanimates, etc. Words in a given category might have near-homophones in another category, but not in the same category. I would be interested in hearing from anyone who decides to use this approach in one of their conlangs, especially if you find my scripts helpful.

In April 2006 I started working on säb zjed'a, an engineered language that used this methodology to generate the set of wordforms from which its initial vocabulary were taken. The self-segregating morphology scheme I've considered has fricatives always and only at the beginning of a morpheme, optionally followed by any of several stops, nasals, liquids and semivowels before the first vowel, and the same set of non-fricatives consonants allowed in final position; this results in a few hundred redundant monosyllables, most of them easily pronouncable though not always euphonious. So far säb zjed'a is a lexically minimalist language, with more morphemes than Toki Pona or Ygyde but fewer than Lojban or gjâ-zym-byn. I expect I will use this same methodology (with the improvements suggested by Alex Fink) when I eventually relex the language based on a corpus frequency analysis, after its corpus grows large enough for such a frequency analysis to be meaningful — though with a different set of phonology input files; I'm finding it not euphonious enough to be fun to work with, which is why building the corpus to analyzable size has been so slow.

The scripts

I wrote several Perl scripts to generate redundant vocabulary for a given phonology. Here they are:

gen-redundant-morphemes.pl - Reads a format file specifying a phonology, and generates a redundant set of morphemes. You'll need one format file for each possible number of syllables.
gen-redundant-morphemes-3dim.pl - the prototype, which works only for exactly three dimensions. There's no point in using this extensively, but you might want to study the code here to get a better idea for how the newer and more powerful version works, in case you want to modify it. The newer version has some tricky bits where it builds up a block of Perl code in a string variable and then evals it; this prototype has the simpler fixed code that the runtime-generated code of the newer version is based on.
gen-all-possible-morphemes.pl - Reads a format file (same format as with gen-redundant-morphemes.pl) and generates all possible morphemes for this phonology (with no attempt at redundancy).
filter_too_similar_strings.pl - Reads strings, one per line, from standard input, and writes a redundant subset to standard output. I usually used it in a pipeline after gen-all-possible-morphemes.pl before I figured out how to write the newer version of gen-redundant-morphemes.pl that can handle an arbitrary number of dimensions. I still use it to comb over the combined output of several runs of gen-redundant-morphemes.pl that produced words of different numbers of syllables for the same phonology; it has an option to filter out strings of which one is a substring of another. Also, it has an option -m that sets the minimum number of characters distinctness (default = 2); if you set this -m 3 you get behavior that gen-redundant-morphemes.pl isn't yet capable of, finding words that differ by at least three phonemes or distinctive features. I am far from satisfied that this script does that in the most efficient possible way, however.
unsort.sh - Reads an entire file from standard intput, de-sorts (randomizes) the order of the lines in the file, and writes to standard output. Requires gawk. (Would be easy to rewrite in Perl, but if it ain't broke, don't fix it...) Used in a pipeline between gen-all-possible-morphemes.pl and filter_too_similar_strings.pl to vary the output; I wrote some shell scripts to repeat this process with the same phonology format files and keep the largest sets of generated words.
replace_by_map.pl - Reads a replacement map file, with search and replace text separated by tabs on each line, and applies those replacements to every line of standard input, and writes to standard output. One use of this is to fix strings that represent a phoneme by its distinctive features — e.g., "kK0" = /k/ (velar stop, unvoiced), "kK1" = /g/ (velar stop, voiced), "tK0" = /t/ (alveolar stop, unvoiced), etc. — and turn them into more standard orthography. I use this as the last stage in a pipeline when generating words with at least two or three distinctive features.
gen-fmt.pl - I used this to generate some of the phonology format files used in the regression test. It produces alternate consonant and vowel slots with a specified number of phonemes in each (using a series of numbers on the command line).
test-suite.sh - a regression test for gen-redundant-morphemes.pl

Download these scripts and some input files for them (16 kb ZIP file)
"Words with built-in error correction" (December 2005 thread on CONLANG mailing list)
A message on the Toki Pona mailing list about the extreme want of redundancy in Toki Pona's vocabulary (o sona e pilin mi: toki pona li pona mute tawa mi. ni li lili li musi. taso ni li pona kepeken ala.)

Note 1 - "fight" /fAjt/ and "bite" /bAjt/ differ as fricative vs. stop, and unvoiced vs. voiced; arguably the points of articulation (labiodental and bilabial) are fairly distinct too, though English has no labiodental stops or bilabial fricatives, so this isn't contrastive by itself without the fricative/stop distinction.

My home page
My conlang page
Last updated August 2008