Unlocking the Secrets of Language: The Power of 66 Words
Written on
Understanding Language Efficiency
In our work at PAT, we are striving to create systems that truly grasp human language, akin to how people communicate.
While advancements in technology have sparked debates about their potential sentience, one must consider the inherent limitations of current machine learning models in grasping the nuances of human language. This discussion leads us to explore the mechanics of human language and advocate for a linguistic-based approach to natural language understanding (NLU) rather than relying on simpler models like word vectors.
Insights from Linguistic Research
I had the opportunity to engage with Bill Foley, a prominent figure in the development of Role and Reference Grammar (RRG), to discuss my research on NLU. His insights into exotic languages have been pivotal in shaping theoretical frameworks within linguistics. As noted in his collaboration with Robert D. Van Valin, Jr., the foundational text from 1984 emphasizes:
"Data from exotic languages, such as Austronesian, Papuan, Australian, and American Indian languages, have played decisive roles in the formulation of many of our theoretical concepts."
He recommended starting our analysis with the Langenscheidt Basic German Vocabulary, which highlights the essential words for language learners. Intriguingly, it reveals that:
"The German language, like any other language, is comprised of millions of words, yet 50% of normal spoken and written texts consist of just 66 words."
This prompts an analysis of these words in English to uncover their significance. The introductory remarks of the book further emphasize the importance of learning the most utilized words:
"Students rightfully ask, which words do I have to learn in order to carry on an everyday conversation… The magic answer is usually 2,000 words … the most important words used in 80% of all written and oral communication."
Given that 50% of language derives from a mere 66 words and that the first 2,000 words account for 80%, we aim to incorporate this methodology into our NLU systems for comprehensive functionality.
The Role of Indexes and Symbols
C.S. Peirce's classification of signs into icons, indexes, and symbols offers valuable insights. Icons bear resemblance to their referents, indexes denote indirect relationships (e.g., "that" pointing to an object), and symbols arise from cultural conventions, like the term for canines being "dogs."
Language extensively employs indexes and symbols. While symbols align well with the WordNet project focusing on "open class" words, many "closed class" words, crucial for understanding context, are often overlooked. Words like "I," "here," and "now" are proforms that derive meaning solely from their context. Without tracking the context of utterance (CoU), a statement such as "I went there with her today" becomes ambiguous.
Analyzing the First 100 Words
The first 100 English words can be categorized through various lenses—be it word forms or meanings. Instead of merely categorizing them by parts of speech, we adopt the RRG layered model, which accounts for when, where, and why. Misclassifying words like "there" and "then" as mere adverbs fails to acknowledge their distinct roles in temporal and locational contexts.
These 100 headwords represent 50% of the vocabulary in the Oxford English Corpus. Thus, for an NLU system to emulate English accurately, it must engage with these words effectively.
The first video, "Do 100 words REALLY unlock 50% of a language?" delves into this fascinating concept, examining the significance of a limited vocabulary in communication.
Closed vs. Open Word Classes
Analyzing the most frequently used words reveals that the majority belong to closed classes, highlighting their critical role as the building blocks of English. For instance, conjunctions such as "and," "but," and "or" cannot be statistically significant at the word level, as they operate at a higher phrase level.
Understanding Word Vectors
The rise of word vectors, popularized by Google’s word2vec paper in 2013, has limitations. These vectors treat words as isolated units, lacking the nuances of similarity that come from contextual meaning. Linguist J.R. Firth famously stated, "You shall know a word by the company it keeps," underscoring the necessity for a linguistic framework rather than mere statistical analysis.
Systems relying on word vectors often fail to capture the complexity of human language, particularly regarding index words, which fundamentally differ from standard word categorization.
Exploring Cultural Linguistics
We must recognize that the current state of NLU technology is constrained by its reliance on language-specific models. As each language has unique characteristics, the development of multiple word vector networks is essential, yet these networks struggle to interoperate due to their focus on signs rather than meaning.
The limitations of word vectors lead to challenges in matching intents accurately, as they often rely on developers to adjust example texts rather than providing a robust understanding of language.
Conclusion: A Path Forward
In our pursuit of human-like NLU systems, we must acknowledge the strengths and weaknesses of existing technologies. The significance of context in language comprehension cannot be overstated, and to succeed, we need systems built on advanced linguistic models like RRG that prioritize meaning within context.
The second video, "50 People Tell Us Words Or Phrases Only Their State Uses | Culturally Speaking," provides insight into regional linguistic variations, showcasing the rich diversity of language and its cultural implications.
References
[i] Foley, William A. and Van Valin, Robert D., Functional Syntax and Universal Grammar, Cambridge, 1984.
[ii] Langenscheidt Basic German Vocabulary, Langenscheidt, 1991, p.VII.