Table link |
Human languages with greater information density
have higher communication speed but lower
conversation breadth
article link
Nature Human Behaviour (2024)
118 34 Metrics
Abstract
Human languages vary widely in how they encode information within circumscribed semantic domains (for example, time, space, colour, human body parts and activities), but little is known about the global structure of semantic information and nothing about its relation to human communication.
We first show that across a sample of ~1,000 languages, there is broad variation in how densely languages encode information into words.
Second, we show that this language information density is associated with a denser configuration of semantic information.
Finally, we trace the relationship between language information density and patterns of communication, showing that informationally denser languages tend towards faster communication but conceptually narrower conversations or expositions within which topics are discussed at greater depth.
These results highlight an important source of variation across the human communicative channel, revealing that the structure of language shapes the nature and texture of human engagement, with consequences for human behaviour across levels of society.
This is a preview of subscription content, access via your institution
Data availability
The datasets analysed in the current study are available at the following links: for parallel corpora, https://opus.nlpl.eu/; for conversations, https://www.ldc.upenn.edu/; for audio duration, https://wordproject.org and https://faithcomesbyhearing.com; for language family and location, https://glottolog.org/; for Wikipedia articles, https://pypi.org/project/Wikipedia-API/; for language fusion and informativity, https://github.com/OlenaShcherbakova/Sociodemographic_factors_complexity/tree/v2.0/data; and for morphological complexity, https://github.com/mllewis/langLearnVar. The language information density, semantic density and communicative speed measures can be found at https://github.com/peteaceves/Language_Density_and_Communication.
Code availability
The code used to create the language information density and semantic density measures was written in Python v.3.7.2 and can be found at https://github.com/peteaceves/Language_Density_and_Communication. The statistical models were run in Stata v.17, and the code can be found in the Supplementary Information.
References
de Saussure, F. Course in General Linguistics (Open Court, 1986).
Bloomfield, L. Language (Holt, Rinehart & Winston, 1933).
Sapir, E. Language: An Introduction to the Study of Speech (Harcourt, Brace, 1921).
Levelt, W. J. M. Speaking: From Intention to Articulation (MIT Press, 1989).
Thompson, B., Roberts, S. G. & Lupyan, G. Cultural influences on word meanings revealed through large-scale semantic alignment. Nat. Hum. Behav. 4, 1029–1038 (2020).
Youn, H. et al. On the universal structure of human lexical semantics. Proc. Natl Acad. Sci. USA 113, 1766–1771 (2016).
Winawer, J. et al. Russian blues reveal effects of language on color discrimination. Proc. Natl Acad. Sci. USA 104, 7780–7785 (2007).
Kay, P. & McDaniel, C. K. The linguistic significance of the meanings of basic color terms. Language 54, 610–646 (1978).
Davidoff, J., Davies, I. & Roberson, D. Colour categories in a stone-age tribe. Nature 398, 203–204 (1999).
Kay, P., Berlin, B., Maffi, L. & Merrifield, W. in Color Categories in Language and Thought (eds Hardin, C. L. & Maffi, L.) 21–56 (Cambridge Univ. Press, 1997).
Dolscheid, S., Shayan, S., Majid, A. & Casasanto, D. The thickness of musical pitch: psychophysical evidence for linguistic relativity. Psychol. Sci. 24, 613–621 (2013).
Bock, K., Carreiras, M. & Meseguer, E. Number meaning and number grammar in English and Spanish. J. Mem. Lang. 66, 17–37 (2012).
Malt, B. C. et al. Talking about walking: biomechanics and the language of locomotion. Psychol. Sci. 19, 232–240 (2008).
Malt, B. C. et al. Human locomotion in languages: constraints on moving and meaning. J. Mem. Lang. 74, 107–123 (2014).
Casasanto, D. & Boroditsky, L. Time in the mind: using space to think about time. Cognition 106, 579–593 (2008).
Fuhrman, O. et al. How linguistic and cultural forces shape conceptions of time: English and Mandarin time in 3D. Cogn. Sci. 35, 1305–1328 (2011).
Lai, V. T. & Boroditsky, L. The immediate and chronic influence of spatio-temporal metaphors on the mental representations of time in English, Mandarin, and Mandarin-English speakers. Front. Psychol. 4, 142 (2013).
Levinson, S. C. Space in Language and Cognition: Explorations in Cognitive Diversity (Cambridge Univ. Press, 2003).
Levinson, S., Meira, S. & The Language and Cognition Group. ‘Natural concepts’ in the spatial topological domain—adpositional meanings in crosslinguistic perspective: an exercise in semantic typology. Language 79, 485–516 (2003).
Majid, A., Bowerman, M., Kita, S., Haun, D. B. M. & Levinson, S. C. Can language restructure cognition? The case for space. Trends Cogn. Sci. 8, 108–114 (2004).
Feist, M. I. Space between languages. Cogn. Sci. 32, 1177–1199 (2008).
Majid, A., Boster, J. S. & Bowerman, M. The cross-linguistic categorization of everyday events: a study of cutting and breaking. Cognition 109, 235–250 (2008).
Saji, N. et al. Word learning does not end at fast-mapping: evolution of verb meanings through reorganization of an entire semantic domain. Cognition 118, 45–61 (2011).
Lewis, M. & Lupyan, G. Gender stereotypes are reflected in the distributional structure of 25 languages. Nat. Hum. Behav. 4, 1021–1028 (2020).
Enfield, N. J., Majid, A. & van Staden, M. Cross-linguistic categorisation of the body: introduction. Lang. Sci. 28, 137–147 (2006).
Brown, C. H. Language and Living Things: Uniformities in Folk Classification and Naming (Rutgers Univ. Press, 1984).
Berlin, B. Ethnobiological Classification: Principles of Categorization of Plants and Animals in Traditional Societies (Princeton Univ. Press, 2014).
Kemp, C., Xu, Y. & Regier, T. Semantic typology and efficient communication. Annu. Rev. Linguist. 4, 109–128 (2018).
Enfield, N. J. Linguistic relativity from reference to agency. Annu. Rev. Anthropol. 44, 207–224 (2015).
Hofstadter, D. & Sander, E. Surfaces and Essences: Analogy as the Fuel and Fire of Thinking (Basic Books, 2013).
Li, P. & Gleitman, L. Turning the tables: language and spatial reasoning. Cognition 83, 265–294 (2002).
Gleitman, L. & Fisher, C. in The Cambridge Companion to Chomsky (ed. McGilvray, J. A.) 123–142 (Cambridge Univ. Press, 2005).
Pinker, S. The Language Instinct (HarperCollins, 1994).
Berlin, B. & Kay, P. Basic Color Terms: Their Universality and Evolution (Univ. California Press, 1969).
Evans, N. & Levinson, S. C. The myth of language universals: language diversity and its importance for cognitive science. Behav. Brain Sci. 32, 429–448, discussion 448–494 (2009).
Davidson, D. On the very idea of a conceptual scheme. Proc. Addresses Am. Phil. Assoc. 47, 5–20 (1973).
Lupyan, G. & Dale, R. Why are there different languages? The role of adaptation in linguistic diversity. Trends Cogn. Sci. 20, 649–660 (2016).
Pellegrino, F., Coupé, C. & Marsico, E. Across-language perspective on speech information rate. Language 87, 539–558 (2011).
Coupé, C., Oh, Y. M., Dediu, D. & Pellegrino, F. Different languages, similar encoding efficiency: comparable information rates across the human communicative niche. Sci. Adv. 5, eaaw2594 (2019).
Lewis, M., Cahill, A., Madnani, N. & Evans, J. Local similarity and global variability characterize the semantic space of human languages. Proc. Natl Acad. Sci. 120, e2300986120 (2023).
Gibson, E. et al. How efficiency shapes human language. Trends Cogn. Sci. 23, 389–407 (2019).
Bentz, C., Alikaniotis, D., Cysouw, M. & Ferrer-i-Cancho, R. The entropy of words—learnability and expressivity across more than 1000 languages. Entropy 19, 275 (2017).
Bellos, D. Is That a Fish in Your Ear? Translation and the Meaning of Everything (Penguin Books, 2011).
Huffman, D. A. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098–1101 (1952).
Harris, Z. S. Distributional structure. Word World 10, 146–162 (1954).
Jurafsky, D. & Martin, J. H. Speech and Language Processing (Stanford Univ., 2018).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems Vol. 26 (MIT Press, 2013).
Pennington, J., Socher, R. & Manning, C. Glove: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, 2014).
Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: analyzing the meanings of class through word embeddings. Am. Sociol. Rev. 84, 905–949 (2019).
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems 4349–4357 (MIT Press, 2016).
Hamilton, W. L., Leskovec, J. & Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. Preprint at https://arxiv.org/abs/1605.09096v6 (2016).
Arora, S., Li, Y., Liang, Y., Ma, T. & Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Trans. Assoc. Comput. Linguist. 6, 483–495 (2018).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Mnih, A. & Hinton, G. Three new graphical models for statistical language modelling. In Proc. 24th International Conference on Machine Learning 641–648 (Association for Computing Machinery, 2007).
Arora, S., Li, Y., Liang, Y., Ma, T. & Risteski, A. A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016).
Davelaar, E. J. & Raaijmakers, J. G. W. in Cognitive Search: Evolution, Algorithms, and the Brain (eds Todd, P. M. et al.) 177–194 (MIT Press, 2012).
Romney, A. K., Brewer, D. D. & Batchelder, W. H. Predicting clustering from semantic structure. Psychol. Sci. 4, 28–34 (1993).
Howard, M. W., Jing, B., Addis, K. M. & Kahana, M. J. Semantic structure and episodic memory Ch. 7. in LSA: A Road Towards Meaning (eds McNamara, D. & Dennis, S.) (Erlbaum, 2007).
Abbott, J. T., Austerweil, J. L. & Griffiths, T. L. Random walks on semantic networks can resemble optimal foraging. Psychol. Rev. 122, 558–569 (2015).
Hills, T. T., Todd, P. M. & Jones, M. N. Foraging in semantic fields: how we search through memory. Top. Cogn. Sci. 7, 513–534 (2015).
Charnov, E. L. Optimal foraging, the marginal value theorem. Theor. Popul. Biol. 9, 129–136 (1976).
Pirolli, P. L. T. Information Foraging Theory: Adaptive Interaction with Information (Oxford Univ. Press, 2007).
Harbison, J. I., Dougherty, M. R., Davelaar, E. J. & Fayyad, B. On the lawfulness of the decision to terminate memory search. Cognition 111, 416–421 (2009).
Lewis, M. & Frank, M. C. Linguistic niches emerge from pressures at multiple timescales. In Proc. 38th Annual Conference of the Cognitive Science Society 1385–1390 (Cognitive Science Society, 2016).
Lupyan, G. & Dale, R. Language structure is partly determined by social structure. PLoS ONE 5, e8559 (2010).
Shcherbakova, O. et al. Societies of strangers do not speak less complex languages. Sci. Adv. 9, eadf7704 (2023).
Pellegrino, F., Coupé, C. & Marsico, E. A cross-language perspective on speech information rate. Language 87, 539–558 (2011).
Schürmann, T. & Grassberger, P. Entropy estimation of symbol sequences. Chaos 6, 414–427 (1996).
Shannon, C. E. Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 50–64 (1951).
Kontoyiannis, I., Algoet, P. H., Suhov, Y. M. & Wyner, A. J. Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inf. Theory 44, 1319–1327 (1998).
Levinson, S. C. Presumptive Meanings: The Theory of Generalized Conversational Implicature (MIT Press, 2000).
Caplan, S., Kodner, J. & Yang, C. Miller’s monkey updated: communicative efficiency and the statistics of words in natural language. Cognition 205, 104466 (2020).
Brochhagen, T. & Boleda, G. When do languages use the same word for different meanings? The Goldilocks principle in colexification. Cognition 226, 105179 (2022).
Bentz, C., Dediu, D., Verkerk, A. & Jäger, G. The evolution of language families is shaped by the environment beyond neutral drift. Nat. Hum. Behav. 2, 816–821 (2018).
Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J. & Webb, M. E. Naming unrelated words predicts creativity. Proc. Natl Acad. Sci. USA 118, e2022340118 (2021).
Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N. & Malone, T. W. Evidence for a collective intelligence factor in the performance of human groups. Science 330, 686–688 (2010).
McGrath, J. E. Groups: Interaction and Performance (Prentice-Hall, 1984).
McMahan, P. & Evans, J. Ambiguity and engagement. Am. J. Sociol. 124, 860–912 (2018).
Murray, D. et al. Unsupervised embedding of trajectories captures the latent structure of mobility. Preprint at https://arxiv.org/abs/2012.02785 (2020).
Lucy, J. A. Linguistic relativity. Annu. Rev. Anthropol. 26, 291–312 (1997).
Lucy, J. A. Language Diversity and Thought: A Reformulation of the Linguistic Relativity Hypothesis (Cambridge Univ. Press, 1992).
Tiedemann, J. Parallel data, tools and interfaces in OPUS. In Proc. 8th International Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 2214–2218 (European Language Resources Association, 2012).
Christodouloupoulos, C. & Steedman, M. A massively parallel corpus: the Bible in 100 languages. Lang. Resour. Eval. 49, 375–395 (2015).
YouVersion https://www.bible.com/ (2017).
ParaCrawl (NTT Communication Science Laboratories, accessed 1 June 2017); https://www.paracrawl.eu/
OpenSubtitles https://www.opensubtitles.org/ (2017).
Rafalovitch, A. & Dale, R. United Nations General Assembly resolutions: a six-language parallel corpus. In Proc. MT Summit XII (Association for Computational Linguistics, 2009).
Juola, P. Measuring linguistic complexity: the morphological tier. J. Quant. Linguist. 5, 206–213 (1998).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).
Mikolov, T., Yih, W.-T. & Zweig, G. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT 2013 746–751 (Association for Computational Linguistics, 2013).
Hammarström, H., Forkel, R., Haspelmath, M. & Bank, S. Glottolog v.4.4. Max Planck Institute for Evolutionary Anthropology https://glottolog.org (2021).
Chen, X., Ender, P., Mitchell, M. & Wells, C. Regression with Stata. UCLA https://stats.oarc.ucla.edu/stata/webbooks/reg/ (2003).
Huber, P. J. The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability Vol. 1 (Eds Le Cam, L. M. & Neyman, J.) 221–233 (Univ. California Press, 1967).
White, H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817–838 (1980).
Bills, A. et al. IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b (Linguistic Data Consortium, 2019); https://doi.org/10.35111/ehfb-ka57
Bills, A. et al. IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b (Linguistic Data Consortium, 2016); https://doi.org/10.35111/5jdb-wp44
Andresen, L. et al. IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b (Linguistic Data Consortium, 2018); https://doi.org/10.35111/f0b3-5398
Bills, A. et al. IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a (Linguistic Data Consortium, 2016); https://doi.org/10.35111/dcr5-ga44
Andresen, L. et al. IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c (Linguistic Data Consortium, 2019); https://doi.org/10.35111/qdg9-7a64
Adams, N. et al. IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c (Linguistic Data Consortium, 2019); https://doi.org/10.35111/7988-wd73
Bills, A. et al. IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a (Linguistic Data Consortium, 2018); https://doi.org/10.35111/rwmc-nm96
Benowitz, D. et al. IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b (Linguistic Data Consortium, 2019); https://doi.org/10.35111/m5qd-dk93
Conners, T. et al. IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g (Linguistic Data Consortium, 2016); https://doi.org/10.35111/mp23-rd11
Bills, A. et al. IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b (Linguistic Data Consortium, 2017); https://doi.org/10.35111/3j2w-kb06
Bills, A. et al. IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a (Linguistic Data Consortium, 2018); https://doi.org/10.35111/vm6x-za86
Andresen, J. et al. IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5 (Linguistic Data Consortium, 2016); https://doi.org/10.35111/mb8z-6p26
Andrus, T. et al. IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 (Linguistic Data Consortium, 2017); https://doi.org/10.35111/yrqp-r555
Adams, N. et al. IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e LDC2017S19 (Linguistic Data Consortium, 2017); https://doi.org/10.35111/te29-8988
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. In Transactions of the Association for Computational Linguistics Vol. 5 (Association for Computational Linguistics, 2017).
Grave, E., Bojanowski, P., Gupta, P., Joulin, A. & Mikolov, T. Learning word vectors for 157 languages. In Proc. International Conference on Language Resources and Evaluation (Eds Calzolari, N. et. al.) (European Language Resources Association, 2018).
Gordon, R. G. Ethnologue, Languages of the World (SIL International, accessed 1 October 2017); https://www.ethnologue.com
Hofstede, G. Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations (SAGE, 2001).
Hofstede, G. Culture’s Consequences: International Differences in Work-Related Values (Sage, 1984).
Acknowledgements
We thank Z. Chen (Stanford University) for her excellent research assistance and are grateful for comments from C. Chambers (Johns Hopkins University), M. Lewis (Meta), D. Casasanto (Cornell University), G. Lupyan (University of Wisconsin-Madison), J. L. Martin (University of Chicago), A. Sharkey (Arizona State University), J. Murphy (RAND Corporation) and J. Chu (Massachusetts Institute of Technology). P.A. also thanks the judges of the 2017 INFORMS/Organization Science Dissertation Proposal Competition for their feedback and acknowledges support from the National Science Foundation Doctoral Dissertation Research Improvement Grant (no. 1702788). The funder had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.