Tuesday, February 20, 2024

Human Language: Info Density Grants Speed But Less Breadth


Table link


Human languages with greater information density

have higher communication speed but lower 

conversation breadth

article link

Abstract

Human languages vary widely in how they encode information within circumscribed semantic domains (for example, time, space, colour, human body parts and activities), but little is known about the global structure of semantic information and nothing about its relation to human communication.

We first show that across a sample of ~1,000 languages, there is broad variation in how densely languages encode information into words.

Second, we show that this language information density is associated with a denser configuration of semantic information.

Finally, we trace the relationship between language information density and patterns of communication, showing that informationally denser languages tend towards faster communication but conceptually narrower conversations or expositions within which topics are discussed at greater depth.

These results highlight an important source of variation across the human communicative channel, revealing that the structure of language shapes the nature and texture of human engagement, with consequences for human behaviour across levels of society.

This is a preview of subscription content, access via your institution

Data availability

The datasets analysed in the current study are available at the following links: for parallel corpora, https://opus.nlpl.eu/; for conversations, https://www.ldc.upenn.edu/; for audio duration, https://wordproject.org and https://faithcomesbyhearing.com; for language family and location, https://glottolog.org/; for Wikipedia articles, https://pypi.org/project/Wikipedia-API/; for language fusion and informativity, https://github.com/OlenaShcherbakova/Sociodemographic_factors_complexity/tree/v2.0/data; and for morphological complexity, https://github.com/mllewis/langLearnVar. The language information density, semantic density and communicative speed measures can be found at https://github.com/peteaceves/Language_Density_and_Communication.

Code availability

The code used to create the language information density and semantic density measures was written in Python v.3.7.2 and can be found at https://github.com/peteaceves/Language_Density_and_Communication. The statistical models were run in Stata v.17, and the code can be found in the Supplementary Information.

References

  1. de Saussure, F. Course in General Linguistics (Open Court, 1986).

  2. Bloomfield, L. Language (Holt, Rinehart & Winston, 1933).

  3. Sapir, E. Language: An Introduction to the Study of Speech (Harcourt, Brace, 1921).

  4. Levelt, W. J. M. Speaking: From Intention to Articulation (MIT Press, 1989).

  5. Thompson, B., Roberts, S. G. & Lupyan, G. Cultural influences on word meanings revealed through large-scale semantic alignment. Nat. Hum. Behav. 4, 1029–1038 (2020).

    Article PubMed Google Scholar 

  6. Youn, H. et al. On the universal structure of human lexical semantics. Proc. Natl Acad. Sci. USA 113, 1766–1771 (2016).

    Article CAS PubMed PubMed Central Google Scholar 

  7. Winawer, J. et al. Russian blues reveal effects of language on color discrimination. Proc. Natl Acad. Sci. USA 104, 7780–7785 (2007).

    Article CAS PubMed PubMed Central Google Scholar 

  8. Kay, P. & McDaniel, C. K. The linguistic significance of the meanings of basic color terms. Language 54, 610–646 (1978).

    Article Google Scholar 

  9. Davidoff, J., Davies, I. & Roberson, D. Colour categories in a stone-age tribe. Nature 398, 203–204 (1999).

    Article CAS PubMed Google Scholar 

  10. Kay, P., Berlin, B., Maffi, L. & Merrifield, W. in Color Categories in Language and Thought (eds Hardin, C. L. & Maffi, L.) 21–56 (Cambridge Univ. Press, 1997).

  11. Dolscheid, S., Shayan, S., Majid, A. & Casasanto, D. The thickness of musical pitch: psychophysical evidence for linguistic relativity. Psychol. Sci. 24, 613–621 (2013).

    Article PubMed Google Scholar 

  12. Bock, K., Carreiras, M. & Meseguer, E. Number meaning and number grammar in English and Spanish. J. Mem. Lang. 66, 17–37 (2012).

    Article Google Scholar 

  13. Malt, B. C. et al. Talking about walking: biomechanics and the language of locomotion. Psychol. Sci. 19, 232–240 (2008).

    Article PubMed Google Scholar 

  14. Malt, B. C. et al. Human locomotion in languages: constraints on moving and meaning. J. Mem. Lang. 74, 107–123 (2014).

    Article Google Scholar 

  15. Casasanto, D. & Boroditsky, L. Time in the mind: using space to think about time. Cognition 106, 579–593 (2008).

    Article PubMed Google Scholar 

  16. Fuhrman, O. et al. How linguistic and cultural forces shape conceptions of time: English and Mandarin time in 3D. Cogn. Sci. 35, 1305–1328 (2011).

    Article PubMed Google Scholar 

  17. Lai, V. T. & Boroditsky, L. The immediate and chronic influence of spatio-temporal metaphors on the mental representations of time in English, Mandarin, and Mandarin-English speakers. Front. Psychol. 4, 142 (2013).

    PubMed PubMed Central Google Scholar 

  18. Levinson, S. C. Space in Language and Cognition: Explorations in Cognitive Diversity (Cambridge Univ. Press, 2003).

  19. Levinson, S., Meira, S. & The Language and Cognition Group. ‘Natural concepts’ in the spatial topological domain—adpositional meanings in crosslinguistic perspective: an exercise in semantic typology. Language 79, 485–516 (2003).

  20. Majid, A., Bowerman, M., Kita, S., Haun, D. B. M. & Levinson, S. C. Can language restructure cognition? The case for space. Trends Cogn. Sci. 8, 108–114 (2004).

    Article PubMed Google Scholar 

  21. Feist, M. I. Space between languages. Cogn. Sci. 32, 1177–1199 (2008).

    Article PubMed Google Scholar 

  22. Majid, A., Boster, J. S. & Bowerman, M. The cross-linguistic categorization of everyday events: a study of cutting and breaking. Cognition 109, 235–250 (2008).

    Article PubMed Google Scholar 

  23. Saji, N. et al. Word learning does not end at fast-mapping: evolution of verb meanings through reorganization of an entire semantic domain. Cognition 118, 45–61 (2011).

    Article PubMed Google Scholar 

  24. Lewis, M. & Lupyan, G. Gender stereotypes are reflected in the distributional structure of 25 languages. Nat. Hum. Behav. 4, 1021–1028 (2020).

    Article PubMed Google Scholar 

  25. Enfield, N. J., Majid, A. & van Staden, M. Cross-linguistic categorisation of the body: introduction. Lang. Sci. 28, 137–147 (2006).

    Article Google Scholar 

  26. Brown, C. H. Language and Living Things: Uniformities in Folk Classification and Naming (Rutgers Univ. Press, 1984).

  27. Berlin, B. Ethnobiological Classification: Principles of Categorization of Plants and Animals in Traditional Societies (Princeton Univ. Press, 2014).

  28. Kemp, C., Xu, Y. & Regier, T. Semantic typology and efficient communication. Annu. Rev. Linguist. 4, 109–128 (2018).

    Article Google Scholar 

  29. Enfield, N. J. Linguistic relativity from reference to agency. Annu. Rev. Anthropol. 44, 207–224 (2015).

    Article Google Scholar 

  30. Hofstadter, D. & Sander, E. Surfaces and Essences: Analogy as the Fuel and Fire of Thinking (Basic Books, 2013).

  31. Li, P. & Gleitman, L. Turning the tables: language and spatial reasoning. Cognition 83, 265–294 (2002).

    Article CAS PubMed Google Scholar 

  32. Gleitman, L. & Fisher, C. in The Cambridge Companion to Chomsky (ed. McGilvray, J. A.) 123–142 (Cambridge Univ. Press, 2005).

  33. Pinker, S. The Language Instinct (HarperCollins, 1994).

  34. Berlin, B. & Kay, P. Basic Color Terms: Their Universality and Evolution (Univ. California Press, 1969).

  35. Evans, N. & Levinson, S. C. The myth of language universals: language diversity and its importance for cognitive science. Behav. Brain Sci. 32, 429–448, discussion 448–494 (2009).

  36. Davidson, D. On the very idea of a conceptual scheme. Proc. Addresses Am. Phil. Assoc. 47, 5–20 (1973).

    Article Google Scholar 

  37. Lupyan, G. & Dale, R. Why are there different languages? The role of adaptation in linguistic diversity. Trends Cogn. Sci. 20, 649–660 (2016).

    Article PubMed Google Scholar 

  38. Pellegrino, F., Coupé, C. & Marsico, E. Across-language perspective on speech information rate. Language 87, 539–558 (2011).

    Article Google Scholar 

  39. Coupé, C., Oh, Y. M., Dediu, D. & Pellegrino, F. Different languages, similar encoding efficiency: comparable information rates across the human communicative niche. Sci. Adv. 5, eaaw2594 (2019).

    Article PubMed PubMed Central Google Scholar 

  40. Lewis, M., Cahill, A., Madnani, N. & Evans, J. Local similarity and global variability characterize the semantic space of human languages. Proc. Natl Acad. Sci. 120, e2300986120 (2023).

  41. Gibson, E. et al. How efficiency shapes human language. Trends Cogn. Sci. 23, 389–407 (2019).

    Article PubMed Google Scholar 

  42. Bentz, C., Alikaniotis, D., Cysouw, M. & Ferrer-i-Cancho, R. The entropy of words—learnability and expressivity across more than 1000 languages. Entropy 19, 275 (2017).

    Article Google Scholar 

  43. Bellos, D. Is That a Fish in Your Ear? Translation and the Meaning of Everything (Penguin Books, 2011).

  44. Huffman, D. A. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098–1101 (1952).

    Article Google Scholar 

  45. Harris, Z. S. Distributional structure. Word World 10, 146–162 (1954).

    Article Google Scholar 

  46. Jurafsky, D. & Martin, J. H. Speech and Language Processing (Stanford Univ., 2018).

  47. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems Vol. 26 (MIT Press, 2013).

  48. Pennington, J., Socher, R. & Manning, C. Glove: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, 2014).

  49. Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: analyzing the meanings of class through word embeddings. Am. Sociol. Rev. 84, 905–949 (2019).

    Article Google Scholar 

  50. Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).

    Article CAS PubMed Google Scholar 

  51. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems 4349–4357 (MIT Press, 2016).

  52. Hamilton, W. L., Leskovec, J. & Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. Preprint at https://arxiv.org/abs/1605.09096v6 (2016).

  53. Arora, S., Li, Y., Liang, Y., Ma, T. & Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Trans. Assoc. Comput. Linguist. 6, 483–495 (2018).

    Article Google Scholar 

  54. Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).

    Article MathSciNet Google Scholar 

  55. Mnih, A. & Hinton, G. Three new graphical models for statistical language modelling. In Proc. 24th International Conference on Machine Learning 641–648 (Association for Computing Machinery, 2007).

  56. Arora, S., Li, Y., Liang, Y., Ma, T. & Risteski, A. A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016).

    Article Google Scholar 

  57. Davelaar, E. J. & Raaijmakers, J. G. W. in Cognitive Search: Evolution, Algorithms, and the Brain (eds Todd, P. M. et al.) 177–194 (MIT Press, 2012).

  58. Romney, A. K., Brewer, D. D. & Batchelder, W. H. Predicting clustering from semantic structure. Psychol. Sci. 4, 28–34 (1993).

    Article Google Scholar 

  59. Howard, M. W., Jing, B., Addis, K. M. & Kahana, M. J. Semantic structure and episodic memory Ch. 7. in LSA: A Road Towards Meaning (eds McNamara, D. & Dennis, S.) (Erlbaum, 2007).

  60. Abbott, J. T., Austerweil, J. L. & Griffiths, T. L. Random walks on semantic networks can resemble optimal foraging. Psychol. Rev. 122, 558–569 (2015).

    Article PubMed Google Scholar 

  61. Hills, T. T., Todd, P. M. & Jones, M. N. Foraging in semantic fields: how we search through memory. Top. Cogn. Sci. 7, 513–534 (2015).

    Article PubMed Google Scholar 

  62. Charnov, E. L. Optimal foraging, the marginal value theorem. Theor. Popul. Biol. 9, 129–136 (1976).

    Article CAS PubMed Google Scholar 

  63. Pirolli, P. L. T. Information Foraging Theory: Adaptive Interaction with Information (Oxford Univ. Press, 2007).

  64. Harbison, J. I., Dougherty, M. R., Davelaar, E. J. & Fayyad, B. On the lawfulness of the decision to terminate memory search. Cognition 111, 416–421 (2009).

    Article PubMed Google Scholar 

  65. Lewis, M. & Frank, M. C. Linguistic niches emerge from pressures at multiple timescales. In Proc. 38th Annual Conference of the Cognitive Science Society 1385–1390 (Cognitive Science Society, 2016).

  66. Lupyan, G. & Dale, R. Language structure is partly determined by social structure. PLoS ONE 5, e8559 (2010).

    Article PubMed PubMed Central Google Scholar 

  67. Shcherbakova, O. et al. Societies of strangers do not speak less complex languages. Sci. Adv. 9, eadf7704 (2023).

    Article PubMed PubMed Central Google Scholar 

  68. Pellegrino, F., Coupé, C. & Marsico, E. A cross-language perspective on speech information rate. Language 87, 539–558 (2011).

    Article Google Scholar 

  69. Schürmann, T. & Grassberger, P. Entropy estimation of symbol sequences. Chaos 6, 414–427 (1996).

    Article MathSciNet PubMed Google Scholar 

  70. Shannon, C. E. Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 50–64 (1951).

    Article Google Scholar 

  71. Kontoyiannis, I., Algoet, P. H., Suhov, Y. M. & Wyner, A. J. Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inf. Theory 44, 1319–1327 (1998).

    Article MathSciNet Google Scholar 

  72. Levinson, S. C. Presumptive Meanings: The Theory of Generalized Conversational Implicature (MIT Press, 2000).

  73. Caplan, S., Kodner, J. & Yang, C. Miller’s monkey updated: communicative efficiency and the statistics of words in natural language. Cognition 205, 104466 (2020).

    Article PubMed Google Scholar 

  74. Brochhagen, T. & Boleda, G. When do languages use the same word for different meanings? The Goldilocks principle in colexification. Cognition 226, 105179 (2022).

    Article PubMed Google Scholar 

  75. Bentz, C., Dediu, D., Verkerk, A. & Jäger, G. The evolution of language families is shaped by the environment beyond neutral drift. Nat. Hum. Behav. 2, 816–821 (2018).

    Article PubMed Google Scholar 

  76. Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J. & Webb, M. E. Naming unrelated words predicts creativity. Proc. Natl Acad. Sci. USA 118, e2022340118 (2021).

    Article CAS PubMed PubMed Central Google Scholar 

  77. Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N. & Malone, T. W. Evidence for a collective intelligence factor in the performance of human groups. Science 330, 686–688 (2010).

    Article CAS PubMed Google Scholar 

  78. McGrath, J. E. Groups: Interaction and Performance (Prentice-Hall, 1984).

  79. McMahan, P. & Evans, J. Ambiguity and engagement. Am. J. Sociol. 124, 860–912 (2018).

    Article Google Scholar 

  80. Murray, D. et al. Unsupervised embedding of trajectories captures the latent structure of mobility. Preprint at https://arxiv.org/abs/2012.02785 (2020).

  81. Lucy, J. A. Linguistic relativity. Annu. Rev. Anthropol. 26, 291–312 (1997).

    Article Google Scholar 

  82. Lucy, J. A. Language Diversity and Thought: A Reformulation of the Linguistic Relativity Hypothesis (Cambridge Univ. Press, 1992).

  83. Tiedemann, J. Parallel data, tools and interfaces in OPUS. In Proc. 8th International Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 2214–2218 (European Language Resources Association, 2012).

  84. Christodouloupoulos, C. & Steedman, M. A massively parallel corpus: the Bible in 100 languages. Lang. Resour. Eval. 49, 375–395 (2015).

    Article PubMed Google Scholar 

  85. YouVersion https://www.bible.com/ (2017).

  86. ParaCrawl (NTT Communication Science Laboratories, accessed 1 June 2017); https://www.paracrawl.eu/

  87. OpenSubtitles https://www.opensubtitles.org/ (2017).

  88. Rafalovitch, A. & Dale, R. United Nations General Assembly resolutions: a six-language parallel corpus. In Proc. MT Summit XII (Association for Computational Linguistics, 2009).

  89. Juola, P. Measuring linguistic complexity: the morphological tier. J. Quant. Linguist. 5, 206–213 (1998).

    Article Google Scholar 

  90. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).

  91. Mikolov, T., Yih, W.-T. & Zweig, G. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT 2013 746–751 (Association for Computational Linguistics, 2013).

  92. Hammarström, H., Forkel, R., Haspelmath, M. & Bank, S. Glottolog v.4.4. Max Planck Institute for Evolutionary Anthropology https://glottolog.org (2021).

  93. Chen, X., Ender, P., Mitchell, M. & Wells, C. Regression with Stata. UCLA https://stats.oarc.ucla.edu/stata/webbooks/reg/ (2003).

  94. Huber, P. J. The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability Vol. 1 (Eds Le Cam, L. M. & Neyman, J.) 221–233 (Univ. California Press, 1967).

  95. White, H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817–838 (1980).

    Article MathSciNet Google Scholar 

  96. Bills, A. et al. IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b (Linguistic Data Consortium, 2019); https://doi.org/10.35111/ehfb-ka57

  97. Bills, A. et al. IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b (Linguistic Data Consortium, 2016); https://doi.org/10.35111/5jdb-wp44

  98. Andresen, L. et al. IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b (Linguistic Data Consortium, 2018); https://doi.org/10.35111/f0b3-5398

  99. Bills, A. et al. IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a (Linguistic Data Consortium, 2016); https://doi.org/10.35111/dcr5-ga44

  100. Andresen, L. et al. IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c (Linguistic Data Consortium, 2019); https://doi.org/10.35111/qdg9-7a64

  101. Adams, N. et al. IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c (Linguistic Data Consortium, 2019); https://doi.org/10.35111/7988-wd73

  102. Bills, A. et al. IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a (Linguistic Data Consortium, 2018); https://doi.org/10.35111/rwmc-nm96

  103. Benowitz, D. et al. IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b (Linguistic Data Consortium, 2019); https://doi.org/10.35111/m5qd-dk93

  104. Conners, T. et al. IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g (Linguistic Data Consortium, 2016); https://doi.org/10.35111/mp23-rd11

  105. Bills, A. et al. IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b (Linguistic Data Consortium, 2017); https://doi.org/10.35111/3j2w-kb06

  106. Bills, A. et al. IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a (Linguistic Data Consortium, 2018); https://doi.org/10.35111/vm6x-za86

  107. Andresen, J. et al. IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5 (Linguistic Data Consortium, 2016); https://doi.org/10.35111/mb8z-6p26

  108. Andrus, T. et al. IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 (Linguistic Data Consortium, 2017); https://doi.org/10.35111/yrqp-r555

  109. Adams, N. et al. IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e LDC2017S19 (Linguistic Data Consortium, 2017); https://doi.org/10.35111/te29-8988

  110. Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. In Transactions of the Association for Computational Linguistics Vol. 5 (Association for Computational Linguistics, 2017).

  111. Grave, E., Bojanowski, P., Gupta, P., Joulin, A. & Mikolov, T. Learning word vectors for 157 languages. In Proc. International Conference on Language Resources and Evaluation (Eds Calzolari, N. et. al.) (European Language Resources Association, 2018).

  112. Gordon, R. G. Ethnologue, Languages of the World (SIL International, accessed 1 October 2017); https://www.ethnologue.com

  113. Hofstede, G. Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations (SAGE, 2001).

  114. Hofstede, G. Culture’s Consequences: International Differences in Work-Related Values (Sage, 1984).

Download references

Acknowledgements

We thank Z. Chen (Stanford University) for her excellent research assistance and are grateful for comments from C. Chambers (Johns Hopkins University), M. Lewis (Meta), D. Casasanto (Cornell University), G. Lupyan (University of Wisconsin-Madison), J. L. Martin (University of Chicago), A. Sharkey (Arizona State University), J. Murphy (RAND Corporation) and J. Chu (Massachusetts Institute of Technology). P.A. also thanks the judges of the 2017 INFORMS/Organization Science Dissertation Proposal Competition for their feedback and acknowledges support from the National Science Foundation Doctoral Dissertation Research Improvement Grant (no. 1702788). The funder had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

No comments:

Post a Comment