- Recherche,
-
Partager cette page
[CLLE] Register Variation in Massively Multilingual Web Data - Can We Finally Navigate the Jungle? (V. Laippala)
Publié le 15 janvier 2026 – Mis à jour le 30 janvier 2026
le 11 février 2026
14h-15h
Salle D30, Maison de la rechercheVeronika Laippala, enseignante-chercheure à l'Université de Turku (Finlande), profitera de sa mobilité enseignante à l'UT2J pour présenter les travaux du groupe de recherche en TAL (Traitement Automatique des Langues) auquel elle appartient : le "TurkuNLP Group".
Veronika Laippala est en mobilité enseignante à l'UT2J du 9 février au 13 février
Nous profitons de cette occasion pour un séminaire proposant un aperçu des travaux menés par le groupe NLPTurku de l'universté de Turku (Finlande). Le séminaire se déroulera à la Maison de la Recherche de l'UT2J, en salle D30, de 14h à 15h, mercredi 11 fev.
Résumé :
Web-scale datasets are extremely large, automatically collected, and massively multilingual. For instance, the HPLT dataset (Burchell et al. 2025) covers 193 languages and eight trillion tokens. While primarily collected for Natural Language Processing (NLP) research, web datasets also hold great potential for linguistics and all other fields benefiting from text-form data. However, this potential remains in many ways untapped, largely due to the noisiness of the data. Unlike traditional, manually curated corpora, web data has often been referred to as a jungle: messy, heterogeneous and uncontrolled.
In this talk, I will present the web data resources developed at TurkuNLP,. In particular, our focus is on the automatic identification of registers (Biber 1988)—situationally defined text varieties such as news, different kinds of blogs, and informative texts—and on using this information to provide richer metadata for web-scale data. First, I will discuss the development of register identification tools and the challenges we have faced when modeling the jungle of web language use.
Second, I will focus on two case studies. I will show our recent findings on possible register universals and the cross-linguistic similarities of registers we have observed across 16 languages. Finally, I will present our work on using register-labeled web data for LLM training, and how informed sampling based on registers can improve LLM performance.
Web-scale datasets are extremely large, automatically collected, and massively multilingual. For instance, the HPLT dataset (Burchell et al. 2025) covers 193 languages and eight trillion tokens. While primarily collected for Natural Language Processing (NLP) research, web datasets also hold great potential for linguistics and all other fields benefiting from text-form data. However, this potential remains in many ways untapped, largely due to the noisiness of the data. Unlike traditional, manually curated corpora, web data has often been referred to as a jungle: messy, heterogeneous and uncontrolled.
In this talk, I will present the web data resources developed at TurkuNLP,. In particular, our focus is on the automatic identification of registers (Biber 1988)—situationally defined text varieties such as news, different kinds of blogs, and informative texts—and on using this information to provide richer metadata for web-scale data. First, I will discuss the development of register identification tools and the challenges we have faced when modeling the jungle of web language use.
Second, I will focus on two case studies. I will show our recent findings on possible register universals and the cross-linguistic similarities of registers we have observed across 16 languages. Finally, I will present our work on using register-labeled web data for LLM training, and how informed sampling based on registers can improve LLM performance.
Le séminaire sera suivi d'une séance de demo (15h10 à 16h10) des outils développés par l'équipe de recherche TurkuNLP.