boilingsteam, to llm
@boilingsteam@mastodon.cloud avatar

Building a Large Japanese Web Corpus for Large Language Models: https://arxiv.org/abs/2404.17733

f_moncomble, to linguistics French
@f_moncomble@mastodon.online avatar

New on the blog — find my collection apps on this page:
https://prendrelangue.fr/category/logiciels/
@linguistics

avldigital, to germanistik German
@avldigital@openbiblio.social avatar

for the "Que font les nouveaux à la recherche en ? (XIIe-XXIe siècles)", which will take place at Université de Fribourg () on September 25-27,2024.

🗓️Deadline for Abstracts: January 8, 2024

📌Further Information: https://avldigital.de/de/vernetzen/details/callforpapers/que-font-les-nouveaux-corpus-a-la-recherche-en-litterature-xiie-xxie-siecles/ @litstudies @germanistik @italianstudies

Hugo, to languagelearning
@Hugo@wikis.world avatar

Long and fiery winter night it is ! User:Dragons_Bot is importing frequency lists in 500 more from Unilex into . Dragons Bot script is running tonight, editing Lili persistently. We then will have common words list for 1001 languages, ready for you to record. At step 3 of the Recording Studio, click "local list", then search List:{your_iso}/Unilex and you are good to go ! If your community's languages aren't there you can let me know below. 🎉
https://lingualibre.org/wiki/Special:RecordWizard

Lingua Libre user interface step 3, where the user pick its wordlist.

Hugo,
@Hugo@wikis.world avatar

@theklan part of of those frequency lists are based on the bible. For each of the 1001 scrapped various open online resources: wikipedias, bibles translations, wordpress blogs. It is used in all our android phones for text autocomplete. Sometimes, only the bible was available so the frequency list reflects it. I would prefer larger, more diverse raw data but they did with what they got. I do not know any better hyperlingual open . cc @alexture
https://github.com/unicode-org/unilex

j_mieczni, to dh German
@j_mieczni@101010.pl avatar

📢 Workshop on the design and application of large machine-readable audio-video corpora of spoken interaction, organized by on behalf of CHORD-talk-in-interaction at the University of Basel on 📅December 7-8, 2023. Register here: https://www.chord-talk-in-interaction.usi.ch/cti/workshops/workshop-2 for on site or online participation by 📅November 27.
Arnulf Deppermann, Henrike Helmer and Silke Reineke from the Leibniz-Institut für Deutsche Sprache, Mannheim, will draw on the IDS corpus FOLK (Forschungs- und Lehrkorpus gesprochenes Deutsch) to discuss and illustrate the challenges of creating and using spoken and multimodal data in a and perspective. They will address workflows, research tools and the specifics of audio and video data gathered to document social interactions in authentic settings. They will discuss the potential and the constraints of using large corpora and their tools to support research on social interaction. There will also be an open discussion on the possibilities and limitations of / allowing the participants to exchange views on current trends towards sharing research data and the conditions under which they can be made available to the scientific community.
CHORD-talk-in-interaction is a collaboration 🤝👥 between USI Università della Svizzera italiana and the universities of Basel, Lausanne and Neuchâtel.
@linguistics
@dh
@qdr

j_mieczni, to dh German
@j_mieczni@101010.pl avatar

📢 Workshop on the design and application of large machine-readable audio-video corpora of spoken interaction, organized by on behalf of CHORD-talk-in-interaction at the University of Basel on 📅December 7-8, 2023. Register here: https://t.co/C7L6WfmGzR for on site or online participation by 📅November 27.
Arnulf Deppermann, Henrike Helmer and Silke Reineke from the Leibniz-Institut für Deutsche Sprache, Mannheim, will draw on the IDS corpus FOLK (Forschungs- und Lehrkorpus gesprochenes Deutsch) to discuss and illustrate the challenges of creating and using spoken and multimodal data in a and perspective. They will address workflows, research tools and the specifics of audio and video data gathered to document social interactions in authentic settings. They will discuss the potential and the constraints of using large corpora and their tools to support research on social interaction. There will also be an open discussion on the possibilities and limitations of / allowing the participants to exchange views on current trends towards sharing research data and the conditions under which they can be made available to the scientific community.
CHORD-talk-in-interaction is a collaboration 🤝👥 between USI Università della Svizzera italiana and the universities of Basel, Lausanne and Neuchâtel.
@linguistics
@dh
@qdr

InistCNRS, to random French

Les corpus spécialisés ISTEX, constitués par les équipes de l’Inist, sont proposés en vue d’une exploitation en traitement automatique des langues et en fouille de textes (TDM). Découvrez les deux nouveaux ISTEX pour la collection .
🔎Documents en texte intégral alignés avec les thésaurus publiés sur la plateforme Loterre


https://www.inist.fr/nos-actualites/istex-deux-nouveaux-corpus-dans-la-collection-memoire/?utm_source=dlvr.it&utm_medium=mastodon

Coocho, to linguistics

I'm re- reading my & Sofia's chapter "Talking about women: Elicitation, manual tagging, and semantic tagging in a study of pick-up artists’ referential strategies" for the first time in ages.
I somehow remembered it as one of the least favourite things we wrote, but it's actually pretty cool!☺️
It's in https://benjamins.com/catalog/scl.98, let me know if you'd like the pdf - happy to share!

cazabon, to technology

Dear #technology #journalists,

Please stop writing and disseminating "most #popular #programming #languages" articles. Besides the fact that #popularity is a relatively useless criterion to select an implementation by [1], the sources of your data are all terrible, and generally fall into:

-questionnaires sent to #CTOs by #management sites
-undocumented analysis of some large but random #corpus of #code
-#survey questions given to users of some popular tool or website

1/2

#analysis

ellescommelinguistes, to languagelearning
@ellescommelinguistes@mastodon.social avatar

Et on démarre l'année universitaire avec une nouvelle #publication!

Hutin, Mathilde & Allassonnière-Tang, Marc. 2023. L’apport des données participatives pour l’étude linguistique des français du monde: le cas de l’opposition /a~ɑ/. Journal of French Language Studies, 1-24. doi: 10.1017/S0959269523000200

#JournalOfFrenchLanguageStudies #JFLS #French #linguistics #phonetics #phonology #largecorpora #corpus #crowdsourcing
#LangueFrançaise #openscience #scienceouverte

https://www.cambridge.org/core/journals/journal-of-french-language-studies/article/lapport-des-donnees-participatives-pour-letude-linguistique-des-francais-du-monde-le-cas-de-lopposition-a/F0F8EE9E94B153A08B724346FB68C342

illandancient, to random

On the website for my corpus of 21st century Scots there is a utility to compare different dialects.

It works by generating lists of the top 200 most common words in each dialect and then displaying which words the dialects have in common.

It is supposed to look like a Euler diagram with two overlapping groups. But its a bit unintuitive.

It works if you know what you're looking at, but if you don't then its just colours and shapes.

1/

https://www.chrisgilmour.co.uk/test/dialcomp.php?a=Central&b=Doric&top=200

A colourful Euler diagram taken from wikipedia that uses coloured rectangle and other shapes to show the relationships between different Solar System objects, where things like Dwarf Planets are a sub-set of Minor Planets, etc.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • Leos
  • everett
  • magazineikmin
  • thenastyranch
  • Youngstown
  • vwfavf
  • rosin
  • slotface
  • khanakhh
  • InstantRegret
  • PowerRangers
  • kavyap
  • tsrsr
  • DreamBathrooms
  • normalnudes
  • mdbf
  • hgfsjryuu7
  • tacticalgear
  • ethstaker
  • osvaldo12
  • ngwrru68w68
  • GTA5RPClips
  • Durango
  • modclub
  • cisconetworking
  • cubers
  • tester
  • anitta
  • All magazines