#corpus - kbin.social

boilingsteam, 14 days ago to llm

Building a Large Japanese Web Corpus for Large Language Models: https://arxiv.org/abs/2404.17733 #llm #japanese #dataset #corpus #training

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

f_moncomble, 24 days ago to linguistics French

New on the blog — find my #corpus collection apps on this page:
https://prendrelangue.fr/category/logiciels/
#linguistics @linguistics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

avldigital, 4 months ago to germanistik German

#CfP for the #conference "Que font les nouveaux #corpus à la recherche en #littérature? (XIIe-XXIe siècles)", which will take place at Université de Fribourg (#unifr) on September 25-27,2024.

🗓️Deadline for Abstracts: January 8, 2024

📌Further Information: https://avldigital.de/de/vernetzen/details/callforpapers/que-font-les-nouveaux-corpus-a-la-recherche-en-litterature-xiie-xxie-siecles/ #LiteratureHistory #LiteraryHistory #fidavlnews @litstudies @germanistik @italianstudies #Poetics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Hugo, 4 months ago to languagelearning

Long and fiery winter night it is ! User:Dragons_Bot is importing frequency lists in 500 more #languages from Unilex into #Lingualibre. Dragons Bot script is running tonight, editing Lili persistently. We then will have common words list for 1001 languages, ready for you to record. At step 3 of the Recording Studio, click "local list", then search List:{your_iso}/Unilex and you are good to go ! If your community's languages aren't there you can let me know below. 🎉
https://lingualibre.org/wiki/Special:RecordWizard

Lingua Libre user interface step 3, where the user pick its wordlist.

reply

expand (29)

collapse (29)

report

activity

copy /kbin url

copy original url

open original url

Loading...

Hugo, 4 months ago

@theklan part of of those frequency lists are based on the bible. For each of the 1001 #languages #Unilex scrapped various open online resources: wikipedias, bibles translations, wordpress blogs. It is used in all our android phones for text autocomplete. Sometimes, only the bible was available so the frequency list reflects it. I would prefer larger, more diverse raw data but they did with what they got. I do not know any better hyperlingual open #corpus. cc @alexture
https://github.com/unicode-org/unilex

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

j_mieczni, 5 months ago to dh German

📢 Workshop on the design and application of large machine-readable audio-video corpora of spoken interaction, organized by #LorenzaMondada on behalf of CHORD-talk-in-interaction at the University of Basel on 📅December 7-8, 2023. Register here: https://www.chord-talk-in-interaction.usi.ch/cti/workshops/workshop-2 for on site or online participation by 📅November 27.
Arnulf Deppermann, Henrike Helmer and Silke Reineke from the Leibniz-Institut für Deutsche Sprache, Mannheim, will draw on the IDS corpus FOLK (Forschungs- und Lehrkorpus gesprochenes Deutsch) to discuss and illustrate the challenges of creating and using spoken and multimodal data in a #EMCA and #interactionallinguistics perspective. They will address workflows, #corpus research tools and the specifics of audio and video data gathered to document social interactions in authentic settings. They will discuss the potential and the constraints of using large corpora and their tools to support research on social interaction. There will also be an open discussion on the possibilities and limitations of #openresearchdata / #opendata allowing the participants to exchange views on current trends towards sharing research data and the conditions under which they can be made available to the scientific community.
CHORD-talk-in-interaction is a collaboration 🤝👥 between USI Università della Svizzera italiana and the universities of Basel, Lausanne and Neuchâtel.
@linguistics
@dh
@qdr

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

j_mieczni, 5 months ago to dh German

📢 Workshop on the design and application of large machine-readable audio-video corpora of spoken interaction, organized by #LorenzaMondada on behalf of CHORD-talk-in-interaction at the University of Basel on 📅December 7-8, 2023. Register here: https://t.co/C7L6WfmGzR for on site or online participation by 📅November 27.
Arnulf Deppermann, Henrike Helmer and Silke Reineke from the Leibniz-Institut für Deutsche Sprache, Mannheim, will draw on the IDS corpus FOLK (Forschungs- und Lehrkorpus gesprochenes Deutsch) to discuss and illustrate the challenges of creating and using spoken and multimodal data in a #EMCA and #interactionallinguistics perspective. They will address workflows, #corpus research tools and the specifics of audio and video data gathered to document social interactions in authentic settings. They will discuss the potential and the constraints of using large corpora and their tools to support research on social interaction. There will also be an open discussion on the possibilities and limitations of #openresearchdata / #opendata allowing the participants to exchange views on current trends towards sharing research data and the conditions under which they can be made available to the scientific community.
CHORD-talk-in-interaction is a collaboration 🤝👥 between USI Università della Svizzera italiana and the universities of Basel, Lausanne and Neuchâtel.
@linguistics
@dh
@qdr

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

InistCNRS, 5 months ago to random French

Les corpus spécialisés ISTEX, constitués par les équipes de l’Inist, sont proposés en vue d’une exploitation en traitement automatique des langues et en fouille de textes (TDM). Découvrez les deux nouveaux #corpus ISTEX pour la collection #Mémoire.
🔎Documents en texte intégral alignés avec les thésaurus publiés sur la plateforme Loterre

#FouilleDeTextes #TDM
https://www.inist.fr/nos-actualites/istex-deux-nouveaux-corpus-dans-la-collection-memoire/?utm_source=dlvr.it&utm_medium=mastodon

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ButterflyOfFire

Coocho, 7 months ago to linguistics

I'm re- reading my & Sofia's chapter "Talking about women: Elicitation, manual tagging, and semantic tagging in a study of pick-up artists’ referential strategies" for the first time in ages.
I somehow remembered it as one of the least favourite things we wrote, but it's actually pretty cool!☺️
It's in https://benjamins.com/catalog/scl.98, let me know if you'd like the pdf - happy to share!
#linguistics #corpus #manosphere

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ amyfou

cazabon, 8 months ago to technology

Dear #technology #journalists,

Please stop writing and disseminating "most #popular #programming #languages" articles. Besides the fact that #popularity is a relatively useless criterion to select an implementation by [1], the sources of your data are all terrible, and generally fall into:

-questionnaires sent to #CTOs by #management sites
-undocumented analysis of some large but random #corpus of #code
-#survey questions given to users of some popular tool or website

1/2

#analysis

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ellescommelinguistes, 8 months ago to languagelearning

Et on démarre l'année universitaire avec une nouvelle #publication!

Hutin, Mathilde & Allassonnière-Tang, Marc. 2023. L’apport des données participatives pour l’étude linguistique des français du monde: le cas de l’opposition /a~ɑ/. Journal of French Language Studies, 1-24. doi: 10.1017/S0959269523000200

#JournalOfFrenchLanguageStudies #JFLS #French #linguistics #phonetics #phonology #largecorpora #corpus #crowdsourcing
#LangueFrançaise #openscience #scienceouverte

https://www.cambridge.org/core/journals/journal-of-french-language-studies/article/lapport-des-donnees-participatives-pour-letude-linguistique-des-francais-du-monde-le-cas-de-lopposition-a/F0F8EE9E94B153A08B724346FB68C342

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ tract_linguistes

illandancient, 1 year ago to random

On the website for my corpus of 21st century Scots there is a utility to compare different dialects.

It works by generating lists of the top 200 most common words in each dialect and then displaying which words the dialects have in common.

It is supposed to look like a Euler diagram with two overlapping groups. But its a bit unintuitive.

It works if you know what you're looking at, but if you don't then its just colours and shapes.

#corpus #CorpusLinguistics

1/

https://www.chrisgilmour.co.uk/test/dialcomp.php?a=Central&b=Doric&top=200

A colourful Euler diagram taken from wikipedia that uses coloured rectangle and other shapes to show the relationships between different Solar System objects, where things like Dwarf Planets are a sub-set of Minor Planets, etc.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...