stefan, (edited ) to journalism avatar

Learn how to request a dataset of all the databases an agency maintains with @muckrock ’s latest webinar:

Next session is on June 14 and you can sign up here:

stefan, to history avatar

"The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used [green pigments containing arsenic] in their covers and other binding components."

Via @dataisplural.

boilingsteam, to llm avatar

Building a Large Japanese Web Corpus for Large Language Models:

daieuxetdailleurs, to archivistodon French avatar
alatitude77, to Discord avatar
KathyReid, (edited ) to ML avatar

Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my #PhD supervisor, Associate Professor @eltwilliams, and written as part of my research at #ANU School of Cybernetics.

Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. #LibriSpeech, @mozilla's #CommonVoice, and several others, document their #metadata.

Unsurprisingly, it finds that the #dataset #documentation practices seen currently do not meet the needs of the #ML practitioners who use these datasets.

We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...

#RightTheDocs #WriteTheDocs


Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.

mtxvp, to random avatar
stefan, to random avatar

"The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest."

Interesting initiative, and they're looking for volunteers:

mwfc, to random avatar

Anyone aware of a (structural Health Monitoring) database that is freely available with data from europe?

I am curious to corelate some data from various sources

I am especially interested in building damage due to ground movement.

stefan, to random avatar

"A dataset of 77,000+ (distinct) candidates across 57,000+ US elections for mayor, city council, school board, county executive, county legislature, sheriff, and prosecutor."


via @dataisplural .

adulau, to opensource

A very nice dataset from Malpedia with all the deobfuscated strings from their dataset. The repository contains the result of the FLARE FLOSS tool applied to all unpacked and dumped samples in Malpedia.


daieuxetdailleurs, to France French avatar

[] Actuellement en train de finaliser un consacré aux célébrations et commémorations nationales en depuis 1970, je vous propose pour les jours à venir un petit sur le sujet ⤵️

PS : le sera mis en et je prépare bien sûr quelques ...

@geneafr @archivistodon

daieuxetdailleurs, to random French avatar

L'un de mes objectifs dans la vie (pro), c'est de figurer dans les coups de coeur de @datagouvfr 😋 ❤️ (

daieuxetdailleurs, to France French avatar

[] Nouveau jeu de données Mentions d'événements climatiques et naturels depuis la fin du XVIIIe siècle en et

Plus de 1000 événements (et ce n'est que le début), à partir des inventaires :

  • des travaux des cathédrales (19e siècle)
  • des calamités publiques (années 1950 et 1960)

@archivistodon @geneafr

researchbuzz, to Raleigh avatar

I connected 's of to a API query for finding nearby items of interest and then to a OpenAI API query so the trees could describe themselves and the area around them.

As you do when you're a weirdo

boilingsteam, to linux avatar

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs:

a, to Futurology avatar

Common Crawl September/October 2023 Crawl Archive (CC-MAIN-2023-40) is out and release.

100TiB compressed of fresh web crawled which can used in your next data mining project.


itnewsbot, to machinelearning avatar

AI-Powered Snore Detector Shakes the Pillow So You Won’t - If you snore, you’ll probably find out about it from someone. An elbow to the ribs... -

regroup_horizon, to politicaltheory avatar

Do you need data related to for your research? 👀📚 Look no further! 👇

Our portal has been filled with new datasets, varying from citizens' attitudes on the to its impact on migration, labor and municipalities. ✅

See here:

@politicalscience @politicaltheory @sociology

ruthpozuelo, (edited ) to datascience avatar

What is the best way to share a dataset with non-technical users and invite them to collaborate on it?

stefan, to ilaughed avatar

A haunting collection of roughly 10,000 recordings of nuclear weapons tests from the 1940's - 1960's.

"The films are equal parts terrifying and fascinating."

via @Beautifulpublicdata

KathyReid, to random avatar

Normal people: refreshing their browser to get Famous People tickets

Me: refreshing my browser waiting for the v15 to drop

🤓 📊 ⌨️

lysander07, to Futurology

The "Wikidata Research Articles Dataset" comprises peer-reviewed full research papers about Wikidata from its first decade of existence (2012-2022).

via @wikiresearch (but posted on )

ppatel, to machinelearning avatar

The MIT researchers found that models trained for autocaptioning with their dataset consistently generated captions that were precise, semantically rich, and described data trends and complex patterns.

Researchers teach an to write better chart captions.

A new can help scientists develop automatic systems that generate richer, more descriptive captions for online charts for people.

CEDO, to ai avatar

"Alles online is voer voor onze " En met de diensten van dat bedrijf geven we onze kinderen les. Ouders die bezwaar maken kunnen de boom in. AutoriteitPersoonsgegevens doet niets.

CEDO, avatar

@ErikJonker Het blijft niet bij "lezen", een AI trainen betekent het maken van afgeleide werken. Dat iets online staat wil niet zeggen dat het publiek domein is. Mensen zetten dingen online binnen een context met een bepaald doel. Het "Grab all you can" waarmee BigTech zijn datasets nu vult negeert dit volledig.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • tacticalgear
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • Durango
  • cubers
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • ngwrru68w68
  • kavyap
  • GTA5RPClips
  • provamag3
  • ethstaker
  • InstantRegret
  • Leos
  • normalnudes
  • everett
  • khanakhh
  • osvaldo12
  • cisconetworking
  • modclub
  • anitta
  • tester
  • megavids
  • lostlight
  • All magazines