#datasets - kbin.social

dissentanddatalove, 13 days ago to art

The Institute for Dissent and Datalove is a loose collective comprised of hackers, artists, activists and tinkerers. It overlaps with networks of solidarities involved in active defense of free speech and free/libre technologies, technology critics and political interventions.

The Institute for Dissent and Datalove has so far mostly been used for operations of de/re-contextualization of large datasets, de-formatting of formats and playful use of liberating algorithms.

It tries to criticize and deconstruct itself, while remaining grounded in uncompromising collective practices of autonomy and solidarity.

We even have a website: https://dissent-and-datalove.institute/

#introduction #art #datasets

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ mmu_man

loopier, 16 days ago to ai

In case you weren't aware, this very interesting call about #AI with small #datasets and #DIY #electronics related to the #body came out:

https://axolot.cat/residencies-hibrides-opencall/

Spread the word!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ freebliss

skinnylatte, 1 month ago to india

Someone I follow just launched this site. Looks pretty interesting.

https://www.dataforindia.com/

#India #Asia #Data #DataViz #DataSets

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ cmdln

sflorg, 1 month ago to trans

#Trans identities are missing from our #datasets, meaning that their experiences in a number of domains cannot be studied quantitatively. The way in which many of our datasets are constructed reinforces a cis-normative understanding of the world, where people are pushed into the false #binary of describing themselves as either male or female.
#SocialScience #LGBTQ #sflorg
https://www.sflorg.com/2024/03/ss03292401.html

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ AnnaAnthro, gratefulwolf

kellogh, 3 months ago to random

i have a hard time believing that using terms like “artificial intelligence” are causing people to believe that it’s “thinking”.

you don’t have to explain to people why artificial grass doesn’t grow. the “artificial” part says that on its own

“lab grown” diamonds are a thing now, they avoided “artificial” diamonds because they’re chemically identical to the real thing

besides, there’s far simpler explanations — e.g. the “AI is going to take over the world” messaging

reply

expand (11)

collapse (11)

report

activity

copy /kbin url

copy original url

open original url

Loading...

doboprobodyne, 3 months ago

@kellogh @mistersql

I beg your pardon; I saw only your second toot and pressed reply without looking for a first.

I think we are both right, but I had not twigged that you thought it improbable that folks might anthropomorphise the machines. In my limited experience humans have a low threshold for thinking in terms of analogy, and I suspect this leads to a low threshold for anthropomorphisation.

You're quite right about the dictionary definition of learning. I'm slightly hesitant to agree beyond that but because of my own ignorance rather than anything else: I am disinclined to call a "hard disk drive" "machine-knowledge" because it seems to me to be too close to thinking by analogy, when such is not necessary.

I know, it's a slippery slope to arguing with a reporter about learning by analogy like that famous video of Richard Feynman, or correcting one's grandchildren for being lazy using "car" as a contraction of "motor-car" (thanks Gran :P ), or readjusting one's students for saying "plane" when they mean "airplane".

As much as I wish "meta-algorithm" might be a preferred term, I realise it never will be. Forgive me; I think I replied on an optimistic day xD English teachers, indeed any linguists, everywhere will agree that language is a fickle mistress; perhaps only slightly less fickle than thought herself.

I hope it is some small consolation that I think right and wrong are constructs of the animal mind, that I accept p>0.05 as probably being real, and that we're probably both right.

#AI #ML #Datasets #LLM #GAN #linguistics #language #communication #hiveMind #NLP #congnition

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nic221, 4 months ago to ai

One of the world’s largest AI training datasets is about to get bigger and ‘substantially better’ https://venturebeat.com/ai/one-of-the-worlds-largest-ai-training-datasets-is-about-to-get-bigger-and-substantially-better/ (a new version of The Pile) #AI #training #datasets

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ErikJonker

remixtures, 4 months ago to ai Portuguese

#AI #GenerativeAI #GeneratedImages #Datasets #AITraining: "Datasets are the building blocks of every AI generated image and text. Diffusion models break images in these datasets down into noise, learning how the images “diffuse.” From that information, the models can reassemble them. The models then abstract those formulas into categories using related captions, and that memory is applied to random noise, so as not to duplicate the actual content of training data, though it sometimes happens. An AI-generated image of a child is assembled from thousands of abstractions of these genuine photographs of children. In the case of Stable Diffusion and Midjourney, these images come from the LAION-5B dataset, a collection of captions and links to 2.3 billion images. If there are hundreds of images of a single child in that archive of URLs, that child could influence the outcomes of these models.

The presence of child pornography in this training data is obviously disturbing. An additional point of serious concern is the likelihood that images of children who experienced traumatic abuse are influencing the appearance of children in the resulting model’s synthetic images, even when those generated images are not remotely sexual.

The presence of this material in AI training data points to an ongoing negligence of the AI data pipeline. This crisis is partly the result of who policymakers talk with and allow to define AI: too often, it is industry experts who have a vested interest in deterring attention from the role of training data, and the facts of what lies within it. As with Omelas, we each face a decision of what to do now that we know these facts."

https://www.techpolicy.press/laion5b-stable-diffusion-and-the-original-sin-of-generative-ai/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ gimulnautti

researchbuzz, 6 months ago to statistics

#DHS #statistics #BigData #datasets #cybersecurity #transparency

'The Department of Homeland Security (DHS) on Thursday unveiled its new Office of Homeland Security Statistics (OHSS), which aims to advance DHS’s statistical reporting and analysis capabilities. The new office said on Nov. 9 it plans to begin releasing its initial sets of data in the coming weeks and throughout fiscal year 2024, including a report on Federal cybersecurity incidents.'

https://www.meritalk.com/articles/dhs-launches-new-office-of-homeland-security-statistics/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ neurovagrant

KathyReid, 6 months ago to Podcast

Great episode of #twimai #podcast, featuring @DAIR's @alex talking about their career, at the intersection of #CS and #sociology, and in particular, the pivotal role of #data in #MachineLearning, and how #datasets have #politics.

https://twimlai.com/podcast/twimlai/pushing-back-on-ai-hype/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alex

ruthpozuelo, 7 months ago (edited 7 months ago) to datascience

What is the best way to share a dataset with non-technical users and invite them to collaborate on it?

#data #dataset #datasets #datascience #database

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Jdreben

danmcquillan, 7 months ago to ai

Re-reading 'On the genealogy of machine learning datasets: A critical history of ImageNet' by @alexhanna. So clear the LLM debacle goes back to the start of the DL boom; it's data fetish, flat universalism, social illiteracy & contempt for workers https://journals.sagepub.com/doi/full/10.1177/20539517211035955
#AI #datasets #Imagenet #resistingAI

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alx, nilsskirnir

remixtures, 8 months ago to ai Portuguese

#AI #GenerativeAI #ML #Algorithms #DataSets: "Ultimately, when we think about what a more ethical development of this technology could look like, I think it looks a lot like machine learning before 2015, or before the last 10 years. It looks like building systems for specific purposes with specific scopes and with specific goals. Then you can start to ask questions like “What values do we want the system to reinforce?” “What data makes sense to use here?” It’s the opposite of the “move fast and break things” philosophy. It’s really hard to advocate for that kind of work because it’s not flashy. It’s frustrating because I think none of this was inevitable. We’ve been talking about it in the field for ages, these issues with general or universal AI systems. I think one of the biggest arguments that we make at DAIR is that we don’t need to be building systems this way, we do not need to be making general purpose AI, we don’t need to be making these kinds of generative AI systems. It’s really hard to go down this line of work without independent funding, because that’s not where the money is right now."

https://cchange.xyz/dylan-baker/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

panda, 8 months ago to ai

Providing the prompt to #AI generated content as alt text poisons the #datasets of #scrapers scraping the fediverse for training data. You're welcome.

#hackback #lifehack

https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/#:~:text=Some%20experts%20say%20that%20will,to%20the%20model%20being%20trained.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jcrabapple

remixtures, 9 months ago to ai Portuguese

#AI #Datasets #FacialRecognition #ML #CC #CreativeCommons #Flickr #Nvidi: "Flickr Faces High-Quality (FFHQ) is a dataset of Flickr face photos originally created for face generation research by NVIDIA in 2019. It includes 70,000 total face images from 67,646 unique Flickr photos. Since its release the dataset has become of the most widely used face datasets for a wide variety of research and commercial applications ranging from face recognition to oral region gender recognition. The images in FFHQ were taken from Flickr users without explicit consent and were selected because they contained high quality face images with a permission Creative Commons license. Many of the images contain infants and children and over 10% of the dataset no longer exists on the original source yet NVIDIA, a $1T company, continues to use and benefit from the 70,000 face images taken on Flickr.com to develop commercial AI technologies.
(...)
Even though the main dataset and its derivatives mention the Creative Commons licenses associated with the media, of which many require attribution, no human readable attribution was provided for any photo in any dataset. Attribution is only provided in a 256MB JSON file that could not be opened on a standard laptop computer using Sublime text editor, let alone parsed to understand author attribution. This may amount to a large-scale breach of the Creative Commons attribution requirement. For further reading on the exploitation of Creative Commons licensing scheme, read "Creative Commons and The Face Recognition Problem". To further complicate the issue, it may not be possible at all to use non-consensual face images for AI/ML when attribution is required because including the subject or author name can force the face photo to become PII (personally identifiable information), a protected class of data."

https://exposing.ai/ffhq/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

remixtures, 9 months ago to ai Portuguese

#AI #DataSets #AIEthics: The Google memo points to the dawning realization that improvements in AI will require putting a lot more care and thought into how data is collected and curated. Even OpenAI, which relies on gargantuan datasets to make its products, is now pointing to this issue. A close engagement with datasets has been deeply undervalued in the AI field, and this neglect has had serious consequences downstream, from technical failures to human rights violations.

This is why investigating datasets is so important. Not because companies want an edge in the current AI wars, but to understand the ideologies, viewpoints, and harms that are being ingested, concentrated, and reproduced by AI systems. The new internet-scale datasets require new investigative methods, new research questions. What political and cultural inflections are baked into training sets? Who and what is represented? What is rendered invisible and unintelligible? Who profits from all this data, and at whose expense? What legal issues does the mass extraction of data raise for copyright, privacy, moral rights, and the right to publicity? What about the people whose creative work and livelihoods are impacted? How could these practices change? And as the accelerating machines of scrape-generate-publish-repeat begin to ingest their own material, what logics, perspectives, and aesthetics will be reinforced in this recursive loop?"

https://knowingmachines.org/9-ways-to-see/9-ways-to-see-a-dataset

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

lapingvino, 9 months ago to github

Through a look into the things people were saying about the upcoming Go 1.21 release I discovered #doltdb and #dolthub today. I have been looking for ages for something like that:

A #mysql compatible database system that works like #git for data, and a #github like environment for #datasets, free for public datasets.

For people here struggling with #healthcare in the #USA: one of the projects they are working out there is a dataset of pricing per hospital and data-based work to push for more and better policy reform around this. I think we can do MUCH better if we have good #data, and now we have a platform to share and work together about data, so let's use it!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

GabrielleGirardeau, 1 year ago to random

There is still time to apply for the iBio_Sorbonne #neuro #summerschool !! Come learn about computational #analysis for #behavior #ephys #imaging in the south of France #Banyuls Theory and hands-on training on real #datasets. It's cheap and travel help is available. Please spread widely!!

image/jpeg

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ elduvelle