researchbuzz, to random
@researchbuzz@researchbuzz.masto.host avatar

"A group of researchers from Germany, Italy, and the United Kingdom have created a dataset called “corpus of resolutions” that contains all resolutions, drafts, and meeting records of the United Nations (UN) Security Council, an international body that aims to ensure global peace and security."

https://datainnovation.org/2024/05/analyzing-un-security-council-resolutions/

dissentanddatalove, to art
@dissentanddatalove@post.lurk.org avatar

The Institute for Dissent and Datalove is a loose collective comprised of hackers, artists, activists and tinkerers. It overlaps with networks of solidarities involved in active defense of free speech and free/libre technologies, technology critics and political interventions.

The Institute for Dissent and Datalove has so far mostly been used for operations of de/re-contextualization of large datasets, de-formatting of formats and playful use of liberating algorithms.

It tries to criticize and deconstruct itself, while remaining grounded in uncompromising collective practices of autonomy and solidarity.

We even have a website: https://dissent-and-datalove.institute/

loopier, to ai

In case you weren't aware, this very interesting call about with small and related to the came out:

https://axolot.cat/residencies-hibrides-opencall/

Spread the word!

skinnylatte, to india
@skinnylatte@hachyderm.io avatar

Someone I follow just launched this site. Looks pretty interesting.

https://www.dataforindia.com/

pratik,
@pratik@writing.exchange avatar

@skinnylatte this is a good book on India’s data https://www.goodreads.com/book/show/60869134

skinnylatte,
@skinnylatte@hachyderm.io avatar

@pratik the site is by that author!

sflorg, to trans

#Trans identities are missing from our #datasets, meaning that their experiences in a number of domains cannot be studied quantitatively. The way in which many of our datasets are constructed reinforces a cis-normative understanding of the world, where people are pushed into the false #binary of describing themselves as either male or female.
#SocialScience #LGBTQ #sflorg
https://www.sflorg.com/2024/03/ss03292401.html

nic221, to ai
@nic221@techhub.social avatar

One of the world’s largest AI training datasets is about to get bigger and ‘substantially better’ https://venturebeat.com/ai/one-of-the-worlds-largest-ai-training-datasets-is-about-to-get-bigger-and-substantially-better/ (a new version of The Pile)

remixtures, to ai Portuguese
@remixtures@tldr.nettime.org avatar

: "Datasets are the building blocks of every AI generated image and text. Diffusion models break images in these datasets down into noise, learning how the images “diffuse.” From that information, the models can reassemble them. The models then abstract those formulas into categories using related captions, and that memory is applied to random noise, so as not to duplicate the actual content of training data, though it sometimes happens. An AI-generated image of a child is assembled from thousands of abstractions of these genuine photographs of children. In the case of Stable Diffusion and Midjourney, these images come from the LAION-5B dataset, a collection of captions and links to 2.3 billion images. If there are hundreds of images of a single child in that archive of URLs, that child could influence the outcomes of these models.

The presence of child pornography in this training data is obviously disturbing. An additional point of serious concern is the likelihood that images of children who experienced traumatic abuse are influencing the appearance of children in the resulting model’s synthetic images, even when those generated images are not remotely sexual.

The presence of this material in AI training data points to an ongoing negligence of the AI data pipeline. This crisis is partly the result of who policymakers talk with and allow to define AI: too often, it is industry experts who have a vested interest in deterring attention from the role of training data, and the facts of what lies within it. As with Omelas, we each face a decision of what to do now that we know these facts."

https://www.techpolicy.press/laion5b-stable-diffusion-and-the-original-sin-of-generative-ai/

gimulnautti,
@gimulnautti@mastodon.green avatar

@remixtures ”Accountability is not as challenging as AI companies would like us to believe. Flying a commercial airliner full of untested experimental fuel is negligence. Rules asking airlines to tell us what’s in the fuel tank do not hamper innovation.”

Good piece.

researchbuzz, to statistics
@researchbuzz@researchbuzz.masto.host avatar

'The Department of Homeland Security (DHS) on Thursday unveiled its new Office of Homeland Security Statistics (OHSS), which aims to advance DHS’s statistical reporting and analysis capabilities. The new office said on Nov. 9 it plans to begin releasing its initial sets of data in the coming weeks and throughout fiscal year 2024, including a report on Federal cybersecurity incidents.'

https://www.meritalk.com/articles/dhs-launches-new-office-of-homeland-security-statistics/

KathyReid, to Podcast
@KathyReid@aus.social avatar

Great episode of , featuring @DAIR's @alex talking about their career, at the intersection of and , and in particular, the pivotal role of in , and how have .

https://twimlai.com/podcast/twimlai/pushing-back-on-ai-hype/

ruthpozuelo, (edited ) to datascience
@ruthpozuelo@mastodon.social avatar

What is the best way to share a dataset with non-technical users and invite them to collaborate on it?

jgkoomey,
@jgkoomey@mastodon.energy avatar

@ruthpozuelo @justinbuist Some may find this too simple but good old Google sheets would work well for you in many applications.

ruthpozuelo,
@ruthpozuelo@mastodon.social avatar

@Jdreben Power Bi is not great for cell entry requirements but Excel online maybe works?

danmcquillan, to ai

Re-reading 'On the genealogy of machine learning datasets: A critical history of ImageNet' by @alexhanna. So clear the LLM debacle goes back to the start of the DL boom; it's data fetish, flat universalism, social illiteracy & contempt for workers https://journals.sagepub.com/doi/full/10.1177/20539517211035955

remixtures, to ai Portuguese
@remixtures@tldr.nettime.org avatar

: "Ultimately, when we think about what a more ethical development of this technology could look like, I think it looks a lot like machine learning before 2015, or before the last 10 years. It looks like building systems for specific purposes with specific scopes and with specific goals. Then you can start to ask questions like “What values do we want the system to reinforce?” “What data makes sense to use here?” It’s the opposite of the “move fast and break things” philosophy. It’s really hard to advocate for that kind of work because it’s not flashy. It’s frustrating because I think none of this was inevitable. We’ve been talking about it in the field for ages, these issues with general or universal AI systems. I think one of the biggest arguments that we make at DAIR is that we don’t need to be building systems this way, we do not need to be making general purpose AI, we don’t need to be making these kinds of generative AI systems. It’s really hard to go down this line of work without independent funding, because that’s not where the money is right now."

https://cchange.xyz/dylan-baker/

panda, to ai
remixtures, to ai Portuguese
@remixtures@tldr.nettime.org avatar

: "Flickr Faces High-Quality (FFHQ) is a dataset of Flickr face photos originally created for face generation research by NVIDIA in 2019. It includes 70,000 total face images from 67,646 unique Flickr photos. Since its release the dataset has become of the most widely used face datasets for a wide variety of research and commercial applications ranging from face recognition to oral region gender recognition. The images in FFHQ were taken from Flickr users without explicit consent and were selected because they contained high quality face images with a permission Creative Commons license. Many of the images contain infants and children and over 10% of the dataset no longer exists on the original source yet NVIDIA, a $1T company, continues to use and benefit from the 70,000 face images taken on Flickr.com to develop commercial AI technologies.
(...)
Even though the main dataset and its derivatives mention the Creative Commons licenses associated with the media, of which many require attribution, no human readable attribution was provided for any photo in any dataset. Attribution is only provided in a 256MB JSON file that could not be opened on a standard laptop computer using Sublime text editor, let alone parsed to understand author attribution. This may amount to a large-scale breach of the Creative Commons attribution requirement. For further reading on the exploitation of Creative Commons licensing scheme, read "Creative Commons and The Face Recognition Problem". To further complicate the issue, it may not be possible at all to use non-consensual face images for AI/ML when attribution is required because including the subject or author name can force the face photo to become PII (personally identifiable information), a protected class of data."

https://exposing.ai/ffhq/

remixtures, to ai Portuguese
@remixtures@tldr.nettime.org avatar

: The Google memo points to the dawning realization that improvements in AI will require putting a lot more care and thought into how data is collected and curated. Even OpenAI, which relies on gargantuan datasets to make its products, is now pointing to this issue. A close engagement with datasets has been deeply undervalued in the AI field, and this neglect has had serious consequences downstream, from technical failures to human rights violations.

This is why investigating datasets is so important. Not because companies want an edge in the current AI wars, but to understand the ideologies, viewpoints, and harms that are being ingested, concentrated, and reproduced by AI systems. The new internet-scale datasets require new investigative methods, new research questions. What political and cultural inflections are baked into training sets? Who and what is represented? What is rendered invisible and unintelligible? Who profits from all this data, and at whose expense? What legal issues does the mass extraction of data raise for copyright, privacy, moral rights, and the right to publicity? What about the people whose creative work and livelihoods are impacted? How could these practices change? And as the accelerating machines of scrape-generate-publish-repeat begin to ingest their own material, what logics, perspectives, and aesthetics will be reinforced in this recursive loop?"

https://knowingmachines.org/9-ways-to-see/9-ways-to-see-a-dataset

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • mdbf
  • ngwrru68w68
  • tester
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • DreamBathrooms
  • megavids
  • tacticalgear
  • osvaldo12
  • normalnudes
  • cubers
  • cisconetworking
  • everett
  • GTA5RPClips
  • ethstaker
  • Leos
  • provamag3
  • anitta
  • modclub
  • lostlight
  • All magazines