gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

So I just learned what "The Stack" is today: an aggregation of GitHub repos for machine learning from which I can opt out.

But I won't.

I won't because they scraped some hot garbage I wrote in bash and Python that would make you faint. Bottom-of-the-barrel throw-away scripts full of coding crimes. Stuff like

find | grep | awk | xargs | ugh

...invoked via subprocess.run() then fed into more garbage.

I want "artificial intelligence" to learn this. It's going to be fantastic.

bzdev,
@bzdev@fosstodon.org avatar

@gabrielesvelto Instead of Gabriele's "find|grep|awk|..." I once did roughly lex | lex | lex |.... I had used LaTex to write a chapter of a final report. Our manager decided we should use troff (this was 1980s). So, a few days before it was due, I wrote a series of lex programs, each doing part of the conversion & fixing some previous errors until I was left with something good enough that the rest could be easily done by hand.
Very ugly coding but also very practical given its one-time use.

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

tired: opt-out of AI training datasets
wired: enthusiastically opt-in all the garbage that's sitting on your disk

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

I wonder if I could cook up a script that turns Star Trek erotic fan fiction into Rust code, then upload that to GitHub

derickr,
@derickr@phpc.social avatar

@gabrielesvelto link to such fiction please 😂

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@derickr the Archive of Our Own has 100k+ such works, carefully labeled with genre, warnings, etc...

https://archiveofourown.org/tags/Star%20Trek/works

Ironically this stuff did end up in many machine-learning training datasets, creating one of those typical "what could go wrong?" scenarios.

avghelper,
@avghelper@fosstodon.org avatar

@gabrielesvelto Consider: Markdown

Then the same thing but rendered as HTML, just in case :blobfoxevil:

nebucatnetzer,
@nebucatnetzer@emacs.ch avatar

@gabrielesvelto How does the GPL work with LLMs?
I always use it for all my code because I personally think it is a great concept.

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@nebucatnetzer it seems that companies gathering data for training are explicitly avoiding it and other explicitly free software licenses. I guess they fear the consequences of generating more GPL'd code.

mdione,
@mdione@en.osm.town avatar

@gabrielesvelto I don't think that AI companies actually care about the quality of the code their systems spew. The whole point is that 'it works' (even when it doesn't), not that a human would be able to modify it later.

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@mdione it's not just that. Most code you'll find around has notable bad patterns: a very common being mostly ignoring errors. Since LLM training gives disproportionate weight to common patterns, it means that the output will consistently reproduce bad ones. This output will be bound to be unstable and insecure by design, not just unmaintainable.

mancavgeek,
@mancavgeek@social.teamb.space avatar

@gabrielesvelto
I was thinking of doing something like this - I've coded plenty of projects that never worked, for reasons that I never figured out, that I think would be perfect for this.
@reedmideke

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@mancavgeek @reedmideke stuff that doesn't work is especially good!

990000,
@990000@mstdn.social avatar

@gabrielesvelto lol yeah I’m going to upload my worst code ever, also endless files of imaginary API keys

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@990000 go for it! When I'm in a hurry everything goes! Allocate a truckload of stuff, leak everything. Ignore error conditions. Catch all exceptions, ignore them and plough ahead. Stuff printf()s every three statements. Append data to a string held in a global variable as a replacement for structured logging. Invoke shell commands from C++. Let's goooo!

tuxicoman,
@tuxicoman@social.jesuislibre.net avatar

@gabrielesvelto @990000

They can decrease the weight of bad code based on number of forks.

Are lawyers lazy to demonstrate rephrasing GPL code is GPL too? If the model is only based on GPL code, would the output not be GPL too?

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@tuxicoman @990000 IANAL but I guess it would. That being said there's a truckload of bad code on GitHub. Bad patterns are everywhere. Even if the individual weight is low I'm sure they show up far more often than stuff that's properly done.

Think error handling: how much code have you seen doing it properly versus the "try/catch everything/do nothing to handle errors/happy go lucky" pattern?

990000,
@990000@mstdn.social avatar

@gabrielesvelto catch all exceptions 😂 🏆

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@990000 I once worked on a piece of financial software written in Java and came across something like this:

try {
...
// stuff that throws exceptions
} catch (Exception e) {
System.out.println("Caught exception " + e.toString());
}

I grep'd the codebase and found around 600 instances of this.

This was production code. It was a bad joke, but it was production code.

God only knows what's on GitHub.

PeterLudemann,

@gabrielesvelto @990000
That's good code because it actually does "something" with the exception. I've seen plenty of code that does nothing in the "catch". Because checked exceptions are annoying.

PeterLudemann,

@gabrielesvelto @990000
There are many things I dislike about Java but checked exceptions can be "handled" by simply adding "throws" to the enclosing method.

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@PeterLudemann @990000 yeah, but you have to handle them at some point. In this case it meant all sort of error conditions were fatal and brought the software in an inconsistent state. For example, it would make several network connections and if one failed or dropped it needed to be restarted, as it wouldn't reconnect

hub,
@hub@cosocial.ca avatar

@gabrielesvelto I want then to cause every piece of code to become copyleft.

gabrielesvelto,
@gabrielesvelto@fosstodon.org avatar

@hub oh gosh, if we could convince these things to spit out GPL-3 license text everywhere - possibly hidden in unicode or something - it would be fantastic. Ultimate poison pill.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • ethstaker
  • DreamBathrooms
  • normalnudes
  • magazineikmin
  • InstantRegret
  • GTA5RPClips
  • thenastyranch
  • Youngstown
  • rosin
  • slotface
  • osvaldo12
  • ngwrru68w68
  • kavyap
  • everett
  • megavids
  • Durango
  • Leos
  • cubers
  • mdbf
  • khanakhh
  • tester
  • modclub
  • cisconetworking
  • anitta
  • tacticalgear
  • provamag3
  • JUstTest
  • lostlight
  • All magazines