Im as anti-"AI" as the next person, but I think its important to keep in mind the larger strategic picture of "AI" w.r.t. #search when it comes to #DuckDuckGo - both have the problem of inaccurate information, mining the commons, etc. But Google's use of LLMs in search is specifically a bid to cut the rest of the internet out of information retrieval and treat it merely as a source of training data - replacing traditional search with #LLM search. That includes a whole ecosystem of surveillance and enclosure of information systems including assistants, chrome, android, google drive/docs/et al, and other vectors.
DuckDuckGo simply doesnt have the same market position to do that, and their system is set up as just an allegedly privacy preserving proxy. So while I think more new search engines are good and healthy, and LLM search is bad and doesnt work, I think we should keep the bigger picture in mind to avoid being reactionary, and I dont think the mere presence of LLM search is a good reason to stop using it.
Seeing people praise #copilot for finally getting rid of hallucinations through simple RAG techniques of checking for reality in eg. citations. This moment where a lot of the trivial claims against #LLMs stopped being true, but the deeper harms of surveillance and information monopoly remained was inevitable and the chief danger of dismissing it as "fancy autocomplete." That is why I wrote this almost a year ago, as a warning of what comes next and what we can do about it: https://jon-e.net/surveillance-graphs/ #SurveillanceGraphs
Molly White is right as usual: "We’ve already tried out having a tech industry led by a bunch of techno-utopianists and those who think they can reduce everything to markets and equations. Let’s try something new, and not just give new names to the old."
trying to articulate new ideologies for computing is where my mind has been at the last few years too. i joke about the 'anti-perf manifesto,' but forging imaginaries that can run on computers that are actively antagonistic to the techno-utopians is all about killing myths of heroism where we are the someone else who goes out and "brings home the spoils." how do we reach a computing that isn't foundationally based on asymmetric power, we serfs at the mercy of the lord of the platform and vice versa, we altrustic platform providers building things the commoners couldn't possibly understand. The language of "scale" where one or a few services need to expand to provide for millions hides futures where we can provide for each other horizontally in overlapping quilts of dozens, hundreds. You could shorthand the "#AI" boom as the continuation of the information conglomerates trying to provide the everything platform, and if our dreams are to meaningfully challenge theirs we can't also aspire to simply "do what they're doing, except it's us doing it."
I tried to articulate this as the cloud orthodoxy vs. a still-nebulous idea i've landed on as vulgarity in computing, but i'll probably be orbiting this idea for as long as i am on line.
#Amazon releases details on its Alexa #LLM, which will use its constant surveillance data to "personalize" the model. Like #Google, they're moving away from wakewords towards being able to trigger Alexa contextually - when the assistant "thinks" it should be responding, which of course requires continual processing of speech for content, not just a word.
The consumer page suggests user data is "training" the model, but the developer page describes exactly the augmented LLM, iterative generation process grounded in a personal knowledge graph that Microsoft, Facebook, and Google all describe as the next step in LLM tech.
We can no longer think of LLMs on their own when we consider these technologies, that era was brief and has passed. Ive been waving my arms up and down about this since chatGPT was released - criticisms of LLMs that stop short at their current form, arguing about whether the language models themselves can "understand" language miss the bigger picture of what they are intended for. These are surveillance technologies that act as interfaces to knowledge graphs and external services, putting a human voice on whole-life surveillance
The NYTimes story on the AI writing news is a story about the repackaging of the knowledge graph. the language model is just an interface. Repackaging as an assistant, the examples of broken factboxes, the sale as a labor saving device, "we don't intend to replace your writers, we want to give you more convenient access to factual information" - here's a piece that should help make sense of that. #SurveillanceGraphs https://jon-e.net/surveillance-graphs/#the-lens-of-search-re-centers-our-focus-away-from-the-generative
The rewriting titles idea is perfectly in line with what they discuss in their investor calls in the context of advertising. it's a natural move if you see the LLMs as scope-limited enterprise tools that are intend to hook companies into dependence on their information access systems (consolidation of power) and hook people into them as means of interacting with an ecosystem of apps, commerce, etc. (intimacy of surveillance).
The debate about whether the LLMs are sentient is not serving us well. It's true, of course they aren't sentient, but it's obscuring more of the truth of the strategy than it is innoculating us against it at this point. Whether the LLMs are sentient is irrelevant because the plan was never to just continue to use the LLMs on their own. They are interfaces to other systems, can be presented as tools that can be conditioned by "factual information."
They won't work as advertised, of course, but we have to be very clear about the threat: The threat is not that LLMs will write the news. That's already happening, do any search. The threat is that the LLMs will be used to leverage greater control over our access to information by destabilizing our already fragile information ecosystem and presenting themselves as precisely not sentient, but handy assistants to interact with trusted databases - the last trustable sources of information left.
The addition of context-optimized clickbait headers for those willing to pay to be the brand beneath them is just an especially cynical product to sell to whichever suckers are desperate enough to buy it.
in my work the last few years I have been playing part-time journalist, talking with people on and off the record, chasing stories through scraped corporate documents, etc. To me that flows naturally with the other parts of my work building software, experimenting with social dynamics and even studying language, but it never escapes me that because my work doesn't fit in any discipline there is no place for it. I've been told to strip the amateur journalism entirely, transform it into qualitative research/ethnography, or just quit academia and do it as straight ahead journalism. but it's the mash of different disciplines and traditions that makes it interesting!
if all we ask from "reforming" or rebuilding #ScholComm is for the owners of the journals to change, but everything else remains intact, we will still be missing so much of what our work could be without their structuring influence. I have chosen to not pursue any of the milestones or metrics that might allow me to get a TT job one day in order to do what, to me, is the most interesting work I could do, and it really sucks that that is the tradeoff. Many academics like to imagine the scientific process as welcoming creativity and new ideas, but those new ideas have to be strongly constrained in form - the revolutionary new idea in my field has to look just like everything else in my field just with different results.
How sick would it be if it was normal to not just have transdisciplinary collaboration look like a linguist in the author list and contributing to the discussion of a traditional Nature systems Neuro paper, but genuinely be able to work across fields and come out with something that we truly don't know what comes out the other side will look like? Prespecifying a paper, much less a project, to fit a journal's specification makes our work boring and I have been in more than a few meetings about potential collaborations that went nowhere because there wouldn't be a venue for it.
Not everyone has to want that, some people just want to do molecular biology only, and thats fine! but for that to be the only way to do things is yet another way that our broken communication systems affect literally everything we do in academia.
I guess, relatedly, if anyone knows of any venue for #SurveillanceGraphs hmu. It's already undergoing a sort of informal public peer review through the annotations, but I would like to have a more systematic process of people checking me on my shit and offering their perspectives. In my mind, it would be great if more processes like that could result in coauthorship if someone wants to contribute, but maybe that's another conversation.
A bit of an overview and then I'll get into some of the more specific arguments in a thread:
This piece is in three parts:
First I trace the mutation of the liberatory ambitions of the #SemanticWeb into #KnowledgeGraphs, an underappreciated component in the architecture of #SurveillanceCapitalism. This mutation plays out against the backdrop of the broader platform capture of the web, rendering us as consumer-users of information services rather than empowered people communicating over informational protocols.
I then show how this platform logic influences two contemporary public information infrastructure projects: the NIH's Biomedical Data Translator and the NSF's Open Knowledge Network. I argue that projects like these, while well intentioned, demonstrate the fundamental limitations of platformatized public infrastructure and create new capacities for harm by their enmeshment in and inevitable capture by information conglomerates. The dream of a seamless "knowledge graph of everything" is unlikely to deliver on the utopian promises made by techno-solutionists, but they do create new opportunities for algorithmic oppression -- automated conversion therapy, predictive policing, abuse of bureacracy in "smart cities," etc. Given the framing of corporate knowledge graphs, these projects are poised to create facilitating technologies (that the info conglomerates write about needing themselves) for a new kind of interoperable corporate data infrastructure, where a gradient of public to private information is traded between "open" and quasi-proprietary knowledge graphs to power derivative platforms and services.
When approaching "AI" from the perspective of the semantic web and knowledge graphs, it becomes apparent that the new generation of #LLMs are intended to serve as interfaces to knowledge graphs. These "augmented language models" are joint systems that combine a language model as a means of interacting with some underlying knowledge graph, integrated in multiple places in the computing ecosystem: eg. mobile apps, assistants, search, and enterprise platforms. I concretize and extend prior criticism about the capacity for LLMs to concentrate power by capturing access to information in increasingly isolated platforms and expand surveillance by creating the demand for extended personalized data graphs across multiple systems from home surveillance to your workplace, medical, and governmental data.
I pose Vulgar Linked Data as an alternative to the infrastructural pattern I call the Cloud Orthodoxy: rather than platforms operated by an informational priesthood, reorienting our public infrastructure efforts to support vernacular expression across heterogeneous #p2p mediums. This piece extends a prior work of mine: Decentralized Infrastructure for (Neuro)science) which has more complete draft of what that might look like.
(I don't think you can pre-write threads on masto, so i'll post some thoughts as I write them under this) /1
As a technology, Knowledge Graphs are a particular configuration and deployment of the technologies of the semantic web. Though the technologies are heterogeneous and vary widely, the common architectural feature is treating data as a graph rather than as tables as in relational databases. These graphs are typically composed of triplet links or "triples" - subject-predicate-object tuples (again, this is heterogeneous) - that make use of controlled vocabularies or schemas.
These seemingly-ordinary data structures have a much longer and richer history in the semantic web. Initially, the idea was to supplement the ordinary "duplet" links of the web with triplets to make the then-radically new web of human-readable documents into something that could also be read by computers. The dream was a fluid, multiscale means of structuring information to bypass the need for platforms altogether - from personal to public information, we could directly exchange and publish information ourselves.
Needless to say, that didn't happen, and the capture of the web by platforms (with search prominent among them) blunted the idealism of the semantic web.
The essential feature of knowledge graphs that makes them coproductive with surveillance capitalism is how they allow for a much more fluid means of data integration. Most contemporary corporations are data corporations, and their operation increasingly requires integrating far-flung and heterogeneous datasets, often stitched together from decades of acquisitions. While they are of course not universal, and there is again a large amount of variation in their deployment and use, knowledge graphs power many of the largest information conglomerates. The graph structure of KGs as well as the semantic constraints that can be imposed by controlled ontologies and schemas make them particularly well-suited to the sprawling data conglomerate that typifies contemporary surveillance capitalism.
I give a case study in RELX, parent of Elsevier and LexisNexis, among others, which is relatively explicit about how it operates as a gigantic graph of data with various overlay platforms.
These knowledge graph powered platform giants represent the capture of information infrastructures broadly, but what would public infrastructure look like? The notion of openness is complicated when it comes to the business models of information conglomerates. In adjacent domains of open source, peer production, and open standards, "openness" is used both to challenge and to reinforce systems of informational dominance.
In particular, Google's acquisition of the peer-production platform Freebase was the precipitating event that ushered in the era of knowledge graphs in the first place, and its tight relationship with its successor, Wikidata, is instructive of the role of openness: public information is crowdsourced to farm the commons and repackaged in derivative platforms.
The information conglomerates in multiple places have expressed a desire for "neutral" exchange schemas and technologies to be able to rent, trade, and otherwise link their proprietary schemas to make a gradient of "factual" public information to contextual information like how a particular company operates, through to personal information often obtained through surveillance. It looks like the NIH and the NSF are set to serve that role for several domains...
These two projects share a common design pattern: create authoritative schemas for a given domain, create a string of platforms to collect data under that schema, ingest and mine as much data as possible, provide access through some limited platform, etc. All very normal! This formulation is based on a very particular arrangement of power and agency, however, where like much of the rest of platform web, some higher "developer" priesthood class designs systems for the rest of us to use. The utopian framing of universal platforms paradoxically strongly limit their use, being capable of only what the platform architects are able to imagine. The two agencies both innovate new funding mechanisms to operate these projects as "public-private" partnerships that further dooms them to inevitable capture when the grant money runs out.
This is where the story starts to merge with the story of "AI." Since the dawn of the semantic web, there was a tension between vernacular expression and making things smoothly computable by autonomous "agents." That is a complicated history in its own right, but after >20 years todays "AI" technologies are starting to resemble the dreams of the latter kind of semantic web head.
The projects are both oriented towards creating knowledge graphs that power algorithmic, often natural language query interfaces. The NIH's biomedical translator project is one example: autonomous reasoning agents compute over data from text mining and other curated platforms to yield "serendipitous" emergent information from the graph. The harms of such an algorithmic health system are immediately clear, and have been richly problematized previously. The Translator's prototypes are happy to perform algorithmic conversion therapy, as the many places where violence is encoded in biomedical information systems is laundered into neatly-digestible recommendations.
Though the aims of the project themselves dip into the colonial dream of the great graph of everything, the true harms for both of these projects come what happens with the technologies after they end. Many information conglomerates are poised to pounce on the infrastructures built by the NIH and NSF projects, stepping in to integrate their work or buy the startups that spin off from them.
The NSF's Open Knowledge Network is much more explicitly bound to the national security and economic interests of the US federal government, intended to provide the infrastructure to power an "AI-driven future." That project is at a much earlier stage, but in its early sketches it promises to take the same patterns of knowledge-graphs plus algorithmic platforms and apply them to government, law enforcement, and a broad range of other domains.
This pattern of public graphs for private profits is well underway at existing companies like Google, and I assume the academics and engineers in both of these projects are operating with the best of intentions and perhaps playing a role they are unaware of.
ok we might not make it to an arXiv submission today, but the document is all prepped and ready to go except the abstract so we definitely will make tomorrow. phew. finally. #SurveillanceGraphs
sometimes big data solutionism jumps the shark and is just very funny
harnessing the vast amounts of data generated in every sphere of life and transforming them into useful, actionable information and knowledge is crucial to the efficient functioning of a modern society