A bit of an overview and then I'll get into some of the more specific arguments in a thread:
This piece is in three parts:
First I trace the mutation of the liberatory ambitions of the #SemanticWeb into #KnowledgeGraphs, an underappreciated component in the architecture of #SurveillanceCapitalism. This mutation plays out against the backdrop of the broader platform capture of the web, rendering us as consumer-users of information services rather than empowered people communicating over informational protocols.
I then show how this platform logic influences two contemporary public information infrastructure projects: the NIH's Biomedical Data Translator and the NSF's Open Knowledge Network. I argue that projects like these, while well intentioned, demonstrate the fundamental limitations of platformatized public infrastructure and create new capacities for harm by their enmeshment in and inevitable capture by information conglomerates. The dream of a seamless "knowledge graph of everything" is unlikely to deliver on the utopian promises made by techno-solutionists, but they do create new opportunities for algorithmic oppression -- automated conversion therapy, predictive policing, abuse of bureacracy in "smart cities," etc. Given the framing of corporate knowledge graphs, these projects are poised to create facilitating technologies (that the info conglomerates write about needing themselves) for a new kind of interoperable corporate data infrastructure, where a gradient of public to private information is traded between "open" and quasi-proprietary knowledge graphs to power derivative platforms and services.
When approaching "AI" from the perspective of the semantic web and knowledge graphs, it becomes apparent that the new generation of #LLMs are intended to serve as interfaces to knowledge graphs. These "augmented language models" are joint systems that combine a language model as a means of interacting with some underlying knowledge graph, integrated in multiple places in the computing ecosystem: eg. mobile apps, assistants, search, and enterprise platforms. I concretize and extend prior criticism about the capacity for LLMs to concentrate power by capturing access to information in increasingly isolated platforms and expand surveillance by creating the demand for extended personalized data graphs across multiple systems from home surveillance to your workplace, medical, and governmental data.
I pose Vulgar Linked Data as an alternative to the infrastructural pattern I call the Cloud Orthodoxy: rather than platforms operated by an informational priesthood, reorienting our public infrastructure efforts to support vernacular expression across heterogeneous #p2p mediums. This piece extends a prior work of mine: Decentralized Infrastructure for (Neuro)science) which has more complete draft of what that might look like.
(I don't think you can pre-write threads on masto, so i'll post some thoughts as I write them under this) /1
@jonny This is super interesting, especially the observations on the role of AI / LLMs in all this. Unfortunately, this is way too complicated for my BA students. If you know any publications that address some of the issues in a more digestible form, I would appreciate recommendations.
@mob which parts?! i am aware of some ppl writing about some subset of these ideas, but not aware of anyone else writing about the intersection of knowledge graphs & information conglomerates & language models
@jonny Amazing. I have been fixed on this since semweb became an idea, participating in open data and public good projects which faded away (people either went professional in a career focused way or burnt out), encountering the priesthood in full force (some fully haughty, some brilliant but incomprehensible, with few useful bridges), trying to spread the value and work up my own accessible tools.
@jonny love that hcommons accepted it with the "banned at arxiv.org" lol! I wanted to ask this earlier when you'd shared the draft, but are you the founder of Institute of Pirate Technology? 😁
@manisha
I guess they haven't formally accepted it since it is in review! so uhhh if they don't then I'll just move on to the next repository 🤷. all I need from them is a doi and semi-persistent storage of 1MB.
I can be said to be the founder of such an institute that is perpetually unfounded and has no formal claim to existence, yes;)
As a technology, Knowledge Graphs are a particular configuration and deployment of the technologies of the semantic web. Though the technologies are heterogeneous and vary widely, the common architectural feature is treating data as a graph rather than as tables as in relational databases. These graphs are typically composed of triplet links or "triples" - subject-predicate-object tuples (again, this is heterogeneous) - that make use of controlled vocabularies or schemas.
These seemingly-ordinary data structures have a much longer and richer history in the semantic web. Initially, the idea was to supplement the ordinary "duplet" links of the web with triplets to make the then-radically new web of human-readable documents into something that could also be read by computers. The dream was a fluid, multiscale means of structuring information to bypass the need for platforms altogether - from personal to public information, we could directly exchange and publish information ourselves.
Needless to say, that didn't happen, and the capture of the web by platforms (with search prominent among them) blunted the idealism of the semantic web.
The essential feature of knowledge graphs that makes them coproductive with surveillance capitalism is how they allow for a much more fluid means of data integration. Most contemporary corporations are data corporations, and their operation increasingly requires integrating far-flung and heterogeneous datasets, often stitched together from decades of acquisitions. While they are of course not universal, and there is again a large amount of variation in their deployment and use, knowledge graphs power many of the largest information conglomerates. The graph structure of KGs as well as the semantic constraints that can be imposed by controlled ontologies and schemas make them particularly well-suited to the sprawling data conglomerate that typifies contemporary surveillance capitalism.
I give a case study in RELX, parent of Elsevier and LexisNexis, among others, which is relatively explicit about how it operates as a gigantic graph of data with various overlay platforms.
These knowledge graph powered platform giants represent the capture of information infrastructures broadly, but what would public infrastructure look like? The notion of openness is complicated when it comes to the business models of information conglomerates. In adjacent domains of open source, peer production, and open standards, "openness" is used both to challenge and to reinforce systems of informational dominance.
In particular, Google's acquisition of the peer-production platform Freebase was the precipitating event that ushered in the era of knowledge graphs in the first place, and its tight relationship with its successor, Wikidata, is instructive of the role of openness: public information is crowdsourced to farm the commons and repackaged in derivative platforms.
The information conglomerates in multiple places have expressed a desire for "neutral" exchange schemas and technologies to be able to rent, trade, and otherwise link their proprietary schemas to make a gradient of "factual" public information to contextual information like how a particular company operates, through to personal information often obtained through surveillance. It looks like the NIH and the NSF are set to serve that role for several domains...
These two projects share a common design pattern: create authoritative schemas for a given domain, create a string of platforms to collect data under that schema, ingest and mine as much data as possible, provide access through some limited platform, etc. All very normal! This formulation is based on a very particular arrangement of power and agency, however, where like much of the rest of platform web, some higher "developer" priesthood class designs systems for the rest of us to use. The utopian framing of universal platforms paradoxically strongly limit their use, being capable of only what the platform architects are able to imagine. The two agencies both innovate new funding mechanisms to operate these projects as "public-private" partnerships that further dooms them to inevitable capture when the grant money runs out.
This is where the story starts to merge with the story of "AI." Since the dawn of the semantic web, there was a tension between vernacular expression and making things smoothly computable by autonomous "agents." That is a complicated history in its own right, but after >20 years todays "AI" technologies are starting to resemble the dreams of the latter kind of semantic web head.
The projects are both oriented towards creating knowledge graphs that power algorithmic, often natural language query interfaces. The NIH's biomedical translator project is one example: autonomous reasoning agents compute over data from text mining and other curated platforms to yield "serendipitous" emergent information from the graph. The harms of such an algorithmic health system are immediately clear, and have been richly problematized previously. The Translator's prototypes are happy to perform algorithmic conversion therapy, as the many places where violence is encoded in biomedical information systems is laundered into neatly-digestible recommendations.
Though the aims of the project themselves dip into the colonial dream of the great graph of everything, the true harms for both of these projects come what happens with the technologies after they end. Many information conglomerates are poised to pounce on the infrastructures built by the NIH and NSF projects, stepping in to integrate their work or buy the startups that spin off from them.
The NSF's Open Knowledge Network is much more explicitly bound to the national security and economic interests of the US federal government, intended to provide the infrastructure to power an "AI-driven future." That project is at a much earlier stage, but in its early sketches it promises to take the same patterns of knowledge-graphs plus algorithmic platforms and apply them to government, law enforcement, and a broad range of other domains.
This pattern of public graphs for private profits is well underway at existing companies like Google, and I assume the academics and engineers in both of these projects are operating with the best of intentions and perhaps playing a role they are unaware of.
@jonny Hey there Jonny, enjoyed the work so far (admittedly I skipped to the end to get to the "Vulgar" recommendations - will have to loop back to the rest)
I would be really interested to know how these sorts of technologies might fit into the vision! (As a relative RDF newb, I can't yet decipher for myself)
@photocyte @knowledgepixels@nanopub@tkuhn
yes I have seen! I reference nanopublications very obliquely earlier in the piece. definitely think it's a step in the right direction, needs generalization - rather than just at the "publication" level of simple factual assertions, being able to also express raw data, make tooling to researchers and the general public can create, fork, merge their own schemas, design their own views on the underlying triples, etc. then we need a p2p protocol for something akin to linked data fragments instead of being fashioned as a platform and we are really cooking ❤️
@photocyte @knowledgepixels@nanopub@tkuhn
put another way - need to integrate with and support existing practice, meet people where they are. add LD tooling into existing editors and whatnot. but I do like the idea of nanopubs and think it's one of many places to work towards.
@photocyte @knowledgepixels@nanopub@tkuhn
as I argue in this piece, the ecosystem matters too, so the platform -> p2p part is not such a trivial shift as it might appear, has to be taken in context with the larger systems it's embedded within. the rest of the piece is about the dangers of this era of linked data platforms and how they can be trivially captured and repurposed for, uh, off-target aims.
@photocyte thanks for connecting these activities. @jonny, this is very interesting! I haven't managed yet to read your piece in full, but it seems to me we at @knowledgepixels are moving in the same direction. We see nanopubs as a container format for all kinds of statements (scientific, meta, opinions, etc.), built on a decentralized p2p network that is evolving. Like going back to the original Semantic Web vision, but make it reliable and trust-aware on top to make it really happen this time.
@tkuhn@photocyte@knowledgepixels well you have my attention - i looked but didn't see the p2p part. that's exactly where i'm headed and if you are too then we should talk because i'm about to start on p2p protocol and if you are already working on one i'll just join up with y'all
@bengo@photocyte@jonny@knowledgepixels Yes! That's exactly the type of content we want to deal with in the near future. I didn't know of the specific term "discourse graph". Thanks for sharing this. Would be great to talk!
@tkuhn@bengo@photocyte@jonny@knowledgepixels
Love this thread synthesizing so many cool directions! Adding our own perspective to the mix (along with @InferenceActive on birdsite) : https://osf.io/preprints/metaarxiv/9nb3u/
TL;DR
Trying to make the case that attention/sensemaking data (eg what researchers are attending to and their assessments of content) are an important kind of nano-scientific knowledge that gets extracted by platforms instead of helping to power content curation and discovery networks
Add comment