@jonny@neuromatch.social
@jonny@neuromatch.social avatar

jonny

@jonny@neuromatch.social

Digital infrastructure 4 a cooperative internet. social/technological systems & systems neuro with some light dynamical systems & crush on topology on the side.

writin bout the surveillance state n makin some p2p

science/work-oriented alt of @jonny

information is political, science is labor

This is a public account, quotes/boosts/links are always ok <3.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

julialuna, to random
@julialuna@chaos.social avatar

clueless techbros doing cryptography certainly is one of the things

jonny,
@jonny@neuromatch.social avatar

@julialuna
it is one of the places the fash with their cultures of extreme masculinity always fail in hilarious ways because simply asking "does this seem right" is often enough to diagnose problems but that is impossible for them. I imagine the heights of tech bro are similar

Private
jonny,
@jonny@neuromatch.social avatar
jonny, to random
@jonny@neuromatch.social avatar

Glad to formally release my latest work - Surveillance Graphs: Vulgarity and Cloud Orthodoxy in Linked Data Infrastructures.

web: https://jon-e.net/surveillance-graphs
hcommons: https://doi.org/10.17613/syv8-cp10

A bit of an overview and then I'll get into some of the more specific arguments in a thread:

This piece is in three parts:

First I trace the mutation of the liberatory ambitions of the into , an underappreciated component in the architecture of . This mutation plays out against the backdrop of the broader platform capture of the web, rendering us as consumer-users of information services rather than empowered people communicating over informational protocols.

I then show how this platform logic influences two contemporary public information infrastructure projects: the NIH's Biomedical Data Translator and the NSF's Open Knowledge Network. I argue that projects like these, while well intentioned, demonstrate the fundamental limitations of platformatized public infrastructure and create new capacities for harm by their enmeshment in and inevitable capture by information conglomerates. The dream of a seamless "knowledge graph of everything" is unlikely to deliver on the utopian promises made by techno-solutionists, but they do create new opportunities for algorithmic oppression -- automated conversion therapy, predictive policing, abuse of bureacracy in "smart cities," etc. Given the framing of corporate knowledge graphs, these projects are poised to create facilitating technologies (that the info conglomerates write about needing themselves) for a new kind of interoperable corporate data infrastructure, where a gradient of public to private information is traded between "open" and quasi-proprietary knowledge graphs to power derivative platforms and services.

When approaching "AI" from the perspective of the semantic web and knowledge graphs, it becomes apparent that the new generation of are intended to serve as interfaces to knowledge graphs. These "augmented language models" are joint systems that combine a language model as a means of interacting with some underlying knowledge graph, integrated in multiple places in the computing ecosystem: eg. mobile apps, assistants, search, and enterprise platforms. I concretize and extend prior criticism about the capacity for LLMs to concentrate power by capturing access to information in increasingly isolated platforms and expand surveillance by creating the demand for extended personalized data graphs across multiple systems from home surveillance to your workplace, medical, and governmental data.

I pose Vulgar Linked Data as an alternative to the infrastructural pattern I call the Cloud Orthodoxy: rather than platforms operated by an informational priesthood, reorienting our public infrastructure efforts to support vernacular expression across heterogeneous mediums. This piece extends a prior work of mine: Decentralized Infrastructure for (Neuro)science) which has more complete draft of what that might look like.

(I don't think you can pre-write threads on masto, so i'll post some thoughts as I write them under this) /1

jonny,
@jonny@neuromatch.social avatar

As a technology, Knowledge Graphs are a particular configuration and deployment of the technologies of the semantic web. Though the technologies are heterogeneous and vary widely, the common architectural feature is treating data as a graph rather than as tables as in relational databases. These graphs are typically composed of triplet links or "triples" - subject-predicate-object tuples (again, this is heterogeneous) - that make use of controlled vocabularies or schemas.

These seemingly-ordinary data structures have a much longer and richer history in the semantic web. Initially, the idea was to supplement the ordinary "duplet" links of the web with triplets to make the then-radically new web of human-readable documents into something that could also be read by computers. The dream was a fluid, multiscale means of structuring information to bypass the need for platforms altogether - from personal to public information, we could directly exchange and publish information ourselves.

Needless to say, that didn't happen, and the capture of the web by platforms (with search prominent among them) blunted the idealism of the semantic web.

/2

The significance of the relationship between search, the semantic web, and what became knowledge graphs is less widely appreciated. The semantic web was initially an alternative to monolithic search engine platforms - or, more generally, to platforms in general [15]. It imagined the use of triplet links and shared ontologies at a protocol level as a way of organizing the information on the web into a richly explorable space: rather than needing to rely on a search bar, one could traverse a structured graph of information [16, 17] to find what one needed without mediation by a third party. The Semantic Web project was an attempt to supplement the arbitrary power to express human-readable information in linked documents with computer-readable information. It imagined a linked and overlapping set of schemas ranging from locally expressive vocabularies used among small groups of friends through globally shared, logically consistent ontologies. The semantic web was intended to evolve fluidly, like language, with cultures of meaning meshing and separating at multiple scales [18, 19, 20]:
Locally defined languages are easy to create, needing local consensus about meaning: only a limited number of people have to share a mental pattern of relationships which define the meaning. However, global languages are so much more effective at communication, reaching the parts that local languages cannot. […] So the idea is that in any one message, some of the terms will be from a global ontology, some from subdomains. The amount of data which can be reused by another agent will depend on how many communities they have in common, how many ontologies they share. In other words, one global ontology is not a solution to the problem, and a local subdomain is not a solution either. But if each agent has uses a mix of a few ontologies of different scale, that is forms a global solution to the problem. [18] The Semantic Web, in naming every concept simply by a URI, lets anyone express new concepts that they invent with minimal effort. Its unifying logical language will enable these concepts to be progressively linked into a universal Web. [19]
The form of of the semantic web that emerged as “Knowledge Graphs” flipped the vision of a free and evolving internet on its head. The mutation from “Linked Open Data” [16] to “Knowledge Graphs” is a shift in meaning from a public and densely linked web of information from many sources to a proprietary information store used to power derivative platforms and services. The shift isn’t quite so simple as a “closure” of a formerly open resource — we’ll return to the complex role of openness in a moment. It is closer to an enclosure, a domestication of the dream of the Semantic Web. A dream of a mutating, pluralistic space of communication, where we were able to own and change and create the information that structures our digital lives was reduced to a ring of platforms that give us precisely as much agency as is needed to keep us content in our captivity. Links that had all the expressive power of utterances, questions, hints, slander, and lies were reduced to mere facts. We were recast from our role as people creating a digital world to consumers of subscriptions and services. The artifacts that we create for and with and between each other as the substance of our lives online were yoked to the acquisitive gaze of the knowledge graph as content to be mined. We vulgar commoners, we data subjects, are not allowed to touch the graph — even if it is built from our disembodied bits.

jonny,
@jonny@neuromatch.social avatar

The essential feature of knowledge graphs that makes them coproductive with surveillance capitalism is how they allow for a much more fluid means of data integration. Most contemporary corporations are data corporations, and their operation increasingly requires integrating far-flung and heterogeneous datasets, often stitched together from decades of acquisitions. While they are of course not universal, and there is again a large amount of variation in their deployment and use, knowledge graphs power many of the largest information conglomerates. The graph structure of KGs as well as the semantic constraints that can be imposed by controlled ontologies and schemas make them particularly well-suited to the sprawling data conglomerate that typifies contemporary surveillance capitalism.

I give a case study in RELX, parent of Elsevier and LexisNexis, among others, which is relatively explicit about how it operates as a gigantic graph of data with various overlay platforms.

/3

In contrast, merging graphs is more straightforward - the data is just triplets, so in an idealized case9 it is possible to just concatenate them and remove duplicates (eg. for a short example, see [35, 36]). The graph can be operated on locally, with more global coordination provided by ontologies and schemas, which themselves have a graph structure [37]. Discrepancies between graphlike schema can be resolved by, you guessed it, making more graph to describe the links and transformations between them. Long-range operations between data are part of the basic structure of a graph - just traverse nodes and edges until you get to where you need to go - and the semantic structure of the graph provides additional constraints to that traversal. Again, a technical description is out of scope here, graphs are not magic, but they are well-suited to merging, modifying, and analyzing large quantities of heterogeneous data10. So if you are a data broker, and you just made a hostile acquisition of another data broker who has additional surveillance information to fill the profiles of the people in your existing dataset, you can just stitch those new properties on like a fifth arm on your nightmarish data Frankenstein.
What does this look like in practice? While in a bygone era Elsevier was merely a rentier holding publicly funded research hostage for profit, its parent company RELX is paradigmatic of the transformation of a more traditional information rentier into a sprawling, multimodal surveillance conglomerate (see [38]). RELX proudly describes itself as a gigantic haunted graph of data: Technology at RELX involves creating actionable insights from big data – large volumes of data in different formats being ingested at high speeds. We take this high-quality data from thousands of sources in varying formats – both structured and unstructured. We then extract the data points from the content, link the data points and enrich them to make it analysable. Finally, we apply advanced statistics and algorithms, such as machine learning and natural language processing, to provide professional customers with the actionable insights they need to do their jobs. We are continually building new products and data and technology platforms, re-using approaches and technologies across the company to create platforms that are reliable, scalable and secure. Even though we serve different segments with different content sets, the nature of the problems solved and the way we apply technology has commonalities across the company. [39] Alt text for figure: https://jon-e.net/surveillance-graphs/#in-its-2022-annual-report-relx-describes-its-business-model-as-i
Text from: https://jon-e.net/surveillance-graphs/#derivative-platforms-beget-derivative-platforms-as-each-expands Derivative platforms beget derivative platforms, as each expands the surface of dependence and provides new opportunities for data to capture. Its integration into clinical systems by way of reference material is growing to include electronic health record (EHR) systems, and they are “developing clinical decision support applications […] leveraging [their] proprietary health graph” [39]. Similarly, their integration into Apple’s watchOS to track medications indicates their interest in directly tracking personal medical data. That’s all within biomedical sciences, but RELX’s risk division also provides “comprehensive data, analytics, and decision tools for […] life insurance carriers” [39], so while we will never have the kind of external visibility into its infrastructure to say for certain, it’s not difficult to imagine combining its diverse biomedical knowledge graph with personal medical information in order to sell risk-assessment services to health and life insurance companies. LexisNexis has personal data enough to serve as an “integral part” of the United States Immigration and Customs Enforcement’s (ICE) arrest and deportation program [42, 43], including dragnet location data [44], driving behavior data from internet-connected cars [45], and payment and credit data as just a small sample from its large catalogue [46] [...]

jonny,
@jonny@neuromatch.social avatar

These knowledge graph powered platform giants represent the capture of information infrastructures broadly, but what would public infrastructure look like? The notion of openness is complicated when it comes to the business models of information conglomerates. In adjacent domains of open source, peer production, and open standards, "openness" is used both to challenge and to reinforce systems of informational dominance.

In particular, Google's acquisition of the peer-production platform Freebase was the precipitating event that ushered in the era of knowledge graphs in the first place, and its tight relationship with its successor, Wikidata, is instructive of the role of openness: public information is crowdsourced to farm the commons and repackaged in derivative platforms.

The information conglomerates in multiple places have expressed a desire for "neutral" exchange schemas and technologies to be able to rent, trade, and otherwise link their proprietary schemas to make a gradient of "factual" public information to contextual information like how a particular company operates, through to personal information often obtained through surveillance. It looks like the NIH and the NSF are set to serve that role for several domains...

/4

text from https://jon-e.net/surveillance-graphs/#%E2%80%9Cpeer-production%E2%80%9D-models-a-more-generic-term-for-public-collabor “Peer production” models, a more generic term for public collaboration that includes FOSS, has similar discontents. The related term “crowdsource [footnote 13]” quite literally describes a patronizing means of harvesting free labor via some typically gamified platform. Wikipedia is perhaps the most well-known example of peer production [footnote 14], and it too struggles with its position as a resource to be harvested by information conglomerates. In 2015, the increasing prevalence of Google’s information boxes caused a substantial decline in Wikipedia page views [68, 69] as its information was harvested into Google’s knowledge graph, and a “will she, won’t she” search engine arguably intended to avoid dependence on Google was at the heart of its 2014-2016 leadership crisis [70, 71]. While shuttering Freebase, Google donated a substantial amount of money to kick-start its successor [72] Wikidata, presumably as a means of crowdsourcing the curation of its knowledge graph [73, 74, 75]. [footnote 13]: For critical work on crowdsourcing in the context of “open science,” see [229], and in the semantic web see [230] [footnote 14]: I have written about the peculiar structure of Wikipedia among wikis previously, section 3.4.1 - “The Wiki Way” [1]
Clearly, on its own, mere “openness” is no guarantee of virtue, and socio-technological systems must always be evaluated in their broader context: what is open? why? who benefits? Open source, open standards, and peer production models do not inherently challenge the rent-seeking behavior of information conglomerates, but can instead facilitate it. In particular, the maintainers of corporate knowledge graphs want to reduce labor duplication by making use of some public knowledge graph that they can then “add value” to with shades of proprietary and personal data (emphasis mine): [blockquote]: In a case like IBM clients, who build their own custom knowledge graphs, the clients are not expected to tell the graph about basic knowledge. For example, a cancer researcher is not going to teach the knowledge graph that skin is a form of tissue, or that St. Jude is a hospital in Memphis, Tennessee. This is known as “general knowledge,” captured in a general knowledge graph. The next level of information is knowledge that is well known to anybody in the domain—for example, carcinoma is a form of cancer or NHL more often stands for non-Hodgkin lymphoma than National Hockey League in some contexts it may still mean that—say, in the patient record of an NHL player). The client should need to input only the private and confidential knowledge or any knowledge that the system does not yet know. [26]
Having such standards be under the stewardship of ostensibly neutral and open third-parties provides cover for powerful actors exerting their influence and helps overcome the initial energy barrier to realizing network effects from their broad use [83, 84]. Peter Mika, the director of Semantic Search at Yahoo Labs, describes this need for third-party intervention in domain-specific standards: A natural next step for Knowledge Graphs is to extend beyond the boundaries of organisations, connecting data assets of companies along business value chains. This process is still at an early stage, and there is a need for trade associations or industry-specific standards organisations to step in, especially when it comes to developing shared entity identifier schemes. [85] As with search, we should be particularly wary of information infrastructures that are technically open [footnote 17] but embed design logics that preserve the hegemony of the organizations that have the resources to make use of them. The existing organization of industrial knowledge graphs as chimeric “data + compute” models give a hint at what we might look for in public knowledge graphs: the data is open, but to make use of it we have to rely on some proprietary algorithm or cloud infrastructure. [footnote 17]: Go ahead, try and make your own web crawler to compete with Google - all the information is just out there in public on the open web!

jonny,
@jonny@neuromatch.social avatar

@risottobias hold that thought! it gets tricky

jonny,
@jonny@neuromatch.social avatar

These two projects share a common design pattern: create authoritative schemas for a given domain, create a string of platforms to collect data under that schema, ingest and mine as much data as possible, provide access through some limited platform, etc. All very normal! This formulation is based on a very particular arrangement of power and agency, however, where like much of the rest of platform web, some higher "developer" priesthood class designs systems for the rest of us to use. The utopian framing of universal platforms paradoxically strongly limit their use, being capable of only what the platform architects are able to imagine. The two agencies both innovate new funding mechanisms to operate these projects as "public-private" partnerships that further dooms them to inevitable capture when the grant money runs out.

This is where the story starts to merge with the story of "AI." Since the dawn of the semantic web, there was a tension between vernacular expression and making things smoothly computable by autonomous "agents." That is a complicated history in its own right, but after >20 years todays "AI" technologies are starting to resemble the dreams of the latter kind of semantic web head.

The projects are both oriented towards creating knowledge graphs that power algorithmic, often natural language query interfaces. The NIH's biomedical translator project is one example: autonomous reasoning agents compute over data from text mining and other curated platforms to yield "serendipitous" emergent information from the graph. The harms of such an algorithmic health system are immediately clear, and have been richly problematized previously. The Translator's prototypes are happy to perform algorithmic conversion therapy, as the many places where violence is encoded in biomedical information systems is laundered into neatly-digestible recommendations.

/5

If the graph encodes being transgender as a disease, it is not farfetched to imagine the ranking system attempting to “cure” it. A seemingly pre-release version of the translator’s query engine, ARAX, does just that: in a query for entities with a biolink:treats link to gender dysphoria, it ranks the standard therapeutics [105, 106] Testosterone and Estradiol 6th and 10th of 11, respectively — behind a recommendation for Lithium (4th) and Pimozide (5th) due to an automated text scrape of two conversion therapy papers [footnote 29]. Queries to ARAX for treatments for gender identity disorder helpfully yielded “zinc” and “water,” offering a paper from the translator group that describes automated drug recommendation as the only provenance [107]. A query for treatments for DOID:1233 “transvestism” was predictably troubling, again prescribing conversion therapy from automated scrapes of outdated and harmful research. The ROBOKOP query engine behaved similarly, answering a query for genes associated with gender dysphoria with exclusively trivial or incorrect responses30. [footnote 29]: as well as a recommendation for “date allergenic extract” from a misinterpretation of “to date” in the abstract of a paper that reads “Cross-sex hormonal treatment (CHT) used for gender dysphoria (GD) could by itself affect well-being without the use of genital surgery; however, to date, there is a paucity of studies investigating the effects of CHT alone”
It is critically important to understand that with an algorithmic, graph-based precision medicine system like this harm can occur even without intended malice. The power of the graph model for precision medicine is precisely its ability to make use of the extended structure of the graph31. The “value added” by the personalized biomedical graph is being able to incorporate the patient’s personal information like genetics, environment, and comorbidities into diagnosis and treatment. So, harmful information embedded within a graph — like transness being a disease in search of a cure — means the system either a) incorporates that harm into its outputs for seemingly unrelated queries or b) doesn’t work. This simultaneously explodes and obscures the risk surface for medically marginalized people: the violence historically encoded in mainstream medical practices and ontologies (eg. [104, 109], among many), incorrectly encoded information like that from automated text mining, explicitly adversarial information injected into the graph through some crowdsourcing portal like this one [110], and so on all presented as an ostensibly “neutral” informatics platform. Each of these sources of harm could influence both medical care and biomedical research in ways that even a well-meaning clinician might not be able to recognize.
The risk of harm is again multiplied by the potential for harmful outputs of a biomedical knowledge graph system to trickle through medical practice and re-enter as training data. The Consortium also describes the potential for ranking algorithms to be continuously updated based on usage or results in research or clinical practice[footnote 32] [87]. Existing harm in medical practice, amplified by any induced by the Translator system, could then be re-encoded as implicit medical consensus in an opaque recommendation algorithm. There is, of course, no unique “loss function” to evaluate health. One belief system’s vision of health is demonic pathology in another. Say an insurance company uses the clinical recommendations of some algorithm built off the Translator’s graph to evaluate its coverage of medical procedures. This gives them license to lower their bottom line under cover of some seemingly objective but fundamentally unaccountable algorithm. There is no need for speculation: Cigna already does this [111]. Could a collection of anti-abortion clinics giving one star to abortion in every case meaningfully influence whether abortion is prescribed or covered? Why not? Who moderates the graph? [footnote 32]: “The Reasoners then return ranked and scored potential translations with provenance and supporting evidence. The user is then able to evaluate the translations and supporting evidence and provide feedback to the Reasoners, thus promoting continuous improvement..."

jonny,
@jonny@neuromatch.social avatar

Though the aims of the project themselves dip into the colonial dream of the great graph of everything, the true harms for both of these projects come what happens with the technologies after they end. Many information conglomerates are poised to pounce on the infrastructures built by the NIH and NSF projects, stepping in to integrate their work or buy the startups that spin off from them.

The NSF's Open Knowledge Network is much more explicitly bound to the national security and economic interests of the US federal government, intended to provide the infrastructure to power an "AI-driven future." That project is at a much earlier stage, but in its early sketches it promises to take the same patterns of knowledge-graphs plus algorithmic platforms and apply them to government, law enforcement, and a broad range of other domains.

This pattern of public graphs for private profits is well underway at existing companies like Google, and I assume the academics and engineers in both of these projects are operating with the best of intentions and perhaps playing a role they are unaware of.

/6

jonny, (edited )
@jonny@neuromatch.social avatar

(i need to take a break before the last section because it gets pretty hectic talking about the language models. brb)

edit: I'm not feeling well and have to make slides for lab meeting tmrw I shall return then.

jonny,
@jonny@neuromatch.social avatar

@manisha
I guess they haven't formally accepted it since it is in review! so uhhh if they don't then I'll just move on to the next repository 🤷. all I need from them is a doi and semi-persistent storage of 1MB.

I can be said to be the founder of such an institute that is perpetually unfounded and has no formal claim to existence, yes;)

jonny,
@jonny@neuromatch.social avatar

@manisha
more details on the IoPT or InstOPiraTe or PiraTech Institute coming soon

jonny,
@jonny@neuromatch.social avatar

@manisha
it is a treasure :)

jonny, to random
@jonny@neuromatch.social avatar

anakin skywalker voice: now this is pettiness

jonny,
@jonny@neuromatch.social avatar

and with that i upload to hcommons

jonny,
@jonny@neuromatch.social avatar

@lili it is "not a content type that they cover," and they won't consider my appeal until it is published in a journal lmao

jonny,
@jonny@neuromatch.social avatar

@saggiotipo
hell ya bb

aris, to random

I can't believe I'm reading that kind of shit in 2023. Just fix your stack overflow. We don't need to understand "chess" or "the goal of the project" to have a negative opinion on your shit response.
https://github.com/official-stockfish/Stockfish/pull/4558

jonny,
@jonny@neuromatch.social avatar

@aris
Jesus what dickish behavior

jonny, to random
@jonny@neuromatch.social avatar

today I became lightly anti-preprint and even more pro-self publishing but in my defense they started it

jonny,
@jonny@neuromatch.social avatar

(I am being petty and joking btw idk how well that translates)

jonny, (edited ) to random
@jonny@neuromatch.social avatar

arxiv is now my enemy. they will take my preprint only after it is a postprint. this is a completely ad-hoc moderation decision as far as I can tell, mentioned nowhere on their rules or guidelines.

edit: apparently it is in their guidelines, but imo the moderation decision is indicative of the deeply conservative nature of preprints vs self publishing

jonny,
@jonny@neuromatch.social avatar

@theruran
seems that way to me. you can point to whatever thousand other papers that are less substantive and less in scope than my work and the commonality between them is they pose no challenge to any system whatever

jonny,
@jonny@neuromatch.social avatar

@cnsyoung
we're going to hcommons!!!

jonny,
@jonny@neuromatch.social avatar

@MsPhelps
aha I must have missed that. the application of the discretion is revealing to say the least.

jonny,
@jonny@neuromatch.social avatar

also tbc I am speaking hyperbolically when I say "my enemy" I am just going to not send stuff there lol.

jonny, to random
@jonny@neuromatch.social avatar

whatever atragon is, it is easily confused with a small fish

jonny,
@jonny@neuromatch.social avatar

@flancian
up 2 u :) I like the hashtag archive but do think they probably serve different purposes?

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • cubers
  • magazineikmin
  • GTA5RPClips
  • khanakhh
  • InstantRegret
  • Youngstown
  • mdbf
  • slotface
  • thenastyranch
  • everett
  • osvaldo12
  • kavyap
  • rosin
  • anitta
  • DreamBathrooms
  • Durango
  • modclub
  • ngwrru68w68
  • vwfavf
  • ethstaker
  • tester
  • cisconetworking
  • tacticalgear
  • Leos
  • provamag3
  • normalnudes
  • JUstTest
  • All magazines