@BenjaminHan@sigmoid.social
@BenjaminHan@sigmoid.social avatar

BenjaminHan

@BenjaminHan@sigmoid.social

Working on natural language, knowledge, reasoning, machine learning, and AI at a fruity company.

Husband, father, runner, German learner, piano player. A curious soul living in #PacificNorthwest (WA US).

Running 05/25/18-05/19/24 (dist # time pace/m date src):

5K 645 21:34 6'56" 4/20/24 Strava
10K 97 48:59 7'52" 5/16/24 Strava
15K 4 1:16:05 8’10” 5/19/24 Strava
HM 25 1:48:25 8’16” 5/19/24 Strava
M 7 3:44:58 8'35" 3/24/24 AppleW

#nlp #nlproc #knowledgeGraphs #ai #running #classicalMusic

This profile is from a federated server and may be incomplete. Browse more on the original instance.

BenjaminHan, to LLMs
@BenjaminHan@sigmoid.social avatar

1/

When performing reasoning or generating code, do really understand what they’re doing, or do they just memorize? Several new results seem to have painted a not-so-rosy picture.

The authors in [1] are interested in testing LLMs on “semantic” vs. “symbolic” reasoning: the former involves reasoning with language-like input, and the latter is reasoning with abstract symbols.

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

2/

They use a symbolic dataset and a semantic dataset to test models’ abilities on memorization and reasoning (screenshot 1). For each dataset they created a corresponding one in the other modality, e.g., they replace natural language labels for the relations and the entities with abstract symbols to create a symbolic version of a semantic dataset (screenshot 2).

image/png

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

3/

The end result? LLMs perform much worse on symbolic reasoning (screenshot), suggesting they leverage heavily on the semantics of the words involved rather than really understand and follow reasoning patterns.

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

4/

The same tendency is borne out by another paper focusing on testing code-generating LLMs when function names are swapped in the input [2] (screenshot 1). They not only found almost all models failed completely, but also most of them exhibit an “inverse scaling” effect: the larger a model is, the worse it gets (screenshot 2).

image/png

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

5/

This shows the semantic priors learned from these function names have totally dominated, and the models don’t really understand what they are doing.

How about LLMs on ? There have been reports of extremely impressive performance of 3.5 and 4, but these models also lack consistency in performance and even possibly have cheated by memorizing the tests [3], as discussed in a previous post [4].

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

6/

In a more recent work [5], the authors tested LLMs on pure tasks, where all variables are now symbolic (screenshot 1). They constructed systematically a dataset starting by picking variables, to generating all possible , to finally mapping all possible statistical . They then “verbalize” these graphs into problems for LLMs to solve for a given causation hypothesis (screenshot 2).

image/png

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

7/

The results? Both and perform worse than BART fine-tuned with MNLI, and not much better than the uniform random baseline (screenshot).

(On : https://www.linkedin.com/posts/benjaminhan_llms-mastodon-paper-activity-7083976296350822401-tKAI)

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

8/

[1] Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. 2023. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners. http://arxiv.org/abs/2305.14825

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

9/

[2] Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, and Shay B. Cohen. 2023. The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python. http://arxiv.org/abs/2305.15507

[3] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. http://arxiv.org/abs/2305.00050

BenjaminHan, to random
@BenjaminHan@sigmoid.social avatar

Don't Join Threads—Make Instagram's 'Twitter Killer' Join You | WIRED https://www.wired.com/story/meta-threads-privacy-decentralization/

BenjaminHan, to Futurology
@BenjaminHan@sigmoid.social avatar

1/

ArXiv.org has been a wonderful place for researchers to submit findings and catch up with the latest, but does submitting papers there (“early ArXiving”) increase the odds for them to be accepted later in a peer-reviewed venue?

(Drumroll please)

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

2/

The authors answer this question in [1] by first laying out the causal graph where early ArXiving is the treatment (variable A) and paper acceptance in a conference is the effect (variable Y; screenshot). Two confounders exist — one unobserved that is the creativity and originality of a paper (intrinsic), while the other is observed that is the topic/authors/institute of a paper (extrinsic).

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

3/

To cope with the unobserved confounder, they deploy a popular causal inference framework known as negative outcome control [2], where a variable called Negative Control Outcome (NCO) is added to correct the bias. NCO must share the same observed and unobserved confounders as the outcome variable while being not causally affected by the treatment. They use “citation counts after certain years of publishing” as the NCO.

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

4/

Numerically the effect of early ArXiving is defined by the causal estimand ATET (screenshot 1). The paper reports ATET is high (9.76%) before accounting for NOC. BUT after considering NCO, the effect is almost non-existent for authors with different citation counts or institution ranks (screenshot 2).

Conclusion: enjoy your guilt-free early ArXiving!

image/png

BenjaminHan, to random
@BenjaminHan@sigmoid.social avatar

If you're an fan like many of us, feels like something we definitely have read somewhere many moons ago. This great article revisits some of those great stories.

Login Jones. 2023. Asimov - The Original Prompt Engineer. https://lojones.github.io/2023/04/30/asimov-prompt-engineer.html

cmuhcii, to random
@cmuhcii@hci.social avatar

In case you missed the CMU Meeting of the Minds last week, here are a few students we recognized at the SCS showcase with HCII independent projects. Congrats, undergrads!

In case you can't read the posters, they include:

👩‍💻 Using large language models to enhance the user interface of webpages

📱A mobile application empowering women and girls on negotiation skills

🏥 Sepsis treatment prediction model evaluation using dynamics model


@jbigham @adamperer

A student points to research poster while talking about the project to a guest
A student stands to the right of her research poster on sepsis treatment prediction modeling

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

@cmuhcii @jbigham @adamperer Also very cool that HCII is on Mastodon!

BenjaminHan, to gpt
@BenjaminHan@sigmoid.social avatar

1/

Solving causal tasks is a hallmark of intelligence. One recent study [1] categorizes these tasks into covariance-based and logic-based reasoning (screenshot) and examines how models perform on causal discovery, actual causality, and causal judgments.

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

2/

Notably, the study reports an impressively high accuracy of 96% achieved by (screenshot) on the Tübingen benchmark [2], which is a classification task that determines whether a change in one variable (e.g., age of abalone) causes a change in another variable (e.g., height of abalone).

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

3/

But upon examining one error, it appears that the model did not truly "reason" (screenshot 1). The authors further conducted a "memorization test" with the model, asking it to recall part of the dataset while providing the other part in the prompts. The results show that the Tübingen benchmark is already in GPT4's training set (screenshot 2)!

image/png

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

4/

Coincidentally, another recent study systematically tests how many books GPT models have memorized [3]. The researchers used a similar approach in the form of a name cloze test (screenshot 1) and found that the models do retain substantial information from these books (GPT4 remembers more than ; screenshot 2).

image/png

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

5/

They also discovered that a significant amount of copyrighted material is present (screenshot 1). Last but definitely not the least, they demonstrated how well the models perform on a task— predicting the first publication date of a random 250-word sample from a book—strongly depends on how much they have seen the task data before (screenshot 2).

image/png

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

6/

The moral of the story: when investigating capabilities of black-box , always perform memorization tests first on the benchmark datasets!

BenjaminHan,
@BenjaminHan@sigmoid.social avatar

7/

[1] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. http://arxiv.org/abs/2305.00050

[2] Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. 2016. Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks. Journal of machine learning research: JMLR, 17(32):1–102. https://jmlr.org/papers/v17/14-518.html

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • kavyap
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • osvaldo12
  • ethstaker
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • everett
  • ngwrru68w68
  • khanakhh
  • JUstTest
  • InstantRegret
  • GTA5RPClips
  • Durango
  • normalnudes
  • cubers
  • tacticalgear
  • cisconetworking
  • tester
  • modclub
  • provamag3
  • anitta
  • Leos
  • lostlight
  • All magazines