BenjaminHan

@BenjaminHan@sigmoid.social

Working on natural language, knowledge, reasoning, machine learning, and AI at a fruity company.

Husband, father, runner, German learner, piano player. A curious soul living in #PacificNorthwest (WA US).

Running 05/25/18-05/19/24 (dist # time pace/m date src):

5K 645 21:34 6'56" 4/20/24 Strava
10K 97 48:59 7'52" 5/16/24 Strava
15K 4 1:16:05 8’10” 5/19/24 Strava
HM 25 1:48:25 8’16” 5/19/24 Strava
M 7 3:44:58 8'35" 3/24/24 AppleW

#nlp #nlproc #knowledgeGraphs #ai #running #classicalMusic

This profile is from a federated server and may be incomplete. Browse more on the original instance.

BenjaminHan, 10 months ago to LLMs

1/

When performing reasoning or generating code, do #LLMs really understand what they’re doing, or do they just memorize? Several new results seem to have painted a not-so-rosy picture.

The authors in [1] are interested in testing LLMs on “semantic” vs. “symbolic” reasoning: the former involves reasoning with language-like input, and the latter is reasoning with abstract symbols.

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

reply

expand (8)

collapse (8)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Jigsaw_You, nsaphra

BenjaminHan, 10 months ago

2/

They use a symbolic dataset and a semantic dataset to test models’ abilities on memorization and reasoning (screenshot 1). For each dataset they created a corresponding one in the other modality, e.g., they replace natural language labels for the relations and the entities with abstract symbols to create a symbolic version of a semantic dataset (screenshot 2).

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

3/

The end result? LLMs perform much worse on symbolic reasoning (screenshot), suggesting they leverage heavily on the semantics of the words involved rather than really understand and follow reasoning patterns.

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

4/

The same tendency is borne out by another paper focusing on testing code-generating LLMs when function names are swapped in the input [2] (screenshot 1). They not only found almost all models failed completely, but also most of them exhibit an “inverse scaling” effect: the larger a model is, the worse it gets (screenshot 2).

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

5/

This shows the semantic priors learned from these function names have totally dominated, and the models don’t really understand what they are doing.

How about LLMs on #causalReasoning? There have been reports of extremely impressive performance of #GPT 3.5 and 4, but these models also lack consistency in performance and even possibly have cheated by memorizing the tests [3], as discussed in a previous post [4].

#Paper #NLP #NLProc #Causation #CausalReasoning #reasoning #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

6/

In a more recent work [5], the authors tested LLMs on pure #causalInference tasks, where all variables are now symbolic (screenshot 1). They constructed systematically a dataset starting by picking variables, to generating all possible #causalGraphs, to finally mapping all possible statistical #correlations. They then “verbalize” these graphs into problems for LLMs to solve for a given causation hypothesis (screenshot 2).

#Paper #NLP #NLProc #Causation #CausalReasoning #research

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

7/

The results? Both #GPT4 and #Alpaca perform worse than BART fine-tuned with MNLI, and not much better than the uniform random baseline (screenshot).

(On #LinkedIn: https://www.linkedin.com/posts/benjaminhan_llms-mastodon-paper-activity-7083976296350822401-tKAI)

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

8/

[1] Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. 2023. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners. http://arxiv.org/abs/2305.14825

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

9/

[2] Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, and Shay B. Cohen. 2023. The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python. http://arxiv.org/abs/2305.15507

[3] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. http://arxiv.org/abs/2305.00050

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Jigsaw_You

BenjaminHan, 10 months ago to random

Don't Join Threads—Make Instagram's 'Twitter Killer' Join You | WIRED https://www.wired.com/story/meta-threads-privacy-decentralization/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hrbrmstr, kellogh

BenjaminHan, 10 months ago to Futurology

1/

ArXiv.org has been a wonderful place for researchers to submit findings and catch up with the latest, but does submitting papers there (“early ArXiving”) increase the odds for them to be accepted later in a peer-reviewed venue?

(Drumroll please)

#research #arxiv #academicPublications #CausalAnalysis

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ nsaphra

BenjaminHan, 10 months ago

2/

The authors answer this question in [1] by first laying out the causal graph where early ArXiving is the treatment (variable A) and paper acceptance in a conference is the effect (variable Y; screenshot). Two confounders exist — one unobserved that is the creativity and originality of a paper (intrinsic), while the other is observed that is the topic/authors/institute of a paper (extrinsic).

#research #arxiv #academicPublications #CausalAnalysis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 10 months ago

3/

To cope with the unobserved confounder, they deploy a popular causal inference framework known as negative outcome control [2], where a variable called Negative Control Outcome (NCO) is added to correct the bias. NCO must share the same observed and unobserved confounders as the outcome variable while being not causally affected by the treatment. They use “citation counts after certain years of publishing” as the NCO.

#research #arxiv #academicPublications #CausalAnalysis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ nsaphra

BenjaminHan, 10 months ago

4/

Numerically the effect of early ArXiving is defined by the causal estimand ATET (screenshot 1). The paper reports ATET is high (9.76%) before accounting for NOC. BUT after considering NCO, the effect is almost non-existent for authors with different citation counts or institution ranks (screenshot 2).

Conclusion: enjoy your guilt-free early ArXiving!

#research #arxiv #academicPublications #CausalAnalysis

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ nsaphra

BenjaminHan, 1 year ago to random

If you're an #Asimov fan like many of us, #PromptEngineering feels like something we definitely have read somewhere many moons ago. This great article revisits some of those great stories.

Login Jones. 2023. Asimov - The Original Prompt Engineer. https://lojones.github.io/2023/04/30/asimov-prompt-engineer.html

#Scifi #GenerativeAI #NLP #NLProc #Robotics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ u0421793

cmuhcii, 1 year ago to random

In case you missed the CMU Meeting of the Minds last week, here are a few students we recognized at the SCS showcase with HCII independent projects. Congrats, undergrads!

In case you can't read the posters, they include:

👩‍💻 Using large language models to enhance the user interface of webpages

📱A mobile application empowering women and girls on negotiation skills

🏥 Sepsis treatment prediction model evaluation using dynamics model

#CarnegieMellon #cmuhcii
@jbigham @adamperer

A student points to research poster while talking about the project to a guest
A student stands to the right of her research poster on sepsis treatment prediction modeling

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ andresmh, jkohlmann, jbigham

BenjaminHan, 1 year ago

@cmuhcii @jbigham @adamperer Also very cool that HCII is on Mastodon!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago to gpt

1/

Solving causal #reasoning tasks is a hallmark of intelligence. One recent study [1] categorizes these tasks into covariance-based and logic-based reasoning (screenshot) and examines how #GPT models perform on causal discovery, actual causality, and causal judgments.

#NLP #NLProc #Paper #GenerativeAI #AI

reply

expand (7)

collapse (7)

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago

2/

Notably, the study reports an impressively high accuracy of 96% achieved by #GPT4 (screenshot) on the Tübingen benchmark [2], which is a classification task that determines whether a change in one variable (e.g., age of abalone) causes a change in another variable (e.g., height of abalone).

#NLP #NLProc #Paper #GenerativeAI #AI

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago

3/

But upon examining one error, it appears that the model did not truly "reason" (screenshot 1). The authors further conducted a "memorization test" with the model, asking it to recall part of the dataset while providing the other part in the prompts. The results show that the Tübingen benchmark is already in GPT4's training set (screenshot 2)!

#NLP #NLProc #Paper #GenerativeAI #AI

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago

4/

Coincidentally, another recent study systematically tests how many books GPT models have memorized [3]. The researchers used a similar approach in the form of a name cloze test (screenshot 1) and found that the models do retain substantial information from these books (GPT4 remembers more than #ChatGPT; screenshot 2).

#NLP #NLProc #Paper #GenerativeAI #AI

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago

5/

They also discovered that a significant amount of copyrighted material is present (screenshot 1). Last but definitely not the least, they demonstrated how well the models perform on a task— predicting the first publication date of a random 250-word sample from a book—strongly depends on how much they have seen the task data before (screenshot 2).

#NLP #NLProc #Paper #GenerativeAI #AI

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago

6/

The moral of the story: when investigating capabilities of black-box #LLMs, always perform memorization tests first on the benchmark datasets!

#NLP #NLProc #Paper #GenerativeAI #AI

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHan, 1 year ago

7/

[1] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. http://arxiv.org/abs/2305.00050

[2] Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. 2016. Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks. Journal of machine learning research: JMLR, 17(32):1–102. https://jmlr.org/papers/v17/14-518.html

#NLP #NLProc #Paper #GenerativeAI #AI

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ocramz