Player FM - Internet Radio Done Right
33 subscribers
Checked 11d ago
เพิ่มแล้วเมื่อ fourปีที่ผ่านมา
เนื้อหาจัดทำโดย Daniel Filan เนื้อหาพอดแคสต์ทั้งหมด รวมถึงตอน กราฟิก และคำอธิบายพอดแคสต์ได้รับการอัปโหลดและจัดหาให้โดยตรงจาก Daniel Filan หรือพันธมิตรแพลตฟอร์มพอดแคสต์ของพวกเขา หากคุณเชื่อว่ามีบุคคลอื่นใช้งานที่มีลิขสิทธิ์ของคุณโดยไม่ได้รับอนุญาต คุณสามารถปฏิบัติตามขั้นตอนที่แสดงไว้ที่นี่ https://th.player.fm/legal
Player FM - แอป Podcast
ออฟไลน์ด้วยแอป Player FM !
ออฟไลน์ด้วยแอป Player FM !
พอดคาสต์ที่ควรค่าแก่การฟัง
สปอนเซอร์
K
Know What You See with Brian Lowery


In this episode, comedian and tea enthusiast Jesse Appell of Jesse's Teahouse takes us on a journey from studying Chinese comedy to building an online tea business. He shares how navigating different cultures shaped his perspective on laughter, authenticity, and community. From mastering traditional Chinese cross-talk comedy to reinventing himself after a life-changing move, Jesse and host Brian Lowery discuss adaptation and the unexpected paths that bring meaning to our lives. For more on Jesse, visit jessesteahouse.com and for more on Brian and the podcast go to brianloweryphd.com.…
AXRP - the AI X-risk Research Podcast
ทำเครื่องหมายทั้งหมดว่า (ยังไม่ได้)เล่น…
Manage series 2844728
เนื้อหาจัดทำโดย Daniel Filan เนื้อหาพอดแคสต์ทั้งหมด รวมถึงตอน กราฟิก และคำอธิบายพอดแคสต์ได้รับการอัปโหลดและจัดหาให้โดยตรงจาก Daniel Filan หรือพันธมิตรแพลตฟอร์มพอดแคสต์ของพวกเขา หากคุณเชื่อว่ามีบุคคลอื่นใช้งานที่มีลิขสิทธิ์ของคุณโดยไม่ได้รับอนุญาต คุณสามารถปฏิบัติตามขั้นตอนที่แสดงไว้ที่นี่ https://th.player.fm/legal
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
…
continue reading
52 ตอน
ทำเครื่องหมายทั้งหมดว่า (ยังไม่ได้)เล่น…
Manage series 2844728
เนื้อหาจัดทำโดย Daniel Filan เนื้อหาพอดแคสต์ทั้งหมด รวมถึงตอน กราฟิก และคำอธิบายพอดแคสต์ได้รับการอัปโหลดและจัดหาให้โดยตรงจาก Daniel Filan หรือพันธมิตรแพลตฟอร์มพอดแคสต์ของพวกเขา หากคุณเชื่อว่ามีบุคคลอื่นใช้งานที่มีลิขสิทธิ์ของคุณโดยไม่ได้รับอนุญาต คุณสามารถปฏิบัติตามขั้นตอนที่แสดงไว้ที่นี่ https://th.player.fm/legal
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
…
continue reading
52 ตอน
ทุกตอน
×A
AXRP - the AI X-risk Research Podcast

The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 00:33 - Anthony, FLI, and Metaculus 06:46 - The Alignment Workshop 07:15 - FLI's current activity 11:04 - AI policy 17:09 - Work FLI funds Links: Future of Life Institute: https://futureoflife.org/ Metaculus: https://www.metaculus.com/ Future of Life Foundation: https://www.flf.org/ Episode art by Hamish Doodles: hamishdoodles.com…
Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 01:12 - Why aligned AI might not be enough 04:05 - Positive visions of AI 08:27 - Improving recommendation systems Links: Why Greatness Cannot Be Planned: https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237 We Need Positive Visions of AI Grounded in Wellbeing: https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo Machine Love: https://arxiv.org/abs/2302.09248 AI Alignment with Changing and Influenceable Reward Functions: https://arxiv.org/abs/2405.17713 Episode art by Hamish Doodles: hamishdoodles.com…
Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 01:04 - The Alignment Workshop 02:49 - How to detect scheming AIs 05:29 - Sokoban-solving networks taking time to think 12:18 - Model organisms of long-term planning 19:44 - How and why to study planning in networks Links: Adrià's website: https://agarri.ga/ An investigation of model-free planning: https://arxiv.org/abs/1901.03559 Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/ Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421 Episode art by Hamish Doodles: hamishdoodles.com…
AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2025/01/05/episode-38_4-shakeel-hashim-ai-journalism.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 01:31 - The AI media ecosystem 02:34 - Why not more AI news? 07:18 - Disconnects between journalists and the AI field 12:42 - Tarbell 18:44 - The Transformer newsletter Links: Transformer (Shakeel's substack): https://www.transformernews.ai/ Tarbell: https://www.tarbellfellowship.org/ Episode art by Hamish Doodles: hamishdoodles.com…
Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/12/episode-38_3-erik-jenner-learned-look-ahead.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 00:57 - How chess neural nets look into the future 04:29 - The dataset and basic methodology 05:23 - Testing for branching futures? 07:57 - Which experiments demonstrate what 10:43 - How the ablation experiments work 12:38 - Effect sizes 15:23 - X-risk relevance 18:08 - Follow-up work 21:29 - How much planning does the network do? Research we mention: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network: https://arxiv.org/abs/2406.00877 Understanding the learned look-ahead behavior of chess neural networks (a development of the follow-up research Erik mentioned): https://openreview.net/forum?id=Tl8EzmgsEp Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT: https://arxiv.org/abs/2310.07582 Episode art by Hamish Doodles: hamishdoodles.com…
A
AXRP - the AI X-risk Research Podcast

1 39 - Evan Hubinger on Model Organisms of Misalignment 1:45:47
1:45:47
ลิสต์เล่นในภายหลัง
ลิสต์เล่นในภายหลัง
ลิสต์
ถูกใจ
ที่ถูกใจแล้ว1:45:47
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge". Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html Topics we discuss, and timestamps: 0:00:36 - Model organisms and stress-testing 0:07:38 - Sleeper Agents 0:22:32 - Do 'sleeper agents' properly model deceptive alignment? 0:38:32 - Surprising results in "Sleeper Agents" 0:57:25 - Sycophancy to Subterfuge 1:09:21 - How models generalize from sycophancy to subterfuge 1:16:37 - Is the reward editing task valid? 1:21:46 - Training away sycophancy and subterfuge 1:29:22 - Model organisms, AI control, and evaluations 1:33:45 - Other model organisms research 1:35:27 - Alignment stress-testing at Anthropic 1:43:32 - Following Evan's work Main papers: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models: https://arxiv.org/abs/2406.10162 Anthropic links: Anthropic's newsroom: https://www.anthropic.com/news Careers at Anthropic: https://www.anthropic.com/careers Other links: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research: https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1 Simple probes can catch sleeper agents: https://www.anthropic.com/research/probes-catch-sleeper-agents Studying Large Language Model Generalization with Influence Functions: https://arxiv.org/abs/2308.03296 Stress-Testing Capability Elicitation With Password-Locked Models [aka model organisms of sandbagging]: https://arxiv.org/abs/2405.19550 Episode art by Hamish Doodles: hamishdoodles.com…
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 00:34 - About Jesse 01:49 - The Alignment Workshop 02:31 - About Timaeus 05:25 - SLT that isn't developmental interpretability 10:41 - The refined local learning coefficient 14:06 - Finding the multigram circuit Links: Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient: https://arxiv.org/abs/2410.02984 Investigating the learning coefficient of modular addition: hackathon project: https://www.lesswrong.com/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition Episode art by Hamish Doodles: hamishdoodles.com…
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels 20:29 - Relation to AI control Links: Alan on Google Scholar: https://scholar.google.com/citations?user=lmQmYPgAAAAJ&hl=en&oi=ao IDs for AI Systems: https://arxiv.org/abs/2406.12137 Visibility into AI Agents: https://arxiv.org/abs/2401.13138 Episode art by Hamish Doodles: hamishdoodles.com…
A
AXRP - the AI X-risk Research Podcast

Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/14/episode-38_0-zhijing-jin-llms-causality-multi-agent-systems.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 00:35 - How the Alignment Workshop is 00:47 - How Zhijing got interested in causality and natural language processing 03:14 - Causality and alignment 06:21 - Causality without randomness 10:07 - Causal abstraction 11:42 - Why LLM causal reasoning? 13:20 - Understanding LLM causal reasoning 16:33 - Multi-agent systems Links: Zhijing's website: https://zhijing-jin.com/fantasy/ Zhijing on X (aka Twitter): https://x.com/zhijingjin Can Large Language Models Infer Causation from Correlation?: https://arxiv.org/abs/2306.05836 Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents: https://arxiv.org/abs/2404.16698 Episode art by Hamish Doodles: hamishdoodles.com…
A
AXRP - the AI X-risk Research Podcast

1 37 - Jaime Sevilla on AI Forecasting 1:44:25
1:44:25
ลิสต์เล่นในภายหลัง
ลิสต์เล่นในภายหลัง
ลิสต์
ถูกใจ
ที่ถูกใจแล้ว1:44:25
Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/10/04/episode-37-jaime-sevilla-forecasting-ai.html Topics we discuss, and timestamps: 0:00:38 - The pace of AI progress 0:07:49 - How Epoch AI tracks AI compute 0:11:44 - Why does AI compute grow so smoothly? 0:21:46 - When will we run out of computers? 0:38:56 - Algorithmic improvement 0:44:21 - Algorithmic improvement and scaling laws 0:56:56 - Training data 1:04:56 - Can scaling produce AGI? 1:16:55 - When will AGI arrive? 1:21:20 - Epoch AI 1:27:06 - Open questions in AI forecasting 1:35:21 - Epoch AI and x-risk 1:41:34 - Following Epoch AI's research Links for Jaime and Epoch AI: Epoch AI: https://epochai.org/ Machine Learning Trends dashboard: https://epochai.org/trends Epoch AI on X / Twitter: https://x.com/EpochAIResearch Jaime on X / Twitter: https://x.com/Jsevillamol Research we discuss: Training Compute of Frontier AI Models Grows by 4-5x per Year: https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year Optimally Allocating Compute Between Inference and Training: https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training Algorithmic Progress in Language Models [blog post]: https://epochai.org/blog/algorithmic-progress-in-language-models Algorithmic progress in language models [paper]: https://arxiv.org/abs/2403.05812 Training Compute-Optimal Large Language Models [aka the Chinchilla scaling law paper]: https://arxiv.org/abs/2203.15556 Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data [blog post]: https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data Will we run out of data? Limits of LLM scaling based on human-generated data [paper]: https://arxiv.org/abs/2211.04325 The Direct Approach: https://epochai.org/blog/the-direct-approach Episode art by Hamish Doodles: hamishdoodles.com…
A
AXRP - the AI X-risk Research Podcast

1 36 - Adam Shai and Paul Riechers on Computational Mechanics 1:48:27
1:48:27
ลิสต์เล่นในภายหลัง
ลิสต์เล่นในภายหลัง
ลิสต์
ถูกใจ
ที่ถูกใจแล้ว1:48:27
Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/09/29/episode-36-adam-shai-paul-riechers-computational-mechanics.html Topics we discuss, and timestamps: 0:00:42 - What computational mechanics is 0:29:49 - Computational mechanics vs other approaches 0:36:16 - What world models are 0:48:41 - Fractals 0:57:43 - How the fractals are formed 1:09:55 - Scaling computational mechanics for transformers 1:21:52 - How Adam and Paul found computational mechanics 1:36:16 - Computational mechanics for AI safety 1:46:05 - Following Adam and Paul's research Simplex AI Safety: https://www.simplexaisafety.com/ Research we discuss: Transformers represent belief state geometry in their residual stream: https://arxiv.org/abs/2405.15943 Transformers represent belief state geometry in their residual stream [LessWrong post]: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why Episode art by Hamish Doodles: hamishdoodles.com…
Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video.
A
AXRP - the AI X-risk Research Podcast

1 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization 2:17:24
2:17:24
ลิสต์เล่นในภายหลัง
ลิสต์เล่นในภายหลัง
ลิสต์
ถูกใจ
ที่ถูกใจแล้ว2:17:24
How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models' beliefs 1:19:18 - Beliefs beyond language models 1:27:21 - Easy-to-hard generalization 1:47:16 - What do easy-to-hard results tell us? 1:57:33 - Easy-to-hard vs weak-to-strong 2:03:50 - Different notions of hardness 2:13:01 - Easy-to-hard vs weak-to-strong, round 2 2:15:39 - Following Peter's work Peter on Twitter: https://x.com/peterbhase Peter's papers: Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932 Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654 Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213 Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442 The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751 Other links: Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279 Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262 Of nonlinearity and commutativity in BERT: https://arxiv.org/abs/2101.04547 Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: https://arxiv.org/abs/2306.03341 Editing a classifier by rewriting its prediction rules: https://arxiv.org/abs/2112.01008 Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): https://arxiv.org/abs/2212.03827 Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: https://arxiv.org/abs/2312.09390 Concrete problems in AI safety: https://arxiv.org/abs/1606.06565 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: https://arxiv.org/abs/2103.03872 Episode art by Hamish Doodles: hamishdoodles.com…
A
AXRP - the AI X-risk Research Podcast

1 34 - AI Evaluations with Beth Barnes 2:14:02
2:14:02
ลิสต์เล่นในภายหลัง
ลิสต์เล่นในภายหลัง
ลิสต์
ถูกใจ
ที่ถูกใจแล้ว2:14:02
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - What is an "eval"? 0:14:42 - How good are evals? 0:37:25 - Are models showing their full capabilities? 0:53:25 - Evaluating alignment 1:01:38 - Existential safety methodology 1:12:13 - Threat models and capability buffers 1:38:25 - METR's policy work 1:48:19 - METR's relationships with labs 2:04:12 - Related research 2:10:02 - Roles at METR, and following METR's work Links for METR: METR: https://metr.org METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/ METR - Hiring: https://metr.org/hiring Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/ Other links: Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/ Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/ ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428 Episode art by Hamish Doodles: hamishdoodles.com…
A
AXRP - the AI X-risk Research Podcast

1 33 - RLHF Problems with Scott Emmons 1:41:24
1:41:24
ลิสต์เล่นในภายหลัง
ลิสต์เล่นในภายหลัง
ลิสต์
ถูกใจ
ที่ถูกใจแล้ว1:41:24
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 - Dimensional analysis 1:23:32 - RLHF problems, in theory and practice 1:31:29 - Scott's research program 1:39:42 - Following Scott's research Scott's website: https://www.scottemmons.com Scott's X/twitter account: https://x.com/emmons_scott When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747 Other works we discuss: AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752 Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394 Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475 The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693 Episode art by Hamish Doodles: hamishdoodles.com…
ขอต้อนรับสู่ Player FM!
Player FM กำลังหาเว็บ