ออฟไลน์ด้วยแอป Player FM !
LW - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort
ซีรีส์ที่ถูกเก็บถาวร ("ฟีดที่ไม่ได้ใช้งาน" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 22, 2024 16:12 ()
Why? ฟีดที่ไม่ได้ใช้งาน status. เซิร์ฟเวอร์ของเราไม่สามารถดึงฟีดพอดคาสท์ที่ใช้งานได้สักระยะหนึ่ง
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 437066594 series 3337129
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving adversarial attacks in computer vision as a baby version of general AI alignment, published by stanislavfort on August 30, 2024 on LessWrong.
I spent the last few months trying to tackle the problem of adversarial attacks in computer vision from the ground up. The results of this effort are written up in our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness (explainer on X/Twitter).
Taking inspiration from biology, we reached state-of-the-art or above state-of-the-art robustness at 100x - 1000x less compute, got human-understandable interpretability for free, turned classifiers into generators, and designed transferable adversarial attacks on closed-source (v)LLMs such as GPT-4 or Claude 3.
I strongly believe that there is a compelling case for devoting serious attention to solving the problem of adversarial robustness in computer vision, and I try to draw an analogy to the alignment of general AI systems here.
1. Introduction
In this post, I argue that the problem of adversarial attacks in computer vision is in many ways analogous to the larger task of general AI alignment. In both cases, we are trying to faithfully convey an implicit function locked within the human brain to a machine, and we do so extremely successfully on average. Under static evaluations, the human and machine functions match up exceptionally well.
However, as is typical in high-dimensional spaces, some phenomena can be relatively rare and basically impossible to find by chance, yet ubiquitous in their absolute count. This is the case for adversarial attacks - imperceptible modifications to images that completely fool computer vision systems and yet have virtually no effect on humans.
Their existence highlights a crucial and catastrophic mismatch between the implicit human vision function and the function learned by machines - a mismatch that can be exploited in a dynamic evaluation by an active, malicious agent. Such failure modes will likely be present in more general AI systems, and our inability to remedy them even in the more restricted vision context (yet) does not bode well for the broader alignment project.
This is a call to action to solve the problem of adversarial vision attacks - a stepping stone on the path to aligning general AI systems.
2. Communicating implicit human functions to machines
The basic goal of computer vision can be viewed as trying to endow a machine with the same vision capabilities a human has. A human carries, locked inside their skull, an implicit vision function mapping visual inputs into semantically meaningful symbols, e.g. a picture of a tortoise into a semantic label tortoise.
This function is represented implicitly and while we are extremely good at using it, we do not have direct, conscious access to its inner workings and therefore cannot communicate it to others easily.
To convey this function to a machine, we usually form a dataset of fixed images and their associated labels. We then use a general enough class of functions, typically deep neural networks, and a gradient-based learning algorithm together with backpropagation to teach the machine how to correlate images with their semantic content, e.g. how to assign a label parrot to a picture of a parrot.
This process is extremely successful in communicating the implicit human vision function to the computer, and the implicit human and explicit, learned machine functions agree to a large extent.
The agreement between the two is striking.
Given how different the architectures are (a simulated graph-like function doing a single forward pass vs the wet protein brain of a mammal running continuous inference), how different the learning algorithms are (gradient descent with backpropagation vs something completely different but still unknown), and how differ...
1851 ตอน
ซีรีส์ที่ถูกเก็บถาวร ("ฟีดที่ไม่ได้ใช้งาน" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 22, 2024 16:12 ()
Why? ฟีดที่ไม่ได้ใช้งาน status. เซิร์ฟเวอร์ของเราไม่สามารถดึงฟีดพอดคาสท์ที่ใช้งานได้สักระยะหนึ่ง
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 437066594 series 3337129
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving adversarial attacks in computer vision as a baby version of general AI alignment, published by stanislavfort on August 30, 2024 on LessWrong.
I spent the last few months trying to tackle the problem of adversarial attacks in computer vision from the ground up. The results of this effort are written up in our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness (explainer on X/Twitter).
Taking inspiration from biology, we reached state-of-the-art or above state-of-the-art robustness at 100x - 1000x less compute, got human-understandable interpretability for free, turned classifiers into generators, and designed transferable adversarial attacks on closed-source (v)LLMs such as GPT-4 or Claude 3.
I strongly believe that there is a compelling case for devoting serious attention to solving the problem of adversarial robustness in computer vision, and I try to draw an analogy to the alignment of general AI systems here.
1. Introduction
In this post, I argue that the problem of adversarial attacks in computer vision is in many ways analogous to the larger task of general AI alignment. In both cases, we are trying to faithfully convey an implicit function locked within the human brain to a machine, and we do so extremely successfully on average. Under static evaluations, the human and machine functions match up exceptionally well.
However, as is typical in high-dimensional spaces, some phenomena can be relatively rare and basically impossible to find by chance, yet ubiquitous in their absolute count. This is the case for adversarial attacks - imperceptible modifications to images that completely fool computer vision systems and yet have virtually no effect on humans.
Their existence highlights a crucial and catastrophic mismatch between the implicit human vision function and the function learned by machines - a mismatch that can be exploited in a dynamic evaluation by an active, malicious agent. Such failure modes will likely be present in more general AI systems, and our inability to remedy them even in the more restricted vision context (yet) does not bode well for the broader alignment project.
This is a call to action to solve the problem of adversarial vision attacks - a stepping stone on the path to aligning general AI systems.
2. Communicating implicit human functions to machines
The basic goal of computer vision can be viewed as trying to endow a machine with the same vision capabilities a human has. A human carries, locked inside their skull, an implicit vision function mapping visual inputs into semantically meaningful symbols, e.g. a picture of a tortoise into a semantic label tortoise.
This function is represented implicitly and while we are extremely good at using it, we do not have direct, conscious access to its inner workings and therefore cannot communicate it to others easily.
To convey this function to a machine, we usually form a dataset of fixed images and their associated labels. We then use a general enough class of functions, typically deep neural networks, and a gradient-based learning algorithm together with backpropagation to teach the machine how to correlate images with their semantic content, e.g. how to assign a label parrot to a picture of a parrot.
This process is extremely successful in communicating the implicit human vision function to the computer, and the implicit human and explicit, learned machine functions agree to a large extent.
The agreement between the two is striking.
Given how different the architectures are (a simulated graph-like function doing a single forward pass vs the wet protein brain of a mammal running continuous inference), how different the learning algorithms are (gradient descent with backpropagation vs something completely different but still unknown), and how differ...
1851 ตอน
ทุกตอน
×ขอต้อนรับสู่ Player FM!
Player FM กำลังหาเว็บ