[{"content":"This page is the short-form entry point to the full essay on adversarial robustness. Instead of asking a reader to absorb everything in one sitting, the material is now organized into five posts that each answer one clear question.\nIf you are arriving from my homepage, start here. The sequence is designed for continuous reading, but each part can also stand on its own.\nReading Order Why AI Can Be Brilliant but Fragile Why strong models can still fail under tiny perturbations, and why that matters in high-stakes settings. What Robustness Really Means A practical theory primer on Bayes error, gradient structure, and loss landscapes. How Adversarial Attacks Evolved How the threat model expanded from pixel noise to 3D, viewpoint, physical, and explainability attacks. How We Defend Models Against Adversarial Attacks A map of the main defense routes: adversarial training, data-centric methods, certification, and efficient purification. Robustness in Modern Models and High-Stakes Settings Why robustness becomes a systems problem once we move into modern architectures and costly real-world domains. Who This Series Is For This series targets readers who already know the basics of machine learning, but do not want to read an 80k-character survey before they understand the main picture. The goal is not to reduce rigor; it is to improve pacing.\nThe full reference version is still available here:\nThe Robustness of Adversarial Network What Changes Across the 5 Parts The split version is not a mechanical chapter break. I rewrote the structure around reader questions:\nPart 1 frames the problem. Part 2 defines robustness. Part 3 explains the threat evolution. Part 4 organizes the defense landscape. Part 5 shows why deployment context changes the problem. If you want the fastest path through the material, read Parts 1, 2, and 4. If you want the most complete path, read all five in order.\n","permalink":"https://xiaokunduan.github.io/posts/adversarial-robustness-series/","summary":"A shorter, more readable 5-part path through adversarial robustness: motivation, theory, attacks, defenses, and high-stakes deployment.","title":"Adversarial Robustness, in 5 Readable Parts"},{"content":" Adversarial Robustness · Part 1 of 5 Why AI Can Be Brilliant but Fragile Series HomeWhat Robustness Really Means → Modern deep learning systems can outperform humans on tasks that once looked completely out of reach. They classify images at scale, reason over long contexts, support scientific discovery, and drive perception stacks in complex environments. Yet that apparent competence hides a deeply uncomfortable fact: the same systems can be pushed into catastrophic failure by changes that are almost invisible to us.\nThat tension is the real starting point for adversarial robustness. The problem is not that models make occasional mistakes. The problem is that very capable models can make high-confidence mistakes under perturbations so small that a human observer would not even notice them. Once that happens, robustness stops being an academic detail and becomes a question of whether we can trust a model at all.\nA classic adversarial example. A tiny, carefully designed perturbation causes the model to classify a panda as a gibbon with very high confidence. Image source: (Goodfellow et al., 2015)\nThe panda-to-gibbon example remains the clearest illustration. A clean image is classified correctly. Add a perturbation that is barely perceptible, and the prediction flips with confidence. The key point is that the perturbation is not random visual noise. It is optimized using the model\u0026rsquo;s own gradients to push the classifier toward failure. In other words, the attack exploits how the model actually organizes its decision boundary.\nOnce we look at the problem through that lens, the stakes become obvious. If a model can be manipulated so easily, what should we expect in settings where errors are costly?\nIn autonomous driving, a failure in perception can corrupt downstream planning. In medical imaging, a brittle decision boundary can turn a minor input shift into a dangerous diagnosis error. In finance or risk screening, a high-confidence mistake can become an automated operational error. The uncomfortable conclusion is that benchmark strength does not imply deployment trustworthiness. A system can be excellent on average and still be dangerously fragile in the worst case.\nThis is why adversarial robustness matters. It asks a sharper question than standard accuracy: how stable is a model when the input is pushed, manipulated, or shifted in ways that matter? That question matters because the world does not present inputs in a clean, static, benchmark-style form.\nThe research community has treated this as an ongoing arms race. As soon as one class of attacks becomes familiar, more adaptive or more realistic attacks follow. Defenses improve, but attacks also become stronger, broader, and less artificial.\nThe arms race in adversarial machine learning. Robustness research evolves as attacks and defenses push against each other over time.\nThat dynamic is why a single “fix” is never the whole story. Robustness is not one trick, one paper, or one benchmark. It is an attempt to understand where models are structurally vulnerable, what kinds of perturbations actually matter, and which defenses meaningfully change the picture.\nThis series follows that path in order. First, we clarify what robustness really means beyond intuition. Then we look at how the threat model evolved, how defenses are organized, and why robustness becomes even harder when we move into modern architectures and high-stakes applications.\nReferences Adversarial Robustness · Part 1 of 5 Why AI Can Be Brilliant but Fragile Series HomeWhat Robustness Really Means → ","permalink":"https://xiaokunduan.github.io/posts/part-1-why-ai-is-fragile/","summary":"Why high-performing AI systems can still fail under tiny perturbations, and why that fragility matters.","title":"Why AI Can Be Brilliant but Fragile"},{"content":" Adversarial Robustness · Part 2 of 5 What Robustness Really Means ← Why AI Can Be Brilliant but FragileSeries HomeHow Adversarial Attacks Evolved → If robustness matters, the next question is obvious: what exactly are we trying to make robust? The naive answer is “make the model resist noise.” That is too shallow. Robustness is tied to the geometry of the data distribution, the smoothness of the model, and the structure of the features the model uses to decide.\nThis matters because many familiar trade-offs in robustness are not just implementation accidents. Some of them reflect real limits in the learning problem itself.\n1. Robustness Has a Theoretical Ceiling The work of Zhang \u0026amp; Sun, 2024 is useful because it reframes certified robustness in terms of data ambiguity. The core object is Bayes error, the irreducible error that remains even for an optimal classifier because class distributions overlap.\nBayes error as irreducible ambiguity. Some inputs lie in an overlapping region where uncertainty is built into the data distribution itself.\nIn standard learning, a model tries to fit the original distribution $\\mathcal{D}$. In certified robustness, however, the model must give the same answer not just on one input $\\boldsymbol{x}$, but across a neighborhood around it. That changes the target distribution. The original distribution is effectively “blurred” by the perturbation neighborhood:\n$$ \\mathcal{D}\u0026rsquo; = \\mathcal{D} * v $$\nCertified robustness changes the learning problem itself. Convolution with a neighborhood distribution increases class overlap and therefore raises irreducible uncertainty.\nThat blurring makes class overlap worse, not better. So the robust learning problem can become intrinsically harder than the standard one. This is the cleanest explanation for why robust models often lose standard accuracy: part of the trade-off is structural, not just algorithmic.\n2. Robust Models Look Different Internally Theory is only one side of the story. We also want practical signs that a model has learned robust features rather than brittle shortcuts. Jain et al., 2023 show that one of the most informative signals is the model\u0026rsquo;s input gradient on clean samples.\nRobust and vulnerable models reveal different gradient structure. Robust gradients are more organized and human-interpretable; vulnerable ones often look noisy and unstable. Image source: (Jain et al., 2023)\nThe contrast is striking. Vulnerable models tend to produce chaotic, high-frequency gradients. Robust models produce gradients that look more structured and semantically aligned. Numerically, robust models also have much smaller gradient norms, suggesting smoother decision surfaces and lower sensitivity to tiny perturbations.\nThat difference changes what successful attacks must look like.\nEffective attacks look different once a model becomes robust. Vulnerable models fall to high-frequency noise, while robust models are harder to fool without semantically meaningful perturbations. Image source: (Jain et al., 2023)\nThis is an important shift in perspective. Robustness is not only about “surviving stronger attacks.” It is also about whether the model relies on stable, meaningful structure rather than fragile surface correlations.\n3. Loss Landscapes Explain Stability Another useful lens is the loss landscape. A robust model should live in a region where small changes to the input or parameters do not sharply increase loss. In other words, robust models should prefer flatter, smoother basins over sharp ones.\nMIMIR and smoother loss basins. Robust pre-training can guide models toward flatter, more stable regions of the optimization landscape. Image source: (Gao et al., 2023)\nGAT under semantic attacks. Generalized adversarial training can flatten the loss landscape across multiple semantic perturbations rather than overfitting to one attack family. Image source: (Laidlaw et al., 2021)\nThis lens is useful because it connects optimization to behavior. A defense should not merely lower loss on a fixed benchmark attack. It should push the model toward a smoother decision space where many nearby perturbations become less dangerous by construction.\nThe Practical Takeaway Robustness is not one scalar property bolted on after training. It is a combination of:\nhow much irreducible uncertainty exists in the target distribution, whether the model\u0026rsquo;s feature usage is structured or brittle, and whether the decision surface is smooth enough to stay stable under perturbation. That gives us a more rigorous question for the rest of the series. If robust models must look different internally and may face theoretical limits, what kinds of attacks are actually trying to break them?\nReferences [1]Zhang, R., \u0026amp; Sun, J. (2024). Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes Errors. Computer Aided Verification, 445\u0026ndash;466. https://doi.org/10.1007/978-3-031-63175-8_19 [2]Jain, G., Balasubramanian, V. N., \u0026amp; Carlini, N. (2023, May 9). Characterizing Model Robustness via Natural Input Gradients. The Eleventh International Conference on Learning Representations. Adversarial Robustness · Part 2 of 5 What Robustness Really Means ← Why AI Can Be Brilliant but FragileSeries HomeHow Adversarial Attacks Evolved → ","permalink":"https://xiaokunduan.github.io/posts/part-2-what-robustness-really-means/","summary":"A theory-first guide to Bayes error, gradients, and loss landscapes as the foundation for robustness.","title":"What Robustness Really Means"},{"content":" Adversarial Robustness · Part 3 of 5 How Adversarial Attacks Evolved ← What Robustness Really MeansSeries HomeHow We Defend Models Against Adversarial Attacks → Defenses only make sense if we are clear about the threat. That threat has changed a lot. Early adversarial machine learning focused on digital image classifiers and tiny perturbations measured in $L_p$ norms. Today, the attack surface includes 3D perception, physical transformations, multi-view consistency, and even attacks on explanations rather than predictions.\nThe story of adversarial robustness is therefore also a story about how the threat model expanded.\nFrom Pixel-Space Optimization to Standard Baselines The classic setup, summarized by Yuan et al., 2019, asks the attacker to find a perturbation $\\boldsymbol{\\delta}$ inside a norm-bounded set that maximizes model loss. This gives us the familiar gradient-based baselines:\nFGSM, a single-step attack along the sign of the gradient. PGD, a stronger iterative attack with repeated projection back into the threat set. These attacks matter because they formalized adversarial examples as an optimization problem rather than as anecdotal failures. They also established the core intuition that gradients expose the local directions in which a model is fragile.\n3D Perception Opened a Wider Surface As machine learning moved into autonomous driving and robotics, the attack surface widened. The work by Zhang et al., 2023 shows how 3D detectors can fail under very small point-cloud changes.\n3D detection under point perturbation. Small changes in point positions can sharply reduce detector quality. Image source: (Zhang et al., 2023)\nThe important shift is conceptual. In 2D, we usually think about pixel perturbations. In 3D, attackers can perturb points, remove critical points, or inject fake ones. That means vulnerability now depends on representation choices such as how the detector encodes geometry and aggregates evidence.\nPoint detachment via saliency. The attack first identifies which points matter most to the detector, then removes them selectively. Image source: (Zhang et al., 2023)\nArchitecture matters in 3D robustness. Voxel-based and point-based detectors do not fail in the same way or to the same degree. Image source: (Zhang et al., 2023)\nThis is already more realistic than pixel-space perturbation. The attack is no longer just “noise on an image.” It is targeted corruption of a perception pipeline.\nThe Threat Became 3D-Consistent and Semantic Another important step is that attacks no longer need to target one rendered image at a time. Nguyen et al., 2024 attack the underlying NeRF scene representation instead of attacking a single 2D output.\nAdvIRL attacks the 3D representation itself. A reinforcement learning loop modifies NeRF parameters so that many rendered views become adversarial together. Image source: (Nguyen et al., 2024)\nOnce the attack moves into the 3D representation, the perturbation becomes consistent across views. This is a major change: the attack is now tied to scene structure rather than to a single frame.\nAt the same time, researchers began to focus on attacks that are semantically meaningful to humans. One strong example is adversarial viewpoint. Yang et al., 2024 show that a model may recognize an object confidently from one viewpoint and fail completely from another.\nNatural versus adversarial viewpoints. A small change in viewpoint can move the model from a low-loss region into a failure region. Image source: (Yang et al., 2024)\nSemantic attacks in driving. Rotations and translations are not arbitrary noise; they are meaningful scene changes that can still destabilize perception. Image source: (Mao et al., 2023)\nThis moves adversarial robustness closer to the real world. The threat is no longer defined only by mathematical convenience. It is defined by changes that correspond to geometry, viewpoint, motion, and the physical arrangement of a scene.\nThe Attack Surface Now Includes Trust and Interpretation The most unsettling extension is that attackers do not always need to change the model\u0026rsquo;s top-line prediction. They can instead target the explanation around that prediction. Baniecki \u0026amp; Biecek, 2024 summarize this as Adversarial Explainable AI (AdvXAI).\nAdversarial Explainable AI expands the target. Explanations, saliency maps, and trust signals can also be attacked. Image source: (Baniecki \u0026amp; Biecek, 2024)\nThis matters because it changes the object of attack. A system can now be manipulated in ways that affect how humans audit, debug, or trust it, even if the prediction interface looks stable.\nThe Main Lesson Adversarial attacks evolved along three axes:\nfrom 2D pixels to richer data modalities such as point clouds and 3D scenes, from synthetic perturbations to semantically meaningful and physical transformations, and from predictions to explanations and system-level trust. That is why modern robustness research cannot be reduced to “defend against PGD.” PGD is still useful, but it is no longer the whole battlefield.\nReferences [1]Yuan, X., He, P., Zhu, Q., \u0026amp; Li, X. (2019). Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Transactions on Neural Networks and Learning Systems, 30, Article 9. https://doi.org/10.1109/TNNLS.2018.2886017 [2]Zhang, C.-H., Zhang, Z., Wu, S., Jiang, T.-Y., \u0026amp; Liu, S. (2023). A Comprehensive Study of the Robustness for LiDAR-based 3D Object Detectors against Adversarial Attacks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21919\u0026ndash;21929. [3]Nguyen, T., Ergezer, M., \u0026amp; Green, C. (2024). AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20907\u0026ndash;20917. [4]Yang, R., Chen, Y., Misailovic, S., \u0026amp; Singh, G. (2024). Towards Viewpoint-Invariant Visual Recognition via Adversarial Training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 25191\u0026ndash;25202. [5]Baniecki, H., \u0026amp; Biecek, P. (2024). Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey. Information Fusion, 107, 102303. https://doi.org/10.1016/j.inffus.2024.102303 Adversarial Robustness · Part 3 of 5 How Adversarial Attacks Evolved ← What Robustness Really MeansSeries HomeHow We Defend Models Against Adversarial Attacks → ","permalink":"https://xiaokunduan.github.io/posts/part-3-how-attacks-evolved/","summary":"From FGSM and PGD to 3D attacks, adversarial viewpoints, and explainability attacks.","title":"How Adversarial Attacks Evolved"},{"content":" Adversarial Robustness · Part 4 of 5 How We Defend Models Against Adversarial Attacks ← How Adversarial Attacks EvolvedSeries HomeRobustness in Modern Models and High-Stakes Settings → Once we accept that attacks are diverse and evolving, the defense literature becomes easier to organize. It is not a random pile of papers. Most methods are trying to solve one of four problems:\nmake the model survive stronger attacks during training, improve the training data rather than only the optimizer, provide formal guarantees rather than benchmark-only evidence, or reduce the computational cost enough for real deployment. That four-part map is a more useful way to read the field than the raw chronology of papers.\n1. Adversarial Training: Learn in a Hostile Environment Adversarial training remains the default baseline because it directly optimizes for worst-case behavior:\n$$ \\min_{\\theta} \\mathbb{E}{(\\boldsymbol{x}, y) \\sim \\mathcal{D}} \\left[ \\max{\\boldsymbol{\\delta} \\in \\mathcal{S}} \\mathcal{L}(\\boldsymbol{x} + \\boldsymbol{\\delta}, y; \\theta) \\right] $$\nThe inner loop generates hard examples; the outer loop forces the model to handle them. It works, but it is expensive and often hurts clean accuracy.\nTwo representative responses to those limitations are Generalized Adversarial Training (GAT) and LORE.\nGAT, from Laidlaw et al., 2021, extends adversarial training beyond one perturbation family and asks how multiple semantic attacks combine.\nComposite attacks are order-sensitive. GAT studies how attack composition changes the optimization problem and motivates broader adversarial training. Image source: (Laidlaw et al., 2021)\nLORE, from Zhao et al., 2023, focuses on the accuracy-robustness trade-off by constraining fine-tuned embeddings to stay close to the original pre-trained representation on clean data.\nLORE treats the trade-off as constrained optimization. Rather than simply weighting clean and robust loss, it protects a clean embedding region while improving robustness. Image source: (Zhao et al., 2023)\nThe common idea is simple: adversarial training is still the workhorse, but it must be broadened or stabilized if we want general robustness without destroying normal performance.\n2. Data-Centric Defenses: Change What the Model Learns From Some defenses target the training distribution rather than the optimizer. The question becomes: can better data or better supervision make robust features easier to learn?\nIPMix, from Lee et al., 2022, is a good example of high-diversity augmentation. It mixes image-level, patch-level, and pixel-level transformations in a structured way to enrich training coverage.\nIPMix increases diversity through multi-granularity mixing. The goal is not just more data, but richer combinations of local and global structure. Image source: (Lee et al., 2022)\nBut more augmentation is not always better. Bai et al., 2023 show that for adversarially trained Vision Transformers, strong augmentations such as MixUp and CutMix can actually make robustness worse. Their “light recipe” works precisely because it removes that extra source of training ambiguity.\nThe ViT light recipe argues for less, not more. In adversarial training, strong augmentation can interfere with learning rather than help it. Image source: (Bai et al., 2023)\nThese two results look contradictory until we ask the right question. Data-centric defenses are not about maximizing variety in the abstract. They are about improving the quality of supervisory signal for the specific robustness objective we care about.\n3. Certified and Deterministic Defenses: Stop Chasing Individual Attacks Empirical defenses ask whether a model survives known attacks. Certified defenses ask a stronger question: can we prove the prediction stays unchanged inside a whole perturbation set?\nRandomized smoothing became one of the most practical routes because it can scale beyond tiny toy networks. But it has limits, especially in high dimensions. Dual Randomized Smoothing (DRS), from Kumar et al., 2023, attacks that limitation directly by decomposing a hard high-dimensional certification problem into smaller ones.\nDRS as divide-and-conquer certification. Splitting the input can improve the effective certified radius in high-dimensional settings. Image source: (Kumar et al., 2023)\nFor applications that need deterministic guarantees rather than probabilistic ones, the literature moves further. Certified Geometric Training (CGT), from Yang et al., 2023, is representative because it brings geometric verification into training and gives verifiable boundaries for transformations such as rotation.\nCGT provides explicit safety bounds. Instead of hoping a model generalizes, the method certifies that predictions remain safe within a transformation range. Image source: (Yang et al., 2023)\nThe key shift here is philosophical. Certification tries to replace the attack-defense arms race with a guaranteed safety boundary. That is expensive and often conservative, but in some domains it is the only standard that really matters.\n4. Efficient Defenses: Make Robustness Deployable A defense that only works with extreme compute or slow inference is hard to use in practice. That is why efficient purification remains important.\nThe OSCP framework from Lei et al., 2024 is a good example. Instead of long iterative purification, it aims for one-step purification while preserving semantic structure.\nOSCP is built around the speed-versus-quality trade-off. It aims to purify quickly enough for real-time use while keeping the defense meaningful. Image source: (Lei et al., 2024)\nPurification quality matters. If the defense removes semantics together with noise, it can break the input even when it defeats the perturbation. Image source: (Lei et al., 2024)\nEfficiency is not a side concern. It determines whether a defense can move from paper results into an actual system.\nA Better Way to Read the Defense Landscape The field becomes much easier to reason about once we treat it as four defense routes:\nadversarial training for worst-case learning, data-centric methods for better robust supervision, certification for provable guarantees, efficient purification for deployability. No single route dominates every setting. The right defense depends on the threat model, the cost of errors, and the available computation. That is why robustness is best understood as a design space, not as one leaderboard.\nReferences [1]Laidlaw, C., Singla, S., \u0026amp; Feizi, S. (2021). Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations. Proceedings of the IEEE/CVF International Conference on Computer Vision, 15302\u0026ndash;15312. https://doi.org/10.1109/ICCV48922.2021.01501 [2]Zhao, Z., Zhang, J., Wu, X., \u0026amp; Liu, J. (2023). LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16568\u0026ndash;16578. [3]Lee, S.-H., Jeong, M., Park, S.-Y., Yun, S.-B., \u0026amp; Choo, J. (2022). IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers. Computer Vision \u0026ndash; ECCV 2022, 21\u0026ndash;38. [4]Bai, Y., Ding, M., Wang, Y., Zhang, Z.-M., Wang, J., \u0026amp; Tao, D. (2023). A Light Recipe to Train Robust Vision Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, Article 10. https://doi.org/10.1109/TPAMI.2023.3283256 [5]Kumar, A., Schwarzschild, A., Gupta, T., Goldblum, M., Gehr, T., \u0026amp; Goldstein, T. (2023). Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24657\u0026ndash;24666. https://doi.org/10.1109/CVPR52729.2023.02366 [6]Yang, R., Laurel, J., Misailovic, S., \u0026amp; Singh, G. (2023, May 9). Provable Defense Against Geometric Transformations. The Eleventh International Conference on Learning Representations. [7]Lei, C. T., Yam, H. M., Guo, Z., \u0026amp; Lau, C. P. \"Instant Adversarial Purification with Adversarial Consistency Distillation\" (2024). Adversarial Robustness · Part 4 of 5 How We Defend Models Against Adversarial Attacks ← How Adversarial Attacks EvolvedSeries HomeRobustness in Modern Models and High-Stakes Settings → ","permalink":"https://xiaokunduan.github.io/posts/part-4-how-we-defend-models/","summary":"A compact map of the main defense routes: adversarial training, data-centric methods, certification, and efficient purification.","title":"How We Defend Models Against Adversarial Attacks"},{"content":" Adversarial Robustness · Part 5 of 5 Robustness in Modern Models and High-Stakes Settings ← How We Defend Models Against Adversarial AttacksSeries Home Robustness becomes more complicated once we leave the standard image-classification setting. Different model families fail for different reasons, and high-stakes applications care about more than top-1 accuracy. They care about calibration, detection quality, deployment shifts, asymmetric error costs, and multi-sensor consistency.\nThis is the point where robustness stops looking like a narrow adversarial example problem and starts looking like a systems problem.\nModern Architectures Change the Shape of Fragility Robustness is not architecture-agnostic. Vision Transformers, prompt-tuned models, self-supervised encoders, spiking networks, and prototype-based systems all introduce different inductive biases and therefore different failure modes.\nFor ViTs, one of the most important lessons is that the standard training recipe does not transfer cleanly into adversarial settings. Bai et al., 2023 show that removing strong augmentation can improve adversarial robustness when training ViTs.\nRobustness in ViTs requires recipe changes. The best standard recipe is not automatically the best robust recipe. Image source: (Bai et al., 2023)\nPrompt tuning creates a different issue. Fu et al., 2023 argue that naive adversarial training in prompt-tuned systems can produce gradient obfuscation, creating an illusion of safety rather than real robustness. The lesson is that robust training must adapt to the actual parameterization of the model.\nSelf-supervised pre-training changes the picture again. Gao et al., 2023 show that reconstructing clean images from doubly corrupted inputs can push an encoder toward smoother, more robust features.\nMIMIR uses corruption and reconstruction to shape robust features. Robustness is influenced by pre-training objectives, not only by the final fine-tuning stage. Image source: (Gao et al., 2023)\nTaken together, these examples show that “modern models” do not add one new robustness problem. They add many. Each architecture changes the interface between optimization, representation, and attack surface.\nHigh-Stakes Domains Care About More Than Classification Now consider autonomous driving, medical imaging, or any other domain where errors have very different costs. Here robustness is not just about whether an image classifier flips label. It is about whether a perception-and-decision stack remains stable under distribution shifts, multi-modal corruption, and task-specific failure modes.\nIn 3D autonomous driving perception, architecture choice already affects robustness. Zhang et al., 2023 show that voxel-based detectors are often more robust than point-based ones under point-cloud attacks.\nRobustness depends on representation in 3D detection. Architecture is already part of the defense story. Image source: (Zhang et al., 2023)\nFor multi-sensor fusion systems, the problem is broader. Mao et al., 2023 propose robustness certification for semantic transformations such as rotation and translation in camera-LiDAR fusion.\nCOMMIT moves robustness toward system-level guarantees. The target is no longer one classifier, but a fused perception pipeline. Image source: (Mao et al., 2023)\nMedical settings expose another axis: domain shift. A model trained in one hospital may see a noticeably different distribution in another. Weng et al., 2023 show that robustness can generalize across domains more than we might expect, but that generalization cannot be assumed.\nHospital-to-hospital shift changes the robustness question. In medical imaging, distribution shift is often as important as explicit attack design. Image source: (Weng et al., 2023)\nCost Matters, Not Just Error Rate In many deployments, not all mistakes are equal. Misclassifying a malignant tumor as benign is not the same kind of error as the reverse. That motivates cost-sensitive robustness, where the defense is shaped around the most dangerous failures rather than around a flat misclassification rate.\nA cost-sensitive certified radius protects against the most dangerous errors first. Safety-critical robustness must respect asymmetric consequences. Image source: (Horváth et al., 2023)\nThis is a useful closing lesson for the entire series. Robustness is not just a property of a model. It is also a property of the environment, the task, and the cost structure around errors.\nWhere the Field Is Heading Modern robustness research is moving toward a broader agenda:\nrobustness that respects architecture-specific behavior, robustness that survives domain shift and multi-modal deployment, and robustness that prioritizes the failures that matter most in practice. That is why I think the right mental model is no longer “Can this classifier survive PGD?” The better question is: What does reliability require in this system, under this deployment, with these costs?\nReferences [1]Bai, Y., Ding, M., Wang, Y., Zhang, Z.-M., Wang, J., \u0026amp; Tao, D. (2023). A Light Recipe to Train Robust Vision Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, Article 10. https://doi.org/10.1109/TPAMI.2023.3283256 [2]Fu, Z., Yuan, X., Li, Y., Guo, Y., Wang, Y., \u0026amp; Zhang, Y. (2023). ADAPT to Robustify Prompt Tuning Vision Transformers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16084\u0026ndash;16093. https://doi.org/10.1109/CVPR52729.2023.01548 [3]Gao, P., Wang, J., Liu, T., Yan, S., \u0026amp; Wang, B. (2023). MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24908\u0026ndash;24918. https://doi.org/10.1109/CVPR52729.2023.02390 [4]Zhang, C.-H., Zhang, Z., Wu, S., Jiang, T.-Y., \u0026amp; Liu, S. (2023). A Comprehensive Study of the Robustness for LiDAR-based 3D Object Detectors against Adversarial Attacks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21919\u0026ndash;21929. [5]Mao, C., Liu, C., Yang, R., Yang, H., Singh, G., \u0026amp; Liu, X. (2023). COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21789\u0026ndash;21798. [6]Weng, T., Chiang, P., Wang, S., Zhang, H., \u0026amp; Hsieh, C. (2023). Generalizability of Adversarial Robustness Under Distribution Shifts. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24604\u0026ndash;24613. https://doi.org/10.1109/CVPR52729.2023.02361 Adversarial Robustness · Part 5 of 5 Robustness in Modern Models and High-Stakes Settings ← How We Defend Models Against Adversarial AttacksSeries Home ","permalink":"https://xiaokunduan.github.io/posts/part-5-robustness-in-modern-high-stakes-settings/","summary":"Why robustness becomes a systems problem in modern architectures and high-cost applications.","title":"Robustness in Modern Models and High-Stakes Settings"},{"content":" Prefer the 5-part readable version?\nThis full article is still here as the reference version, but I also split it into a shorter 5-part series for easier reading and sharing.\nStart the series\nMotivation We are in the midst of a transformative era driven by deep learning, particularly by the large language models (LLMs) based on the Transformer architecture. These models are demonstrating capabilities that surpass human experts in a growing range of domains, operating with unprecedented efficiency and accuracy. From mastering complex intellectual challenges like Go and protein folding to accelerating drug discovery and scientific breakthroughs, the power of AI seems to be reshaping our very definition of \u0026ldquo;intelligence.\u0026rdquo;\nHowever, in stark contrast to these powerful capabilities lies a profound and counter-intuitive vulnerability: these seemingly omnipotent tools can be remarkably fragile when faced with minute, imperceptible perturbations.\nA classic example of an adversarial attack. A nearly imperceptible perturbation causes the model to misclassify a \u0026lsquo;panda\u0026rsquo; as a \u0026lsquo;gibbon\u0026rsquo; with extremely high confidence. Image source: (Goodfellow et al., 2015)\nIn the original image, a standard classification model correctly identifies the \u0026ldquo;panda\u0026rdquo; with 57.7% confidence. However, when a carefully crafted layer of \u0026ldquo;adversarial noise\u0026rdquo;—nearly invisible to the human eye—is introduced, a catastrophic failure occurs. This noise is not random; it is computed using the model\u0026rsquo;s gradient information, specifically designed to maximize the classification error. The result is not just an incorrect prediction, but one made with exceptionally high confidence (99.3%), misidentifying the panda as a \u0026ldquo;gibbon.\u0026rdquo;\nThis example sharply poses a fundamental question: If a model that can outperform humans on intellectual benchmarks can be so easily deceived by such a trivial \u0026ldquo;digital sleight of hand,\u0026rdquo; how can we trust it to perform high-stakes tasks like autonomous driving, medical diagnosis, or financial trading? This is precisely why enhancing model robustness is of paramount importance.\nThe problem, first highlighted by Goodfellow and his colleagues, has sparked an ongoing evolution of adversarial attacks and defenses. Researchers are committed to building increasingly resilient defense mechanisms (the Captain America\u0026rsquo;s shield, in our analogy) to withstand the continuous emergence of more sophisticated attack methods (Thor\u0026rsquo;s hammer).\nThe continuous arms race in adversarial machine learning. This is conceptualized as a battle between an unstoppable force (attacks) and an immovable object (defenses).\nThis post aims not merely to catalogue the latest \u0026ldquo;hammers and shields.\u0026rdquo; Our core objective is to systematically dissect and analyze these cutting-edge works to identify unexplored intersections, emerging challenges, and critical research gaps, thereby charting a course for future inquiry.\nUnderstanding Robustness Before delving into specific defense techniques like adversarial training or data augmentation, we must first address a few fundamental questions: What are the intrinsic theoretical limits of a defense system\u0026rsquo;s robustness? Do such boundaries exist? And how can we assess whether a model possesses inherently robust characteristics?\nThe ultimate limit of a deep learning model\u0026rsquo;s robustness is not dictated by our algorithms, but by the inherent ambiguity within the data distribution itself. Groundbreaking research from Zhang \u0026amp; Sun, 2024 leverages the concept of Bayes Error to reveal this insurmountable \u0026ldquo;theoretical ceiling.\u0026rdquo;\nThe core idea of Bayes error lies in the unavoidable, inherent ambiguity of data. It arises from the natural overlap between the distributions of different classes, where even a \u0026ldquo;perfect\u0026rdquo; classifier cannot make a 100% certain judgment. A widely-circulated example provides an excellent intuitive explanation for this statistical concept:\nAn intuitive illustration of Bayes error. The inherent ambiguity of the input data (is it a cat or a dog?) represents the irreducible error that even a perfect classifier cannot overcome.\nIn Figure 2(a), it is difficult to determine whether the animal is a cat or a dog based solely on its back. This input sample lies in the overlapping region of the \u0026ldquo;cat\u0026rdquo; and \u0026ldquo;dog\u0026rdquo; class distributions. More formally, for a given data distribution, the Bayes optimal classifier operates by selecting the class with the highest posterior probability for any given input. The Bayes error represents the absolute minimum error rate that even this perfect classifier cannot avoid. It quantifies the inherent, irreducible uncertainty within the data by calculating the expectation of one minus the highest class probability for each sample.\nThe core insight from Zhang \u0026amp; Sun, 2024 is that the pursuit of certified robustness fundamentally alters the data distribution the model needs to learn. Standard training aims to fit the original data distribution $\\mathcal{D}$. Certified training, however, requires the model to yield the same prediction for an input $\\boldsymbol{x}$ and its entire surrounding neighborhood $\\mathcal{V_x}$. This is equivalent to \u0026ldquo;smearing\u0026rdquo; the label of $\\boldsymbol{x}$ across the whole neighborhood. When this is done for all data points, the process is mathematically equivalent to a convolution of the original distribution $\\mathcal{D}$ with a function $v$ representing the neighborhood, forming a new, \u0026ldquo;blurred\u0026rdquo; distribution $\\mathcal{D}\u0026rsquo; = \\mathcal{D} * v$.\nThe effect of convolution on data distributions. The process of certified robustness effectively \u0026lsquo;blurs\u0026rsquo; the original distribution (left), increasing the overlap between classes (right) and thereby raising the inherent Bayes error. (Concept from Zhang and Sun, 2024)\nThis \u0026ldquo;blurring\u0026rdquo; inevitably leads to an increase in the Bayes error. The paper rigorously proves that the convolution process will necessarily increase or maintain the system\u0026rsquo;s inherent uncertainty, formally stated as $\\beta_{\\mathcal{D}\u0026rsquo;} \\ge \\beta_{\\mathcal{D}}$. This implies that for a model to become robust, the learning task itself becomes inherently more difficult. Therefore, the drop in accuracy for robust models is not solely a flaw of the algorithm but is rooted in the higher, unavoidable error rate of the target distribution it optimizes. The research further derives a computable Irreducible Robustness Error, $\\zeta_D$, establishing the theoretical upper bound of certified robust accuracy at $1 - \\zeta_D$. This upper bound fundamentally explains the trade-off between improving model robustness and sacrificing standard accuracy.\nGiven that model robustness has a theoretical ceiling, how can we evaluate whether a model possesses inherently robust features? The work by Jain et al., 2023 provides an elegant and efficient method: by observing the model\u0026rsquo;s \u0026ldquo;input gradients\u0026rdquo; on normal, clean samples, we can easily identify its level of robustness.\nInput gradients can be understood as the model\u0026rsquo;s \u0026ldquo;attention map\u0026rdquo; or \u0026ldquo;sensitivity distribution,\u0026rdquo; revealing the degree to which it relies on input features for its decisions. Their research finds significant, systematic differences between the input gradients of vulnerable and robust models.\nA comparison of input gradients on clean samples for vulnerable versus robust models. Robust models exhibit structured, human-interpretable gradients, whereas vulnerable models show chaotic, high-frequency noise. Image source: (Jain et al., 2023)\nThis finding is not only visually intuitive but also numerically validated. The gradient norms of robust models are about two orders of magnitude lower than those of vulnerable models, implying their decision function surfaces are much \u0026ldquo;smoother\u0026rdquo; and less sensitive to small input changes. This intrinsic difference also dictates the types of attacks that are effective against them.\nVisualization of adversarial perturbations for different models. Attacks on vulnerable models consist of high-frequency noise, while attacks on robust models must contain semantic structures similar to the original image to be effective. Image source: (Jain et al., 2023)\nA model\u0026rsquo;s robustness is deeply reflected in its gradient response to normal inputs and the nature of attacks that can threaten it. A robust model focuses on structure and, therefore, can only be challenged by structured attacks. This provides a new dimension for rapidly assessing model robustness without resorting to expensive attack-based testing.\nA robust model should maintain a stable decision boundary in the face of input perturbations. This stability can be visualized through the \u0026ldquo;Loss Landscape,\u0026rdquo; which depicts how the loss function (the degree of error) changes with slight variations in model parameters or inputs.\nAn ideally robust model should possess a wide, flat loss landscape in both its parameter and input spaces, rather than a sharp, steep one. In a flat region, small perturbations do not cause drastic increases in the loss value, indicating more stable model performance.\nThis notion is validated from different perspectives by research from Xu et al. (2023) and in the work on Generalized Adversarial Training (GAT) by Laidlaw et al., 2021.\nA smoother loss landscape resulting from MIMIR pre-training. Compared to the baseline MAE, models pre-trained with MIMIR (first and last columns) converge to broader, flatter loss basins, indicating greater stability. Image source: (Gao et al., 2023)\nThe smooth loss landscape of GAT under various semantic attacks. When faced with multiple semantic attacks such as changes in hue, rotation, and saturation, the standard model\u0026rsquo;s loss landscape (blue curve) is filled with dramatically fluctuating high peaks, indicating extreme vulnerability. In contrast, the model trained with GAT (green curve) exhibits a loss landscape that is exceptionally flat and close to the bottom. This demonstrates that GAT does not simply defend against specific attacks but fundamentally shapes a smoother, more stable decision space, making it \u0026lsquo;immune\u0026rsquo; to a wide variety of input perturbations. Image source: (Laidlaw et al., 2021)\nThe loss landscape offers a deeper perspective for understanding robustness. An effective defense mechanism should not merely aim to reduce loss on certain adversarial examples, but should reshape the entire loss landscape, guiding the model towards an inherently smoother and more stable state.\nThe Evolving Threats Effective defense must be built upon a deep understanding of the threats. In the field of deep learning security, adversarial attacks are far from static; they continuously evolve in response to advancements in defense technologies.\nEarly research on adversarial attacks, as summarized in surveys like the one by Yuan et al., 2019, primarily focused on the digital image space, using $L_p$ norms to measure the \u0026ldquo;imperceptibility\u0026rdquo; of perturbations. The goal of such attacks is to find a perturbation $\\boldsymbol{\\delta}$ within an $L_p$-norm ball $\\mathcal{S}$ of radius $\\epsilon$ centered at the original input $\\boldsymbol{x}$ that maximizes the model\u0026rsquo;s loss. Classic methods largely rely on the model\u0026rsquo;s gradient information. For example, the Fast Gradient Sign Method (FGSM) adds a perturbation in the direction of the loss function\u0026rsquo;s gradient $\\nabla_{\\boldsymbol{x}} \\mathcal{L}(\\boldsymbol{x}, y; \\theta)$ in a single step, while Projected Gradient Descent (PGD) iteratively updates the perturbation in small steps and projects it back into the $L_p$-norm ball, often finding more effective attacks. These gradient-based optimization methods form the cornerstone of adversarial attack research.\nWith the rise of technologies like autonomous driving, the focus of attacks has gradually shifted from 2D images to more complex 3D perception systems. Zhang et al., 2023 conducted a comprehensive robustness evaluation in this domain, extending classic attack ideas to 3D point clouds.\nThe effect of adversarial point perturbation attacks on mainstream 3D detectors. As shown, minute perturbations to a point cloud (point displacements of less than 10cm), imperceptible to the human eye, can cause a sharp decline in the performance of various mainstream 3D detection models, including SECOND, PointPillar, and PV-RCNN, leading to numerous false negatives and false positives. Image source: (Zhang et al., 2023)\nThe study systematically analyzes three types of attacks: Point Perturbation, which involves slightly shifting the 3D coordinates of each point; Point Detachment, which removes a small number of critical points; and Point Attachment, which adds optimized \u0026ldquo;fake\u0026rdquo; points in vulnerable areas of the scene. The effectiveness of detachment and attachment relies on locating \u0026ldquo;critical points\u0026rdquo; or \u0026ldquo;vulnerable regions,\u0026rdquo; which is typically achieved by generating a saliency map based on the gradient of the loss function with respect to each point\u0026rsquo;s coordinates.\nThe process of a saliency-map-based point detachment attack. An attacker first generates an \u0026lsquo;importance map\u0026rsquo; via gradient computation (a), where warmer colors indicate points with a greater impact on the detection result. The attacker then precisely and iteratively removes these critical points, which are primarily distributed on the object\u0026rsquo;s surface (b, c, d), thereby compromising the model\u0026rsquo;s perception with minimal cost. Image source: (Zhang et al., 2023)\nBeyond point clouds, emerging 3D representation methods like Neural Radiance Fields (NeRF) have also exposed new attack surfaces. Nguyen et al., 2024 proposed an innovative framework for attacking NeRF models in a black-box setting. The core idea is to shift from manipulating the pixels of 2D rendered images to directly modifying the parameters of the NeRF model itself through reinforcement learning. This creates an inherently adversarial 3D scene from which any rendered 2D view will possess adversarial properties.\nThe workflow of the AdvIRL framework. A reinforcement learning agent iteratively fine-tunes the NeRF model\u0026rsquo;s parameters, optimizing based on feedback from a target classifier to train a 3D model that consistently generates adversarial views. Image source: (Nguyen et al., 2024)\nThe elegance of this method lies in its shift of focus from the vulnerability of individual 2D images to the vulnerability of the 3D representation itself, making the resulting adversarial noise inherently robust to 3D transformations like rotation and scaling.\nOverview of adversarial examples generated by AdvIRL. The method achieves success across various scenes. The leftmost image in each row is the original, unattacked rendering, while the columns to the right show adversarial images rendered from different viewpoints after the AdvIRL attack. The adversarial noise (mainly color and texture distortions) is consistent in 3D space. Image source: (Nguyen et al., 2024)\nTo demonstrate its attack capabilities more concretely, let\u0026rsquo;s look at a few examples. In an attack targeting a lighthouse scene, AdvIRL successfully caused the model to misclassify.\nA targeted attack on the lighthouse scene. This is an adversarial rendering generated by AdvIRL with the goal of making the classifier misidentify the \u0026rsquo;lighthouse\u0026rsquo; as a \u0026lsquo;boathouse.\u0026rsquo; The attack successfully achieved this with a 50% classification confidence. Subfigures (a)-(d) show renderings from different angles, demonstrating the attack\u0026rsquo;s effectiveness across multiple viewpoints. Image source: (Nguyen et al., 2024)\nIn simulations relevant to safety-critical domains like autonomous driving, the threat is even more palpable. The researchers launched a targeted attack on a truck scene, aiming to have the model identify it as a \u0026ldquo;cannon.\u0026rdquo;\nImages generated from different angles of the adversarially perturbed truck. This targeted attack was remarkably successful, with 15 out of 20 renderings from different viewpoints being misclassified as \u0026lsquo;cannon,\u0026rsquo; with confidence levels ranging from 15% to 70%. This highlights the potential risk of such attacks on visual models in applications like autonomous driving. Image source: (Nguyen et al., 2024)\nWhile $L_p$-norm attacks are mathematically tractable, they often diverge from real-world physical perturbations. Consequently, research is increasingly shifting towards more semantically meaningful attacks that directly manipulate our perceptual dimensions, such as object geometry and viewpoint.\nThe work by Yang et al., 2024 profoundly reveals the extreme vulnerability of modern visual models in this dimension. The study points out that a model\u0026rsquo;s performance can fluctuate dramatically with changes in viewpoint, with multiple \u0026ldquo;adversarial viewpoint\u0026rdquo; regions where even a slight change can cause the model to switch from a correct identification to a completely wrong one.\nA comparison between natural and adversarial viewpoints. The same forklift model is correctly identified with high confidence (0.99) from a \u0026lsquo;Natural Viewpoint.\u0026rsquo; However, when observed from certain uncommon \u0026lsquo;Adversarial Viewpoints,\u0026rsquo; the same model fails completely, misidentifying it as unrelated objects like a \u0026lsquo;spotlight\u0026rsquo; or \u0026lsquo;stopwatch.\u0026rsquo; The loss landscape in the bottom right intuitively illustrates this: natural viewpoints correspond to low-loss \u0026lsquo;valleys,\u0026rsquo; while multiple adversarial viewpoints correspond to high-loss \u0026lsquo;peaks.\u0026rsquo; Image source: (Yang et al., 2024)\nTo systematically study and defend against such attacks, the researchers proposed the Viewpoint-Invariant Adversarial Training (VIAT) framework. At its core is a min-max optimization process designed to find and leverage the most confusing viewpoints to train the classifier.\nThe min-max optimization process of the VIAT framework. The entire process is a game. First, multi-view images of an object are encoded into a continuous Neural Radiance Field (NeRF) representation. Then, in the inner maximization stage, an attacker learns a Gaussian Mixture distribution to find the \u0026lsquo;worst-case\u0026rsquo; viewpoint distribution that maximizes the classifier\u0026rsquo;s loss. Subsequently, in the outer minimization stage, the classifier is trained on images rendered from these worst-case viewpoints to minimize its loss, thereby learning viewpoint invariance. Image source: (Yang et al., 2024)\nA key finding of this research is the strong transferability of adversarial viewpoints across objects of the same class.\nThe similarity of loss landscapes for different objects of the same class. Although the four \u0026lsquo;sofa\u0026rsquo; objects differ significantly in appearance, their respective loss landscapes are surprisingly similar. The high-loss \u0026lsquo;peaks\u0026rsquo; and low-loss \u0026lsquo;valleys\u0026rsquo; appear in roughly the same locations across all four maps. Image source: (Yang et al., 2024)\nThis discovery provides the theoretical basis for an efficient training strategy called distribution sharing, where adversarial viewpoint distributions can be shared and reused among different objects within the same class during training, significantly improving efficiency and enhancing the model\u0026rsquo;s generalization capabilities.\nModels trained with the VIAT framework exhibit a qualitative leap in robustness, capable not only of defending against specially designed synthetic attacks but also of generalizing to complex real-world scenarios.\nPerformance comparison of standard and VIAT models across multiple scenarios. The standard model proves extremely vulnerable to various adversarial viewpoints (whether from synthetic attacks like GMVFool/ViewFool or real-world objects), frequently making errors. In contrast, the VIAT-trained model demonstrates powerful defense capabilities, making correct judgments in the vast majority of cases, proving its robustness has strong generalization ability and practical value. Image source: (Yang et al., 2024)\nSimilarly, in the autonomous driving domain, Mao et al., 2023 also focused on common physical-world transformations like rotation and translation. These transformations directly alter meaningful attributes of a scene and pose a direct threat to the reliability of multi-sensor fusion systems.\nSemantic transformations in an autonomous driving scene. Slight rotations (a) or changes in the distance (b) of the vehicle ahead can significantly alter the input data distribution for sensors (camera and LiDAR), potentially leading to detection failure. Image source: (Mao et al., 2023)\nAs attack and defense methods become increasingly sophisticated, viewing attacks as isolated events is no longer sufficient. Zhou et al., 2022 were the first to borrow the lifecycle model of Advanced Persistent Threats (APT) from the cybersecurity domain to provide a systematic analysis framework for adversarial attacks.\nThe APT lifecycle model for adversarial attacks. A complex attack can be decomposed into five interconnected stages: 1. Vulnerability Analysis (theoretical reconnaissance), 2. Fabrication (generating basic attacks), 3. Post-Fabrication (enhancing attack transferability or depth), 4. Real Application (applying in the physical world), and 5. Re-evaluation of Imperceptibility (optimizing the perturbation to be more covert). Image source: (Zhou et al., 2022)\nThis framework integrates disparate attack methods into a unified view, revealing their roles and objectives throughout the attack process.\nThe most advanced and insidious attacks no longer target the model\u0026rsquo;s prediction outcome but its explainability—that is, our trust in the model\u0026rsquo;s decision-making process. The survey by Baniecki \u0026amp; Biecek, 2024 systematically summarizes this emerging field, terming it Adversarial Explainable AI (AdvXAI).\nThe advent of AdvXAI has opened a new battleground: attacks on model explainability (XAI).\nA taxonomy of attacks and defenses in Adversarial Explainable AI (AdvXAI). Attackers can manipulate model explanations through adversarial examples, backdoors, etc., while defenders can counter with model regularization, explanation aggregation, and other techniques. Image source: (Baniecki \u0026amp; Biecek, 2024)\nThe goal of such attacks is no longer the model\u0026rsquo;s predictive accuracy but our trust in its decision-making process. An attacker can successfully manipulate a model\u0026rsquo;s explanation without altering its prediction, thereby providing a false yet seemingly plausible justification to mask the model\u0026rsquo;s true, potentially biased or flawed, reasoning. This attack on the model\u0026rsquo;s \u0026ldquo;mind\u0026rdquo; presents a novel challenge for deep learning security, warning us against blindly trusting the explanations provided by XAI methods.\nTypes of Defense Adversarial Training We now turn to the most mainstream and effective defense paradigm today: Adversarial Training (AT). The core idea of AT is to introduce adversarial attacks into the training process. By continuously generating adversarial examples that can fool the current model and using them as training data, AT forces the model to learn a decision boundary that is less sensitive to input perturbations. This chapter will start from the basic framework of AT, explore its evolution to address more complex threats, and analyze various advanced strategies aimed at mitigating its inherent performance trade-offs.\nThe essence of adversarial training, as articulated in surveys by Yuan et al., 2019 and Zhou et al., 2022, is a min-max optimization problem. Its objective can be formally expressed as: $$ \\min_{\\theta} \\mathbb{E}_{(\\boldsymbol{x}, y) \\sim \\mathcal{D}} \\left[ \\max_{\\boldsymbol{\\delta} \\in \\mathcal{S}} \\mathcal{L}(\\boldsymbol{x} + \\boldsymbol{\\delta}, y; \\theta) \\right] $$ This framework involves a two-stage game. First is the Inner Maximization, where for fixed model parameters $\\theta$ and input $\\boldsymbol{x}$, an adversarial perturbation $\\boldsymbol{\\delta}$ is found within the set $\\mathcal{S}$ to maximize the loss function $\\mathcal{L}$, generating the most effective attack sample. This is followed by the Outer Minimization, where the model parameters $\\theta$ are adjusted to minimize the expected loss on these worst-case samples, thereby enhancing the model\u0026rsquo;s robustness. Although this framework is powerful, traditional AT often focuses only on defending against single, pixel-based $L_p$-norm attacks, which falls short of the complex and varied threat models in the real world.\nTo address real-world composite perturbations, which are often multi-dimensional, the work by Laidlaw et al., 2021 extends adversarial training from defending against single perturbations to defending against combinations of multiple semantic perturbations, proposing the Generalized Adversarial Training (GAT) framework. A core insight of their research is that in composite attacks, the order of attacks is crucial.\nThe impact of attack order on the effectiveness of composite attacks. As shown, simply moving the ℓ∞ attack from the first step to the fourth step of the attack sequence can turn a previously ineffective composite attack into a successful one, causing the model to misclassify a \u0026lsquo;warplane\u0026rsquo; as a \u0026lsquo;wing.\u0026rsquo; Image source: (Laidlaw et al., 2021)\nTo this end, the GAT framework utilizes a Composite Adversarial Attack (CAA) method that automatically learns the optimal attack sequence to generate more powerful adversarial examples. By training in this more severe and realistic attack environment, GAT enables the model\u0026rsquo;s decision boundary to become smoother to multiple types of perturbations, thus achieving stronger general robustness.\nDespite its effectiveness, the most criticized drawback of adversarial training is that it often comes at the cost of the model\u0026rsquo;s standard accuracy on clean data. To mitigate this \u0026ldquo;accuracy-robustness trade-off,\u0026rdquo; researchers have proposed several ingenious balancing strategies. One such solution, presented by Chen et al., 2023, is the B-MTARD framework. This method moves beyond a single model and introduces two expert teachers for knowledge distillation.\nThe multi-teacher distillation framework of B-MTARD. The framework has a student model learn simultaneously from two teachers: a clean teacher, who imparts knowledge on achieving high standard accuracy, and a robust teacher, who passes on experience in adversarial robustness. To address imbalances between the teachers\u0026rsquo; \u0026rsquo;teaching styles\u0026rsquo; and the student\u0026rsquo;s \u0026rsquo;learning pace,\u0026rsquo; the framework innovatively designs two dynamic balancers: an entropy-based balancer to unify the \u0026lsquo;knowledge intensity\u0026rsquo; of the teachers, and a normalized loss balancer to coordinate the student\u0026rsquo;s \u0026rsquo;learning progress.\u0026rsquo; Image source: (Chen et al., 2023)\nThrough this intelligent collaborative teaching, the student model can effectively assimilate the strengths of both, achieving a better balance between accuracy and robustness.\nAnother perspective on solving the trade-off comes from Liu et al., 2023, which approaches the problem from the angle of model complexity. Their work finds that a key intrinsic metric, $\\Gamma_{ce}$ (measuring the confidence contrast between \u0026ldquo;mastered\u0026rdquo; and \u0026ldquo;unmastered\u0026rdquo; samples), has a phased relationship with the model\u0026rsquo;s generalization gap: a positive correlation in the early stages of training and a negative one in the later stages. Based on this observation, they designed a \u0026ldquo;phased\u0026rdquo; training strategy: in the early phase, regularization is used to decrease $\\Gamma_{ce}$ to build a good generalization foundation, while in the later phase, $\\Gamma_{ce}$ is increased to \u0026ldquo;reclaim\u0026rdquo; lost standard accuracy. This dynamic adjustment allows the model to focus on the most appropriate optimization target at different training stages.\nFurthermore, Zhao et al., 2023 formalize the trade-off as a principled constrained optimization problem and solve it using the Lagrangian dual method. The core idea is to strictly constrain the embeddings generated by the fine-tuned model on clean data to remain close to those of the original pre-trained model, while simultaneously fine-tuning for adversarial robustness.\nThe core concept and optimization framework of LORE. As shown on the right, the optimization process of LORE is constrained by a \u0026lsquo;safe zone\u0026rsquo; centered around the original model\u0026rsquo;s embedding. The fine-tuned model\u0026rsquo;s embedding on clean inputs must remain within this region. Meanwhile, the training objective is to pull the embedding of the attacked input back to the original model\u0026rsquo;s anchor point. Image source: (Zhao et al., 2023)\nLORE\u0026rsquo;s improvements in training stability and performance trade-off. This constrained mechanism yields significant benefits. LORE (blue dashed line) completely avoids the catastrophic collapse of standard accuracy seen in traditional adversarial fine-tuning (solid lines) during the initial training phase. LORE (orange starred line) also achieves a better Pareto frontier between robust accuracy and standard accuracy compared to naive regularization methods (blue dotted line). Image source: (Zhao et al., 2023)\nBy transforming the trade-off from a simple weighted sum into a principled, constrained optimization problem, LORE provides a powerful and elegant framework for maximizing the preservation of a model\u0026rsquo;s original knowledge and performance while enhancing its robustness.\nData-Centric Defenses Beyond directly optimizing the model to counter attacks through adversarial training, another powerful and complementary defense paradigm is data-centric. The core idea is that instead of merely forcing the model to learn in a difficult environment, we can directly provide it with higher-quality, more diverse, and more challenging training data, compelling it to learn more generalizable features.\nThe work by Lee et al., 2022, IPMix, no longer settles for single-dimensional transformations but pioneeringly fuses image-level, patch-level, and pixel-level augmentations within a parallel, chain-mixed framework.\nA visual comparison of common data augmentation methods. From simple CutOut and Mixup to the more complex PixMix. Image source: (Lee et al., 2022)\nThe Chain-Mixed Framework of IPMix. At the heart of IPMix is an elegant parallel processing workflow. The original image is routed into multiple parallel \u0026lsquo;augmentation chains.\u0026rsquo; Some chains apply traditional image-level transformations, while others mix the image with external synthetic images at the pixel or patch level. These variously \u0026lsquo;prepared\u0026rsquo; images are finally fused and mixed with the original image via a skip connection to ensure core semantics are preserved. Image source: (Lee et al., 2022)\nThe essence of this method lies in the granularity of its information fusion. It moves beyond simple mathematical operations to design more sophisticated mixing mechanisms, creating unprecedented pixel combinations to maximally enrich the training data distribution.\nThe fine-grained mixing operations of IPMix. IPMix uses random masks at the pixel and element levels to \u0026lsquo;weave\u0026rsquo; the information from two images together. It can also perform targeted mixing in specific regions (such as \u0026lsquo;scar-like\u0026rsquo; or \u0026lsquo;block-like\u0026rsquo; patterns) to simulate various local occlusions and artifacts. Image source: (Lee et al., 2022)\nThe effect of this complex augmentation strategy is significant, as it reshapes the model\u0026rsquo;s internal feature space.\nImproved feature space and attention mechanism after IPMix training. The t-SNE visualization in the upper part shows that after IPMix training, the feature clusters for different classes become more compact intra-class and more separated inter-class. Meanwhile, the Grad-CAM heatmaps in the lower part reveal that the baseline model\u0026rsquo;s attention is entirely captured by background textures, whereas the IPMix-trained model can precisely and completely cover the subject (a dragonfly) itself. Image source: (Lee et al., 2022)\nIn contrast to IPMix\u0026rsquo;s pursuit of \u0026ldquo;all-encompassing\u0026rdquo; complexity, the work by Zhang et al., 2023 approaches the problem from the perspectives of \u0026ldquo;efficiency\u0026rdquo; and \u0026ldquo;universality,\u0026rdquo; proposing a plug-and-play UAA framework. Its core idea is to decouple the expensive process of generating adversarial perturbations from the model training pipeline.\nThe two-stage decoupled framework of UAA. In the first stage (top), a generator G capable of producing \u0026lsquo;universal\u0026rsquo; adversarial perturbations is trained offline. To prevent G from overfitting, the target classifier\u0026rsquo;s parameters are randomly re-initialized in each epoch. In the second stage (bottom), this pre-trained, fixed G serves as an efficient data augmentation tool, seamlessly integrated into the main model\u0026rsquo;s training process. Image source: (Zhang et al., 2023)\nHowever, does simply increasing the quantity and complexity of data solve the problem of insufficient robustness?\nResearch from Wang et al., 2023 emphasizes the \u0026ldquo;quality\u0026rdquo; of augmented data. Their results surprisingly show that using more advanced generative models (like EDM) to create synthetic data is far more effective at improving model adversarial robustness than many traditional, distortion-based augmentation methods. A key finding of this study is that when training with a massive amount of high-quality generated data, the phenomenon of \u0026ldquo;robust overfitting\u0026rdquo; (where test robustness starts to decrease after peaking during training) almost completely disappears. To combat this, \u0026ldquo;early stopping\u0026rdquo; is typically required. However, with large-scale, high-quality data from EDM, robust overfitting is virtually eliminated, allowing the model\u0026rsquo;s test robustness to improve steadily throughout training. This means one can train for longer to achieve better performance without carefully searching for the optimal stopping point. The paper also investigated the impact of generated data quality (measured by FID score) on model performance. The results clearly indicate that the lower the FID score of the generated data (i.e., the higher the quality), the higher the standard and robust accuracy of the final trained model.\nOn the topic of how data augmentation should be done, different researchers hold almost opposing views. Research from Bai et al., 2023 uncovered a startlingly counter-intuitive phenomenon: in the specific paradigm of adversarial training for Vision Transformers (ViTs), widely-used strong data augmentations (like MixUp, CutMix) are not only unhelpful but actually detrimental.\nThis finding fundamentally challenges the conventional wisdom that \u0026ldquo;stronger augmentation leads to better robustness.\u0026rdquo; The root cause lies in a conflict between two mechanisms for \u0026ldquo;making the task harder\u0026rdquo;—strong data augmentation and adversarial perturbation—which leads to \u0026ldquo;overcorrection\u0026rdquo; or a \u0026ldquo;loss of training focus,\u0026rdquo; thereby undermining the learning of robustness.\nFirst, adversarial training is itself a very strong form of regularization. It requires the model not only to classify a single data point correctly but also to maintain a stable prediction within a high-dimensional neighborhood around that point (e.g., a small ball of radius ε). This is already a highly challenging optimization target. Strong data augmentation, such as MixUp, which creates semantically ambiguous mixed images that do not exist in the real world, is also a powerful regularization technique. When these two potent regularization methods are combined, the training task becomes exceptionally difficult. It\u0026rsquo;s like asking a student to solve an extremely complex math problem written in two mixed languages simultaneously; the student is likely to become confused by having to deal with both difficulties at once and may end up mastering neither.\nSecond, adversarial training is a min-max process, with its core being the inner \u0026ldquo;maximization\u0026rdquo; step—finding the \u0026ldquo;strongest\u0026rdquo; adversarial example that maximizes the loss for the current model at each training step. This is like finding a high-quality \u0026ldquo;sparring partner\u0026rdquo; for the model. However, strong data augmentation undermines the quality of this \u0026ldquo;sparring partner.\u0026rdquo; The starting point becomes ambiguous: When training begins with an image created by mixing two pictures with MixUp, this starting point has already deviated from the clean, real data distribution. The gradient signal obtained from such a \u0026ldquo;blurry\u0026rdquo; starting point to find the most confusing perturbation direction is far less clear and effective than that from a clean sample. The effectiveness of the attack is reduced: The paper\u0026rsquo;s experiments show that when using the \u0026ldquo;light recipe\u0026rdquo; (without strong data augmentation), the attacks during training become more effective. This means the model faces \u0026ldquo;genuine\u0026rdquo; strong attacks in every round. Conversely, with the \u0026ldquo;standard recipe\u0026rdquo; using strong data augmentation, the attack effectiveness is significantly diminished. As the attacks in training are weakened, the model naturally incurs a lower loss on these \u0026ldquo;watered-down\u0026rdquo; adversarial examples, leading to an apparently high \u0026ldquo;robust accuracy\u0026rdquo; on the validation set. However, this is a false, overestimated robustness. When the model, after training, is confronted with a truly powerful standard attack (like AutoAttack) generated from a clean sample, this deceptive defense crumbles.\nThis contradiction points to a core dilemma in data augmentation: how can we harness the benefits of diversity from high-intensity augmentations without introducing destructive noise to the model?\nTo address this, the work by Park et al., 2023 shifts the focus from how to transform images to how to assign appropriate \u0026ldquo;supervisory signals\u0026rdquo; to the transformed images. Instead of assigning a rigid, 100% confidence hard label to all augmented samples—regardless of their distortion level—it introduces a more intelligent and adaptive labeling paradigm. The confidence of the label assigned to augmented data should correlate with the degree of \u0026ldquo;distortion.\u0026rdquo; Augmented samples closer to the original data should have higher label confidence, while those that are further away and more severely distorted should have lower confidence.\nThe core idea and workflow of AutoLabel. As shown in Figure (a), the process is intuitive. For an augmented image that is similar to the original and only slightly distorted (top), the learned label confidence remains high (0.91). But for a severely distorted image that is difficult even for humans to recognize (bottom), the learned label confidence drops significantly (0.52). Image source: (Park et al., 2023)\nThis adjustment is achieved through a feedback mechanism. AutoLabel defines a \u0026ldquo;transformation distance\u0026rdquo; based on the augmentation parameters, which quantifies the degree of image distortion. As shown in Figure (b), the lower the mixup ratio and the longer the augmentation chain, the greater the transformation distance. AutoLabel uses this distance to dynamically update the label confidences for training samples in different \u0026ldquo;distance bins\u0026rdquo; by evaluating the model\u0026rsquo;s calibration error (ECE) on a clean validation set after each training epoch.\nCertified Defenses The adversarial training and data augmentation methods we\u0026rsquo;ve discussed so far fall under the category of Empirical Defenses. These methods aim to improve a model\u0026rsquo;s performance against known types of attacks by introducing diverse or adversarial data during training. Their effectiveness is typically empirically verified by measuring test accuracy against specific attack algorithms like PGD or AutoAttack. However, empirical defenses have a fundamental limitation: they cannot guarantee protection against new, more powerful attacks that may emerge in the future. A model that appears robust on current benchmarks could still be compromised by an adaptive attack designed specifically for it.\nTo overcome this uncertainty, the research community has increasingly turned to a more rigorous and challenging goal: Certified Defenses.\nThe objective of certified defenses, as articulated in works like the survey by Zhou et al., 2022, is not simply to improve performance against specific attacks, but to provide a mathematical, provable guarantee of robustness for a model\u0026rsquo;s predictions.\nThis guarantee is attack-agnostic. It does not depend on the specific algorithm used by an attacker to generate an adversarial example. Instead, it defines a perturbation set $\\mathcal{S}$ around an original input $\\boldsymbol{x}$ (e.g., an $L_\\infty$-norm ball of radius $\\epsilon$) and mathematically proves that for all possible inputs $\\boldsymbol{x}\u0026rsquo; \\in \\mathcal{S}$, the model\u0026rsquo;s prediction will remain unchanged.\nFormally, a certified defense aims to verify the truth of the following statement: $$ \\forall \\boldsymbol{x}\u0026rsquo; \\in \\mathcal{S}, \\quad \\arg\\max_j f(\\boldsymbol{x}\u0026rsquo;)_j = \\arg\\max_j f(\\boldsymbol{x})_j $$\nThis paradigm shift offers the highest level of security assurance for a model. It is no longer an endless \u0026ldquo;arms race\u0026rdquo; between attacks and defenses, but a means of establishing an absolute safety boundary for the model\u0026rsquo;s behavior. Once certified, we can be confident that any perturbation within this boundary, no matter how cleverly designed, cannot alter the model\u0026rsquo;s prediction.\nThe technical means to achieve such guarantees fall into two main categories: complete verification methods, such as Mixed-Integer Linear Programming (MILP), which provide exact answers but are computationally expensive and difficult to scale to large networks; and incomplete verification methods, like convex relaxation and randomized smoothing, which provide a lower bound on robustness. While potentially conservative due to approximation, they are more computationally efficient and practical.\nAmong incomplete verification methods, Randomized Smoothing (RS) has become one of the most practical and widely studied techniques due to its scalability and model-agnostic nature. Its core idea is to construct a smoothed classifier by injecting random noise (typically Gaussian) at the input and using a \u0026ldquo;majority vote\u0026rdquo; principle. This smoothed classifier can provide a probabilistic, yet provable, robustness radius for its predictions. However, the basic RS framework faces several challenges, and recent research has focused on deepening its theoretical understanding, overcoming its intrinsic limitations, and expanding its applications.\nIt has long been established in practice that for randomized smoothing to be effective, the underlying base classifier must be trained on noise-augmented data. However, this practice lacked solid theoretical support for a long time.\nThe work by Li et al., 2023 provides a deep analysis of this phenomenon, revealing that noise-augmented training is not universally beneficial. The study introduces the concept of \u0026ldquo;interference distance\u0026rdquo; to describe the degree of separation between decision regions of the same class in the data distribution. A large interference distance means the decision regions are sparsely distributed and isolated from each other, while a small interference distance means they are densely packed and close together.\nThe effect of interference distance on noise-augmented training. When the interference distance is large (top row), both noise-augmented training (second column) and the subsequent smoothing operation (third column) cause the decision regions (orange) to continuously \u0026lsquo;shrink,\u0026rsquo; leading to significant performance degradation. In contrast, when the interference distance is small (bottom row), the training noise helps to \u0026lsquo;merge\u0026rsquo; adjacent decision regions into a larger, more stable whole, resulting in better performance after smoothing. Image source: (Li et al., 2023)\nThis research theoretically proves that the effectiveness of noise-augmented training is closely related to this \u0026ldquo;interference distance\u0026rdquo;: it can be detrimental for distributions with a large interference distance, but essential for those with a small one. This not only explains why the method works well on real-world datasets like CIFAR-10 but also, more importantly, indicates that the noise level for training and the noise level for certification do not need to be the same. Tuning them as independent hyperparameters can lead to superior performance.\nOne of the most critical weaknesses of randomized smoothing is the Curse of Dimensionality. As the input dimension $d$ increases, the certified radius rapidly decays at a rate of $1/\\sqrt{d}$, severely limiting its application to high-dimensional data like high-resolution images.\nTo overcome this challenge, Kumar et al., 2023 proposed Dual Randomized Smoothing (DRS). The core idea is to decompose a high-dimensional smoothing problem into multiple parallel low-dimensional problems.\nThe core idea and theoretical advantage of DRS. As shown in (a), DRS partitions a high-dimensional input into two lower-dimensional subspaces and applies randomized smoothing to them independently. The theoretical analysis in (b) shows that the certified radius upper bound for DRS (solid green and red lines) is significantly higher than that of traditional RS (dashed blue line), and this advantage becomes more pronounced as the dimensionality increases. Image source: (Kumar et al., 2023)\nThrough this \u0026ldquo;divide and conquer\u0026rdquo; strategy, the decay rate of the DRS robustness radius upper bound is improved to $(1/\\sqrt{m} + 1/\\sqrt{n})$ (where $m+n=d$), effectively mitigating the curse of dimensionality.\nTo translate this theoretical advantage into a practical algorithm, the study proposes a concrete implementation workflow. The process begins by partitioning an input image $\\boldsymbol{x} \\in \\mathbb{R}^d$ into two lower-dimensional sub-images, $\\boldsymbol{x}^l \\in \\mathbb{R}^m$ and $\\boldsymbol{x}^r \\in \\mathbb{R}^n$, using two non-overlapping downsampling operators, $\\pi_l$ and $\\pi_r$. Subsequently, randomized smoothing is performed on these two sub-images in parallel.\nThe implementation workflow of Dual Randomized Smoothing (DRS). This flowchart details the steps of DRS, including downsampling using 2x2 pixel indexing, parallel noise injection, interpolation, and classification of sub-images, estimating probability lower bounds via statistical methods, and finally aggregating the results. Image source: (Kumar et al., 2023)\nIn the workflow shown in Figure 20, two parallel smoothed classifiers, $g_l$ and $g_r$, are constructed. For the left path, the probability of the smoothed classifier $g_l$ predicting class $c$, denoted as $p_c(g_l, \\boldsymbol{x}^l)$, is estimated by extensive sampling of the base classifier $f^l$ on the sub-image with added Gaussian noise $\\boldsymbol{\\epsilon}^l \\sim \\mathcal{N}(0, \\sigma^2 \\boldsymbol{I}_m)$. A key engineering detail is that since the downsampled sub-images are smaller, they must be enlarged back to the original size via Interpolation to be processed by the base classifier, which was trained on full-size images.\nDuring the certification phase, to obtain a provable guarantee, the algorithm does not directly use the prediction frequency as the probability. Instead, it utilizes statistical tools (like the Clopper-Pearson interval) to compute a confidence lower bound for the probabilities of the top-1 class $c_A$ and top-2 class $c_B$ for each sub-classifier, $\\underline{p_A^l}$ and $\\underline{p_A^r}$. Finally, the prediction of the DRS classifier $g_{DRS}$ is determined by aggregating these two independent probability distributions, with the total probability lower bound for the top class $c_A$ being $\\underline{p_A} = \\underline{p_A^l} + \\underline{p_A^r}$. This aggregated probability lower bound $\\underline{p_A}$ is then used to calculate the certified radius $R$ for the entire system in the original $d$-dimensional space: $$ R = \\sigma \\Phi^{-1}(\\underline{p_A}) $$ where $\\Phi^{-1}$ is the inverse cumulative distribution function of the standard normal distribution. In this manner, DRS provides a computationally feasible and theoretically superior solution for the certified robustness of high-dimensional inputs.\nAnother issue with traditional RS is its often rigid trade-off between robustness and accuracy. To defend against more realistic attacks that can only perturb a subset of entities, Duan et al., 2023 proposed the Hierarchical Randomized Smoothing framework. Its core idea is \u0026ldquo;targeted noising.\u0026rdquo;\nThe workflow of Hierarchical Randomized Smoothing. Instead of adding noise to all entities, this method proceeds in two steps: first, it randomly selects a subset of entities (nodes with dashed borders); then, it adds noise only to the selected entities. Image source: (Duan et al., 2023)\nTo achieve efficient certification, the method innovatively \u0026ldquo;appends\u0026rdquo; a selection indicator information as an extra channel or feature dimension to the original data. This technique of encoding meta-information as part of the data greatly simplifies the mathematical proof, allowing the framework to flexibly integrate almost any existing smoothing distribution.\nThe advantage of hierarchical smoothing in the robustness-accuracy trade-off. Hierarchical smoothing (blue stars) significantly outperforms traditional additive noise (orange circles) and random erasing (green circles) on the Pareto frontier of robustness versus accuracy, offering a series of superior trade-off solutions. Image source: (Duan et al., 2023)\nBasic RS is primarily designed for classification tasks. Extending it to more complex outputs, such as semantic segmentation, is a crucial frontier of research.\nChiang et al., 2023 address the issue of high abstain rates in semantic segmentation certification with Adaptive Hierarchical Certification. When a model\u0026rsquo;s prediction for a pixel wavers between semantically related classes (e.g., \u0026ldquo;car\u0026rdquo; and \u0026ldquo;truck\u0026rdquo;), traditional methods would \u0026ldquo;abstain.\u0026rdquo; This method, however, adaptively relaxes the requirement, certifying it to a coarser parent class (e.g., \u0026ldquo;vehicle\u0026rdquo;).\nVisualization of results from adaptive hierarchical certification. The baseline method, SEGCERTIFY (middle), has a large number of gray \u0026lsquo;abstained\u0026rsquo; pixels. In contrast, ADAPTIVECERTIFY (right) successfully certifies many of these pixels to coarser levels (e.g., certifying some road pixels as \u0026lsquo;flat ground\u0026rsquo;), thus providing more meaningful certified information. Image source: (Chiang et al., 2023)\nSimilarly, E et al., 2023 proposed Localized Randomized Smoothing for multi-output tasks like image segmentation and node classification. The core idea is that to accurately segment a part of an image, one only needs to protect the information in that part and its immediate vicinity, while stronger \u0026ldquo;noise\u0026rdquo; can be applied to distant, less relevant areas. This approach dramatically improves the model\u0026rsquo;s defense against global perturbations while maintaining high accuracy in critical regions.\nThe workflow of Localized Randomized Smoothing. To certify the segmentation result of the top-right grid (containing the parrot\u0026rsquo;s head), the system generates a series of noisy images where the noise applied to the top-right is significantly weaker than in other areas. By repeating this process for each grid cell and stitching the results, a final segmentation map that is both highly accurate and robust is obtained. Image source: (E et al., 2023)\nAlthough randomized smoothing is powerful and scalable, the robustness guarantee it provides is probabilistic, meaning there is always a tiny failure probability (e.g., $\\alpha=0.001$). For safety-critical applications like autonomous driving, this probabilistic guarantee may still be insufficient. Therefore, another research frontier is dedicated to developing defense methods that can provide deterministic guarantees, ensuring that the model\u0026rsquo;s prediction will absolutely not change within a specified perturbation range.\nThe main challenge for such methods is that the perturbation sets for many real-world disturbances (like geometric transformations) are highly non-convex, making verification extremely difficult.\nYang et al., 2023 were the first to successfully integrate deterministic geometric robustness certification into the training process, proposing the Certified Geometric Training (CGT) framework. The core technical contribution is a Fast Geometric Verifier (FGV), which is thousands or even tens of thousands of times faster than existing tools. This leap in speed makes it feasible to perform geometric robustness verification in every iteration of training. By optimizing for robustness over randomly sampled small local intervals during training, CGT enables the model to generalize to global robustness across the entire target transformation range.\nThe verifiable safety boundary of CGT in an autonomous driving scenario. CGT not only enables the autonomous driving model to accurately predict the steering angle (blue prediction line closely follows the green ground truth line) but, more importantly, provides a strict verifiable safety boundary (red area). This boundary guarantees that even if the input image undergoes any rotation within ±2°, the model\u0026rsquo;s steering prediction will never go outside this region, providing a deterministic safety promise for the system. Image source: (Yang et al., 2023)\nSimilarly, in the 3D point cloud domain, Jia et al., 2023 proposed the first framework that can provide a deterministic $L_0$-norm robustness guarantee for point cloud classifiers. The core idea is to partition the input point cloud into multiple disjoint sub-clouds using a hash function, and then perform a majority vote on the independent predictions for each sub-cloud.\nOptimizing sub-cloud classification in PointCert using a Point Cloud Completion Network (PCN). To address the difficulty of classifying sparse sub-clouds with standard classifiers, a practical variant of PointCert adds a Point Cloud Completion Network (PCN) before the classifier. It first \u0026lsquo;completes\u0026rsquo; the sparse sub-cloud into a full shape before classification, thereby improving the accuracy of the voting process. Image source: (Jia et al., 2023)\nSince perturbing one point affects at most a few sub-cloud predictions, the final prediction remains unchanged as long as the \u0026ldquo;winning margin\u0026rdquo; in the vote is large enough. Based on this, PointCert derives a tight formula for the certifiable perturbation size, which precisely quantifies the maximum number of points that can be added, deleted, or modified while guaranteeing the prediction remains constant.\nAn emerging and highly promising direction is to shift the focus of defense from traditional discriminative models to generative models. Research by Wang et al., 2023 demonstrates that diffusion models are not only powerful generators but can also be used as certifiably robust classifiers.\nThe classification principle is as follows: given an input image (which may be perturbed), we attempt to denoise it using the diffusion model, conditioned on every possible class. The class that allows the image to be reconstructed with the lowest reconstruction error is considered the model\u0026rsquo;s prediction.\nThe robustness of this method is rooted in the diffusion model\u0026rsquo;s deep learning of the data manifold. An adversarial perturbation slightly pushes a data point off its original, true manifold. When denoising is conditioned on the wrong class, the model tries to pull this point towards a completely different manifold, resulting in a high reconstruction error. The study theoretically proves that the diffusion classifier has a small Lipschitz constant, providing a mathematical basis for its inherent smoothness and robustness.\nEfficient Defenses Although defense techniques like Adversarial Training (AT) are effective, their immense computational cost has been a major barrier to their widespread adoption. Therefore, a crucial research direction is to significantly improve the efficiency of defenses while maintaining strong robustness. Adversarial Purification offers a promising path by decoupling the defense task from model training, \u0026ldquo;cleaning\u0026rdquo; the input at inference time. However, purification methods based on diffusion models, while effective, are too slow for real-world, real-time applications due to their iterative denoising process.\nThe work by Lei et al., 2024 on OSCP was specifically designed to resolve this core conflict. The study proposes a framework that can complete purification in a single step, achieving a remarkable balance between speed and effectiveness.\nThe speed and effectiveness advantages of the OSCP framework. The introductory figure of the paper uses a vivid analogy to illustrate its core contribution. Traditional purification methods (top) are time-consuming and inefficient, and their results are often suboptimal. In contrast, the OSCP framework (bottom) is extremely efficient and can effectively remove adversarial noise while better preserving the original details and quality of the image. Image source: (Lei et al., 2024)\nThe implementation of the OSCP framework relies on two core innovations: GAND, a training method designed for \u0026ldquo;one-step purification,\u0026rdquo; and CAP, an inference process that ensures the purified image does not lose its semantic integrity.\nThe technical flowchart of the OSCP framework (GAND training and CAP inference). The GAND training stage (a) teaches the model to handle adversarial noise, while the CAP inference stage (b) uses an edge map as guidance to enforce structural integrity during denoising. Image source: (Lei et al., 2024)\nThe introduction of the CAP process fundamentally solves a fatal flaw of traditional purification methods: the loss of semantic information. Many purification methods are \u0026ldquo;blind\u0026rdquo; in their noise removal, potentially erasing complex details that are part of the original image along with the noise.\nA comparison between the CAP method and traditional purification methods. A typical purification method (blue box), when processing an attacked image of a turtle, might remove the adversarial perturbation but also erase the turtle\u0026rsquo;s head. In contrast, the CAP method (red box), guided by the edge map, performs a directed purification, perfectly preserving the turtle\u0026rsquo;s head while removing the noise. Image source: (Lei et al., 2024)\nThis ability to preserve semantic information ultimately translates into the ultra-high quality of the purified images.\nA quality comparison of purified images. Whether dealing with natural textures or artificial details, the traditional method DiffPure (column c) leads to significant blurring and loss of detail. In contrast, the images purified by OSCP (column d) are clear and sharp, far surpassing traditional methods in visual quality and being nearly indistinguishable from the original clean images (column a). Image source: (Lei et al., 2024)\nRobustness in Modern Models The robustness of deep learning models is not a one-size-fits-all problem; it is closely tied to the model\u0026rsquo;s internal architecture. Different architectures, such as Vision Transformers (ViT), Spiking Neural Networks (SNNs), or prototype-based networks, exhibit unique vulnerabilities and defense potentials due to their distinct inductive biases and information processing mechanisms.\nThe rise of Vision Transformers (ViT) has revolutionized computer vision, but their robustness has also become a central issue. Lacking the strong inductive biases of CNNs like \u0026ldquo;convolution\u0026rdquo; and \u0026ldquo;locality,\u0026rdquo; ViTs heavily rely on large-scale data and strong data augmentation in standard training. Researchers have found that directly transferring this successful paradigm to adversarial training does not yield optimal results.\nThe work by Bai et al., 2023 proposed a groundbreaking \u0026ldquo;light recipe\u0026rdquo; for this challenge. The study discovered that for adversarial training, strong data augmentations (like MixUp, CutMix) are not only unhelpful but are in fact harmful. The root cause is the conflict between the two strong regularization methods—adversarial training and strong data augmentation—which makes the training task exceptionally difficult and interferes with the generation of effective adversarial examples. Therefore, the core of this recipe is to remove all strong data augmentations, supplemented by \u0026ldquo;ε-warmup\u0026rdquo; (gradually increasing the adversarial perturbation strength in the early stages of training) and a larger \u0026ldquo;weight decay.\u0026rdquo;\nThe state-of-the-art robustness of the light recipe on ImageNet. A ViT model trained with this \u0026rsquo;light recipe\u0026rsquo; achieved state-of-the-art performance in adversarial robustness on ImageNet, significantly outperforming the best-performing CNN models at the time. Image source: (Bai et al., 2023)\nWith the popularization of large pre-trained models, parameter-efficient fine-tuning techniques like Prompt Tuning have become a new hotspot. However, ensuring robustness within the prompt tuning paradigm presents new challenges. Research from Fu et al., 2023 found that naively applying adversarial training to prompt tuning leads to severe gradient obfuscation, creating a false sense of security. This is because most model parameters are frozen during prompt tuning, causing the input gradients to become \u0026ldquo;shattered\u0026rdquo; and rendering traditional gradient-based attacks ineffective.\nTo solve this problem, the study proposed the ADAPT framework. Its core is to design an adaptive adversarial attack that considers both the learnable \u0026ldquo;prompt\u0026rdquo; parameters $\\boldsymbol{\\theta}_p$ and the input image $\\boldsymbol{x}$ when generating attacks, rather than just the input. The corresponding adaptive adversarial prompt training then uses this stronger attack to optimize the prompt parameters. Experiments show that the ADAPT framework achieves adversarial robustness comparable to fully fine-tuning the entire model while only tuning about 1% of the parameters.\nSelf-supervised learning has also proven to be an effective way to enhance the robustness of ViTs. The work by Gao et al., 2023, inspired by the Information Bottleneck theory, proposed a novel self-supervised adversarial pre-training method. The core idea is to have the model reconstruct a completely clean, complete original image from an input that has been doubly corrupted (adversarial perturbation + random masking).\nThe pre-training process of MIMIR. The training objective of MIMIR consists of two parts. The first is to minimize the reconstruction loss between the reconstructed image and the original clean image. The second is to minimize the mutual information between the corrupted input and the latent features extracted by the encoder. This compels the encoder to discard information related to the perturbation. Image source: (Gao et al., 2023)\nAn encoder trained in this manner naturally extracts features that are highly \u0026ldquo;immune\u0026rdquo; to adversarial perturbations and learns a smoother loss landscape.\nSpiking Neural Networks (SNNs), known for their event-driven nature and spatio-temporal information processing, have garnered attention for their energy efficiency and biological plausibility. The work by Zhang et al., 2023 elevates the study of SNN robustness to a new level. The research innovatively treats an SNN as a Temporal Self-Ensemble model, viewing the network states over $T$ timesteps as an ensemble of $T$ independent sub-networks.\nBased on this perspective, the paper identifies two key challenges to SNN robustness: the vulnerability of individual temporal sub-networks and the propagation of vulnerability across timesteps. To address this, the study proposes the Robust Temporal self-Ensemble (RTE) training framework. RTE aims to simultaneously enhance the robustness of each temporal sub-network and suppress the propagation of adversarial perturbations across timesteps through a unified loss function. Experiments demonstrate that RTE achieves a better robustness-accuracy trade-off than existing SNN adversarial defense methods on multiple benchmarks.\nPrototype-Based Networks are considered an important direction in explainable AI due to their case-based reasoning. However, research by Saralajew et al., 2025 points out that many existing deep prototype networks may produce misleading explanations that are inconsistent with the model\u0026rsquo;s actual decision-making process due to their unconstrained weights.\nUnfaithful explanations caused by unconstrained weights. As shown, the PIPNet model incorrectly classifies the input \u0026lsquo;Fish Crow\u0026rsquo; as a \u0026lsquo;Common Raven.\u0026rsquo; Although the similarity score of the input image to the \u0026lsquo;Fish Crow\u0026rsquo; prototype is higher, the model assigns a disproportionately large weight to the \u0026lsquo;Common Raven\u0026rsquo; prototype, allowing weaker evidence to dominate the final decision. Image source: (Saralajew et al., 2025)\nTo address this issue, the study extends the \u0026ldquo;Classification-by-Components\u0026rdquo; (CBC) model. The new CBC architecture can be seen as a deep Radial Basis Function (RBF) network with well-defined interpretability constraints. It constrains the weights through probabilistic modeling and introduces negative reasoning, considering not only \u0026ldquo;what it is\u0026rdquo; but also \u0026ldquo;what it is not.\u0026rdquo;\nA CBC model learning concepts through positive and negative reasoning. To identify a \u0026lsquo;Vermilion Flycatcher,\u0026rsquo; the CBC model not only learns an abstract concept of a \u0026lsquo;slender beak\u0026rsquo; through positive reasoning but also confirms the absence of a \u0026lsquo;stout, conical beak\u0026rsquo; through negative reasoning, thereby distinguishing it from the similarly colored \u0026lsquo;Northern Cardinal.\u0026rsquo; Image source: (Saralajew et al., 2025)\nThis reasoning approach, based on concept comparison, makes the model\u0026rsquo;s decision-making process more comprehensive and robust, achieving a unification of performance, interpretability, and robustness.\nRobustness in Cost-sensitive Cases The challenge of adversarial robustness is not confined to generic image classification tasks. When deep learning models are deployed in high-stakes, specific domains such as autonomous driving and medical imaging, robustness issues take on new, more complex forms. The input data (e.g., 3D point clouds, multi-modal sensor streams) and task objectives (e.g., 3D object detection, surface reconstruction) in these fields place unprecedented demands on model reliability.\nIn autonomous driving systems, precise 3D perception of the surrounding environment is the cornerstone of safe navigation. However, both LiDAR-based and camera-based 3D perception models face severe threats from adversarial attacks.\nThe work by Zhang et al., 2023 provides the first systematic robustness evaluation of mainstream LiDAR-based 3D object detectors. The study found that imperceptible perturbations to point clouds can cause a sharp decline in the performance of top detection models. The research systematically analyzes three attack modalities: Point Perturbation, Point Detachment, and Point Attachment. A key finding is that a model\u0026rsquo;s vulnerability is closely related to its feature representation method: Voxel-based models, which discretize space and to some extent disrupt the continuity of attack gradients, are generally more robust than Point-based models that process raw points directly.\nThe accuracy-robustness trade-off for different 3D detector architectures. Voxel-based models (red) generally cluster in the top-right corner, exhibiting higher accuracy and robustness. In contrast, point-based models (green) are generally less robust. PointPillar (the green triangle at the bottom left) is an extreme outlier, being exceptionally sensitive to point perturbations due to its unique feature encoding method. Image source: (Zhang et al., 2023)\nTo enhance the performance of camera-only 3D object detection, Chen et al., 2023 proposed a pioneering knowledge distillation framework. The core idea is to use a powerful, LiDAR-based \u0026ldquo;teacher\u0026rdquo; model and a 2D instance segmentation \u0026ldquo;teacher\u0026rdquo; model to jointly guide the training of a \u0026ldquo;student\u0026rdquo; model that uses only multi-camera images. This approach aims to boost the student\u0026rsquo;s performance without increasing its inference-time computational complexity.\nThe cross-modality, cross-task, and cross-stage knowledge distillation framework of X³KD. X³KD is a highly synergistic training system. The LiDAR teacher provides Cross-modal Feature Distillation, Adversarial Training, and Output Distillation to the student in the Bird\u0026rsquo;s-Eye View space. Meanwhile, the 2D segmentation teacher provides Cross-task Instance Segmentation Distillation in the Perspective View space. Through this comprehensive guidance, the student\u0026rsquo;s 3D perception capabilities are significantly improved. Image source: (Chen et al., 2023)\nA qualitative comparison between X³KD and a baseline model. The bird\u0026rsquo;s-eye view comparison shows that the baseline model, BEVDepth (left), produces chaotic and incorrectly oriented detections in complex scenes. In contrast, the model trained with X³KD (middle) yields very clean and accurate detections that are highly consistent with the ground truth (GT, right). Image source: (Chen et al., 2023)\nBesides detection, another critical 3D perception task is surface reconstruction. The work by Tang et al., 2023 introduces a novel, efficient, and precise method for reconstructing object surfaces from unorganized, noisy point clouds. The core innovations of SurfR are parallel multi-scale feature extraction and a cross-scale attention mechanism.\nThe multi-scale feature extraction and cross-scale attention mechanism of SurfR. SurfR first partitions the point cloud into grid cells and extracts features in parallel at multiple scales. Then, for a given query point, it samples neighboring features at each scale. Finally, a Transformer encoder processes the features from different scales, using self-attention to effectively fuse information across scales. Image source: (Tang et al., 2023)\nThe reconstruction performance of SurfR under different noise levels. SurfR can generate high-quality surfaces and preserve rich geometric details when processing point clouds with varying levels of noise and sparsity. Its performance is superior or comparable to existing methods on multiple benchmarks, while achieving an order-of-magnitude speedup. Image source: (Tang et al., 2023)\nFinally, we must recognize that in systems like autonomous driving, threats come not only from attacks on individual sensors but also from multi-modal semantic attacks on the entire system. Mao et al., 2023 proposed the first robustness certification framework for multi-sensor fusion systems. This framework can ensure that an autonomous vehicle\u0026rsquo;s perception module remains stable and reliable when facing common real-world semantic attacks like rotation and translation, providing mathematical safety guarantees.\nThe core idea of the COMMIT framework. A standard camera-LiDAR fusion model (top) may fail to detect a vehicle completely when subjected to a minor rotation attack. The COMMIT framework (bottom), through a randomized smoothing strategy tailored for multi-sensor fusion, can provide a provable robustness guarantee, for instance, ensuring that the Intersection over Union (IoU) of vehicle detection remains above 0.5 for any rotation within 10 degrees. Image source: (Mao et al., 2023)\nThe importance of robustness is particularly pronounced in fields like medical imaging analysis, where a model\u0026rsquo;s incorrect judgment could directly endanger a patient\u0026rsquo;s life. Such applications face not only the threat of adversarial attacks but also the more common challenge of Domain Shift, where a model trained on data from one institution (e.g., Hospital A) may experience a significant performance drop when deployed at another institution (Hospital B) with a slightly different data distribution.\nThe work by Weng et al., 2023 provides the first large-scale experimental analysis of the performance of adversarial robustness in domain generalization scenarios.\nThe core problem of robustness generalization under domain shift. The central question of this research is whether robustness acquired through adversarial training in a \u0026lsquo;source domain\u0026rsquo; (e.g., real photos) can \u0026lsquo;generalize\u0026rsquo; to an unseen, stylistically different \u0026rsquo;target domain\u0026rsquo; (e.g., cartoons). Image source: (Weng et al., 2023)\nThe study finds that both empirical and certified robustness can generalize to a considerable extent to new, unseen data distributions. A surprising discovery is that the visual similarity between the source and target domains does not correlate well with the level of robustness generalization. The study also extends its experiments to a real-world medical imaging application.\nDomain shift in the CAMELYON17 dataset. Histopathology image slides from five different hospitals exhibit clear distribution shifts due to differences in scanning equipment and staining procedures. The experiments demonstrate that in this real-world medical scenario, adversarial augmentation not only significantly improves the generalization of robustness but also has a minimal impact on the model\u0026rsquo;s accuracy on clean data. Image source: (Weng et al., 2023)\nBesides domain shift, high-risk domains also face the challenge of asymmetric error costs. For example, in medical diagnosis, misclassifying a malignant tumor as benign (a false negative) is far more costly than misclassifying a benign tumor as malignant (a false positive).\nHorváth et al., 2023 address this issue by proposing the first provably cost-sensitive adversarial defense. This method no longer treats all misclassifications as equally costly but allows the user to define a cost matrix $\\boldsymbol{C}$ to encode the severity of different types of misclassifications.\nVisualization of the cost-sensitive certified radius. The core of this method is to define a cost-sensitive certified radius. For a malignant tumor sample (red plus), the certified radius guarantees that any perturbation within this \u0026lsquo;safe zone\u0026rsquo; will not cause it to be misclassified as the high-cost \u0026lsquo;benign\u0026rsquo; class (green minus). The size of this radius depends on the gap between the model\u0026rsquo;s confidence in the correct classification and its confidence in the target incorrect class. Image source: (Horváth et al., 2023)\nBy designing a training method (Margin-CS) that specifically optimizes this confidence gap, the research enables models in safety-critical applications to prioritize defending against the most dangerous, highest-cost attacks.\nTang et al., 2024 model the image restoration problem (e.g., denoising, deraining, super-resolution) as an Optimal Transport (OT) problem and innovatively introduce the \u0026ldquo;transport residual\u0026rdquo; as a unique clue specific to a particular degradation to guide the restoration process.\nThe core idea of the RCOT framework. As shown in (a), the RCOT method first analyzes the \u0026lsquo;residuals\u0026rsquo; corresponding to different degradation types and encodes them into a \u0026lsquo;diagnostic report.\u0026rsquo; This embedding is then used as a condition to guide the optimal transport map, achieving a precise, \u0026lsquo;symptomatic\u0026rsquo; restoration. The denoising results in (b) demonstrate that RCOT can more effectively preserve and reconstruct fine image structures. Image source: (Tang et al., 2024)\nWith the rise of generative models, especially diffusion models, their own robustness and how to improve their performance have become new research hotspots. Ben-Iwhi et al., 2024 proposed a new technique called Group Orthogonalization Regularization (GOR), which aims to address the widespread parameter redundancy in deep neural networks by reducing the correlation between convolutional filters.\nThe improvement in generation quality of diffusion models by GOR. Combining GOR with LoRA for text-to-image diffusion model fine-tuning, the GOR-optimized model (bottom row) generates images with richer, more vivid details (especially in the eyes) than the standard method (top row), significantly improving generation quality. Image source: (Ben-Iwhi et al., 2024)\nExperiments also demonstrate that adding GOR during adversarial training can effectively improve model robustness. This indicates that optimizing the internal parameter structure of a model can not only enhance its core performance but also strengthen its resilience against adversarial attacks.\nReferences [1]Zhang, R., \u0026amp; Sun, J. (2024). Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes Errors. Computer Aided Verification, 445\u0026ndash;466. https://doi.org/10.1007/978-3-031-63175-8_19 [2]Jain, G., Balasubramanian, V. N., \u0026amp; Carlini, N. (2023, May 9). Characterizing Model Robustness via Natural Input Gradients. The Eleventh International Conference on Learning Representations. [3]Laidlaw, C., Singla, S., \u0026amp; Feizi, S. (2021). Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations. Proceedings of the IEEE/CVF International Conference on Computer Vision, 15302\u0026ndash;15312. https://doi.org/10.1109/ICCV48922.2021.01501 [4]Yuan, X., He, P., Zhu, Q., \u0026amp; Li, X. (2019). Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Transactions on Neural Networks and Learning Systems, 30, Article 9. https://doi.org/10.1109/TNNLS.2018.2886017 [5]Zhang, C.-H., Zhang, Z., Wu, S., Jiang, T.-Y., \u0026amp; Liu, S. (2023). A Comprehensive Study of the Robustness for LiDAR-based 3D Object Detectors against Adversarial Attacks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21919\u0026ndash;21929. [6]Nguyen, T., Ergezer, M., \u0026amp; Green, C. (2024). AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20907\u0026ndash;20917. [7]Yang, R., Chen, Y., Misailovic, S., \u0026amp; Singh, G. (2024). Towards Viewpoint-Invariant Visual Recognition via Adversarial Training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 25191\u0026ndash;25202. [8]Mao, C., Liu, C., Yang, R., Yang, H., Singh, G., \u0026amp; Liu, X. (2023). COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21789\u0026ndash;21798. [9]Zhou, S., Liu, C., Ye, D., Zhu, T., Zhou, W., \u0026amp; Yu, P. S. (2022). Adversarial Attacks and Defenses in Deep Learning: From a Perspective of Cybersecurity. ACM Computing Surveys, 55, Article 8. https://doi.org/10.1145/3547330 [10]Baniecki, H., \u0026amp; Biecek, P. (2024). Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey. Information Fusion, 107, 102303. https://doi.org/10.1016/j.inffus.2024.102303 [11]Chen, R., Guo, S., Jiang, L.-J., Niu, X., \u0026amp; Zhang, Q. (2023). Mitigating the Accuracy-Robustness Trade-off via Balanced Multi-Teacher Adversarial Distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 4668\u0026ndash;4678. [12]Liu, D., Niu, Y., Wu, Q., Liu, J., \u0026amp; Zhang, H. (2023). Boosting Adversarial Training via Fisher-Rao Norm-based Regularization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24898\u0026ndash;24907. [13]Zhao, Z., Zhang, J., Wu, X., \u0026amp; Liu, J. (2023). LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16568\u0026ndash;16578. [14]Lee, S.-H., Jeong, M., Park, S.-Y., Yun, S.-B., \u0026amp; Choo, J. (2022). IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers. Computer Vision \u0026ndash; ECCV 2022, 21\u0026ndash;38. [15]Zhang, D., Wang, C., Li, J., \u0026amp; Zhang, M. (2023). The Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24808\u0026ndash;24817. https://doi.org/10.1109/CVPR52729.2023.02381 [16]Wang, Z., Pang, T., Du, C., Lin, M., Liu, W., \u0026amp; Yan, S. (2023). Better Diffusion Models Further Improve Adversarial Training. Proceedings of the 40th International Conference on Machine Learning, 36246\u0026ndash;36263. [17]Bai, Y., Ding, M., Wang, Y., Zhang, Z.-M., Wang, J., \u0026amp; Tao, D. (2023). A Light Recipe to Train Robust Vision Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, Article 10. https://doi.org/10.1109/TPAMI.2023.3283256 [18]Park, S.-H., Shin, J.-g., Kim, K., Park, T., Lee, I.-K., \u0026amp; Choo, J. (2023, May 9). What Are Effective Labels for Augmented Data? Improving Calibration and Robustness with AutoLabel. The Eleventh International Conference on Learning Representations. [19]Li, B.-J., Cisse, M., Singh, S. P., \u0026amp; van der Maaten, L. \"Understanding Noise-Augmented Training for Randomized Smoothing\" (2023). [20]Kumar, A., Schwarzschild, A., Gupta, T., Goldblum, M., Gehr, T., \u0026amp; Goldstein, T. (2023). Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24657\u0026ndash;24666. https://doi.org/10.1109/CVPR52729.2023.02366 [21]Duan, Y., Chen, Z., Wang, E., Li, Z., \u0026amp; Zhu, S. (2023, May 9). Hierarchical Randomized Smoothing. The Eleventh International Conference on Learning Representations. [22]Chiang, P.-y., Fang, K.-h., Zhang, H., \u0026amp; Hsieh, C.-j. (2023). Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16738\u0026ndash;16747. [23]E, W., Jia, J., \u0026amp; Liu, J. (2023, May 9). Localized Randomized Smoothing for Collective Robustness Certification. The Eleventh International Conference on Learning Representations. [24]Yang, R., Laurel, J., Misailovic, S., \u0026amp; Singh, G. (2023, May 9). Provable Defense Against Geometric Transformations. The Eleventh International Conference on Learning Representations. [25]Jia, R., Liu, C., \u0026amp; Singh, G. (2023). PointCert: A Deterministic Approach for Certified L0-Robustness of Point Cloud Classifiers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17351\u0026ndash;17360. https://doi.org/10.1109/CVPR52729.2023.01667 [26]Wang, H., Liu, S., Jiang, H., \u0026amp; Wang, Z. (2023). Your Diffusion Model is Secretly a Certifiably Robust Classifier. Proceedings of the 40th International Conference on Machine Learning, 36081\u0026ndash;36102. [27]Lei, C. T., Yam, H. M., Guo, Z., \u0026amp; Lau, C. P. \"Instant Adversarial Purification with Adversarial Consistency Distillation\" (2024). [28]Fu, Z., Yuan, X., Li, Y., Guo, Y., Wang, Y., \u0026amp; Zhang, Y. (2023). ADAPT to Robustify Prompt Tuning Vision Transformers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16084\u0026ndash;16093. https://doi.org/10.1109/CVPR52729.2023.01548 [29]Gao, P., Wang, J., Liu, T., Yan, S., \u0026amp; Wang, B. (2023). MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24908\u0026ndash;24918. https://doi.org/10.1109/CVPR52729.2023.02390 [30]Zhang, T., Yuan, B., \u0026amp; Wang, Y. (2023). Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16997\u0026ndash;17007. https://doi.org/10.1109/CVPR52729.2023.01634 [31]Saralajew, S., Rana, A., Villmann, T., \u0026amp; Shaker, A. (2025). A Robust Prototype-Based Network with Interpretable RBF Classifier Foundations. Proceedings of the AAAI Conference on Artificial Intelligence, 39, Article 19. https://doi.org/10.1609/aaai.v39i19.32432 [32]Chen, Y., Yu, J., Chen, Z., Tang, S., Wu, G., Wang, C., Wang, X., \u0026amp; Sun, J. (2023). X\u0026sup3;KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17614\u0026ndash;17624. https://doi.org/10.1109/CVPR52729.2023.01691 [33]Tang, K., Zhou, Y., Zhang, Y., \u0026amp; Liu, Y. (2023). SurfR: Surface Reconstruction with Multi-scale Attention. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13182\u0026ndash;13191. https://doi.org/10.1109/CVPR52729.2023.01272 [34]Weng, T., Chiang, P., Wang, S., Zhang, H., \u0026amp; Hsieh, C. (2023). Generalizability of Adversarial Robustness Under Distribution Shifts. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24604\u0026ndash;24613. https://doi.org/10.1109/CVPR52729.2023.02361 [35]Horv\u0026aacute;th, M. Z., Chiang, P., Zhang, H., \u0026amp; Vechev, M. (2023, May 9). Provably Cost-Sensitive Adversarial Defense via Randomized Smoothing. The Eleventh International Conference on Learning Representations. [36]Tang, X., Hu, X., Gu, X., \u0026amp; Sun, J. (2024). Residual-Conditioned Optimal Transport: Towards Structure-Preserving Unpaired and Paired Image Restoration. Proceedings of the 41st International Conference on Machine Learning, 47757\u0026ndash;47777. [37]Ben-Iwhi, I., Fratarcangeli, M., \u0026amp; Sintorn, I.-M. \"Group Orthogonalization Regularization for Vision Models Adaptation and Robustness\" (2024). ","permalink":"https://xiaokunduan.github.io/posts/2025-09-08-adversary-robustness/","summary":"\u003cdiv class=\"series-callout\"\u003e\n  \u003cp class=\"series-callout__title\"\u003ePrefer the 5-part readable version?\u003c/p\u003e\n  \u003cp\u003eThis full article is still here as the reference version, but I also split it into a shorter 5-part series for easier reading and sharing.\u003c/p\u003e\u003cp\u003e\u003ca class=\"series-callout__button\" href=\"/posts/adversarial-robustness-series/\"\u003eStart the series\u003c/a\u003e\u003c/p\u003e\u003c/div\u003e\n\n\u003ch1 id=\"motivation\"\u003e\u003cstrong\u003eMotivation\u003c/strong\u003e\u003c/h1\u003e\n\u003cp\u003eWe are in the midst of a transformative era driven by deep learning, particularly by the large language models (LLMs) based on the Transformer architecture. These models are demonstrating capabilities that surpass human experts in a growing range of domains, operating with unprecedented efficiency and accuracy. From mastering complex intellectual challenges like Go and protein folding to accelerating drug discovery and scientific breakthroughs, the power of AI seems to be reshaping our very definition of \u0026ldquo;intelligence.\u0026rdquo;\u003c/p\u003e","title":"The Robustness of Adversarial Network"}]