
頂會論文分享

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models (ICCV25)
作者Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu, Tzu-Ling Lin, Hong-Han Shuai
演講者:Teng-Fang Hsiao
演講摘要 (Abstract)
Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance
image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they
experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free
Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes
on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this
interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference
Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally,
our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing
the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I
methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation (ICLR2025)
作者Sheng-Feng Yu, Sheng-Feng_Yu, , Jia-Jiun Yao, Wei-Chen Chiu
演講者:TBD
演講摘要 (Abstract)
Although larger datasets are crucial for training large deep models, the rapid growth of dataset size has brought a significant challenge
in terms of considerable training costs, which even results in prohibitive computational expenses. Dataset Distillation becomes a popular technique
recently to reduce the dataset size via learning a highly compact set of representative exemplars, where the model trained with these exemplars ideally
should have comparable performance with respect to the one trained with the full dataset. While most of existing works upon dataset distillation focus on
supervised datasets, \todo{we instead aim to distill images and their self-supervisedly trained representations into a distilled set. This procedure,
named as Self-Supervised Dataset Distillation, effectively extracts rich information from real datasets, yielding the distilled sets with enhanced
cross-architecture generalizability.} Particularly, in order to preserve the key characteristics of original dataset more faithfully and compactly,
several novel techniques are proposed: 1) we introduce an innovative parameterization upon images and representations via distinct low-dimensional bases,
where the base selection for parameterization is experimentally shown to play a crucial role; 2) we tackle the instability induced by the randomness of data
augmentation -- a key component in self-supervised learning but being underestimated in the prior work of self-supervised dataset distillation -- by utilizing
predetermined augmentations; 3) we further leverage a lightweight network to model the connections among the representations of augmented views from the same
image, leading to more compact pairs of distillation. Extensive experiments conducted on various datasets validate the superiority of our approach in terms of
distillation efficiency, cross-architecture generalization, and transfer learning performance.

Domain-adaptive Video Deblurring via Test-time Blurring(ECCV 2024)
演講者:何勁廷 Jin-Ting He
演講摘要 (Abstract)
Dynamic scene video deblurring aims to remove undesirable blurry artifacts captured during the exposure process. Although previous video
deblurring methods have achieved impressive results, they suffer from significant performance drops due to the domain gap between training
and testing videos, especially for those captured in real-world scenarios. To address this issue, we propose a domain adaptation scheme based on
a blurring model to achieve test-time fine-tuning for deblurring models in unseen domains. Since blurred and sharp pairs are unavailable for
fine-tuning during inference, our scheme can generate domain-adaptive training pairs to calibrate a deblurring model for the target domain.
First, a Relative Sharpness Detection Module is proposed to identify relatively sharp regions from the blurry input images and regard them as
pseudo-sharp images. Next, we utilize a blurring model to produce blurred images based on the pseudo-sharp images extracted during testing.
To synthesize blurred images in compliance with the target data distribution, we propose a Domain-adaptive Blur Condition Generation Module to
create domain-specific blur conditions for the blurring model. Finally, the generated pseudo-sharp and blurred pairs are used to fine-tune a
deblurring model for better performance. Extensive experimental results demonstrate that our approach can significantly improve state-of-the-art
video deblurring methods, providing performance gains of up to 7.54dB on various real-world video deblurring datasets.

WARM DIFFUSION: RECIPE FOR BLUR-NOISE MIXTURE DIFFUSION MODELS (ICLR 2025)
作者Hao-Chien Hsueh (薛皓謙) , 1 Wen-Hsiao Peng (彭文孝) , and 1 Ching-Chun Huang (黃敬群)
演講者:Chi-En Yen (顏琦恩)
演講摘要 (Abstract)
Abstract Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types.
While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key dif fusion paradigms:
hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails
to exploit the strong correlation between high-frequency image detail and low frequency structures, leading to random behaviors in the early
steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness)
in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths,
we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and conquer
strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes.
We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and
changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.

MOT E-NAS: Multi-Objective Training-based Estimate for Efficient Neural Architecture Search (NIPS24)
作者Yu-Ming Zhang, Jun-Wei Hsieh, Xin Li, Ming-Ching Chang, Chun-Chieh Lee, Kuo-Chin Fan
演講者:Jun-Wei Hsieh
演講摘要 (Abstract)
Neural Architecture Search (NAS) methods seek effective optimization toward performance metrics regarding model accuracy and generalization
while facing challenges regarding search costs and GPU resources. Recent Neural Tangent Kernel (NTK) NAS methods achieve remarkable search
efficiency based on a training-free model estimate; however, they overlook the non-convex nature of the DNNs in the search process. In this
paper, we develop Multi-Objective Training-based Estimate (MOTE) for efficient NAS, retaining search effectiveness and achieving the new
state-of-the-art in the accuracy and cost trade-off. To improve NTK and inspired by the Training Speed Estimation (TSE) method, MOTE is
designed to model the actual performance of DNNs from macro to micro perspective by draw loss landscape and convergence speed simultaneously.
Using two reduction strategies, the MOTE is generated based on a reduced architecture and a reduced dataset. Inspired by evolutionary search,
our iterative ranking-based, coarse-to-fine architecture search is highly effective. Experiments on NASBench-201 show MOTE-NAS achieves 94.32%
accuracy on CIFAR-10, 72.81% on CIFAR-100, and 46.38% on ImageNet-16-120, outperforming NTK-based NAS approaches. An evaluation-free (EF)
version of MOTE-NAS delivers high efficiency in only 5 minutes, delivering a model more accurate than KNAS.

SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking (AAAI24)
作者Yu-Hsiang Wang、Jun-Wei Hsieh、Ping-Yang Chen、Ming-Ching Chang、Hung-Hin So、Xin Li
演講者:Jun-Wei Hsieh
演講摘要 (Abstract)
Despite recent progress in Multiple Object Tracking (MOT), several obstacles such as occlusions, similar objects, and complex scenes remain
an open challenge. Meanwhile, a systematic study of the cost-performance tradeoff for the popular tracking-by-detection paradigm is still
lacking. This paper introduces SMILEtrack, an innovative object tracker that effectively addresses these challenges by integrating an efficient
object detector with a Siamese network-based Similarity Learning Module (SLM). The technical contributions of SMILETrack are twofold. First,
we propose an SLM that calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate
Detection and Embedding (SDE) models. The SLM incorporates a Patch Self-Attention (PSA) block inspired by the vision Transformer, which
generates reliable features for accurate similarity matching. Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE
function for robust object matching across consecutive video frames, further enhancing MOT performance. Together, these innovations help
SMILETrack achieve an improved trade-off between the cost (e.g., running speed) and performance (e.g., tracking accuracy) over several existing
state-of-the-art benchmarks, including the popular BYTETrack method. SMILETrack outperforms BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points
on MOT17 and MOT20 datasets.