TeleStyle

Content-Preserving Style Transfer in Images and Videos

A lightweight, state-of-the-art model built on Qwen-Image-Edit for precise stylization without losing content fidelity.

Latest News

Jan 30, 2026

TeleStyle refined the code and updated requirements.txt. In addition, a new version of TeleStyle-Image model with better performance has been uploaded. Finally, we released a free online demo for TeleStyle-Image. Please light a star to support this project if you find the demo useful.

Jan 28, 2026

TeleStyle released the technical report, code, and model of TeleStyle.

Abstract & Technical Overview

Content-preserving style transfer—generating stylized outputs based on content and style references—remains a significant challenge for Diffusion Transformers (DiTs). This difficulty arises primarily from the inherent entanglement of content and style features within their internal representations. Standard methods often struggle to separate these elements, leading to outputs where the style overpowers the original structure, or the content rigidity prevents the style from being applied effectively.

In this technical report, we present TeleStyle, a lightweight yet effective model designed for both image and video stylization. Built upon the robust foundation of Qwen-Image-Edit, TeleStyle employs the base model’s strong capabilities in content preservation and style customization. Unlike previous approaches that require massive computational resources or struggle with generalization, our approach focuses on efficiency and precision.

To facilitate effective training, we curated a high-quality dataset of distinct specific styles. Recognizing that curated data alone is insufficient for broad generalization, we further synthesized triplets using thousands of diverse, in-the-wild style categories. This combination ensures the model encounters a vast array of aesthetic possibilities during its training phase.

TeleStyle introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This strategic training approach allows the model to learn from high-fidelity data first, establishing a strong baseline, before adapting to the variance found in synthetic data. Consequently, this approach enables the model to generalize to unseen styles without compromising precise content fidelity.

Additionally, TeleStyle introduce a video-to-video stylization module to enhance temporal consistency and visual quality. Style transfer in video often suffers from flickering or inconsistent texturing between frames. Our module addresses this by ensuring that style features propagate correctly across the temporal dimension, maintaining the coherence of the stylized video. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality.

Live Demo

Experience TeleStyle firsthand. Upload a content image and a style reference to see the model in action.

Methodology & Architecture

1. Foundation on Qwen-Image-Edit

TeleStyle is not built from scratch but stands on the shoulders of giants. We selected Qwen-Image-Edit as our base model due to its exceptional understanding of semantic content and its ability to accept multimodal inputs. By using this pre-trained foundation, we bypass the need to learn basic image generation features, allowing us to focus entirely on the style transfer task. This "standing on shoulders" approach significantly reduces training time and computational costs while yielding superior results.

2. Hybrid Dataset Construction

Data is the fuel of any AI model. For style transfer, the quality of data is defined by how well the style reference is separated from the content reference. We observed that existing datasets often lacked the specific "content-style" pairing needed for robust training.

Curated Data: We hand-picked a set of high-quality style images representing distinct artistic movements (Impressionism, Cyberpunk, Watercolor, etc.) and paired them with compatible content images. This creates a "clean" signal for the model.
Synthetic Triplets: To cover the infinite variety of real-world styles, we generated thousands of synthetic triplets using in-the-wild images. While "noisy," this data forces the model to learn robust feature extraction techniques, preventing overfitting to the clean dataset.

3. Curriculum Continual Learning

Training on a mix of clean and noisy data simultaneously can lead to suboptimal convergence. To solve this, we implemented a Curriculum Continual Learning framework. The training process creates a structured "curriculum" for the model:

Phase 1: Foundation. The model trains exclusively on the high-quality, curated dataset. This establishes a strong understanding of fundamental style transfer mechanics.
Phase 2: Expansion. We gradually introduce the synthetic, noisy data. The model uses its foundational knowledge to interpret this new data, effectively "cleaning" it on the fly during the learning process.
Phase 3: Refinement. The model undergoes a final fine-tuning stage where both datasets are used to polish its performance, maximizing both adaptability and fidelity.

4. Video-to-Video Stylization

Video style transfer presents a unique challenge: temporal consistency. A frame-by-frame application of style often results in a "flickering" effect where the artistic textures shift randomly between frames. TeleStyle incorporates a specialized video module that:

Analyzes optical flow to understand movement between frames.
Propagates style features along motion paths, ensuring that a brushstroke on a moving object follows that object naturally.
Maintains a temporal memory buffer to ensure long-term consistency, preventing the style from drifting over time.

Installation & Usage Guide

1. Environment Setup

First, clone the repository and install the dependencies. We recommend using a fresh virtual environment.

pip install -r requirements.txt

Tested Configurations:

Python 3.11
PyTorch 2.9.1 + CUDA 12.1
diffusers 0.36.0
transformers 4.57.3

2. Download Checkpoints

You generally need two sets of weights: the base model and the TeleStyle adapters.

Image Model

diffsynth_Qwen-Image-Edit-2509-Lightning-4steps-V1.0-bf16.safetensors

diffsynth_Qwen-Image-Edit-2509-telestyle.safetensors

Video Model

dit.ckpt

prompt_embeds.pth

3. Running Inference

Image Stylization:

python telestyleimage_inference.py

Video Stylization:

python telestylevideo_inference.py --video_path assets/example/1.mp4 --image_path assets/example/1-0.png --output_path results

Frequently Asked Questions

What makes TeleStyle different from StyleGAN or other style transfer models?

Most traditional models like StyleGAN operate on a global level, often losing specific semantic details of the content image. Diffusion-based methods, while powerful, often struggle to disentangle the "style" prompt from the "content" structure. TeleStyle specifically addresses this entanglement issue using a unique Curriculum Continual Learning approach, allowing it to apply style textures aggressively while keeping the content structure intact.

Can TeleStyle handle high-resolution videos?

Yes, specifically because TeleStyle is designed as a lightweight model. By optimizing the internal representations of the Diffusion Transformer, we significantly reduce the memory footprint required per frame. This allows for processing higher resolution inputs compared to standard latent diffusion models, assuming sufficient GPU VRAM is available. For 4K video, tiling strategies may still be necessary depending on your hardware.

Is the model open source for commercial use?

TeleStyle is released under a permissive license for research and educational purposes. For commercial applications, please review the specific license terms attached to the model weights on Hugging Face, as restrictions may apply based on the base model (Qwen-Image-Edit) and the datasets used for training.

How much VRAM do I need to run TeleStyle?

We recommend a GPU with at least 16GB of VRAM for comfortable inference at standard resolutions (512x512 or 768x768). For video inference involving multiple frames in memory for temporal consistency, 24GB VRAM (like an NVIDIA RTX 3090 or 4090) is ideal. We are actively working on quantization techniques to lower these requirements further.

What is the "Curriculum Continual Learning" framework?

It is a training strategy where we order the training data by difficulty. Instead of throwing all data (clean and noisy) at the model at once, we first teach it with "easy," high-quality examples. Once it masters those, we introduce "harder," synthetic examples. This prevents the model from getting confused early in training and results in a much more robust final model.

Where can I find the dataset used for training?

Portions of the dataset, particularly the synthetic generation scripts and the list of style categories, are available in our GitHub repository. The full high-quality curated dataset may be subject to copyright restrictions depending on the source images, but we provide instructions on how to replicate our data curation process.

Does it support arbitrary style images?

Yes! One of the core strengths of TeleStyle is "zero-shot" style transfer. This means you can upload any image—a painting, a sketch, or a photograph—and the model will attempt to extract its stylistic essence and apply it to your content. Performance is best with images that have strong, distinct textures or color palettes.

Project Roadmap

Release Reference Code (Complete)
Release Models (Complete)
Release Technical Report (Complete)
Future Optimization for Mobile Devices