HyperAI Weekly AI Model Update: Irodori-TTS, SAM-Audio, MatAnyone 2, PrismAudio, and More

This weekly update brings together a useful group of new AI demos and model resources, especially around audio generation, speech recognitio...

Introduction

This week’s HyperAI update focuses on a strong mix of audio, video, image understanding, OCR, and speech recognition models. The headline project is Irodori-TTS-500M-v3, an open Japanese text-to-speech model that combines high-fidelity 48 kHz speech generation, zero-shot voice cloning, and fine-grained style control through emoji annotations.

The update also includes tools for prompt-based audio separation, video matting, 4D world simulation, video-to-audio generation, document OCR, on-device segmentation, expressive audio editing, and low-latency streaming ASR. Below is a cleaned-up, publication-ready version of the original weekly roundup, with the useful screenshots preserved in their original context.

Source Note

This article is based on the BAAI Hub / HyperAI weekly update published at The original page states that the article source is from WeChat and that images can be removed if there are copyright concerns.

QR codes, promotional posters, group invitation images, and unrelated recommendation banners were intentionally removed. The DiaMoE-TTS and DreamOmni2 image links are retained at their source positions, but their preview requests timed out during checking, so they are noted here instead of being treated as fully verified screenshots.

Weekly HyperAI Update Overview

From June 27 to July 3, HyperAI updated several public resources on its official website:

12 selected public tutorials
5 popular AI encyclopedia entries
4 AI conference deadlines in July

The main theme this week is practical experimentation. Most entries are not just paper descriptions; they provide online demos or runnable notebooks so users can quickly test the model behavior.

Selected Public Tutorials

Irodori-TTS-500M-v3: Japanese TTS with Emoji Style Control

Irodori-TTS is an open-source Japanese text-to-speech project released by developer Aratako in
2026. The featured model, Irodori-TTS-500M-v3, is designed for Japanese speech synthesis, zero-shot voice cloning, and emoji-guided voice style control.

The model is built around a Rectified Flow Diffusion Transformer (RF-DiT) architecture and generates speech in a continuous DACVAE latent space. In practical use, the most interesting point is that it can clone a target voice from only a short reference clip, usually around 3 to 10 seconds, without extra fine-tuning.

It also supports style control through emoji annotations. That makes the model more flexible than a basic TTS system: users can guide tone, emotion, pacing, and subtle non-verbal expression in a more lightweight way.

图片展示了Irodori-TTS-500M-v3的界面，用于日本文本到语音转换，支持表情符号风格控制。左侧有“Basic TTS”“Voice Cloning”“Emoji Guide”三个选项卡，当前选中“Basic TTS”。下方输入框显示日文文本“こんにちは、今日はいい天気ですね。”右侧是选中音频的波形图，下方有“Generate Speech”按钮。该图与上文介绍Irodori-TTS-500M-v3模型支持表情符号风格控制的内容相呼应，直观呈现了模型的实际操作界面。

MatAnyone 2: Video Matting for Foreground Extraction

MatAnyone 2 is a video matting model released by NTU S-Lab and SenseTime. It is built for extracting human foregrounds and generating alpha mattes from videos.

The model improves stability by using a learned quality evaluator. This helps reduce boundary artifacts and preserve details such as hair, semi-transparent edges, and foreground contours. It is also useful when the user wants to isolate a specific person in a multi-person video.

这张图片展示的是MatAnyone 2的演示操作界面，MatAnyone 2是用于视频前景提取的AI模型。界面上方标注了模型名称“MatAnyone 2: Video Matting”，并说明其功能是从视频中提取前景。左侧为操作面板，设有上传视频、调整参数的选项，下方还有处理状态提示；右侧则对应展示了原始输入视频帧，以及模型处理后生成的前景遮罩输出，遮罩清晰勾勒出了目标前景区域，直观呈现了该模型的视频抠图效果。

Online demo:

InSpatio-World: Real-Time 4D World Simulation

InSpatio-World is a real-time 4D world simulator released by the InSpatio team in
2026. It can take an input video and a specified camera trajectory, then generate a stable new-view video.

The core idea is to make video scenes more controllable. Instead of passively watching a fixed camera view, users can define camera movement and explore the scene from new viewpoints while preserving temporal consistency.

图片展示了InSpatio-World实时4D世界模拟器的界面及效果。左侧为上传视频和选择相机轨迹的输入区域，下方有“Generate novel view”按钮。右侧呈现了模拟器生成的视频效果，展示了咖啡杯、面包等物品在不同角度的场景，体现了其生成稳定、可控制新视角视频的能力，与上下文介绍的InSpatio-World可将输入视频和指定相机轨迹生成稳定新视角视频的功能相契合。

DiaMoE-TTS: IPA-Based Multi-Dialect Speech Synthesis

DiaMoE-TTS is a multi-dialect speech synthesis framework from Giant AI Lab. It uses the International Phonetic Alphabet, or IPA, as a unified frontend for dialect speech generation.

The model combines a Mixture-of-Experts design with parameter-efficient adaptation methods such as LoRA and conditioning adapters. This allows the system to adapt more quickly to new dialects, even when only limited data is available.

图片展示了DiaMoE-TTS: Multi-Dialect Speech Synthesis的界面。上方有IPA基于的Mixture-of-Experts设计和参数高效适应方法如LoRA和条件适配器的介绍。中间是“Generate Speech”按钮，下方有示例文本输入框，支持9种中国方言，右侧显示生成语音波形及语音参考（方言提示）。底部列出支持的方言及对应提示声音，还标注了模型使用KPL模型进行方言合成、生成时间等信息。该图与文档中介绍DiaMoE-TTS模型的内容相关，直观呈现其操作界面及功能。

SAM-Audio: Segment Anything in Audio

SAM-Audio is Meta’s audio source separation foundation model. It can isolate a target sound from a mixed audio signal using natural language descriptions, visual cues from video, or a selected time span.

For example, a user can describe the sound they want to separate, such as “man speaking,” “dog barking,” “car engine,” or “piano playing.” The model then attempts to separate the target audio from everything else in the mixture.

这张图片是Meta的SAM-Audio模型的操作界面截图，对应文档中“SAM-Audio: Segment Anything in Audio”的内容展示。界面用于实现音频源分离，左侧设置了两种输入音轨的波形，下方的“Sound Description”输入框填写了示例指令“man speaking”，还有“Enable Span Prediction”的勾选选项，底部设有橙色的“Separate Sound”操作按钮；右侧则对应展示处理后的目标声音输出波形，下方还附有示例描述的分类列表，涵盖人声、动物声、乐器声等不同类别的待分离声音示例。

PrismAudio: Video-to-Audio Generation with Decomposed CoT and Multi-Dimensional Rewards

PrismAudio is a video-to-audio generation model from Tongyi Lab. It focuses on generating audio that matches the visual scene, timing, atmosphere, and spatial feeling of a video.

The model introduces a decomposed Chain-of-Thought planning process. Instead of treating video-to-audio generation as one single reasoning step, it separates the process into semantic, temporal, aesthetic, and spatial dimensions. Each dimension is paired with a targeted reward signal for reinforcement learning.

图片展示了PrismAudio视频转音频生成模型的界面。左侧为输入区域，有“Upload Video”按钮，下方是视频预览窗口，视频内容为一位女士坐在长椅上。下方还有“Caption / Prompt”区域，示例文本为“A girl in the rain”。右侧是运行日志，显示视频准备、检查时长等步骤。底部是输出区域，呈现了生成的音频和视频。该图直观呈现了PrismAudio模型的视频转音频生成流程及效果，与文档中对PrismAudio模型的介绍相呼应。

DreamOmni2: Multimodal Instruction-Based Image Editing and Generation

DreamOmni2 is a multimodal image editing and generation model from CUHK JIA Lab. It has been accepted by CVPR 2026 as a Highlight paper.

The model is built on FLUX.1-Kontext-dev and uses a fine-tuned Qwen2.5-VL-7B visual language model to handle instructions. It supports natural language prompts together with reference images, which makes it suitable for tasks such as object replacement, style transfer, pose imitation, and concept-driven generation.

图片展示了DreamOmni2模型的编辑与生成示例。上方左侧为原始街道场景图，右侧为人物照片；下方为编辑结果，人物站在街道场景中，背景与人物融合自然。图片与上下文紧密相关，直观呈现了DreamOmni2支持自然语言提示与参考图像，可进行对象替换、风格转换、姿势模仿等任务，适用于多模态指令驱动的图像编辑和生成。

PixelRefer: Fine-Grained Object Understanding for Images and Videos

PixelRefer is a unified image and video object understanding framework from Alibaba DAMO Academy. It focuses on fine-grained object-centric comprehension instead of only describing an entire scene.

The framework supports region-level pointing, captioning, and question answering. It also introduces a scale-adaptive object tokenizer and a lighter PixelRefer-Lite variant to make object representation more compact and efficient.

图片展示了PixelRefer模型的演示界面。上方标题为“Spatial-temporal object referring with arbitrary granularity”。画面中呈现了一张城市景观图片，图中有布鲁克林大桥、摩天大楼等。下方有“Image”和“Video”选项卡，当前选中“Image”。界面底部有“Generate Caption”按钮，以及“Model Status”区域。该图片与文档中介绍的PixelRefer模型相关，直观呈现了其在图像理解方面的应用，支持区域级指针、描述和问答等功能。

Unlimited-OCR: One-Shot Long-Document OCR and Layout Parsing

Unlimited-OCR is an OCR and document layout parsing project released by Baidu in
2026. It is designed for long-document parsing rather than only single-page recognition.

The project can process single document images, multi-page images, and pages converted from PDFs. It is especially useful for papers, reports, scanned documents, long tables, and multi-page structured materials.

图片展示了Baidu于2026年发布的Unlimited-OCR项目界面。左侧为文档上传区域，提示“Drop your document here”或“or click anywhere to browse”，并有“PDF”“image”“text”选项。右侧为OCR输出显示区，提示“OCR output will appear here”及“Use a document size greater than 1MB”。该图片与上下文紧密相关，直观呈现了Unlimited-OCR项目处理文档的界面，说明其可处理单文档图像、多页图像及PDF转换页面，尤其适用于论文、报告等材料。

EdgeTAM: Promptable Image and Video Segmentation for Edge Devices

EdgeTAM is an on-device Track Anything Model developed by Meta Reality Labs and NTU S-Lab. It is designed for resource-constrained devices while keeping the interactive segmentation ability of SAM-style models.

The model reduces the memory attention bottleneck of SAM 2 through a 2D Spatial Perceiver and a distillation pipeline. In practice, that means it can support promptable segmentation and video object tracking more efficiently on edge hardware.

图片展示了EdgeTAM模型的演示界面，标题为“EdgeTAM: On-Device Track Anything Model”。左侧为输入部分，上方有“Choose Image”按钮，下方显示“16943930.png”图像，图像中有一个蓝色的无限符号图案。右侧为结果部分，显示了对无限符号图案的分割效果，有前景（包含）和背景（排除）选项，下方有“Score: 0.6992 | Mask area: 5774 pixels”等信息，还有“Reset All Points”和“Undo Last Point”按钮。该图直观呈现了EdgeTAM模型在图像分割方面的应用效果。

Step-Audio-EditX: Zero-Shot Voice Cloning and Expressive Audio Editing

Step-Audio-EditX is an audio editing model from StepFun. It combines a 3B-parameter LLM-based audio model with reinforcement learning to support zero-shot voice cloning and expressive audio editing.

The model can handle Mandarin, English, Sichuanese, Cantonese, Japanese, and Korean. It is built for tasks such as emotion control, speaking-style editing, paralinguistic editing, and iterative audio refinement.

图片展示了Step-Audio-EditX模型的界面，用于零样本语音克隆和表达性音频编辑。界面分为“Voice Cloning”和“Audio Editing”两个标签，当前选中“Voice Cloning”。左侧有“Input Audio (Reference Voice)”输入框，下方是“Target Text (Text to Synthesize)”输入区域，示例文本为“Hi, the weather is good today.”，底部有“CLONE”按钮。右侧是“Cloned Audio Output”区域，显示克隆音频波形及进度条，底部提示“Clone completed. Output duration: 4.2s”。该图直观呈现了模型操作界面及效果。

Nemotron 3.5 ASR Streaming 0.6B: Lightweight Streaming Speech Recognition

Nemotron 3.5 ASR Streaming 0.6B is an automatic speech recognition model from NVIDIA. It is built for low-latency streaming transcription and uses a cache-aware FastConformer-RNNT architecture.

The key design is context reuse. During streaming inference, the model reuses encoder context instead of recomputing overlapping audio chunks, which helps reduce redundant computation and improve real-time performance.

图片展示了Nemotron 3.5 ASR Streaming 0.6B自动语音识别模型的界面。上方提示上传或录制短语音片段以用CPU演示。中部有音频波形图，下方有目标语言选择框，当前选中en-US，还有注意力上下文大小框，显示56.13。底部橙色区域为“Transcribe”按钮，下方是转录文本区域，显示一段关于乡村道路和学校教室的描述。该图与上下文介绍的Nemotron 3.5 ASR Streaming 0.6B模型相关，直观呈现了其操作界面及转录功能。

Popular Encyclopedia Entries

HyperAI also highlighted five popular AI encyclopedia entries this week:

Large Language Model (LLM)
World Action Model (WAM)
Harmonic Mean
Virtual Screening
Reinforcement Learning from AI Feedback (RLAIF)

HyperAI’s wiki collects hundreds of AI-related concepts and explanations. It is useful for readers who want a quick way to understand terms that often appear in papers, tutorials, and model documentation.

AI Conference Deadlines in July

The original update also lists several AI and computer science conference deadlines in July. All deadline times are marked as AoE time.

Date	Time	Conference
July 09	23:59:59	POPL 2027
July 10	23:59:59	ICSE 2027
July 17	23:59:59	SIGMOD 2027
July 28	23:59:59	AAAI 2027

About HyperAI

HyperAI is an artificial intelligence and high-performance computing community. Its website provides public resources for developers, researchers, and AI learners.

According to the original source, HyperAI has already collected or supported:

2,100+ public datasets with domestic acceleration nodes
700+ classic and popular online tutorials
300+ AI4Science paper case studies
700+ AI-related encyclopedia entries
A complete Chinese documentation mirror for Apache TVM

FAQ

What is Irodori-TTS-500M-v3?

Irodori-TTS-500M-v3 is an open Japanese text-to-speech model based on an RF-DiT architecture. It supports Japanese speech generation, short-reference zero-shot voice cloning, and emoji-based style control.

Can Irodori-TTS clone a voice without fine-tuning?

Yes. The original update describes Irodori-TTS as supporting zero-shot voice cloning from a short reference audio clip, typically around 3 to 10 seconds. The effect still depends on the quality and clarity of the reference audio.

What is SAM-Audio used for?

SAM-Audio is used for prompt-based audio source separation. Users can describe the sound they want to extract, provide visual cues, or specify a time range to isolate a target sound from a mixed recording.

What is the difference between video matting and video segmentation?

Video segmentation usually separates objects into regions or masks, while video matting estimates a more detailed alpha matte. Matting is especially important for clean foreground extraction, hair detail, semi-transparent edges, and compositing.

What does PrismAudio generate?

PrismAudio generates audio for video. It tries to align generated sound with the video’s semantic content, timing, aesthetic feeling, and spatial cues.

Why is Unlimited-OCR useful for long documents?

Unlimited-OCR is designed for long-horizon parsing, not just isolated single-page OCR. It can be useful when dealing with papers, reports, scanned files, long tables, or multi-page PDF-derived images.

Is Nemotron 3.5 ASR Streaming 0.6B suitable for real-time speech transcription?

Yes, it is designed for low-latency streaming ASR. Its cache-aware FastConformer-RNNT architecture reuses context during streaming inference, which helps reduce redundant computation.

Related Tools

Irodori-TTS: Open-source Japanese TTS with reference-audio voice cloning and style control.
Irodori-TTS-500M-v3 on Hugging Face: Model page for the 500M v3 Japanese TTS checkpoint.
SAM-Audio: Meta’s repository for Segment Anything in Audio inference and examples.
MatAnyone 2: Project page for the MatAnyone 2 video matting framework.
InSpatio-World: Project page for real-time interactive 4D world simulation.
DiaMoE-TTS: GitHub repository for IPA-based multi-dialect speech synthesis.
PrismAudio: Project page for video-to-audio generation with decomposed CoT and multi-dimensional rewards.
DreamOmni2: Open-source multimodal instruction-based image editing and generation project.
PixelRefer: Alibaba DAMO Academy’s framework for fine-grained image and video object understanding.
Unlimited-OCR: Baidu’s long-horizon OCR and document parsing project.
EdgeTAM: Meta’s on-device track-anything model for promptable image and video segmentation.
Step-Audio-EditX: StepFun’s model for zero-shot voice cloning and expressive audio editing.
Nemotron 3.5 ASR Streaming 0.6B: NVIDIA’s Hugging Face model page for low-latency streaming ASR.

Summary

This weekly update brings together a useful group of new AI demos and model resources, especially around audio generation, speech recognition, video processing, image understanding, and long-document OCR.

The most practical entries are Irodori-TTS for Japanese voice generation, SAM-Audio for prompt-based sound separation, MatAnyone 2 for clean video matting, Unlimited-OCR for long documents, and Nemotron 3.5 ASR for streaming speech recognition.

Overall, this roundup is useful for readers who want to quickly discover which new AI models are worth testing, what each one does, and where to try them.