CrossGaussian

CafeTorro®

September 2024 ~ October 2025

🌐 ACM UIST 2025 Adjunct

📺️ IEEE ISMAR 2025 Research Demonstration

CrossGaussian: Enhancing Remote Collaboration through
3D Gaussian Splatting and Real-time 360◦ Streaming

CrossGaussian: Enhancing Remote Collaboration through
3D Gaussian Splatting and Real-time 360◦ Streaming

Abstract

In remote collaboration systems, remote users often experience information asymmetry and limited interactivity when collaborating with on-site users using virtually reconstructed scenes of the physical environment. Real-time 360° camera streaming mitigates the narrow field-of-view limitations of conventional video conferencing systems by providing a wide-angle, fast-rendered view; however, the lack of depth information still restricts active and free spatial exploration. Conversely, offline CAD-based scene reconstruction allows free navigation but requires substantial time and cost to produce. To address these issues, this study introduces 3D Gaussian Splatting (3DGS) — a learning-based neural rendering technique capable of rapidly and accurately reconstructing large-scale physical environments with high responsiveness. CrossGaussian integrates real-time 360° video streaming and 3DGS-based large-scale scene reconstruction through an automated pipeline, thereby presenting the first room-scale remote collaboration design space that enables free-viewpoint exploration and novel visual interactions in remote collaborative environments.

Introduction

CrossGaussian is a first-author project I led from topic ideation to publication during my junior winter to senior fall in an HCI research lab. I defined the research direction, conducted literature reviews, designed the system, ran user studies, and presented the work at international conferences. In the early phase, I analyzed 20+ top-tier papers from CHI, UIST, and CVPR focused on remote collaboration systems, 3D reconstruction, and AI-based novel view synthesis. This review revealed the limitations of photogrammetry, NeRF, and Instant-NGP in remote collaboration, particularly high computational cost, slow processing, and limited interactivity. Based on this, I identified 3D Gaussian Splatting (3DGS) as a promising approach due to its explicit rendering and real-time performance. I independently designed an end-to-end prototype pipeline and developed it with co-authors. After building the prototype, I conducted a user study with 24 participants, collecting data via NASA-TLX, SUS, and custom questionnaires. Insights from early feedback led me to refine the research focus toward defining and exploring a design space for 3DGS-driven visualization and interaction techniques in remote collaboration. The work culminated in acceptance to the ACM UIST 2025 Poster Session and the ISMAR 2025 Demo Session, where I demonstrated the system live for three days. CrossGaussian demonstrates my ability to bridge cutting-edge AI rendering technology with human-centered design, delivering a functioning system that advances remote collaboration through HCI-driven insights.

BACKGROUND

BACKGROUND

BACKGROUND

In co-located collaboration, participants can freely move, explore, and interact within the shared physical space. In remote collaboration, however, this autonomy is significantly constrained. Remote users depend on the on-site collaborator to look behind objects or change viewpoints within a video feed, increasing communication burden, causing unnecessary coordination, and ultimately limiting interaction. Some prior work mounts cameras on robotic platforms to provide spatial context, but despite this advantage, such systems often induce simulator sickness for remote users. Therefore, enabling free exploration of the physical environment for remote collaborators remains an open challenge.

RESEARCH

Real-time 360° video streaming partially mitigates viewpoint limitations by providing a wide field of view; however, the lack of depth information prevents users from actively understanding spatial structure or estimating object distance. While manually created 3D models offer another alternative, fully modeling every space is inefficient and costly. To address this, recent remote collaboration research has explored progressive reconstruction using camera-based photogrammetry. Yet, because this approach relies on image-based Structure-from-Motion (SfM) to produce surface-centric meshes, it suffers from limitations in resolution, accuracy, and responsiveness. More recently, Neural Radiance Fields (NeRF) have been adopted for remote collaboration, but their high computational cost still makes them unsuitable for real-time interactive environments.



In contrast, the recently emerging camera-synthesized view–based neural rendering approach, 3D Gaussian Splatting (3DGS), represents a scene as a collection of numerous Gaussian primitives (defined by position, color, covariance, etc.) to enable fast rendering. Unlike NeRF, which encodes a scene implicitly through neural networks, 3DGS adopts an explicit and computation-efficient structure optimized for rapid processing. Owing to these characteristics, 3DGS offers significantly faster training speed, higher rendering performance, and superior scalability to large-scale or dynamic environments compared to NeRF. Building on this, our study integrates real-time contextual 360° video streaming with the fast, precise, and responsive capabilities of 3DGS for remote collaboration. Furthermore, we explore a room-scale design space for remote environment exploration and interactive techniques based on this integration.

SYSTEM ARCHITECURE

To integrate 3DGS and real-time camera streaming into a remote collaboration environment, I independently designed the full system pipeline.


1. Data Collection — Real-Time Synchronized Input: The system begins with real-time 360° video capture. Footage from an Insta360 camera is automatically stored locally and immediately referenced by a custom SDK plugin that forwards frames to a remote GPU server for multi-view optimization and 3D scene training. As soon as video is recorded, it is automatically passed into the image-based point-cloud generation pipeline, triggering reconstruction. Simultaneously, 360° video is streamed in real time via H.264 packets.

2. Reconstruction — Automated and Remote Gaussian Pipeline: To eliminate manual execution and path configuration during collaboration, I built an automated end-to-end pipeline connecting Unity and Python. A Python script (Putty.py) uses Paramiko to create SSH/SFTP connections, sending commands to the remote GPU server for training. Only video capture occurs locally; all heavy computation (SfM, Gaussian optimization, conversion) runs automatically on the remote server. Once training completes, result files are automatically retrieved and rendered in the collaboration environment.

3. Streaming — Reliable and Synchronized Transmission: Real-time video transmission is handled via a TCP-based streaming protocol. Each encoded frame is segmented into small chunks, and decoding occurs only after all chunks are received. ACK signals ensure reliable delivery with no duplication or loss, enabling low-latency, glitch-free 360° streaming synchronized with 3DGS output.

4. Rendering — Unified Visualization and Immersive Overlay: Finally, 360° video and Gaussian scenes are fused into a single space. Incoming video is decoded in real time using FFmpeg’s GPU decoder (h264_cuvid) and converted from NV12 to RGBA via NVIDIA NPP. A shader maps each fisheye pixel into spherical space, projecting the camera’s field of view onto a virtual dome for wide-FOV real-time streaming. Gaussian rendering results are composited on the same render target, producing a unified immersive scene.











To integrate 3DGS and real-time camera streaming into a remote collaboration environment, I independently designed the full system pipeline.


1. Data Collection — Real-Time Synchronized Input: The system begins with real-time 360° video capture. Footage from an Insta360 camera is automatically stored locally and immediately referenced by a custom SDK plugin that forwards frames to a remote GPU server for multi-view optimization and 3D scene training. As soon as video is recorded, it is automatically passed into the image-based point-cloud generation pipeline, triggering reconstruction. Simultaneously, 360° video is streamed in real time via H.264 packets.

2. Reconstruction — Automated and Remote Gaussian Pipeline: To eliminate manual execution and path configuration during collaboration, I built an automated end-to-end pipeline connecting Unity and Python. A Python script (Putty.py) uses Paramiko to create SSH/SFTP connections, sending commands to the remote GPU server for training. Only video capture occurs locally; all heavy computation (SfM, Gaussian optimization, conversion) runs automatically on the remote server. Once training completes, result files are automatically retrieved and rendered in the collaboration environment.

3. Streaming — Reliable and Synchronized Transmission: Real-time video transmission is handled via a TCP-based streaming protocol. Each encoded frame is segmented into small chunks, and decoding occurs only after all chunks are received. ACK signals ensure reliable delivery with no duplication or loss, enabling low-latency, glitch-free 360° streaming synchronized with 3DGS output.

4. Rendering — Unified Visualization and Immersive Overlay: Finally, 360° video and Gaussian scenes are fused into a single space. Incoming video is decoded in real time using FFmpeg’s GPU decoder (h264_cuvid) and converted from NV12 to RGBA via NVIDIA NPP. A shader maps each fisheye pixel into spherical space, projecting the camera’s field of view onto a virtual dome for wide-FOV real-time streaming. Gaussian rendering results are composited on the same render target, producing a unified immersive scene.











To integrate 3DGS and real-time camera streaming into a remote collaboration environment, I independently designed the full system pipeline.


1. Data Collection — Real-Time Synchronized Input: The system begins with real-time 360° video capture. Footage from an Insta360 camera is automatically stored locally and immediately referenced by a custom SDK plugin that forwards frames to a remote GPU server for multi-view optimization and 3D scene training. As soon as video is recorded, it is automatically passed into the image-based point-cloud generation pipeline, triggering reconstruction. Simultaneously, 360° video is streamed in real time via H.264 packets.

2. Reconstruction — Automated and Remote Gaussian Pipeline: To eliminate manual execution and path configuration during collaboration, I built an automated end-to-end pipeline connecting Unity and Python. A Python script (Putty.py) uses Paramiko to create SSH/SFTP connections, sending commands to the remote GPU server for training. Only video capture occurs locally; all heavy computation (SfM, Gaussian optimization, conversion) runs automatically on the remote server. Once training completes, result files are automatically retrieved and rendered in the collaboration environment.

3. Streaming — Reliable and Synchronized Transmission: Real-time video transmission is handled via a TCP-based streaming protocol. Each encoded frame is segmented into small chunks, and decoding occurs only after all chunks are received. ACK signals ensure reliable delivery with no duplication or loss, enabling low-latency, glitch-free 360° streaming synchronized with 3DGS output.

4. Rendering — Unified Visualization and Immersive Overlay: Finally, 360° video and Gaussian scenes are fused into a single space. Incoming video is decoded in real time using FFmpeg’s GPU decoder (h264_cuvid) and converted from NV12 to RGBA via NVIDIA NPP. A shader maps each fisheye pixel into spherical space, projecting the camera’s field of view onto a virtual dome for wide-FOV real-time streaming. Gaussian rendering results are composited on the same render target, producing a unified immersive scene.











DESIGN IMPLEMENTATION

Leveraging the explicit scene representation structure of 3DGS and its precise depth rendering capabilities at room scale, we explored a design space to enhance explorability and interactivity in remote collaboration environments. Inspired by existing Cross-Reality scene blending research, we designed the following key features for remote collaboration.



Blending of Overlapping Scenes

Abrupt transitions between real-time streaming and 3DGS scenes can cause motion sickness and reduced presence. To mitigate this, our system implements a feature that visually separates yet blends 3DGS scenes with 360-degree video streaming. By adjusting the transparency of each overlaid scene and applying color scaling techniques, users can maintain real-time environmental context (360-degree stream) while simultaneously exploring with free viewpoints (3DGS). This reduces cognitive load during context switching and preserves presence. The overlay structure also enables visual distinction of non-salient regions through color scaling of 3DGS scenes or pixel value adjustments of 360-degree footage. Similar to Gruenefeld et al.'s adjustable scene blending, users can customize their optimal collaboration experience by controlling the blending ratio between 3DGS and 360-degree video to balance realism and exploration freedom.

Occlusion-Aware Exploration

Automatic Occlusion Detection and Visualization: Because the camera in a remote space captures the scene from a single viewpoint, there is a fundamental limitation in that it cannot visualize areas occluded by structures such as walls or pillars. For example, when an on-site worker needs to inspect equipment located behind a column, the 360° camera alone cannot reveal the hidden region. To address this, our system utilizes the 3D spatial information of the 3D Gaussian Splatting (3DGS) model to automatically detect and visualize occluded areas based on the camera’s position within the physical environment. Specifically, the system first computes which regions are blocked by surrounding structures relative to the current position and orientation of the 360° camera in the 3DGS model. It then compares the depth values of adjacent pixels to estimate each pixel’s precise depth and pseudo-normal direction, determining which parts correspond to shadows or occluded regions. Using Unity compute shaders and HLSL, the system performs real-time GPU-based shadow computation to quickly identify these occluded areas and visually highlight them for the user. Through this approach, remote users can intuitively perceive and utilize the camera’s viewpoint coverage in the on-site environment for more effective collaboration.


See-Through Capability: Our system provides see-through capabilities for remote 3D environments by leveraging depth information inherent in the 3DGS model. While photogrammetry relies on mesh-based representations with fixed surfaces, making transparency control difficult, Gaussian Splatting uses 3D Gaussians with alpha values, enabling natural semi-transparent rendering through alpha blending at the rendering stage. This allows users to see through objects and directly inspect spaces beyond them without complex viewpoint manipulation, enabling intuitive exploration and novel interactions without information loss caused by occlusion.


PERCEPTUAL EVALUATION

PERCEPTUAL EVALUATION

PERCEPTUAL EVALUATION

We conducted a user study to examine how reconstruction delay affects users' perception of object presence and manipulability. While state-of-the-art 3D reconstruction often exceeds the real-time threshold of 33ms, prior work rarely addresses how such delays impact user perception. We conducted a within-subjects study with 18 participants across four randomized delay conditions: 0.15s, 1s, 10s, and 60s. After observing an object under each condition, participants rated perceived manipulability and trust in existence on a 7-point Likert scale. We used Friedman tests and Wilcoxon signed-rank tests for analysis. Results showed that reconstruction delay significantly decreased perceived manipulability, with scores of 5.8±1.6 (0.15s), 5.7±1.4 (1s), 5.2±1.4 (10s), and 4.3±1.7 (60s). Significant differences were observed between 0.15s-60s and 1s-60s conditions. Trust in existence showed a similar pattern: 6.2±1.2 (0.15s), 5.8±1.3 (1s), 5.1±1.5 (10s), and 4.3±1.8 (60s). Qualitative feedback revealed that most participants lost trust after 10s, with 60s delays making objects feel disconnected and significantly reducing willingness to interact.




We conducted a user study to examine how reconstruction delay affects users' perception of object presence and manipulability. While state-of-the-art 3D reconstruction often exceeds the real-time threshold of 33ms, prior work rarely addresses how such delays impact user perception. We conducted a within-subjects study with 18 participants across four randomized delay conditions: 0.15s, 1s, 10s, and 60s. After observing an object under each condition, participants rated perceived manipulability and trust in existence on a 7-point Likert scale. We used Friedman tests and Wilcoxon signed-rank tests for analysis. Results showed that reconstruction delay significantly decreased perceived manipulability, with scores of 5.8±1.6 (0.15s), 5.7±1.4 (1s), 5.2±1.4 (10s), and 4.3±1.7 (60s). Significant differences were observed between 0.15s-60s and 1s-60s conditions. Trust in existence showed a similar pattern: 6.2±1.2 (0.15s), 5.8±1.3 (1s), 5.1±1.5 (10s), and 4.3±1.8 (60s). Qualitative feedback revealed that most participants lost trust after 10s, with 60s delays making objects feel disconnected and significantly reducing willingness to interact.




We conducted a user study to examine how reconstruction delay affects users' perception of object presence and manipulability. While state-of-the-art 3D reconstruction often exceeds the real-time threshold of 33ms, prior work rarely addresses how such delays impact user perception. We conducted a within-subjects study with 18 participants across four randomized delay conditions: 0.15s, 1s, 10s, and 60s. After observing an object under each condition, participants rated perceived manipulability and trust in existence on a 7-point Likert scale. We used Friedman tests and Wilcoxon signed-rank tests for analysis. Results showed that reconstruction delay significantly decreased perceived manipulability, with scores of 5.8±1.6 (0.15s), 5.7±1.4 (1s), 5.2±1.4 (10s), and 4.3±1.7 (60s). Significant differences were observed between 0.15s-60s and 1s-60s conditions. Trust in existence showed a similar pattern: 6.2±1.2 (0.15s), 5.8±1.3 (1s), 5.1±1.5 (10s), and 4.3±1.8 (60s). Qualitative feedback revealed that most participants lost trust after 10s, with 60s delays making objects feel disconnected and significantly reducing willingness to interact.




OUTCOME

OUTCOME

OUTCOME

This research was accepted as a first-author poster at ACM UIST (ACM Symposium on User Interface Software and Technology) 2025, one of the most prestigious conferences in user interface and interaction technology. It was also accepted for the demo session at IEEE ISMAR (International Symposium on Mixed and Augmented Reality) 2025, the world's leading conference in augmented and mixed reality, where we conducted live demonstrations for three days. At both conferences, we received significant interest and positive feedback from renowned researchers in the HCI field worldwide and experts from global companies regarding the system's real-time performance, practical utility in remote collaboration, and the innovative nature of our 3DGS-based approach.




MATERIALS

This project has been published as an Adjunct Proceedings paper in the ACM Digital Library. To access the Full Paper, please click the image on the right to be redirected to the publication page. The paper is available for free under Open Access.



Create a free website with Framer, the website builder loved by startups, designers and agencies.