Zhengyuan (Dora) Dong

Zhengyuan (Dora) Dong

Ph.D. Student, Data Systems Group Cheriton School of Computer Science, University of Waterloo

My Research Interests: Data Lake, Model Lake, Multi-agent System, AI for Science

I like jogging, ai for music, ai for productivity, ai for occult, and have two parrots.

🏃🎹🧑‍💻🔮🦜🦜

I am open for collaboration, always welcoming discussion.

News

  • 2025 Dec. Our demo paper LazyVLM got accepted by ICDE DEMO
  • 2025 Dec. We released our paper ModelTables on arXiv

Publications

  • ModelTables: A Corpus of Tables about Models Zhengyuan Dong, Victor Zhong, Renée J. MillerarXiv preprint arXiv:2512.16106 (2025)
  • InteracSPARQL: An Interactive System for SPARQL Query Refinement Using Natural Language Explanations Xiangru Jian, Zhengyuan Dong, M. Tamer ÖzsuarXiv preprint arXiv:2511.02002 (2025)
  • LazyVLM: Neuro-Symbolic Approach to Video Analytics Xiangru Jian*, Wei Pang*, Zhengyuan Dong*, Chao Zhang*, M Tamer ÖzsuICDE Demo Track (2026)
  • GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks Hao Xu*, Xiangru Jian*, Xinjian Zhao*, Wei Pang*, Chao Zhang, Suyuchen Wang, Qixin Zhang, Zhengyuan Dong, Joao Monteiro, Bang Liu, Qiuzhuang Sun, Tianshu YuarXiv preprint arXiv:2504.12764 (2025)
  • BioMANIA: Simplifying bioinformatics data analysis through conversation Zhengyuan Dong, Victor Zhong, and Yang LubioRxiv (2023)

Service

  • Reviewer, IEEE Transactions on Multimedia (TMM), ACL ARR, Pattern Recognition, ACL/ICML/NeurIPS workshop (2025 - present)
  • Academic Graduate Mentor, UR2PHD Program , University of Waterloo (2025 - present)

Open Source Projects

ModelTables

ModelTables

Status: Completed ✅ at Jun 2025. Updated at Dec 2025

ModelTables is a benchmark corpus of tables in Model Lakes that captures structured semantics of performance and configuration tables often overlooked by text-only retrieval. Built from Hugging Face model cards, GitHub READMEs, and referenced papers, it links tables to their surrounding model and publication context. The corpus covers over 60K models and 90K tables, with multi-source ground truth using citation links, model-card inheritance, and shared training datasets. We evaluate table search methods including Data Lake operators (unionable, joinable, keyword) and IR baselines (dense, sparse, hybrid retrieval), demonstrating the first large-scale benchmark for structured model knowledge discovery.

LazyVLM

LazyVLM

Status: Completed ✅ at Mar 2025. To Be Released

LazyVLM is a neuro-symbolic video analytics system that combines the flexibility of Vision Language Models (VLMs) with the efficiency of symbolic methods. It allows users to query open-domain video data at scale using a semi-structured text interface, decomposing complex video queries into efficient operations for robust and scalable analytics.

BioMANIA

BioMANIA

Status: Completed ✅ at Oct 2023. Updated at Oct 2024

An AI-driven chatbot platform that simplifies bioinformatics data analysis through conversation. Features include front-end and back-end components, extensive data setup, model fine-tuning, and deployment solutions across Docker, Railway, and terminal CLI.

DocLocal

DocLocal

Status: Completed ✅ in Jun 2023

A GUI application that downloads and manages GitHub repository README files locally while offering integrated web search functionality through popular search engines. The tool streamlines documentation access by automatically fetching README files from repositories and displaying them in a user-friendly interface for offline browsing.

Teaching

  • Mentor, CS 399 Readings in Computer Science (F25)
  • Teaching Assistant, CS 348 Introduction to Database Systems (S24, S25, F25)
  • Teaching Assistant, CS 136 Elementary Algorithm Design and Data Abstraction (W24, F24, W25)

Honors

  • Prov-Doc Entrance Award, University of Waterloo, 2024
  • International Doctoral Student Award (IDSA), University of Waterloo, 2024

Talks