SimClone: Detecting Tabular Data Clones using Value Similarity
This program is tentative and subject to change.
Thu 1 May 2025 15:30 - 16:00 at Canada Hall 3 Poster Area - Thu Afternoon Break Posters 15:30-16:00
Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20% in terms of both F1-score and AUC. In addition, SimClone’s visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.
This program is tentative and subject to change.
Wed 30 AprDisplayed time zone: Eastern Time (US & Canada) change
13:30 - 14:00 | Wed Lunch Posters 13:30-14:00Research Track / Journal-first Papers / New Ideas and Emerging Results (NIER) at Canada Hall 3 Poster Area | ||
13:30 30mPoster | Pattern-based Generation and Adaptation of Quantum WorkflowsQuantum Research Track Martin Beisel Institute of Architecture of Application Systems (IAAS), University of Stuttgart, Johanna Barzen University of Stuttgart, Frank Leymann University of Stuttgart, Lavinia Stiliadou Institute of Architecture of Application Systems (IAAS), University of Stuttgart, Daniel Vietz University of Stuttgart, Benjamin Weder Institute of Architecture of Application Systems (IAAS), University of Stuttgart | ||
13:30 30mTalk | Mole: Efficient Crash Reproduction in Android Applications With Enforcing Necessary UI Events Journal-first Papers Maryam Masoudian Sharif University of Technology, Hong Kong University of Science and Technology (HKUST), Heqing Huang City University of Hong Kong, Morteza Amini Sharif University of Technology, Charles Zhang Hong Kong University of Science and Technology | ||
13:30 30mTalk | Automated Testing Linguistic Capabilities of NLP Models Journal-first Papers Jaeseong Lee The University of Texas at Dallas, Simin Chen University of Texas at Dallas, Austin Mordahl The University of Texas at Dallas, Cong Liu University of California, Riverside, Wei Yang UT Dallas, Shiyi Wei University of Texas at Dallas | ||
13:30 30mPoster | BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries Research Track Wen Zhang University of Georgia, Botang Xiao University of Georgia, Qingchen Kong University of Georgia, Le Guan University of Georgia, Wenwen Wang University of Georgia | ||
13:30 30mTalk | A Unit Proofing Framework for Code-level Verification: A Research AgendaFormal Methods New Ideas and Emerging Results (NIER) Paschal Amusuo Purdue University, Parth Vinod Patil Purdue University, Owen Cochell Michigan State University, Taylor Le Lievre Purdue University, James C. Davis Purdue University Pre-print | ||
13:30 30mTalk | Listening to the Firehose: Sonifying Z3’s Behavior New Ideas and Emerging Results (NIER) | ||
13:30 30mTalk | Towards Early Warning and Migration of High-Risk Dormant Open-Source Software DependenciesSecurity New Ideas and Emerging Results (NIER) Zijie Huang Shanghai Key Laboratory of Computer Software Testing and Evaluation, Lizhi Cai Shanghai Key Laboratory of Computer Software Testing & Evaluating, Shanghai Software Center, Xuan Mao Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, China, Kang Yang Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology | ||
13:30 30mPoster | SimClone: Detecting Tabular Data Clones using Value Similarity Journal-first Papers Xu Yang University of Manitoba, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Shaowei Wang University of Manitoba, Zhen Ming (Jack) Jiang York University | ||
13:30 30mTalk | SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code GenerationFormal Methods New Ideas and Emerging Results (NIER) Junjie Sheng East China Normal University, Yanqiu Lin East China Normal University, Jiehao Wu East China Normal University, Yanhong Huang East China Normal University, Jianqi Shi East China Normal University, Min Zhang East China Normal University, Xiangfeng Wang East China Normal University |
Thu 1 MayDisplayed time zone: Eastern Time (US & Canada) change
15:30 - 16:00 | Thu Afternoon Break Posters 15:30-16:00Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) / SE in Society (SEIS) at Canada Hall 3 Poster Area | ||
15:30 30mTalk | Mole: Efficient Crash Reproduction in Android Applications With Enforcing Necessary UI Events Journal-first Papers Maryam Masoudian Sharif University of Technology, Hong Kong University of Science and Technology (HKUST), Heqing Huang City University of Hong Kong, Morteza Amini Sharif University of Technology, Charles Zhang Hong Kong University of Science and Technology | ||
15:30 30mTalk | Best ends by the best means: ethical concerns in app reviews Journal-first Papers Neelam Tjikhoeri Vrije Universiteit Amsterdam, Lauren Olson Vrije Universiteit Amsterdam, Emitzá Guzmán Vrije Universiteit Amsterdam | ||
15:30 30mTalk | Shaken, Not Stirred. How Developers Like Their Amplified Tests Journal-first Papers Carolin Brandt Delft University of Technology, Ali Khatami Delft University of Technology, Mairieli Wessel Radboud University, Andy Zaidman Delft University of Technology | ||
15:30 30mPoster | BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries Research Track Wen Zhang University of Georgia, Botang Xiao University of Georgia, Qingchen Kong University of Georgia, Le Guan University of Georgia, Wenwen Wang University of Georgia | ||
15:30 30mTalk | Towards Early Warning and Migration of High-Risk Dormant Open-Source Software DependenciesSecurity New Ideas and Emerging Results (NIER) Zijie Huang Shanghai Key Laboratory of Computer Software Testing and Evaluation, Lizhi Cai Shanghai Key Laboratory of Computer Software Testing & Evaluating, Shanghai Software Center, Xuan Mao Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, China, Kang Yang Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology | ||
15:30 30mTalk | Exploring User Privacy Awareness on GitHub: An Empirical Study Journal-first Papers Costanza Alfieri Università degli Studi dell'Aquila, Juri Di Rocco University of L'Aquila, Paola Inverardi Gran Sasso Science Institute, Phuong T. Nguyen University of L’Aquila | ||
15:30 30mPoster | SimClone: Detecting Tabular Data Clones using Value Similarity Journal-first Papers Xu Yang University of Manitoba, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Shaowei Wang University of Manitoba, Zhen Ming (Jack) Jiang York University | ||
15:30 30mTalk | Strategies to Embed Human Values in Mobile Apps: What do End-Users and Practitioners Think? SE in Society (SEIS) Rifat Ara Shams CSIRO's Data61, Mojtaba Shahin RMIT University, Gillian Oliver Monash University, Jon Whittle CSIRO's Data61 and Monash University, Waqar Hussain Data61, CSIRO, Harsha Perera CSIRO's Data61, Arif Nurwidyantoro Universitas Gadjah Mada |