Open Source Software Tools for Data Management and Deep Model Training Automation
Designing and optimizing deep models require managing large datasets and conducting carefully designed controlled experiments that depend on large sets of hyper-parameters and problem dependent software/data configurations. These experiments are executed by training the model under observation with varying configurations. Since executing a typical training run can take days even on proven acceleration fabrics such as Graphics Processing Units (GPU), avoiding human error in configuration preparations and securing the repeatability of the experiments are of utmost importance. Failed training runs lead to lost time, wasted energy and frustration. On the other hand, unrepeatable or poorly monitored/logged training runs make it exceedingly hard to track performance and lock on a successful and well generalizing deep model. Hence, managing large datasets and training automation are crucial for efficiently training deep models. In this paper, we present two open source software tools that aim to achieve these goals, namely, a Dataset Manager (DatumAid) tool and a Training Automation Manager (OrchesTrain) tool. DatumAid is a software tool that integrates with Computer Vision Annotation Tool (CVAT) to facilitate the management of annotated datasets. By adding additional functionality, DatumAid allows users to filter labeled data, manipulate datasets, and export datasets for training purposes. The tool adopts a simple code structure while providing flexibility to users through configuration files. OrchesTrain aims to automate model training process by facilitating rapid preparation and training of models in the desired style for the intended tasks. Users can seamlessly integrate their models prepared in the PyTorch library into the system and leverage the full capabilities of OrchesTrain. It enables the simultaneous or separate usage of Wandb, MLflow, and TensorBoard loggers. To ensure reproducibility of the conducted experiments, all configurations and codes are saved to the selected logger in an appropriate structure within a YAML file along with the serialized model files. Both software tools are publicly available on GitHub.
Wed 13 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
13:30 - 15:00 | Open Source and Software Ecosystems 2Research Papers / Journal-first Papers / Industry Showcase (Papers) at Room D Chair(s): Paul Grünbacher Johannes Kepler University Linz, Austria | ||
13:30 12mTalk | Personalized First Issue Recommender for Newcomers in Open Source Projects Research Papers Wenxin Xiao School of Computer Science, Peking University, Jingyue Li Norwegian University of Science and Technology, Hao He Carnegie Mellon University, Ruiqiao Qiu Beijing Institute of Technology, Minghui Zhou Peking University Pre-print | ||
13:42 12mTalk | Understanding and Enhancing Issue Prioritization in GitHub Research Papers Yingying He Nanjing University of Aeronautics and Astronautics, Wenhua Yang Nanjing University of Aeronautics and Astronautics, Minxue Pan Nanjing University, Yasir Hussain Nanjing University of Aeronautics and Astronautics, Yu Zhou Nanjing University of Aeronautics and Astronautics | ||
13:55 12mResearch paper | Who is the Real Hero? Measuring Developer Contribution via Multi-dimensional Data Integration Research Papers Yuqiang Sun Nanyang Technological University, Zhengzi Xu Nanyang Technological University, Chengwei Liu Nanyang Technological University, Yiran Zhang Nanyang Technological University, Yang Liu Nanyang Technological University Pre-print | ||
14:08 12mTalk | Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization) Journal-first Papers Tianpei Xia North Carolina State University, Wei Fu North Carolina State University, Rui Shu North Carolina State University, Rishabh Agrawal North Carolina State University, Tim Menzies North Carolina State University Link to publication DOI Pre-print | ||
14:21 12mTalk | To Share, or Not to Share: Exploring Test-Case Reusability in Fork Ecosystems Research Papers Mukelabai Mukelabai The University of Zambia, Zambia, Christoph Derks Ruhr-University Bochum, Germany, Jacob Krüger Eindhoven University of Technology, Thorsten Berger Ruhr University Bochum File Attached | ||
14:34 12mTalk | LiSum: Open Source Software License Summarization with Multi-Task LearningRecorded talk Research Papers Linyu Li , Sihan Xu Nankai University, Yang Liu Nanyang Technological University, Ya Gao Nankai University, Xiangrui Cai Nankai University, Jiarun Wu Nankai University, Wenli Song Civil Aviation University of China, Zheli Liu Nankai University Pre-print Media Attached | ||
14:47 12mTalk | Open Source Software Tools for Data Management and Deep Model Training Automation Industry Showcase (Papers) Umut Tıraşoğlu ORDULU Corp., Abdussamet Türker ORDULU Corp., Adnan Ekici ORDULU Corp., Hayri Yiğit ORDULU Corp., Yusuf Enes Bölükbaşı ORDULU Corp., Toygar Akgun TOBB ETU |