Introduction
Code clones refer to similar or identical code segments in or between software systems [11]. Code clones may occur because of code reuse, automatic code generation, and the use of specific design patterns [11]. They are a widely used software development practice.
Code clones were initially considered to be a negative factor in software quality because they may cause harmful effects such as defect propagation and increased software maintainability [11], [16], [20]. However, some studies have positively affirmed the value of code clones, believing they can improve software development efficiency [14]. As the software scale continues to expand, the necessity of automated detection and management of code clones has gradually increased.
So far, several studies related to code clone analysis have confirmed that code clones are relatively common in software systems and exhibit various effects [14], [16], [18], [19], [20], [21]. However, most prior studies focused on code clones within projects. Ossher et al. were the first researchers to detect and analyze code clones on a large scale across multiple projects and report the existence of code clones [22]. However, their study did not distinguish between code clones within and between projects. Gharehyazie et al. focused on detecting and analyzing code clones between projects [2]. They detected code clones in 5753 Java projects across multiple fields on GitHub and produced a series of findings. For example, code clones between projects account for approximately 5% of the total code and 10–30% of code clones within a project. Moreover, the onion model of cross-project code clones was proposed.
However, the research by Gharehyazie et al. exhibits two primary limitations: only clones that were an exact match to each other were detected, and only Java code was investigated. Studies have shown that the proportion of Type-3 clones (refer to Section III for definition of this type) in code clones is relatively higher. Moreover, the evolution of software will also cause changes in the types of code clones, thereby increasing the proportion of Type-3 code clones. Furthermore, as the diversity of programming languages gradually increases, the characteristics of code clones between different languages may also be different [3]. The status of cross-project code clones can be more effectively investigated by expanding the scope of the investigation to multiple languages and Type-3 clones.
Owing to device heterogeneity and platform compatibility factors, Internet of Things (IoT) systems may be a software development field with the most significant number of development languages. Makhshari et al. analyzed the challenges in IoT system development, including third-party breaking changes [1]. We believe that inconsistency with third-party reference codes also poses risks. Therefore, we choose the IoT system as the target field of this investigation.
In summary, this study has two research questions:
RQ1: How prevalent is cross-project code cloning in IoT systems?
RQ2: How does code cloning affect the targeted IoT systems in terms of maintainability?
To answer RQ1, we targeted IoT open-sourced software (OSS) on GitHub and conducted a Type-3-level cross-project code clone investigation for multiple languages. First, we collected representative OSS systems from GitHub, including 123 software systems in nine languages. Then, we divided the collected OSS systems into three categories: Communication, Device, App/Cloud. Finally, we generated 18 groups from all the repositories according to their language and category. We used the Multilingual Syntactic Code Clone Detector (MSCCD) [3] to detect clones in each group and analyzed the results.
For RQ2, we attempted to classify cross-language code clones from the perspective of clone genealogy analysis and identify the categories with higher risks. We extracted the text similarity for each clone pair at every historical period. Code clones were classified based on the presence of modification history of both clone pairs and the trend of similarity changes. Moreover, we also analyzed commit messages to extract defect-fixing commits associated with code clones, thereby investigating the status of defect propagation through cross-project code clones.
The primary contributions of this work are as follows:
Cross-project code clones were detected in one-third of the repositories and 10 of the 20 groups.
From the perspective of IoT systems, communication components and drivers for terminal devices were identified to be more likely to produce cross-project code clones. Seventy percent of the groups with cross-project clones were included in the communication category, and the device category group was identified as containing the most clone divisions and the second most clone pairs.
From the perspective of languages, languages with a low level of abstraction and those without a complete package management system were observed to be more likely to produce cross-project code clones.
We identified several factors that led to a surge in the number of detected clone pairs, and these clones may need to be filtered out in some code clone detection applications.
Approximately 95% of the cross-project clones were observed to be untouched. Among the cross-project clones with modification records, clones with decreasing similarity were approximately 72%. These clones may no longer be detected by the same code clone detection tool.
We collected nine defect propagation instances caused by cross-project code clones, some of which had been exposed for nearly three years. This indicates real security risks.
This paper is composed of the following sections. Section II introduces related research, and Section III defines the terms used in this paper. Section IV introduces the research methods used in this work. Section V presents the answers to the two research questions. Section VI discusses several factors that can lead to a sudden increase in the number of detected clones and threats to validity. Finally, Section VII concludes this paper.
Related Studies
With the expansion of software, the necessity for automated management of code clones is increasing. Numerous studies have investigated code clones in real-world software projects. Baxter et al. determined that 7–15% of the code of moderate-scale software systems are code clones [13]. Kapser and Godfrey also identified that 12% of the lines of code (LOC) of Apache HTTP Server were a part of code clones, and some subsystems may have significantly higher clone density as compared to that of others [12]. Several works have analyzed code clones to illustrate the benefits and disadvantages of code cloning and provide refactoring suggestions [14], [15], [18], [19], [20], [21]. However, most prior work is primarily focused on intra-project code clones. Although certain studies focused on cross-project code clone analysis [23], [24], [25], Osher et al. proposed the first large-scale cross-project clone study, which detected file-level code clones across 13,000 Java projects [22]. They identified that more than 15% of all projects include at least one duplicated file, and over 10% of files are clones. However, they did not distinguish between intra- and cross-project clones. Gharehyazie et al. focused on cross-project code reuse on GitHub [2]. They detected code clones across 5,753 Java projects using Deckard [6], discovering that cross-project clones account for 10 to 30% of all clones within the project and up to 5% of the code base. They also determined that most clones originate within the same project, followed by projects within the same application domain, and finally, from projects across different domains. Although the work of Gharehyazie et al. targeted various domains, they only detected Type-1 (refer to Section III for the definition) clones in Java. In this work, we detected cross-project Type-3 code clones in codes written in nine languages.
To detect code clones automatically, several clone detection tools are applied such as CCFinder [5], Deckard [6], NiCad [7], SourcererCC [8], and SAGA [9]. These state-of-the-art tools demonstrate superior recall, precision, and scalability performance. However, the repositories we collected in this work encompass 11 programming languages; none of these detectors can cover all of them. The only two multi-language clone detectors that are capable of satisfying this need are CCFinder [10] and MSCCD [3]. CCFinder achieves its multi-language mechanism by converting the grammar rules of the target language into regular expressions. However, it cannot support specific languages, such as Lua, which contains grammar rules that cannot be converted into regular expressions. Moreover, CCFinder cannot detect Type-3 clones. MSCCD is the only detector that can detect Type-3 code clones in all target languages; therefore, we used MSCCD in this research.
Code clone evolution refers to the process by which repeated code snippets change over time, whereas code clone genealogy tracks the lineage and historical changes of these code clones across multiple software versions, aiding in understanding the evolution of code clones. Kim et al. performed the first study of clone genealogy analysis and defined the types of changes that can be applied to a code clone [27]. Several studies on clone genealogy analysis have been produced based on their work. For example, Krinke determined that approximately one-half of the clones maintained consistency, and the cloned codes were relatively more stable [28]. Gode and Koschke discovered that more than half of the clones were stagnant, and only approximately 12% The two studies by Barbour et al. targeted the late propagation phenomenon [15], [16]. They identified eight types of late propagation that caused errors and determined that the code size that experienced late propagation would affect fault-proneness. These studies primarily focused on code clones within projects, whereas our study attempts to perform clone genealogy analysis on code clones between projects. We used a relatively simple model, focusing on the presence of modification history and changing trend of text similarity. Our classification results are consistent with the conclusions of existing studies.
Terminologies
In this paper, we employ specific terminologies defined as follows.
Code Segment: A contiguous lines of code defined by the triplet (f, s, e), with the source file f, begin line s, end line e. A code segment can be a full file or only a part [3].
Code Clone: A relationship between code segments. Code segments with a code clone relationship have the same functionality and are syntactically similar or identical.
Types of Code Clones [11]:
T1 (Type-1): Code clones with identical code segments, except for the differences in white space, layout, and comments.
T2 (Type-2): Code clones with identical code segments, except for the differences in identifier names and literal values, along with the T1 clone differences.
T3 (Type-3): Code clones with syntactically similar code segments that differ at the statement level. The segments that have statements added, modified, and/or removed with respect to each other and the T1 and T2 clone differences.
T4 (Type-4): Code clones with syntactically dissimilar code segments that implement the same functionality.
The T1, T2, and T3 code clones are relevant to this research.
Clone Pair: Two code segments with the code clone relationship [11].
Clone Division: A clone division is a set of code segments, and its size is larger than two. When plotting an undirected graph using all detected clone pairs, each connected component consisting of three or more code segments corresponds to a clone division.
Methodology
This section introduces our research approach as shown in Figure 1. We first extended the database collected by Makhshari et al., which contains OSSs of IoT systems. Then, we grouped the collected software by their language and primary target in the IoT system. We then detected cross-project code clones for each group using MSCCD, a multilingual code clone detector that can detect Type-3 code clones in various granularity levels. For the detected cross-project code clones, we generated file- and function-level clone genealogy reports. The generated clone genealogy report can check the similarity between two code bases at any time. The following subsections introduce the details.
A. Collection and Classification of Open-Source Software for Internet of Things (IoT) Systems
In this research, we extended the dataset introduced by Makhshari et al. [1]. To identify bugs in IoT software, they collected bug reports from 91 IoT repositories by searching IoT-related topics and excluding repositories with fewer stars, labeled issues, and closed issues. We removed the limitation of bug reports and further collected 31 high-star repositories. In detail, we searched the repositories with the following keywords: “IoT,” “Internet of things,” “IoT-application,” “IoT-platform,” “IoT-device” with more than ten stars. We then excluded forked repositories and repositories with less than 50 closed and five labeled issues and resulted in 853 remaining repositories. For the repositories that were not included in the dataset presented by Makhshari et al., the parts with less than 1000 stars were excluded. We then manually checked the remaining 90 repositories and excluded those unrelated to actual IoT systems, such as documentation and sample code. Thus, the extended dataset contains 122 repositories in 11 languages.
B. Grouping of the Collected Repositories
This section presents the categorizations of the collected IoT repositories. IoT systems exhibit a typical hierarchical structure, each layer possessing distinct functionalities and characteristics. Similar functionalities are a prerequisite for cross-project code cloning. Grouping before conducting clone detection can reduce computational load as compared to that when directly performing code clone detection across all OSS systems, thereby enabling large-scale, Type-3 level, multi-language cross-project code clone investigation. Furthermore, clones persisting after grouping are more likely to be associated with the unique characteristics of IoT systems.
We categorized the OSS systems into the following three categories:
Device: These OSS systems primarily operate on dedicated devices such as sensors, performing tasks such as device control and data collection at the end of the IoT architectural spectrum.
Communication: These OSS systems primarily facilitate communication through various protocols.
App/Cloud: These software systems operate primarily on general-purpose devices such as servers, integrating data and providing services at the top of the IoT architectural spectrum.
We checked the home page or documentation of the OSS projects to determine their categories. Moreover, specific OSS projects encompass functionalities across multiple categories, allowing them to belong to more than one category.
Finally, we grouped all the OSS by language and category. We obtained 20 groups, as listed in Table 1, comprising a total of 114 OSS systems in 10 programming languages. Among the OSS systems collected, which were described in the previous section, eight were not grouped as they did not have a corresponding OSS within the same category.
C. Clone Detection Using Multilingual Syntactic Code Clone Detector
We used MSCCD, a code clone detector capable of identifying Type-3 clones across multiple languages, for analyzing each group [3]. We adjusted the similarity threshold to 80%, which is a deviation from the default value of 70%. While the conventional setting might boost recall, it also increases the risk of false positives and accidental clones. Verifying each clone is impractical due to the significant number of clones reported and the substantial workload. Thus, we adopted an 80% threshold to better balance detection correctness and recall for Type-3 clones. Moreover, the configuration stipulates a minimum of 30 tokens and granularity values ranging from 0 to 10, enabling MSCCD to detect code clones exceeding 30 tokens at both file and block levels.
MSCCD was run for each group as detailed in Section IV-B. It identifies intra- and cross-project clones; however, this study primarily focused on the analysis of cross-project clones.
D. History Tracking for Cross-Project Code Clones
This subsection introduces a tool that we implemented to track the history of each reported clone pair. Figure 2 shows the model of history tracking. A clone pair is detected when their current similarity exceeds the similarity threshold. However, they may not have always been in a cloned state throughout their entire history. In other words, the similarities like
The input was a cross-project clone pair that was reported by MSCCD. Then, the following steps were performed to extract the history:
Step 1: Extract the file path and line numbers of the two code segments of the target clone pair.
Step 2: (Function mode only) Extract the function name of the target segment by using the CTag [4] tool or parser of the target language. The tool first extracts all function items with their function name, return type, parameter list, and line numbers. Then, the one with matched line numbers that is extracted in step 1 is returned.
Step 3: Extract all the revisions of the target file using
method and copy these modified files.$git~log$ Step 4: (Function mode only) Identify the target function in each revision. The tool extracts all function items with their function name, return type, and parameter list and then returns the one whose function name, return type, and parameter list match with those extracted in Step 2.
Step 5: (Function mode only) Check the revision list extracted in Step 3 and filter out the revisions where the target functions are not modified.
Step 6: Calculate the similarity of the reported clone pair at each time when one of the pairs is modified. In particular, all revisions of the two code segments are traversed in descending order. Each revision’s similarity to the most recent revision on the other side, which is older than itself, is calculated. If no revision exists on the other side that is older than the given revision, e.g., revision R2 and R1 of code segment A in Figure 2, its similarity to the temporally closest revision on the other side is calculated. We used the Overlap Similarity [8], which is defined in Equation 1. The Overlap Similarity of two code segments
and$CS_{x}$ is calculated by the proportion of shared tokens between the two segments relative to the token count of the larger segment.$CS_{y}$ \begin{equation*} Sim( CS_{x}, CS_{y} ) = \frac {\left |{{CS_{x}\cap CS_{y} }}\right |}{MAX\left ({{ \left |{{CS_{x} }}\right |, \left |{{ CS_{y}}}\right | }}\right )} \tag {1}\end{equation*} View Source\begin{equation*} Sim( CS_{x}, CS_{y} ) = \frac {\left |{{CS_{x}\cap CS_{y} }}\right |}{MAX\left ({{ \left |{{CS_{x} }}\right |, \left |{{ CS_{y}}}\right | }}\right )} \tag {1}\end{equation*}
Step 7: Generate the HTML report for the result. Figure 3 shows a part of a report for a function-level clone. At the top of the report, the project names, function names, and other relevant information for each code segment, along with a graphical representation of their clone history, are displayed. The subsequent section details the status of each code segment at various points in time, including their commit information, source code, and patch files.
Based on the changes in historical similarity, we can analyze the occurrence time of the clone and the direction of its evolution. We provide more details of the scenarios based on historical similarity in Section V-B.
E. Exploring Commits That Fixed A Defect
To answer RQ2, we attempted to identify the commits that fixed software defects from the list of commits that were included in the cloned revision history. When a commit fixes a software defect, the previous revision contains the defect, likely located at the modified location. Based on this, we checked whether the defect had been propagated to other software through code cloning and verified whether the propagated defect had been fixed. To achieve this, we retrieved keywords in the commit message. We traversed all commits and selected those that contained the keywords bug(s), issue(s), and error(s) as candidates. Finally, we manually checked all candidates and only retained the commits that were guaranteed to be defect fixes.
Results
A. RQ1: Trend of Cross-Project Clones in IoT Systems
Table 2 provides a summary of the detected clones. Of the 114 repositories targeted in clone detection, 103 were identified to have code clones, with 81 containing file-level code clones. Among these repositories, 48 contain cross-project code clones, representing approximately 40% of all repositories. Of these, 30 projects had file-level code clones, whereas 46 included block-level ones. Based on the grouping, cross-project code clones were identified in more than one-half of the 14 groups. Cross-project code clones constituted approximately 16% of all detected intra-project code clones. For intra-project code clones, the average sizes of the file- and block-level code clones were 759 and 53 tokens, respectively. Cross-project code clones were considerably larger, averaging 2761 and 49 tokens for file- and block-level code clones, respectively. Among these cross-project code clones, 2179 clone divisions were identified, 360 of which were file-level code clones, with the remaining being block-level code clones. File-level clone divisions consisted of an average of 19 code segments, while block-level clone divisions comprised an average of 6 code segments.
However, a considerable number can be filtered out among the more than 859,000 pairs of cross-project code clones detected, although their similarity is high enough, and they are correct code clones from the perspective of the clone detector. They include clones between drivers of different hardware models in the ARM architecture, source code of imported third-party software packages managed by vendor folders in Go language projects, and ProtoReflect functions automatically generated in the Go language. Details of these issues are discussed in Section VI-B. These code clones are not generated by copy-and-paste and do not cause problems such as defect propagation, but they caused millions of cross-project code clones reported. Therefore, we filtered out the code clones mentioned above, and the number of remaining code clones is shown below in Table 1. After filtering, 11,196 cross-project code clone pairs and 914 clone divisions are left. Subsequent investigations will be conducted on these remaining code clones. Besides, the results of three C# groups are not included in Table 2 because we encountered the problem of being unable to scan the entire source code files. This issue is usually caused by problems in the ANTLR-generated parser used by MSCCD [31]. We found that many incomplete code segments containing only package import statements were detected as code clones. These false positive results are challenging to eliminate automatically. Moreover, after manual inspection, we found no correct cross-project code clones were reported in these three groups. Therefore, we excluded the results of these three groups from the entire investigation.
Of the 14 groups detected with cross-project code clones, Eight were related to communication functionalities, whereas only nine communication-related groups were noted in the initial grouping (refer to Table 1). The number of groups detected with cross-project code clones related to Device and App/Cloud categories was 3. Thus, communication-related code exhibits a more obvious tendency to be cloned across projects. Table 3 lists details of each group. We reported the number of clone pairs and clone divisions, clone pairs per 10K LOC, and the average number of code segments within each clone division in file- and block-level statistics. Regarding detected clone pairs, the most block-level clones are identified in Groups 2 (C, Dev) and 8 (Go, Comm), with 7910 and 1095 pairs, respectively. Group 2 also reports the most file-level code clones, with 982 pairs. Moreover, Group 10 (Erlang, Comm) shows a marginal difference in the number of clones detected at the file and block levels. For the clone pairs per 10K LOC, Groups 2, 3, 5, 8, and 10 exceeded 200 pairs at the block level. Groups 2 (C, Dev) significantly outperformed the other groups at the file level, with 982 pairs. Group 2 (C, Dev) also exhibited distinct characteristics regarding clone divisions’ number and average size. Meanwhile, the two Go groups have significantly larger average clone division sizes than all other groups, indicating the presence of substantial large clone divisions within the Go language groups.
Cross-project code clones are not reported in the groups that are not listed in Table 3. Moreover, some groups exhibited a deficient number of detected cross-project clones. First, the characteristics of the programming language may influence the generation of code clones. For example, languages like Python and Java, which have mature package management ecosystems, allow developers to easily incorporate external code, reducing the need to reuse code by directly copying source code. Second, the software’s functionality may also have an impact. For instance, code related to communication protocols and hardware drivers is often challenging to sufficiently abstract, increasing the need to reuse code by copying source code. In contrast, it is easier to apply abstraction at IoT systems’ Cloud and Application layers, resulting in a lower likelihood of code cloning. Additionally, the development style of individual teams can influence the number of code clones. Some teams may actively reuse external code, while others aim to minimize code cloning as much as possible. A detailed analysis of the most influential factors for each group is necessary.
To directly show the codes that are reused across projects, we identified all code divisions with a size of more than five and summarized their similar types, numbers, and functions, as listed in Tables 4 and 5. Of 61 code divisions at the block level, 12 are clones owing to their similar data structures. Among them, eight clones are observed in C and C++, which are lists or matrices required by data transformation algorithms (including type transformation, encryption and decryption, and compression and decompression). For example, the S-box or T-box required by the Advanced Encryption Standard algorithm or the arrangement that shows a zigzag scan order for image/video compression. Moreover, four code divisions with similar data structures are found in Java, JavaScript, and Erlang languages, which are typically configuration files for testing. The remaining 49 code divisions are generated because of similar functions. Most (16 code divisions) are related to communication functions, such as implementing hash algorithms, form processing, and managing permissions, status, and messages in communication. The rest are some standard functions, such as the equal function in Java that compares whether objects are equal or not, and test functions.
Among the 11 clone divisions at the file level, 1 is a test configuration file for JavaScript, and 3 are interface files in Java with high text similarity because of the reference of similar packages. Similar functions resulted in the remaining seven clone divisions. Among the five clone divisions in C language, two are related to data transformation and verification, whereas three are related to managing specific hardware. The two code divisions of Erlang are modules for process supervision.
The answer to RQ1: Ourfindings demonstrate asubstantial presenceof cross-project codeclones in IoT systems,with a pronounced focuson communication-relatedfunctionalities.
B. Affection of Cross-Project Clones in Maintainability
In this section, we explore the impact of cross-project cloning on software maintainability through the analysis of code clone modification history. As introduced in Section IV, the main information in the code clone modification history extracted in this study is the historical data of similarity between code segments. Code clone detection tools are usually used to detect the existence of code clones in the current state. However, the code pairs detected as clones may not have always been code clones at every past instance. Moreover, the trend of similarity changes also varies. We classified the clone modification history according to the trend of similarity changes and the presence of correction records, as shown in Table 4. The following list summarizes the typical scenario for each type:
P1 (Modifications were incorporated on both sides, and the similarity is generally constant): In this case, the overall similarity remains unchanged, although both clones have a modification history. This situation implies that the two clones have strong consistency and must synchronize their modifications with each other. The code clone is highly likely to be a known clone and is managed by the developer. Managing clones in this pattern will reduce the risk of defect propagation and inconsistency. However, we should also focus on the occurrence of late propagation.
P2 (Modification was incorporated on at least one side, and the similarity decreases): When modification history exists and the similarity gradually decreases, the two sides of the clone pair are likely to evolve in different directions. Assuming that the similarity keeps decreasing, this clone pair may not be detected by the same clone detector at the same granularity in the future (particularly for coarse-grained detection). However, this does not imply that similar code segments that need to be managed have been eliminated through code evolution. This way, risk locations such as defect propagation or inconsistency may no longer be detectable.
P3 (Modification was incorporated on at least one side, and the similarity increases): This situation is uncommon. One possible situation is that the two code segments did not establish a clone relationship in the early stage. After their respective evolution, one referred to the other, which made the code similarity significantly higher than the initial one.
P4 (No modification was incorporated on both sides or a single side was modified while maintaining the similarity): In this case, sufficient similarity change information is unavailable to determine the direction of clonal evolution. Therefore, all the risks mentioned above may occur in this case.
To more accurately classify the clone modification history, we must define the starting point and threshold of the similarity change. In the above scenarios, the starting point of the similarity change is when the clone is considered to be established. This point may be the initial version in the modification history, the closest version with the newer initial version on the time scale, or the time point with the highest similarity. In this implementation, we used the time point with the highest similarity between the initial versions of the two code segments in the clone pair as the starting point for observing the similarity change. Moreover, significantly minor revisions should be ignored when considering the scale of the similarity change trend. Therefore, we set a threshold for similarity changes. When the similarity change exceeds the threshold, it is considered to exhibit a significant change. In this investigation, we used a threshold of 5%.
Table 6 lists the detected file-level code clones’ clone history classification results. Among the five groups of detected file-level code clones, all groups exhibit the P1 cloning pattern, whereas three groups exhibit the P2 cloning pattern. We must focus on risks such as late propagation for these clones. Furthermore, the results of Group 2 are significantly different from those of other groups, with more than 93% of the clones exhibiting the P4 pattern. This result shows that most of the code clones between devices in the C language have no obvious modification history and remain in a state that is consistent with or significantly close to the original.
Table 6 also shows the clone history classification results of function-level code clones. Excluding Group 5, all groups predominantly exhibit only P4 cloning patterns, which shows that most of the function-level cross-language code clones have yet to evolve significantly. The only exception is Group 5, where half of the clones are likely to remain consistent, and only one pair exhibits the P4 pattern. This is because this group contains the twin projects iotagent-ul and iotagent-json. They are generated based on the same library and implement the functions of iotagent for different data formats, which implies that the code clones between these two projects are likely to remain consistent. Moreover, many developers contributed to both projects, significantly reducing the difficulty in maintaining consistency.
To demonstrate the risks of different types of cross-project code clones, we attempted to extract the software bug fixes related to these code clones according to the method described in Section IV. In total, candidates for defect fix commits were detected in three groups. We manually judged these candidates, and the results are listed in Table 7. In Group 2, 18 commits were judged as bug fixes, of which three bugs were propagated through code clones and were not fixed. Group 10 contained 17 bug fix commits, of which four bugs were propagated and not fixed. In Group 5, two bug fix commits were observed, both propagated but fixed. Among the nine detected bug propagation, seven were not fixed, which is consistent with the classification results of their respective groups in the software fix history. Thus, Group 5 exhibits the highest proportion of the P1 pattern in the classification of software fix history. In Groups 2 and 10, the proportion of the P1 pattern is deficient (less than 5%), which may indicate that most clones were not managed, resulting in uncorrected bug propagation.
Subsequently, we present three case studies to describe three situations where a defect candidate is judged as an actual defect.
1) Unfixed Cross-Project Propagated Defect
This example is related to a file-level code clone between the AliOS Things and Zephyr projects in Group 2. The file name is hci_raw.c, which is an implementation of a raw Host Controller Interface driver for Bluetooth. The defect is related to the length of the structure buf not being checked in the
The modification history of this pair of clones is classified as P2, which implies that the similarity continues to decrease after the clone is established. In the latest state, the similarity has dropped below 80%, which implies that this clone can no longer be detected by a file-level clone detector with a similarity threshold of 80%. Moreover, we observed that most of the cross-project code clones in AliOS Things do not contain any modification history. This implies that these codes become untouched after being propagated to the project, which may indicate fewer chances for discovering defects in these codes (as compared to that of actively developed code). These clones related to inactive code are also less likely to be identified.
2) Fixed Cross-Project Propagated Defect
This case is based on a file-level code clone between the iotagent-json and iotagent-ul projects in Group 5, named commonBindings.js. A bug was fixed via a commit with ID a63c9fc in iotagnet-json. The details of this case are shown in Figure 6. The triangular and circular points represent the commits of iotagent-json, and iotagent-ul, respectively. The bug was in the guessType function, which is used to determine the type of a given device. Because the parentheses representing the function call were forgotten in the conditional divergence judgment, the conditional always executed the branch where the condition was true. When the function returned the wrong device type, it caused further problems. This bug was introduced on October 30, 2017, and fixed on January 17, 2018. The bug was introduced in a commit on December 21, 2017, on iogagent-ul and fixed on the same day when iotagent-json fixed this problem.
The revision history of this clone is also classified as P2, and its 77% similarity in the latest state is considerably lower than the 93% in the highest state. However, because the two projects have a high degree of similarity, the same evolution was applied to the other side several times. In this mode, developers are clearly aware of the existence of clones and consciously propagate the code evolution that needs to be applied to the other side. Therefore, when the bug in iotagent-json was discovered, the other party quickly fixed the bug. Note that, as the other party fixed the bug within 12 hours, it should not be a high risk. However, in terms of version, the correction was performed after two commits, which can be conceptually called late prorogation, and a risk of spreading the bug to the production environment exists. Compared to relying on developers’ awareness, automated management tools for code clones can significantly reduce this risk.
3) Unpropagated Defect
This case is based on a file-level code clone of the file named
To determine defect propagation in code cloning, the time of cloning, the time of defect introduction, and whether the clone maintains consistency are key information. For Type-3 code cloning, the clone modification history can more accurately indicate whether the defect has been propagated than simply relying on the similarity in the latest state.
The answer to RQ2: Cross-project code cloningin IoT systems posessignificant maintainabilitychallenges. The primaryrisks include the potentialfor unaddressed defectsthat may propagate betweenprojects and the difficultyin managing and maintainingcloned code segments.
Discussion
A. Trend of Cross-Project Clones in Terms of Commit Number
In Section V-A, we constructed the dataset by searching IoT-related code repositories based on specific keywords and filtering the search results according to the number of stars and issues. In this process, we did not filter or classify repositories based on the number of commits. However, the number of commits is also an essential attribute of a code repository. A higher number of commits indicates higher activity in the project. Smaller and more frequent commits make changes easier to track and revert, enhancing maintainability. In this section, we classified the repositories based on whether they had more than 1,000 commits to observe trends in cross-project code cloning.
Among the 114 repositories, 72 had more than 1,000 commits, accounting for over 60% For all cross-project code clones, we categorized them based on the origin of each code segment: both segments from repositories with more than 1,000 commits (Group A), both segments from repositories with fewer than 1,000 commits (Group B), and clones spanning across the two groups. The same principle applies to clone divisions: if all segments in a clone division originated from the same group, the clone division belonged to that group; otherwise, it was classified across the two groups. Table 8 presents the number of clone pairs and divisions at various granularities and their average sizes. The results show that all cross-project code clones were associated with Group A, and no clones were found between repositories with fewer than 1,000 commits. Furthermore, nearly 95% Clone divisions exhibited a similar distribution, with over 96% of all code segments in clone divisions originating from repositories in Group A. This result suggests that projects with a higher number of commits and activity levels are more likely to introduce cross-project code clones, and they also tend to clone code from other high-commit repositories. While the repositories included in this study were filtered by star and issue counts, representing the more popular and higher-profile repositories in the search results, the detected cross-project code clones were still concentrated among repositories with more commits.
For the cross-project code clones analyzed in Section V-B, we calculated the mean, standard deviation, and median of commits for each clone pair, with the results shown in Table 9. Here, the commit number of a clone pair is defined as the sum of the commits that made changes to each code segment. Unsurprisingly, file-level code clones had a higher average number of commits. The two JavaScript groups had the highest average commit number, which can be attributed to the nature of the projects in Group 5. The iotagent series of projects exhibit notably high activity. As for the average number of commits for each type of clone modification history, we found that clones with consistency (P1) tend to have relatively higher commit counts. Among the five groups reporting P1 clones, three groups showed a significantly higher average commit count for P1 clones compared to other groups. Only in the case of file-level Group 10 did another category exhibit a significantly higher average commit count than P1. In addition, groups with higher mean commits also tend to exhibit higher standard deviations. This suggests that even in projects with frequent commit submissions, the commits are more concentrated on a few critical files rather than evenly distributed across all files. This observation aligns with common experience in software development.
B. Situations That May Be Considered as Misdetection
In this study, we identified some situations that significantly increased the number of clone pairs. Code clone detectors indicate that these code pairs have a high degree of similarity and are undoubtedly code clones. However, in the application scenarios of code clone detection technology, such as defect detection, whether these code pairs should exist as code clones must be discussed. If several detection results do not exist as code clones, the availability of code clone detection results will decrease. The following are some situations.
1) Drivers for Different Hardware Series with the Same Architecture
In Group 2, multiple OSS systems include several drivers because they need to support the STM32 series microcontrollers. We observed that several code clone pairs were detected for drivers of different series. The initial detection results included 14,851 clone pairs, and only 8,917 pairs were left after excluding clones between STM drivers of different series. That is, 40% of the initial detection results were code clone detections between drivers of different series. Drivers must often be localized and corrected after importing; not all drivers remain untouched. Drivers may contain defects when imported, and defects may also be introduced during localization corrections. Such drivers may cause further propagation of defects after being cloned. Therefore, management based on clone relationships is necessary. In actual tasks, code clones between drivers of different series should be ignored (at least when cross-project clones of drivers of the same series exist).
2) Similar Functions
In Group 8, we obtained a huge clone division, with the detection of 16,487 clone pairs between 192 block-level code segments (including 8,039 cross-project clone pairs). Each code segment here is a ProtoReflect function, which is automatically generated by the protoreflect package for the introspection and manipulation of Protobuf1 messages at runtime. Because runtime access interfaces are required for each message, the number of protoreflect functions is equal to the number of defined messages, and refactoring to reduce the number of protoreflect functions is difficult. These clone pairs should be ignored for the actual use of code clone detection.
The situation of equal and hashCode functions in Java are similar. The equal function determines whether two object instances are equal, whereas the hashCode function calculates the hash code of an object instance. These two functions also constitute a large clone division. For these clones, class relationships should also be taken into account.
3) Dependency Management by Vendor Folder in Go Projects
In Go software projects, external dependencies can be managed through the vendor folder. It saves the source code of external dependency packages in the vendor folder in the project to achieve specific version dependency locking, offline project building, and other advantages. Undoubtedly, for code clone detection tools, all codes of the same external package file will be detected as clones. Better practices should be used to manage these external software packages, including defect detection through package names and version numbers.
C. Threats to Validity
1) Method to Identify Defects in Software Repositories
In this study, we attempted to identify commits representing bug fixes from the commits included in the clone evolution history to collect details of defects associated with code clones. We retrieved commits that fixed defects by analyzing the commit messages. However, information that reflects the recall of this detection method is not available. For mining defects from software code repositories, the commonly used method is to retrieve issues with bug labels and finally obtain specific commits by resolving the pull request of this issue. However, this method cannot detect defects that are not recorded through the issue tracking system. Analyzing commit messages is the simplest method for this investigation. However, it is unsuitable for errors corrected by multiple or multi-target commits. Owing to the lack of recall evaluation results, the relevant results are limited.
Moreover, our judgment criterion for identifying defect propagation is whether the code containing the defect before the correction is propagated through code clones. However, we did not run the target program to determine whether the defect would cause a similar problem after propagation. If a defect depends on other files, it may no longer cause problems in the cloned environment. For various objective reasons, running each version of each target program in the corresponding environment to confirm the existence of the defect is difficult.
2) Classifying Clone Modification History
We classified the code clone modification history by the presence of modification and the trend of similarity change. However, this classification method only uses the similarity records of the earlier period and the latest period, and the similarity change information between them may not be reflected. For example, the result cannot reflect the situation of “maintaining consistency for a period of time, but losing consistency at a certain point in time.” Instead, we only get a result with a decreasing trend in similarity. In future work, more visualization methods can be used to improve this problem.
3) Methods of Collecting Repositories
The methodology used in this study can also be applied to software systems beyond IoT. The only necessary adjustment would be the approach to selecting candidate repositories. Given the current technological limitations, it is challenging to conduct large-scale (hundreds or more repositories) T3-level cross-project code clone detection within a reasonable time. Therefore, it becomes essential to pre-select candidate repositories to narrow down the scope of detection. In this study, we grouped the collected repositories based on programming language and the IoT hierarchical structure, ensuring that the code detection tasks involved a maximum of 15 repositories simultaneously. If the scope of candidate repositories is too narrow, many clones may be missed. On the other hand, if the scope is too broad, the computational workload may become unmanageable. Therefore, it will be necessary to determine a suitable approach to limit the scope of detection.
Conclusion
This study investigated the prevalence and impact of cross-project code cloning in open-source IoT systems. We analyzed a diverse set of repositories across multiple programming languages and determined that cross-project code clones are common, particularly in communication-related functionalities. Our analysis revealed that many of these clones remain untouched after propagation, posing risks for latent defects and defect propagation. Proactive clone management significantly reduces these risks, highlighting the importance of effective clone synchronization and maintenance practices.
Our future work focuses on developing automated tools to enhance clone management and exploring the effects of clones in other software domains beyond IoT.