1 Introduction
Identifying similar or identical code fragments among programs has many important applications. For example, one application is to detect illegal code reuse. With the rapid growth of open-source projects, software plagiarism has become a serious threat to maintaining a healthy and trustworthy environment in the software industry. In 2005 there was an intellectual property lawsuit filed by Compuware against IBM [19] . As a result, IBM paid 140 million in fines to license Compuware’s software and an additional 260 million to purchase Compuware’s services. In the case of software plagiarism, determining the sameness of two code fragments is faced with an increasing challenge caused by emerging, readily available code obfuscation techniques [17], [18], by which a software plagiarist transforms the stolen code in various ways to hide its appearance and logic, not to mention that often the plaintiff is not allowed to access the source code of the suspicious program.