Engineering Journal: Science and InnovationELECTRONIC SCIENCE AND ENGINEERING PUBLICATION
Certificate of Registration Media number Эл #ФС77-53688 of 17 April 2013. ISSN 2308-6033. DOI 10.18698/2308-6033
  • Русский
  • Английский
Article

The comparison of program sources using the sequence alignment of tokens

Published: 19.11.2014

Authors: Dubanov A.V.

Published in issue: #9(33)/2014

DOI: 10.18698/2308-6033-2014-9-1318

Category: Information technology

Borrowing detection is a very actual problem now. In this work, one of the known algorithms of the biopolymer sequence alignment was modified to make it possible to compare program sources and detect similar snippets in these sources. The input data of this algorithm are the source codes treated as the sequences of symbols. The set of lexical domains correspond to the alphabet of symbols making up these sequences. The algorithm was implemented and demonstrated with some code fragments written in Scheme language. The perspectives and restrictions of the algorithm application are also discussed.


References
[1] Burrows S., Tahaghoghi S.M.M., Zobel J. Efficient plagiarism detection for large code repositories. Softw. Pract. Exper, 2007, no. 37(2), 151-175.
[2] MOSS (Measure of Software Similarity). Available at: http://theory.stanford.edu/~aiken/moss/ (accessed on 02.10.2014).
[3] Agrawal A., Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Trans. Comput. Biol. Bioinform., 2011, no. 8(1), рр. 194-205.
[4] Lewis J., Ossowski S., Hicks J., Errami M., Garner H.R. Text similarity: An alternative way to search MEDLINE. Bioinformatics, 2006, no. 22 (18), pp. 22982304.
[5] Durbin R., Eddy S.R., Krogh A., Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998, 336 p.