skip to main content
research-article

A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Published: 25 June 2019 Publication History

Abstract

Similarity search is a principle operation in different fields of study. However, the cost for that operation is expensive due to several reasons, mainly by redundancy and big data load. There are many approaches that concentrate on how to speed up similarity search, especially with massive datasets, so that we can employ it for plenty of recent applications. In this paper, we study an efficient way for either single or batch similarity processing with MapReduce while minimizing redundant data by building lightweight indexes from the data and query sources. More specifically, we propose a general query processing scheme that not only handles a single query but also deals with sets of query in an incremental manner. In addition, we build the indexes in an ordered fashion, the so-called sorted inverted indexes, so that we can perform our quick pruning strategy that discards unrelated objects. Moreover, we embed metadata inside the indexes to reduce inessential duplicates. Last but not least, we measure our proposed solution by conducting empirical experiments on real datasets. The results verify the efficiency of our method when we do similarity search with query batches, especially when both query sets and datasets are large.

References

[1]
Alabduljalil MA, Tang X, Yang T. Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM International Conference on web search and data mining, 2013. pp. 203–212.
[2]
Baraglia R, De Francisci Morales G. Lucchese C. Document similarity self-join with Mapreduce. In: 2010 IEEE International Conference on data mining, 2010. pp. 731–736.
[3]
Dang TK, Küng J, Wagner R. The SH-tree: a super hybrid index structure for multidimensional data. In: Proceedings of the 12th International Conference on database and expert systems applications, 2001. pp. 340–349.
[4]
Dean J and Ghemawat S MapReduce: simplified data processing on large clusters J Commun ACM 2008 51 1 107-113
[5]
Gao Y, Yang K, Chen L, Zheng B, Chen G, and Chen C Metric similarity joins using MapReduce IEEE Trans Knowl Data Eng 2017 29 3 656-669
[6]
Metwally A and Faloutsos C V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors Proc VLDB Endow 2012 5 8 704-715
[7]
Nguyen DT-T, Yong CH, Pham XQ, Loan TTK, Huh EN. An index scheme for similarity search on cloud computing using mapreduce over docker container. In: Proceedings of the 10th International Conference on ubiquitous information management and communication, 2016;60:1–6.
[8]
Phan TN, Dang TK. An efficient batch similarity processing with MapReduce. In: Proceedings of the 5th International Conference on future data and security engineering, 2018. pp. 158–171.
[9]
Phan TN, Küng J, Dang TK. An adaptive similarity search in massive datasets. In: Transactions on large-scale data- and knowledge-centered systems XXIII, Lecture Notes in computer science. 2016;9480:45–74.
[10]
Phan TN, Küng J, Dang TK. eHSim: an efficient hybrid similarity search with MapReduce. In: Proceedings of the 30th IEEE International Conference on advanced information networking and applications. 2016. p. 422–429, IEEE computer society.
[11]
Rajaraman A, Ullman JD. Chapter 3: Finding similar items. In: Mining of massive datasets. pp. 71–127 Cambridge University Press; 2011.
[12]
Rong C, Lu W, Wang X, Du X, Chen Y, and Tung AKH Efficient and Scalable processing of string similarity join IEEE Trans Knowl Data Eng 2013 25 10 2217-2230
[13]
Satuluri V and Parthasarathy S Bayesian locality sensitive hashing for fast similarity search Proc VLDB Endow 2012 5 5 430-441
[14]
Tang M, Yu Y, Aref WG, Malluhi QM, Ouzzani M. Efficient processing of Hamming-distance-based similarity-search queries over MapReduce. In: Proceedings of 18th International Conference on extending database technology, 2015. pp. 361–372.
[15]
Wang J, Li G, Deng D, Zhang Y, Feng J. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. In: 31st IEEE International Conference on data engineering, 2015. pp. 519–530
[16]
Xiao C, Wang W, Lin X, Yu JX, and Wang G Efficient similarity joins for near-duplicate detection ACM Trans Database Syst 2011 6 3 15:1-15:41
[17]
Zadeh RB and Goel A Dimension independent similarity computation J Mach Learn Res 2013 14 1 1605-1626
[18]
Zezula P, Amato G, Dohnal V, Batko M. Similarity search—the metric space approach. In: Series: advances in database systems, vol. 32, XVIII, 2006. 220 p., ISBN: 0-387-29146-6.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image SN Computer Science
SN Computer Science  Volume 1, Issue 1
Jan 2020
823 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 25 June 2019
Accepted: 13 June 2019
Received: 12 April 2019

Author Tags

  1. Similarity search
  2. Batch processing
  3. Lightweight indexing
  4. Metadata
  5. MapReduce
  6. Hadoop

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media

-