Progression of String Matching Practices in Web Mining – A Survey

KALADEVI A.C, NIVETHA S.M

Abstract


String matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is classified into two sub-problems namely finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately. The basic technique is the Dictionary-based entity extraction. It identifies entities from a document which are predefined. Next trend for improving the recall is the approximate entity extraction. For a given query it finds all substrings in a document that roughly match entities in a given dictionary. This causes redundancy and lowers its performance. To overcome this drawback in the performance of string matching, a technique called Approximate Membership Localization is used. It is solved via P-Prune Algorithm. This paper is a survey on performance and accuracy of the string matching process and exposes an idea on using P-Prune in Blog-Search Framework.


Keywords


Blog, P-Prune, Approximate membership localization, Approximate membership Extraction, RSS Feeds

Full Text:

PDF

References


http://blogsearch.google.com

B. Liu, M. Hu and J. Cheng “Opinion Observer: Analyzing and Comparing Opinions,†Proceedings of the 14th WWW Conference, 2005.

A.C.Kaladevi and S.M.Nivetha, “Efficient Approximate Membership Localization using P-Prune Algorithm in Blogs,†in International Conference on Computer Communication and Informatics, pp. 14, 2014.

B. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,†Comm. ACM, vol. 13, no. 7, pp. 422-426, 1970.

U. Manber and S. Wu, “An Algorithm for Approximate Membership Checking with Application to Password Security,†Information Processing Letters, vol. 50, no. 4, pp. 191-197, 1994.

G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, 2002.

A. Aho and M. Corasick, “Efficient String Matching: an Aid to Bibliographic Search,†Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.

K. Jarvelin and J. Kekalainen, “Cumulated Gain-Based Evaluation of IR Techniques,†ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422-446, 2002.

B. Bocek, E. Hunt, and B. Stiller, “Fast Similarity Search in Large Dictionaries,†Technical Report ifi-2007.02, Dept. of Informatics University of Zurich, 2007.

G. Brodal and L. Gasieniec, “Approximate Dictionary Queries,†Proceedings of the 7th Symp. Combinatorial Pattern Matching, vol. 1075, pp. 65-74, 1996.

K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, “An Efficient Filter for Approximate Membership Checking,†Proceedings of ACM SIGMOD International Conf. Management of Data, pp. 805-818, 2008.

S. Chaudhuri, V. Ganti, and D. Xin, “Exploiting Web Search to Generate Synonyms for Entities,†Proceedings of the 18th International Conf. World Wide Web (WWW), pp. 151-160, 2009.

J. Lu, J. Han, and X. Meng, “Efficient Algorithms for Approximate Member Extraction Using Signature-Based Inverted Lists,†Proc. 18th CIKM ACM Conf. Information and Knowledge Management, pp. 315-324, 2009.

Z. Li, L. Sitbon, L. Wang, X. Du and X. Zhou: “AML: Efficient Approximate Membership Localization within a Web-Based Join Framework,†IEEE Transactions on Knowledge and Data Engineering, vol. 25, no.2,Feb.2013.

G. Mishne, “Multiple Ranking Strategies for Opinion Retrieval in Blogs,†Proceedings of TREC Blog Track, 2006. Retrieval in Blogs,†Proceedings of TREC Blog Track, 2006.

W. Zhang, C. Yu, W. Meng “Opinion Retrieval from Blogsâ€, Proceedings of the CIKM ACM International Conf. Information and Knowledge Management, pp. 831-840, 2007.

B. He, C. Macdonald, J. He, I. Ounis, “An Effective Statistical Approach to Blog Post Opinion Retrieval,†Proceedings of the CIKM ACM International Conf. Information and Knowledge Management, pp.1063-1072, 2008.

J. Elsas, J. Arguello, J. Callan, J. Carbonell, “Retrieval and Feedback Models for Blog Feed Search,†SIGIR ACM Conf. Special Interest Group on Information Retrieval, pp. 347–354, 2008.


Refbacks

  • There are currently no refbacks.


ISSN: 1694-2507 (Print)

ISSN: 1694-2108 (Online)