Characterizing Web Spam Using Content and HTTP Session Analysis
Short Description
Another challenge posed by the JavaScript techniques is. the nondeterministic behavior of … JavaScript location object. This technique accounts for 7% …
Website: faculty.cs.tamu.edu | Filesize: 541kb
Content
Characterizing Web Spam Using Content and HTTP
Session Analysis
Steve Webb
College of Computing
Georgia Institute of
Technology
Atlanta, GA 30332
webb@cc.gatech.edu
James Caverlee
College of Computing
Georgia Institute of
Technology
Atlanta, GA 30332
caverlee@cc.gatech.edu
Calton Pu
College of Computing
Georgia Institute of
Technology
Atlanta, GA 30332
calton@cc.gatech.edu
ABSTRACT
Web spam research has been hampered by a lack of statistically
significant collections. In this paper, we perform the
first large-scale characterization of web spam using content
and HTTP session analysis techniques on the Webb Spam
Corpus - a collection of about 350,000 web spam pages. Our
content analysis results are consistent with the hypothesis
that web spam pages are different from normal web pages,
showing far more duplication of physical content and URL
redirections. An analysis of session information collected
during the crawling of the Webb Spam Corpus shows significant
concentration of hosting IP addresses in two narrow
ranges as well as significant overlaps among session header
values. These findings suggest that content and HTTP session
analysis may contribute a great deal towards future
efforts to automatically distinguish web spam pages from
normal web pages.
1. INTRODUCTION
Web spam has…
Get the file Download here
Related Books:Related Searches: georgia institute of technology atlanta, georgia institute of technology atlanta ga, georgia institute of technology, javascript techniques, physical content
Comments
Leave a Reply