Selective Search

Selective search is a Distributed Information Retrieval(DIR) project for large-scale text search in which collection of documents are partitioned into subsets(known as shards) based on documents corpus similarity, thus enabling search across fewer shards instead of all shards.

Goal of the project is to implement an efficient and effective distributed search system without affecting the search quality and aiming to reduce search cost.

The methodology of decomposing a large-scale collection into subunits based on its records content similarity is termed as “Topic Based Partitioning”, and searching across only fewer relevant shards for given search query is termed as “Selective Search” in a publication by Kulkarni et al.

Table of Contents

  1. General Architecture
  2. Current Implementation
    1. Technology Stack
    2. Version Compatibility
    3. Implementation Architecture
    4. Features
  3. Getting Started
    1. Compile
    2. Run
      1. Run on IDE
      2. Run on Spark Cluster
    3. Setup Solrcloud
  4. Examples
  5. Configuration and Tuning
  6. Questions
  7. References

General Architecture

Overview1

Current Implementation

Selective Search is programmed in Scala. It extends libraries of Apache Spark MLLib for unsupervised clustering and distributed computing, Apache Solr libraries for Search and Information Retrieval.

Technology Stack

Version Compatibility

  1. Java JDK 8 version “1.8.0_131”

  2. Scala 2.12.2

  3. Apache Spark version 2.1.1

  4. Apache solr 6.6.2

  5. Spark-Solr 3.3.0

  6. Apache Maven 3.5.0

  7. Apache Ivy 2.4.0

  8. Apache ANT version 1.10.1

Implementation Architecture

Overview2

Features

Getting Started

Compile

To compile the current version of selective-search, you will need to have the following list of software running on your machine.

In order to verify the above listed softwares are running on your machine, confirm with commands below.

After verification of required softwares setup, download the source code and execute the below command.

Run

To run the selective-search project on localhost(machine), it is required for Apache SolrCloud to be configured. If you do not have it already configured, follow instructions provided here: ()

After the SolrCloud setup, there are two ways to run selective-search, either set it up on an IDE—it could be either IntelliJ/Eclipse or launch a Spark Cluster and execute job on it.

1. Run on IDE

2. Run on Spark Cluster

Configure Spark Cluster on localhost.

Run the selective search project on spark cluster. nohup ./bin/spark-submit --master spark://RajaniM-1159:7077 --num-executors 2 --executor-memory 8g --driver-memory 12g --conf "spark.rpc.message.maxSize=2000" --conf "spark.driver.maxResultSize=2000" --class com.sfsu.cs.main.TopicalShardsCreator /Users/rajanishivarajmaski/selective-search/target/selective-search-1.0-SNAPSHOT.jar TopicalShardsCreator -zkHost localhost:9983 -collection word-count -warcFilesPath /Users/rajani.maski/rm/cluweb_catb_part/ -dictionaryLocation /Users/rajani.maski/rm/spark-solr-899/ -numFeatures 25000 -numClusters 50 -numPartitions 50 -numIterations 30 &

Setup SolrCloud

For Selective Search, we require a solr collection with implicit routing strategy.

Visualization

Examples

Follow the steps listed below to execute(run) selective search for any other(custom/specific)dataset.

Configuration and Tuning

Required configurations

Tuning

Troubleshooting

References

  1. Anagha Kulkarni. 2015. Selective Search: Efficient and Effective Large­ scale Search. ACM Transactions on Information Systems, 33(4). ACM. 2015.
  2. Anagha Kulkarni. 2010. Topic-based Index Partitions for Efficient and Effective Selective Search. 8th Workshop on Large-Scale Distributed Systems for Information Retrieval.
  3. Rolf Jagerman, Carsten Eickhoff. Web-scale Topic Models in Spark: An Asynchronous Parameter Server. 2016.
  4. Clueweb09 dataset. Lemur Project.
  5. 20Newsgroups. Jrennie. qwone.com/~jason/20Newsgroups
  6. Mon Shih Chuang and Anagha Kulkarni. Improving Shard Selection for Selective Search. Proceedings of the Asia Information Retrieval Societies Conference. November 2017
  7. L. Si and J. Callan. (2003.) “Relevant document distribution estimation method for resource selection.” In Proceedings of the Twenty Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto: ACM.