wiki:application_pages/apps/python/reverseIndex

Application Name

Reverse Index

Summary

  • Name: Reverse Index
  • Contact Person: support-compss@bsc.es
  • Access Level: public
  • License Agreement: GPL
  • Platform: COMPSs
  • Repository: Reverse Index

Description

Given a directory, this application parses all the files in it and writes all the links found in a result output file. Files are distributed in a given number of chunks. Chunks of files are processed in parallel. Later, once processed, chunks are merge to a final result file. Merging tasks are done also in parallel. In the result file, after each link appears the filename of the files that contains that link.

Arguments:

  1. Debug: if true, prints debug information
  2. Website path: path to the directory where to read the files from
  3. Chunks: number of chunks when processing files
  4. Output filename: filename for the result file where the application merges all the links found
  5. Temp directory: directory where the application writes the (*.part) temporary files

Execution instructions

The test directory under this project contains 3 html pages to be parsed as example.

export CLASSPATH=$CLASSPATH:/YOUR_PATH_TO/reverseindex.jar
export CLASSPATH=$CLASSPATH:/YOUR_PATH_TO/htmlparser.jar
runcompssext --app=reverse.Reverse --project=/YOUR_PATH_TO/project.xml --resources=/YOUR_PATH_TO/resources.xml --cline_args="true /YOUR_PATH_TO/test 3 /YOUR_PATH_TO/results.txt /YOUR_PATH_TO/tmp"

Dependencies

For compilation and/or execution there are some jars found in the lib directory of this project that could be needed:

  • activation.jar
  • commons-compress-1.4.1.jar
  • filterbuilder.jar
  • htmllexer.jar
  • htmlparser.jar
  • sitecapturer.jar
  • tar.jar
  • thumbelina.jar
Last modified 9 years ago Last modified on 10/26/15 09:33:51

Attachments (1)

Download all attachments as: .zip