/***************************************************************** Readme : ADMIRE - Agent based Distributed Multimedia Indexing and Retrieval Engine. Module : Indexer. *****************************************************************/ This folder contains the following directory structure. 1 - Bin: Contains the class files for Indexer code. The following files are included: Indexer.class DirectoryTraverser.class HTMLFilter.class HTMLReader.class HTMLParser.class Helper.class Stemmer.class Stopper.class Page.class StackElement.class XMLParser.class 2 - Lib: Contains the jar files for the two underlying libraries being used by the indexer. The library TextToOnto is used for constructing ontology from given text corpus and is called directly by the indexer. Some of the classes have been modified, and some classes have been added to this library to fit our needs and to add the database functionality. TextToOnto further depends on some modules from KAON API. The following jar files are needed: Texttoonto.jar mysql-connector-java-3.0.11-stable-bin.jar guibase.jar qtag.jar 3 - data: Contains all the html files to be indexed and searched. These are saved from http://archive.org/movies 4 - javadoc: Contains the html files generated from javadoc. Indexer folder has all the javadocs for Indexer module. The Texttoonto contains javadocs for the three main java files in the API that I wrote myself/made changes in. 5 - src: The src folder contains .java files corresponding to the indexer module. The indexer class has the very limited functionality of search too without any user interface. Results are just printed out using system.out.println. The following directory structure is provided. Module - Indexer: Indexer.java DirectoryTraverser.java HTMLFilter.java Helper.java Stemmer.java Stopper.java Module - TextToOnto: Third party library. The following two files are written by Faisal. DBEngine.java IndexStructure.java /***************************************************************** HOW TO BUILD: The source code in src\indexer has to be linked to jar files in lib. 1 - Unzip the contents and put 'em in this folder at local machine: c:\admire Pl. make sure that all the jar files are in the folder: c:\admire\lib Also make sure that the .class files are there in the folder: c:\admire\bin Finally, there should be enough html files with their corrsponding file folders in the folder: c:\admire\data 2 - If you need to just run the existing code, do the following at the command prompt: cd c:\admire\bin setClassPath java Indexer The output messages will be displayed showing the status of Indexer execution. If you get some sqlException, then there might be some problem with the online mySql database at mia.ece.uic.edu. Pl. send me a mail and I'll fix it. 3 - If you need to build the indexer: Compile the .java files in src\indexer and redirect the output path to c:\admire\bin. Then follow the steps in 2 in order to run the Indexer. 4 - To verify that the code is working properly: - Check the IndexDB database online at: http://mia.ece.uic.edu/phpMyAdmin I'll send in the login and pwd in mail. Then click on Databases and on IndexDB on the left side frame. There should be two tables in the IndexDB: FileTable has info about the html file names, gif file names, MPEG filenames, hashmap of html files and the IP address of machine hosting the file. IndexTable has all the terms used for indexing. It has term stem, term frequency, term TFIDF score and a foreign key to locate the file in which this term appears. - Change the query term to be searched: The query string is in the field m_QueryStr = "war"; initialized in the constructor of Indexer. If you change the query string, you'll need to rebuild the Indexer module as outlined in step 3. - Move around some of the html files from data folder. You can put some files to be indexed in data folder and the ones not to be indexed in the junk folder. 4 - If you need to build the Open-source TextToOnto API: Compile the API. Since the API has a multi-level hierarchical structure, some IDE like JBuilder may be helpful in building the API. Make a jar file out of it. Once the API has been built, the edu folder needs to be jarred. use the following command to jar the contents of the API: jar -cvf texttoonto.jar edu Then copy this jar file to: c:\admire\lib