ECE-491 : Object-Oriented Programming for Computer Engineers. (Spring 2004).
Term Project: Java-based Indexer module for Multimedia indexing and searching. [JavaDocs] [Readme.txt] [Class Presentation]
Objective: This program indexes a set of multimedia files based on their associated HTML content. Such indexes play a central role in multimedia search engines that refer to this index to answer multimedia queries. The indexes are populated in an online MySQL data-base and a modified version of the general texttoOnto java library has been used for indexing. For more indepth information about the multimedia indexer click here.
Testing procedure:
Follow the following steps to test the code:
1.Download the Source Code and unzip its contents in C:\
2. Change to c:\admire\bin directory using the command cd c:\admire\bin
3. Execute setClassPath.bat. This will set all the classpaths that are necessary for execution.
4. Execute the command: java Indexer This will index sample files present in c:\admire\data and put the index terms in a MySQL database server running on http://mia.ece.uic.edu.
5. To check that the databases are actually being populated: (a) Check out http://mia.ece.uic.edu/phpMyAdmin. You will need to use a user name and password which has been mailed to you separately. (b) Open the INDEXDB database and explore to see the actual indexes.
Indexer: The Indexer module in Admire is responsible for indexing the video clips based on their description in the respective html files. The module first generates a corpus based on all the html files to be indexed on each node. The corpus is built by reading in files of type html and storing the file names along with the hash maps of the files based on their contents. The hash map scheme sets the file names for keys and whole contents of the html files for the values. As the new html file contents are read in, the hash map of the newly read file is compared against the ones already read in. If the new one hashes to the same bin as any of the old one, then it is assumed that the two files (with different file names) have the same contents, and so the latter is not added to the list of files to be indexed. Once a list of distinct files (based on html content) is generated, the module then parses through the html file contents to extract the poster (gif file). Note that there are normally many gif files in a single html file. The one which represents the movie
contents has to be extracted. This is accomplished by looking for a specific pattern in the html file in the “href” anchor. Once that unique anchor is located in the html content, the “IMG” tag is searched and finally the “src=” tag is extracted for the gif file name of the poster. The filenames for html files as well as corresponding filenames of the poster (.gif) files are stored in the online database at the centralized server. The server has MySQL database containing two Tables, FILES_TABLE and INDEX_TABLE. JDBC API is used to connect to the database.
Once the html file contents are read in and ready for indexing, the indexer converts the html content into XML content using Simple API for XML (SAX). The XML content is further messaged by performing stop word elimination and word stemming. Stop words are high frequency words which have little semantic content, such as “a, the, in”, etc. The process of word stemming generates word stems to get to the terms for indexing. The stemming process converts, for example, “going” to “go”. The module then launches the Term Extraction phase. Based on the corpus, the TermExtractor then builds the dictionary of content using TextToOnto library. The dictionary generated this way is then pruned to get rid of terms appearing less than a threshold. Finally, the Indexer module updates the statistics of the corpus which internally generates the list of terms, calculates their metrics (Term Frequency, Entropy, TFIDF, etc.) and writes 'em on the server databse in INDEX_TABLE.