For this challenge our team picked Machine Learning examples of coding and decided to create a Search Engine specifically made for the huge list of Nasa datasources.
Our search engine implementation is coded in python, which can be used as a RESTful API.
As a datasets list, we decided to query the CMR API that Nasa openly provides. There we can find information about over 33K datasets and links in a form of metadata. We made a javascript implementation to query the CMR API for all the pages and all the results.
We take all of those metadata information and we store all of it either in a database or on a local file, for our example a local file will suffice. The process of parsing this local file for information is what follows.
As a programing language we picked Python as its syntax is easier to read and understand. The Main part of the search engine is divided in two files:
Create Index TF-IDF
The first thing that we do is read the collection of dataset metadata we gathered into memory and get it ready for preprocessing. The Preprocessing stage has several phases:
For each entry in the list of datasets we grab the title and summary and join them together in a string variable. Then from that string we try to find terms or keywords that are most used to define that particular entry in that string format. During this process, we take each word from the text and check how frequent it is in that text and get the stem of that word. Stemming for example would mean taking words like: "Launched" or "Launching" and bringing them to their base form which in this case would be "Launch".
After that has been done, we take all of the terms and we see how many of them are stop words. A stop word is a word that's used through out all of the documents most frequently but it's very useful in a search engine algorithm because a query for that word would return huge amount of unrelated data. An example of stop words: "also, and, are, but, both, by, from, for...etc".
Once the preprocessing is done and we got the keywords, we start making an Index of all of them. The index consists of all the important keywords, in which document do they belong, which positions do the show up at in the text. Additionally, we calculate the tf and df weight of the keyword based on their positions and count.
By doing so we've successfully generated a huge list of keywords that link to specific datasets or in combination with one another to an even more complex search query of the datasets. Given that, this means we can use this Index to serve us a form of a search engine.
Query Index TF-IDF
The first that needs to be done in order to be able to successfully use the newly created Index is to load it in memory and process it.
Now that we know what we have inside each line and how the information is stored, we process it by reading the keyword along with their links and id's that they point to and ordering them based on their TF-IDF weights. Now all of these weights have a different meaning with every unique search term that is entered to query the database.
After we've read, parsed and ordered the Index file accordingly we can start issuing commands via the python command interface and start serving search terms.
With every search term that's entered we use Natural Language Toolkit to both process the term that's entered by remove stop words if it contains any and stemming each word to its base form. Following this is the TF & DF weights calculations for that search term which we use to rank the results that would best match that particular search and return the top 10 matching documents or in this case Nasa Dataset list and Articles.
This search engine can further be improved by utilizing more machine learning examples such as the relevance feedback tf-idf based algorithm to improve the search terms and show suggestions to users for maybe what should follow, similar to google suggestions.
Although its fast enough for personal use, further optimizations like utilizing parallel processing and a smarter way to store and retrieve data can provide a stable platform that could be used on a wider scale.
Usage
First check if the following files are listed in your directory:
If missing, you’ll have to run the following command, make sure you’ve got admin privileges on windows or are sudo on unix:
python createIndex_tfidf.py stopwords.dat collections.min.json testIndex.dat titleIndex.dat
You’ll notice the command prompt listing all the pages and entries within the CMR API that i scrapped.
Once that’s done and finished, you can start the search engine and index querying:
python queryIndex_tfidf.py stopwords.dat testIndex.dat titleIndex.dat
This is an interactive command prompt in which you can type your queries in the following ways:
In order to close the app, simply cmd+ctrl.
SpaceApps is a NASA incubator innovation program.