Pigeon Cloud | Data Concierge

Skopje

The Challenge | Data Concierge

Develop an artificial intelligence tool to help Earth science data users and enthusiasts find datasets and resources of interest!

Pigeon Cloud Nasa Search Engine

This is a search engine built on top of Machine Learning examples used to query, process & create an Index of keywords for any dataset.

Pigeon Cloud

For this challenge our team picked Machine Learning examples of coding and decided to create a Search Engine specifically made for the huge list of Nasa datasources.


Our search engine implementation is coded in python, which can be used as a RESTful API.

As a datasets list, we decided to query the CMR API that Nasa openly provides. There we can find information about over 33K datasets and links in a form of metadata. We made a javascript implementation to query the CMR API for all the pages and all the results.

We take all of those metadata information and we store all of it either in a database or on a local file, for our example a local file will suffice. The process of parsing this local file for information is what follows.

As a programing language we picked Python as its syntax is easier to read and understand. The Main part of the search engine is divided in two files:

  1. createIndex_tfidf.py
  2. queryIndex_tfidf.py

Create Index TF-IDF

The first thing that we do is read the collection of dataset metadata we gathered into memory and get it ready for preprocessing. The Preprocessing stage has several phases:

  1. Join the Title and Summary of each entry
  2. Find Terms and do word stemming
  3. Remove StopWords

For each entry in the list of datasets we grab the title and summary and join them together in a string variable. Then from that string we try to find terms or keywords that are most used to define that particular entry in that string format. During this process, we take each word from the text and check how frequent it is in that text and get the stem of that word. Stemming for example would mean taking words like: "Launched" or "Launching" and bringing them to their base form which in this case would be "Launch".

After that has been done, we take all of the terms and we see how many of them are stop words. A stop word is a word that's used through out all of the documents most frequently but it's very useful in a search engine algorithm because a query for that word would return huge amount of unrelated data. An example of stop words: "also, and, are, but, both, by, from, for...etc".

Once the preprocessing is done and we got the keywords, we start making an Index of all of them. The index consists of all the important keywords, in which document do they belong, which positions do the show up at in the text. Additionally, we calculate the tf and df weight of the keyword based on their positions and count.

By doing so we've successfully generated a huge list of keywords that link to specific datasets or in combination with one another to an even more complex search query of the datasets. Given that, this means we can use this Index to serve us a form of a search engine.


Query Index TF-IDF

The first that needs to be done in order to be able to successfully use the newly created Index is to load it in memory and process it.

Now that we know what we have inside each line and how the information is stored, we process it by reading the keyword along with their links and id's that they point to and ordering them based on their TF-IDF weights. Now all of these weights have a different meaning with every unique search term that is entered to query the database.

After we've read, parsed and ordered the Index file accordingly we can start issuing commands via the python command interface and start serving search terms.

With every search term that's entered we use Natural Language Toolkit to both process the term that's entered by remove stop words if it contains any and stemming each word to its base form. Following this is the TF & DF weights calculations for that search term which we use to rank the results that would best match that particular search and return the top 10 matching documents or in this case Nasa Dataset list and Articles.

This search engine can further be improved by utilizing more machine learning examples such as the relevance feedback tf-idf based algorithm to improve the search terms and show suggestions to users for maybe what should follow, similar to google suggestions.

Although its fast enough for personal use, further optimizations like utilizing parallel processing and a smarter way to store and retrieve data can provide a stable platform that could be used on a wider scale.


Usage

First check if the following files are listed in your directory:

  • testIndex.dat
  • titleIndex.dat

If missing, you’ll have to run the following command, make sure you’ve got admin privileges on windows or are sudo on unix:

python createIndex_tfidf.py stopwords.dat collections.min.json testIndex.dat titleIndex.dat

You’ll notice the command prompt listing all the pages and entries within the CMR API that i scrapped.

Once that’s done and finished, you can start the search engine and index querying:

python queryIndex_tfidf.py stopwords.dat testIndex.dat titleIndex.dat

This is an interactive command prompt in which you can type your queries in the following ways:

  • owq=(query) - This is a one word query, will target by each word in the phrase specifically
  • ftq=(query) - Free Text Query, queries by all the terms or keywords in the query you’ve entered. Similar to Full Text Search
  • pq=(query) - Phrase Query, searches for particular phrase in similarity and tf-idf weight

In order to close the app, simply cmd+ctrl.

NASA Logo

SpaceApps is a NASA incubator innovation program.