Software and Security: [Java] Extracting information from websites

Disclaimer: this tutorial was not written by me, Qkyrie. In fact, it was written by Deque, a high-quality member on hackforums.net.

What am I about to learn from this tutorial?

The first part covers the search after movietitles in IMDB. Because websites are heterogeneous and can't be treated with the same code, I particularly give attention how you will find solutions for similar problems. With this you should be able to write your own filterprograms.

Which requirements are there to understand this tutorial?

I assume that you are an light advanced Java programmer. You should have previous knowledge with stringhandling and streams. Advanced OOP knowledge is not necessary.

I recommend to work through this tutorial for string manipulation, if you have problems with it: http://www.esus.com/docs/GetIndexPage.jsp?uid=224

For network streams look at this site: http://oreilly.com/catalog/javaio/chapter/ch05.html

How do I use a search engine with Java?

I suggest you first view the site whose data you want to get. How does the URL look like after starting a search? What are the parameters given through the URL? In case of IMDB a searchstring may look like this: http://www.imdb.com/find?s=all&q=die+hard

After s= follows a searchoption. s=all searches for all, that the database can give you, e.g. actors, characters. If you want to limit your search to movietitles, you have to set s=tt(you figure this out by trying). q stands probably for query. That's where you have to put your searchstring in which single words are divided by +.

Searching with google after a solution, you may find something like that:
http://www.iks.hs-merseburg.de/~uschroet...8b320dcd1e

Code:
package com.tutego.insel.net;

import java.net.URL;

import java.net.URLEncoder;

import java.util.Scanner;

public class YahooSeeke

{

public static void main( String... args ) throws Exception

  {

    String search = "the green mile";

    if ( args.length > 0 )

    {

  search = args[ 0 ];

  for ( int i = 1; i < args.length; i++ )

    search += " " + args[ i ];

    }

   search = "p=" + URLEncoder.encode( search.trim(), "UTF-8" );

   URL u = new URL( "http://de.search.yahoo.com/search?" + search );

   String r = new Scanner( u.openStream() ).useDelimiter( "\\Z" ).next();

   System.out.println( r );

}

}

This works fine with the given example. But if you try to use http://www.imdb.com/find? instead of the yahoo-address you will be blocked. You get this message:
Server returned HTTP response code: 403

Looking up the meaning of the response code (e.g. google "HTTP response code"), we get this answer: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

Quote:10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request
method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in
the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.

Since we are not allowed to get access, we search for an IMDB API. Not every search engine provides an API, because most of them don't want to allow automatic searches. There is also no API for IMDB, but we are allowed to download text data files: http://www.imdb.com/interfaces
It is not easy handling these text files and it is an disproportional effort for a little program. Therefore we find APIs from other people to make this work easier, e.g.:http://www.deanclatworthy.com/imdb/
With the aid of this webservice you are able to do a search through changing two lines of code:

Code:
search = "q=" + URLEncoder.encode(search.trim(), "UTF-8");

        URL u = new URL("http://www.deanclatworthy.com/imdb/?" + search);

The output will look like this:

Quote:{"title":"The Green Mile","imdburl":"http:\/\/www.imdb.com\/title\/tt0120689\/","country":"USA","languages":"English ,French","genres":"Crime,Drama,Fantasy,
Mystery","rating":"8.4","votes":"211313","usascreens":2875,"ukscreens":340,"year":"1999",
"stv":0,"series":0}

We can gather a lot of information with some string processing. But we get not more than text and we have to trust that this service will be continued. It would be much better to get direct access to IMDB.

We use the class URLConnection now. Samplecode and explainings are on this site: http://download.oracle.com/javase/tutori...iting.html

Code:
public static void testSearch() {

    BufferedReader in = null;

    try {

        URL url = new URL("http://www.imdb.de/find?s=tt&q=die+hard");

        URLConnection urlc;

        urlc = url.openConnection();

        in = new BufferedReader(new InputStreamReader(urlc.getInputStream()));

        String inputLine;

        while ((inputLine = in.readLine()) != null)

            System.out.println(inputLine);

    } catch (IOException e) {

        e.printStackTrace();

    } finally {

        try {

            in.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

}

Again we get the response code 403. But we know that the common browsers somehow get access to it. For this reason we pretend to be Firefox:

Code:
urlc.addRequestProperty("user-agent", "Firefox");

Our method looks like this now:

Code:
public static void testSearch() {

    BufferedReader in = null;

    try {

        URL url = new URL("http://www.imdb.de/find?s=tt&q=die+hard");

        URLConnection urlc;

        urlc = url.openConnection();

        urlc.addRequestProperty("user-agent", "Firefox");

        in = new BufferedReader(new InputStreamReader(urlc.getInputStream()));

        String inputLine;

        while ((inputLine = in.readLine()) != null)

            System.out.println(inputLine);

    } catch (IOException e) {

        e.printStackTrace();

    } finally {

        try {

            in.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

}

Et voila, it's working. The output is the source code. Try it out, if you want to see it. I won't post it since the source code is pretty much.

How do I extract information out of the source code?

Nice to have an output, but it is not as good conditioned as the IMDB-API-output. Let's have a look on a search-query with Firefox. We visit http://www.imdb.com/ and search after "die hard". There are several results. Rightclick and choose showing the source code of the site. By pressing Strg-F we look after the striking string "Popular Titles " and are able to localize it in line 499.
Alternatively you can mark the text in your browser and choose to show the selected text after rightclick.

The following shows the beginning of line 499 till the first searching result in the source code:

Code:
<p><b>Popular Titles</b> (Displaying 5 Results)<table><tr> <td valign="top"><a href="/title/tt0095016/" onClick="(new Image()).src='/rg/find-

tiny-photo-1/title_popular/images/b.gif?link=/title/tt0095016/';"><img src="http://ia.media-imdb.com/images

/M/MV5BMTIxNTY3NjM0OV5BMl5BanBnXkFtZTcwNzg5MzY0MQ@@._V1._SY30_SX23_.jpg" width="23" height="32" border="0"></a>&nbsp;</td><td

align="right" valign="top"><img src="/images/b.gif" width="1" height="6"><br>1.</td><td valign="top"><img src="/images/b.gif" width="1" height="6">

<br><a href="/title/tt0095016/" onclick="(new Image()).src='/rg/find-title-1/title_popular/images/b.gif?link=/title/tt0095016/';">Stirb langsam</a>

(1988) <p class="find-aka">aka "Die Hard"&nbsp;- USA <em>(original title)</em></p>

For gathering information you have to look for characteristic signs of a title in the source code. Firstly the search results are in this line that contains the string "Popular Titles ". So you are able to narrow your search. Secondly you have to filter the titles itself. Those are links and bounded by <a href></a>. But you can't get them by the link-tag alone, because the corresponding pictures have links too. But we recognize, if there is an image-tag between <a href...></a> or a simple string.

Armed with this deliberations we create our first filtermethods for testing and let our program give the results for several search-queries.

Using the method contains() we find the line involving "Popular Titles ".

Code:
public void testSearch() {

    BufferedReader in = null;

    try {

        ...

        //as lons as lines are there

        while ((inputLine = in.readLine()) != null) {

            //check whether the string is in it

            if (inputLine.contains("Popular Titles")) {

                System.out.println(inputLine);

            }

        }

    } catch (IOException e) {

        ...

}

Below I introduce several methods for string processing we will use for our task.

This method looks for a character c in a string string. The search begins at index start. After finding the character, this method will return the index of it.

Code:
private Integer getIndexOfChar(String string, char c, int start) {

    Integer index = null;

    for (int i = start; i < string.length(); i++) {

        if (string.charAt(i) == c) {

            index = i;

            break;

        }

    }

    return index;

}

getSubString() uses the method above to determine a substring. The substring ranges from the character from to the character to (first occurrence is relevant).

Code:
private String getSubString(String string, char from, char to) {

    String sub;

    Integer beginIndex = getIndexOfChar(string, from, 0);

    if (beginIndex == null) {

        System.err.println("BeginIndex not found");

        return null;

    }

    Integer endIndex = getIndexOfChar(string, to, beginIndex + 1);

    if (endIndex == null) {

        System.err.println("EndIndex not found");

        return null;

    }

    sub = string.substring(beginIndex + 1, endIndex);

    return sub;

}

The method below proves whether the chararray input begins with the string compare at a given index.

Code:
private boolean containsCompareAtIndex(char[] compare, char[] input, int index) {

  //check wether the lenght of the rest input is enough

    if(compare.length > input.length - index){

        return false;

    }

  //check for inequality

    for (int j = 1; j < compare.length; j++) {

        if (compare[j] != input[++index]) {

            return false;

        }

    }

    return true;

}

getCompareIndex() returns the index from which the string input contains the string compare.

Code:
private Integer getCompareIndex(char[] compare, char[] input) {

    Integer compareIndex = null;

    for (int i = 0; i < input.length; i++) {

        if (input[i] == compare[0] && containsCompareAtIndex(compare, input, i)) {

            compareIndex = i;

            break;

        }

    }

    return compareIndex;

}

This method is our goal. getTitles() provides the list with the titles that can be found in inputLine (look up the comments for explanation).

Code:
private List<String> getTitles(String inputLine) {

    List<String> titles = new LinkedList<String>();

    char[] compare = "<a href".toCharArray();

    Integer titleBeginIndex = null;

    do {

        char[] input = inputLine.toCharArray();

        titleBeginIndex = getCompareIndex(compare, input);

        if (titleBeginIndex != null) { // probably a title was found

            //cuts inputLine till the beginning "<a href"

            inputLine = inputLine.substring(titleBeginIndex + 1);

            //determines the substring between '>' and '<'

            String title = getSubString(inputLine, '>', '<');

            //if this substring is empty, it won't be added to the list

            if (!title.equals("")) {

                titles.add(title);

            }

        }

    //go on with it until no more titles can be found

    } while (titleBeginIndex != null);

    return titles;

}

Now we should test our code. For doing this, we have to edit our testSearch() method:

Code:
public void testSearch() {

        BufferedReader in = null;

        List<String> titles = new LinkedList<String>();

        try {

            ...

            while ((inputLine = in.readLine()) != null) {

                if (inputLine.contains("Popular Titles")) {

        titles.addAll(getTitles(inputLine));

                }

            }

            for (String string : titles) {

                System.out.println(string);

            }

        } catch (IOException e) {

            ...

    }

Those are the results for "blair witch":

Quote:Blair Witch Project
Book of Shadows: Blair Witch 2

That seems to be quite good. But there are still failings. Several movies with the same title are only able to be distinguished by their year which is not filtered now. Titles with special characters are not displayed right. But for the first part it should be enough.

Deque

Pagina's

dinsdag 22 maart 2011

[Java] Extracting information from websites - IMDB example

Geen opmerkingen:

Een reactie posten