Machine Learning EN

Collecting data for analysis – scraper

When dealing with Machine Learning issues, we often face the problem of obtaining valuable data. We do not always have access to professional bases, powered by the content exactly what we care about.

The construction of a machine (scraper), which will download the content from the Internet we are interested in.

Writing an article (The article is available only in Polish, because I analyzed in it details related to this language) How to teach the program to understand the Polish language, I had to have a lot of texts in Polish to train/teach a model of understanding Polish colloquial speech.

To this end, I wrote an application that wholesale downloaded entries from one of the popular websites aggregating opinions about companies and saved them in the database. I also used it to obtain a database currently put up for the sale of apartments from Warsaw.

The source code is made available under this address.

Let’s analyze how it works

Most data on data download will take place in the main application file, in the Main class, in the RUN start -up method. Periodically, we will reach for other classes in which we store other, useful methods.

Let’s create a http_connect_class class

from modules import http_request

http_request_obj = http_request.http_connect_class()

And let’s download the contents of the entire website to the SOUP variable

def get_html(self, url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    soup = BeautifulSoup(response, 'html.parser')
    return soup

soup = http_request_obj.get_html(url)

In the portal code, the number of stars and the text of the review, it looks like this:

<span class='review positive'>5</span>
 
<div class='text'>
  <span>Przesyłka na czas. Produkt zgodny z opisem. Polecam :)</span>
</div>

or:

<span class='review negative'>1</span>
 
<div class='text'>
  <span>Bateria w telefonie trzyma pół dnia. Nie polecam!</span>
</div>

Using the Beautifulsoup library, we pars the code and build letters containing the assessment and the text of the description. Here is a curiosity: the name of the library comes from the fact that it also copes well with the so -called “With tag soup”, or slopply stretched HTML code with tagged tags.

from bs4 import BeautifulSoup
import re

rating_star_list = soup.findAll('span', {'class': re.compile("^review*")})
text_review_list = soup.findAll("div", {"class": "text"})

Due to the fact that we used a regular expression ^Review* we save to the list of both reviews Review Negative and Review Positive.

Now we are appointing a new instance of the DB_SQL_LITE_CLASS class,

from modules import sqlite

db_obj = sqlite.db_sql_lite_class()

We open the handy SQLITE base

def open_database(self, db_path):   
    try:
        conn = sqlite3.connect(db_path)
    except:
        print("Blad otwarcia bazy:", db_path)        
    return conn

conn = db_obj.open_database(db_path)

And we save data from the portal stored in variables.

def save_data_in_database(self, conn, db_path, rating_star, text_review):     
    try:
        conn = sqlite3.connect(db_path)     
        sql = '''INSERT INTO reviews VALUES 
                 ('''+str(rating_star)+''',"'''+text_review+'''"); '''
        c = conn.cursor()
        c.execute(sql)
        c.execute('commit')
        conn.close()    
    except:  
        print("Blad zapisu do bazy", db_path)

db_obj.save_data_in_database(conn, db_path, rating_star, text_review)

The scraper work was done and the data was saved to the database: