Web scraping investment fund prices with Python

In this article we will look at a relatively short Python project that would read the names of all SEB investment funds, ISIN codes and their historical prices. The work if we would do it our selves is easy and we could do it manually but I personally got bored at the fourth fund and it’s prices. So let’s do it all automatically with the power of Python!

Why SEB investment funds?

I could think of only 2 reasons at the moment. In this case we will do this for educational purposes but in general we could do this if:

  • we want to invest and of course analyze the time series data but there are little to no money that we could spare for investing. But we are SEB’s clients. Bank clients usually purchase the shares owned by the bank at a reduced commission and can also invest relatively insignificant amounts.
  • We want to practice web scraping on some basic pages. Believe me, in finance it’s hard to find web pages that are more basic than big old banks that are around for century. They never had a necessity to upgrade their web presence. (up until now…)

When I started this article I was still working at SEB and from those times I still have life insurance for which I have to determine the allocation of my money to their funds. In the future we can expect the continuation of this topic in connection with analysis of the time series data as SEB does not offer an opportunity to look at the historical performance of your portfolio themselves. This task is left to me (and You). If you have a similar problem with SEB or another bank this series of articles might be not only educational but even useful.

The Code

For those who don’t care about my monolog around the code here’s the complete code:

from bs4 import BeautifulSoup
import requests
import pandas as pd

fundList = requests.get('https://ibanka.seb.lv/cgi-bin/ipank/ipank.p?act=VPFOND')
soup = BeautifulSoup(fundList.text, 'html.parser')

isins = []

for tr in soup.find_all('tr'):
    if len(tr)>10 :
        fundName = tr.find_all('td')[1].text
        sep = str(tr.a).find('isin=') + 5
        isins.append([fundName, str(tr.a)[sep:sep+12]])

funds = pd.DataFrame(isins, columns=list(['Fund Name','ISIN']))
print('List of ISINs and fund names gathered!\n')

prices = []
for isin in isins:
    temp = []
    pricePage = requests.get('https://ibanka.seb.lv/cgi-bin/ipank/ipank.p?sesskey=&lang=LAT&act=VPFONDINFO&isin=' + isin[1])
    soup = BeautifulSoup(pricePage.text, 'html.parser')

    txt = str(soup.find_all('script'))[str(soup.find_all('script')).find('function convertDate(date) {')+159:str(soup.find_all('script')).find('initChart(min,seriesdata,isin,valuuta')-1]
    txt = txt.replace('\n','')
    txt = txt.replace('convertDate(','')
    txt = txt.replace(')','')
    txt = txt.replace('removeNegatives(','')
    txt = txt[1:txt.find(',isin')]

    temp = txt.split('],[')

    if len(temp)>1000:
        for p in temp:
            str(p).replace(']','')
            str(p).replace('[','')
            if len(p)>0:
                prices.append([isin[0], isin[1], str(p[1:5]) + '-' + str(p[5:7]) + '-' + str(p[7:9]), str(p[11:])])
        
    print('Prices for {} have been gathered!'.format(isin[0]))

print('\nCleaning prices...')
for p in prices:
    p[3] = p[3].replace(']','')

data = pd.DataFrame(prices, columns=list(['Fund Name','ISIN','DATE','PRICE']))
data['PRICE'].astype('float64')

result = data.pivot(index = 'DATE', columns = 'ISIN', values = 'PRICE')
result = result.apply(pd.to_numeric, errors='ignore')
result = result.dropna(how = 'any')

gains = result.pct_change()
correlations = gains.corr()
riskSTD = gains.std()

print('Saving data to MS Excel file')
with pd.ExcelWriter('D:\Development\Python Projects\FONDI\SEB funds.xlsx') as pdExcel:
    result.to_excel(pdExcel, 'DATA')
    gains.to_excel(pdExcel, 'YIELDS')
    funds.to_excel(pdExcel, 'FUND LIST')
    correlations.to_excel(pdExcel, 'CORRs')
    riskSTD.to_excel(pdExcel, 'RISKS')
    pdExcel.save()

print('\nImport process completed!')

But those who want to view this code in separate parts and continue to read my monologue I chose to split this fairly short code in 3 steps:

  • Acquisition of fund name and ISIN code
  • Reading each individual funds’ pages and finding prices
  • Save prices in DataFrame

Acquisition of fund name and ISIN code

For those who wonder what ISIN is:

The International Securities Identification Number (ISIN) is a code that uniquely identifies a specific securities issue. The organization that allocates ISINs in any particular country is the country’s respective National Numbering Agency (NNA)

.

Source: Investopedia

So it’s clear – ISIN is a unique sequrity reference made up of two letters (based on country) and 10 digits. Total of 12 characters combination. In our case it also has an added value – it is included in the HTML addresses of the SEB fund price pages. In other words, we will get them, because they are vital for us to successfully fulfill our plan but on this topic we will return to the next section on reading individual pages of the funds.

How SEB investment fund pages are built/structured?

Open SEB Investment Fund price page and right-click to select Inspect or whatever web browser you are using. You will see the website design from inside. I use Google Chrome so in my case I right-click and Inspect. You should open something like this (on the right side):

I am not ready to run an introduction to HTML course here so I will describe it in my words. Hopefully some would describe it as plain English.

Here you are looking for an element that starts with <table> which obviously marks some table element of the page. Each

to </ table> consists of lines which in turn are marked as and between tr marks we expect to see column separators </ tr> and each column has in turn column separators

 

containing data inside </ td>. Try to expand these elements on your own and shortly You will see that we are interested in second

element of each

that contains information about SEB Funds. The records we are looking for should look something like this:SEB aktīvais plānsAs you can see here we have both the ISIN and fund name. So you need to find every

entry containing these two elements but you must first read the page with Python:

from bs4 import BeautifulSoup
import requests
import csv

fundList = requests.get('https://ibanka.seb.lv/cgi-bin/ipank/ipank.p?act=VPFOND')
soup = BeautifulSoup(fundList.text, 'html.parser')

So here we use 2 modules: requests and BeautifulSoup from bs4. We will use requests to get the data in our Python and save to variable fundList. Then we want to process the HTML object into more readable format so we take HTML text from the fundList and parse it through BeautifulSoup. Don’t forget to add .text or .content to your requests response variable when passing it to BeautifulSoup. I must admit that this part is hard to mess up.

BeautifulSoup allows you to use a function like .find_all (‘tag‘). This method lets us select all HTML elements/tags that match whatever tag we have provided it with. The normal .find (‘keyword‘) also works but it stops at the first element. So lets write .find_all (‘tr’) and .find_all (‘td’), because these two elements are the only ones that interest us. We don’t need .find_all(‘table’) since every table is built with

and

.

isins = []

for tr in soup.find_all('tr'):
    if len(tr)>10 :
        #As I wrote we are interested in second element of td  -> find_all('td')[1]:
        fundName = tr.find_all('td')[1].text
        #We will use 'isin=' at the end of those td elements as ending seperators:
        sep = str(tr.a).find('isin=') + 5
        isins.append([fundName, str(tr.a)[sep:sep+12]]) #+12 == the length of any ISIN

The isin list should be defined early otherwise we will have an error as soon as we use the .append () method to add new data for a fund to the list.

Saving information to DataFrame

It is more convenient to work with 2D data if it’s stored in a DataFrame rather than lists or arrays. DataFrames provide various built-in operations that we can perform on given data tables:

#Storing data to DataFrame:
funds = pd.DataFrame(isins, columns=list(['Fund Name','ISIN']))
print('List of ISINs and fund names gathered!\n')

Reading each individual funds’ pages and finding prices

By clicking on the fund or in the element of the section of the above, we see that the fonts addresses have a simple structure. It is enough to replace the ISIN code with the variable from the isin [1] and we will get the HTML address of individual fund.

Structure: Where to find prices?

Same as before we need to find the part of the HTML code that interests us. It is fairly fortunate that these pages are not designed according to the latest fashion with jQuery requests or flash. Then it would be harder scrape them… at least for me.

When looking at the code inside the Python you can use the soup.prettify () method which will indent the text block making it look more structured but we can of course also inspect the page in our browser. If you quickly scroll through the code You will realize that this time we will be interested in a place that starts with:

Bunch of rows with data in middle and it will end with this:

Once you locate the information you need to define starting point and ending point. I would definitely not recommend slicing from heading 1 to heading 35 (these are random numbers just to show an example). Maybe some of the readers could recommend me a more efficient method of handling this but I will use the .find () text method to locate starting/ending points of each price per date.

So let’s stat reading:

prices = [] #the list that we will amend until all prices will be read and then pass it to DataFrame

for isin in isins:
#We don't want to do data cleaning in prices[] list so we will introduce a temp
    temp = []

    pricePage = requests.get('https://ibanka.seb.lv/cgi-bin/ipank/ipank.p?sesskey=&lang=LAT&act=VPFONDINFO&isin=' + isin[1])
    soup = BeautifulSoup(pricePage.text, 'html.parser')

#Data cleaning - as fun as watching the paint dry:
    txt = str(soup.find_all('script'))[str(soup.find_all('script')).find('function convertDate(date) {')+159:str(soup.find_all('script')).find('initChart(min,seriesdata,isin,valuuta')-1]
    txt = txt.replace('\n','')
    txt = txt.replace('convertDate(','')
    txt = txt.replace(')','')
    txt = txt.replace('removeNegatives(','')
    txt = txt[1:txt.find(',isin')]

#Converting string to list with string method .split()
    temp = txt.split('],[')

If you read this article in order to learn how to download information from websites with Python I have to warn you that it takes most of the time to clean up your data or to search for the right cleaning solution. I had to play around a bit until I got this part done right…or at least working. I wanted to find a more fancy solution for this data cleaning issue but at the end of the day I did’t care how it looks as long as it works.

As you can see I use the txt variable to temporarily save the piece of text that soup.find_all (‘script’) delivered. The main task of txt is to clean the text as much as possible so that we don’t have to do much of a cleaning later on. Here we will use the simple search function to find start and end points based on what we saw in HTML code. Main task is to find correct key words or signs that are unique to these

Leave a Reply

Your email address will not be published. Required fields are marked *