How do you extract a HTML table and add a new column with constant values from an earlier <strong> tag?

2020-05-23 python html beautifulsoup robobrowser

I'm attempting to extract a series of tables from an HTML document and append a new column with a constant value from a tag being used as a header. The idea would then be to make this new three column table a dataframe. below is the code i've come up with so far. I.e. each table would have a third column where all the row values would equal either AGO, DPK, ATK, or PMS depending which header precedes the series of tables. Would be grateful for any help as i'm new to python and HTML. Thanks a mill!

import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser

br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")

table = br.find_all('td', class_='vc_table_cell')

for element in table:
    data = element.find('span', class_='vc_table_content')
    prod_name = br.find_all('strong')
    ago = prod_name[0].text
    dpk = prod_name[1].text
    atk = prod_name[2].text
    pms = prod_name[3].text
    if br.find('strong').text == ago:
        data.append(ago.text)
    elif br.find('strong').text == dpk:
        data.append(dpk.text)
    elif br.find('strong').text == atk:
        data.append(atk.text)
    elif br.find('strong').text == pms:
        data.append(pms.text)
    print(data.text)

df = pd.DataFrame(data)

The result i'm hoping for is to go from this

                AGO

Enterprise     Price
Coy A          $0.5/L
Coy B          $0.6/L
Coy C          $0.7/L

to the new table below as a dataframe in Pandas

Enterprise     Price            Product
Coy A          $0.5/L           AGO
Coy B          $0.6/L           AGO
Coy C          $0.7/L           AGO

and to repeat the same thing for other tables with DPK, ATK and PMS information

Answers

I hope I understood your question right. This script will scrape all tables found in the page into the dataframe and save it to csv file:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://oilpriceng.net/03-09-2019/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
    if tag.name == 'strong':
        last = tag.get_text(strip=True)
    else:
        a, b = tag.select('td')
        a, b = a.get_text(strip=True), b.get_text(strip=True)
        if a and b != 'DEPOT PRICE':
            data['Enterprise'].append(a)
            data['Price'].append(b)
            data['Product'].append(last)

df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')

Prints:

            Enterprise         Price Product
0            AVIDOR PH        ₦190.0     AGO
1            SHORELINK                   AGO
2    BULK STRATEGIC PH        ₦190.0     AGO
3                  TSL                   AGO
4              MASTERS                   AGO
..                 ...           ...     ...
165             CHIPET        ₦132.0     PMS
166               BOND                   PMS
167           RAIN OIL                   PMS
168               MENJ        ₦133.0     PMS
169              NIPCO  ₦ 2,9000,000     LPG

[170 rows x 3 columns]

The data.csv (screenshot from LibreOffice):

enter image description here

Related