素材牛VIP会员
python beautifulsoup 如何抓取不规则表格的内容
 qi***hu  分类:Python  人气:695  回帖:3  发布于6年前 收藏

在爬一个网站数据的时候发现,旧的页面采用的表格和现在的格式不一样,这到不算大问题,只是旧式表格采用的是表格格式并不规则。因网站登陆本身需要账号,就不提供网址了。
具体如下:
新式:

旧式:

在旧式表中中,列名行与数据第一行有6个td标签,其余仅有5个td标签。
表格中的tr标签与td标签均没有特别的属性用做区分。

目前我的处理方式是:
新式:
读列名行,按顺序做一个列表例如:['厂家','备注','单位','变化']。
之后每行数据 按顺序制作成一个字典例如{'厂家':'ABC','备注':'ABC'}
然后插入到我的数据库中。
旧式:
方法类似,只是我要每行 判断cells的数量 来确定读哪部分。

我的问题是:
请问 有没有更好的办法,将表格中的数据按照格式读取出来,甚至能处理旧式表格这样的布局?

旧式表格:

<tr style="height: 16.55pt">

    <td style="border-bottom: windowtext 1.5pt double; border-left: windowtext 1.5pt double; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 9.66%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="9%">
    <div style="text-align: center; layout-grid-mode: char" align="center">产品</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div style="text-align: left; layout-grid-mode: char" align="left">厂家</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div style="text-align: center; layout-grid-mode: char" align="center">元<span>/公斤</span></div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">涨<span>/跌</span></div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">产地</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">备注</div>
    </td>
</tr>
<tr style="height: 9.3pt">
    <td rowspan="7" style="border-bottom: windowtext 1.5pt double; border-left: windowtext 1.5pt double; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 9.66%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="9%">
    <div style="text-align: center; layout-grid-mode: char" align="center">进</div>
    <div style="text-align: center; layout-grid-mode: char" align="center">口</div>
    <div style="text-align: center; layout-grid-mode: char" align="center">原</div>
    <div style="text-align: center; layout-grid-mode: char" align="center">生</div>
    <div style="text-align: center; layout-grid-mode: char" align="center">多</div>
    <div style="text-align: center; layout-grid-mode: char" align="center">晶</div>
    <div style="text-align: center; layout-grid-mode: char" align="center">硅</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left">WackerChemie</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">480</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">德国</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>
<tr style="height: 9.3pt">
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left">Hemlock Semiconductor</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">450</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">美国</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>
<tr style="height: 9.3pt">
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left">Tokuyama Corporation</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">460</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">日本</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>
<tr style="height: 9.3pt">
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left">MEMC Electronic Materials,Inc</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">450</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">美国</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>
<tr style="height: 9.3pt">
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left">MitsubishiPolysilicon</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">460</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">日本</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>
<tr style="height: 9.3pt">
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left">REC Group</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">430</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">美国</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>
<tr style="height: 14.55pt">
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
    <div align="left"><span style="color: black">DC Chemical</span></div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
    <div align="center">430</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
    <div style="text-align: center; layout-grid-mode: char" align="center">-</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
    <div style="text-align: left; layout-grid-mode: char" align="left">韩国</div>
    </td>
    <td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
    <div style="text-align: center; text-indent: 5.25pt; layout-grid-mode: char" align="center">&nbsp;</div>
    </td>
</tr>

。。我想把新式的表格也发出来。可惜这些里面各种属性太多了 超过限制了。

讨论这个帖子(3)垃圾回帖将一律封号处理……

Lv1 新人
真***溜 职业无 6年前#1

如果你能提供html源码,就好办了~


你的html源码不完整,补上<table>标签后,直接贴到Excel里,就能变成表格了。

如果你会Excel的话,很容易就搞定了


python3

import pandas as pd

html = 'tab.html' # 你给的table源码

#默认pd会用 lxml 解析html
df = pd.read_html(html,header=0,encoding='utf8')[0]
print(df)
df2 = df.iloc[1:,0:-1]
df2.columns = df.columns.delete(0)
df2 = df2.append(df.iloc[0,1:])
df2['产品']=df.iat[0,0].replace(' ','')
df2.insert(0,'产品',df2.pop('产品'))
df2 = df2.sort_index()
print(df2)

结果:

        产品                             厂家 元/公斤 涨/跌  产地   备注
0  进口原生多晶硅                   WackerChemie  480   -  德国  NaN
1  进口原生多晶硅          Hemlock Semiconductor  450   -  美国  NaN
2  进口原生多晶硅           Tokuyama Corporation  460   -  日本  NaN
3  进口原生多晶硅  MEMC Electronic Materials,Inc  450   -  美国  NaN
4  进口原生多晶硅          MitsubishiPolysilicon  460   -  日本  NaN
5  进口原生多晶硅                      REC Group  430   -  美国  NaN
6  进口原生多晶硅                    DC Chemical  430   -  韩国  NaN
Lv4 码徒
小***学 软件测试工程师 6年前#2

http://0594666.com/shop/shop2...
怎么提取

Lv5 码农
疯***斯 职业无 6年前#3

学一下pandas
redad_html(url,match='你想要的表格里的字符')
这样就可以直接得到你想要的表格的数据内容。非常爽。

 文明上网,理性发言!   😉 阿里云幸运券,戳我领取