Jul 062012
Programming i html: BeautifulSoup <b>bold</b> tag fail on newest questions tagged html – Stack Overflow
I have a html that contains:
<b>
<p align="left">TXT1</p>
</b>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
When I do:
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('url')
htmlr = html.read()
soup = BeautifulSoup(htmlr)
print soup
I get something different:
<p align="left">TXT1</p>
<p align="left">NR1 <b>TXT2</b> TXT3 <b>TXT4</b>
TXT5</p>
I am analyzing html document layout, so losing tags is quite frustrating. Why is it happening and whats the best way to stop it? Help much appriciated!
EDIT: I need to handle the badly formed html documents for information extraction purposes. If their creator wanted some text to be rendered bold, I have to take it into account, even if the person created an invalid html.
See Answers
source: http://stackoverflow.com/questions/11363885/beautifulsoup-bbold-b-tag-fail
Programming i html: programming-i-html
Recent Comments