Programming i html: BeautifulSoup <b>bold</b> tag fail on newest questions tagged html – Stack Overflow
I have a html that contains:
<b> <p align="left">TXT1</p> </b> <p align="left"> <b>NR1</b> <b>TXT2</b> TXT3 <b>TXT4</b> TXT5 </p>
When I do:
from BeautifulSoup import BeautifulSoup html = urllib.urlopen('url') htmlr = html.read() soup = BeautifulSoup(htmlr) print soup
I get something different:
<p align="left">TXT1</p> <p align="left">NR1 <b>TXT2</b> TXT3 <b>TXT4</b> TXT5</p>
I am analyzing html document layout, so losing tags is quite frustrating. Why is it happening and whats the best way to stop it? Help much appriciated!
EDIT: I need to handle the badly formed html documents for information extraction purposes. If their creator wanted some text to be rendered bold, I have to take it into account, even if the person created an invalid html.
Programming i html: programming-i-html