How Do I Use Beautifulsoup4 To Get All Text Before
Tag

April 22, 2024 Post a Comment

I'm trying to scrape some data for my app. My question is I need some Here is the HTML code: This

Solution 1:

Try this. It should give you the desired output. Just consider the content variable used within the below script to be the holder of your above pasted html elements.

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

Output:

This is a first sentence. 
This is a second sentence. 
This is a third sentence.

Then, you could go through each of the sublists repeatedly replacing tags by turning them into soup and then getting the lists of children for these. Eventually, you will have several sublists containing only what BeautifulSoup calls 'navigable strings' that you can manipulate as usual.

Join the elements together, then I would suggest that you eliminate white space using a regex sub like this:

result = re.sub(r'\s{2,}', '', <joined list>)

Solution 3:

You can easily do this using bs4 and basic string manipulation like so:

from bs4 import BeautifulSoup

data = '''
<tr><td>
    This
    <aclass="tip info"href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <aclass="tip info"href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <aclass="tip info"href="blablablablabla">is a third</a>
    sentence.
    <br></td></tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

This will give as output:

This is a first sentence.
This is a second sentence.
This is a third sentence.

Solution 4:

htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while"  "in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while"\n "in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new linesprint(parsedText.strip())

Free Interactive Html5 Tutorial

How Do I Use Beautifulsoup4 To Get All Text Before
Tag

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "How Do I Use Beautifulsoup4 To Get All Text Before
Tag"

How Do I Use Beautifulsoup4 To Get All Text Before Tag

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "How Do I Use Beautifulsoup4 To Get All Text Before Tag"

How Do I Use Beautifulsoup4 To Get All Text Before
Tag

Post a Comment for "How Do I Use Beautifulsoup4 To Get All Text Before
Tag"