Skip to content Skip to sidebar Skip to footer

Remove Html Tags And Get Start/end Indices Of Marked-down Text?

I have a bunch of text that in markdown format: a**b**c is abc. I've got it converted to html tags to be more regular: abc I know there's a lot of to

Solution 1:

Looks like what you want is an HTML Parser. HTML Parser's are complicated things. Therefore, you want to use an existing library (creating your own is hard and likely to fail on many edge cases). Unfortunately, as highlighted in this question, most of the existing HTML parsing libraries do not retain position information. The good news is that the one HTML Parser which reliably retains position information is in the Python standard library (see HTMLParser). And as you are using Python 3, the problems with that parser have been fixed.

A basic example might look like this:

from html.parser import HTMLParser


classStripTextParser(HTMLParser):
    def__init__(self, *args, **kwargs):
        self.data = []
        super(StripTextParser, self).__init__(*args, **kwargs)

    defhandle_data(self, data):
        if data.strip():
            # Only use wtrings which are contain more than whitespace
            startpos = self.getpos()
            # `self.getpos()` returns `(line, column)` of start position.# Use that plus length of data to calculate end position.
            endpos = (startpos[0], startpos[1] + len(data))
            self.data.append((data, startpos, endpos))


defstrip_text(html):
    parser = StripTextParser()
    parser.feed(html)
    return parser.data

test1 = "<sup><sup>There</sup></sup> <sup><sup>was</sup></sup> <sup><sup>another</sup></sup> <sup><sup>thread</sup></sup> <sup><sup>like</sup></sup> <sup><sup>this</sup></sup>"print(strip_text(test1))

# Ouputs: [('There', (1, 10), (1, 15)), ('was', (1, 38), (1, 41)), ('another', (1, 64), (1, 71)), ('thread', (1, 94), (1, 100)), ('like', (1, 123), (1, 127)), ('this', (1, 150), (1, 154))]


test2 = """
<ul>
<li>https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB</li>
<li>79</li>
<li>Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! </li>
</ul>
"""print(strip_text(test2))

# Outputs: [('https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB', (3, 4), (3, 77)), ('79', (4, 4), (4, 6)), ('Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! ', (5, 4), (5, 95))]

test3 = "<em><strike>a</strike></em>"print(strip_text(test3))

# Outputs: [('a', (1, 12), (1, 13))]

Without some more specific information about the format desired for the output, I just created a list of tuples. Of course, you can refactor to fit your specific needs. And if you want all of the whitespace, then remove the if data.strip(): line.

Solution 2:

This is the code that could be a good start for you. Hope it helps.

import sys
from html.parser import HTMLParser

line=sys.argv[1]

classMyHTMLParser(HTMLParser):
    stripped_text = ""
    isTag = False
    isData = False
    beginDataIndices = []
    endDataIndices = []
    global_index = 0defhandle_starttag(self, tag, attrs):
       #print("Encountered a start tag:", tag)
       self.isTag = Truedefhandle_endtag(self, tag):
       #print("Encountered an end tag :", tag)
       self.isTag = Falsedefhandle_data(self, data):
       #print("Encountered some data  :", data)
       self.stripped_text += data
       if(self.isTag):
          self.beginDataIndices.append(self.global_index)
          self.global_index += 1
          self.isData = Trueelse:
          if(self.isData):
             self.endDataIndices.append(self.global_index)
          self.isData = False
          self.global_index += 1defprintIndices(self):
          for i inrange(len(self.endDataIndices)):
             print("(%d, %d)" % (self.beginDataIndices[i], self.endDataIndices[i]))

parser = MyHTMLParser()
parser.feed(line)
print(parser.stripped_text)
parser.printIndices()

Post a Comment for "Remove Html Tags And Get Start/end Indices Of Marked-down Text?"