Question:
How to handle overlapping keyword bracketing in Python?

Problem:

Suppose I have a list of keywords and an input string labelled section, I want my code to find those keywords within section and put them inside [] square brackets. However, my keywords sometimes overlap with each other.


keywords = ["alpha", "alpha beta", "alpha beta charlie", "alpha beta charlie delta"]


To fix that I sorted them by length so that the longer keywords would be prioritized. However, when I run my code, sometimes I would get double or nested brackets (I assume it is because it still detects those as valid keywords)


I tried this:

import re


keywords = ["alpha", "alpha beta", "alpha beta charlie", "alpha beta charlie delta"]


keywords.sort(key=len, reverse=True)


section = "alpha alpha beta alpha beta charlie alpha beta charlie delta"

section = section.replace("'", "’").replace("\"", "”")

section_lines = section.split('\n')

for i, line in enumerate(section_lines):

    if not line.startswith('#'):

        section_lines[i] = re.sub(r'-',' ',line)

        for x in range(4):

            section_lines[i] = re.sub(r'\b' + f"{keywords[x]}" + r'\b', f"[{keywords[x]}]", section_lines[i], flags=re.IGNORECASE)

            section_lines[i] = section_lines[i].replace("[[", "[").replace("]]", "]")


section = '\n'.join(section_lines)

section = section.replace("   "," ").replace("  "," ")


print(section)

Do not mind the line splits, it is for another part of to handle multiple lines.


I wanted: [alpha] [alpha beta] [alpha beta charlie] [alpha beta charlie delta]


but instead I got: [alpha] [alpha] beta] [alpha] beta] charlie] [alpha] beta] charlie] delta]


Solution:

You can join the sorted keywords into an alternation pattern instead of substituting the string with different keywords 4 times, each time potentially substituting the replacement string from the previous iteration:

import re


keywords = ["alpha", "alpha beta", "alpha beta charlie", "alpha beta charlie delta"]

keywords.sort(key=len, reverse=True)

section = "alpha alpha beta alpha beta charlie alpha beta charlie delta"

print(re.sub(rf"\b({'|'.join(keywords)})\b", r'[\1]', section))


This outpouts:

[alpha] [alpha beta] [alpha beta charlie] [alpha beta charlie delta]


Suggested blogs:

>Can I make a specific character the center point on a header?

>Authenticating user using PassportJS local strategy during registration

>Adding EJS ForEach loop tag

>How to get pending ActiveMQ messages?

>How to pull out different elements from a text file? - Python

>Adding cooldown for user tries to farm XP-discord.py


Ritu Singh

Ritu Singh

Submit
0 Answers