), it gets replaced with a Div element, this way we can set via pandoc the `class` attribute to `PARA_STYLE`, applying css style to the element. + - *Additionally:* an id for images and table headers (marked with `TF` and `TH`) is created, so that we can link them in the second script. + +- On Paragraphs we also call: + ***Reference():*** This function has the purpose of relating in a table the references patterns `[number]` or `[i.number]` to a link that points to its definition. +- On Images we also: + - change the extension, so the html properly uses the .png images instead of the unsupported .emf exesnion + - create a yellow overlay for debugging purposes if the images format was not originally .emf as expected. +- On Headers: + Since annex headers all have level 8, we relate in a table, as we did with references, the annex name and their respective link. +--- +**Filter 2:** +This filter finishes the job of the first filter, using the tables we built in previous step and pandoc automatically generated toc links. + +*First some clean up:* +> Sometimes certain portions of text get encapsulated in Plain objects, e.g `"Some text"`, instead of being `[Str("Some"), Space(), Str("text")]` as expected in Pandoc, it's `[Plain(Str("Some")), Plain(Space()), Plain(Str("text"))]`. +> This causes some problems with the other functions since they expect the first structure, so we use the function ***Normalize()*** to ensure the proper structure is respected in our document. + +After normalizing it, we basically run 2 fucntions on each element of the document: +1. ***Substitute(el, word):*** this function has the purpose of replacing every instance of `word` followed by a pattern of type `x.x.x.x` with a link pointing to that specific instance of word. This links are generated via helper functions. +*Example:* `Substitute(el, clause)` will substitute each pattern of `clause x.x.x.x` with a link generated by **ClauseLink()** function, e.g. `clause 4.4.2` will point to clase 4.4.2 in the document. +*word* can be clause, table, figure or annex + +2. ***MultipleClauses():*** similarly to Substitute, but it specifically handles multiple `x.x.x.x` patterns after the word `clauses`. +*Example:* in `clauses 4.4.1, 4.4.2, 4.4.3 and 4.4.4` each clause number `x.x.x` will become a link to that clause. + +3. Every string of type `[number]` or `[i.number]` will be subtituted with the link pointing to that reference definition. + + + + +--- + +### Explanation of the visual cues of the generated HTML +- grey background: paragraph style applied +- vertical grey bar: HTML blockquote, due to indentation in docx +- text is highlighted: on-demand style, i.e. the text's font is changed wrt to its own paragraph style base font + +# Generate docx from html +Copy your manipulated version of ETSI_GS_skeleton.docx to the docx_to_html folder, then generate html_to_docx_output.docx: +> `html_to_docx.py ./API` + +### Explanation of html_to_docx.py + +In the reverse step, we want to generate a docx, starting from our html. Using Pandoc we would have some trouble with styles. So we decided to write it manually. +Basically we use BeautifulSoup4 to parse and navigate the html and in combination with it, we use python-docx (starting from an ETSI Skeleton) in order to write one-by-one the content of each html tag in the docx document. + +--- +**Recreate the styles:** +recreating the style for Ondemand style it's a bit tricky since we don't wont to actually have these styles saved in the word document. +- ***Solution:*** we use cssutils to parse the CSS used for style and create a list of Styles object, these objects just hold the same style properties we passed down from the original docx Document to the generated HTML, and two method that let us apll those properties to a run or paragraph. + +--- +**The main loop:** +We iterate each html tag with bs4, and we call over each tag the function `handle_tag()`: this is where most of the magic happens, we call this fucntion recursevly onto each tag's children so we can handle every indivual bit of information as we please. + +We have some variables that keep the state of previous iterations, they will be introduced as we proceed into the explanation. The two most important ones are `run` and `para`. They keep the reference to the current paragraph/run that we're working on, they're basically just pointers. + +- **Step 1:** identifying the type of tag. We use the `tag.name` property that is basically what's written in the HTML. E.g. the tag `
or \
$subtitle$
+ $endif$ $for(author)$ + + $endfor$ $if(date)$ +$date$
+ $endif$ $if(abstract)$ +$subtitle$
+ $endif$ $for(author)$ + + $endfor$ $if(date)$ +$date$
+ $endif$ $if(abstract)$ +` tag's `` child tag, which it always has. Preserves indentation by replacing any tabs with tab placeholders, which is necessary because Pandoc trims preceding whitespace.
+ """
+ # There will only be one code tag inside the pre tag
+ code = pre.find("code")
+
+ # Get the direct children of the code tag. Each span contains an tag and a series of spans representing a single line's worth of text.
+ code_children = code.find_all("span", recursive=False)
+
+ # Keep track of each child's spans' text
+ lines: list[str] = []
+
+ for code_child in code_children:
+ spans = code_child.find_all("span", recursive=False)
+ line = ""
+
+ for i, span in enumerate(spans):
+ if (
+ len(span.get_text().strip()) == 0 and i == 0
+ ): # Preserve indentation added by the first span with just whitespace
+ new_span_text = span.get_text().replace(
+ "\xa0\xa0", TAB_PLACEHOLDER
+ ) # Code blocks have double the amount of non-breaking spaces as tabs that they should have, so replace every two non-breaking spaces with one tab placeholder
+ line = f"{line}{new_span_text}"
+
+ else: # Add each span's text to the current line, replacing non-breaking spaces with normal spaces
+ new_span_text = span.get_text().replace("\xa0", " ")
+ line = f"{line}{new_span_text}"
+
+ lines.append(line)
+ line = "" # Prepare for next line
+
+ return lines
+
+
+def handle_examples_and_notes(soup: BeautifulSoup):
+ """
+ Unwrap the inner divs within examples and notes.
+ - Examples are divs that apply the `EX` class with two child divs with no attributes
+ - Notes are divs that apply the `NO` or `TAN` classes (one or the other)
+
+ Performs the following tasks:
+ 1. Unwrap the inner divs to reduce the div's contents to the tag and its body content all at the same level
+ 2a. Conditional handling depending on the content type of the first body element
+ 1. If the content is another paragraph, simply merge it with the first paragraph (the label/tag)
+ 2. If the content is a code block, take the first line and merge it with the label/tag so the code block is level with the label tag, then add in all the other lines after it.
+ 2b. If applicable, handling of all subsequent body elements with a similar logic to the handling of the first element
+ 3. Wrap the contents of child elements, such as spans or paragraphs, that apply some style with tags containing that style name so the style can be readded during postprocessing.
+ """
+
+ def handle_first_element(soup: BeautifulSoup, children: list[Tag]):
+ """Handles the first element in the example or note. It is handled differently than other elements because the first element interacts with the example/note tag, since the body text should be on-level with the tag."""
+
+ def handle_paragraph(soup: BeautifulSoup, tag: Tag, body: Tag):
+ """Simply consolidate the body paragraph with the tag paragraph"""
+ tag_body_para = soup.new_tag("p")
+ tag_body_para.append(tag.get_text())
+ # tag_body_para.append(body.get_text())
+
+ for child in body.children:
+ tag_body_para.append(copy.copy(child))
+
+ # Replace the two old paragraphs with the single new paragraph
+ tag.insert_before(tag_body_para)
+ tag.decompose()
+ body.decompose()
+
+ return soup
+
+ def handle_code_block(soup: BeautifulSoup, tag: Tag, body: Tag):
+ """Apply the HTML Sample style to the individual lines and merge the first line with the tag"""
+ # Get code blocks' lines
+ pre = body.find_all("pre")[0]
+ lines = get_plaintext_from_codeblock(pre)
+
+ # Make a new div for the tag and the body
+ consolidated_div = soup.new_tag("div")
+
+ # Make the new paragraph for the first line, containing the label text and the first line of the code block
+ label_text = NavigableString(
+ f"{tag.get_text()}\t"
+ ) # Add tab for indentation
+
+ first_body_span = soup.new_tag("span", attrs={"class": "HTML_Sample"})
+ first_body_span.append(lines.pop(0))
+
+ label_and_first_line_para = soup.new_tag("p")
+ label_and_first_line_para.append(label_text)
+ label_and_first_line_para.append(first_body_span)
+
+ consolidated_div.append(label_and_first_line_para)
+
+ # For the rest of the lines, add tabs to their beginnings and add them as subsequent paragraphs
+ for line in lines:
+ line_paragraph = soup.new_tag("p", attrs={"class": "HTML_Sample"})
+ line_paragraph.append(line)
+ consolidated_div.append(line_paragraph)
+
+ tag.insert_before(consolidated_div)
+ tag.decompose()
+ body.decompose()
+
+ return soup
+
+ # Existing tag and (first or only) body element
+ tag = children[0]
+ body = children[1]
+
+ if body.name == "p":
+ soup = handle_paragraph(soup, tag, body)
+
+ if body.name == "div" and body.find("pre"):
+ soup = handle_code_block(soup, tag, body)
+
+ return soup
+
+ def handle_subsequent_elements(soup: BeautifulSoup, body_elements: list[Tag]):
+ """Ensure correct indentation of all other body elements"""
+
+ def handle_codeblock(soup: BeautifulSoup, element: Tag):
+ """
+ Ensure the code block has the correct indentation by prepending a tab placeholder
+ """
+ pre = element.find_all("pre")[0]
+
+ lines: list[str] = get_plaintext_from_codeblock(pre)
+
+ codeblock_div = soup.new_tag("div", attrs={"class": "EX"})
+ for line in lines:
+ # Create suffix to tell whether paragraph should have space after it
+ suffix = NO_SPACE
+ if line == lines[-1]:
+ suffix = WITH_SPACE
+
+ line_paragraph = soup.new_tag(
+ "p", attrs={"class": f"HTML_Sample/{suffix}"}
+ ) # This class/style name doesn't exist, but will be normalized later on in postprocessing.
+ line_paragraph.append(line)
+
+ codeblock_div.append(line_paragraph)
+
+ pre.parent.parent.append(codeblock_div)
+ pre.decompose()
+
+ return soup
+
+ for element in body_elements:
+ if element.name == "div" and element.find("pre"):
+ handle_codeblock(soup, element)
+
+ return soup
+
+ def wrap_child_elements_with_style_tags(
+ soup: BeautifulSoup, ex_no_tan_children: list[Tag]
+ ):
+ """
+ Wrap any child elements (spans, paragraphs, etc) with tags containing their styles. This is necessary because the entire example/note div will be flattened by Pandoc and the inner styles will be lost otherwise. Later, these tags will be used after conversion to Docx to restore the styles.
+
+ #### Tag format
+
+ {{{stylename}}}[tag contents]{{{/stylename}}}
+ """
+
+ def make_tag(text: str, classname: str):
+ return f"{{{{{{{classname}}}}}}}{text}{{{{{{/{classname}}}}}}}"
+
+ for child in ex_no_tan_children:
+ grandchildren: list[Tag] = child.find_all()
+
+ for grandchild in grandchildren:
+ grandchild_class = grandchild.get("class", [])
+
+ text = grandchild.get_text()
+
+ if grandchild.name in HTML_BASIC_FORMAT_TAGS and not is_reference(
+ child.parent
+ ):
+ tagged_text = make_tag(text, grandchild.name)
+
+ grandchild.clear()
+ grandchild.append(tagged_text)
+
+ if (
+ grandchild.name == "a"
+ and grandchild.has_attr("href")
+ and not is_reference(child.parent)
+ ):
+ tagged_text = make_tag(text, f'LINK|{grandchild.get("href")}')
+
+ grandchild.clear()
+ grandchild.append(tagged_text)
+
+ if len(grandchild_class) > 0:
+ grandchild_class = grandchild_class[
+ 0
+ ] # Get the class name at index 0, since there will always be only one class
+
+ tagged_text = make_tag(text, grandchild_class)
+
+ grandchild.clear()
+ grandchild.append(tagged_text)
+
+ return soup
+
+ divs = soup.find_all(
+ "div", class_=lambda cls: cls in EXAMPLE_NOTE_CLASSES if cls else False
+ )
+
+ for div in divs:
+ children: list[Tag] = div.find_all(recursive=False)
+
+ if (
+ len(children) != 2
+ or not (children[0].name == "div" and children[1].name == "div")
+ or not (len(children[0].attrs) == 0 and len(children[1].attrs) == 0)
+ ):
+ continue # Not the correct structure of examples or notes
+
+ # Step 1: Unwrap the divs
+ for child in children:
+ child.unwrap()
+
+ # Step 2a: Combine the first and second children (the tag and the (first or only) body element)
+ children: list[Tag] = div.find_all(
+ recursive=False
+ ) # Update children now that the divs were removed
+
+ if not (len(children) >= 2):
+ continue # Not a correctly formatted example or note
+
+ soup = handle_first_element(soup, children)
+
+ if len(children) == 2:
+ continue # At this point, examples and notes with only one child are done
+
+ # Step 2b
+ soup = handle_subsequent_elements(soup, children[2:])
+
+ # Step 3
+ divs = soup.find_all(
+ "div", class_=lambda cls: cls in EXAMPLE_NOTE_CLASSES if cls else False
+ ) # Refresh list with changes
+
+ for div in divs:
+ children: list[Tag] = div.find_all(recursive=False)
+
+ soup = wrap_child_elements_with_style_tags(soup, children)
+
+ return soup
+
+
+def handle_abbreviations(soup: BeautifulSoup):
+ """
+ Convert examples into a docx-friendly form, adding a tab placeholder to ensure correct indentation.
+
+ #### Start
+ `[abbreviation][meaning]`
+
+ #### End
+ `[abbreviation]{{{TAB}}}[meaning]`
+ """
+ ABBREVIATION_CLASS = "EW"
+ abbreviations = soup.find_all("div", attrs={"class": ABBREVIATION_CLASS})
+
+ if len(abbreviations) == 0:
+ return soup # Nothing to do here
+
+ for abbr in abbreviations:
+ children = abbr.find_all("div")
+
+ if len(children) != 2:
+ continue # Not an abbreviation
+
+ tab = NavigableString(TAB_PLACEHOLDER)
+ abbr_div = children[0]
+
+ abbr_div.insert_after(tab)
+
+ for child in children:
+ child.unwrap()
+
+ return soup
+
+
+def convert_codeblock_styles_to_etsi(soup: BeautifulSoup):
+ """Style codeblocks to use ETSI styles rather than the styling applied by the HTML codeblock"""
+ pres = soup.find_all("pre")
+
+ for pre in pres:
+ lines: list[str] = get_plaintext_from_codeblock(pre)
+
+ new_codeblock = soup.new_tag("div")
+
+ for line in lines:
+ line_para = soup.new_tag("p")
+
+ # Create a suffix for this tag specifying whether space should be added after the paragraph. Only the last paragraph in the code block should have the space added.
+ suffix = NO_SPACE
+ if line == lines[-1]:
+ suffix = WITH_SPACE
+
+ line_para.append(
+ f"{{{{{{HTML_Sample/{suffix}}}}}}}{line}{{{{{{/HTML_Sample/{suffix}}}}}}}"
+ )
+
+ new_codeblock.append(line_para)
+
+ pre.insert_before(new_codeblock)
+ pre.decompose()
+
+ return soup
+
+
+def cleanup_code_tags(soup: BeautifulSoup):
+ """Throughout the text, there are portions of text enclosed in `<` and `>`. Unwrap these code tags to preserve just the text. If the text is a hyperlink, replace the code tag with a hyperlink containing the link."""
+ code_tags = soup.find_all("code")
+
+ for code in code_tags:
+ text = code.text
+
+ if len(text) == 0 or (text[0] != "<" and text[-1] != ">"):
+ continue
+
+ is_link = "https://" in text[1:] or "http://" in text[1:]
+
+ if is_link:
+ text = text[1:-1] # Remove < and >
+ hyperlink = soup.new_tag("a", attrs={"href": text})
+ hyperlink.append(text)
+
+ code.replace_with(hyperlink)
+
+ elif text == "":
+ code.parent.decompose()
+
+ else:
+ plaintext = NavigableString(text)
+
+ code.replace_with(plaintext)
+
+ return soup
+
+
+def create_custom_tags_for_bold_italic_underline_styles(soup: BeautifulSoup):
+ """Some spans apply complex styles part of which involves bolding, italicizing, or underlining. Create custom tags with the class name"""
+ CSS_BIU = combine_biu_classes()
+
+ spans = soup.find_all("span", class_=lambda cls: cls in CSS_BIU if cls else False)
+
+ for span in spans:
+ cls = span.get("class")[0]
+ if cls in HANDLE_UNDERSCORE_CLASSES:
+ # The styles for these classes do not have the underscores
+ cls = cls.replace("_", " ")
+
+ text = f"{{{{{{{cls}}}}}}}{span.get_text()}{{{{{{/{cls}}}}}}}"
+
+ span.replace_with(NavigableString(text))
+
+ return soup
+
+
+def prepare_table_cell_classes(soup: BeautifulSoup):
+ """Transform table cells to prepare them for conversion to docx. Unwrap the div and apply the div's class to all the paragraphs contained therein."""
+ for cell in soup.find_all(["th", "td"]):
+ div = cell.find("div")
+
+ if not div or not div.get("class")[0]:
+ continue
+
+ # Table notes already have their style applied to them, skip them here to avoid messing them up
+ if div.find("div", class_="TAN"):
+ continue
+
+ div_class = div.get("class")[0]
+
+ # Add a tag with the class to help apply the corresponding style during postprocessing
+ label_para = soup.new_tag("p")
+ label_para.append(NavigableString(f"{{{{{{{div_class}}}}}}}"))
+
+ div.insert(0, label_para)
+
+ # Ensure plaintext is wrapped in a paragraph so paragraph styles can be applied to it
+ for child in div.children:
+ try:
+ if not isinstance(child, NavigableString):
+ continue
+
+ para = soup.new_tag("p").append(child.get_text())
+ child.replace_with(para)
+ except ValueError:
+ p_warning(f'Could not add child: {repr(child)} to paragraph in table cell')
+
+ div.unwrap()
+
+ return soup
+
+
+def add_whitespace_placeholder_to_headings(soup: BeautifulSoup):
+ """Adds placeholders for tabs and newline characters in headers that must be maintained in the Docx."""
+ HEADER_TAGS = ["h1", "h2", "h3", "h4", "h5", "h6"]
+ headers = soup.find_all(HEADER_TAGS)
+
+ main_headers = r"\d+(?:\.\d+){0,2}"
+ annex_headers = r"Annex\s[A-Z](?:\s(?:\(normative\)|\(informative\)))?:"
+ annex_subheaders = r"[A-Z](?:\.\d{1,2})+"
+
+ tabbed_header_regex = rf"^({main_headers}|{annex_subheaders})\s" # The headers that should have a tab offset
+ newline_header_regex = rf"^({annex_headers})\s*"
+
+ for header in headers:
+ replacement_text = re.sub(
+ tabbed_header_regex, rf"\1{TAB_PLACEHOLDER}", header.get_text().lstrip()
+ )
+ replacement_text = re.sub(
+ newline_header_regex, rf"\1{NEWLINE_PLACEHOLDER}", replacement_text
+ )
+
+ if TAB_PLACEHOLDER in replacement_text:
+ header.clear()
+ header.append(NavigableString(replacement_text))
+
+ if NEWLINE_PLACEHOLDER in replacement_text:
+ header.clear()
+ header.append(NavigableString(replacement_text))
+
+ return soup
+
+
+# endregion
+
+
+def preprocess(
+ src: str, src_type: str, excluded_html_files: list[str], consolidated_html_path: str
+):
+ """
+ ### Description
+ Preprocessing mandatory for conversion from HTML to Docx. Performs the following tasks:
+ 1. Removes the table of contents Pandoc adds to the HTML.
+ 2. Sources images from their EMF files rather than from their PNG files.
+ 3. Removes file references from links.
+ 4. Formats examples and notes.
+ 5. Converts all remaining code blocks to the ETSI style.
+ 6. Cleans up the formatting of remaining code tags containing some plaintext enclosed in `< >`
+ 7. Converts Docx paragraph styles to Docx run styles in spans
+ 8. Consolidates the HTML files.
+
+ ### Arguments
+ - `src`: The source directory containing the HTML files to convert
+ - `src_type`: The source file type, `html`
+ - `excluded_html_files`: A list of HTML filenames that, should they occur within `src`, will not be included for conversion to Docx
+ """
+
+ for filename in os.listdir(src):
+ if filename.endswith(src_type) and filename not in excluded_html_files:
+ # Setup and preprocessing
+ input_path = os.path.join(src, filename)
+ with open(input_path, "r", encoding="utf-8") as html:
+ soup = BeautifulSoup(html, "html.parser")
+
+ soup = remove_pandoc_toc(soup)
+ soup = change_images_to_use_high_quality(soup, src)
+ soup = modify_links(soup)
+ soup = handle_examples_and_notes(soup)
+ soup = handle_abbreviations(soup)
+ soup = convert_codeblock_styles_to_etsi(soup)
+ soup = cleanup_code_tags(soup)
+ soup = create_custom_tags_for_bold_italic_underline_styles(soup)
+ soup = prepare_table_cell_classes(soup)
+ soup = add_whitespace_placeholder_to_headings(soup)
+
+ contents = soup.decode_contents()
+
+ new_filename = re.sub(
+ "([\w-]+?).html", r"--preprocessed--\1.html", filename
+ ) # Ensure file order is preserved by keeping the number in front
+ output_path = os.path.join(src, new_filename)
+
+ with open(output_path, "w", encoding="utf-8") as html_new:
+ html_new.write(contents)
+
+ handle_html_consolidation("create", src, consolidated_html_path)
diff --git a/md_to_docx_converter/src/to_html/__init__.py b/md_to_docx_converter/src/to_html/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/md_to_docx_converter/src/to_html/postprocessing.py b/md_to_docx_converter/src/to_html/postprocessing.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad8d46fd778746bb3bbb65b973f4233f22095026
--- /dev/null
+++ b/md_to_docx_converter/src/to_html/postprocessing.py
@@ -0,0 +1,752 @@
+import os, re, html
+from bs4 import BeautifulSoup, Tag, NavigableString
+
+from src.utils import (
+ apply_renaming_logic,
+ get_dirty_filenames_mapping_with_expected_filenames,
+ p_error,
+)
+
+from src.constants import ABBREVIATION_CLASS
+
+normative_file = "clause-2"
+informative_file = "clause-2"
+files_with_references = [normative_file, informative_file]
+
+
+# region Helpers
+def remove_code_blocks_with_only_images(soup: BeautifulSoup):
+ """
+ Removes code elements inside paragraphs that contain only images, preserving the img elements.
+ """
+ paragraphs = soup.find_all("p")
+
+ for paragraph in paragraphs:
+ code_elements = paragraph.find_all("code")
+
+ for code_element in code_elements:
+ # Check if the code element contains only one child and it's an img
+ children = [
+ child for child in code_element.children if isinstance(child, Tag)
+ ]
+ if len(children) == 1 and children[0].name == "img":
+ img_element = children[0]
+ # Replace the code element with just the img element
+ code_element.replace_with(img_element)
+
+ return soup
+
+
+def fix_toc_links(soup: BeautifulSoup, filenames_mapping: dict):
+ """
+ Fixes the table of contents links in the HTML by updating their href attributes
+ based on the provided filenames mapping.
+ """
+ toc_links = soup.select("#TOC a")
+
+ for link in toc_links:
+ href = link.get("href", "")
+ before_ash, after_ash = href.split("#", 1) if "#" in href else (href, "")
+ if before_ash in filenames_mapping:
+ link["href"] = f"{filenames_mapping[before_ash]}#{after_ash}"
+
+ return soup
+
+
+def fix_ex_json_spacing(soup: BeautifulSoup):
+ """Removes `
` tags from the ends of lines containing JSON data. These tags add too much space between lines."""
+ examples = soup.find_all("div", class_="EX")
+
+ for example in examples:
+ for br_tag in example.find_all("br"):
+ br_tag.decompose()
+
+ return soup
+
+
+def unwrap_gt_lt_code_tags(soup: BeautifulSoup):
+ """
+ #### Functionality
+ Unwrap the code tags to preserve their contents without any changes to how they render, maintaining the `<` and `>` to maintain `<` and `>` respectively as plaintext.
+
+ #### Explanation of Necessity
+ During preprocessing, sections of text marked by a beginning `<` and an ending `>` needed to be enclosed in code blocks for Pandoc to preserve the text.
+ """
+ # codes = soup.find_all("code", lambda tag: tag.parent and tag.parent.name != "pre")
+ codes = soup.select("code:not(pre > code):not(em > code)")
+
+ for code in codes:
+ text = NavigableString(html.unescape(code.get_text()))
+ code.insert_before(text)
+ code.decompose()
+
+ return soup
+
+
+def format_references(soup: BeautifulSoup):
+ """
+ Since references are wrapped in the `EX` class, and they were impacted during HTML-Markdown preprocessing.
+
+ Converts references to the expected format (``) to ensure they are displayed correctly.
+ """
+ EXAMPLE_CLASS = "EX"
+ examples = soup.find_all("div", class_=EXAMPLE_CLASS, id=False)
+
+ for example in examples:
+ # For now, these example divs follow the usual format of having all the contents contained within two child divs. Unwrap these divs.
+ for child in list(example.children):
+ if child.name == "div":
+ child.unwrap()
+
+ # Setup
+ paragraphs = example.find_all("p", recursive=False)
+ references: list[Tag] = []
+
+ for paragraph in paragraphs:
+ if not paragraph.contents:
+ continue
+
+ first_child = paragraph.contents[0]
+
+ # Check that the first child is an empty , otherwise skip this paragraph
+ if not (
+ isinstance(first_child, Tag)
+ and first_child.name == "span"
+ and first_child.get("id")
+ and first_child.get_text() == ""
+ ):
+ continue
+
+ ref_id = first_child.get("id", "")
+ second_child = paragraph.contents[1]
+ tag, remamining_text = second_child.split("] ")
+ tag += "] " ## leaving this trailing space because we need to check if this is needed when generating a docx
+ tag = tag.replace("[n.", "[")
+
+ first_child.extract()
+ second_child.extract()
+
+ # Prepare new parent div and child spans
+ parent_div = soup.new_tag(
+ "div", attrs={"id": ref_id, "class": EXAMPLE_CLASS}
+ )
+ tag_span = soup.new_tag("span")
+ body_span = soup.new_tag("span")
+ body_span.append(NavigableString(remamining_text))
+
+ # Add tag
+ tag_span.append(NavigableString(tag))
+
+ # Add body contents
+ for contents in list(paragraph.contents):
+ body_span.append(contents)
+ body_span.append(NavigableString("\n"))
+
+ # Append spans to div, and div to references list
+ parent_div.append(tag_span)
+ parent_div.append(body_span)
+ references.append(parent_div)
+
+ # Switch references
+ for reference in references:
+ example.insert_before(reference)
+ example.decompose()
+
+ return soup
+
+
+def handle_ew_div(soup: BeautifulSoup):
+ """Expands a single div applying the `EW` class to a series of paragraphs, each of which containing an abbreviation and its meaning, into a series of divs each dedicated to a single abbreviation. Also adds a tab between the abbreviation and its meaning to ensure proper indentation."""
+ consolidated_abbreviations = soup.find("div", attrs={"class": ABBREVIATION_CLASS})
+
+ if not consolidated_abbreviations:
+ return soup
+
+ paragraphs = consolidated_abbreviations.find_all("p")
+
+ for para in reversed(
+ paragraphs
+ ): # Reverse the order to make sure the abbreviations end up in the correct order
+ abbreviation, meaning = para.get_text().split(" ", 1)
+
+ abbr_div = soup.new_tag("div")
+ abbr_div.append(NavigableString(abbreviation))
+
+ meaning_div = soup.new_tag("div")
+ meaning_div.append(NavigableString(meaning))
+
+ div_replacement = soup.new_tag("div", attrs={"class": ABBREVIATION_CLASS})
+ div_replacement.append(abbr_div)
+ div_replacement.append(meaning_div)
+
+ consolidated_abbreviations.insert_after(div_replacement)
+ consolidated_abbreviations.insert_after(
+ NavigableString("\n")
+ ) # For readability in the raw HTML, has no visual effect when rendered
+
+ consolidated_abbreviations.decompose()
+
+ return soup
+
+
+def format_examples_and_notes(soup: BeautifulSoup):
+ """Restores the expected HTML structure to examples and notes, which have come in from the Markdown as sequential blocks of code enclosed three times each in ``s."""
+
+ def get_label_text_and_class(para: Tag):
+ """Get the label text from the paragraph and determine the class to assign to the div"""
+ text = para.contents[0].split(":")[0] + ":"
+ remaining_text = (
+ para.contents[0].split(": ")[1] if ": " in para.contents[0] else ""
+ )
+ cls = ""
+
+ if "[!tip]" in text:
+ cls = "EX"
+ else:
+ cls = "TAN" if para.find_parent("td") else "NO"
+
+ text = text.replace("[!note] ", "").replace("[!tip] ", "")
+
+ remaining_contents = para.contents[1:]
+ if remaining_text:
+ remaining_contents.insert(0, NavigableString(remaining_text))
+ return text, cls, remaining_contents
+
+ # Take only the top-level blockquotes to simplify logic
+ blockquotes = [
+ bq for bq in soup.find_all("blockquote") if not bq.find_parent("blockquote")
+ ]
+
+ for blockquote in reversed(blockquotes):
+ label_para = blockquote.find("p")
+
+ if not label_para:
+ continue
+
+ label_text, label_class, remaining_contents = get_label_text_and_class(
+ label_para
+ )
+
+ new_parent_div = soup.new_tag("div", attrs={"class": label_class})
+
+ # Process label
+ label_div = soup.new_tag("div")
+
+ new_label_para = soup.new_tag("p")
+ new_label_para.append(NavigableString(label_text))
+
+ label_div.append(new_label_para)
+ new_parent_div.append(label_div)
+
+ # Process body
+ body_div = soup.new_tag("div")
+ if (
+ remaining_contents
+ ): # this happens when there is not an empty line after the [!note] or [!tip], better to do here because in md we have gridtables to take care of
+ para_container = soup.new_tag("p")
+ for content in remaining_contents:
+ para_container.append(content)
+ body_div.append(para_container)
+ else:
+ current = blockquote.next_sibling
+
+ while current:
+ if current.name == "blockquote":
+ # Finished gathering body elements
+ current.decompose()
+ break
+
+ next_sibling = current.next_sibling
+ current.extract()
+ body_div.append(current)
+ current = next_sibling
+
+ new_parent_div.append(body_div)
+
+ # Finish processing
+ blockquote.replace_with(new_parent_div)
+
+ return soup
+
+
+def add_links_to_references_in_text(soup):
+ def reform_broken_links_in_text(soup: BeautifulSoup):
+ """
+ Fix reference links throughout the text that are effectively just plaintext. Reform the link.
+
+ Broken reference tags take the following form:
+
+ ```
+
+ [{reference tag}] OR \[{reference tag}\]
+
+ ```
+ """
+ # Some links are already properly formed, so they don't need anything done to them. Such links are surrounded by an extra pair of brackets, so add a negative lookahead and a negative lookbehind to exclude those outer brackets.
+ BAD_LINK_REGEX = r"^\\?\[{1}((?:i\.)?[A-Za-z0-9]+)\\?\](?!\])(.*?)$"
+
+ spans = soup.find_all("span", attrs={None})
+
+ for span in spans:
+ text = span.get_text()
+ match = re.search(BAD_LINK_REGEX, text)
+
+ if not match:
+ continue
+
+ tag_contents = match.group(1)
+ after_contents = (
+ NavigableString(match.group(2)) if match.group(2).strip() else None
+ )
+
+ is_informative = tag_contents.startswith("i.")
+
+ ref_file = informative_file if is_informative else normative_file
+ ref_id = tag_contents if is_informative else f"n.{tag_contents}"
+ link = f"{ref_file}.html#{ref_id}"
+
+ a = soup.new_tag("a", attrs={"href": link})
+ a.append(f"[{tag_contents}]")
+
+ span.replace_with(a)
+
+ # After the , add any miscellaneous text that was erroneously part of the link
+ if after_contents:
+ a.insert_after(after_contents)
+
+ return soup
+
+ REG_REGEX = r"(? tag
+ link = (
+ f"{informative_file}.html#{internal_text}"
+ if is_informative
+ else f"{normative_file}.html#{internal_text}"
+ )
+ a = soup.new_tag("a", attrs={"href": link})
+ a.append(f"[{internal_text.replace('n.', '')}]")
+ return a
+
+ def process_text_nodes(element):
+ for content in list(element.contents):
+ if isinstance(content, NavigableString):
+ before_text = ""
+ after_text = content
+ while True:
+ match = re.search(REG_REGEX, after_text)
+ if match:
+ before_text = after_text[: match.start()]
+ after_text = after_text[match.end() :]
+ if before_text:
+ content.insert_before(NavigableString(before_text))
+
+ is_informative = match.group(1) == "i."
+ # replace content with the tag
+ match_text = match.group(0)
+ a = insert_link_with_reference(match_text, is_informative)
+ content.insert_before(a)
+ else:
+ if after_text:
+ content.insert_before(NavigableString(after_text))
+ break
+
+ content.extract()
+
+ elif isinstance(content, Tag) and not content.name in ["a", "code"]:
+ process_text_nodes(content)
+
+ for element in soup.find_all(["p", "div"]):
+ if element.name == "div" and element.get("id") == "list-of-references":
+ continue # It is the original reference in clause 2
+ if (
+ element.name == "p"
+ and element.parent.name == "div"
+ and element.parent.get("id") == "list-of-references"
+ ):
+ continue # Skip paragraphs that are direct children of the reference list
+ process_text_nodes(element)
+
+ soup = reform_broken_links_in_text(soup)
+
+ return soup
+
+
+def move_dangling_brackets_out_of_links(soup: BeautifulSoup):
+ """
+ Move any dangling brackets that are part of links to the outside of the link tag.
+ This is necessary for links that were improperly formatted and had brackets inside the link text.
+ """
+
+ for a in soup.find_all("a"):
+ text = a.get_text()
+ # Check for dangling brackets
+ if (
+ (text.startswith("(") and not text.endswith(")"))
+ or (text.startswith("[") and not text.endswith("]"))
+ or (text.startswith("{") and not text.endswith("}"))
+ ):
+ a.insert_before(
+ NavigableString(text[0])
+ ) # Add the opening bracket before the link
+ a_text = text[1:] # Remove the opening bracket from the link text
+ a.string = a_text
+ elif (
+ (text.endswith(")") and not text.startswith("("))
+ or (text.endswith("]") and not text.startswith("["))
+ or (text.endswith("}") and not text.startswith("{"))
+ ):
+ a.insert_after(
+ NavigableString(text[-1])
+ ) # Add the closing bracket after the link
+ a.string = text[:-1] # Remove the closing bracket from the link text
+
+ return soup
+
+
+def remove_links_from_labels(soup: BeautifulSoup):
+ """
+ Remove links from label elements.
+ """
+ labels = soup.find_all("div", class_=["TF", "TH"])
+ for label in labels:
+ a_tag = label.find("a")
+ if a_tag:
+ a_tag.unwrap()
+ return soup
+
+
+def add_ids_to_labels(soup: BeautifulSoup):
+ """
+ Add ids to label elements if they don't have one.
+ """
+ labels = soup.find_all("div", class_=["TF", "TH"])
+ for label in labels:
+ if not label.get("id"):
+ label_text = label.get_text().strip()
+ id = label_text.split(":")[0].split(" ")[1]
+ if label_text.startswith("Figure"):
+ label.attrs["id"] = f"Figure_{id}"
+ elif label_text.startswith("Table"):
+ label.attrs["id"] = f"Table_{id}"
+ return soup
+
+
+def replace_dash_characters(soup: BeautifulSoup):
+ """
+ Replace dash characters in the a_tags and ids with the correct ones.
+ """
+ a_tags = soup.find_all("a")
+ for a in a_tags:
+ if a.string:
+ a.string = a.string.replace("‑", "-").replace("—", "-")
+ href = a.get("href", "")
+ if href:
+ a["href"] = href.replace("‑", "-").replace("—", "-")
+ ids = soup.find_all(id=True)
+ for element in ids:
+ id = element.get("id", "")
+ if id:
+ element["id"] = id.replace("‑", "-").replace("—", "-")
+ return soup
+
+
+def move_figure_id_to_FL_elements(soup: BeautifulSoup):
+ """
+ Move the id attributes from figure elements to their parent FL elements.
+ """
+ figures = soup.find_all("div", class_="FL")
+
+ for figure in figures:
+ # get next sibling
+ next_sibling = figure.find_next_sibling()
+ # check that the very next sibling is a div with class TF
+ if (
+ next_sibling
+ and next_sibling.name == "div"
+ and "TF" in next_sibling["class"]
+ ):
+ id = next_sibling.get("id", "")
+ figure.attrs["id"] = id
+ next_sibling["id"] = None
+
+ return soup
+
+
+def fix_custom_tags(soup: BeautifulSoup):
+ """
+ Fix custom tags in the HTML.
+ """
+
+ def notAnImage(href: str) -> bool:
+ image_extensions = [".png", ".jpg", ".jpeg", ".svg"]
+ return not any(href.endswith(ext) for ext in image_extensions)
+
+ # Example: Change to
+ h1_tag = soup.find("h1", id=True)
+
+ a_tags = soup.find_all("a")
+
+ for a in a_tags:
+ href = a.get("href", "")
+ if href.endswith("below"):
+ is_table = "Table" in href
+ class_name = "TH" if is_table else "FL"
+ next_element = a.find_next("div", class_=class_name, id=True)
+ if next_element:
+ prefix = "Table_" if is_table else "Figure_"
+ string_to_be_replaced = f"{prefix}below"
+ new_a_text = next_element["id"].replace(prefix, "")
+ a["href"] = href.replace(string_to_be_replaced, next_element["id"])
+ a.string = a.string.replace("below", new_a_text)
+ else:
+ # flash an error
+ print(
+ p_error(f"Error: Found a broken custom tag in file {h1_tag.string}")
+ )
+ print(
+ p_error(
+ f"Error: No next element found for '{a.string}'. There are not any figures/tables above this tag."
+ )
+ )
+ os._exit(1)
+ elif href.endswith("above"):
+ is_table = "Table" in href
+ class_name = "TH" if is_table else "FL"
+ previous_element = a.find_previous("div", class_=class_name, id=True)
+ if previous_element:
+ prefix = "Table_" if is_table else "Figure_"
+ string_to_be_replaced = f"{prefix}above"
+ new_a_text = previous_element["id"].replace(prefix, "")
+ a["href"] = href.replace(string_to_be_replaced, previous_element["id"])
+ a.string = a.string.replace("above", new_a_text)
+ else:
+ # flash an error
+ print(
+ p_error(f"Error: Found a broken custom tag in file {h1_tag.string}")
+ )
+ print(
+ p_error(
+ f"Error: No previous element found for '{a.string}'. There are not any figures/tables above this tag."
+ )
+ )
+ os._exit(1)
+ elif (
+ href.find("#") != -1 and href.find("root") != -1 and notAnImage(href)
+ ): # when root is used in md
+ new_id_prefix = f"{h1_tag['id']}"
+ a["href"] = href.replace("root", new_id_prefix)
+ a.string = a.string.replace("root", new_id_prefix)
+ return soup
+
+
+def extract_images_from_html(soup: BeautifulSoup) -> dict:
+ """
+ Extracts image sources from the given HTML content.
+
+ Args:
+ html (str): The HTML content as a string.
+
+ Returns:
+ dict: A dictionary mapping image filenames to their full paths.
+ """
+ figures = soup.find_all("div", class_="FL")
+ images_mapping = {}
+
+ for fig in figures:
+ id = fig.get("id", "")
+ if id:
+ img = fig.find("img")
+ if img:
+ src = img.get("src", "").replace("media/", "")
+ images_mapping[id] = src
+ figure_caption = fig.find("figcaption")
+ if (
+ figure_caption
+ ): # TODO: check if we might want to keep the caption instead of removing it
+ figure_caption.decompose()
+
+ return images_mapping, soup
+
+
+def add_custom_link_to_images(
+ soup: BeautifulSoup, images_mapping: dict
+) -> BeautifulSoup:
+ """
+ Adds a custom link to images in the HTML content based on the provided images mapping.
+
+ Args:
+ html (str): The HTML content as a string.
+ images_mapping (dict): A dictionary mapping image filenames to their full paths.
+
+ Returns:
+ str: The modified HTML content with custom links added to images.
+ """
+ # look for text that matches the pattern Figure+++
+ a_tags = soup.find_all("a")
+ for a in a_tags:
+ href = a.get("href", "")
+ if "Figure+++" in href:
+ # Extract the filename from the href
+ filename = href.split("+++")[1]
+ if filename in images_mapping:
+ image_info = images_mapping[filename]
+ a["href"] = f"{image_info['file']}#{image_info['id']}"
+ a.string = f"figure {image_info['id'].split('_')[1]}"
+ else:
+ raise ValueError(
+ f"ERROR: Image '{filename}' not found in images mapping. Are you sure it exists in the media folder and is used in the document?"
+ )
+
+ return soup
+
+
+def fix_capitalization_in_links(soup: BeautifulSoup) -> BeautifulSoup:
+ """
+ Ensures that the capitalization in the link text matches the capitalization in the href attribute.
+ """
+ a_tags = soup.find_all("a")
+ span_clauses_tags = soup.find_all("span", class_="clauses-marker")
+ for a in a_tags:
+ text = a.get_text()
+ if not text:
+ continue
+
+ if not text.startswith(("figure", "table", "clause", "annex")):
+ continue
+
+ # First case: it is the first word in a sentence
+ if a.parent and a.parent.contents[0] == a:
+ capitalized_text = text.capitalize()
+ a.string = capitalized_text
+
+ # Second case: it is after a period
+ elif a.previous_sibling and isinstance(a.previous_sibling, NavigableString):
+ prev_text = a.previous_sibling.strip()
+ if (
+ prev_text.endswith(".")
+ or prev_text.endswith("!")
+ or prev_text.endswith("?")
+ ):
+ capitalized_text = text.capitalize()
+ a.string = capitalized_text
+ for span in span_clauses_tags:
+ text = span.get_text()
+ if not text:
+ continue
+ if span.parent and span.parent.contents[0] == span:
+ capitalized_text = text.capitalize()
+ span.string = capitalized_text
+ elif span.previous_sibling and isinstance(
+ span.previous_sibling, NavigableString
+ ):
+ prev_text = span.previous_sibling.strip()
+ if (
+ prev_text.endswith(".")
+ or prev_text.endswith("!")
+ or prev_text.endswith("?")
+ ):
+ capitalized_text = text.capitalize()
+ span.string = capitalized_text
+ return soup
+
+
+def postprocess(html_dir: str):
+ """
+ ### Description
+ Iterates through the generated HTML files, applying various transformations to improve
+ formatting, fix links, and ensure consistent styling. This includes:
+
+ - Renaming files using the same logic applied to MD files
+ - Formatting references properly in reference sections
+ - Adding links to references mentioned in the text
+ - Fixing table of contents links
+ - Adjusting bracket placement around links
+ - Removing excess spacing in code examples
+ - Unwrapping code tags that should render as plaintext
+ - Formatting abbreviation sections properly
+ - Processing examples, notes and tips
+ - Fixing image handling in code blocks
+ - Adding and fixing IDs for figures and tables
+ - Fixing custom tags for relative figure/table references
+ - Normalizing dash characters in links and IDs
+ - Ensuring proper capitalization in links
+ - Creating cross-references for images
+
+ ### Arguments
+ - `html_dir`: Directory containing the HTML files to be processed
+ """
+ filenames_mapping = get_dirty_filenames_mapping_with_expected_filenames(html_dir)
+ images_mapping = {}
+
+ allfiles = os.listdir(html_dir)
+ for filename in os.listdir(html_dir):
+ if filename.endswith(".html"):
+ with open(os.path.join(html_dir, filename), "r", encoding="utf-8") as file:
+ html = file.read()
+ if filename == "index.html":
+ new_filename = filename
+ else:
+ new_filename = apply_renaming_logic(html, filename, "html")
+ os.rename(
+ os.path.join(html_dir, filename), os.path.join(html_dir, new_filename)
+ )
+ file_path = os.path.join(html_dir, new_filename)
+
+ with open(file_path, "r", encoding="utf-8") as html:
+ soup = BeautifulSoup(html, "html.parser")
+
+ soup = remove_code_blocks_with_only_images(soup)
+ soup = format_examples_and_notes(soup)
+
+ if (
+ new_filename.replace(".html", "") in files_with_references
+ ): # Reference-specific formatting
+ soup = format_references(soup)
+ else:
+ soup = add_links_to_references_in_text(soup)
+
+ soup = fix_toc_links(soup, filenames_mapping)
+ soup = move_dangling_brackets_out_of_links(soup)
+ soup = fix_ex_json_spacing(soup)
+ soup = unwrap_gt_lt_code_tags(soup)
+
+ soup = handle_ew_div(soup)
+
+ soup = remove_links_from_labels(soup)
+ soup = add_ids_to_labels(soup)
+ soup = replace_dash_characters(soup)
+ soup = move_figure_id_to_FL_elements(soup)
+ soup = fix_custom_tags(soup)
+ images, soup = extract_images_from_html(soup)
+ for image_id, image_src in images.items():
+ images_mapping[image_src] = {"id": image_id, "file": new_filename}
+
+ contents = soup.decode_contents()
+
+ with open(file_path, "w", encoding="utf-8") as html:
+ html.write(contents)
+
+ for filename in os.listdir(html_dir):
+ if filename.endswith(".html"):
+ file_path = os.path.join(html_dir, filename)
+ with open(file_path, "r", encoding="utf-8") as html:
+ soup = BeautifulSoup(html, "html.parser")
+
+ try:
+ soup = add_custom_link_to_images(soup, images_mapping)
+ soup = fix_capitalization_in_links(soup)
+ except ValueError as e:
+ print(p_error(f"Error in file {filename}:"))
+ print(p_error(str(e)))
+ os._exit(1)
+
+ contents = soup.decode_contents()
+
+ with open(file_path, "w", encoding="utf-8") as html:
+ html.write(contents)
diff --git a/md_to_docx_converter/src/to_html/preprocessing.py b/md_to_docx_converter/src/to_html/preprocessing.py
new file mode 100644
index 0000000000000000000000000000000000000000..50cd220bcd7e821e94580d5ffc14b913fbddc2ea
--- /dev/null
+++ b/md_to_docx_converter/src/to_html/preprocessing.py
@@ -0,0 +1,687 @@
+import os, re, os, json
+import sys
+from typing_extensions import Literal
+
+from src.constants import (
+ NORMATIVE_REF_FILE,
+ INFORMATIVE_REF_FILE,
+ DEFAULT_CLAUSES,
+ DEFAULT_ANNEXES,
+ REFS,
+ DIV_START_REGEX,
+ DIV_END_REGEX,
+ BAD_DIV_DELINEATOR_REGEX,
+)
+
+from src.utils import (
+ handle_consolidated_md,
+ get_file_order,
+ int_to_letter,
+ p_warning,
+ p_error,
+ p_label,
+)
+from src.constants import MAX_HEADING_LEVEL
+
+files_with_references = [NORMATIVE_REF_FILE, INFORMATIVE_REF_FILE]
+
+
+# region Helpers
+
+def undo_prettier_formatting(text: str) -> str:
+ """Undo any formatting changes made by Prettier to ensure the Markdown is in a more raw format for processing."""
+
+ def fix_notes_and_examples(text: str) -> str:
+ if text.startswith("> > >"):
+ text = text.replace("> > >", ">>>")
+ return text
+
+ # Remove spaces before and after colons in headings
+ new_lines = []
+ lines = text.split("\n")
+
+ for line in lines:
+ new_line = line
+
+ # Fix notes and examples
+ new_line = fix_notes_and_examples(new_line)
+
+ new_lines.append(new_line)
+
+ new_text = "\n".join(new_lines)
+
+ return new_text
+
+def run_format_checks(filename: str, file_lines: list[str]):
+ """Runs various checks on the Markdown file contents to ensure they are properly formatted. If any improper formatting is detected, display any fatal errors or warnings as necessary."""
+
+ def check_divs():
+ """
+ ### Display an error and exit when...
+ - An opening does not have a closing
+
+ ### Display a warning when...
+ - The number of openings and number of closings do not match
+ - Find a closing without a corresponding opening, this is likely meant to be an opening and needs metadata
+ """
+ i = 0
+ in_div = False
+ in_div_no_metadata = (
+ False # For if/when a div is found that doesn't have any class
+ )
+ start_line_num = (
+ 0 # For keeping track of the line number at which the latest div was opened
+ )
+
+ # Keep track of numbers of div starts and div ends
+ num_div_start = 0
+ num_div_end = 0
+
+ while i < len(file_lines):
+ line = file_lines[i].replace("\n", "")
+ line_num = i + 1
+
+ bad_div_delin_match = re.match(BAD_DIV_DELINEATOR_REGEX, line)
+ if bad_div_delin_match and line.find(":::") == -1:
+ # This div delineator doesn't have exactly three colons `:::`
+ print(
+ p_error(
+ f"{p_label(filename)}:{p_label(line_num)}: Improperly formatted div delineator in line. Line: {p_label(line)}"
+ )
+ )
+ raise Exception("DIV_DELINEATOR_ERROR")
+
+ start_match = re.match(DIV_START_REGEX, line)
+ num_div_start += 1 if start_match else num_div_start
+
+ if start_match:
+ in_div_no_metadata = False # Set this to false in case it was true from a previous div without metadata
+ if in_div:
+ # The previous div wasn't closed, print error and quit
+ print(
+ p_error(
+ f"{p_label(filename)}:{p_label(start_line_num)}: No end tag found for div starting at this line"
+ )
+ )
+ raise Exception("DIV_DELINEATOR_ERROR")
+ else:
+ # A normal div opener
+ in_div = True
+
+ start_line_num = line_num
+ i += 1
+ continue
+
+ end_match = re.match(DIV_END_REGEX, line)
+ num_div_end += 1 if end_match else num_div_end
+
+ if end_match:
+ if not in_div and not in_div_no_metadata:
+ # This should open a div, but it doesn't have a class assigned to it
+ print(
+ p_warning(
+ f"{p_label(filename)}:{p_label(line_num)}: The delineator at this line seems to open a div, this div or one before it may not be correctly structured."
+ )
+ )
+ in_div_no_metadata = True
+
+ elif not in_div and in_div_no_metadata:
+ # The closing to a classless div
+ in_div_no_metadata = False
+
+ in_div = False
+ i += 1
+ continue
+
+ i += 1
+
+ def check_notes_and_examples():
+ """
+ Display an error and exit when the [!note] or [!tip] tags are followed by some text that is not "NOTE" or "EXAMPLE" respectively.
+ """
+ note_regex = r"^>>> \[!note\]"
+ note_in_table_regex = r"^\| >>> \[!note\]"
+ example_regex = r"^>>> \[!tip\]"
+ note_allowed_text = r"^>>> \[!note\]( NOTE(\s\d+)?:)?$"
+ note_in_table_allowed_text = r"^\| >>> \[!note\](( NOTE(\s\d+)?:)?\s*\|)?$"
+ example_allowed_text = r"^>>> \[!tip\]( EXAMPLE(\s\d+)?:)?$"
+
+ for i, line in enumerate(file_lines):
+ line_num = i + 1
+ if re.match(note_regex, line):
+ if re.match(note_allowed_text, line):
+ continue
+ print(f"line: {line}")
+ print(
+ p_error(
+ f"{p_label(filename)}:{p_label(line_num)}: NOTE with unexpected text found. Please ensure the NOTE tag is either empty or is followed by 'NOTE', an optional number and a colon."
+ )
+ )
+ raise Exception("NOTE_NUMBERING_ERROR")
+ if re.match(note_in_table_regex, line):
+ if re.match(note_in_table_allowed_text, line):
+ continue
+ print(f"line: {line}")
+ print(
+ p_error(
+ f"{p_label(filename)}:{p_label(line_num)}: NOTE with unexpected text found in table. Please ensure the NOTE tag is either empty or is followed by 'NOTE', an optional number and a colon. Also, since this is in a table, ensure the line ends with a pipe '|' character without any other text after it. Finally, NOTES in tables can only be in cells that span over all columns."
+ )
+ )
+ raise Exception("NOTE_NUMBERING_ERROR")
+ if re.match(example_regex, line):
+ if re.match(example_allowed_text, line):
+ continue
+ print(f"line: {line}")
+ print(
+ p_error(
+ f"{p_label(filename)}:{p_label(line_num)}: EXAMPLE with unexpected text found. Please ensure the EXAMPLE tag is either empty or is followed by 'EXAMPLE', an optional number and a colon."
+ )
+ )
+ raise Exception("EXAMPLE_NUMBERING_ERROR")
+
+ check_divs()
+ check_notes_and_examples()
+
+def remove_ignore_prettier_statements(text: str) -> str:
+ """Remove any existing statements from the text to avoid duplication"""
+ new_lines = []
+ for line in text.split("\n"):
+ if line.strip() != "" and line.strip() != "":
+ new_lines.append(line)
+
+ return "\n".join(new_lines)
+
+def add_divs_to_images_tables(text : str) -> str:
+ """Add divs around images and their captions, and tables captions to the ones defined using the ETSI guidelines."""
+ file_lines = text.split("\n")
+ new_file_lines = []
+ TABLE_CAPTION_REGEX = r"^\*\*Table"
+ IMAGE_CAPTION_REGEX = r"^\*\*Figure"
+ IMAGE_DEF_REGEX = r"^!\[.*\]\(.*\)"
+
+ for line in file_lines:
+ if re.match(IMAGE_DEF_REGEX, line):
+ # If the line is an image definition, add divs around it
+ new_file_lines.append("::: FL")
+ new_file_lines.append(line)
+ new_file_lines.append(":::")
+ elif re.match(IMAGE_CAPTION_REGEX, line):
+ # If the line is an image caption, add divs around it
+ new_file_lines.append("::: TF")
+ new_file_lines.append(line.replace("**", ""))
+ new_file_lines.append(":::")
+ elif re.match(TABLE_CAPTION_REGEX, line):
+ # If the line is a table caption, add divs around it
+ new_file_lines.append("::: TH")
+ new_file_lines.append(line.replace("**", ""))
+ new_file_lines.append(":::")
+ else:
+ new_file_lines.append(line)
+
+ return "\n".join(new_file_lines) + "\n"
+
+def handle_less_than_greater_than_text(file_contents: str):
+ """Replace `<` and `>` with `<` and `>` respectively and wrap the whole section in single code ticks to allow the text to render in the HTML"""
+ regex = r"\<(?!img\b|span\b|sup|/sup)(.+?)\>"
+ replace = r"`<\1>`"
+ table_regex = rf"\|([^|\n]*?{regex}[^|\n]*?)\|"
+
+ def replace_in_table(match: re.Match):
+ cell_body = match.group(1)
+ intended_length = len(cell_body)
+
+ cell_body = re.sub(regex, replace, cell_body)
+
+ new_cell = f"|{cell_body[:intended_length]}|"
+ return new_cell
+
+ while True:
+ file_contents, count = re.subn(table_regex, replace_in_table, file_contents)
+ if (
+ count == 0
+ ): # Break out of the loop once there aren't any more substitutions to make
+ break
+
+ file_contents = re.sub(regex, replace, file_contents)
+
+ return file_contents
+
+
+def add_ids_to_headings(file_contents: str):
+ """Add HTML IDs to all headings in the document."""
+
+ def extract_id_from_heading(heading_text: str) -> str:
+ """Extract ID from heading text using the same logic as preprocessing.py"""
+ annex_regex = r"^Annex\s+([A-Z])(?:\s+\([\w]+\):|\s*:)?\s"
+ clause_number_regex = r"^([A-Z]?\.?\d+(\.\d+)*)\s"
+
+ if re.match(annex_regex, heading_text):
+ match = re.match(annex_regex, heading_text)
+ return match.group(1) # Extracts only the letter (A, B, C, etc.)
+ elif re.match(clause_number_regex, heading_text):
+ match = re.match(clause_number_regex, heading_text)
+ return match.group(1) # Extracts the clause number
+ else:
+ # Generate a fallback ID from the heading text
+ return re.sub(r"[^a-zA-Z0-9]", "-", heading_text.strip()).lower()
+
+ for level in range(1, MAX_HEADING_LEVEL):
+ regex = rf"^(#{{{level}}} (.+))$"
+
+ def replacement(match):
+ full_heading = match.group(1)
+ heading_text = match.group(2)
+ heading_id = extract_id_from_heading(heading_text)
+ return rf"{full_heading} {{#{heading_id}}}"
+
+ file_contents = re.sub(regex, replacement, file_contents, flags=re.MULTILINE)
+
+ return file_contents
+
+
+def add_empty_lines_in_notes_and_examples(file_contents: str):
+ """Ensure there is an empty line after the NOTE and EXAMPLE tags to ensure proper rendering. This is required because Pandoc would otherwise merge the note/example tag line with the first line of the note/example."""
+ file_lines = file_contents.split("\n")
+ new_file_lines = []
+ i = 0
+ while i < len(file_lines):
+ line = file_lines[i]
+
+ # opening of a note or example
+ if line.startswith(">>> [!note]") or line.startswith(">>> [!tip]") or line.startswith("| >>> [!note]"):
+ new_file_lines.append(line)
+ # Check if the next line exists and is not empty
+ if i + 1 < len(file_lines) and file_lines[i + 1].strip() != "":
+ if not line.startswith("| >>> [!note]"):
+ new_file_lines.append("") # Add an empty line only for notes/examples outside tables
+ else:
+ if not line.startswith("+") and not line.endswith("+"):
+ line_length = len(line) - 2 # Subtract 2 for the "|" at the start and end
+ new_file_lines.append("|" + " " * line_length + "|") # Add an empty line
+
+ # closing of a note or example
+ elif line.find(">>>") != -1 or line.find("| >>>") != -1:
+ #check before the line
+ line_before = file_lines[i - 1] if i > 0 else ""
+ empty_line_regex = r"^\s*$"
+ empty_table_row_regex = r"^\|\s*\|$"
+ if not re.match(empty_line_regex, line_before) and not line_before.startswith("| "):
+ new_file_lines.append("") # Add an empty line before any other blockquote
+ elif not re.match(empty_table_row_regex, line) and line_before.startswith("| "):
+ line_length = len(line) - 2 # Subtract 2 for the "|" at the start and end
+ new_file_lines.append("|" + " " * line_length + "|") # Add an empty line before any other blockquote in a table
+
+ new_file_lines.append(line)
+
+ #check after the line
+ if not line.startswith("| >>>"): # we are not in a table
+ new_file_lines.append("") # Add an empty line after any other blockquote
+ elif line.startswith("| >>>"):
+ if not line.startswith("+-") and not line.startswith("+="):
+ line_length = len(line) - 2 # Subtract 2 for the "|" at the start and end
+ new_file_lines.append("|" + " " * line_length + "|") # Add an empty line
+ else:
+ new_file_lines.append(line)
+ i += 1
+ return "\n".join(new_file_lines) + "\n"
+
+# Used to keep track of clause numbers across multiple levels when auto-numbering
+clauses_counters = [0] * MAX_HEADING_LEVEL
+clauses_counters[0] = 3 # first 3 clauses are taken by mandatory files
+annexes_counters = [0] * MAX_HEADING_LEVEL
+example_counter = 0
+note_counter = 0
+note_in_table_counter = 0
+figure_counter = 0
+table_counter = 0
+
+
+def auto_number_content(
+ file_contents: str, content_type: Literal["clauses", "annexes"]
+):
+ global example_counter, note_counter, note_in_table_counter
+
+ def auto_number_heading(line: str):
+ global clauses_counters, annexes_counters, figure_counter, table_counter
+ new_heading = ""
+ new_line = line
+ is_annex = content_type == "annexes"
+ if is_annex and line.startswith("# Annex"):
+ annexes_counters[0] += 1
+ # set all lower levels to 0
+ for j in range(1, MAX_HEADING_LEVEL):
+ annexes_counters[j] = 0
+ match = re.match(r"^# Annex ([A-Z])(?:\s+.*)?$", line)
+ if match:
+ new_heading = match.group(1)
+ new_line = match.group(0) # keep the line as is
+ else:
+ annex_letter = int_to_letter(annexes_counters[0])
+ new_heading = annex_letter
+ new_line = line.replace("Annex", f"Annex {annex_letter}")
+ else: # it is a clause of a regular file or a sub-clause of an annex
+ match = re.match(r"^(#+) (.+)$", line)
+ if match:
+ level = len(match.group(1))
+ if is_annex:
+ annexes_counters[level - 1] += 1
+ # set all lower levels to 1
+ for j in range(level, MAX_HEADING_LEVEL):
+ annexes_counters[j] = 0
+ else:
+ clauses_counters[level - 1] += 1
+ # set all lower levels to 1
+ for j in range(level, MAX_HEADING_LEVEL):
+ clauses_counters[j] = 0
+ figure_counter = 0
+ table_counter = 0
+
+ # check if we have an hardcoded clause number
+ existing_number = re.match(
+ r"^(?:\d+|[A-Za-z])(?:\.\d+)*\s+", match.group(2)
+ )
+
+ if not existing_number: # user didn't write a clause number
+ # create a string with all heading counters up to the level
+ if is_annex:
+ annex_letter = int_to_letter(annexes_counters[0])
+ heading_string = (
+ annex_letter
+ + "."
+ + ".".join(str(x) for x in annexes_counters[1:level])
+ )
+ else:
+ heading_string = ".".join(
+ str(x) for x in clauses_counters[:level]
+ )
+ new_heading = heading_string
+ new_line = f"{match.group(1)} {heading_string} {match.group(2)}"
+ else:
+ new_heading = existing_number.group(0).strip()
+ new_line = match.group(0) # keep the line as is
+ return new_line, new_heading
+
+ def auto_number_example(line: str) -> str:
+ global example_counter
+ new_line = line
+ example_counter += 1
+ if "EXAMPLE" not in line:
+ if (
+ example_counter != 1
+ ): # if it is one the number can be omitted, need to check later
+ new_line = line.replace(
+ ">>> [!tip]", f">>> [!tip] EXAMPLE {example_counter}:"
+ )
+ return new_line
+
+ def auto_number_note(line: str) -> str:
+ global note_counter
+ new_line = line
+ note_counter += 1
+ if "NOTE" not in line:
+ if (
+ note_counter != 1
+ ): # if it is one the number can be omitted, need to check later
+ new_line = line.replace(
+ ">>> [!note]", f">>> [!note] NOTE {note_counter}:"
+ )
+ return new_line
+
+ def auto_number_figure(line: str) -> str:
+ global figure_counter
+ new_line = line
+ figure_counter += 1
+ match = re.search(r"Figure:", line)
+ if match:
+ figure_number = f"{previous_heading}-{figure_counter}"
+ new_line = line.replace("Figure:", f"Figure {figure_number}:")
+ return new_line
+
+ def auto_number_table(line: str) -> str:
+ global table_counter
+ new_line = line
+ table_counter += 1
+ match = re.search(r"Table:", line)
+ if match:
+ table_number = f"{previous_heading}-{table_counter}"
+ new_line = line.replace("Table:", f"Table {table_number}:")
+ return new_line
+
+ def auto_number_note_in_table(line: str) -> str:
+ global note_in_table_counter
+ new_line = line
+ note_in_table_counter += 1
+ if "NOTE" not in line:
+ if (
+ note_in_table_counter != 1
+ ): # if it is one the number can be omitted, need to check later
+ new_text = f"| >>> [!note] NOTE {note_in_table_counter}:"
+ text_to_replace = "| >>> [!note]"
+ diff_in_length = len(new_text) - len(text_to_replace)
+ # ensure we keep the table formatting by adding spaces at the end of the line if needed
+ if diff_in_length > 0:
+ text_to_replace = text_to_replace + " " * diff_in_length
+ new_line = line.replace(
+ text_to_replace, new_text
+ )
+ return new_line
+
+ # take line and line number and replace the line number
+ lines = file_contents.splitlines()
+ # required to keep track of the previous line for figures and tables
+ previous_line = ""
+ previous_heading = ""
+ first_example_line_index = -1
+ first_note_line_index = -1
+ first_note_in_table_line_index = -1
+ for i, line in enumerate(lines):
+ new_line = line
+
+ if line.startswith("#"):
+ new_line, new_heading = auto_number_heading(new_line)
+ previous_heading = new_heading
+
+ if example_counter >= 1 and first_example_line_index != -1 and "EXAMPLE" not in lines[first_example_line_index]:
+ lines[
+ first_example_line_index
+ ] += f" EXAMPLE{' 1' if example_counter > 1 else ''}:"
+ example_counter = 0
+ first_example_line_index = -1
+
+ if note_counter >= 1 and first_note_line_index != -1 and "NOTE" not in lines[first_note_line_index]:
+ lines[
+ first_note_line_index
+ ] += f" NOTE{' 1' if note_counter > 1 else ''}:"
+ note_counter = 0
+ first_note_line_index = -1
+
+ elif line.startswith(">>> [!tip]"):
+ new_line = auto_number_example(new_line)
+ if example_counter == 1:
+ first_example_line_index = i
+
+ elif line.startswith(">>> [!note]") :
+ new_line = auto_number_note(new_line)
+ if note_counter == 1:
+ first_note_line_index = i
+
+ elif previous_line.startswith("::: TF"):
+ new_line = auto_number_figure(new_line)
+
+ elif previous_line.startswith("::: TH"):
+ new_line = auto_number_table(new_line)
+
+ if note_in_table_counter >= 1 and first_note_in_table_line_index != -1:
+ note_string = f"| >>> [!note] NOTE{' 1' if note_in_table_counter > 1 else ''}:"
+ note_string_length = len(note_string)
+ text_to_be_replaced = "| >>> [!note]"
+ text_to_be_replaced_length = len(text_to_be_replaced)
+ diff_in_length = note_string_length - text_to_be_replaced_length
+ if diff_in_length > 0:
+ text_to_be_replaced = text_to_be_replaced + " " * diff_in_length
+ lines[first_note_in_table_line_index] = lines[
+ first_note_in_table_line_index
+ ].replace(
+ text_to_be_replaced, note_string
+ )
+ note_in_table_counter = 0
+ first_note_in_table_line_index = -1
+
+ elif line.startswith("| >>> [!note]"):
+ new_line = auto_number_note_in_table(new_line)
+ if note_in_table_counter == 1:
+ first_note_in_table_line_index = i
+
+ lines[i] = new_line
+ previous_line = line
+
+ ### We need to run again the logic where we add the number in examples and notes since we might not have done it for all cases (it triggers on specific points, and if it happens the element is in the last heading/table it may be skipped)
+
+ if example_counter >= 1 and first_example_line_index != -1 and "EXAMPLE" not in lines[first_example_line_index]:
+ lines[
+ first_example_line_index
+ ] += f" EXAMPLE{' 1' if example_counter > 1 else ''}:"
+
+ if note_counter >= 1 and first_note_line_index != -1 and "NOTE" not in lines[first_note_line_index]:
+ lines[first_note_line_index] += f" NOTE{' 1' if note_counter > 1 else ''}:"
+
+ if note_in_table_counter >= 1 and first_note_in_table_line_index != -1:
+ note_string = f"| >>> [!note] NOTE{' 1' if note_in_table_counter > 1 else ''}:"
+ note_string_length = len(note_string)
+ text_to_be_replaced = "| >>> [!note]"
+ text_to_be_replaced_length = len(text_to_be_replaced)
+ diff_in_length = note_string_length - text_to_be_replaced_length
+ if diff_in_length > 0:
+ text_to_be_replaced = text_to_be_replaced + " " * diff_in_length
+ lines[first_note_in_table_line_index] = lines[
+ first_note_in_table_line_index
+ ].replace(
+ text_to_be_replaced, note_string
+ )
+
+ file_contents = "\n".join(lines) + "\n"
+ return file_contents
+
+
+def add_ids_to_references(file_contents: str, filename: str):
+ """Ensure any characters that need to be escaped are escaped."""
+
+ def handle_references(file_contents: str, filename: str):
+ """Make sure references are correctly escaped."""
+ # Pattern for informative references with "i." prefix
+ REF_REGEX_I = r"\[(i\.[A-Za-z0-9]+)\]"
+
+ # Pattern for normative references without "i." prefix
+ REF_REGEX_N = r"\[(n\.[A-Za-z0-9]+)\]"
+
+ if (
+ filename.replace(".md", "") in files_with_references
+ ): # references clauses, add span with ids
+ REF_REPLACE_I = r'\[\1\]'
+ REF_REPLACE_N = r'\[\1\]'
+ file_contents = re.sub(REF_REGEX_I, REF_REPLACE_I, file_contents)
+ file_contents = re.sub(REF_REGEX_N, REF_REPLACE_N, file_contents)
+ return file_contents
+
+ file_contents = handle_references(file_contents, filename)
+
+ return file_contents
+
+
+# endregion
+
+
+def preprocess(
+ src: str, src_type: str, consolidated_md_path: str, file_order_json: str
+):
+ """
+ ### Description
+ Preprocesses Markdown files to prepare them for conversion to HTML by applying various transformations:
+
+ 1. Performs format validation checks on divs, notes, and examples
+ 2. Removes any existing prettier-ignore statements
+ 3. Adds appropriate divs around images, tables, and their captions
+ 4. Auto-numbers clauses, annexes, examples, notes, figures, and tables
+ 5. Adds IDs to references for proper linking
+ 6. Handles special characters like '<' and '>' to ensure proper rendering
+ 7. Adds IDs to headings for navigation and cross-referencing
+ 8. Ensures proper formatting of notes and examples with empty lines
+ 9. Consolidates all preprocessed files into a single Markdown file
+
+ ### Arguments
+ - `src`: The absolute or relative path of the directory containing the source Markdown files
+ - `src_type`: The source file type, `md`
+ - `consolidated_md_path`: The path at which the consolidated Markdown file will be created
+ - `file_order_json`: Path to JSON file specifying custom order of clauses and annexes
+
+ ### Returns
+ - A dictionary mapping filenames to their numeric/alphabetic positions in the document
+ """
+ filename_numbers_mapping = {}
+ clauses = DEFAULT_CLAUSES
+ annexes = DEFAULT_ANNEXES
+
+ if file_order_json:
+ with open(file_order_json, "r") as file:
+ json_data = json.load(file)
+ clauses = json_data.get("clauses")
+ annexes = json_data.get("annexes")
+
+ files, clauses_filenames, annexes_filenames = get_file_order(src, clauses, annexes)
+ files = [f"{filename}.md" for filename in files]
+ clauses_filenames = [f"{filename}.md" for filename in clauses_filenames]
+ annexes_filenames = [f"{filename}.md" for filename in annexes_filenames]
+ preprocessed_filenames = []
+ for filename in files:
+ filename_without_extension = filename[:-3] # Remove .md extension
+ if filename.endswith(src_type) and filename != "consolidated.md":
+ input_path = os.path.join(src, filename)
+ try:
+ text = open(input_path, "r", encoding="utf-8").read()
+
+ text = undo_prettier_formatting(text)
+ run_format_checks(filename, text.splitlines())
+
+ text = remove_ignore_prettier_statements(text)
+
+ text = add_divs_to_images_tables(text)
+
+ if filename in clauses_filenames:
+ text = auto_number_content(text, "clauses")
+ filename_numbers_mapping[filename_without_extension] = (
+ clauses_counters[0]
+ )
+ elif filename in annexes_filenames:
+ text = auto_number_content(text, "annexes")
+ filename_numbers_mapping[filename_without_extension] = (
+ int_to_letter(annexes_counters[0]).lower()
+ )
+ text = add_ids_to_references(text, filename)
+ text = handle_less_than_greater_than_text(text)
+ text = add_ids_to_headings(text)
+ text = add_empty_lines_in_notes_and_examples(text)
+ if filename_without_extension in REFS:
+ text = text.replace("::: REFS", "::: EX")
+
+ new_filename = re.sub(
+ r"([\w-]+?).md", r"--preprocessed--\1.md", filename
+ ) # Ensure file order is preserved by keeping the number in front
+ output_path = os.path.join(src, new_filename)
+ open(output_path, "w", encoding="utf-8").write(text)
+ preprocessed_filenames.append(new_filename)
+ except Exception as e:
+ # print(f"Error: {e}")
+ # print(
+ # f"Warning: Could not preprocess {input_path}. It may not be a valid UTF-8 text file or is missing."
+ # )
+ if e.args[0] == "DIV_DELINEATOR_ERROR" or e.args[0] == "NOTE_NUMBERING_ERROR" or e.args[0] == "EXAMPLE_NUMBERING_ERROR":
+ # delete all files that start with --preprocessed--
+ for f in os.listdir(src):
+ if f.startswith("--preprocessed--"):
+ os.remove(os.path.join(src, f))
+ sys.exit(1)
+ pass
+
+ handle_consolidated_md("create", src, consolidated_md_path, preprocessed_filenames)
+
+ return filename_numbers_mapping
diff --git a/md_to_docx_converter/src/to_md/__init__.py b/md_to_docx_converter/src/to_md/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/md_to_docx_converter/src/to_md/cleaning.py b/md_to_docx_converter/src/to_md/cleaning.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c3081f27fcefb5b9bae688b7162f1c30919c5e0
--- /dev/null
+++ b/md_to_docx_converter/src/to_md/cleaning.py
@@ -0,0 +1,1217 @@
+import re, json, os
+from bs4 import BeautifulSoup, Tag, NavigableString, Comment
+
+from src.utils import get_css_classes, is_whitespace_navstr
+
+from src.constants import (
+ LIST_HIERARCHY_CLASSES,
+ ORDERED_LIST_ITEM_CLASSES,
+ HTML_LIST_TAGS,
+)
+
+
+# region Helpers
+def ensure_correct_css_class_use(soup: BeautifulSoup):
+ """Reassigns the ondemand classes applied by some divs and spans to established alternatives. Unwraps other divs and spans that needlessly apply classes."""
+
+ def handle_tags_to_unwrap(soup: BeautifulSoup):
+ """
+ Handles the following classes:
+ - `ondemand_CHAR_color_000000`
+ - `ondemand_PAR_space_after_3`
+ - `ondemand_PAR_space_after_6`
+ - `ondemand_PAR_space_after_10`
+ - `ondemand_PAR_alignment_CENTER_space_before_3`
+ - `ondemand_CHAR_name_Century_Gothic_size_16_color_FFFFFF`
+ - `ondemand_CHAR_size_9`
+ - `ondemand_CHAR_name_Arial_size_9`
+ - `ondemand_CHAR_size_9_color_000000`
+ - `ondemand_CHAR_size_12_color_000000`
+ - `example-title`
+ """
+ CLASSES_TO_UNWRAP: list[str] = [
+ "ondemand_CHAR_color_000000",
+ "ondemand_PAR_space_after_3",
+ "ondemand_PAR_space_after_6",
+ "ondemand_PAR_space_after_10",
+ "ondemand_PAR_space_after_12",
+ "ondemand_PAR_left_indent_36_space_after_10",
+ "ondemand_PAR_alignment_CENTER_space_before_3",
+ "ondemand_CHAR_name_Century_Gothic_size_16_color_FFFFFF",
+ "ondemand_CHAR_size_9",
+ "ondemand_CHAR_name_Arial_size_9",
+ "ondemand_CHAR_size_9_color_000000",
+ "ondemand_CHAR_size_12_color_000000",
+ "example-title",
+ "block"
+ ]
+
+ divs_to_unwrap = soup.find_all(
+ lambda tag: tag.has_attr("class")
+ and any(cls in CLASSES_TO_UNWRAP for cls in tag["class"])
+ )
+
+ for div in divs_to_unwrap:
+ div.unwrap()
+
+ return soup
+
+ def handle_tags_to_reassign(soup: BeautifulSoup):
+ """
+ Makes the following switches:
+ - `ondemand_CHAR_color_0000FF` -> `Hyperlink`
+ - {`ondemand_CHAR_color_595959`,
+ `ondemand_CHAR_name_Times_New_Roman_size_10_color_595959`} -> `HTML_Keyboard`
+ - `ondemand_PAR_alignment_CENTER_space_after_12` -> `FL`
+ - `ondemand_CHAR_name_Roboto_Mono_size_8_color_31849B` -> `HTML_Code`
+ - `ondemand_PAR_alignment_CENTER` -> `TAC`
+ - {`ondemand_CHAR_name_Courier_New`,
+ `ondemand_CHAR_name_Courier_New_size_9`,
+ `ondemand_CHAR_name_Courier_New_size_8`} -> `PL`
+ """
+ class_pairs = [ # Format [, ]
+ ["ondemand_CHAR_color_0000FF", "Hyperlink"],
+ ["ondemand_CHAR_color_595959", "HTML_Keyboard"],
+ ["ondemand_PAR_alignment_CENTER_space_after_12", "FL"],
+ ["ondemand_CHAR_name_Roboto_Mono_size_8_color_31849B", "HTML_Code"],
+ ["ondemand_PAR_alignment_CENTER", "TAC"],
+ ["ondemand_CHAR_name_Courier_New", "PL"],
+ ["ondemand_CHAR_name_Courier_New_size_8", "PL"],
+ ["ondemand_CHAR_name_Courier_New_size_9", "PL"],
+ [
+ "ondemand_CHAR_name_Times_New_Roman_size_10_color_595959",
+ "HTML_Keyboard",
+ ],
+ [
+ "ETSI-code_Char",
+ "HTML_Sample"
+ ],
+ ]
+
+ for i in range(len(class_pairs)):
+ tags: list[Tag] = soup.find_all(class_=class_pairs[i][0])
+ for tag in tags:
+ tag["class"] = [class_pairs[i][1]]
+
+ return soup
+
+ def correct_span_styles(soup: BeautifulSoup):
+ """Some styles corresponding to Word paragraph styles, and therefore should only be applied to ``s and ``s, are also applied to spans, which should only use styles that correspond to Word run styles. In such cases, make the necessary substitutions."""
+ STYLES_AND_REPLACEMENTS = [["PL", "HTML_Sample"]]
+
+ for pair in STYLES_AND_REPLACEMENTS:
+ before_style = pair[0]
+ after_style = pair[1]
+
+ spans = soup.find_all("span", class_=before_style, recursive=True)
+ for span in spans:
+ span["class"] = [after_style]
+
+ return soup
+
+ def handle_exceptional_cases(soup: BeautifulSoup):
+ """
+ Some CSS classes need very specific handling. This handling is done here.
+
+ Classes impacted:
+ - "ondemand_PAR_first_line_indent_56"
+ - "ondemand_PAR_first_line_indent_70"
+ """
+
+ def handle_ondemand_PAR_first_line_indent_56_and_70(
+ soup: BeautifulSoup,
+ ):
+ """Maintaining the span inside these divs, replace the div that wraps it with a paragraph. These classes appear only in code block sequences, so form the format expected during code block creation here."""
+ CLASSES = [
+ "ondemand_PAR_first_line_indent_56",
+ "ondemand_PAR_first_line_indent_70",
+ ]
+ divs = (
+ soup.find("html")
+ .find("body")
+ .find_all(
+ "div",
+ attrs={"class": lambda c: c and any(cls in c for cls in CLASSES)},
+ recursive=False,
+ )
+ )
+
+ for div in divs:
+ spans = div.find_all("span")
+ span_cls = spans[0].get("class")[
+ 0
+ ] # All the classes should be the same here
+
+ # Consolidated the contents of all the spans into a new span
+ span_contents = "".join([s.get_text() for s in spans])
+ consolidated_span = soup.new_tag("span", attrs={"class": span_cls})
+ consolidated_span.append(NavigableString(span_contents))
+
+ # Cate a new paragraph for the consolidated span and insert it after the div
+ paragraph = soup.new_tag("p")
+ paragraph.append(consolidated_span)
+ div.insert_after(paragraph)
+
+ # Remove the div
+ div.decompose()
+
+ return soup
+
+ soup = handle_ondemand_PAR_first_line_indent_56_and_70(soup)
+
+ return soup
+
+ soup = handle_tags_to_unwrap(soup)
+ soup = handle_tags_to_reassign(soup)
+ soup = handle_exceptional_cases(soup)
+ soup = correct_span_styles(soup)
+
+ return soup
+
+
+def handle_blockquotes(soup: BeautifulSoup):
+ """Unwrap blockquotes, preserving their contents"""
+
+ for blockquote in soup.find_all("blockquote"):
+ blockquote.unwrap()
+
+ return soup
+
+
+def remove_example_tag_number_from_body(soup: BeautifulSoup):
+ """Some examples have their number mistakenly copied and added as part of their body. Remove those from the body."""
+ divs: list[Tag] = soup.find_all("div", class_="EX")
+ regex = r"(?:EXAMPLE |^\s*)?((?:(?:[0-9])|(?:[1-3][0-9])))" # Captures the number part and the ending colon, ex., 1:, 3:, 14:, etc
+
+ for div in divs:
+ children = [
+ child
+ for child in div.children
+ if (isinstance(child, Tag) and (child.name == "div"))
+ ]
+
+ if len(children) != 2:
+ continue # Skip this one because the problem only occurs in examples with exactly two children
+
+ # Getting the children and matches in their text
+ tag = children[0]
+ tag_text = tag.get_text(strip=True)
+ tag_match = re.search(regex, tag_text)
+
+ body = children[1]
+ body_text = body.get_text(strip=True)
+ body_match = re.search(regex, body_text)
+
+ # Disqualify divs
+ if not (body_match and tag_match) and not body_match == tag_match:
+ continue # The example number isn't repeated in the body here
+
+ if not (isinstance(tag, Tag) and isinstance(body, Tag)):
+ continue # Not two divs
+
+ # Remove only if it's just the number/colon pair in the div and nothing else
+ if len(body_text) == 2 or len(body_text) == 3:
+ body.decompose()
+
+ return soup
+
+
+def handle_tags_for_empty_classes(soup: BeautifulSoup, css_src: list[str]):
+ """
+ Replace the tags for only those HTML tags that only have a class attribute that applies an empty CSS class with a paragraph (to maintain visual positioning).
+
+ Ex:
Content -> {line beginning}Content{newline}
+ """
+ empty_css_classes = get_css_classes("empty", css_src)
+
+ for tag in soup.find_all(True):
+ if list(tag.attrs.keys()) == ["class"]:
+ classes = tag.get("class", [])
+ if len(classes) == 1 and classes[0] in empty_css_classes:
+ paragraph = soup.new_tag("p")
+
+ while tag.contents:
+ paragraph.append(tag.contents[0].extract())
+
+ tag.replace_with(
+ paragraph
+ ) # Maintain spacing between text by enclosing it in a paragraph
+
+ return soup
+
+
+def consolidate_sequential_spans(soup: BeautifulSoup, css_src: list[str]):
+ """Consolidates only those series of spans that contain data that should be contained within a single span"""
+ CLASSES = get_css_classes("all", css_src)
+
+ def handle_spans(soup: BeautifulSoup, tag: Tag):
+ nonlocal CLASSES
+ children = list(tag.children)
+ i = 0
+
+ while i < len(children):
+ child = children[i]
+
+ if isinstance(child, Tag) and child.name == "span":
+ span_class = child.get("class", [])
+
+ if len(span_class) == 1 and span_class[0] in CLASSES:
+ span_class = span_class[0]
+ group = [child]
+ i += 1
+
+ # Get consecutive spans, possibly separated by whitespace
+ while i < len(children):
+ next_child = children[i]
+
+ if ( # Spans
+ isinstance(next_child, Tag)
+ and next_child.name == "span"
+ and next_child.get("class", []) == [span_class]
+ ):
+ group.append(next_child)
+ i += 1
+ elif is_whitespace_navstr(next_child):
+ i += 1
+ else: # End of sequence
+ break
+
+ # Create new consolidated span
+ if len(group) > 1:
+ new_parent_span = soup.new_tag("span", **{"class": span_class})
+
+ for span in group:
+ for contents in list(span.contents):
+ new_parent_span.append(contents.extract())
+
+ group[0].insert_before(new_parent_span)
+
+ # Remove old spans
+ for span in group:
+ span.decompose()
+
+ children = list(tag.children)
+ i = children.index(new_parent_span) + 1
+
+ else:
+ i += 1
+
+ else:
+ i += 1
+
+ return soup
+
+ def handle_spans_not_in_tables(soup: BeautifulSoup):
+ divs = soup.find_all("div")
+
+ paras = soup.find_all("p")
+
+ for div in divs:
+ soup = handle_spans(soup, div)
+
+ for para in paras:
+ soup = handle_spans(soup, para)
+
+ return soup
+
+ def handle_spans_in_tables(soup: BeautifulSoup):
+ tables = soup.find_all("table")
+
+ for table in tables:
+ cells = table.find_all(["td", "th"])
+
+ for cell in cells:
+ soup = handle_spans(soup, cell)
+
+ divs = cell.find_all("div")
+
+ for div in divs:
+ soup = handle_spans(soup, div)
+
+ return soup
+
+ soup = handle_spans_not_in_tables(soup)
+ soup = handle_spans_in_tables(soup)
+
+ return soup
+
+
+def consolidate_div_sequences(soup: BeautifulSoup):
+ """
+ Consolidates sequences of divs that apply the following classes to their contents:
+ - `PL`
+ - `EX`
+ - `NO`
+ - `EW`
+
+ Divs applying these classes are meant to go together
+ """
+ EXAMPLES_NOTES_CLASSES = ["EX", "NO"]
+
+ CLASSES = EXAMPLES_NOTES_CLASSES + ["B1plus", "PL", "EW"]
+
+ def is_beginning_ex_no(div: Tag):
+ """
+ Checks example and note divs to see if they should be the start of an example or note - that is, who have two classless child divs and no other HTML tags
+
+ Returns `True` for such divs, `False` for other examples or notes
+ """
+ nonlocal EXAMPLES_NOTES_CLASSES
+
+ if not div or not isinstance(div, Tag):
+ return False
+
+ div_class = div.get("class", [])
+
+ if len(div_class) != 1 or div["class"][0] not in EXAMPLES_NOTES_CLASSES:
+ return False
+
+ # Children of parent div
+ children = [
+ child
+ for child in div.contents
+ if isinstance(child, Tag)
+ or (isinstance(child, NavigableString) and not child.strip() == "")
+ ]
+
+ # All children that are classless divs
+ div_children = [
+ child
+ for child in children
+ if isinstance(child, Tag)
+ and child.name == "div"
+ and len(child.get("class", [])) == 0
+ ]
+
+ # Does this div have only two child divs?
+ is_top_level_div: bool = len(div_children) == 2 and len(children) == len(
+ div_children
+ )
+
+ return is_top_level_div
+
+ def div_is_in_series(current: Tag, next: Tag):
+ """
+ ### Description
+ Determine whether a div is in a series by checking the next tag
+
+ ### Arguments
+ - `current`: The div being tested
+ - `next`: The div after `current` within a list of all divs in the soup
+ """
+ if not current.name == "div" and not next.name == "div":
+ return False
+
+ current_class = current.get("class")
+ current_class = current_class[0] if current_class else None
+
+ next_class = next.get("class")
+ next_class = next_class[0] if next_class else None
+
+ if (
+ not current_class
+ or not next_class
+ or (current_class != next_class)
+ or is_beginning_ex_no(next)
+ ):
+ return False
+
+ current_sibling = current.find_next_sibling(lambda tag: isinstance(tag, Tag))
+
+ if next is not current_sibling:
+ return False
+
+ return True
+
+ def populate_group(group: list[Tag], divs: list[Tag]):
+ """Gets the divs that follow. Populates `group` with the elements that should be part of a previous example or note's body."""
+ nonlocal div_class, i # `i` is the index used with the while loop
+
+ while i < len(divs) and div_is_in_series(group[-1], divs[i]):
+ group.append(divs[i])
+ i += 1
+
+ return group
+
+ def add_body_div(soup: BeautifulSoup, group: list[Tag]):
+ """Wrap all the divs' content within a single div in the example or note. Add this new div to the example/note and remove the old divs that are in `group`."""
+ nonlocal div_class
+
+ PARENT = group[0].parent
+ POSITION = group[-1].next_sibling
+ FIRST_DIV_ID = group[0].get("id", "")
+ IS_REFERENCE = FIRST_DIV_ID.startswith("n.") or FIRST_DIV_ID.startswith("i.")
+
+ def move_old_elements_to_div(new_parent_div: Tag):
+ nonlocal PARENT, POSITION, soup, group
+ for child in group:
+ # Append the contents of each child div to a paragraph
+ paragraph = soup.new_tag("p")
+ is_list: bool = False
+ for contents in list(child.contents):
+ paragraph.append(contents.extract())
+
+ # Append the paragraph to the parent
+ if not is_list:
+ new_parent_div.append(paragraph)
+
+ # Remove the child
+ child.decompose()
+
+ return new_parent_div
+
+ # Create new parent div
+ new_parent_div = soup.new_tag(
+ "div",
+ **{
+ "class": div_class,
+ "id": "list-of-references" if IS_REFERENCE else None,
+ },
+ )
+ new_parent_div = move_old_elements_to_div(new_parent_div)
+
+ # Add new parent div back to the HTML
+ if POSITION and isinstance(POSITION, (Tag, NavigableString)):
+ POSITION.insert_before(new_parent_div)
+ else:
+ PARENT.append(new_parent_div)
+
+ return soup
+
+ def prep_examples(soup: BeautifulSoup):
+ """Some example body elements may have irregular formats. Normalize them to allow the function to correctly consolidate them."""
+
+ def fix_html_sample_and_html_code_discrepancies(soup: BeautifulSoup):
+ """
+ Handles example body elements that take the following form:
+
+ ```
+
+ [indentation][body component]
+
+ ```
+
+ Unwraps the inner div to produce the following normalized form:
+
+ ```
+
+ [indentation][body component]
+
+ ```
+ """
+ examples = soup.find_all("div", attrs={"class": "EX"})
+
+ for example in examples:
+ children = example.find_all(recursive=False)
+
+ if len(children) != 2:
+ continue
+
+ if children[0].name != "span" and children[1].name != "span":
+ continue
+
+ if re.search(r"\[(?:i\.)?\d{1,2}\]", children[0].get_text().strip()):
+ continue # This is a reference, not an example
+
+ indentation_tag = children[0]
+ body_tag = children[1]
+ cls = body_tag.get("class", [])[0]
+
+ new_tag_text = f"{indentation_tag.get_text()}{body_tag.get_text()}"
+ new_tag = soup.new_tag("span", attrs={"class": cls})
+ new_tag.append(new_tag_text)
+
+ indentation_tag.insert_after(new_tag)
+ indentation_tag.decompose()
+ body_tag.decompose()
+
+ return soup
+
+ soup = fix_html_sample_and_html_code_discrepancies(soup)
+
+ return soup
+
+ soup = prep_examples(soup)
+
+ divs = soup.find_all("div")
+ i = 0
+
+ while i < len(divs) - 1:
+ div = divs[i]
+ div_class = div.get("class", [])
+
+ # Operate only on divs with one class which is in `classes`...
+ if (
+ len(div_class) == 1
+ and div_class[0] in CLASSES
+ and div.parent.name
+ != "li" # Causes problems with divs inside of list items
+ ):
+ # Skip divs with only one child in them...
+ if div_class[0] in EXAMPLES_NOTES_CLASSES and is_beginning_ex_no(div):
+ i += 1
+ continue
+
+ if not div_is_in_series(div, divs[i + 1]):
+ i += 1
+ continue
+
+ # ...Otherwise, continue on
+ div_class = div_class[0]
+ group = [div] # Start tracking the divs, adding this first div
+ i += 1
+
+ group = populate_group(group, divs)
+
+ if len(group) < 2:
+ continue # There is nothing left to do
+
+ soup = add_body_div(soup, group)
+
+ else:
+ i += 1
+
+ return soup
+
+
+def format_lists(soup: BeautifulSoup):
+ """
+ ### Description
+ From sequences of divs applying list styles to single list items, form nested list structures.
+
+ ### Process
+ 1. Unwraps the divs that apply the list class to individual elements and applies that class to the list itself.
+ 2. Make lists out of list items contained in divs.
+ 3. Nest sublists within lists where appropriate.
+ """
+
+ def unwrap_list_items(soup: BeautifulSoup):
+ """Unwrap the divs applying the class name and add the class to the top-level `` or ``."""
+
+ def unwrap_list_items(ol_ul: Tag):
+ """Unwrap the contents of individual list items, but if the list item has multiple elements inside, convert those divs to paragraphs and wrap them inside a single parent paragraph."""
+ for li in ol_ul.find_all("li", recursive=False):
+ divs: list[Tag] = li.find_all("div")
+
+ if not divs:
+ continue
+
+ div = divs[0]
+ list_class = div.get("class")
+ if not ol_ul.get("class") and list_class:
+ ol_ul["class"] = list_class
+
+ if len(divs) == 1:
+ div.unwrap() # If this is the only div, simply unwrap it
+ continue
+
+ # If it has more than one div, wrap their contents in a single div
+ new_parent_div = soup.new_tag("div", attrs={"class": "list-item"})
+ for div in divs:
+ para = soup.new_tag("p")
+
+ for content in div.contents[:]:
+ para.append(content.extract())
+
+ new_parent_div.append(para)
+ div.decompose()
+
+ li.append(new_parent_div)
+
+ return ol_ul
+
+ for ol_ul in soup.find_all(HTML_LIST_TAGS):
+
+ ol_ul = unwrap_list_items(ol_ul)
+
+ return soup
+
+ def make_lists_from_divs(soup: BeautifulSoup):
+ """Make lists out of sequences of divs that represent lists. Apply attributes when apppropriate."""
+
+ def add_elements_to_group(group: list[Tag], i: int):
+ """Iterate through `elements`, adding items to `group` while there is still a sequence of divs applying list hierarchy classes. Return `group` when finished."""
+ nonlocal elements, current_class
+
+ j = i + 1
+
+ while j < len(elements):
+ next = elements[j]
+ next_class = next.get("class")
+ next_class = next_class[0] if next_class else None
+
+ if not (
+ isinstance(next, Tag)
+ and next.name == "div"
+ and next_class == current_class
+ ) or (isinstance(next, Tag) and next.find(HTML_LIST_TAGS)):
+ break # End of sequence
+
+ group.append(next.extract())
+ j += 1
+
+ return j - 1, group
+
+ def make_list(group: list[Tag]):
+ """Return a new list from the divs contained in `group`"""
+
+ def init_list():
+ """Initialize a new ordered or unordered list based on the first element in `group`. Return an ordered list if the first element contains some kind of indexing marker (ex., *1)*, *1.*, *a.*, or *a)* ), otherwise return an unordered list."""
+ nonlocal first_element_text, cls
+
+ # Get the starting value, if it exists
+ start = first_element_text.strip().split(" ")[0]
+ item_match = re.search(r"^([A-Za-z0-9]+)[\.\)]\s+", start)
+
+ # If the starting value exists, the list will be an ordered list
+ if item_match:
+ item_number = item_match.group(1)
+
+ # Reduce any instances of to just
+ check = re.search(r"^\d+([A-Za-z]+)$", item_number)
+ item_number = check.group(1) if check else item_number
+
+ list_type = "1" if item_number.isdigit() else "a"
+
+ new_list = soup.new_tag(
+ "ol",
+ attrs={"class": cls, "start": item_number, "type": list_type},
+ )
+
+ # Otherwise, it will be an unordered list
+ else:
+ new_list = soup.new_tag("ul", attrs={"class": cls})
+
+ return new_list
+
+ first_element_text = group[0].get_text()
+ cls = group[0].get("class")[0] if group[0].get("class") else None
+
+ new_list = init_list()
+
+ # Add group elements to the list as list items
+ for elem in group:
+ li = soup.new_tag("li")
+
+ for elem_child in elem.contents:
+ # Strip bullets/numbering from child text, if applicable
+ if not elem_child.get_text().strip():
+ continue # No text
+
+ child_text = elem_child.get_text().strip()
+
+ index_portion = re.search(r"^([A-Za-z0-9]+[\.\)]\s+)", child_text)
+ bullet_portion = re.search(r"^(\-\s+)", child_text)
+
+ match = None
+
+ if index_portion:
+ match = index_portion.group(0)
+
+ elif bullet_portion:
+ match = bullet_portion.group(0)
+
+ # Set child text to the stripped text, if applicable
+ child_text = child_text[len(match) :] if match else child_text
+
+ # Add the child to the list item
+ if isinstance(elem_child, NavigableString):
+ li.append(NavigableString(child_text))
+
+ elif isinstance(elem_child, Tag):
+ li.append(elem_child.extract())
+
+ new_list.append(li)
+
+ return new_list
+
+ elements = soup.find("body").find_all(recursive=False)
+
+ i = 0
+
+ while i < len(elements):
+ current = elements[i]
+ current_class = current.get("class")
+ current_class = current_class[0] if current_class else None
+
+ is_part_of_code_block = (
+ isinstance(current, Tag)
+ and len(current.find_all()) == 1
+ and current.find("span", attrs={"class": "HTML_Sample"})
+ )
+
+ if (
+ not (isinstance(current, Tag) and current.name == "div")
+ or current_class not in LIST_HIERARCHY_CLASSES
+ or (is_part_of_code_block and current_class == "B1plus")
+ ):
+ i += 1
+ continue # Not a list item contained in a div
+
+ # Unwrap any divs contained herein that apply the same class that `current_class` applies
+ for div in current.find_all("div", attrs={"class": [current_class]}):
+ div.unwrap()
+
+ i, group = add_elements_to_group([current], i)
+
+ new_list = make_list(group)
+
+ current.replace_with(new_list)
+
+ i += 1
+
+ return soup
+
+ def nest_lists(soup: BeautifulSoup):
+ """Ensure list items are properly nested by comparing the level in the hierarchy contained in their classnames."""
+
+ def is_list(element):
+ """Returns `True` if the provided element is an ordered or unordered list, otherwise returns `False`."""
+ return isinstance(element, Tag) and element.name in HTML_LIST_TAGS
+
+ def get_level(classname: str):
+ """Takes a list-related classname like `B1`, `B2plus`, etc. Return an int representing the level at which the list should be nested based on the classname, ranging from 1 (top-level) to 5 (most nested). The `BN` and `BL` classes are at the top level."""
+ nesting_level = (
+ 1 if classname in ORDERED_LIST_ITEM_CLASSES else int(classname[1])
+ )
+ return nesting_level
+
+ elements = soup.find("body").find_all(recursive=False)
+
+ for e in elements:
+ if (
+ isinstance(e, Tag)
+ and e.name == "div"
+ and e.get("class")[0] in ["B4", "B5"]
+ ):
+ print(f"{e}\n")
+
+ i = 0
+
+ while i < len(elements) - 1:
+ current = elements[i]
+ next = elements[i + 1]
+
+ if not (is_list(current) and is_list(next)):
+ i += 1
+ continue
+
+ current_class = current.get("class")[0]
+ next_class = next.get("class")[0]
+
+ current_level = get_level(current_class)
+ next_level = get_level(next_class)
+
+ if next_level > current_level:
+ current.append(next.extract())
+
+ elif next_level < current_level:
+ parent = current.find_parent(HTML_LIST_TAGS)
+ while parent:
+ parent_level = get_level(parent.get("class")[0])
+
+ if parent_level <= next_level:
+ break # Correct level reached
+
+ # If not at the right level, move back up one level
+ parent = parent.find_parent(HTML_LIST_TAGS)
+
+ if parent:
+ parent.insert_after(next.extract())
+ else:
+ i += 1
+
+ i += 1
+
+ return soup
+
+ soup = unwrap_list_items(soup)
+ soup = make_lists_from_divs(soup)
+ soup = nest_lists(soup)
+
+ return soup
+
+
+def create_code_blocks(soup: BeautifulSoup, css_src: list[str]):
+ """
+ Creates code blocks for code-like data, identified as:
+ - Sequences of spans that apply a CSS class which applies a monospaced font
+ - The contents of PL blocks
+ """
+
+ def create_code_block(lines: list[str]):
+ pre = soup.new_tag("pre")
+ code = soup.new_tag("code")
+
+ # Preserve code-like structure
+ plaintext = f'{"".join(lines)}\n'
+
+ # Create {...contents...}
structure
+ code.append(plaintext)
+ code["class"] = "language-json" # Add JSON hint
+ pre.append(code)
+
+ return pre
+
+ def make_block(soup: BeautifulSoup, current_group: list[Tag]):
+ """Makes a code block out of the paragraphs contained in `current_group`, then empties `current_group` to prepare for future code blocks"""
+ if len(current_group) < 2:
+ # There aren't enough elements to justify making a code block
+ return []
+
+ plaintext_lines = []
+
+ # Get paragraphs' lines
+ for p in current_group:
+ text = p.find("span").get_text(strip=False)
+
+ # Replace any newlines in the middle of the string (not at the beginning or the end) with spaces
+ middle_newline_regex = r"(? 0
+ ):
+ plaintext_lines: list[str] = []
+
+ # Get just the text of the div's children
+ for child in div.children:
+ if isinstance(child, Tag) and child.name == "p":
+ text = child.get_text(strip=False)
+ plaintext_lines.append(text)
+
+ elif isinstance(child, NavigableString):
+ plaintext_lines.append(str(child))
+
+ elif isinstance(child, Tag):
+ plaintext_lines.append(child.get_text(strip=False))
+
+ # Remove excess newlines
+ plaintext_lines = [line for line in plaintext_lines if line.strip()]
+
+ # Replace old div with new code block
+ code_block = create_code_block(plaintext_lines)
+
+ div.insert_before(code_block)
+ div.decompose()
+
+ return soup
+
+ def cleanup_divs(soup: BeautifulSoup):
+ """There will be redundant divs leftover that only contain code blocks. Remove those divs, preserving the contents."""
+ for div in soup.find_all("div"):
+ children: list[Tag] = [
+ child
+ for child in div.contents
+ if not (isinstance(child, NavigableString) and child.strip == "")
+ ]
+
+ if (
+ len(children) == 1
+ and isinstance(children[0], Tag)
+ and children[0].name == "pre"
+ ):
+ div.unwrap()
+
+ return soup
+
+ def cleanup_excess_newlines(soup: BeautifulSoup):
+ """Remove excessive newlines from some code blocks"""
+ for pre in soup.find_all("pre"):
+ if not (
+ len(pre.contents) == 1
+ and isinstance(pre.contents[0], Tag)
+ and pre.contents[0].name == "code"
+ ):
+ continue # This is not a block
+
+ code: Tag = pre.contents[0]
+
+ # Work only on code blocks that need to have their contents modified
+ lines = [line for line in code.get_text()]
+ lines_stripped = [
+ line for line in code.get_text().splitlines() if line.strip()
+ ]
+
+ if len(lines) == len(lines_stripped):
+ continue # This code block is already formatted correctly
+
+ # Re-form the body of the code block
+ contents = "\n" + "\n".join(lines_stripped) + "\n"
+
+ # Replace contents
+ code.clear()
+ code.append(NavigableString(contents))
+
+ return soup
+
+ monospaced_classes = get_css_classes("monospaced", css_src)
+ monospaced_classes_for_divs = ["PL"]
+ classes_to_target_for_span_consolidation = ["B1plus", "EX", "NO"]
+ divs = soup.find_all("div")
+
+ # Process the divs
+ for div in divs[:]:
+ soup = handle_monospaced_span_sequences_within_div(
+ soup,
+ div,
+ monospaced_div_classes=classes_to_target_for_span_consolidation,
+ monospaced_span_classes=monospaced_classes,
+ )
+ soup = handle_whole_divs(soup, div, monospaced_classes_for_divs)
+
+ # Manipulations outside of divs
+ soup = handle_monospaced_span_sequences_outside_div(soup, monospaced_classes)
+
+ # Cleanup
+ soup = cleanup_divs(soup)
+ soup = cleanup_excess_newlines(soup)
+
+ return soup
+
+
+def handle_empty_tags(soup: BeautifulSoup):
+ """Removes tags without any content, or whose content is just a newline"""
+ tags = soup.find_all(True)
+
+ # Tags like these are considered empty because they take the form of , but they aren't actually empty
+ non_empty_tags: list[str] = ["img"]
+
+ def is_empty(tag: Tag):
+ # Empty if no contents or only whitespace
+ if tag.name in non_empty_tags:
+ return False
+ if not tag.contents:
+ return True
+ if all(
+ (is_whitespace_navstr(contents))
+ or (isinstance(contents, Tag) and is_empty(contents))
+ for contents in tag.contents
+ ):
+ return True
+ return False
+
+ for tag in soup.find_all(True):
+ # We want to avoid removing table cells, since they may contain empty content. Removing them would break the table structure.
+ if tag.name in ["td", "th"]:
+ continue
+
+ if is_empty(tag):
+ tag.decompose()
+
+ return soup
+
+def fix_references(soup: BeautifulSoup):
+ """Fix references in the document"""
+ for text_element in soup.find_all(string=True):
+ REF_REGEX = r"\[(\d+)\](?![\(\[\{])"
+ # give me all matches
+ matches = re.findall(REF_REGEX, text_element)
+ if matches:
+ new_text = text_element
+ for match in matches:
+ # Replace [number] with [n.number]
+ new_text = re.sub(r"\[" + match + r"\](?![\(\[\{])", f"[n.{match}]", new_text)
+
+ # Replace the original text with the modified text
+ text_element.replace_with(new_text)
+ return soup
+
+def remove_style_from_images(soup: BeautifulSoup):
+ """Removes any style attributes from images"""
+ for img in soup.find_all("img"):
+ if img.has_attr("style"):
+ del img["style"]
+ if img.has_attr("alt"):
+ del img["alt"]
+
+ parent_tag = img.parent
+ if parent_tag and parent_tag.name == "span" and parent_tag.has_attr("style"):
+ # remove parent span if it only contains the image and has a style attribute and leave the img
+ parent_tag.replace_with(img)
+
+ return soup
+
+def handle_figures_tables_structure(soup: BeautifulSoup):
+ """Ensures that figures and tables captions are properly structured to follow ETSI guidelines"""
+ fl_divs = soup.find_all("div", class_="FL")
+ tf_divds = soup.find_all("div", class_="TF")
+ th_divds = soup.find_all("div", class_="TH")
+
+ for div in fl_divs + tf_divds + th_divds:
+ if div.get("class") == ["FL"]:
+ div_contents = [content for content in div.contents if not is_whitespace_navstr(content)]
+ if len(div_contents) == 1 and isinstance(div_contents[0], Tag) and div_contents[0].name == "img":
+ img = div_contents[0]
+ div.replace_with(img)
+ if div.get("class") == ["TF"] or div.get("class") == ["TH"]:
+ # add leading and trailing ** to the text in the div
+ text = div.get_text(strip=True)
+ if text:
+ new_text = f"**{text}**"
+ new_tag = soup.new_tag("p")
+ new_tag.string = new_text
+ div.replace_with(new_tag)
+
+ return soup
+# endregion
+
+
+def cleaning(soup: BeautifulSoup, css_src: bool):
+ """
+ ### Description
+ Cleaning that takes place as a part of preprocessing
+ - Make various changes to tags based on CSS class
+ - Remove tags that have no content
+ - Remove blockquotes
+ - Normalize improperly formatted lists
+ - Consolidate sequential spans and divs
+ - Create code blocks where applicable
+ - Fix lists
+
+ ### Arguments
+ - `soup`: the BeautifulSoup-parsed HTML source on which to operate.
+ - `css_src`: a list of the CSS source files used by the HTML.
+ """
+
+ soup = ensure_correct_css_class_use(soup)
+
+ soup = remove_example_tag_number_from_body(soup)
+
+ soup = handle_blockquotes(soup)
+
+ soup = handle_tags_for_empty_classes(soup, css_src)
+
+ soup = consolidate_sequential_spans(soup, css_src)
+
+ soup = format_lists(soup)
+
+ soup = consolidate_div_sequences(soup)
+
+ soup = handle_empty_tags(soup)
+
+ soup = create_code_blocks(soup, css_src)
+
+ soup = fix_references(soup)
+
+ soup = remove_style_from_images(soup)
+
+ soup = handle_figures_tables_structure(soup)
+
+ return soup
diff --git a/md_to_docx_converter/src/to_md/postprocessing.py b/md_to_docx_converter/src/to_md/postprocessing.py
new file mode 100644
index 0000000000000000000000000000000000000000..b80bc80eddf255130c3503deaa111624c9e316fc
--- /dev/null
+++ b/md_to_docx_converter/src/to_md/postprocessing.py
@@ -0,0 +1,268 @@
+import os, re
+
+from src.utils import apply_renaming_logic
+from src.constants import REFS
+
+# region Regex
+# For unescaping characters
+CHARACTER_REGEX = r"\\([\<\>\{\}\(\)\"\'])"
+SPAN_REGEX = r"(\[.+?\]\{[^\{\}]+?\})"
+REF_REGEX = r"\[\\\[((?:i\.)?\d{1,2})\\\]\]\((\d{1,2}-references.md#\1)\)"
+
+# For escaped characters inside of grid tables, preserve the positioning of the table structural elements (ex., piples) by adding a space of padding for each "unescaped" character before the column's closing pipe
+REGEX_IN_TABLE = rf"\|([^|\n]*?(?:{CHARACTER_REGEX}|\\([\[\]]))[^|\n]*)\|"
+# endregion
+
+
+# region Helpers
+def remove_html_comments(file_contents: str):
+ """
+ Remove HTML comments, which follow this pattern:
+
+ ``
+ """
+ comment_regex = r"\<\!\-{2}.+?\-{2}\>"
+ file_contents = re.sub(comment_regex, "", file_contents)
+
+ return file_contents
+
+
+def convert_links_to_markdown(file_contents: str):
+ """Since Pandoc doesn't do it automatically, change the file extensions within links to other locations in the document from `.html` to `.md`"""
+ # For links inside of grid tables, preserve the positioning of the table structural elements (ex., pipes) by "replacing" the two letters lost in the transition from "html" to "md" with spaces appended to the end of the link
+ regex_in_table = (
+ r"\|([^|\n]*?\(\d{1,2}-.+?)\.html#((?:i\.)?\d{1,2}|[\w.-]+?\)[^|\n]*?)\|"
+ )
+ replace_in_table = r"|\1.md#\2 |"
+
+ # For all other links
+ regex_not_in_table = r"\((\d{1,2}-.+?)\.html#((?:i\.)?\d{1,2}|[\w.-]+?)\)"
+ replace_not_in_table = r"(\1.md#\2)"
+
+ while True:
+ file_contents, count = re.subn(regex_in_table, replace_in_table, file_contents)
+ if (
+ count == 0
+ ): # Break out of the loop once there aren't any more substitutions made
+ break
+
+ file_contents = re.sub(regex_not_in_table, replace_not_in_table, file_contents)
+ return file_contents
+
+
+def unescape_characters(file_contents: str, is_cleanup: bool):
+ """
+ During conversion to Markdown, Pandoc automatically escapes Markdown special characters. In some cases, this is useful, however for others, this makes reading the text more difficult with no benefit, as the characters work fine for plain text if they aren't used in a way that triggers their special functionality.
+
+ Removes the backslash from before the following characters, thereby "unescaping" them:
+ - `<` and `>`
+ - `{` and `}`
+ - `(` and `)`
+ - `[` and `]`
+ - `"` and `'`
+ """
+
+ def unescape_brackets_in_line(line: str):
+ """Unescape brackets in a single line of text, returning the newly processed line and the number of brackets unescaped. This number is necessary to maintain padding inside of grid tables."""
+
+ def unescape_brackets_in_string(escape_string: str):
+ return re.sub(r"\\([\[\]])", r"\1", escape_string)
+
+ if not ("\\[" in line or "\\]" in line):
+ # Nothing to do, return the line and 0 for the number of lines escaped
+ return line, 0
+
+ replacement_line: list[str] = []
+ end_of_last_match = 0
+
+ len_before = len(line)
+
+ for regex in [SPAN_REGEX, REF_REGEX]:
+ # Do not escape brackets in references or
+ for match in re.compile(regex).finditer(line):
+ # Escape any brackets before the match
+ text_before_match = line[end_of_last_match : match.start()]
+ text_before_match = unescape_brackets_in_string(text_before_match)
+ replacement_line.append(text_before_match)
+
+ # Keep the match as it is
+ replacement_line.append(match.group(0))
+
+ # Prepare for next match
+ end_of_last_match = match.end()
+
+ # Handle the part of the line after the final match
+ text_after_last_match = line[end_of_last_match:]
+ text_after_last_match = unescape_brackets_in_string(text_after_last_match)
+ replacement_line.append(text_after_last_match)
+
+ new_line = "".join(replacement_line)
+
+ # Calculate number of brackets escaped for table padding
+ len_after = len(new_line)
+ num_brackets_escaped = len_before - len_after
+
+ return new_line, num_brackets_escaped
+
+ def unescape_brackets(file_contents: str):
+ """Unescape brackets `[` and `]` outside of spans, but keep them escaped inside of spans to avoid Pandoc mistaking them for the span's opening/closing tags"""
+ lines = file_contents.split("\n")
+ replacement_lines: list[str] = []
+
+ for line in lines:
+ if re.search(r"\|(?:[^\|]+?)\|", line):
+ replacement_lines.append(line)
+ continue # This is in a table, lines in tables are handled with `replace_in_table`
+
+ replacement_line, _ = unescape_brackets_in_line(line)
+ replacement_lines.append(replacement_line)
+
+ file_contents = "\n".join(replacement_lines)
+
+ return file_contents
+
+ def replace_in_table(match: re.Match):
+ column_text: str = match.group(1)
+ column_text = column_text.replace(
+ "\xa0", " "
+ ) # Replace non-breaking spaces with normal spaces
+ length_before = len(column_text)
+
+ if is_cleanup:
+ # Remove trailing backslashes that get appended to the end of some [] pairs
+ column_text = column_text.replace("\\[\\]\\", "[]")
+
+ column_text = re.sub(CHARACTER_REGEX, r"\1", column_text)
+
+ # Handle brackets
+ column_text, _ = unescape_brackets_in_line(column_text)
+
+ length_after = len(column_text)
+
+ new_column = f'|{column_text}{" " * (length_before - length_after)}|'
+
+ return new_column
+
+ def unespace_captions(file_contents: str):
+ """Remove extraneous spaces in figure and table captions, which are denoted by starting with `**` and ending with `**`"""
+ lines = file_contents.split("\n")
+ replacement_lines: list[str] = []
+ caption_regex = r"(?:\\\*|\\?\*){2}\s*((?:Figure|Table)(?:\s+[A-Za-z]?\.?\d+(?:\.[A-Za-z0-9]+)*(?:-\d+)?)?(?:\s*:\s*.*?))\s*(?:\\\*|\\?\*){2}"
+ for line in lines:
+ if "|" in line: #we need to add spaces to keep the table structure
+ new_line = re.sub(caption_regex, r"**\1** ", line)
+ else:
+ new_line = re.sub(caption_regex, r"**\1**", line)
+ replacement_lines.append(new_line)
+ file_contents = "\n".join(replacement_lines)
+ return file_contents
+
+ prev_contents = None
+ while file_contents != prev_contents:
+ prev_contents = file_contents
+ file_contents, _ = re.subn(REGEX_IN_TABLE, replace_in_table, file_contents)
+
+ # For all other escaped characters
+ file_contents = re.sub(CHARACTER_REGEX, r"\1", file_contents)
+ file_contents = unescape_brackets(file_contents)
+ file_contents = unespace_captions(file_contents)
+
+ return file_contents
+
+def remove_toc_in_front_page(text: str) -> str:
+ """Remove the table of contents from the front page"""
+ # Find the start and end of the table of contents
+ toc_start = text.find("::: TT")
+ toc_end = len(text)
+
+ # If both markers are found, remove the table of contents
+ if toc_start != -1 and toc_end != -1:
+ text = text[:toc_start] + text[toc_end:]
+
+ return text
+
+def add_ignore_prettier_text_before_after_tables(text: str) -> str:
+ """Add the text `--- IGNORE ---` before and after each table in the Markdown file to prevent Prettier from reformatting the tables"""
+
+ new_lines = []
+ lines = text.split("\n")
+ inside_table = False
+ for i in range(len(lines)):
+ line = lines[i]
+ if line.startswith("+--") or line.startswith("+=="):
+ if not inside_table:
+ new_lines.append("")
+ inside_table = True
+ new_lines.append(line)
+ else:
+ new_lines.append(line)
+ if i + 1 < len(lines):
+ next_line = lines[i + 1]
+ if not (next_line.startswith("|") or next_line.startswith("+--") or next_line.startswith("+==")):
+ inside_table = False
+ new_lines.append("")
+ else:
+ new_lines.append(line)
+
+ return "\n".join(new_lines)
+
+
+def rename_html_classes_in_text(text: str) -> str:
+ """Apply the renaming logic to all HTML classes in the text"""
+ return text.replace("{.HTML_","{.HTML-")
+
+# region Import from Word Logic
+def remove_leading_tab(filename: str) -> str:
+ REF_REGEX = r"(\d+)-tab"
+ filename = re.sub(REF_REGEX, r"\1-", filename)
+ return filename
+
+
+# endregion
+
+
+def postprocess(markdown_dir: str, is_cleanup: bool):
+ """
+ ### Description
+ Iterates through the generated Markdown files, applying various transformations to improve
+ formatting and readability:
+
+ 1. Removes table of contents from front page
+ 2. Removes HTML comments from all files
+ 3. Converts links from HTML to Markdown format
+ 4. Unescapes special characters that Pandoc unnecessarily escaped:
+ - Angle brackets, curly braces, parentheses, quotes
+ - Square brackets with special handling for tables and spans
+ 5. Adds prettier-ignore comments before and after tables to preserve formatting
+ 6. Renames HTML classes to use hyphens instead of underscores
+ 7. Applies special handling for reference pages
+
+ ### Args
+ - `markdown_dir`: String path to directory containing the generated Markdown files
+ - `is_cleanup`: `True` if additional formatting should be performed (when converting from "dirty" HTML), `False` otherwise
+ """
+
+ for filename in os.listdir(markdown_dir):
+ if filename.endswith(".md"):
+ file_path = os.path.join(markdown_dir, filename)
+
+ with open(file_path, "r", encoding="utf-8") as file:
+ text = file.read()
+
+ if filename == "front-page.md":
+ text = remove_toc_in_front_page(text)
+
+ text = remove_html_comments(text)
+
+ text = convert_links_to_markdown(text)
+ text = unescape_characters(text, is_cleanup)
+
+ text = add_ignore_prettier_text_before_after_tables(text)
+
+ text = rename_html_classes_in_text(text)
+
+ if filename.replace(".md", "") in REFS:
+ text = text.replace("{#list-of-references .EX}", "REFS")
+
+ with open(file_path, "w", encoding="utf-8") as file:
+ file.write(text)
diff --git a/md_to_docx_converter/src/to_md/preprocessing.py b/md_to_docx_converter/src/to_md/preprocessing.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9fb0f6faf617ae669322d7d4445fdffda61ce1f
--- /dev/null
+++ b/md_to_docx_converter/src/to_md/preprocessing.py
@@ -0,0 +1,468 @@
+import re, json, os
+from bs4 import BeautifulSoup, Tag, NavigableString
+
+from src.constants import HEADER_TAGS, EXAMPLE_NOTE_CLASSES
+
+from src.to_md.cleaning import cleaning as preprocess_cleaning
+
+
+# region Helpers
+def remove_pandoc_toc(soup: BeautifulSoup):
+ """
+ Removes various elements added by Pandoc during conversion to HTML, but would only clutter the Markdown:
+ - Table of Contents
+ - Links to the CSS style sheets
+ - Document headers and footers
+ - Various buttons and switches
+ - Editor tag enclosing the document's text
+
+ These clutter the Markdown and will be added back in during conversion to HTML.
+ """
+ # Remove the Table of Contents
+ toc = soup.find(id="TOC") or soup.find("nav", {"class": "toc"})
+ if toc:
+ toc.decompose()
+
+ # Remove CSS links
+ for link in soup.find_all("link", rel="stylesheet"):
+ link.decompose()
+
+ # Remove headers and footers
+ for h_f in soup.select(".footer, .header"):
+ h_f.decompose()
+
+ # Handle div with id="editor", which contains the body contents. There is only one of its kind.
+ editor_div = soup.find("div", attrs={"id": "editor"})
+ body = soup.find("body")
+
+ editor_div.extract()
+ body.append(editor_div)
+ editor_div.unwrap()
+
+ for element in soup.select("button, label.switch, span.sider, input#editing"):
+ element.decompose()
+
+ return soup
+
+
+def remove_flex(soup: BeautifulSoup):
+ """Removes the `` and `` elements."""
+ for flex in soup.find_all("flex"):
+ # Decomposing the flex will cause the child s to likewise decompose
+ flex.decompose()
+
+ return soup
+
+
+def format_abbreviations(soup: BeautifulSoup):
+ """
+ Correctly formatted abbreviations take this form and be in a series:
+
+ `[abbreviation][meaning of abbreviation]`
+
+ Reformat this to have Pandoc produce a format easier to edit in the Markdown:
+
+ `class="EW">[abbreviation] [meaning]\\n[abbreviation] [meaning]\\n ... `
+ """
+ ABBREVIATION_CLASS = "EW"
+
+ def is_abbreviation(element):
+ if not element.name or element.name != "div":
+ return False
+
+ div: Tag = element
+
+ children = div.find_all(recursive=False)
+ cls = div.get("class", [])
+
+ if len(cls) != 1:
+ return False
+
+ cls = cls[0]
+ if cls != ABBREVIATION_CLASS:
+ return False
+
+ if len(children) != 2:
+ return False
+
+ if children[0].name != "div" and children[1].name != "div":
+ return False
+
+ return True
+
+ def make_abbreviation_paragraph(soup: BeautifulSoup, abbr_div: Tag):
+ """From a single unprocessed abbreviation div, create a paragraph containing the abbreviation and its meaning separated by a single space"""
+ children = abbr_div.find_all("div", recursive=False)
+ abbreviation = children[0].get_text()
+ meaning = children[1].get_text()
+
+ abbreviation_paragraph = soup.new_tag("p")
+ abbreviation_paragraph.append(NavigableString(f"{abbreviation} {meaning}"))
+
+ return abbreviation_paragraph
+
+ if soup.find("div", attrs={"class": ABBREVIATION_CLASS}) is None:
+ return soup # Nothing to do here
+
+ elements: list[Tag] = soup.find_all(
+ "div", attrs=lambda attrs: attrs and len(attrs) > 0
+ )
+
+ i = 0
+
+ while i < len(elements):
+ element = elements[i]
+
+ if not is_abbreviation(element):
+ i += 1
+ continue
+
+ consolidated_abbr_div = soup.new_tag(
+ "div", attrs={"class": ABBREVIATION_CLASS}
+ ) # Consolidate abbreviations in this div
+ old_abbreviations: list[Tag] = []
+
+ old_abbreviations.append(element)
+
+ # Add the first abbreviation to the div
+ abbr_para = make_abbreviation_paragraph(soup, abbr_div=element)
+ consolidated_abbr_div.append(abbr_para)
+
+ # Check if this div is in a series...
+ is_in_series = False
+ next_element = elements[i + 1]
+ if is_abbreviation(next_element):
+ is_in_series = True
+
+ if not is_in_series: # Insert the new div here and remove the old one
+ element.insert_after(consolidated_abbr_div)
+ element.decompose()
+
+ i += 1
+ continue
+
+ # ...if so, continue processing divs...
+ while i < len(elements) - 1:
+ i += 1
+ next_element = elements[i]
+
+ if is_abbreviation(next_element):
+ old_abbreviations.append(next_element)
+
+ next_abbr_para = make_abbreviation_paragraph(
+ soup, abbr_div=next_element
+ )
+ consolidated_abbr_div.append(next_abbr_para)
+ elif isinstance(next_element, NavigableString):
+ continue # Simply iterate over these, they don't matter as far as consolidation is concerned
+ else:
+ break # Finished the sequence
+
+ # ...and finally, add the new div and remove the old ones
+ old_abbreviations[-1].insert_after(consolidated_abbr_div)
+ for old_abbr in old_abbreviations:
+ old_abbr.decompose()
+
+ i += 1
+
+ return soup
+
+
+def format_notes_and_examples(soup: BeautifulSoup, is_cleanup: bool):
+ """
+ Does two things with examples and notes:
+ 1. Converts the inner divs to paragraphs
+ 2. If the cleanup flag is passed and if there are any code blocks that follow, move them into the main body of the tag
+ """
+
+ def convert_divs_to_paragraphs(
+ soup: BeautifulSoup, classes: list[str] = EXAMPLE_NOTE_CLASSES
+ ):
+ # Only get divs that apply the EX or NO classes
+ divs = soup.find_all(
+ "div",
+ class_=lambda class_attr: class_attr
+ and any(cls in class_attr for cls in classes),
+ )
+
+ for div in divs:
+ children = [child for child in div.children if isinstance(child, Tag)]
+
+ # Only want divs with two classless child divs, one for the tag and one for the body
+ child_is_classless_div: bool = (
+ len(children) == 1 # One child..
+ and children[0].name == "div" # ...which is a div...
+ and len(children[0].get("class", [])) == 0 # ...and is classless
+ )
+
+ children_are_classless_divs: bool = (
+ len(children) == 2 # Two children...
+ and children[0].name == "div" # ...the first is a div...
+ and len(children[0].get("class", [])) == 0 # ...with no class...
+ and children[1].name == "div" # ...and the second is a div...
+ and len(children[1].get("class", [])) == 0 # ...with no class
+ )
+
+ if not child_is_classless_div and not children_are_classless_divs:
+ continue
+
+ if child_is_classless_div:
+ child_div: Tag = children[0]
+ child_paragraph: Tag = soup.new_tag("p")
+
+ child_paragraph.extend(child_div.contents)
+
+ child_div.insert_before(child_paragraph)
+ child_div.decompose()
+
+ if children_are_classless_divs:
+ tag_div: Tag = children[0]
+ body_div: Tag = children[1]
+
+ tag_paragraph: Tag = soup.new_tag("p")
+ body_paragraph: Tag = soup.new_tag("p")
+
+ tag_paragraph.extend(tag_div.contents)
+ body_paragraph.extend(body_div.contents)
+
+ tag_div.insert_before(tag_paragraph)
+ tag_div.decompose()
+
+ body_div.insert_before(body_paragraph)
+ body_div.decompose()
+
+ return soup
+
+ def move_code_blocks_to_body(
+ soup: BeautifulSoup, classes: list[str] = EXAMPLE_NOTE_CLASSES
+ ):
+ divs = soup.find_all(
+ "div",
+ class_=lambda class_attr: class_attr
+ and any(cls in class_attr for cls in classes),
+ )
+
+ for div in divs:
+ div_class = div.get("class", [])[0]
+
+ current = div.find_next_sibling()
+
+ while current:
+ # Skip siblings that aren't tags
+ if not isinstance(current, Tag):
+ current = current.find_next_sibling()
+ continue
+
+ current_class = ( # Get the class, if applicable
+ current.get("class", [])[0]
+ if current.get("class", []) and len(current.get("class", [])) == 1
+ else None
+ )
+
+ # Stop searching once it finds something to add to the body
+ current_children = [
+ child for child in current.children if isinstance(child, Tag)
+ ]
+ if not current_children:
+ break # This is a div without tags, not the kind of example or note to which the code block should be added
+
+ child = current_children[0]
+
+ is_successive_body_element = current_class == div_class
+ is_code_block = current.name == "pre" and child.name == "code"
+
+ if not (is_successive_body_element or is_code_block):
+ break
+
+ next_sibling = current.find_next_sibling()
+
+ # Handle code blocks
+ if is_code_block:
+ div.append(current)
+
+ # Handle successive elements
+ elif is_successive_body_element:
+
+ current_tags = current.find_all(
+ lambda tag: tag if isinstance(tag, Tag) else None
+ )
+
+ if (
+ "EXAMPLE" in current_tags[0].get_text()
+ or "NOTE" in current_tags[0].get_text()
+ ):
+ break # This is the beginning of another example or note
+
+ # Move the tags to the body
+ for tag in current_tags:
+ div.append(tag.extract())
+
+ # Remove the old div containing the body elements
+ current.extract()
+
+ current = next_sibling
+
+ return soup
+
+ soup = convert_divs_to_paragraphs(soup)
+
+ if is_cleanup:
+ soup = move_code_blocks_to_body(soup)
+
+ return soup
+
+
+def simplify_headings(soup: BeautifulSoup, is_cleanup: bool):
+ """
+ Preprocess headings so that when Pandoc converts them to HTML, they are more human-manageable.
+ """
+
+ def update_filenames_in_ids(soup: BeautifulSoup):
+ """Ensure that the filenames in inner links reflect the new filename. This involves removing any substrings `tab` from href attributes."""
+ tags = soup.find_all(attrs={"href": True})
+
+ for tag in tags:
+ href_text = tag.get("href")
+
+ while "tab" in href_text:
+ before_tab, after_tab = href_text.split("tab", 1)
+ href_text = f"{before_tab}{after_tab}"
+
+ if "tab" not in href_text:
+ tag["href"] = href_text
+ break
+
+ return soup
+
+ headings = soup.find_all(HEADER_TAGS)
+
+ mapping = {}
+
+ for heading in headings:
+ if heading.has_attr("data-number"):
+ del heading["data-number"]
+
+ if is_cleanup:
+ old_id = heading["id"].replace("tab", "") if heading.has_attr("id") else ""
+ heading_text = heading.get_text()
+ annex_regex = r"^Annex\s+([A-Z])\s"
+ clause_number_regex = r"^([A-Z]?\.?\d+(\.\d+)*)\s"
+ if re.match(annex_regex, heading_text):
+ match = re.match(annex_regex, heading_text)
+ new_id = match.group(1) # Extracts only the letter (A, B, C, etc.)
+ elif re.match(clause_number_regex, heading_text):
+ match = re.match(clause_number_regex, heading_text)
+ new_id = match.group(1) # Extracts the clause number
+ else:
+ # this is a clause without a number
+ new_id = old_id
+ if new_id != old_id:
+ mapping[old_id] = new_id
+
+ if heading.has_attr("id"):
+ del heading["id"]
+
+ soup = update_filenames_in_ids(soup)
+
+ return soup, mapping
+
+
+def edit_links_accordingly_with_expected_filenames(
+ soup: BeautifulSoup, filenames_mapping: dict
+):
+ """
+ Edit links in the document to point to the expected filenames based on the provided mapping.
+ """
+ for old_filename, new_filename in filenames_mapping.items():
+ # This can happen when converting from docx to html then md, pandoc ads the "tab" word because the "\t" is in the title of a ETSI document
+ if re.match(r"^\d+-tab", old_filename):
+ old_filename = old_filename.replace("-tab", "-", 1)
+ new_filename = (
+ new_filename.replace("html", "md")
+ if "html" in old_filename
+ else new_filename.replace("md", "html")
+ )
+ # Update links in the soup
+ for tag in soup.find_all(
+ attrs={"href": lambda href: href and href.startswith(old_filename + "#")}
+ ):
+ # replace the portion of href left to the "#"
+ tag["href"] = tag["href"].replace(old_filename, new_filename)
+
+ return soup
+
+
+def fix_link_to_headings_with_id_mapping(src: str, mapping: dict):
+ """
+ Fix the headings ID mapping in the HTML file using the mapping provided.
+ """
+
+ with open(src, "r", encoding="utf-8") as f:
+ soup = BeautifulSoup(f, "html.parser")
+
+ for old_id, new_id in mapping.items():
+ links = soup.find_all(
+ attrs={"href": lambda href: href and href.endswith(f"#{old_id}")}
+ )
+ for link in links:
+ link["href"] = link["href"].replace(old_id, new_id)
+
+ content = soup.decode_contents()
+ with open(src, "w", encoding="utf-8") as f:
+ f.write(content)
+
+
+def preprocess(
+ src_path: str, dest_path: str, is_cleanup: bool, css_src, filenames_mapping: dict
+):
+ """
+ ### Description
+ Preprocessing mandatory for conversion from HTML to Markdown. **Operating on a single source HTML file**, performs the following tasks:
+
+ 1. Removes the table of contents, CSS links, headers/footers, buttons, and other Pandoc-generated elements
+ 2. Removes flex and flex-item elements that aren't needed in Markdown
+ 3. Formats abbreviations to have a more readable structure in Markdown
+ 4. Restructures notes and examples:
+ - Converts inner divs to paragraphs
+ - Moves related code blocks into the body of notes/examples (when cleanup flag is enabled)
+ 5. Simplifies headings by:
+ - Removing data-number attributes
+ - Generating cleaner IDs based on heading text
+ - Creating a mapping between old and new IDs
+ 6. Updates links in the document to use expected filenames (when cleanup flag is enabled)
+ 7. Applies additional cleaning operations for "dirty" HTML sources
+
+ ### Arguments
+ - `src_path`: The absolute or relative path to the source HTML file
+ - `dest_path`: The absolute or relative path where the processed HTML file will be saved
+ - `is_cleanup`: `True` if additional formatting should be performed (when converting from "dirty" HTML), `False` otherwise
+ - `css_src`: A list of absolute or relative paths to all source CSS files (needed for cleanup processing)
+ - `filenames_mapping`: Dictionary mapping old filenames to expected filenames
+
+ ### Returns
+ - A dictionary mapping old heading IDs to new heading IDs
+ """
+
+ with open(src_path, "r", encoding="utf-8") as html:
+ soup = BeautifulSoup(html, "html.parser")
+
+ soup = remove_pandoc_toc(soup)
+ soup = remove_flex(soup)
+
+ if is_cleanup:
+ soup = preprocess_cleaning(soup, css_src)
+ else:
+ # Preprocessing that is only applicable when converting "clean" HTML to Markdown
+ soup = format_abbreviations(soup)
+
+ soup = format_notes_and_examples(soup, is_cleanup)
+ soup, mapping = simplify_headings(soup, is_cleanup)
+ if is_cleanup:
+ soup = edit_links_accordingly_with_expected_filenames(soup, filenames_mapping)
+
+ contents = soup.decode_contents()
+
+ with open(dest_path, "w", encoding="utf-8") as html_new:
+ html_new.write(contents)
+
+ return mapping
diff --git a/md_to_docx_converter/src/utils.py b/md_to_docx_converter/src/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9073bd3ed05641ba6a68358b1d366469d54a32e
--- /dev/null
+++ b/md_to_docx_converter/src/utils.py
@@ -0,0 +1,750 @@
+import re, os, sys, json, cssutils
+from typing import Literal
+from cssutils.css import CSSStyleSheet, CSSStyleRule, CSSStyleDeclaration
+from bs4 import BeautifulSoup, NavigableString
+
+from src.constants import (
+ ABBREVIATION_CLASS,
+ FRM_TYPE,
+ OUTPUT_DOC_NAME,
+ PERM_CSS_FILE,
+ REFERENCE_DOC,
+ TO_TYPE,
+ CONSOLIDATED_MD_NAME,
+ CONSOLIDATED_HTML_NAME,
+ ETSI_INITIAL_FILES,
+ DEFAULT_CLAUSES,
+ DEFAULT_HTML_CLAUSES,
+ DEFAULT_ANNEXES,
+ ETSI_FINAL_FILES,
+ SCOPE,
+ REFS,
+ DEFS
+)
+
+cssutils.log.setLevel("FATAL") # Suppress irrelevant warnings and errors
+
+
+# region Warning and Error tags
+def p_warning(text: str):
+ return f"\033[1m\033[33mWARNING:\033[0m {text}"
+
+
+def p_error(text: str):
+ return f"\033[1m\033[31mERROR:\033[0m {text}"
+
+
+def p_label(text: str):
+ return f"\033[4m\033[1m{text}\033[0m"
+
+
+# endregion
+
+
+# region Validators for script arguments
+def validate_src_directory(path):
+ """
+ Ensures the provided source directory is valid.
+
+ #### Args
+ `path`: The user-defined source destination provided through the src flag
+ """
+ if not os.path.isdir(path):
+ print(f"{p_error} Source path {p_label(path)} is not a valid directory")
+ sys.exit(1)
+
+
+def validate_type(file_type, is_src: bool):
+ """Ensure that the source and destination file types are HTML or Markdown"""
+ acceptable_src_types = {"html_dirty", "html", "md"}
+ acceptable_dest_types = {"html", "md", "docx"}
+
+ if (is_src and file_type not in acceptable_src_types) or (
+ not is_src and file_type not in acceptable_dest_types
+ ):
+ print(
+ f'File type {p_label(file_type)} is not an acceptable file type\nAcceptable types: "html", "md"'
+ )
+ sys.exit(1)
+
+
+def validate_conversion(frm: FRM_TYPE, to: TO_TYPE):
+ """
+ Ensures that the conversion is of one of the following valid types:
+ - `HTML` -> `Markdown` OR `Docx`
+ - Includes dirty `HTML` -> clean `Markdown`
+ - `Markdown` -> `HTML`
+
+ If not, prints an error and exits
+ """
+ VALID_CONVERSIONS = [
+ ["html", "md"],
+ ["html_dirty", "md"],
+ ["md", "html"],
+ ["html", "docx"],
+ ]
+
+ if [frm, to] not in VALID_CONVERSIONS:
+ print(
+ f"{p_error} Invalid destination type {p_label(to)} for source type {p_label(frm)}"
+ )
+ sys.exit(1)
+
+
+def validate_file_order_json(json_file: str, is_conv_md_to_html: bool):
+ """Various validation checks for the filepath provided by the `--file_order` argument and, if necessary, print any warnings and errors and exit the script as necessary."""
+
+ def is_json_malformed():
+ with open(json_file, "r") as file:
+ json_data = json.load(file)
+
+ for key in ["clauses", "annexes"]:
+ if key not in json_data:
+ print(
+ p_error(
+ f"Missing key {p_label(key)} in JSON file at {p_label(json_file)}"
+ )
+ )
+ return True
+
+ if not isinstance(json_data[key], list):
+ print(
+ p_error(
+ f"{p_label(json_file)}: Value at {p_label(key)} must be a list of strings"
+ )
+ )
+ return True
+
+ if not all(isinstance(item, str) for item in json_data[key]):
+ print(
+ p_error(
+ f"{p_label(json_file)}: Value at {p_label(key)} must be a list of strings"
+ )
+ )
+ return True
+
+ for item in json_data[key]:
+ if re.search(r"\.(?:md|html)$", item):
+ print(
+ p_error(
+ f"{p_label(json_file)}: Value {item} in {p_label(key)} must not contain file extensions"
+ )
+ )
+ return True
+
+ return False
+
+ if is_conv_md_to_html:
+ if not os.path.exists(json_file):
+ print(p_error(f"No file found at {p_label(json_file)}"))
+ sys.exit(1)
+
+ if json_file and not json_file.endswith(".json"):
+ print(p_error(f"Provided file {p_label(json_file)} is not a JSON file."))
+ sys.exit(1)
+
+ if is_json_malformed():
+ # Appropriate error message is printed in `is_json_malformed()`
+ sys.exit(1)
+
+ else:
+ if json_file:
+ print(
+ p_warning("No file ordering JSON will be used during this conversion.")
+ )
+
+
+def are_files_present(
+ src_dir: str,
+ clauses: list[str] = DEFAULT_CLAUSES,
+ annexes: list[str] = DEFAULT_ANNEXES,
+):
+ """Assess whether always-present files and clause and annex files are present. If any are missing, print warnings or errors as needed and return `False`. If all necessary files are present, return `True`."""
+ # Ensure all standard ETSI files are present
+ for filelist in [ETSI_INITIAL_FILES, ETSI_FINAL_FILES]:
+ for file in filelist:
+ if not os.path.exists(os.path.join(src_dir, f"{file}.md")):
+ print(
+ p_error(
+ f"File {p_label(f'{file}.md')} not found in source directory {p_label(src_dir)}"
+ )
+ )
+ return False
+
+ # Assess and handle clauses and annexes
+ for file_list in [clauses, annexes]:
+ for file in file_list:
+ is_file_existant = os.path.exists(os.path.join(src_dir, f"{file}.md"))
+
+ if file_list != DEFAULT_CLAUSES and file_list != DEFAULT_ANNEXES:
+ if not is_file_existant:
+ # Print an error and exit the script
+ print(
+ p_error(
+ f"File {p_label(f'{file}.md')} not found in source directory {p_label(src_dir)}"
+ )
+ )
+ return False
+
+ else:
+ if not is_file_existant:
+ # Print warning and continue
+ print(
+ p_warning(
+ f"File {p_label(f'{file}.md')} not found in source directory {p_label(src_dir)}, continuing on"
+ )
+ )
+ continue
+
+ return True
+
+
+# endregion
+
+
+# region Command Generators
+def get_html_to_md_command(md_input_path: str, output_path: str):
+ """
+ Returns an array containing elements of the Pandoc command to convert from HTML to Markdown.
+
+ ### Arguments
+ - `md_input_path`: The absolute or relative path of the preprocessed HTML file generated during the preprocessing step.
+ - `output_path`: The absolute or relative path to which the generated Markdown file will be saved.
+
+ ### Command
+
+ NOTE: The verbose destination Markdown format specified by the `-t` parameter forces Pandoc to generate grid tables.
+
+ `pandoc -f html -t markdown-pipe_tables-simple_tables-multiline_tables+grid_tables --wrap=none -o --lua-filter=html_to_md.lua`
+ """
+ command = [
+ "pandoc",
+ "-f",
+ "html",
+ md_input_path,
+ "-t",
+ "markdown-pipe_tables-simple_tables-multiline_tables+grid_tables",
+ "--wrap=none",
+ "-o",
+ output_path,
+ "--lua-filter=html_to_md.lua",
+ ]
+
+ return command
+
+
+def get_md_to_html_command(
+ src: str, dest: str, consolidated_md_path: str, css_src: list[str]
+):
+ """
+ ### Description
+ Returns an array containing elements of the Pandoc command to convert from Markdown to HTML.
+
+ ### Arguments
+ - `src`: The source directory containing Markdown files to convert to HTML
+ - `dest`: The destination directory that will contain the generated HTML files
+ - `consolidated_md_path`: The absolute or relative path to a single Markdown file that contains the contents of all the Markdown files to convert to HTML.
+ - `css_src`: A list of absolute or relative paths to CSS source files that the generated HTML will use. The class names will correspond to those used in the Pandoc-formatted tags in the Markdown.
+
+ ### Command
+
+ `pandoc -f markdown -t chunkedhtml --wrap=none -o --css={repeated for subsequent CSS files} --standalone --toc --toc-depth 6 --template=official.html --split-level=1 --lua-filter=md_to_html.lua --resource-path=`
+ """
+ command = [
+ "pandoc",
+ "-f",
+ "markdown",
+ consolidated_md_path, # Uses temporary MD file
+ "-t",
+ "chunkedhtml",
+ "--wrap=none",
+ "-o",
+ dest,
+ *[f"--css={css_file}" for css_file in css_src],
+ "--standalone",
+ "--toc",
+ "--toc-depth",
+ "6",
+ "--template=official.html",
+ "--split-level=1",
+ "--lua-filter=md_to_html.lua",
+ "--lua-filter=md_to_html_2.lua",
+ "--lua-filter=md_to_html_3.lua",
+ f"--resource-path={src}",
+ ]
+
+ return command
+
+
+def get_html_to_docx_command(dest: str, consolidated_html_path, output_doc_path):
+ """
+ Returns an array containing elements of the Pandoc command to convert from HTML to Docx.
+
+ ### Arguments
+ - `dest`: The destination directory for the generated Docx, used for accessing the `media` directory
+ - `consolidated_html_path`: The absolute or relative path to the HTML file containing the consolidated contents of all the HTML files to convert to Docx
+ - `output_doc_path`: The absolute or relative path at which to save the generated Docx
+
+ ### Command
+
+ `pandoc -f html --resouce-path= -t docx -o --lua-filter=html_to_docx.lua --reference-doc={REFERENCE_DOC, that is, customized_reference.docx}`
+ """
+ command = [
+ "pandoc",
+ "-f",
+ "html",
+ consolidated_html_path,
+ f"--resource-path={dest}",
+ "-t",
+ "docx",
+ "-o",
+ output_doc_path,
+ "--lua-filter=html_to_docx.lua",
+ f"--reference-doc={REFERENCE_DOC}",
+ ]
+
+ return command
+
+
+# endregion
+
+
+# region Consolidated File Generators
+def handle_consolidated_md(
+ action: Literal["create", "delete"],
+ src: str,
+ consolidated_md_path: str,
+ ordered_files: list[str] = None,
+):
+ """
+ For use when converting to HTML. Consolidates the input files into a single file or deletes that file.
+
+ #### Args
+ - `action`: Either "create" or "delete", defining the function's behavior
+ - `src`: The absolute or relative path to the directory containing the Markdown files to consolidate
+ - `consolidated_md_path`: The absolute or relative path to
+ """
+ if action not in ["create", "delete"]:
+ raise ValueError(f'Invalid action {action}, expected "create" or "delete"')
+
+ consolidated_md_name = consolidated_md_path.split("/")[-1]
+
+ def create():
+ """Handles creation of the consolidated Markdown file"""
+
+ # Ordering the files for the table of contents
+ # Include only the numbered files
+ excluded_md_files = [consolidated_md_name, "index.md"]
+
+ create_consolidated_file(src, consolidated_md_path, ordered_files)
+
+ # Clean up temporary files
+ for file in os.listdir(src):
+ if "--preprocessed--" in file and file not in excluded_md_files:
+ os.remove(os.path.join(src, file))
+
+ def delete():
+ """Handles deletion of the consolidated Markdown file"""
+ if os.path.exists(consolidated_md_path):
+ os.remove(consolidated_md_path)
+
+ if action == "create":
+ create()
+
+ if action == "delete":
+ delete()
+
+
+def get_prefix_of_pandoc_html_generated_file(filename):
+ prefix = re.match(r"^(\d+)", filename)
+ prefix = int(prefix.group(1)) if prefix else float("inf")
+ return prefix
+
+
+def handle_html_consolidation(
+ action: Literal["create", "delete"],
+ src: str,
+ consolidated_html_path: str,
+):
+ """
+ Consolidates all the input HTML files into a single HTML file for conversion to Docx or deletes the consolidated file, depending on the value passed to `action`.
+
+ #### Args
+ `action`: `"create"` or `"delete"`, specifies whether to create or delete the file
+ `src`: The source directory containing HTML Files to consolidate
+ `consolidated_html_path`: The absolute or relative directory at which to save/from which to delete the consolidated HTML file
+ """
+ if action not in ["create", "delete"]:
+ raise ValueError(f'Invalid action {action}, expected "create" or "delete"')
+
+ def create():
+ """Handles creation of the consolidated HTML file"""
+
+ # Ordering the files for the table of contents
+ # Include only the numbered files
+ excluded_html_files = [CONSOLIDATED_HTML_NAME, "index.html"]
+
+ clauses = DEFAULT_HTML_CLAUSES
+ annexes = DEFAULT_ANNEXES
+ ordered_files = order_local_files_by_prefix(src, excluded_html_files, "html", clauses, annexes)
+ create_consolidated_file(src, consolidated_html_path, ordered_files)
+
+ # Clean up temporary files
+ for file in os.listdir(src):
+ if "--preprocessed--" in file and file not in excluded_html_files:
+ os.remove(os.path.join(src, file))
+
+ def delete():
+ """Handles deletion of the consolidated HTML file"""
+ if os.path.exists(consolidated_html_path):
+ os.remove(consolidated_html_path)
+
+ if action == "create":
+ create()
+
+ if action == "delete":
+ delete()
+
+
+# endregion
+
+
+# region Files helpers
+def get_file_order(
+ src_dir: str,
+ clauses: list[str],
+ annexes: list[str],
+):
+ """Return a list of strings defining the order in which the files should be processed during conversion. If no list of clauses is provided, use the default clauses. If no list of annexes is provided, use the default annexes."""
+
+ def is_file_in_list(filename: str):
+ """Return `True` if this file or a file with a name like this one's name is already in the list defining the file order, otherwise `False`"""
+ nonlocal new_file_order
+
+ filename = filename.split(".")[0] # Remove file extension
+
+ for added_file in new_file_order:
+ if filename in added_file:
+ return True
+
+ return False
+
+ # Begin the list
+ new_file_order = ETSI_INITIAL_FILES
+ clauses_ordered = []
+ annexes_ordered = []
+
+ # Assess and handle clauses and annexes
+ for file_list, item_prefix in [(clauses, "clause-"), (annexes, "annex-")]:
+ for file in file_list:
+ new_file_order.append(file)
+
+ if item_prefix == "clause-":
+ clauses_ordered.extend(file_list)
+ elif item_prefix == "annex-":
+ annexes_ordered.extend(file_list)
+
+ # Add other clauses and files in alphabetical order
+ for filename in sorted(os.listdir(src_dir)):
+ if filename.startswith(item_prefix) and not is_file_in_list(filename):
+ name = os.path.splitext(filename)[0] # Filename without extension
+
+ if name not in clauses:
+ new_file_order.append(name)
+
+ if item_prefix == "clause-":
+ clauses_ordered.append(name)
+ elif item_prefix == "annex-":
+ annexes_ordered.append(name)
+
+ # End the list
+ new_file_order.extend(ETSI_FINAL_FILES)
+
+ return new_file_order, clauses_ordered, annexes_ordered
+
+
+def get_output_doc_path(dest: str):
+ """
+ Returns the path at which the generated Docx is saved/from which the generated Docx is accessed
+
+ ### Arguments
+ `dest`: The destination directory at in which the Docx should be saved
+ """
+ return f"{dest}/{OUTPUT_DOC_NAME}"
+
+
+def apply_renaming_logic(text: str, filename: str, postfix: str) -> str:
+ new_filename = filename
+ if postfix == "md":
+ if text.startswith("::: ZA"):
+ new_filename = f"front-page.{postfix}"
+ else:
+ header_regex = r"^# (\d+)"
+ filename_regex = (
+ r"^\d+-([\w-]+?)\.md$" if postfix == "md" else r"^\d+-([\w-]+?)\.html$"
+ )
+ if text.startswith("# Annex"):
+ new_filename = f"annex-{text.split()[2].lower()}.{postfix}"
+ elif re.match(header_regex, text):
+ match = re.match(header_regex, text)
+ chapter_number = match.group(1)
+ new_filename = f"clause-{chapter_number}.{postfix}"
+ else: # it is not a chapter with a number or an annex
+ match = re.match(filename_regex, filename)
+ if match:
+ new_filename = f"{match.group(1)}.{postfix}"
+ else: # postfix == "html"
+ soup = BeautifulSoup(text, "html.parser")
+
+ if soup.find("div", class_="ZA"):
+ new_filename = f"front-page.{postfix}"
+ else:
+ title_tag = soup.find("h1", id=True)
+ title = title_tag.get_text()
+ if title.startswith("Annex"):
+ annex_number = title.split()[1].lower().replace(":", "")
+ new_filename = f"annex-{annex_number}.{postfix}"
+ else:
+ header_regex = r"^(\d+)\s"
+ match = re.match(header_regex, title)
+ if match:
+ chapter_number = match.group(1)
+ if postfix == "md" and chapter_number == "1":
+ new_filename = f"{SCOPE}.{postfix}"
+ elif postfix == "md" and chapter_number == "2":
+ new_filename = f"{REFS}.{postfix}"
+ elif postfix == "md" and chapter_number == "3":
+ new_filename = f"{DEFS}.{postfix}"
+ else:
+ new_filename = f"clause-{chapter_number}.{postfix}"
+ else:
+ # it is a clause without a number, just use the filename without the leading number
+ new_filename = re.sub(r"^\d+-", "", filename).lower()
+ return new_filename
+
+
+def get_dirty_filenames_mapping_with_expected_filenames(dir_path: str):
+ """
+ Returns a mapping of dirty filenames to their expected clean counterparts
+ """
+ mapping = {}
+ filenames = sorted(
+ [
+ f
+ for f in os.listdir(dir_path)
+ if f.endswith(".html") and not f == "index.html"
+ ],
+ key=get_prefix_of_pandoc_html_generated_file,
+ )
+
+ for filename in filenames:
+ with open(os.path.join(dir_path, filename), "r", encoding="utf-8") as file:
+ expected_filename = apply_renaming_logic(file.read(), filename, "html")
+ mapping[filename] = expected_filename
+
+ return mapping
+
+
+def order_local_files_by_prefix(
+ src: str,
+ excluded_files: list[str],
+ type: Literal["html", "md"],
+ clauses: list[str] = DEFAULT_CLAUSES,
+ annexes: list[str] = DEFAULT_ANNEXES,
+):
+ if type not in ["html", "md"]:
+ raise ValueError(f"Invalid type: {type}")
+
+ original_files = [
+ file
+ for file in os.listdir(src)
+ if "--preprocessed--" in file
+ and file.endswith(f".{type}")
+ and file not in excluded_files
+ ]
+
+ files_order, _, _ = get_file_order(src, clauses, annexes)
+
+ ordered_files = []
+ for prefix in files_order:
+ matching_files = [
+ filename
+ for filename in original_files
+ if re.match(rf"--preprocessed--{re.escape(prefix)}[.-]", filename)
+ ]
+ ordered_files.extend(matching_files)
+ return ordered_files
+
+
+def create_consolidated_file(src: str, consolidated_file_path: str, files: list[str]):
+ """
+ Creates a consolidated file from the provided list of files.
+
+ ### Arguments
+ - `src`: The source directory containing the files to consolidate
+ - `consolidated_file_path`: The path to the consolidated file to create
+ - `files`: A list of file names to consolidate
+
+ ### Returns
+ - The path to the consolidated file and its contents
+ """
+
+ consolidated_contents = ""
+
+ # Create a temporary file with the consolidated contents
+ for file in files:
+ html_file_path = os.path.join(src, file)
+ with open(html_file_path, "r", encoding="utf-8") as html:
+ consolidated_contents += f"{html.read()}\n"
+
+ with open(consolidated_file_path, "w", encoding="utf-8") as c_html:
+ c_html.write(consolidated_contents)
+
+
+# endregion
+
+
+# region BeautifulSoup Type Checkers
+def is_whitespace_navstr(element):
+ """
+ ### Description
+ Returns `True` if `element` is an instance of a `NavigableString` composed of nothing but whitespace, otherwise returns `False`.
+ """
+ if isinstance(element, NavigableString) and not element.strip():
+ return True
+
+ return False
+
+
+# endregion
+
+
+# region Other helpers
+def get_consolidated_md_path(src: str):
+ """
+ ### Description
+ Returns the path at which to store/from which to retrieve the consolidated Markdown file
+
+ ### Arguments
+ `src`: The source directory containing the Markdown files to consolidate
+ """
+ return os.path.join(src, CONSOLIDATED_MD_NAME)
+
+
+def get_consolidated_html_path(src: str):
+ """
+ ### Description
+ Returns the path at which to store/from which to retrieve the consolidated HTML file
+
+ ### Arguments
+ `src`: The source directory containing the HTML files to consolidate
+ """
+ return os.path.join(src, CONSOLIDATED_HTML_NAME)
+
+
+def get_css_classes(
+ classes_to_get: Literal["all", "empty", "monospaced", "bullets"],
+ css_files: list[str],
+):
+ """
+ Returns a sorted list of CSS classes from the CSS source according to some restriction defined by `classes_to_get`.
+ ### Arguments
+ - `classes_to_get`: The desired subgroup of classes
+ - `css_files`: A list of relative or absolute paths to CSS source files
+
+ #### Values for `classes_to_get`
+ - `"all"`
+ The sorted list will contain every class name
+ - `"empty"`
+ The sorted list will contain only classes that contain no properties. Used during cleanup when converting a "dirty" HTML to Markdown.
+ - `"monospaced"`
+ The sorted list will contain only classes that apply monospaced fonts to the text.
+ ##### Monospaced fonts:
+ - Courier New
+ - Roboto Mono
+ - Monospace
+ """
+
+ def get_regex():
+ """Returns the regex corresponding to the type of classes to get defined by `classes_to_get`"""
+ match classes_to_get:
+ case "all":
+ return r"\.([\w\_\-]+)\b"
+
+ case "empty":
+ return r"\.([\w\-_]+?)\s*\{\s*\}"
+
+ case "monospaced":
+ monospaced_fonts = {"Courier New", "Roboto Mono", "Monospace"}
+ monospaced_fonts = "|".join(
+ [re.escape(font) for font in monospaced_fonts]
+ )
+ monospaced_class_regex = (
+ rf"\.([\w\-_]+)\s*{{[^}}]*font-family\s*:\s*[^;{{}}]*"
+ rf"(?:{monospaced_fonts})[^;{{}}]*;"
+ )
+ return monospaced_class_regex
+
+ case "bullets":
+ return r"\.(B\d\w*?)\s*\{[^\}]+?\}"
+
+ case _:
+ return
+
+ regex = get_regex()
+ classes = set()
+
+ for css_file in css_files:
+ if os.path.exists(css_file) and css_file != "customCSS.css":
+ css_contents = open(css_file, "r", encoding="utf-8").read()
+ classes.update(re.findall(regex, css_contents))
+
+ if classes_to_get == "empty" and ABBREVIATION_CLASS in classes:
+ classes.remove(ABBREVIATION_CLASS)
+
+ return sorted(classes)
+
+
+def get_bold_italic_underline_css_classes(css_styles_src: str):
+ """Return three separate sets for CSS classes that apply bolding, italicization, and underlining"""
+ sheet: CSSStyleSheet = cssutils.parseFile(css_styles_src)
+ bold = set()
+ italic = set()
+ underline = set()
+
+ for rule in sheet:
+ if rule.type != rule.STYLE_RULE:
+ continue
+
+ rule: CSSStyleRule
+
+ for selector in rule.selectorList:
+ classname = selector.selectorText.lstrip(".")
+
+ style: CSSStyleDeclaration = rule.style
+
+ if style.getPropertyValue("font-weight").lower() == "bold":
+ bold.add(classname)
+
+ if style.getPropertyValue("font-style").lower() == "italic":
+ italic.add(classname)
+
+ if style.getPropertyValue("text-decoration").lower() == "underline":
+ underline.add(classname)
+
+ return bold, italic, underline
+
+
+def combine_biu_classes():
+ """Combines bold and italic classes in the separate lists from `get_bold_italic_underline_css_classes` into a single list"""
+ CSS_BOLD, CSS_ITALIC, _ = get_bold_italic_underline_css_classes(PERM_CSS_FILE)
+
+ return set(cls for group in [CSS_BOLD, CSS_ITALIC] for cls in group)
+
+def int_to_letter(number: int):
+ """Converts a 1-based integer to a corresponding uppercase letter (A=1, B=2, etc.)"""
+ if 1 <= number <= 26:
+ return chr(number + 64) # 65 is the ASCII code for 'A'
+ raise ValueError("Number must be between 1 and 26.")
+
+# endregion
diff --git a/md_to_docx_converter/template.html b/md_to_docx_converter/template.html
new file mode 100644
index 0000000000000000000000000000000000000000..03e658d38caabe5d2467957d2b52706c8da0ced5
--- /dev/null
+++ b/md_to_docx_converter/template.html
@@ -0,0 +1,283 @@
+
+
+
+
+
+
+ $for(author-meta)$
+
+ $endfor$ $if(date-meta)$
+
+ $endif$ $if(keywords)$
+
+ $endif$ $if(description-meta)$
+
+ $endif$
+ $if(title-prefix)$$title-prefix$ – $endif$$pagetitle$
+
+ $for(css)$
+
+ $endfor$ $for(header-includes)$ $header-includes$ $endfor$ $if(math)$
+ $if(mathjax)$
+
+ $endif$ $math$ $endif$
+
+
+
+
+
+
+
+
+
+
+
+ $for(include-before)$ $include-before$ $endfor$ $if(title)$
+
+ $title$
+ $if(subtitle)$
+ $subtitle$
+ $endif$ $for(author)$
+
+ $endfor$ $if(date)$
+ $date$
+ $endif$ $if(abstract)$
+
+ $abstract-title$
+ $abstract$
+
+ $endif$
+
+ $endif$
+
+
+
+
+
+
+ $body$ $for(include-after)$ $include-after$ $endfor$
+
+
+
+
+
+
+
diff --git a/md_to_docx_converter/templates/README.md b/md_to_docx_converter/templates/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e09d9180565a445d8656b4c4b30004bc6c160bb5
--- /dev/null
+++ b/md_to_docx_converter/templates/README.md
@@ -0,0 +1,7 @@
+# Templates
+
+## Contents
+
+**[JSON templates](./json/)**
+
+**[Document skeletons](./document_skeleton/)**
\ No newline at end of file
diff --git a/md_to_docx_converter/templates/document_skeleton/README.md b/md_to_docx_converter/templates/document_skeleton/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..042c7147cfc4a986a316190311a52c1e3ae93df5
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/README.md
@@ -0,0 +1,20 @@
+# Document Skeletons
+
+This folder contains skeletons of files required for creating new ETSI deliverables in Markdown. The files are divided based on the pertinent deliverable type, if applicable, or categorized as common if they are shared among all ETSI deliverable types.
+
+### Supported Deliverable Types
+
+- Group Specification
+- Group Report
+- Technical Specification
+- Technical Report
+
+### Necessary files
+
+**Group Specification**: [common files](./common_files/) and [group specification files](./group_specification/)
+
+**Group Report**: [common files](./common_files/) and [group report files](./group_report/)
+
+**Technical Specification**: [common files](./common_files/) and [technical specification files](./technical_specification/)
+
+**Technical Report**: [common files](./common_files/) and [technical report files](./technical_report/)
\ No newline at end of file
diff --git a/md_to_docx_converter/templates/document_skeleton/common_files/annex-a.md b/md_to_docx_converter/templates/document_skeleton/common_files/annex-a.md
new file mode 100644
index 0000000000000000000000000000000000000000..d7e4612aaabfd3b452c585261da7eaf4f313aef6
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/common_files/annex-a.md
@@ -0,0 +1,2 @@
+# Annex A (normative or informative): Title of annex
+
diff --git a/md_to_docx_converter/templates/document_skeleton/common_files/annex-b.md b/md_to_docx_converter/templates/document_skeleton/common_files/annex-b.md
new file mode 100644
index 0000000000000000000000000000000000000000..cd33ea6bfc921d9b3f8c5753a785ef8bc7e441f6
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/common_files/annex-b.md
@@ -0,0 +1,6 @@
+# Annex B (normative or informative): Title of annex
+
+## B.1 First clause of the annex
+
+### B.1.1 First subdivided clause of the annex
+
diff --git a/md_to_docx_converter/templates/document_skeleton/common_files/clause-4.md b/md_to_docx_converter/templates/document_skeleton/common_files/clause-4.md
new file mode 100644
index 0000000000000000000000000000000000000000..706532535d603c59b39363ca97896be99fd3fd12
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/common_files/clause-4.md
@@ -0,0 +1 @@
+# 4 Here put your content
\ No newline at end of file
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/annex-bibliography.md b/md_to_docx_converter/templates/document_skeleton/group_report/annex-bibliography.md
new file mode 100644
index 0000000000000000000000000000000000000000..5eedc212f48490bf068aa8977e531aa0bb91d2cb
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/annex-bibliography.md
@@ -0,0 +1,4 @@
+# Annex (informative): Bibliography
+
+-
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/annex-change_history.md b/md_to_docx_converter/templates/document_skeleton/group_report/annex-change_history.md
new file mode 100644
index 0000000000000000000000000000000000000000..4c71836f22bc157101f448a4d40be7b74af9ff26
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/annex-change_history.md
@@ -0,0 +1,12 @@
+# Annex (informative): Change history
+
+::: TAL
++-----------------------+-----------------------+------------------------------------------+
+| Date | Version | Information about changes |
++=======================+=======================+==========================================+
+| | {TAC} | |
+| | | |
+| | <#> | |
++-----------------------+-----------------------+------------------------------------------+
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/clause-1_Scope.md b/md_to_docx_converter/templates/document_skeleton/group_report/clause-1_Scope.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ac75138f412612e5400121caf403532354baef1
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/clause-1_Scope.md
@@ -0,0 +1,4 @@
+# 1 Scope
+
+The present document ...
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/clause-2_References.md b/md_to_docx_converter/templates/document_skeleton/group_report/clause-2_References.md
new file mode 100644
index 0000000000000000000000000000000000000000..abbd5028ddc07a4c65e085ec2cf9a7c56af03c60
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/clause-2_References.md
@@ -0,0 +1,24 @@
+# 2 References
+
+## 2.1 Normative references
+
+Normative references are not applicable in the present document.
+
+## 2.2 Informative references
+
+References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the referenced document (including any amendments) applies.
+
+>>> [!note] NOTE:
+
+While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee their long-term validity.
+
+>>>
+
+The following referenced documents may be useful in implementing an ETSI deliverable or add to the reader's understanding, but are not required for conformance to the present document.
+
+::: REFS
+[i.1] Reference body
+
+[i.2] Reference body
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/clause-3_Definitions.md b/md_to_docx_converter/templates/document_skeleton/group_report/clause-3_Definitions.md
new file mode 100644
index 0000000000000000000000000000000000000000..2b562a5c8a6297b506a3d2f68e806d949b4809dc
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/clause-3_Definitions.md
@@ -0,0 +1,20 @@
+# 3 Definition of terms, symbols and abbreviations
+
+## 3.1 Terms
+
+For the purposes of the present document, the [following] terms [given in ... and the following] apply:
+
+## 3.2 Symbols
+
+For the purposes of the present document, the [following] symbols [given in ... and the following] apply:
+
+## 3.3 Abbreviations
+
+For the purposes of the present document, the [following] abbreviations [given in ... and the following] apply:
+
+::: EW
+ABBR Abbreviation
+
+2ABR Second Abbreviation
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/executive-summary.md b/md_to_docx_converter/templates/document_skeleton/group_report/executive-summary.md
new file mode 100644
index 0000000000000000000000000000000000000000..d8b749a65f8743511a5eec02147eacfb9c7eb6d9
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/executive-summary.md
@@ -0,0 +1,2 @@
+# Executive summary
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/foreword.md b/md_to_docx_converter/templates/document_skeleton/group_report/foreword.md
new file mode 100644
index 0000000000000000000000000000000000000000..2d19ca8560a174a2bb9cacc9a4a19100e5515cde
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/foreword.md
@@ -0,0 +1,4 @@
+# Foreword
+
+This Group Report (GR) has been produced by ETSI Industry Specification Group ().
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/front-page.md b/md_to_docx_converter/templates/document_skeleton/group_report/front-page.md
new file mode 100644
index 0000000000000000000000000000000000000000..15fbd9b0da94fa5215cd6ce69151430f6348755f
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/front-page.md
@@ -0,0 +1,88 @@
+*Disclaimer*
+
+The present document has been produced and approved by the () ETSI Industry Specification Group (ISG) and represents the views of those members who participated in this ISG.It does not necessarily represent the views of the entire ETSI membership.
+
+::: ZA
+[ETSI GR LLL-LLL DD]{.ondemand_CHAR_size_32}Vm.t.e[(yyyy-mm)]{.ondemand_CHAR_size_16}
+:::
+
+::: ZB
+**Group REPORT**
+:::
+
+::: ZT
+Title;
+:::
+
+::: ZT
+Part #: Part element of title;
+:::
+
+::: ZT
+Sub-part #: Sub-part element of title
+:::
+
+::: ZT
+Release \#
+:::
+
+::: ZT
+*Should you need a step-by-step guide for drafting an ETSI deliverable, please consult the "Principles for Drafting ETSI Deliverables" document. Otherwise you may contact us at edithelp@etsi.org.*
+:::
+
+::: TAC
+Reference
+
+
+
+Keywords
+
+
+
+*ETSI*
+
+650 Route des Lucioles
+
+F-06921 Sophia Antipolis Cedex - FRANCE
+
+Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
+
+[Siret N° 348 623 562 00017 - APE 7112B]{.ondemand_CHAR_name_Arial_size_7}
+
+[Association à but non lucratif enregistrée à la]{.ondemand_CHAR_name_Arial_size_7}
+
+[Sous-préfecture de Grasse (06) N° w061004871]{.ondemand_CHAR_name_Arial_size_7}
+
+***Important notice***
+
+The present document can be downloaded from the [[ETSI Search & Browse Standards]{.underline}](https://www.etsi.org/standards-search)application.
+
+The present document may be made available in electronic versions and/or in print. The content of any electronic and/or print versions of the present document shall not be modified without the prior written authorization of ETSI. In case of any existing or perceived difference in contents between such versions and/or in print, the prevailing version of an ETSI deliverable is the one made publicly available in PDF format on [[ETSI deliver]{.underline}](http://www.etsi.org/deliver)repository.
+
+Users should be aware that the present document may be revised or have its status changed, this information is available in the [[Milestones listing]{.underline}](https://portal.etsi.org/Services/editHelp/Standards-development/Tracking-a-draft/Status-codes)[.]{.ondemand_CHAR_name_Arial_size_9_color_0000FF}
+
+If you find errors in the present document, please send your comments tothe relevant service listed under [[Committee Support Staff]{.underline}](https://portal.etsi.org/People/Commitee-Support-Staff).
+
+If you find a security vulnerability in the present document, please report it through our [Coordinated Vulnerability Disclosure (CVD)](https://www.etsi.org/standards/coordinated-vulnerability-disclosure) program.
+
+***Notice of disclaimer & limitation of liability***
+
+The information provided in the present deliverable is directed solely to professionals who have the appropriate degree of experience to understand and interpret its content in accordance with generally accepted engineering or other professional standard and applicable regulations.
+
+No recommendation as to products and services or vendors is made or should be implied.
+
+No representation or warranty is made that this deliverable is technically accurate or sufficient or conforms to any law and/or governmental rule and/or regulation and further, no representation or warranty is made of merchantability or fitness for any particular purpose or against infringement of intellectual property rights.
+
+In no event shall ETSI be held liable for loss of profits or any other incidental or consequential damages.
+
+Any software contained in this deliverable is provided "AS IS" with no warranties, express or implied, including but not limited to, the warranties of merchantability, fitness for a particular purpose and non-infringement of intellectual property rights and ETSI shall not be held liable in any event for any damages whatsoever (including, without limitation, damages for loss of profits, business interruption, loss of information, or any other pecuniary loss) arising out of or related to the use of or inability to use the software.
+
+***Copyright Notification***
+
+No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm except as authorized by written permission of ETSI.The content of the PDF version shall not be modified without the written authorization of ETSI.The copyright and the foregoing restriction extend to reproduction in all media.
+
+© ETSI yyyy.
+
+All rights reserved.
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/history.md b/md_to_docx_converter/templates/document_skeleton/group_report/history.md
new file mode 100644
index 0000000000000000000000000000000000000000..88870d89fd70a71c3fe516723f8f61c176a1128e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/history.md
@@ -0,0 +1,12 @@
+# History
+
+::: TAL
++:-----------------------:+:-----------------------:+:-----------------------:+
+| Document History |
++=========================+===================================================+
+| | | |
++-------------------------+-------------------------+-------------------------+
+:::
+
+*Latest changes made on YYYY-MM-DD*
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/intellectual-property-rights.md b/md_to_docx_converter/templates/document_skeleton/group_report/intellectual-property-rights.md
new file mode 100644
index 0000000000000000000000000000000000000000..5848ab7e258d944a022528037b6b3e217be3ffbf
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/intellectual-property-rights.md
@@ -0,0 +1,18 @@
+# Intellectual Property Rights
+
+::: H6
+Essential patents
+:::
+
+IPRs essential or potentially essential to normative deliverables may have been declared to ETSI. The declarations pertaining to these essential IPRs, if any, are publicly available for **ETSI members and non-members**, and can be found in ETSI SR 000 314: *"Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards"*, which is available from the ETSI Secretariat. Latest updates are available on the [ETSI IPR online database](https://ipr.etsi.org/).
+
+Pursuant to the ETSI Directives including the ETSI IPR Policy, no investigation regarding the essentiality of IPRs, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document.
+
+::: H6
+Trademarks
+:::
+
+The present document may include trademarks and/or tradenames which are asserted and/or registered by their owners. ETSI claims no ownership of these except for any which are indicated as being the property of ETSI, and conveys no right to use or reproduce any trademark and/or tradename. Mention of those trademarks in the present document does not constitute an endorsement by ETSI of products, services or organizations associated with those trademarks.
+
+**DECT™**, **PLUGTESTS™**, **UMTS™** and the ETSI logo are trademarks of ETSI registered for the benefit of its Members. **3GPP™**, **LTE™** and **5G™** logo are trademarks of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. **oneM2M™** logo is a trademark of ETSI registered for the benefit of its Members and of the oneM2M Partners. **GSM**® and the GSM logo are trademarks registered and owned by the GSM Association.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/introduction.md b/md_to_docx_converter/templates/document_skeleton/group_report/introduction.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d07efe555d3ce74da304571d46338266d5f860e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/introduction.md
@@ -0,0 +1,2 @@
+# Introduction
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_report/modal-verbs-terminology.md b/md_to_docx_converter/templates/document_skeleton/group_report/modal-verbs-terminology.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa21889aa4f651b4dc25a677f4b08d63270ad417
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_report/modal-verbs-terminology.md
@@ -0,0 +1,6 @@
+# Modal verbs terminology
+
+In the present document "**should**", "**should not**", "**may**", "**need not**", "**will**", "**will not**", "**can**" and "**cannot**" are to be interpreted as described in clause 3.2 of the [ETSI Drafting Rules](https://portal.etsi.org/Services/editHelp!/Howtostart/ETSIDraftingRules.aspx) (Verbal forms for the expression of provisions).
+
+"**must**" and "**must not**" are **NOT** allowed in ETSI deliverables except when used in direct citation.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/annex-bibliography.md b/md_to_docx_converter/templates/document_skeleton/group_specification/annex-bibliography.md
new file mode 100644
index 0000000000000000000000000000000000000000..5eedc212f48490bf068aa8977e531aa0bb91d2cb
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/annex-bibliography.md
@@ -0,0 +1,4 @@
+# Annex (informative): Bibliography
+
+-
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/annex-change_history.md b/md_to_docx_converter/templates/document_skeleton/group_specification/annex-change_history.md
new file mode 100644
index 0000000000000000000000000000000000000000..4c71836f22bc157101f448a4d40be7b74af9ff26
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/annex-change_history.md
@@ -0,0 +1,12 @@
+# Annex (informative): Change history
+
+::: TAL
++-----------------------+-----------------------+------------------------------------------+
+| Date | Version | Information about changes |
++=======================+=======================+==========================================+
+| | {TAC} | |
+| | | |
+| | <#> | |
++-----------------------+-----------------------+------------------------------------------+
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/clause-1_Scope.md b/md_to_docx_converter/templates/document_skeleton/group_specification/clause-1_Scope.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ac75138f412612e5400121caf403532354baef1
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/clause-1_Scope.md
@@ -0,0 +1,4 @@
+# 1 Scope
+
+The present document ...
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/clause-2_References.md b/md_to_docx_converter/templates/document_skeleton/group_specification/clause-2_References.md
new file mode 100644
index 0000000000000000000000000000000000000000..7711f3e1f631bd75a0ff30cd32725c9d19ec69f1
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/clause-2_References.md
@@ -0,0 +1,40 @@
+# 2 References
+
+## 2.1 Normative references
+
+References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the referenced document (including any amendments) applies.
+
+Referenced documents which are not found to be publicly available in the expected location might be found in the [ETSI docbox](https://docbox.etsi.org/Reference/).
+
+>>> [!note] NOTE:
+
+While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee their long-term validity.
+
+>>>
+
+The following referenced documents are necessary for the application of the present document.
+
+::: REFS
+[n.1] Reference body
+
+[n.2] Reference body
+:::
+
+## 2.2 Informative references
+
+References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the referenced document (including any amendments) applies.
+
+>>> [!note] NOTE:
+
+While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee their long-term validity.
+
+>>>
+
+The following referenced documents may be useful in implementing an ETSI deliverable or add to the reader's understanding, but are not required for conformance to the present document.
+
+::: REFS
+[i.1] Reference body
+
+[i.2] Reference body
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/clause-3_Definitions.md b/md_to_docx_converter/templates/document_skeleton/group_specification/clause-3_Definitions.md
new file mode 100644
index 0000000000000000000000000000000000000000..2b562a5c8a6297b506a3d2f68e806d949b4809dc
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/clause-3_Definitions.md
@@ -0,0 +1,20 @@
+# 3 Definition of terms, symbols and abbreviations
+
+## 3.1 Terms
+
+For the purposes of the present document, the [following] terms [given in ... and the following] apply:
+
+## 3.2 Symbols
+
+For the purposes of the present document, the [following] symbols [given in ... and the following] apply:
+
+## 3.3 Abbreviations
+
+For the purposes of the present document, the [following] abbreviations [given in ... and the following] apply:
+
+::: EW
+ABBR Abbreviation
+
+2ABR Second Abbreviation
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/executive-summary.md b/md_to_docx_converter/templates/document_skeleton/group_specification/executive-summary.md
new file mode 100644
index 0000000000000000000000000000000000000000..d8b749a65f8743511a5eec02147eacfb9c7eb6d9
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/executive-summary.md
@@ -0,0 +1,2 @@
+# Executive summary
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/foreword.md b/md_to_docx_converter/templates/document_skeleton/group_specification/foreword.md
new file mode 100644
index 0000000000000000000000000000000000000000..42511b6462eceed58bcaf905d86470148684d13b
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/foreword.md
@@ -0,0 +1,4 @@
+# Foreword
+
+This Group Specification (GS) has been produced by ETSI Industry Specification Group \ (\).
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/front-page.md b/md_to_docx_converter/templates/document_skeleton/group_specification/front-page.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd4e1ad7c971fb08072ac8c24f03aa9c8819bd63
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/front-page.md
@@ -0,0 +1,89 @@
+*Disclaimer*
+
+The present document has been produced and approved by the () ETSI Industry Specification Group (ISG) and represents the views of those members who participated in this ISG.It does not necessarily represent the views of the entire ETSI membership.
+
+::: ZA
+[ETSI GS LLL-LLL DDD ]{.ondemand_CHAR_size_32}Vm.t.e[(yyyy-mm)]{.ondemand_CHAR_size_16}
+:::
+
+::: ZB
+***Group Specification***
+:::
+
+::: ZT
+Title;
+:::
+
+::: ZT
+Part #: Part element of title;
+:::
+
+::: ZT
+Sub-part #: Sub-part element of title
+:::
+
+::: ZT
+Release #
+:::
+
+::: ZT
+*Should you need a step-by-step guide for drafting an ETSI deliverable, please consult the \"Principles for Drafting ETSI Deliverables\" document. Otherwise you may contact us at edithelp@etsi.org.*
+:::
+
+::: TAC
+Reference
+
+
+
+Keywords
+
+
+
+*ETSI*
+
+650 Route des Lucioles
+
+F-06921 Sophia Antipolis Cedex - FRANCE
+
+Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
+
+[Siret N° 348 623 562 00017 - APE 7112B]{.ondemand_CHAR_name_Arial_size_7}
+
+[Association à but non lucratif enregistrée à la]{.ondemand_CHAR_name_Arial_size_7}
+
+[Sous-préfecture de Grasse (06) N° w061004871]{.ondemand_CHAR_name_Arial_size_7}
+
+***Important notice***
+
+The present document can be downloaded from the [ETSI Search & Browse Standards](https://www.etsi.org/standards-search)application.
+
+The present document may be made available in electronic versions and/or in print. The content of any electronic and/or print versions of the present document shall not be modified without the prior written authorization of ETSI. In case of any existing or perceived difference in contents between such versions and/or in print, the prevailing version of an ETSI deliverable is the one made publicly available in PDF format on [ETSI deliver](http://www.etsi.org/deliver) repository.
+
+Users should be aware that the present document may be revised or have its status changed, this information is available in the [Milestones listing](https://portal.etsi.org/Services/editHelp/Standards-development/Tracking-a-draft/Status-codes).
+
+If you find errors in the present document, please send your comments tothe relevant service listed under [Committee Support Staff](https://portal.etsi.org/People/Commitee-Support-Staff).
+:::
+
+If you find a security vulnerability in the present document, please report it through our [Coordinated Vulnerability Disclosure (CVD)](https://www.etsi.org/standards/coordinated-vulnerability-disclosure) program.
+
+***Notice of disclaimer & limitation of liability***
+
+The information provided in the present deliverable is directed solely to professionals who have the appropriate degree of experience to understand and interpret its content in accordance with generally accepted engineering or other professional standard and applicable regulations.
+
+No recommendation as to products and services or vendors is made or should be implied.
+
+No representation or warranty is made that this deliverable is technically accurate or sufficient or conforms to any law and/or governmental rule and/or regulation and further, no representation or warranty is made of merchantability or fitness for any particular purpose or against infringement of intellectual property rights.
+
+In no event shall ETSI be held liable for loss of profits or any other incidental or consequential damages.
+
+Any software contained in this deliverable is provided \"AS IS\" with no warranties, express or implied, including but not limited to, the warranties of merchantability, fitness for a particular purpose and non-infringement of intellectual property rights and ETSI shall not be held liable in any event for any damages whatsoever (including, without limitation, damages for loss of profits, business interruption, loss of information, or any other pecuniary loss) arising out of or related to the use of or inability to use the software.
+
+***Copyright Notification***
+
+No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm except as authorized by written permission of ETSI.The content of the PDF version shall not be modified without the written authorization of ETSI.The copyright and the foregoing restriction extend to reproduction in all media.
+
+© ETSI yyyy.
+
+All rights reserved.
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/history.md b/md_to_docx_converter/templates/document_skeleton/group_specification/history.md
new file mode 100644
index 0000000000000000000000000000000000000000..88870d89fd70a71c3fe516723f8f61c176a1128e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/history.md
@@ -0,0 +1,12 @@
+# History
+
+::: TAL
++:-----------------------:+:-----------------------:+:-----------------------:+
+| Document History |
++=========================+===================================================+
+| | | |
++-------------------------+-------------------------+-------------------------+
+:::
+
+*Latest changes made on YYYY-MM-DD*
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/intellectual-property-rights.md b/md_to_docx_converter/templates/document_skeleton/group_specification/intellectual-property-rights.md
new file mode 100644
index 0000000000000000000000000000000000000000..5848ab7e258d944a022528037b6b3e217be3ffbf
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/intellectual-property-rights.md
@@ -0,0 +1,18 @@
+# Intellectual Property Rights
+
+::: H6
+Essential patents
+:::
+
+IPRs essential or potentially essential to normative deliverables may have been declared to ETSI. The declarations pertaining to these essential IPRs, if any, are publicly available for **ETSI members and non-members**, and can be found in ETSI SR 000 314: *"Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards"*, which is available from the ETSI Secretariat. Latest updates are available on the [ETSI IPR online database](https://ipr.etsi.org/).
+
+Pursuant to the ETSI Directives including the ETSI IPR Policy, no investigation regarding the essentiality of IPRs, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document.
+
+::: H6
+Trademarks
+:::
+
+The present document may include trademarks and/or tradenames which are asserted and/or registered by their owners. ETSI claims no ownership of these except for any which are indicated as being the property of ETSI, and conveys no right to use or reproduce any trademark and/or tradename. Mention of those trademarks in the present document does not constitute an endorsement by ETSI of products, services or organizations associated with those trademarks.
+
+**DECT™**, **PLUGTESTS™**, **UMTS™** and the ETSI logo are trademarks of ETSI registered for the benefit of its Members. **3GPP™**, **LTE™** and **5G™** logo are trademarks of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. **oneM2M™** logo is a trademark of ETSI registered for the benefit of its Members and of the oneM2M Partners. **GSM**® and the GSM logo are trademarks registered and owned by the GSM Association.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/introduction.md b/md_to_docx_converter/templates/document_skeleton/group_specification/introduction.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d07efe555d3ce74da304571d46338266d5f860e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/introduction.md
@@ -0,0 +1,2 @@
+# Introduction
+
diff --git a/md_to_docx_converter/templates/document_skeleton/group_specification/modal-verbs-terminology.md b/md_to_docx_converter/templates/document_skeleton/group_specification/modal-verbs-terminology.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa21889aa4f651b4dc25a677f4b08d63270ad417
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/group_specification/modal-verbs-terminology.md
@@ -0,0 +1,6 @@
+# Modal verbs terminology
+
+In the present document "**should**", "**should not**", "**may**", "**need not**", "**will**", "**will not**", "**can**" and "**cannot**" are to be interpreted as described in clause 3.2 of the [ETSI Drafting Rules](https://portal.etsi.org/Services/editHelp!/Howtostart/ETSIDraftingRules.aspx) (Verbal forms for the expression of provisions).
+
+"**must**" and "**must not**" are **NOT** allowed in ETSI deliverables except when used in direct citation.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/annex-bibliography.md b/md_to_docx_converter/templates/document_skeleton/technical_report/annex-bibliography.md
new file mode 100644
index 0000000000000000000000000000000000000000..5eedc212f48490bf068aa8977e531aa0bb91d2cb
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/annex-bibliography.md
@@ -0,0 +1,4 @@
+# Annex (informative): Bibliography
+
+-
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/annex-change_history.md b/md_to_docx_converter/templates/document_skeleton/technical_report/annex-change_history.md
new file mode 100644
index 0000000000000000000000000000000000000000..4c71836f22bc157101f448a4d40be7b74af9ff26
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/annex-change_history.md
@@ -0,0 +1,12 @@
+# Annex (informative): Change history
+
+::: TAL
++-----------------------+-----------------------+------------------------------------------+
+| Date | Version | Information about changes |
++=======================+=======================+==========================================+
+| | {TAC} | |
+| | | |
+| | <#> | |
++-----------------------+-----------------------+------------------------------------------+
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/clause-1_Scope.md b/md_to_docx_converter/templates/document_skeleton/technical_report/clause-1_Scope.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ac75138f412612e5400121caf403532354baef1
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/clause-1_Scope.md
@@ -0,0 +1,4 @@
+# 1 Scope
+
+The present document ...
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/clause-2_References.md b/md_to_docx_converter/templates/document_skeleton/technical_report/clause-2_References.md
new file mode 100644
index 0000000000000000000000000000000000000000..abbd5028ddc07a4c65e085ec2cf9a7c56af03c60
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/clause-2_References.md
@@ -0,0 +1,24 @@
+# 2 References
+
+## 2.1 Normative references
+
+Normative references are not applicable in the present document.
+
+## 2.2 Informative references
+
+References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the referenced document (including any amendments) applies.
+
+>>> [!note] NOTE:
+
+While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee their long-term validity.
+
+>>>
+
+The following referenced documents may be useful in implementing an ETSI deliverable or add to the reader's understanding, but are not required for conformance to the present document.
+
+::: REFS
+[i.1] Reference body
+
+[i.2] Reference body
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/clause-3_Definitions.md b/md_to_docx_converter/templates/document_skeleton/technical_report/clause-3_Definitions.md
new file mode 100644
index 0000000000000000000000000000000000000000..2b562a5c8a6297b506a3d2f68e806d949b4809dc
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/clause-3_Definitions.md
@@ -0,0 +1,20 @@
+# 3 Definition of terms, symbols and abbreviations
+
+## 3.1 Terms
+
+For the purposes of the present document, the [following] terms [given in ... and the following] apply:
+
+## 3.2 Symbols
+
+For the purposes of the present document, the [following] symbols [given in ... and the following] apply:
+
+## 3.3 Abbreviations
+
+For the purposes of the present document, the [following] abbreviations [given in ... and the following] apply:
+
+::: EW
+ABBR Abbreviation
+
+2ABR Second Abbreviation
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/executive-summary.md b/md_to_docx_converter/templates/document_skeleton/technical_report/executive-summary.md
new file mode 100644
index 0000000000000000000000000000000000000000..d8b749a65f8743511a5eec02147eacfb9c7eb6d9
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/executive-summary.md
@@ -0,0 +1,2 @@
+# Executive summary
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/foreword.md b/md_to_docx_converter/templates/document_skeleton/technical_report/foreword.md
new file mode 100644
index 0000000000000000000000000000000000000000..beb1afa46eb28416c3a804b699cca3237676163e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/foreword.md
@@ -0,0 +1,4 @@
+# Foreword
+
+This Technical Report (TR) has been produced by {ETSI Technical Committee|ETSI Project|} ().
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/front-page.md b/md_to_docx_converter/templates/document_skeleton/technical_report/front-page.md
new file mode 100644
index 0000000000000000000000000000000000000000..70e7c5eef2758e3ccd31746bc6a3cc3004196293
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/front-page.md
@@ -0,0 +1,82 @@
+::: ZA
+[ETSI TR 1DD DDD ]{.ondemand_CHAR_size_32}Vm.t.e[(yyyy-mm)]{.ondemand_CHAR_size_16}
+:::
+
+::: ZT
+Title;
+:::
+
+::: ZT
+Part #: Part element of title;
+:::
+
+::: ZT
+Sub-part #: Sub-part element of title
+:::
+
+::: ZT
+Release \#
+:::
+
+::: ZT
+*Should you need a step-by-step guide for drafting an ETSI deliverable, please consult the "Principles for Drafting ETSI Deliverables" document. Otherwise you may contact us at edithelp@etsi.org.*
+:::
+
+::: ZB
+**TECHNICAL REPORT**
+:::
+
+::: TAC
+Reference
+
+
+
+Keywords
+
+
+
+*ETSI*
+
+650 Route des Lucioles
+
+F-06921 Sophia Antipolis Cedex - FRANCE
+
+Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
+
+[Siret N° 348 623 562 00017 - APE 7112B]{.ondemand_CHAR_name_Arial_size_7}
+
+[Association à but non lucratif enregistrée à la]{.ondemand_CHAR_name_Arial_size_7}
+
+[Sous-préfecturede Grasse (06) N° w061004871]{.ondemand_CHAR_name_Arial_size_7}
+
+***Important notice***
+
+The present document can be downloaded from the [ETSI Search & Browse Standards](https://www.etsi.org/standards-search)application.
+
+The present document may be made available in electronic versions and/or in print. The content of any electronic and/or print versions of the present document shall not be modified without the prior written authorization of ETSI. In case of any existing or perceived difference in contents between such versions and/or in print, the prevailing version of an ETSI deliverable is the one made publicly available in PDF format on [[ETSI deliver]{.underline}](http://www.etsi.org/deliver)repository.
+
+Users should be aware that the present document may be revised or have its status changed, this information is available in the [[Milestones listing]{.underline}](https://portal.etsi.org/Services/editHelp/Standards-development/Tracking-a-draft/Status-codes)[.]{.ondemand_CHAR_name_Arial_size_9_color_0000FF}
+
+If you find errors in the present document, please send your comments tothe relevant service listed under [[Committee Support Staff]{.underline}](https://portal.etsi.org/People/Commitee-Support-Staff).
+
+If you find a security vulnerability in the present document, please report it through our [Coordinated Vulnerability Disclosure (CVD)](https://www.etsi.org/standards/coordinated-vulnerability-disclosure) program.
+
+*Notice of disclaimer & limitation of liability*
+
+The information provided in the present deliverable is directed solely to professionals who have the appropriate degree of experience to understand and interpret its content in accordance with generally accepted engineering or other professional standard and applicable regulations.
+
+No recommendation as to products and services or vendors is made or should be implied.
+
+No representation or warranty is made that this deliverable is technically accurate or sufficient or conforms to any law and/or governmental rule and/or regulation and further, no representation or warranty is made of merchantability or fitness for any particular purpose or against infringement of intellectual property rights.
+
+In no event shall ETSI be held liable for loss of profits or any other incidental or consequential damages.
+
+Any software contained in this deliverable is provided "AS IS" with no warranties, express or implied, including but not limited to, the warranties of merchantability, fitness for a particular purpose and non-infringement of intellectual property rights and ETSI shall not be held liable in any event for any damages whatsoever (including, without limitation, damages for loss of profits, business interruption, loss of information, or any other pecuniary loss) arising out of or related to the use of or inability to use the software.
+
+***Copyright Notification***
+
+No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm except as authorized by written permission of ETSI.The content of the PDF version shall not be modified without the written authorization of ETSI.The copyright and the foregoing restriction extend to reproduction in all media.
+
+© ETSI yyyy.
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/history.md b/md_to_docx_converter/templates/document_skeleton/technical_report/history.md
new file mode 100644
index 0000000000000000000000000000000000000000..88870d89fd70a71c3fe516723f8f61c176a1128e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/history.md
@@ -0,0 +1,12 @@
+# History
+
+::: TAL
++:-----------------------:+:-----------------------:+:-----------------------:+
+| Document History |
++=========================+===================================================+
+| | | |
++-------------------------+-------------------------+-------------------------+
+:::
+
+*Latest changes made on YYYY-MM-DD*
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/intellectual-property-rights.md b/md_to_docx_converter/templates/document_skeleton/technical_report/intellectual-property-rights.md
new file mode 100644
index 0000000000000000000000000000000000000000..5848ab7e258d944a022528037b6b3e217be3ffbf
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/intellectual-property-rights.md
@@ -0,0 +1,18 @@
+# Intellectual Property Rights
+
+::: H6
+Essential patents
+:::
+
+IPRs essential or potentially essential to normative deliverables may have been declared to ETSI. The declarations pertaining to these essential IPRs, if any, are publicly available for **ETSI members and non-members**, and can be found in ETSI SR 000 314: *"Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards"*, which is available from the ETSI Secretariat. Latest updates are available on the [ETSI IPR online database](https://ipr.etsi.org/).
+
+Pursuant to the ETSI Directives including the ETSI IPR Policy, no investigation regarding the essentiality of IPRs, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document.
+
+::: H6
+Trademarks
+:::
+
+The present document may include trademarks and/or tradenames which are asserted and/or registered by their owners. ETSI claims no ownership of these except for any which are indicated as being the property of ETSI, and conveys no right to use or reproduce any trademark and/or tradename. Mention of those trademarks in the present document does not constitute an endorsement by ETSI of products, services or organizations associated with those trademarks.
+
+**DECT™**, **PLUGTESTS™**, **UMTS™** and the ETSI logo are trademarks of ETSI registered for the benefit of its Members. **3GPP™**, **LTE™** and **5G™** logo are trademarks of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. **oneM2M™** logo is a trademark of ETSI registered for the benefit of its Members and of the oneM2M Partners. **GSM**® and the GSM logo are trademarks registered and owned by the GSM Association.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/introduction.md b/md_to_docx_converter/templates/document_skeleton/technical_report/introduction.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d07efe555d3ce74da304571d46338266d5f860e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/introduction.md
@@ -0,0 +1,2 @@
+# Introduction
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_report/modal-verbs-terminology.md b/md_to_docx_converter/templates/document_skeleton/technical_report/modal-verbs-terminology.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa21889aa4f651b4dc25a677f4b08d63270ad417
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_report/modal-verbs-terminology.md
@@ -0,0 +1,6 @@
+# Modal verbs terminology
+
+In the present document "**should**", "**should not**", "**may**", "**need not**", "**will**", "**will not**", "**can**" and "**cannot**" are to be interpreted as described in clause 3.2 of the [ETSI Drafting Rules](https://portal.etsi.org/Services/editHelp!/Howtostart/ETSIDraftingRules.aspx) (Verbal forms for the expression of provisions).
+
+"**must**" and "**must not**" are **NOT** allowed in ETSI deliverables except when used in direct citation.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/annex-bibliography.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/annex-bibliography.md
new file mode 100644
index 0000000000000000000000000000000000000000..5eedc212f48490bf068aa8977e531aa0bb91d2cb
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/annex-bibliography.md
@@ -0,0 +1,4 @@
+# Annex (informative): Bibliography
+
+-
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/annex-change_history.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/annex-change_history.md
new file mode 100644
index 0000000000000000000000000000000000000000..4c71836f22bc157101f448a4d40be7b74af9ff26
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/annex-change_history.md
@@ -0,0 +1,12 @@
+# Annex (informative): Change history
+
+::: TAL
++-----------------------+-----------------------+------------------------------------------+
+| Date | Version | Information about changes |
++=======================+=======================+==========================================+
+| | {TAC} | |
+| | | |
+| | <#> | |
++-----------------------+-----------------------+------------------------------------------+
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-1_Scope.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-1_Scope.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ac75138f412612e5400121caf403532354baef1
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-1_Scope.md
@@ -0,0 +1,4 @@
+# 1 Scope
+
+The present document ...
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-2_References.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-2_References.md
new file mode 100644
index 0000000000000000000000000000000000000000..7711f3e1f631bd75a0ff30cd32725c9d19ec69f1
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-2_References.md
@@ -0,0 +1,40 @@
+# 2 References
+
+## 2.1 Normative references
+
+References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the referenced document (including any amendments) applies.
+
+Referenced documents which are not found to be publicly available in the expected location might be found in the [ETSI docbox](https://docbox.etsi.org/Reference/).
+
+>>> [!note] NOTE:
+
+While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee their long-term validity.
+
+>>>
+
+The following referenced documents are necessary for the application of the present document.
+
+::: REFS
+[n.1] Reference body
+
+[n.2] Reference body
+:::
+
+## 2.2 Informative references
+
+References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For specific references, only the cited version applies. For non-specific references, the latest version of the referenced document (including any amendments) applies.
+
+>>> [!note] NOTE:
+
+While any hyperlinks included in this clause were valid at the time of publication, ETSI cannot guarantee their long-term validity.
+
+>>>
+
+The following referenced documents may be useful in implementing an ETSI deliverable or add to the reader's understanding, but are not required for conformance to the present document.
+
+::: REFS
+[i.1] Reference body
+
+[i.2] Reference body
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-3_Definitions.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-3_Definitions.md
new file mode 100644
index 0000000000000000000000000000000000000000..2b562a5c8a6297b506a3d2f68e806d949b4809dc
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/clause-3_Definitions.md
@@ -0,0 +1,20 @@
+# 3 Definition of terms, symbols and abbreviations
+
+## 3.1 Terms
+
+For the purposes of the present document, the [following] terms [given in ... and the following] apply:
+
+## 3.2 Symbols
+
+For the purposes of the present document, the [following] symbols [given in ... and the following] apply:
+
+## 3.3 Abbreviations
+
+For the purposes of the present document, the [following] abbreviations [given in ... and the following] apply:
+
+::: EW
+ABBR Abbreviation
+
+2ABR Second Abbreviation
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/executive-summary.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/executive-summary.md
new file mode 100644
index 0000000000000000000000000000000000000000..d8b749a65f8743511a5eec02147eacfb9c7eb6d9
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/executive-summary.md
@@ -0,0 +1,2 @@
+# Executive summary
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/foreword.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/foreword.md
new file mode 100644
index 0000000000000000000000000000000000000000..4d02e164057f711b8863cb40d26383f382c22c0c
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/foreword.md
@@ -0,0 +1,4 @@
+# Foreword
+
+This Technical Specification (TS) has been produced by {ETSI Technical Committee|ETSI Project|} ().
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/front-page.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/front-page.md
new file mode 100644
index 0000000000000000000000000000000000000000..e984af220868ac9bd500cfdbe0ad329fbb33e308
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/front-page.md
@@ -0,0 +1,84 @@
+::: ZA
+[ETSI TS 1DD DDD ]{.ondemand_CHAR_size_32}Vm.t.e[(yyyy-mm)]{.ondemand_CHAR_size_16}
+:::
+
+::: ZT
+Title;
+:::
+
+::: ZT
+Part #: Part element of title;
+:::
+
+::: ZT
+Sub-part #: Sub-part element of title
+:::
+
+::: ZT
+Release #
+:::
+
+::: ZT
+*Should you need a step-by-step guide for drafting an ETSI deliverable, please consult the "Principles for Drafting ETSI Deliverables" document. Otherwise you may contact us at edithelp@etsi.org.*
+:::
+
+::: ZB
+***TECHNICAL SPECIFICATION***
+:::
+
+::: TAC
+Reference
+
+
+
+Keywords
+
+
+
+*ETSI*
+
+650 Route des Lucioles
+
+F-06921 Sophia Antipolis Cedex - FRANCE
+
+Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
+
+[Siret N° 348 623 562 00017 - APE 7112B]{.ondemand_CHAR_name_Arial_size_7}
+
+[Association à but non lucratif enregistrée à la]{.ondemand_CHAR_name_Arial_size_7}
+
+[Sous-préfecture de Grasse (06) N° w061004871]{.ondemand_CHAR_name_Arial_size_7}
+
+***Important notice***
+
+The present document can be downloaded from the [ETSI Search & Browse Standards](https://www.etsi.org/standards-search) application.
+
+The present document may be made available in electronic versions and/or in print. The content of any electronic and/or print versions of the present document shall not be modified without the prior written authorization of ETSI. In case of any existing or perceived difference in contents between such versions and/or in print, the prevailing version of an ETSI deliverable is the one made publicly available in PDF format on [ETSI deliver](http://www.etsi.org/deliver)repository.
+
+Users should be aware that the present document may be revised or have its status changed, this information is available in the [Milestones listing](https://portal.etsi.org/Services/editHelp/Standards-development/Tracking-a-draft/Status-codes).
+
+If you find errors in the present document, please send your comments to the relevant service listed under [Committee Support Staff](https://portal.etsi.org/People/Commitee-Support-Staff).
+
+If you find a security vulnerability in the present document, please report it through our [Coordinated Vulnerability Disclosure (CVD)](https://www.etsi.org/standards/coordinated-vulnerability-disclosure) program.
+
+***Notice of disclaimer & limitation of liability***
+
+The information provided in the present deliverable is directed solely to professionals who have the appropriate degree of experience to understand and interpret its content in accordance with generally accepted engineering or other professional standard and applicable regulations.
+
+No recommendation as to products and services or vendors is made or should be implied.
+
+No representation or warranty is made that this deliverable is technically accurate or sufficient or conforms to any law and/or governmental rule and/or regulation and further, no representation or warranty is made of merchantability or fitness for any particular purpose or against infringement of intellectual property rights.
+
+In no event shall ETSI be held liable for loss of profits or any other incidental or consequential damages.
+
+Any software contained in this deliverable is provided "AS IS" with no warranties, express or implied, including but not limited to, the warranties of merchantability, fitness for a particular purpose and non-infringement of intellectual property rights and ETSI shall not be held liable in any event for any damages whatsoever (including, without limitation, damages for loss of profits, business interruption, loss of information, or any other pecuniary loss) arising out of or related to the use of or inability to use the software.
+
+***Copyright Notification***
+
+No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm except as authorized by written permission of ETSI. The content of the PDF version shall not be modified without the written authorization of ETSI. The copyright and the foregoing restriction extend to reproduction in all media.
+
+© ETSI yyyy.
+
+All rights reserved.
+:::
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/history.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/history.md
new file mode 100644
index 0000000000000000000000000000000000000000..88870d89fd70a71c3fe516723f8f61c176a1128e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/history.md
@@ -0,0 +1,12 @@
+# History
+
+::: TAL
++:-----------------------:+:-----------------------:+:-----------------------:+
+| Document History |
++=========================+===================================================+
+| | | |
++-------------------------+-------------------------+-------------------------+
+:::
+
+*Latest changes made on YYYY-MM-DD*
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/intellectual-property-rights.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/intellectual-property-rights.md
new file mode 100644
index 0000000000000000000000000000000000000000..5848ab7e258d944a022528037b6b3e217be3ffbf
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/intellectual-property-rights.md
@@ -0,0 +1,18 @@
+# Intellectual Property Rights
+
+::: H6
+Essential patents
+:::
+
+IPRs essential or potentially essential to normative deliverables may have been declared to ETSI. The declarations pertaining to these essential IPRs, if any, are publicly available for **ETSI members and non-members**, and can be found in ETSI SR 000 314: *"Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards"*, which is available from the ETSI Secretariat. Latest updates are available on the [ETSI IPR online database](https://ipr.etsi.org/).
+
+Pursuant to the ETSI Directives including the ETSI IPR Policy, no investigation regarding the essentiality of IPRs, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document.
+
+::: H6
+Trademarks
+:::
+
+The present document may include trademarks and/or tradenames which are asserted and/or registered by their owners. ETSI claims no ownership of these except for any which are indicated as being the property of ETSI, and conveys no right to use or reproduce any trademark and/or tradename. Mention of those trademarks in the present document does not constitute an endorsement by ETSI of products, services or organizations associated with those trademarks.
+
+**DECT™**, **PLUGTESTS™**, **UMTS™** and the ETSI logo are trademarks of ETSI registered for the benefit of its Members. **3GPP™**, **LTE™** and **5G™** logo are trademarks of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. **oneM2M™** logo is a trademark of ETSI registered for the benefit of its Members and of the oneM2M Partners. **GSM**® and the GSM logo are trademarks registered and owned by the GSM Association.
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/introduction.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/introduction.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d07efe555d3ce74da304571d46338266d5f860e
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/introduction.md
@@ -0,0 +1,2 @@
+# Introduction
+
diff --git a/md_to_docx_converter/templates/document_skeleton/technical_specification/modal-verbs-terminology.md b/md_to_docx_converter/templates/document_skeleton/technical_specification/modal-verbs-terminology.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa21889aa4f651b4dc25a677f4b08d63270ad417
--- /dev/null
+++ b/md_to_docx_converter/templates/document_skeleton/technical_specification/modal-verbs-terminology.md
@@ -0,0 +1,6 @@
+# Modal verbs terminology
+
+In the present document "**should**", "**should not**", "**may**", "**need not**", "**will**", "**will not**", "**can**" and "**cannot**" are to be interpreted as described in clause 3.2 of the [ETSI Drafting Rules](https://portal.etsi.org/Services/editHelp!/Howtostart/ETSIDraftingRules.aspx) (Verbal forms for the expression of provisions).
+
+"**must**" and "**must not**" are **NOT** allowed in ETSI deliverables except when used in direct citation.
+
diff --git a/md_to_docx_converter/templates/json/file_order.json b/md_to_docx_converter/templates/json/file_order.json
new file mode 100644
index 0000000000000000000000000000000000000000..6505b5ab961c39e9c6291762e6f8320d748a3d49
--- /dev/null
+++ b/md_to_docx_converter/templates/json/file_order.json
@@ -0,0 +1,4 @@
+{
+ "clauses": [],
+ "annexes": []
+}
diff --git a/media/image12.png b/media/image12.png
index c1d304088f60e423dcdbebd15305601c0fe99ace..ee6ebfb9fa2f57cd6810c414441b08f821f83216 100644
Binary files a/media/image12.png and b/media/image12.png differ
diff --git a/media/image15.png b/media/image15.png
index 3a1258a6d0eafc722c20d059a177bf5f80facd07..b9195520bc5ea27350582e09043cd83b198d5b10 100644
Binary files a/media/image15.png and b/media/image15.png differ
diff --git a/media/image16.png b/media/image16.png
index ec94518f8c3de36d9cc050fd3ce93d9aa9289b07..a58fbef4c1be2d7863641380a2ee0204722183d9 100644
Binary files a/media/image16.png and b/media/image16.png differ
diff --git a/media/image4.PNG b/media/image4.png
similarity index 100%
rename from media/image4.PNG
rename to media/image4.png