fix: wrong src folder when passing --src parameter (968040d7) · Commits · CIM - Context Information Management / NGSI-LD API

md_to_docx_converter/README.md

+32 −5

Original line number	Diff line number	Diff line
		@@ -20,7 +20,13 @@ Latest

		Latest

		### 1.1.1 Create a virtual environment
		### 1.1.2 Optional Software

		#### [WSL (Windows Subsystem for Linux)](https://learn.microsoft.com/en-us/windows/wsl/install)

		See additional setup steps in [section 1.2.1](#121-for-use-with-wsl).

		### 1.1.3 Create a virtual environment

		If you prefer to use Minconda or Pyenv, setup a virtual environment according to the following steps. If you are not using Miniconda or Pyenv, you can use any other python environment manger, just go to point 3 of the list and install the requirements using pip.

		@@ -45,7 +51,7 @@ If you prefer to use Minconda or Pyenv, setup a virtual environment according to
		1. `cd path/to/ETSI-GS-CIM-009`
		2. `pip install -r requirements.txt`

		### 1.1.2 Check Pandoc
		### 1.1.4 Check Pandoc

		Ensure Pandoc version 3.7.0.2 is installed.

		@@ -73,6 +79,25 @@ Current folder
		├── ETSIstyles.css
		```

		### 1.2.1 For Use with WSL

		To ensure the script runs correctly with WSL 1[^1], do one of the following...

		- Ensure the directory containing the script's files is saved in the Linux filesystem (ex., at `\\wsl$\{Distro}\home\{user}\path\to\script\dir`).

		OR

		- Ensure the current user has full control of the directory containing the script's files within the base Windows filesystem (ex., at `C:\path\to\script\dir`).
		1. Right click on the top-level directory in the file explorer.
		2. Click Properties
		3. Go to the Security tab
		4. Click Edit to modify permissions.
		5. In the list, select the current user.

		If the current user is not in the list, click Add. In the dialogue that opens, ensure that the User object type is selected and type the current user's username in the Enter object names to select box. Click Check names, then click OK.

		6. Under Permissions, check the box for Full control, then click Apply and OK.

		## 1.3 Expected Folder Structure after conversion

		Through conversion you will eventually end up with the following folder structure:
		@@ -130,7 +155,7 @@ The script `convert.py` handles the following conversions:
		- If this argument is not provided, the source path will be _ETSI-GS-CIM-009_/_GENERATED_FILES_/_{value provided via `--folder`}_/_{value provided via `--frm`}_.

		- `--file_order` - Provide the path to a JSON that contains the order in which user-created clauses and annexes should be.
		- If this argument is not provided, the default ordering convention will be used. See [section 2.2.1.1](#2.2.1.1-preparation-create-markdown-from-scratch) for more details.
		- If this argument is not provided, the default ordering convention will be used. See [section 2.2.1.1](#2211---preparation-create-markdown-from-scratch) for more details.

		## 2.2 Conversion

		@@ -154,7 +179,7 @@ Create Markdown from scratch or convert "dirty" HTML to "clean" Markdown
		- Custom: Create a file ordering JSON according to the [template](./templates/json/file_order.json) and provide it to the script with the `--file_order` argument.
		- In either case, if any other files that follow the naming convention clause-{some text} or annex-{some text} will be ordered alphabetically after the predefined files in the appropriate section of the document.
		4. Create a media directory inside the directory created in step 1.
		- As images are added to the Markdown, place the image in `PNG` and `EMF` form inside.[^1]
		- As images are added to the Markdown, place the image in `PNG` and `EMF` form inside.[^2]

		At this point, the document's clauses and annexes can be created. It is important to follow the following considerations to ensure the document is created properly:
		- Ensure any files defined in a [file ordering JSON](./templates/json/file_order.json) are present in the source directory created in step 1, otherwise the script will print an error and quit prematurely.
		@@ -224,4 +249,6 @@ Specify a different directory containing the HTML files.

		`convert.py --frm html --to docx --folder {docname} --src relative/or/absolute/source/path`

		[^1]: Method subject to change
		[^1]: These steps may not be necessary with WSL 2, but it is recommended to follow them nevertheless.

		[^2]: Method subject to change

md_to_docx_converter/convert.py

+8 −3

Original line number	Diff line number	Diff line
		@@ -207,15 +207,20 @@ def convert():
		if os.path.exists(DEST):
		shutil.rmtree(DEST)

		preprocess_html(SRC, SRC_TYPE, CONSOLIDATED_MD_PATH, FILE_ORDER_JSON)
		filename_numbers_mapping = preprocess_html(SRC, SRC_TYPE, CONSOLIDATED_MD_PATH, FILE_ORDER_JSON)
		filename_numbers_mapping_path = os.path.join(SRC, "filename_numbers_mapping.json")
		with open(filename_numbers_mapping_path, "w") as f:
		json.dump(filename_numbers_mapping, f, indent=4)

		# Conversion
		command = get_md_to_html_command(SRC, DEST, CONSOLIDATED_MD_PATH, CSS_SRC)

		try:
		subprocess.run(command, check=True, capture_output=True, text=True)
		os.remove(filename_numbers_mapping_path)
		except subprocess.CalledProcessError as e:
		print(f"Error converting Markdown files in {SRC} to HTML:\n{e.stderr}")
		os.remove(filename_numbers_mapping_path)
		sys.exit(1)

		# Copy the media directory back over to preserve the emfs, since Pandoc doesn't bring those over
		@@ -225,8 +230,8 @@ def convert():
		shutil.copytree(f"{SRC}/media", f"{DEST}/media")

		# Copy ETSIstyles.css into the parent folder
		styles_css = CSS_SRC[0]
		shutil.copy(styles_css, os.path.join(FILEGEN_DIR, FOLDER))
		for css_file in CSS_SRC:
		shutil.copy(css_file, os.path.join(FILEGEN_DIR, FOLDER))
		shutil.copy("advancedTOCLogic.js", DEST)

		# Cleanup the consolidated Markdown

md_to_docx_converter/md_to_html_3.lua

+1 −1

Original line number	Diff line number	Diff line
		local pandoc = require "pandoc"
		-- local logfile = io.open("debug.log", "a")

		local mt, filename_numbers_mapping = pandoc.mediabag.fetch("../../../filename_numbers_mapping.json")
		local mt, filename_numbers_mapping = pandoc.mediabag.fetch("filename_numbers_mapping.json")
		filename_numbers_mapping = pandoc.json.decode(filename_numbers_mapping, false)

		local function split(str, delimiter)

md_to_docx_converter/src/constants.py

+12 −0

Original line number	Diff line number	Diff line
		@@ -148,3 +148,15 @@ DEFAULT_HTML_CLAUSES = [f"clause-{i}" for i in range(1, 21)]

		DEFAULT_ANNEXES = [f"annex-{letter}" for letter in "abcdefghijklmnopqrstuvwxyz"]
		# endregion

		# region Markdown Preprocessing Formatting Checks
		METADATA_REGEX = r"(?:\{\.)?\w+(?:\s+\.\w+)*\}?" # Follows the form `className` or `{.class1 .class2 ... .classN}`
		DIV_START_REGEX = rf"[-\s]:::\s{METADATA_REGEX}\s"
		DIV_END_REGEX = r"\s:::\s"

		BAD_COLON_GROUP_REGEX = (
		r"(?<!:)(?::{1,2}\|:{4,})(?!:)" # Match all colon groups except the valid `:::`
		)
		# Match lines that start with : or :: and are not followed by letters
		BAD_DIV_DELINEATOR_REGEX = r"^\s*(?::{1,2})(?![a-zA-Z])"
		# endregion

md_to_docx_converter/src/to_html/preprocessing.py

+161 −27

Original line number	Diff line number	Diff line
		import os, re, os, json
		import sys
		from typing_extensions import Literal

		from src.constants import (
		@@ -6,10 +7,20 @@ from src.constants import (
		INFORMATIVE_REF_FILE,
		DEFAULT_CLAUSES,
		DEFAULT_ANNEXES,
		REFS
		REFS,
		DIV_START_REGEX,
		DIV_END_REGEX,
		BAD_DIV_DELINEATOR_REGEX,
		)

		from src.utils import handle_consolidated_md, get_file_order, int_to_letter
		from src.utils import (
		handle_consolidated_md,
		get_file_order,
		int_to_letter,
		p_warning,
		p_error,
		p_label,
		)
		from src.constants import MAX_HEADING_LEVEL

		files_with_references = [NORMATIVE_REF_FILE, INFORMATIVE_REF_FILE]
		@@ -18,6 +29,92 @@ files_with_references = [NORMATIVE_REF_FILE, INFORMATIVE_REF_FILE]
		# region Helpers


		def run_format_checks(filename: str, file_lines: list[str]):
		"""Runs various checks on the Markdown file contents to ensure they are properly formatted. If any improper formatting is detected, display any fatal errors or warnings as necessary."""

		def check_divs():
		"""
		### Display an error and exit when...
		- An opening does not have a closing

		### Display a warning when...
		- The number of openings and number of closings do not match
		- Find a closing without a corresponding opening, this is likely meant to be an opening and needs metadata
		"""
		i = 0
		in_div = False
		in_div_no_metadata = (
		False # For if/when a div is found that doesn't have any class
		)
		start_line_num = (
		0 # For keeping track of the line number at which the latest div was opened
		)

		# Keep track of numbers of div starts and div ends
		num_div_start = 0
		num_div_end = 0

		while i < len(file_lines):
		line = file_lines[i].replace("\n", "")
		line_num = i + 1

		bad_div_delin_match = re.match(BAD_DIV_DELINEATOR_REGEX, line)
		if bad_div_delin_match and line.startswith(":::") is False:
		# This div delineator doesn't have exactly three colons `:::`
		print(
		p_error(
		f"{p_label(filename)}:{p_label(line_num)}: Improperly formatted div delineator in line. Line: {p_label(line)}"
		)
		)
		raise Exception("DIV_DELINEATOR_ERROR")

		start_match = re.match(DIV_START_REGEX, line)
		num_div_start += 1 if start_match else num_div_start

		if start_match:
		in_div_no_metadata = False # Set this to false in case it was true from a previous div without metadata
		if in_div:
		# The previous div wasn't closed, print error and quit
		print(
		p_error(
		f"{p_label(filename)}:{p_label(start_line_num)}: No end tag found for div starting at this line"
		)
		)
		raise Exception("DIV_DELINEATOR_ERROR")
		else:
		# A normal div opener
		in_div = True

		start_line_num = line_num
		i += 1
		continue

		end_match = re.match(DIV_END_REGEX, line)
		num_div_end += 1 if end_match else num_div_end

		if end_match:
		if not in_div and not in_div_no_metadata:
		# This should open a div, but it doesn't have a class assigned to it
		print(
		p_warning(
		f"{p_label(filename)}:{p_label(line_num)}: The delineator at this line seems to open a div, this div or one before it may not be correctly structured."
		)
		)
		in_div_no_metadata = True

		elif not in_div and in_div_no_metadata:
		# The closing to a classless div
		in_div_no_metadata = False

		in_div = False
		i += 1
		continue

		i += 1

		check_divs()


		def handle_less_than_greater_than_text(file_contents: str):
		"""Replace `<` and `>` with `<` and `>` respectively and wrap the whole section in single code ticks to allow the text to render in the HTML"""
		regex = r"\<(?!img\b\|span\b\|sup\|/sup)(.+?)\>"
		@@ -92,6 +189,7 @@ def auto_number_content(
		file_contents: str, content_type: Literal["clauses", "annexes"]
		):
		global example_counter, note_counter, note_in_table_counter

		def auto_number_heading(line: str):
		global clauses_counters, annexes_counters, figure_counter, table_counter
		new_heading = ""
		@@ -157,8 +255,12 @@ def auto_number_content(
		new_line = line
		if "EXAMPLE" not in line:
		example_counter += 1
		if example_counter != 1: # if it is one the number can be omitted, need to check later
		new_line = line.replace(">>> [!tip]", f">>> [!tip] EXAMPLE {example_counter}:")
		if (
		example_counter != 1
		): # if it is one the number can be omitted, need to check later
		new_line = line.replace(
		">>> [!tip]", f">>> [!tip] EXAMPLE {example_counter}:"
		)
		return new_line

		def auto_number_note(line: str) -> str:
		@@ -166,8 +268,12 @@ def auto_number_content(
		new_line = line
		if "NOTE" not in line:
		note_counter += 1
		if note_counter != 1: # if it is one the number can be omitted, need to check later
		new_line = line.replace(">>> [!note]", f">>> [!note] NOTE {note_counter}:")
		if (
		note_counter != 1
		): # if it is one the number can be omitted, need to check later
		new_line = line.replace(
		">>> [!note]", f">>> [!note] NOTE {note_counter}:"
		)
		return new_line

		def auto_number_figure(line: str) -> str:
		@@ -194,8 +300,12 @@ def auto_number_content(
		global note_in_table_counter
		new_line = line
		note_in_table_counter += 1
		if note_in_table_counter != 1: # if it is one the number can be omitted, need to check later
		new_line = line.replace(">>> [!note]", f">>> [!note] NOTE {note_in_table_counter}:")
		if (
		note_in_table_counter != 1
		): # if it is one the number can be omitted, need to check later
		new_line = line.replace(
		">>> [!note]", f">>> [!note] NOTE {note_in_table_counter}:"
		)
		return new_line

		# take line and line number and replace the line number
		@@ -214,39 +324,45 @@ def auto_number_content(
		previous_heading = new_heading

		if example_counter >= 1 and first_example_line_index != -1:
		lines[first_example_line_index] += f" EXAMPLE{' 1' if example_counter > 1 else ''}:"
		lines[
		first_example_line_index
		] += f" EXAMPLE{' 1' if example_counter > 1 else ''}:"
		example_counter = 0
		first_example_line_index = -1

		if note_counter >= 1 and first_note_line_index != -1:
		lines[first_note_line_index] += f" NOTE{' 1' if note_counter > 1 else ''}:"
		lines[
		first_note_line_index
		] += f" NOTE{' 1' if note_counter > 1 else ''}:"
		note_counter = 0
		first_note_line_index = -1


		elif line.startswith(">>> [!tip]"):
		new_line = auto_number_example(new_line)
		if example_counter == 1:
		first_example_line_index = i


		elif line.startswith(">>> [!note]"):
		new_line = auto_number_note(new_line)
		if note_counter == 1:
		first_note_line_index = i


		elif previous_line.startswith("::: TF"):
		new_line = auto_number_figure(new_line)


		elif previous_line.startswith("::: TH"):
		new_line = auto_number_table(new_line)

		if note_in_table_counter >= 1 and first_note_in_table_line_index != -1:
		note_string = f" NOTE{' 1' if note_in_table_counter > 1 else ''}:"
		first_index_after_bracket = lines[first_note_in_table_line_index].find("[!note]")
		lines[first_note_in_table_line_index] = lines[first_note_in_table_line_index][:first_index_after_bracket] + note_string + lines[first_note_in_table_line_index][first_index_after_bracket:]
		first_index_after_bracket = lines[first_note_in_table_line_index].find(
		"[!note]"
		)
		lines[first_note_in_table_line_index] = (
		lines[first_note_in_table_line_index][:first_index_after_bracket]
		+ note_string
		+ lines[first_note_in_table_line_index][first_index_after_bracket:]
		)
		note_in_table_counter = 0
		first_note_in_table_line_index = -1

		@@ -255,22 +371,29 @@ def auto_number_content(
		if note_in_table_counter == 1:
		first_note_in_table_line_index = i


		lines[i] = new_line
		previous_line = line

		### We need to run again the logic where we add the number in examples and notes since we might not have done it for all cases (it triggers on specific points, and if it happens the element is in the last heading/table it may be skipped)

		if example_counter >= 1 and first_example_line_index != -1:
		lines[first_example_line_index] += f" EXAMPLE{' 1' if example_counter > 1 else ''}:"
		lines[
		first_example_line_index
		] += f" EXAMPLE{' 1' if example_counter > 1 else ''}:"

		if note_counter >= 1 and first_note_line_index != -1:
		lines[first_note_line_index] += f" NOTE{' 1' if note_counter > 1 else ''}:"

		if note_in_table_counter >= 1 and first_note_in_table_line_index != -1:
		note_string = f" NOTE{' 1' if note_in_table_counter > 1 else ''}:"
		first_index_after_bracket = lines[first_note_in_table_line_index].find("[!note]")
		lines[first_note_in_table_line_index] = lines[first_note_in_table_line_index][:first_index_after_bracket] + note_string + lines[first_note_in_table_line_index][first_index_after_bracket:]
		first_index_after_bracket = lines[first_note_in_table_line_index].find(
		"[!note]"
		)
		lines[first_note_in_table_line_index] = (
		lines[first_note_in_table_line_index][:first_index_after_bracket]
		+ note_string
		+ lines[first_note_in_table_line_index][first_index_after_bracket:]
		)

		file_contents = "\n".join(lines) + "\n"
		return file_contents
		@@ -340,12 +463,18 @@ def preprocess(
		try:
		text = open(input_path, "r", encoding="utf-8").read()

		run_format_checks(filename, text.splitlines())

		if filename in clauses_filenames:
		text = auto_number_content(text, "clauses")
		filename_numbers_mapping[filename_without_extension] = clauses_counters[0]
		filename_numbers_mapping[filename_without_extension] = (
		clauses_counters[0]
		)
		elif filename in annexes_filenames:
		text = auto_number_content(text, "annexes")
		filename_numbers_mapping[filename_without_extension] = int_to_letter(annexes_counters[0]).lower()
		filename_numbers_mapping[filename_without_extension] = (
		int_to_letter(annexes_counters[0]).lower()
		)
		text = add_ids_to_references(text, filename)
		text = handle_less_than_greater_than_text(text)
		text = add_ids_to_headings(text)
		@@ -363,9 +492,14 @@ def preprocess(
		# print(
		# f"Warning: Could not preprocess {input_path}. It may not be a valid UTF-8 text file or is missing."
		# )
		if e.args[0] == "DIV_DELINEATOR_ERROR":
		# delete all files that start with --preprocessed--
		for f in os.listdir(src):
		if f.startswith("--preprocessed--"):
		os.remove(os.path.join(src, f))
		sys.exit(1)
		pass

		with open("filename_numbers_mapping.json", "w") as f:
		json.dump(filename_numbers_mapping, f, indent=4)

		handle_consolidated_md("create", src, consolidated_md_path, preprocessed_filenames)

		return filename_numbers_mapping