Merge branch 'md-to-docx-conversion-tool' into 'master' (63a4ddc0) · Commits · CIM - Context Information Management / NGSI-LD API

.gitignore

+36 −0

Original line number	Diff line number	Diff line
		@@ -45,3 +45,39 @@ jspm_packages/
		.env
		.next


		# Folders
		node_modules
		public
		media*
		saved_files
		official
		editing
		shared
		.venv
		.vscode
		.envrc
		__pycache__

		# Input directory
		md_to_docx_converter/INPUT/*
		!md_to_docx_converter/INPUT/README.md
		md_to_docx_converter/INPUT/file_order/*
		!md_to_docx_converter/INPUT/file_order
		!md_to_docx_converter/INPUT/file_order/README.md

		# Output Files
		md_to_docx_converter/background_highlight.css
		md_to_docx_converter/temp.docx
		md_to_docx_converter/API*.css
		md_to_docx_converter/primer*.css
		md_to_docx_converter/API*.html
		md_to_docx_converter/primer*.html

		md_to_docx_converter/GENERATED_FILES
		editing.zip
		official.zip
		package.json
		package-lock.json
		md_to_docx_converter/public/API*.css
		md_to_docx_converter/public/primer*.css
		No newline at end of file

md_to_docx_converter/ETSIstyles.css

0 → 100644

+264 −0

Original line number	Diff line number	Diff line
		.NF {
		font-family: Arial;
		font-size: 9pt;
		align-items: center;
		display: flex;
		gap: 4pt;
		}

		.TAH {
		font-family: Arial;
		font-size: 9pt;
		font-weight: bold;
		text-align: center;
		}

		.TAC {
		font-family: Arial;
		font-size: 9pt;
		text-align: center;
		}

		.TAL {
		font-family: Arial;
		font-size: 9pt;
		}

		.HTML_Variable {
		font-family: "Roboto Mono", Monospace;
		font-size: 8pt;
		font-style: italic;
		}

		.HTML_Keyboard {
		font-family: Times New Roman;
		font-size: 10pt;
		color: #595959;
		}

		.HTML_Definition {
		font-style: italic;
		}

		.TAN {
		font-family: Arial;
		font-size: 9pt;
		padding-left: 12pt;
		display: flex;
		gap: 12pt;
		}

		.HTML_Code {
		font-family: "Roboto Mono", Monospace;
		font-size: 8pt;
		color: #31849b;
		}

		.Plain_Text_Char {
		font-family: Courier New;
		}

		.HTML_Error {
		font-family: Times New Roman;
		font-weight: bold;
		font-style: italic;
		color: #595959;
		}

		.TAJ {
		font-family: Arial;
		font-size: 9pt;
		text-align: justify;
		}

		.FP {
		}

		.ZA {
		font-family: Arial;
		font-size: 20pt;
		text-align: right;
		}

		.ZGSM {
		}

		.ZT {
		font-family: Arial;
		font-size: 17pt;
		font-weight: bold;
		text-align: center;
		}

		.ZG {
		font-family: Arial;
		text-align: right;
		}

		.ZB {
		font-family: Arial;
		font-style: italic;
		text-align: right;
		font-family: Century Gothic;
		font-size: 16pt;
		font-weight: bold;
		color: #ffffff;
		}

		.H6 {
		font-family: Arial;
		font-size: 10pt;
		margin-top: 6pt;
		margin-bottom: 9pt;
		}
		.NO {
		padding-left: 12pt;
		margin-bottom: 9pt;
		display: flex;
		gap: 12pt;
		}

		.NO > p {
		margin: 0;
		padding: 0;
		white-space: pre-wrap;
		}

		.NO > p:first-of-type {
		white-space: nowrap;
		}

		.EX {
		padding-left: 12pt;
		margin-bottom: 9pt;
		display: flex;
		gap: 12pt;
		align-items: flex-start;
		}

		.EX > p {
		margin: 0;
		padding: 0;
		white-space: pre-wrap;
		}

		.EX > p:first-of-type {
		white-space: nowrap;
		}

		.EW {
		gap: 12pt;
		display: flex;
		align-items: flex-start;
		}

		.EW > div:first-child {
		min-width: 100px;
		flex-shrink: 0;
		}

		.EW > div:last-child {
		flex: 1;
		}

		.FL {
		font-family: Arial;
		font-weight: bold;
		text-align: center;
		margin-top: 3pt;
		margin-bottom: 9pt;
		}

		.TF {
		font-family: Arial;
		font-weight: bold;
		text-align: center;
		margin-bottom: 12pt;
		}

		.B1plus {
		margin-bottom: 9pt;
		}

		.TH {
		font-family: Arial;
		font-weight: bold;
		text-align: center;
		margin-top: 3pt;
		margin-bottom: 9pt;
		}

		.B2plus {
		margin-bottom: 9pt;
		}

		.HTML_Sample {
		font-family: Courier New;
		font-size: 8pt;
		}

		.B3 {
		margin-bottom: 9pt;
		}

		.B1 {
		margin-bottom: 9pt;
		}

		.PL {
		font-family: Courier New;
		font-size: 8pt;
		}

		.Hyperlink {
		text-decoration: underline;
		color: #0000ff;
		}

		.B3plus {
		margin-bottom: 9pt;
		}

		.B4 {
		margin-bottom: 9pt;
		}

		.B2 {
		margin-bottom: 9pt;
		}

		.B5 {
		margin-bottom: 9pt;
		}

		.BL {
		margin-bottom: 9pt;
		}

		.BN {
		margin-bottom: 9pt;
		}

		.image_overlay {
		visibility: hidden;
		}

		/* CUSTOM ETSI ONDEMAND STYLES FOR FRONTPAGE AND HISTORY TABLE */

		.ondemand_CHAR_size_32 {
		font-size: 32pt;
		}

		.ondemand_CHAR_size_16 {
		font-size: 16pt;
		}

		.ondemand_CHAR_name_Arial_size_7 {
		font-family: Arial;
		font-size: 7pt;
		}

		.ondemand_PAR_space_before_3_after_3 {
		margin-top: 3pt;
		margin-bottom: 3pt;
		}

md_to_docx_converter/INPUT/README.md

0 → 100644

+7 −0

Original line number	Diff line number	Diff line
		# User Inputs Folder

		This folder is used to contain various input files introduced to the script.

		## Contents

		file_order: Intended to contain various file ordering JSONS that follow the [template](../templates/json/file_order.json).
		No newline at end of file

md_to_docx_converter/INPUT/file_order/README.md

0 → 100644

+65 −0

Original line number	Diff line number	Diff line
		# File Orderings

		Place JSON files here that define how files contained in the conversion source directory should be ordered.

		### Considerations

		#### Ensure files defined in the JSON exist in the source directory

		Unlike when using the script's default file ordering, a file that is defined in a file ordering provided to the script that is not present in the conversion source directory will cause the script to fail. It is important to make sure that any files defined in a provided JSON are present.

		#### Preceding numbers in top-level headings override the script ordering

		In the top-level heading of each Markdown file, a leading number (for example, `# 4 Clause Heading) will override the file order defined in the script. It is important to ensure that:
		1. No two Markdown source files use the same preceding number in their top-level heading. It is best to make sure the numbers correspond with the file's intended place in the overall hierarchy.
		2. Numbering of non-standard clauses and annexes begins with 4, because 1, 2, and 3 are reserved for the predefined clauses Scope, References, and Definitions.

		### Example

		The following example specifies the ordering of a few example files. An [empty template](../../templates/json/file_order.json) is provided for convenience.

		``` json
		{
		"clauses": [
		"clause-example1",
		"clause-example3",
		"clause-example2",
		"example4"
		],
		"annexes": [
		"annex-example1",
		"annex-3",
		"example2"
		]
		}
		```

		This will ensure that the following file order is produced.

		1. Universal ETSI initial files
		1. Intellectual Property Rights
		2. Foreword
		3. Modal verbs terminology
		4. Executive summary
		5. Introduction
		2. ETSI universal initial clauses
		1. Scope
		2. References
		3. Definition of terms, symbols and abbreviations
		3. Clauses defined in the JSON
		1. clause-example1.md
		2. clause-example3.md
		3. clause-example2.md
		4. example4.md
		4. Other clauses[^1]
		5. Annexes defined in the JSON
		1. annex-example1.md
		2. annex-3.md
		3. example2.md
		6. Other annexes[^2]
		7. Universal ETSI final files
		1. History

		[^1]: If any other files exist in the conversion source directory whose filenames follow the format clause-{text}.md, they will be added alphabetically after the JSON-defined clauses.

		[^2]: Similar to [^1], any files in the source directory that follow the format annex-{text}.md will be added alphabetically after the JSON-defined annexes.
		No newline at end of file

md_to_docx_converter/README.md

0 → 100644

+217 −0

Original line number	Diff line number	Diff line
		# 1. Introduction

		## 1.1 Setup

		### 1.1.1 Required Software

		#### [Miniconda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) (or any other python environment manager)

		Latest

		#### [Pandoc](https://pandoc.org/installing.html)

		Version 3.7.0.2

		#### [LibreOffice (command line)](https://www.libreoffice.org/)

		Latest

		#### [ImageMagick](https://imagemagick.org/)

		Latest

		### 1.1.1 Create a virtual environment with Conda

		Assuming you have Miniconda installed, follow these steps. If you are not using Miniconda, you can use any other python environment manger, just go to point 3 of the list and install the requirements using pip.

		1. Follow the [instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) to setup Miniconda.

		2. Create a virtual environment with Python version 3.10.16 to use with the script.

		1. `conda create -p path/to/environment_name python=3.10.16`
		2. `conda activate environment_name`

		3. Install the requirements contained in _requirements.txt_.
		1. `cd path/to/ETSI-GS-CIM-009`
		2. `pip install -r requirements.txt`

		### 1.1.2 Check Pandoc

		Ensure Pandoc version 3.7.0.2 is installed.

		`pandoc --version`

		## 1.2 Initialize folders

		Run the following command to create the necessary folder structure:

		```bash
		python init.py --folder {folder_name}
		```

		This will generate the folder structure required for the conversion process.

		The expected folder structure is as follows:

		```
		Current folder
		└── GENERATED_FILES
		└── {folder_name}
		├── md
		└── media
		├── customCSS.css
		├── ETSIstyles.css
		```

		## 1.3 Expected Folder Structure after conversion

		Through conversion you will eventually end up with the following folder structure:

		```
		Current folder
		└── GENERATED_FILES
		└── {folder_name}
		├── docx
		├── html
		\| └── media
		├── md
		└── media
		├── customCSS.css
		├── ETSIstyles.css
		```

		Where:

		- `{folder_name}` is the name of the folder you specified when running the `init.py` script.
		- md is the folder containing the Markdown files where you will work
		- html/media is the folder containing the media files (images) referenced in the Markdown files
		- customCSS.css is the CSS file used for custom style used for the HTML pages only. This file is volatile, and is copied to the output directory during conversion.
		- ETSIstyles.css is the CSS file used to represent styles coming out of a Word ETSI document. This file is also volatile, and is copied to the output directory during conversion.
		- html is generated from the Markdown files. This folder is automatically created during conversion.
		- html/media is the folder containing the media files (images) referenced in the HTML files. This folder is automatically created during conversion.
		- docx is generated from the HTML files. This folder is automatically created during conversion.

		# 2. Usage

		The script `convert.py` handles the following conversions:

		\| From \| To \|
		\| ---------- \| ---------- \|
		\| Dirty HTML \| Markdown \|
		\| Markdown \| Clean HTML \|
		\| Clean HTML \| Docx \|

		## 2.1 Arguments

		`convert.py -h` - List the script's arguments

		### 2.1.1 Required

		- `--frm [file type]` - Source Types: html_dirty for HTML to run additional processing on to produce desired formatting, html for HTML that is already properly formatted, or md for Markdown

		- `--to` - Destination Types: html, md, or docx It must be different from the source file type provided with `--frm`.

		- `--folder` - The name of the folder within _ETSI-GS-CIM-009/GENERATED_FILES_ to place the converted files
		- BEST PRACTICE: Use a short term by which the document can be uniquely identified, for example, _API_ or _primer_.

		### 2.1.2 Optional

		- `--src` - Provide the path of the source directory rather than allowing the script to generate it.
		- If this argument is not provided, the source path will be _ETSI-GS-CIM-009_/_GENERATED_FILES_/_{value provided via `--folder`}_/_{value provided via `--frm`}_.

		- `--file_order` - Provide the path to a JSON that contains the order in which user-created clauses and annexes should be.
		- If this argument is not provided, the default ordering convention will be used. See [section 2.2.1.1](#2.2.1.1-preparation-create-markdown-from-scratch) for more details.

		## 2.2 Conversion

		`convert.py` requires "dirty" HTML to convert to Markdown. Therefore, if the "dirty" HTML does not already exist, it must be generated from the document in docx format.

		NOTE: For the following commands, `{docname}` refers to the filename of the document without the extension. For example, the `{docname}` _API_ refers to _API.docx_.

		### 2.2.1 Preparation: Generate "dirty" HTML

		### 2.2.1 Preparation
		---

		Create Markdown from scratch or convert "dirty" HTML to "clean" Markdown

		#### 2.2.1.1 Create Markdown from scratch

		1. If it does not already exist, create the directory GENERATED_FILES/{docname}/md where `convert.py` is.
		2. Copy template files of [Required Markdown files](./templates/document_skeleton/)
		3. Decide whether to use the script's default naming convention or custom clause and annex names.
		- Default: Name clauses clause-{number 4-20} and annexes annex-{letter a-z}. Clauses and annexes will be arranged in order according to the alphanumeric suffix.
		- Custom: Create a file ordering JSON according to the [template](./templates/json/file_order.json) and provide it to the script with the `--file_order` argument.
		- In either case, if any other files that follow the naming convention clause-{some text} or annex-{some text} will be ordered alphabetically after the predefined files in the appropriate section of the document.
		4. Create a media directory inside the directory created in step 1.
		- As images are added to the Markdown, place the image in `PNG` and `EMF` form inside.[^1]

		At this point, the document's clauses and annexes can be created. It is important to follow the following considerations to ensure the document is created properly:
		- Ensure any files defined in a [file ordering JSON](./templates/json/file_order.json) are present in the source directory created in step 1, otherwise the script will print an error and quit prematurely.
		- Ensure that the [required files](./templates/document_skeleton/) are all present in the source directory, otherwise the script will print an error and quit prematurely.

		#### 2.2.1.2 Generate "dirty" HTML

		##### 1) Preprocess {docname}.docx

		`preprocessing.py {docname}.docx`

		##### 2) Copy customCSS.css to the document's directory

		`cp customCSS.css GENERATED_FILES/{docname}/customCSS.css`

		##### 3) Prepare the images in {docname}.docx for the HTML

		1. `pandoc --extract-media GENERATED_FILES/{docname}/temp -f docx -t html GENERATED_FILES/{docname}/temp/temp.docx -o GENERATED_FILES/{docname}/temp/{docname}.html`

		2. `libreoffice --headless --convert-to png --outdir GENERATED_FILES/{docname}/temp/media GENERATED_FILES/{docname}/temp/media/*.emf`

		3. Either of the following...
		- `mogrify -trim GENERATED_FILES/{docname}/temp/media/*.png`
		- `magick mogrify -trim GENERATED_FILES/{docname}/temp/media/*.png`

		#### Convert to HTML using Pandoc

		`pandoc --resource-path GENERATED_FILES/{docname}/temp -f docx -t chunkedhtml -L filter_1.lua -L filter_2.lua --css=customCSS.css --css="{docname}.css" -s GENERATED_FILES/{docname}/temp/temp.docx -o GENERATED_FILES/{docname}/html_dirty --toc --toc-depth 4 --template=official.html --split-level=1`

		#### Delete temporary files that are no longer needed

		`rm -r GENERATED_FILES/{docname}/temp`

		### 2.2.2 Dirty HTML to Markdown

		---

		Starting with "dirty" HTML contained in the default source location (that is, _GENERATED_FILES_/_{docname}_/_html_dirty_), convert to "clean" Markdown, which will be contained in _GENERATED_FILES_/_{docname}_/_md_.

		`convert.py --frm html_dirty --to md --folder {docname}`

		Specify a different directory containing the "dirty" HTML files.

		`convert.py --frm html_dirty --to md --folder {docname} --src relative/or/absolute/source/path`

		### 2.2.3 Markdown to HTML

		---

		Starting with Markdown files contained in the default source location (_GENERATED_FILES_/_{docname}_/_md_), convert to HTML. The Markdown is assumed to be in a "clean" state.

		`convert.py --frm md --to html --folder {docname}`

		Specify a different directory containing the Markdown files.

		`convert.py --frm md --to html --folder {docname} --src relative/or/absolute/source/path`

		### 2.2.4 HTML to Docx

		---

		Starting with HTML files contained in the default source location (_GENERATED_FILES_/_{docname}_/_html_), convert to Docx.

		`convert.py --frm html --to docx --folder {docname}`

		Specify a different directory containing the HTML files.

		`convert.py --frm html --to docx --folder {docname} --src relative/or/absolute/source/path`

		[^1]: Method subject to change