chore: update readme (a650bd54) · Commits · CIM - Context Information Management / NGSI-LD API

md_to_docx_converter/README.md

+55 −54

Original line number	Diff line number	Diff line
		@@ -57,29 +57,7 @@ Ensure Pandoc version 3.7.0.2 is installed.

		`pandoc --version`

		## 1.2 Initialize folders

		Run the following command to create the necessary folder structure:

		```bash
		python init.py --folder {folder_name}
		```

		This will generate the folder structure required for the conversion process.

		The expected folder structure is as follows:

		```
		Current folder
		└── GENERATED_FILES
		└── {folder_name}
		├── md
		└── media
		├── customCSS.css
		├── ETSIstyles.css
		```

		### 1.2.1 For Use with WSL
		## 1.2 For Use with WSL

		To ensure the script runs correctly with WSL 1[^1], do one of the following...

		@@ -159,20 +137,45 @@ The script `convert.py` handles the following conversions:

		## 2.2 Conversion

		`convert.py` requires "dirty" HTML to convert to Markdown. Therefore, if the "dirty" HTML does not already exist, it must be generated from the document in docx format.
		`convert.py` is a script that handles the conversion of documents between different formats. It can be used to convert Markdown files to HTML, HTML files to Docx, and "dirty" HTML files to "clean" Markdown (when one wants to start a new document from some content in a DOCX file).

		NOTE: For the following commands, `{docname}` refers to the filename of the document without the extension. For example, the `{docname}` _API_ refers to _API.docx_.
		### 2.2.0 General Usage (Markdown to HTML to DOCX)

		### 2.2.1 Preparation: Generate "dirty" HTML
		It follows the general usage pattern:

		### 2.2.1 Preparation
		---
		From md to html (validation of Markdown):

		`convert.py --frm md --to html --folder {folder_name} --src relative/or/absolute/source/path`

		NOTE: The `--src` argument is optional and can be used to specify the source directory when the Markdown files are located in a different directory (e.g. a dedicated repository or workspace). When provided, it will use it to locate the md files. This shall point to the directory containing the Markdown files and the media folder.

		NOTE 2: If `--src` is omitted, the script will expect to find a "md" folder in the "./GENERATED_FILES/<folder_name>" path containing the Markdown files and the media folder.


		From html to docx (generation of DOCX):

		`convert.py --frm html --to docx --folder {folder_name}`


		### 2.2.1 Starting from scratch or existing DOCX to Markdown

		Run the following command to create the necessary folder structure:

		```bash
		python init.py --folder {folder_name}
		```

		Create Markdown from scratch or convert "dirty" HTML to "clean" Markdown
		This will generate the folder structure required for the conversion process.

		From now on, either of the following two paths can be taken:
		- Create Markdown files from scratch, see [section 2.2.1.1](#2211---create-markdown-from-scratch)
		- Convert an existing DOCX file to Markdown, see [section 2.2.1.2](#2212---convert-docx-to-markdown)

		In both cases, you will end up with markdown files in GENERATED_FILES/{folder_name}/md. Then proceed to [section 2.2.2](#222---dirty-html-to-markdown) to convert the "dirty" HTML to "clean" Markdown.

		#### 2.2.1.1 Create Markdown from scratch

		1. If it does not already exist, create the directory GENERATED_FILES/{docname}/md where `convert.py` is.
		1. If it does not already exist, create the directory GENERATED_FILES/{folder_name}/md where `convert.py` is by running the init script.
		2. Copy template files of [Required Markdown files](./templates/document_skeleton/)
		3. Decide whether to use the script's default naming convention or custom clause and annex names.
		- Default: Name clauses clause-{number 4-20} and annexes annex-{letter a-z}. Clauses and annexes will be arranged in order according to the alphanumeric suffix.
		@@ -187,67 +190,65 @@ At this point, the document's clauses and annexes can be created. It is importan

		#### 2.2.1.2 Generate "dirty" HTML

		##### 1) Preprocess {docname}.docx
		##### 1) Preprocess {folder_name}.docx

		`preprocessing.py {docname}.docx`
		`preprocessing.py {folder_name}.docx`

		##### 2) Copy customCSS.css to the document's directory

		`cp customCSS.css GENERATED_FILES/{docname}/customCSS.css`
		`cp customCSS.css GENERATED_FILES/{folder_name}/customCSS.css`

		##### 3) Prepare the images in {docname}.docx for the HTML
		##### 3) Prepare the images in {folder_name}.docx for the HTML

		1. `pandoc --extract-media GENERATED_FILES/{docname}/temp -f docx -t html GENERATED_FILES/{docname}/temp/temp.docx -o GENERATED_FILES/{docname}/temp/{docname}.html`
		1. `pandoc --extract-media GENERATED_FILES/{folder_name}/temp -f docx -t html GENERATED_FILES/{folder_name}/temp/temp.docx -o GENERATED_FILES/{folder_name}/temp/{folder_name}.html`

		2. `libreoffice --headless --convert-to png --outdir GENERATED_FILES/{docname}/temp/media GENERATED_FILES/{docname}/temp/media/*.emf`
		2. `libreoffice --headless --convert-to png --outdir GENERATED_FILES/{folder_name}/temp/media GENERATED_FILES/{folder_name}/temp/media/*.emf`

		3. Either of the following...
		- `mogrify -trim GENERATED_FILES/{docname}/temp/media/*.png`
		- `magick mogrify -trim GENERATED_FILES/{docname}/temp/media/*.png`
		- `mogrify -trim GENERATED_FILES/{folder_name}/temp/media/*.png`
		- `magick mogrify -trim GENERATED_FILES/{folder_name}/temp/media/*.png`

		#### Convert to HTML using Pandoc

		`pandoc --resource-path GENERATED_FILES/{docname}/temp -f docx -t chunkedhtml -L filter_1.lua -L filter_2.lua --css=customCSS.css --css="{docname}.css" -s GENERATED_FILES/{docname}/temp/temp.docx -o GENERATED_FILES/{docname}/html_dirty --toc --toc-depth 4 --template=official.html --split-level=1`
		`pandoc --resource-path GENERATED_FILES/{folder_name}/temp -f docx -t chunkedhtml -L filter_1.lua -L filter_2.lua --css=customCSS.css --css="{folder_name}.css" -s GENERATED_FILES/{folder_name}/temp/temp.docx -o GENERATED_FILES/{folder_name}/html_dirty --toc --toc-depth 4 --template=official.html --split-level=1`

		#### Delete temporary files that are no longer needed

		`rm -r GENERATED_FILES/{docname}/temp`
		`rm -r GENERATED_FILES/{folder_name}/temp`

		### 2.2.2 Dirty HTML to Markdown

		---
		#### Generate Markdown from the "dirty" HTML

		Starting with "dirty" HTML contained in the default source location (that is, _GENERATED_FILES_/_{docname}_/_html_dirty_), convert to "clean" Markdown, which will be contained in _GENERATED_FILES_/_{docname}_/_md_.
		Starting with "dirty" HTML contained in the default source location (that is, _GENERATED_FILES_/_{folder_name}_/_html_dirty_), convert to "clean" Markdown, which will be contained in _GENERATED_FILES_/_{folder_name}_/_md_.

		`convert.py --frm html_dirty --to md --folder {docname}`
		`convert.py --frm html_dirty --to md --folder {folder_name}`

		Specify a different directory containing the "dirty" HTML files.

		`convert.py --frm html_dirty --to md --folder {docname} --src relative/or/absolute/source/path`
		`convert.py --frm html_dirty --to md --folder {folder_name} --src relative/or/absolute/source/path`

		### 2.2.3 Markdown to HTML
		### 2.2.2 Markdown to HTML

		---

		Starting with Markdown files contained in the default source location (_GENERATED_FILES_/_{docname}_/_md_), convert to HTML. The Markdown is assumed to be in a "clean" state.
		Starting with Markdown files contained in the default source location (_GENERATED_FILES_/_{folder_name}_/_md_), convert to HTML. The Markdown is assumed to be in a "clean" state.

		`convert.py --frm md --to html --folder {docname}`
		`convert.py --frm md --to html --folder {folder_name}`

		Specify a different directory containing the Markdown files.

		`convert.py --frm md --to html --folder {docname} --src relative/or/absolute/source/path`
		`convert.py --frm md --to html --folder {folder_name} --src relative/or/absolute/source/path`

		### 2.2.4 HTML to Docx
		### 2.2.3 HTML to Docx

		---

		Starting with HTML files contained in the default source location (_GENERATED_FILES_/_{docname}_/_html_), convert to Docx.
		Starting with HTML files contained in the default source location (_GENERATED_FILES_/_{folder_name}_/_html_), convert to Docx.

		`convert.py --frm html --to docx --folder {docname}`
		`convert.py --frm html --to docx --folder {folder_name}`

		Specify a different directory containing the HTML files.

		`convert.py --frm html --to docx --folder {docname} --src relative/or/absolute/source/path`
		`convert.py --frm html --to docx --folder {folder_name} --src relative/or/absolute/source/path`

		[^1]: These steps may not be necessary with WSL 2, but it is recommended to follow them nevertheless.