Commit 63a4ddc0 authored by Marco Cavalli's avatar Marco Cavalli
Browse files

Merge branch 'md-to-docx-conversion-tool' into 'master'

Add initial templates for group report, group specification, technical report,...

See merge request !13
parents 2a4ca015 101fd3cf
Loading
Loading
Loading
Loading
+36 −0
Original line number Original line Diff line number Diff line
@@ -45,3 +45,39 @@ jspm_packages/
.env
.env
.next
.next



# Folders
node_modules
public
media*
saved_files
official
editing
shared
.venv
.vscode
.envrc
__pycache__

# Input directory
md_to_docx_converter/INPUT/*
!md_to_docx_converter/INPUT/README.md
md_to_docx_converter/INPUT/file_order/*
!md_to_docx_converter/INPUT/file_order
!md_to_docx_converter/INPUT/file_order/README.md

# Output Files
md_to_docx_converter/background_highlight.css
md_to_docx_converter/temp.docx
md_to_docx_converter/API*.css
md_to_docx_converter/primer*.css
md_to_docx_converter/API*.html
md_to_docx_converter/primer*.html

md_to_docx_converter/GENERATED_FILES
editing.zip
official.zip
package.json
package-lock.json
md_to_docx_converter/public/API*.css
md_to_docx_converter/public/primer*.css
 No newline at end of file
+264 −0
Original line number Original line Diff line number Diff line
.NF {
  font-family: Arial;
  font-size: 9pt;
  align-items: center;
  display: flex;
  gap: 4pt;
}

.TAH {
  font-family: Arial;
  font-size: 9pt;
  font-weight: bold;
  text-align: center;
}

.TAC {
  font-family: Arial;
  font-size: 9pt;
  text-align: center;
}

.TAL {
  font-family: Arial;
  font-size: 9pt;
}

.HTML_Variable {
  font-family: "Roboto Mono", Monospace;
  font-size: 8pt;
  font-style: italic;
}

.HTML_Keyboard {
  font-family: Times New Roman;
  font-size: 10pt;
  color: #595959;
}

.HTML_Definition {
  font-style: italic;
}

.TAN {
  font-family: Arial;
  font-size: 9pt;
  padding-left: 12pt;
  display: flex;
  gap: 12pt;
}

.HTML_Code {
  font-family: "Roboto Mono", Monospace;
  font-size: 8pt;
  color: #31849b;
}

.Plain_Text_Char {
  font-family: Courier New;
}

.HTML_Error {
  font-family: Times New Roman;
  font-weight: bold;
  font-style: italic;
  color: #595959;
}

.TAJ {
  font-family: Arial;
  font-size: 9pt;
  text-align: justify;
}

.FP {
}

.ZA {
  font-family: Arial;
  font-size: 20pt;
  text-align: right;
}

.ZGSM {
}

.ZT {
  font-family: Arial;
  font-size: 17pt;
  font-weight: bold;
  text-align: center;
}

.ZG {
  font-family: Arial;
  text-align: right;
}

.ZB {
  font-family: Arial;
  font-style: italic;
  text-align: right;
  font-family: Century Gothic;
  font-size: 16pt;
  font-weight: bold;
  color: #ffffff;
}

.H6 {
  font-family: Arial;
  font-size: 10pt;
  margin-top: 6pt;
  margin-bottom: 9pt;
}
.NO {
  padding-left: 12pt;
  margin-bottom: 9pt;
  display: flex;
  gap: 12pt;
}

.NO > p {
  margin: 0;
  padding: 0;
  white-space: pre-wrap;
}

.NO > p:first-of-type {
  white-space: nowrap;
}

.EX {
  padding-left: 12pt;
  margin-bottom: 9pt;
  display: flex;
  gap: 12pt;
  align-items: flex-start;
}

.EX > p {
  margin: 0;
  padding: 0;
  white-space: pre-wrap;
}

.EX > p:first-of-type {
  white-space: nowrap;
}

.EW {
  gap: 12pt;
  display: flex;
  align-items: flex-start;
}

.EW > div:first-child {
  min-width: 100px;
  flex-shrink: 0;
}

.EW > div:last-child {
  flex: 1;
}

.FL {
  font-family: Arial;
  font-weight: bold;
  text-align: center;
  margin-top: 3pt;
  margin-bottom: 9pt;
}

.TF {
  font-family: Arial;
  font-weight: bold;
  text-align: center;
  margin-bottom: 12pt;
}

.B1plus {
  margin-bottom: 9pt;
}

.TH {
  font-family: Arial;
  font-weight: bold;
  text-align: center;
  margin-top: 3pt;
  margin-bottom: 9pt;
}

.B2plus {
  margin-bottom: 9pt;
}

.HTML_Sample {
  font-family: Courier New;
  font-size: 8pt;
}

.B3 {
  margin-bottom: 9pt;
}

.B1 {
  margin-bottom: 9pt;
}

.PL {
  font-family: Courier New;
  font-size: 8pt;
}

.Hyperlink {
  text-decoration: underline;
  color: #0000ff;
}

.B3plus {
  margin-bottom: 9pt;
}

.B4 {
  margin-bottom: 9pt;
}

.B2 {
  margin-bottom: 9pt;
}

.B5 {
  margin-bottom: 9pt;
}

.BL {
  margin-bottom: 9pt;
}

.BN {
  margin-bottom: 9pt;
}

.image_overlay {
  visibility: hidden;
}

/* CUSTOM ETSI ONDEMAND STYLES FOR FRONTPAGE AND HISTORY TABLE */

.ondemand_CHAR_size_32 {
  font-size: 32pt;
}

.ondemand_CHAR_size_16 {
  font-size: 16pt;
}

.ondemand_CHAR_name_Arial_size_7 {
  font-family: Arial;
  font-size: 7pt;
}

.ondemand_PAR_space_before_3_after_3 {
  margin-top: 3pt;
  margin-bottom: 3pt;
}
+7 −0
Original line number Original line Diff line number Diff line
# User Inputs Folder

This folder is used to contain various input files introduced to the script.

## Contents

**file_order**: Intended to contain various file ordering JSONS that follow the [template](../templates/json/file_order.json).
 No newline at end of file
+65 −0
Original line number Original line Diff line number Diff line
# File Orderings

Place JSON files here that define how files contained in the conversion source directory should be ordered.

### Considerations

#### Ensure files defined in the JSON exist in the source directory

Unlike when using the script's default file ordering, a file that is defined in a file ordering provided to the script that is not present in the conversion source directory will cause the script to fail. It is important to make sure that any files defined in a provided JSON are present.

#### Preceding numbers in top-level headings override the script ordering

In the top-level heading of each Markdown file, a leading number (for example, `# 4 Clause Heading) will override the file order defined in the script. It is important to ensure that:
1. No two Markdown source files use the same preceding number in their top-level heading. It is best to make sure the numbers correspond with the file's intended place in the overall hierarchy.
2. Numbering of non-standard clauses and annexes begins with **4**, because **1**, **2**, and **3** are reserved for the predefined clauses *Scope*, *References*, and *Definitions*.

### Example

The following example specifies the ordering of a few example files. An [empty template](../../templates/json/file_order.json) is provided for convenience.

``` json
{
  "clauses": [
    "clause-example1",
    "clause-example3",
    "clause-example2",
    "example4"
  ],
  "annexes": [
    "annex-example1",
    "annex-3",
    "example2"
  ]
}
```

This will ensure that the following file order is produced.

1. **Universal ETSI initial files**
    1. Intellectual Property Rights
    2. Foreword
    3. Modal verbs terminology
    4. Executive summary
    5. Introduction
2. **ETSI universal initial clauses**
    1. Scope
    2. References
    3. Definition of terms, symbols and abbreviations
3. **Clauses defined in the JSON**
    1. *clause-example1.md*
    2. *clause-example3.md*
    3. *clause-example2.md*
    4. *example4.md*
4. **Other clauses**[^1]
5. **Annexes defined in the JSON**
    1. *annex-example1.md*
    2. *annex-3.md*
    3. *example2.md*
6. **Other annexes**[^2]
7. **Universal ETSI final files**
    1. History

[^1]: If any other files exist in the conversion source directory whose filenames follow the format *clause-{text}.md*, they will be added alphabetically after the JSON-defined clauses.

[^2]: Similar to [^1], any files in the source directory that follow the format *annex-{text}.md* will be added alphabetically after the JSON-defined annexes.
 No newline at end of file
+217 −0
Original line number Original line Diff line number Diff line
# 1. Introduction

## 1.1 Setup

### 1.1.1 Required Software

#### [Miniconda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) (or any other python environment manager)

Latest

#### [Pandoc](https://pandoc.org/installing.html)

Version 3.7.0.2

#### [LibreOffice (command line)](https://www.libreoffice.org/)

Latest

#### [ImageMagick](https://imagemagick.org/)

Latest

### 1.1.1 Create a virtual environment with Conda

Assuming you have Miniconda installed, follow these steps. If you are not using Miniconda, you can use any other python environment manger, just go to point 3 of the list and install the requirements using pip.

1. Follow the [instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) to setup Miniconda.

2. Create a virtual environment with Python version **3.10.16** to use with the script.

   1. `conda create -p path/to/environment_name python=3.10.16`
   2. `conda activate environment_name`

3. Install the requirements contained in _requirements.txt_.
   1. `cd path/to/ETSI-GS-CIM-009`
   2. `pip install -r requirements.txt`

### 1.1.2 Check Pandoc

Ensure Pandoc version **3.7.0.2** is installed.

`pandoc --version`

## 1.2 Initialize folders

Run the following command to create the necessary folder structure:

```bash
python init.py --folder {folder_name}
```

This will generate the folder structure required for the conversion process.

The expected folder structure is as follows:

```
Current folder
└── GENERATED_FILES
    └── {folder_name}
        ├── md
            └── media
        ├── customCSS.css
        ├── ETSIstyles.css
```

## 1.3 Expected Folder Structure after conversion

Through conversion you will eventually end up with the following folder structure:

```
Current folder
└── GENERATED_FILES
    └── {folder_name}
        ├── docx
        ├── html
        |   └── media
        ├── md
            └── media
        ├── customCSS.css
        ├── ETSIstyles.css
```

Where:

- `{folder_name}` is the name of the folder you specified when running the `init.py` script.
- md is the folder containing the Markdown files where you will work
  - html/media is the folder containing the media files (images) referenced in the Markdown files
- customCSS.css is the CSS file used for custom style used for the HTML pages only. **This file is volatile, and is copied to the output directory during conversion.**
- ETSIstyles.css is the CSS file used to represent styles coming out of a Word ETSI document. **This file is also volatile, and is copied to the output directory during conversion.**
- html is generated from the Markdown files. **This folder is automatically created during conversion.**
  - html/media is the folder containing the media files (images) referenced in the HTML files. **This folder is automatically created during conversion.**
- docx is generated from the HTML files. **This folder is automatically created during conversion.**

# 2. Usage

The script `convert.py` handles the following conversions:

| From       | To         |
| ---------- | ---------- |
| Dirty HTML | Markdown   |
| Markdown   | Clean HTML |
| Clean HTML | Docx       |

## 2.1 Arguments

`convert.py -h` - List the script's arguments

### 2.1.1 Required

- `--frm [file type]` - Source Types: **html_dirty** for HTML to run additional processing on to produce desired formatting, **html** for HTML that is already properly formatted, or **md** for Markdown

- `--to` - Destination Types: **html**, **md**, or **docx** It must be different from the source file type provided with `--frm`.

- `--folder` - The name of the folder within _ETSI-GS-CIM-009/GENERATED_FILES_ to place the converted files
  - BEST PRACTICE: Use a short term by which the document can be uniquely identified, for example, _API_ or _primer_.

### 2.1.2 Optional

- `--src` - Provide the path of the source directory rather than allowing the script to generate it.
  - If this argument is not provided, the source path will be _ETSI-GS-CIM-009_/_GENERATED_FILES_/_{value provided via `--folder`}_/_{value provided via `--frm`}_.

- `--file_order` - Provide the path to a JSON that contains the order in which user-created clauses and annexes should be.
    - If this argument is not provided, the default ordering convention will be used. See [section 2.2.1.1](#2.2.1.1-preparation-create-markdown-from-scratch) for more details.

## 2.2  Conversion

`convert.py` requires "dirty" HTML to convert to Markdown. Therefore, if the "dirty" HTML does not already exist, it must be generated from the document in docx format.

**NOTE**: For the following commands, `{docname}` refers to the filename of the document without the extension. For example, the `{docname}` _API_ refers to _API.docx_.

### 2.2.1 Preparation: Generate "dirty" HTML

### 2.2.1     Preparation
---

Create Markdown from scratch or convert "dirty" HTML to "clean" Markdown

#### 2.2.1.1   Create Markdown from scratch

1. If it does not already exist, create the directory *GENERATED_FILES*/*{docname}*/*md* where `convert.py` is.
2. Copy template files of [Required Markdown files](./templates/document_skeleton/)
3. Decide whether to use the script's default naming convention or custom clause and annex names.
    - Default: Name clauses *clause-{number 4-20}* and annexes *annex-{letter a-z}*. Clauses and annexes will be arranged in order according to the alphanumeric suffix.
    - Custom: Create a file ordering JSON according to the [template](./templates/json/file_order.json) and provide it to the script with the `--file_order` argument.
    - In either case, if any other files that follow the naming convention *clause-{some text}* or *annex-{some text}* will be ordered alphabetically after the predefined files in the appropriate section of the document.
4. Create a *media* directory inside the directory created in step 1.
    - As images are added to the Markdown, place the image in `PNG` and `EMF` form inside.[^1]

At this point, the document's clauses and annexes can be created. It is important to follow the following considerations to ensure the document is created properly:
- Ensure any files defined in a [file ordering JSON](./templates/json/file_order.json) are present in the source directory created in step 1, otherwise the script will print an error and quit prematurely.
- Ensure that the [required files](./templates/document_skeleton/) are all present in the source directory, otherwise the script will print an error and quit prematurely.

#### 2.2.1.2   Generate "dirty" HTML

##### 1) Preprocess *{docname}.docx*

`preprocessing.py {docname}.docx`

##### 2) Copy *customCSS.css* to the document's directory

`cp customCSS.css GENERATED_FILES/{docname}/customCSS.css`

##### 3) Prepare the images in *{docname}.docx* for the HTML

1. `pandoc --extract-media GENERATED_FILES/{docname}/temp -f docx -t html GENERATED_FILES/{docname}/temp/temp.docx -o GENERATED_FILES/{docname}/temp/{docname}.html`

2. `libreoffice --headless --convert-to png --outdir GENERATED_FILES/{docname}/temp/media GENERATED_FILES/{docname}/temp/media/*.emf`

3. Either of the following...
   - `mogrify -trim GENERATED_FILES/{docname}/temp/media/*.png`
   - `magick mogrify -trim GENERATED_FILES/{docname}/temp/media/*.png`

#### Convert to HTML using Pandoc

`pandoc --resource-path GENERATED_FILES/{docname}/temp -f docx -t chunkedhtml -L filter_1.lua -L filter_2.lua --css=customCSS.css --css="{docname}.css" -s GENERATED_FILES/{docname}/temp/temp.docx -o GENERATED_FILES/{docname}/html_dirty --toc --toc-depth 4 --template=official.html --split-level=1`

#### Delete temporary files that are no longer needed

`rm -r GENERATED_FILES/{docname}/temp`

### 2.2.2 Dirty HTML to Markdown

---

Starting with "dirty" HTML contained in the default source location (that is, _GENERATED_FILES_/_{docname}_/_html_dirty_), convert to "clean" Markdown, which will be contained in _GENERATED_FILES_/_{docname}_/_md_.

`convert.py --frm html_dirty --to md --folder {docname}`

Specify a different directory containing the "dirty" HTML files.

`convert.py --frm html_dirty --to md --folder {docname} --src relative/or/absolute/source/path`

### 2.2.3 Markdown to HTML

---

Starting with Markdown files contained in the default source location (_GENERATED_FILES_/_{docname}_/_md_), convert to HTML. The Markdown is assumed to be in a "clean" state.

`convert.py --frm md --to html --folder {docname}`

Specify a different directory containing the Markdown files.

`convert.py --frm md --to html --folder {docname} --src relative/or/absolute/source/path`

### 2.2.4 HTML to Docx

---

Starting with HTML files contained in the default source location (_GENERATED_FILES_/_{docname}_/_html_), convert to Docx.

`convert.py --frm html --to docx --folder {docname}`

Specify a different directory containing the HTML files.

`convert.py --frm html --to docx --folder {docname} --src relative/or/absolute/source/path`

[^1]: Method subject to change
Loading