You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

184 lines
7.5 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

Metadata-Version: 2.4
Name: rfc3987-syntax
Version: 1.1.0
Summary: Helper functions to syntactically validate strings according to RFC 3987.
Project-URL: Homepage, https://github.com/willynilly/rfc3987-syntax
Project-URL: Documentation, https://github.com/willynilly/rfc3987-syntax#readme
Project-URL: Issues, https://github.com/willynilly/rfc3987-syntax/issues
Project-URL: Source, https://github.com/willynilly/rfc3987-syntax
Author: Jan Kowalleck
Author-email: Will Riley <wanderingwill@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: RFC 3987,RFC3987,parser,syntax,validator
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Requires-Dist: lark>=1.2.2
Provides-Extra: testing
Requires-Dist: pytest>=8.3.5; extra == 'testing'
Description-Content-Type: text/markdown
# rfc3987-syntax
Helper functions to parse and validate the **syntax** of terms defined in **[RFC 3987](https://www.rfc-editor.org/info/rfc3987)** — the IETF standard for Internationalized Resource Identifiers (IRIs).
## 🎯 Purpose
The goal of `rfc3987-syntax` is to provide a **lightweight, permissively licensed Python module** for validating that strings conform to the **ABNF grammar defined in RFC 3987**. These helpers are:
- ✅ Strictly aligned with the **syntax rules of RFC 3987**
- ✅ Built using a **permissive MIT license**
- ✅ Designed for both **open source and proprietary use**
- ✅ Powered by [Lark](https://github.com/lark-parser/lark), a fast, EBNF-based parser
> 🧠 **Note:** This project focuses on **syntax validation only**. RFC 3987 specifies **additional semantic rules** (e.g., Unicode normalization, BiDi constraints, percent-encoding requirements) that must be enforced separately.
## 📄 License, Attribution, and Citation
**`rfc3987-syntax`** is licensed under the [MIT License](LICENSE), which allows reuse in both open source and commercial software.
This project:
- ❌ Does **not** depend on the `rfc3987` Python package (GPL-licensed)
- ✅ Uses [`lark`](https://github.com/lark-parser/lark), licensed under MIT
- ✅ Implements grammar from **[RFC 3987](https://datatracker.ietf.org/doc/html/rfc3987)**, using **[RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)** where RFC 3987 delegates syntax
> ⚠️ This project is **not affiliated with or endorsed by** the authors of RFC 3987 or the `rfc3987` Python package.
Please cite this software in accordance with the enclosed CITATION.cff file.
## ⚠️ Limitations
The grammar and parser enforce **only the ABNF syntax** defined in RFC 3987. The following are **not validated** and must be handled separately for full compliance:
- ✅ Unicode **Normalization Form C (NFC)**
- ✅ Bidirectional text (**BiDi**) constraints (RFC 3987 §4.1)
- ✅ **Port number ranges** (must be 065535)
- ✅ Valid **IPv6 compression** (only one `::`, max segments)
- ✅ Context-aware **percent-encoding** requirements
ChatGPT 40 was used during the original development process. Errors may exist due to this assistance. Additional review, testing, and bug fixes by human experts is welcome.
## 📦 Installation
```bash
pip install rfc3987-syntax
```
## 🛠 Usage
### List all supported "terms" (i.e., non-terminals and terminals within ABNF production rules) used to validate the syntax of an IRI according to RFC 3987
```python
from rfc3987_syntax import RFC3987_SYNTAX_TERMS
print("Supported terms:")
for term in RFC3987_SYNTAX_TERMS:
print(term)
```
### Syntactically validate a string using the general-purpose validator
```python
from rfc3987_syntax import is_valid_syntax
if is_valid_syntax(term='iri', value='http://github.com'):
print("✓ Valid IRI syntax")
if not is_valid_syntax(term='iri', value='bob'):
print("✗ Invalid IRI syntax")
if not is_valid_syntax(term='iri_reference', value='bob'):
print("✓ Valid IRI-reference syntax")
```
### Alternatively, use term-specific helpers to validate RFC 3987 syntax.
```python
from rfc3987_syntax import is_valid_syntax_iri
from rfc3987_syntax import is_valid_syntax_iri_reference
if is_valid_syntax_iri('http://github.com'):
print("✓ Valid IRI syntax")
if not is_valid_syntax_iri('bob'):
print("✗ Invalid IRI syntax")
if is_valid_syntax_iri_reference('bob'):
print("✓ Valid IRI-reference syntax")
```
### Get the Lark parse tree for a syntax validation (useful for additional semantic validation)
```python
from rfc3987_syntax import parse
ptree: ParseTree = parse(term="iri", value="http://github.com")
print(ptree)
```
## 📚 Sources
This grammar was derived from:
- **[RFC 3987 Internationalized Resource Identifiers (IRIs)]**
→ Defines IRI syntax and extensions to URI (e.g. Unicode characters, `ucschar`)
→ https://datatracker.ietf.org/doc/html/rfc3987
- **[RFC 3986 Uniform Resource Identifier (URI): Generic Syntax)]**
→ Provides reusable components like `scheme`, `authority`, `ipv4address`, etc.
→ https://datatracker.ietf.org/doc/html/rfc3986
> 📝 When `RFC 3986` is listed as the source, it is **used in accordance with RFC 3987**, which explicitly references it for foundational elements.
### Rule-to-Source Mapping
| Rule/Component | Source | Notes |
|----------------------|------------|-------|
| `iri` | RFC 3987 | Top-level IRI rule |
| `iri_reference` | RFC 3987 | Top-level IRI Reference rule |
| `absolute_iri` | RFC 3987 | Top-level Absolute IRI rule |
| `scheme` | RFC 3986 | Referenced by RFC 3987 §2.2 |
| `ihier_part` | RFC 3987 | IRI-specific hierarchy |
| `irelative_ref` | RFC 3987 | IRI-specific relative ref |
| `irelative_part` | RFC 3987 | IRI-specific relative part |
| `iauthority` | RFC 3986 | Standard URI authority |
| `ipath_abempty` | RFC 3986 | Path format variant |
| `ipath_absolute` | RFC 3986 | Absolute path |
| `ipath_noscheme` | RFC 3986 | Path disallowing scheme prefix |
| `ipath_rootless` | RFC 3986 | Used in non-scheme contexts |
| `iquery` | RFC 3987 | Query extension to URI |
| `ifragment` | RFC 3987 | Fragment extension to URI |
| `ipchar`, `isegment` | RFC 3986 | Path characters and segments |
| `isegment_nz_nc` | RFC 3987 | IRI-specific path constraint |
| `iunreserved` | RFC 3987 | Includes `ucschar` |
| `ucschar`, `iprivate`| RFC 3987 | Unicode support |
| `sub_delims` | RFC 3986 | Reserved characters |
| `ip_literal` | RFC 3986 | IPv6 or IPvFuture in `[]` |
| `ipv6address` | RFC 3986 | Expanded forms only |
| `ipvfuture` | RFC 3986 | Forward-compatible |
| `ipv4address` | RFC 3986 | Dotted-decimal IPv4 |
| `ls32` | RFC 3986 | Final 32 bits of IPv6 |
| `h16`, `dec_octet` | RFC 3986 | Hex and decimal chunks |
| `port` | RFC 3986 | Optional numeric |
| `pct_encoded` | RFC 3986 | Percent encoding (e.g. `%20`) |
| `alpha`, `digit`, `hexdig` | RFC 3986 | Character classes |