LMT: Tailor hyper-parameters towards higher accuracy (and longer training time)

Also, allow users to override them
author: Birte Kristina Friesel <birte.friesel@uos.de> 2024-01-25 09:43:35 +0100
committer: Birte Kristina Friesel <birte.friesel@uos.de> 2024-01-25 09:55:36 +0100
commit: eb34056dbf9e10be7bed3835600f4edd7f7a1ef3 (patch)
tree: 31a181ed3d2bd36322fcaee19967e5a7b7edae33
parent: 9e8f9774c42cac0904d56d8edfd4abd4b2b717d1 (diff)
2 files changed, 60 insertions, 2 deletions
diff --git a/README.md b/README.md
index a95718e..76074cb 100644
--- a/README.md
+++ b/README.md
@@ -116,8 +116,13 @@ The following variables may be set to alter the behaviour of dfatool components.
 | `DFATOOL_DTREE_FUNCTION_LEAVES` | 0, **1** | Use functions (fitted via linear regression) in decision tree leaves when modeling numeric parameters with at least three distinct values. If 0, integer parameters are treated as enums instead. |
 | `DFATOOL_DTREE_SKLEARN_CART` | **0**, 1 | Use sklearn CART ("Decision Tree Regression") algorithm for decision tree generation. Uses binary nodes and supports splits on scalar variables. Overrides `FUNCTION_LEAVES` (=0) and `NONBINARY_NODES` (=0). |
 | `DFATOOL_DTREE_SKLEARN_DECART` | **0**, 1 | Use sklearn CART ("Decision Tree Regression") algorithm for decision tree generation. Ignore scalar parameters, thus emulating the DECART algorithm. |
-| `DFATOOL_DTREE_LMT` | **0**, 1 | Use [Linear Model Tree](https://github.com/cerlymarco/linear-tree) algorithm for regression tree generation. Uses binary nodes and linear functions. Overrides `FUNCTION_LEAVES` (=0) and `NONBINARY_NODES` (=0). |
 | `DFATOOL_CART_MAX_DEPTH` | **0** .. *n* | maximum depth for sklearn CART. Default (0): unlimited. |
+| `DFATOOL_DTREE_LMT` | **0**, 1 | Use [Linear Model Tree](https://github.com/cerlymarco/linear-tree) algorithm for regression tree generation. Uses binary nodes and linear functions. Overrides `FUNCTION_LEAVES` (=0) and `NONBINARY_NODES` (=0). |
+| `DFATOOL_LMT_MAX_DEPTH` | 5 .. **20** | Maximum depth for LMT. LMT default: 5 |
+| `DFATOOL_LMT_MIN_SAMPLES_SPLIT` | 0.0 .. 1.0, **6** .. *n* | Minimum samples required to still perform an LMT split. LMT default: 6 |
+| `DFATOOL_LMT_MIN_SAMPLES_LEAF` | 0.0 .. 1.0, **3** .. *n* | Minimum samples that each leaf of a split candidate must contain. LMT default: 0.1 (10% of *n samples*) |
+| `DFATOOL_LMT_MAX_BINS` | 10 .. **120** | Number of bins used to determine optimal split. LMT default: 25. |
+| `DFATOOL_LMT_CRITERION` | **mse**, rmse, mae, poisson | Error metric to use when selecting best split. LMT default: ssr |
 | `DFATOOL_ULS_ERROR_METRIC` | **ssr**, rmsd, mae, … | Error metric to use when selecting best-fitting function during unsupervised least squares (ULS) regression. Least squares regression itself minimzes root mean square deviation (rmsd), hence the equivalent (but partitioning-compatible) sum of squared residuals (ssr) is the default. Supports all metrics accepted by `--error-metric`. |
 | `DFATOOL_ULS_MIN_DISTINCT_VALUES` | 2 .. **3** .. *n* | Minimum number of unique values a parameter must take to be eligible for ULS |
 | `DFATOOL_ULS_SKIP_CODEPENDENT_CHECK` | **0**, 1 | Do not detect and remove co-dependent features in ULS. |
diff --git a/lib/parameters.py b/lib/parameters.py
index e697eda..8627b72 100644
--- a/lib/parameters.py
+++ b/lib/parameters.py
@@ -1106,7 +1106,60 @@ class ModelAttribute:
             from sklearn.linear_model import LinearRegression
             from dfatool.lineartree import LinearTreeRegressor
 
-            lmt = LinearTreeRegressor(base_estimator=LinearRegression(), max_depth=20)
+            # max_depth : int, default=5
+            #     The maximum depth of the tree considering only the splitting nodes.
+            #     A higher value implies a higher training time.
+            max_depth = int(os.getenv("DFATOOL_LMT_MAX_DEPTH", "20"))
+
+            # min_samples_split : int or float, default=6
+            #     The minimum number of samples required to split an internal node.
+            #     The minimum valid number of samples in each node is 6.
+            #     A lower value implies a higher training time.
+            #     - If int, then consider `min_samples_split` as the minimum number.
+            #     - If float, then `min_samples_split` is a fraction and
+            #       `ceil(min_samples_split * n_samples)` are the minimum
+            #       number of samples for each split.
+            if "." in os.getenv("DFATOOL_LMT_MIN_SAMPLES_SPLIT", ""):
+                min_samples_split = float(os.getenv("DFATOOL_LMT_MIN_SAMPLES_SPLIT"))
+            else:
+                min_samples_split = int(os.getenv("DFATOOL_LMT_MIN_SAMPLES_SPLIT", "6"))
+
+            # min_samples_leaf : int or float, default=0.1
+            #     The minimum number of samples required to be at a leaf node.
+            #     A split point at any depth will only be considered if it leaves at
+            #     least `min_samples_leaf` training samples in each of the left and
+            #     right branches.
+            #     The minimum valid number of samples in each leaf is 3.
+            #     A lower value implies a higher training time.
+            #     - If int, then consider `min_samples_leaf` as the minimum number.
+            #     - If float, then `min_samples_leaf` is a fraction and
+            #       `ceil(min_samples_leaf * n_samples)` are the minimum
+            #       number of samples for each node.
+            if "." in os.getenv("DFATOOL_LMT_MIN_SAMPLES_LEAF", ""):
+                min_samples_leaf = float(os.getenv("DFATOOL_LMT_MIN_SAMPLES_LEAF"))
+            else:
+                min_samples_leaf = int(os.getenv("DFATOOL_LMT_MIN_SAMPLES_LEAF", "3"))
+
+            # max_bins : int, default=25
+            #     The maximum number of bins to use to search the optimal split in each
+            #     feature. Features with a small number of unique values may use less than
+            #     ``max_bins`` bins. Must be lower than 120 and larger than 10.
+            #     A higher value implies a higher training time.
+            max_bins = int(os.getenv("DFATOOL_LMT_MAX_BINS", "120"))
+
+            # criterion : {"mse", "rmse", "mae", "poisson"}, default="mse"
+            #     The function to measure the quality of a split. "poisson"
+            #     requires ``y >= 0``.
+            criterion = os.getenv("DFATOOL_LMT_CRITERION", "mse")
+
+            lmt = LinearTreeRegressor(
+                base_estimator=LinearRegression(),
+                max_depth=max_depth,
+                min_samples_split=min_samples_split,
+                min_samples_leaf=min_samples_leaf,
+                max_bins=max_bins,
+                criterion=criterion,
+            )
             fit_parameters, category_to_index, ignore_index = param_to_ndarray(
                 parameters, with_nan=False, categorial_to_scalar=categorial_to_scalar
             )
author	Birte Kristina Friesel <birte.friesel@uos.de>	2024-01-25 09:43:35 +0100
committer	Birte Kristina Friesel <birte.friesel@uos.de>	2024-01-25 09:55:36 +0100
commit	eb34056dbf9e10be7bed3835600f4edd7f7a1ef3 (patch)
tree	31a181ed3d2bd36322fcaee19967e5a7b7edae33
parent	9e8f9774c42cac0904d56d8edfd4abd4b2b717d1 (diff)