Documentation

Table of Content

Semantic Map

The Semantic Map is a powerful visualization tool that helps you understand what your AI assistant knows. It analyzes your uploaded documents and displays them as an interactive map, grouping related content into topics and highlighting areas that may need improvement.

What is the Semantic Map

The Semantic Map uses machine learning to analyze all the content in your knowledge base and organize it visually. Each dot on the map represents a piece of your content (a "chunk"), and dots that are close together share similar meaning.

Component	What It Shows
Topics (Clusters)	Groups of related content automatically detected
Points	Individual chunks from your documents
Colors	Different colors represent different topics
Unclustered Points	Content that doesn't fit neatly into any topic

The map helps you answer questions like:

What topics does my AI know about?
Are any topics underrepresented?
Is my content well-organized or scattered?

Embeddings Representation

When you upload documents, each chunk is converted into a high-dimensional vector (an "embedding") that captures its semantic meaning. Chunks with similar meaning have similar vectors, even if they use different words.

Dimensionality Reduction

The map displays content in 2D, but embeddings exist in much higher dimensions. We use UMAP (Uniform Manifold Approximation and Projection) to project these high-dimensional vectors onto a 2D plane while preserving the relationships between them.

Points close together in the original space remain close in 2D
Points far apart remain separated
The resulting layout reveals natural groupings in your content

Clustering

Once projected to 2D, we use HDBSCAN (Hierarchical Density-Based Spatial Clustering) to automatically detect topic clusters. Unlike traditional clustering algorithms:

It doesn't require you to specify how many topics to find
It identifies "noise" (content that doesn't belong to any clear topic)
It handles clusters of varying sizes and densities

Topic Labeling

Each detected cluster is analyzed to generate a human-readable topic label. The system examines representative content from each cluster and produces a 2-4 word descriptive name (e.g., "Shipping & Returns", "Product Specifications").

Primary Stats

Metric	Definition	What It Tells You
Topics	Number of distinct clusters detected	How many separate subjects your knowledge base covers
Chunks	Total content pieces analyzed	Overall size of your knowledge base
Clustered	Percentage of chunks belonging to a topic	How much of your content is well-organized
Unclustered	Percentage marked as "noise"	Content that may be too unique or poorly organized

Clustered vs. Unclustered:

High clustered % (>90%) = Well-organized content with clear topics
High unclustered % (>20%) = Scattered content, consider reorganizing

Silhouette Score

Range: -1 to +1

The Silhouette Score measures how similar each chunk is to its own cluster compared to other clusters. It evaluates both cohesion (how tightly grouped points are within a cluster) and separation (how distinct clusters are from each other).

Score	Interpretation
0.7 – 1.0	Excellent — Strong, well-separated clusters
0.5 – 0.7	Good — Clear structure with minor overlap
0.25 – 0.5	Fair — Clusters exist but boundaries are fuzzy
0.0 – 0.25	Weak — Overlapping or poorly defined clusters
Below 0	Poor — Points may be assigned to wrong clusters

Mathematical intuition: For each point, the score compares its average distance to points in its own cluster versus the nearest neighboring cluster. A high score means points are much closer to their own cluster.

Calinski-Harabasz Index

Range: 0 to ∞ (higher is better)

Also known as the Variance Ratio Criterion, this metric measures the ratio of between-cluster dispersion to within-cluster dispersion.

Score	Interpretation
Higher values	Better-defined, more separated clusters
Lower values	Clusters are less distinct or overlapping

Mathematical intuition: Compares how spread out clusters are from each other versus how spread out points are within each cluster. Well-defined clusters are tight internally but far apart from each other.

Note: This score is not normalized, so it's most useful for comparing different analyses of the same dataset rather than as an absolute measure.

Davies-Bouldin Index

Range: 0 to ∞ (lower is better)

This metric measures the average similarity between each cluster and its most similar neighboring cluster.

Score	Interpretation
0 – 0.5	Excellent — Clusters are very distinct
0.5 – 1.0	Good — Reasonable separation
1.0 – 2.0	Fair — Some cluster overlap
Above 2.0	Poor — Significant overlap between clusters

Mathematical intuition: For each cluster, finds the neighboring cluster it's most similar to, then averages across all clusters. Lower scores indicate clusters that are compact and well-separated.

Quality Assesment

The overall Quality rating combines these metrics into a single assessment:

Rating	Criteria
Excellent	Silhouette ≥ 0.7
Good	Silhouette 0.5 – 0.7
Fair	Silhouette 0.25 – 0.5
Poor	Silhouette < 0.25

Diagnostics

The Semantic Map provides four diagnostic analyses:

Diagnostic	What It Measures	Healthy State
Noise Analysis	Percentage of unclustered content	<10% unclustered
Balance	Distribution of content across topics	No single topic >50%
Depth	Content volume per topic	All topics have ≥5 chunks
Organization	Overall cluster quality	Multiple distinct topics detected

Severity indicators:

Green: No issues detected
Blue: Minor observation
Yellow: Moderate concern worth addressing
Red: Significant issue affecting chatbot quality

Advance Settings

For fine-tuning the analysis, you can adjust clustering parameters:

Parameter	Range	Default	Technical Effect
Minimum Cluster Size	2–50	3	Minimum points required to form a cluster. Lower values detect smaller, more specific topics. Higher values require more evidence before creating a topic.
Neighbor Sensitivity	2–100	15	Number of neighbors considered when projecting to 2D. Lower values preserve local structure (fine-grained patterns). Higher values preserve global structure (broad patterns).
Density Threshold	1–10	2	Minimum points in a neighborhood to be considered a core point. Higher values require denser clusters, reducing noise sensitivity.

When to adjust:

Symptom	Try This
Too many tiny topics	Increase Minimum Cluster Size
Topics are too broad	Decrease Minimum Cluster Size
Related content split across topics	Increase Neighbor Sensitivity
Unrelated content grouped together	Decrease Neighbor Sensitivity
Too much noise (unclustered)	Decrease Density Threshold
Loose/scattered clusters	Increase Density Threshold

Results

Healthy Knowledge Base Signs:

5–15 topics (varies by use case)
>90% clustered content
Silhouette score >0.5
Balanced topic sizes (no single topic dominates)
All diagnostics green or blue

Warning Signs:

Issue	Possible Cause	Solution
Very few topics (1-2)	Content is too similar or too limited	Add diverse content covering different subjects
Many tiny topics	Content is too fragmented	Consolidate related documents
High unclustered %	Documents cover many unrelated subjects	Reorganize into focused topic files
Low Silhouette score	Topics overlap significantly	Separate content more clearly by subject
One dominant topic	Knowledge base is unbalanced	Add content for underrepresented areas

Quick Reference

Task	How To
Run analysis	Training → Semantic Map → Run Analysis
Filter by topic	Click a topic in the sidebar
View all topics	Click "Show All" in the sidebar
View chunk content	Click any point on the map
Adjust settings	Click the settings icon
Reset view	Click "Reset View" after panning/zooming
Re-analyze	Click "Re-run" in the header

Temperature - Top P