grilly.datasets.clean_conversations
Conversations SVC Cleaner
Cleans the conversations_svc_semantic.jsonl file by removing:
- Leaked filenames (*.py, *.pt, *.txt, etc.)
- File paths (Unix/Windows)
- Code artifacts (&&, backticks, arrows, shell commands)
- Technical numbers (checkpoint IDs, dimension specs)
- Leaked project names (AURA, STDP, etc.)
- Entries that become too short or invalid after cleaning
Output: conversations_svc_cleaned.jsonl
Functions
|
Clean SVC fields. |
|
Apply all cleaning rules to a text string. |
|
Check if an entry is still valid after cleaning. |
|
Autogenerated reference for |
Classes
|
Dict subclass for counting hashable items. |
|
PurePath subclass that can make system calls. |
- grilly.datasets.clean_conversations.clean_text(text)[source]
Apply all cleaning rules to a text string.
Dependencies:
re.Variables:
text(str, required).Usage Example
from grilly.datasets.clean_conversations import clean_text result = clean_text(text='example')
- Parameters
text (str) –
- Return type
str
- grilly.datasets.clean_conversations.clean_svc(svc)[source]
Clean SVC fields.
Dependencies:
Nonedetected from callable globals.Variables:
svc(dict, required).Usage Example
from grilly.datasets.clean_conversations import clean_svc result = clean_svc(svc={})
- Parameters
svc (dict) –
- Return type
dict
- grilly.datasets.clean_conversations.is_valid_after_cleaning(entry)[source]
Check if an entry is still valid after cleaning.
Dependencies:
Nonedetected from callable globals.Variables:
entry(dict, required).Usage Example
from grilly.datasets.clean_conversations import is_valid_after_cleaning result = is_valid_after_cleaning(entry={})
- Parameters
entry (dict) –
- Return type
bool
- grilly.datasets.clean_conversations.main()[source]
Autogenerated reference for
grilly.datasets.clean_conversations.main.Dependencies:
collections,json,pathlib,re,sys.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.datasets.clean_conversations import main result = main()