LLMs

Creating Thread Datasets using Tune Studio

Oct 8, 2024

5 min read

Despite Large Language Models such as GPT4o and Llama 3.1 405B, garnering everyone’s eyes showing their prowess in Contextual understanding and sheer vast knowledge of almost every tiny tidbit out there, the tightest of GenAI workflows break when these models fail to come up with relevant information. Such struggles often result in prominent context-less hallucinations in models, which might solve a service query ticket with the feature your company seldom thought about providing.

In this blog, we shall discuss what is needed to create Thread Datasets for Fine-Tuning LLMs, a technique widely adopted in modern Assistant Workflows, and how to use Tune Studio’s API to access the sharpest and strongest LLMs in the sphere to create such datasets.

Thread Datasets Against Other Assistant Workflows

System Prompting, an ancient tactic still being sold to consumers in the modern day and age, soon becomes irrelevant when the data your workflow intends to work upon decides to dive into something much more niche than LLMs in the workflow can handle. Think of it this way: a genius 180 IQ child in ancient Mesopotamia’s intellect is still very much limited to that time’s knowledge and cannot figure out why OpenAI hasn’t released their latest promised feature!

Finetuned Large Language Models trained upon Thread Datasets from previous RAG-based or otherwise inferences have shown their might in making AI Assistants airtight. Such datasets rely not heavily on the foundational model’s vast knowledge of everything but instead on a compiled embedded understanding of something.

Advantages of Using Thread Datasets

Thread Datasets provide the much-needed context to an Assistant Workflow and enrich the model’s contextual continuity, allowing it to learn from context-rich conversations. This is a boon in multi-turn interactions, which is how these User-facing pipelines interact.

Apart from the significant decrease in Hallucination scores, such fine-tuned models also excel at Evaluation Metrics, which users will notice through the relevance of query answers and the thorough data utilisation that will be done in such cases (Remember, though, that Overfitting is still very much a thing for Assistant Workflows), where each thread can provide multiple layers of information compared to traditional datasets.

Creating a Thread Dataset using Tune Studio

The quality of the Finetuned model on thread datasets is highly dependent on the Model that generated the conversation. Clear distinctions are shown between the sentence structures and context understanding while utilising existing knowledge about vast topics such as marketing workflows, writing code, or whatever the pipeline requires. One should use the more prominent models or their distilled versions for such a task.

To create thread datasets, you can access GPT4o, Anthropic Claude, Llama 3.1 405B, and Llama 3.2 90B for free on Tune Studio today!

Let's understand how to create such a dataset using Tune Studio’s API. For example, we will create a “Smart Home Customer Support” dataset.

Note: All API References Can Be Found Here.

1. Creating a Thread Dataset

To create a thread dataset, we'll use the CreateThreadDataset API endpoint. This endpoint requires organization, cluster information, dataset name, description, and format.

Python Code:

import requests

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/CreateThreadDataset"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "name": "SmartHome Customer Support Dataset",
    "description": "Dataset for managing customer support conversations about SmartHome.",
    "format": "json"
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: After running the above code, you will receive a JSON response similar to this:

{
  "datasetId": "abc123",
  "name": "SmartHome Customer Support Dataset",
  "description": "Dataset for managing customer support conversations about SmartHome.",
  "format": "json"
}

This confirms the successful creation of the dataset.

2. Listing Threads in the Dataset

Once the dataset is created, you can list its threads using the ListThreadDatasetThreads API endpoint. This requires the dataset ID and pagination parameters.

Python Code:

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/ListThreadDatasetThreads"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "datasetId": "abc123",  # Use the datasetId obtained from the previous step
    "page": {
        "limit": 10,
        "prevPageToken": None,
        "nextPageToken": None,
        "totalPage": 1
    },
    "order": "asc"  # or "desc" based on your requirement
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: After executing this code, the response will include a list of threads:

{
  "threads": [
    {
      "threadId": "thread1",
      "messages": [
        {"role": "customer", "content": "How do I reset my SmartHome device?"},
        {"role": "support", "content": "To reset your SmartHome device, press and hold the reset button for 10 seconds."}
      ]
    },
    {
      "threadId": "thread2",
      "messages": [
        {"role": "customer", "content": "What should I do if my device is not responding?"},
        {"role": "support", "content": "Try unplugging the device and plugging it back in."}
      ]
    }
  ],
  "pagination": {
    "totalCount": 2,
    "nextPageToken": null
  }
}

This output shows the threads and messages in the dataset.

3. Exporting the Dataset

To export the dataset for analysis or sharing, we will use the ExportThreadDataset API endpoint, specifying the dataset ID, type, and format.

Python Code:

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/ExportThreadDataset"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "datasetId": "abc123",  # Use the datasetId from the first step
    "type": "full",  # Specify the type of export
    "format": "json"  # Specify the format of the export
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: Upon successful export, you will receive a response indicating the status of the operation:

{
  "status": "success",
  "message": "Dataset exported successfully."
}

This confirms the successful export of the dataset and provides a link to download the file.

Conclusion

In this tutorial, we demonstrated how to manage a thread dataset using the Thread Dataset API. We focused on creating a dataset, listing its threads, and exporting it for further analysis. Each step was accompanied by Python code snippets and expected output, providing a clear understanding of the API's capabilities.

Additional Resources

Despite Large Language Models such as GPT4o and Llama 3.1 405B, garnering everyone’s eyes showing their prowess in Contextual understanding and sheer vast knowledge of almost every tiny tidbit out there, the tightest of GenAI workflows break when these models fail to come up with relevant information. Such struggles often result in prominent context-less hallucinations in models, which might solve a service query ticket with the feature your company seldom thought about providing.

In this blog, we shall discuss what is needed to create Thread Datasets for Fine-Tuning LLMs, a technique widely adopted in modern Assistant Workflows, and how to use Tune Studio’s API to access the sharpest and strongest LLMs in the sphere to create such datasets.

Thread Datasets Against Other Assistant Workflows

System Prompting, an ancient tactic still being sold to consumers in the modern day and age, soon becomes irrelevant when the data your workflow intends to work upon decides to dive into something much more niche than LLMs in the workflow can handle. Think of it this way: a genius 180 IQ child in ancient Mesopotamia’s intellect is still very much limited to that time’s knowledge and cannot figure out why OpenAI hasn’t released their latest promised feature!

Finetuned Large Language Models trained upon Thread Datasets from previous RAG-based or otherwise inferences have shown their might in making AI Assistants airtight. Such datasets rely not heavily on the foundational model’s vast knowledge of everything but instead on a compiled embedded understanding of something.

Advantages of Using Thread Datasets

Thread Datasets provide the much-needed context to an Assistant Workflow and enrich the model’s contextual continuity, allowing it to learn from context-rich conversations. This is a boon in multi-turn interactions, which is how these User-facing pipelines interact.

Apart from the significant decrease in Hallucination scores, such fine-tuned models also excel at Evaluation Metrics, which users will notice through the relevance of query answers and the thorough data utilisation that will be done in such cases (Remember, though, that Overfitting is still very much a thing for Assistant Workflows), where each thread can provide multiple layers of information compared to traditional datasets.

Creating a Thread Dataset using Tune Studio

The quality of the Finetuned model on thread datasets is highly dependent on the Model that generated the conversation. Clear distinctions are shown between the sentence structures and context understanding while utilising existing knowledge about vast topics such as marketing workflows, writing code, or whatever the pipeline requires. One should use the more prominent models or their distilled versions for such a task.

To create thread datasets, you can access GPT4o, Anthropic Claude, Llama 3.1 405B, and Llama 3.2 90B for free on Tune Studio today!

Let's understand how to create such a dataset using Tune Studio’s API. For example, we will create a “Smart Home Customer Support” dataset.

Note: All API References Can Be Found Here.

1. Creating a Thread Dataset

To create a thread dataset, we'll use the CreateThreadDataset API endpoint. This endpoint requires organization, cluster information, dataset name, description, and format.

Python Code:

import requests

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/CreateThreadDataset"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "name": "SmartHome Customer Support Dataset",
    "description": "Dataset for managing customer support conversations about SmartHome.",
    "format": "json"
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: After running the above code, you will receive a JSON response similar to this:

{
  "datasetId": "abc123",
  "name": "SmartHome Customer Support Dataset",
  "description": "Dataset for managing customer support conversations about SmartHome.",
  "format": "json"
}

This confirms the successful creation of the dataset.

2. Listing Threads in the Dataset

Once the dataset is created, you can list its threads using the ListThreadDatasetThreads API endpoint. This requires the dataset ID and pagination parameters.

Python Code:

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/ListThreadDatasetThreads"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "datasetId": "abc123",  # Use the datasetId obtained from the previous step
    "page": {
        "limit": 10,
        "prevPageToken": None,
        "nextPageToken": None,
        "totalPage": 1
    },
    "order": "asc"  # or "desc" based on your requirement
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: After executing this code, the response will include a list of threads:

{
  "threads": [
    {
      "threadId": "thread1",
      "messages": [
        {"role": "customer", "content": "How do I reset my SmartHome device?"},
        {"role": "support", "content": "To reset your SmartHome device, press and hold the reset button for 10 seconds."}
      ]
    },
    {
      "threadId": "thread2",
      "messages": [
        {"role": "customer", "content": "What should I do if my device is not responding?"},
        {"role": "support", "content": "Try unplugging the device and plugging it back in."}
      ]
    }
  ],
  "pagination": {
    "totalCount": 2,
    "nextPageToken": null
  }
}

This output shows the threads and messages in the dataset.

3. Exporting the Dataset

To export the dataset for analysis or sharing, we will use the ExportThreadDataset API endpoint, specifying the dataset ID, type, and format.

Python Code:

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/ExportThreadDataset"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "datasetId": "abc123",  # Use the datasetId from the first step
    "type": "full",  # Specify the type of export
    "format": "json"  # Specify the format of the export
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: Upon successful export, you will receive a response indicating the status of the operation:

{
  "status": "success",
  "message": "Dataset exported successfully."
}

This confirms the successful export of the dataset and provides a link to download the file.

Conclusion

In this tutorial, we demonstrated how to manage a thread dataset using the Thread Dataset API. We focused on creating a dataset, listing its threads, and exporting it for further analysis. Each step was accompanied by Python code snippets and expected output, providing a clear understanding of the API's capabilities.

Additional Resources

Despite Large Language Models such as GPT4o and Llama 3.1 405B, garnering everyone’s eyes showing their prowess in Contextual understanding and sheer vast knowledge of almost every tiny tidbit out there, the tightest of GenAI workflows break when these models fail to come up with relevant information. Such struggles often result in prominent context-less hallucinations in models, which might solve a service query ticket with the feature your company seldom thought about providing.

In this blog, we shall discuss what is needed to create Thread Datasets for Fine-Tuning LLMs, a technique widely adopted in modern Assistant Workflows, and how to use Tune Studio’s API to access the sharpest and strongest LLMs in the sphere to create such datasets.

Thread Datasets Against Other Assistant Workflows

System Prompting, an ancient tactic still being sold to consumers in the modern day and age, soon becomes irrelevant when the data your workflow intends to work upon decides to dive into something much more niche than LLMs in the workflow can handle. Think of it this way: a genius 180 IQ child in ancient Mesopotamia’s intellect is still very much limited to that time’s knowledge and cannot figure out why OpenAI hasn’t released their latest promised feature!

Finetuned Large Language Models trained upon Thread Datasets from previous RAG-based or otherwise inferences have shown their might in making AI Assistants airtight. Such datasets rely not heavily on the foundational model’s vast knowledge of everything but instead on a compiled embedded understanding of something.

Advantages of Using Thread Datasets

Thread Datasets provide the much-needed context to an Assistant Workflow and enrich the model’s contextual continuity, allowing it to learn from context-rich conversations. This is a boon in multi-turn interactions, which is how these User-facing pipelines interact.

Apart from the significant decrease in Hallucination scores, such fine-tuned models also excel at Evaluation Metrics, which users will notice through the relevance of query answers and the thorough data utilisation that will be done in such cases (Remember, though, that Overfitting is still very much a thing for Assistant Workflows), where each thread can provide multiple layers of information compared to traditional datasets.

Creating a Thread Dataset using Tune Studio

The quality of the Finetuned model on thread datasets is highly dependent on the Model that generated the conversation. Clear distinctions are shown between the sentence structures and context understanding while utilising existing knowledge about vast topics such as marketing workflows, writing code, or whatever the pipeline requires. One should use the more prominent models or their distilled versions for such a task.

To create thread datasets, you can access GPT4o, Anthropic Claude, Llama 3.1 405B, and Llama 3.2 90B for free on Tune Studio today!

Let's understand how to create such a dataset using Tune Studio’s API. For example, we will create a “Smart Home Customer Support” dataset.

Note: All API References Can Be Found Here.

1. Creating a Thread Dataset

To create a thread dataset, we'll use the CreateThreadDataset API endpoint. This endpoint requires organization, cluster information, dataset name, description, and format.

Python Code:

import requests

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/CreateThreadDataset"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "name": "SmartHome Customer Support Dataset",
    "description": "Dataset for managing customer support conversations about SmartHome.",
    "format": "json"
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: After running the above code, you will receive a JSON response similar to this:

{
  "datasetId": "abc123",
  "name": "SmartHome Customer Support Dataset",
  "description": "Dataset for managing customer support conversations about SmartHome.",
  "format": "json"
}

This confirms the successful creation of the dataset.

2. Listing Threads in the Dataset

Once the dataset is created, you can list its threads using the ListThreadDatasetThreads API endpoint. This requires the dataset ID and pagination parameters.

Python Code:

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/ListThreadDatasetThreads"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "datasetId": "abc123",  # Use the datasetId obtained from the previous step
    "page": {
        "limit": 10,
        "prevPageToken": None,
        "nextPageToken": None,
        "totalPage": 1
    },
    "order": "asc"  # or "desc" based on your requirement
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: After executing this code, the response will include a list of threads:

{
  "threads": [
    {
      "threadId": "thread1",
      "messages": [
        {"role": "customer", "content": "How do I reset my SmartHome device?"},
        {"role": "support", "content": "To reset your SmartHome device, press and hold the reset button for 10 seconds."}
      ]
    },
    {
      "threadId": "thread2",
      "messages": [
        {"role": "customer", "content": "What should I do if my device is not responding?"},
        {"role": "support", "content": "Try unplugging the device and plugging it back in."}
      ]
    }
  ],
  "pagination": {
    "totalCount": 2,
    "nextPageToken": null
  }
}

This output shows the threads and messages in the dataset.

3. Exporting the Dataset

To export the dataset for analysis or sharing, we will use the ExportThreadDataset API endpoint, specifying the dataset ID, type, and format.

Python Code:

url = "https://studio.tune.app/tune.dataset.ThreadDatasetService/ExportThreadDataset"
payload = {
    "auth": {
        "organization": "<your_organization>",
        "cluster": "<your_cluster>"
    },
    "datasetId": "abc123",  # Use the datasetId from the first step
    "type": "full",  # Specify the type of export
    "format": "json"  # Specify the format of the export
}
headers = {
    "X-Tune-Key": "<your_api_key>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Expected Output: Upon successful export, you will receive a response indicating the status of the operation:

{
  "status": "success",
  "message": "Dataset exported successfully."
}

This confirms the successful export of the dataset and provides a link to download the file.

Conclusion

In this tutorial, we demonstrated how to manage a thread dataset using the Thread Dataset API. We focused on creating a dataset, listing its threads, and exporting it for further analysis. Each step was accompanied by Python code snippets and expected output, providing a clear understanding of the API's capabilities.

Additional Resources

Written by

Aryan Kargwal

Data Evangelist