Breadcrumb navigation

Frovedis Machine Learning: Unsupervised Learning Dimensionality reduction using clustering and t-SNE Reduced learning time

Technical Articlescompared to scikit-learn

May dd, 2022
Shoichiro Yokotani, Application Development Expert
AI Platform division

Frovedis Machine Learning: Unsupervised Learning Dimensionality reduction using clustering and t-SNE Reduced learning time (compared to scikit-learn)

Unsupervised learning is a general term for learning to extract information from a data set that does not have any indicators of correct answers. In supervised learning, there is a set of output (correct answer) data corresponding to the input data. This allowed us to verify the correctness of the learning results. However, in unsupervised learning, there is no measure of correctness or incorrectness. It is generally difficult to judge whether the learning results are appropriate or not.

Unsupervised learning can be divided into two main categories. The first is clustering, which divides a dataset into groups according to their characteristics. For example, it can be used to group articles into political, economic, sports, etc. based on the similarity of words in individual news articles. The name of each group is not automatically assigned by unsupervised learning, so each group must be labeled by human judgment. The certainty of the grouped results also requires human judgment.

The second is data dimensionality reduction. Data sets with high-dimensional features are converted to lower dimensions. Data sets with high-dimensional features are difficult to understand by graphing their features. In such cases, dimensionality reduction is useful, such as creating important variables from high-dimensional features or transforming high-dimensional data into low-dimensional spatial data based on the distance between data in the high-dimensional space. Dimensionality reduction makes it easier to visually understand the features of the data.

As a sample of the unsupervised learning algorithm presented here, we will use a dataset of news articles that have been previously segmented into words and vectorized using Word2vec. As a first step, we use k-means clustering to group the words in the news articles. Next, we visualize the data features using t-SNE and clustering again using DBSCAN algorithm. We measure the training time for each learning algorithm using Frovedis and scikit-learn.


clustering_for_column