Malaysian Journal of Computer Science

EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS

Fri, 01 Aug 2025 00:00:00 +0800

Noisy text normalization is a critical preprocessing step in natural language processing (NLP), particularly for user-generated content (UGC) that contains a lot of slang, abbreviations, and typographical errors. This extended study investigates the performance of multiple similarity measures in normalizing Malay noisy text, addressing gaps in prior study that predominantly relied on rule-based approaches and single similarity measures. By systematically evaluating token-based, edit-based, and sequence-based similarity measures across various thresholds, this study provides a comprehensive analysis of their effectiveness and computational efficiency. The methodology comprises a two-phase experiment: an initial phase to identify optimal thresholds using a small dataset and a second phase that generalizes findings on a larger dataset. Key findings reveal that edit-based measures, such as Levenshtein Distance and Damerau-Levenshtein, consistently outperform other measures at lower thresholds, achieving normalization success rates exceeding 83%. Ratcliff/Obershelp emerged as the most effective sequence-based measure, while token-based measures like Jaccard and Cosine demonstrated limited performance. The study also highlights the critical role of threshold in balancing normalization accuracy and flexibility. Additionally, computational time analysis underscores the trade-offs between accuracy and efficiency across similarity categories. These findings pave the way for more robust and adaptable text normalization strategies, particularly for Malay language studies.

ENHANCING MULTILABEL CLASSIFICATION IN CHARGE PREDICTION USING LABEL CORRELATION AND PROBLEM TRANSFORMATION METHOD

Nasa Zata Dina; Sri Devi Ravana (Corresponding Author); Norisma Idris — Fri, 01 Aug 2025 00:00:00 +0800

Legal Judgment Prediction (LJP) has recently gained significant interest from both academic and legal practitioners. The majority of LJP methods focus on single label prediction problem, neglecting the real-world multilabel case. Therefore, this study aimed to classify multilabel legal cases using label correlation and problem transformation methods. Data were collected from a publicly accessible legal document in the European Court of Human Rights (ECHR) and EUR-Lex. Multilabel text classification tasks face challenges such as sample diversity, complexity, and the need for effective utilization of label correlations. In this paper, we propose a model that integrates domain specific text embedding and label correlation. Proposed model leverages label powerset as problem transformation to transform a multilabel problem to a multiclass problem by incorporating domain specific text embedding and label correlation, which enhances classification performance in charge prediction and addresses label omission issues. Extensive experiments on two legal text datasets demonstrate the model’s excellent performance. The proposed model substantially outperformed two baseline studies by attaining competitive results of 80.32%-90.09% F1-score and 0.0119-0.0210 Hamming Loss score, respectively. Meanwhile, the baseline models have attained 52%-80% F1-score and 0.0452-0.1479 Hamming Loss score. Proposed model’s performance significantly surpasses the baseline models. The significance of this study is the implementation of label correlation in label powerset problem transformation method and the application of domain specific embedding to solve multilabel classification problem in legal domain.

ENHANCING RECOMMENDER SYSTEMS WITH DEEP REINFORCEMENT LEARNING AND KNOWLEDGE GRAPH EMBEDDINGS

Fri, 01 Aug 2025 00:00:00 +0800

Deep Reinforcement Learning (DRL), a subfield of machine learning, has shown remarkable potential in various domains, including recommender systems (RSs). This study leverages DRL to improve RS performance by effectively modeling user preferences and addressing their unique needs. A knowledge graph (KG) is constructed using product information, such as features and historical purchase data, to serve as the environment for the Markov Decision Process (MDP) within the DRL framework. The KG is enriched with embeddings to enable efficient navigation and enhance its utility. The Actor-Critic model in DRL employs these embeddings within the MDP, enabling a more accurate representation of user preferences. Central to this approach is the Representation of User Preferences via Path Embedding Propagation (RUPPEP), which serves as the study’s core contribution. Experimental results demonstrate that DRL-based RSs achieve superior performance metrics, with a 13.26% improvement in NDCG for the Amazon Cell Phones dataset and a 15.43% increase for the Amazon Beauty dataset compared to the best SOTA baseline model, highlighting their potential to advance the field of recommendation systems.

OPTIMIZING BERTSNN TO ENHANCE SOURCE-TARGET DOMAIN SIMILARITY SCORING FOR CROSS-DOMAIN SENTIMENT CLASSIFICATION OF PRODUCT REVIEWS

Haitao Zhao ; Jasy Liew Suet Yan (Corresponding Author) — Fri, 01 Aug 2025 00:00:00 +0800

Cross-domain sentiment analysis (CDSA) predicts sentiment polarity in a target domain using knowledge from source domains but existing CDSA methods lack effective source domain selection strategies. This study investigates BertSNN, which combines pre-trained BERT embeddings, a Siamese neural network, and various distance metrics to measure domain similarity and optimize source domain selection for CDSA. First, we experiment with document-level (DocBERT) and sentence-level (SentenceBERT) embeddings with BiLSTM and BiLSTM + CNN neural network configurations to identify the best combination for BertSNN. Second, we explore two distance metrics—Euclidean and Manhattan—alongside shifted cosine similarity to determine the most effective choice for domain similarity scoring. Using product reviews, we test on 25 target domains, examining whether using multiple top most similar source domains improve cross-domain sentiment classification compared to a single most similar source domain. Results indicate that document-level embeddings, BiLSTM and shifted cosine similarity produce the most optimal BertSNN that can select high-quality similar source domains to train a cross-domain sentiment classifier for a target domain, beating two other traditional baseline methods (i.e., bag-of-words and TF-IDF representations). Our findings also show that using top five most similar source domains (k = 5) for training generally improves cross-domain sentiment classification performance as opposed to using a single most similar source domain (k = 1). This study contributes to CDSA by advancing the understanding of embedding choices and distance metrics within a Siamese neural network for source-target domain similarity scoring and providing actionable insights on domain selection strategies to improve sentiment analysis models.

FUTURE SOCIAL MEDIA USE WITH THE EMERGENCE OF AI IN MALAYSIA AND INDONESIA

Muhammad Zainul Abidin Mohamed Tahir ; Roslina Othman (Corresponding Author) — Fri, 01 Aug 2025 00:00:00 +0800

This study explored the future uses of social media in Malaysia and Indonesia. There were 28.68 million social media users in Malaysia and 167 million in Indonesia. Most of the 10,211 users in the United States said that social media had a largely negative impact. Another survey found that 39% of the users in the United States posited that by 2035, uses of social media would not significantly serve the public good, and 18% said social media was evolving to a worse future for society. The objectives were to explore the potential, emerging trends, and future social media uses in Malaysia and Indonesia through content analysis, the Delphi survey, and triangulation. The content analysis analyzed 40 websites using QDA Miner and WordStat 9 for theme and case identification. The Delphi survey was constructed from existing studies and distributed to 12 experts. The findings revealed that potential social media uses comprised artificial intelligence-enhanced social commerce, socio-political attempts, and the need for control mechanisms enforced by the authorities. The emerging trends included the creation of non-traditional families and social chatbots. The preferable futures of social media use comprised augmented and virtual reality, audience analytics, conversational commerce, AI-powered chatbots, digital citizenship, and virtual campaigns against online scammers. The values of social media use included narrative control, compliance with Maqasid al-Shariah, regulated technological designs, and safety in the virtual environment. The study recommended an expanded ethical guideline for social media use and an action plan for regulating social media use in the future.

A HYBRID CONTEXTUAL EMBEDDING BASED CLUSTERING AND CLASSIFICATION TECHNIQUE FOR UNSUPERVISED IMPLICIT ASPECT CATEGORIZATION IN INDONESIAN REVIEWS

Nur Hayatin; Suraya Alias (Corresponding Author); Lai Po Hung — Fri, 01 Aug 2025 00:00:00 +0800

Aspect categorization is a grouping of reviews based on aspect categories that follow the review domain. The problem arises when only sentiment features appear as a clue to predict implicit aspects. On the other hand, implicit aspects play an important role in generating a summary. Without implicit aspect, we probably lose some important words needed for analyzing user’s reviews. Existing techniques face difficulties in utilizing the implicit aspects due to limited resources and computationally expensive problems. Hence, we propose an implicit aspect categorization model based on a hybrid contextual embedding-based clustering and classification technique. We developed the model using an unsupervised learning approach which is no need labelled data in training. A contextual embedding-based clustering technique generated train data from explicit sentences which will be used to classify implicit aspect categorization. Four steps of the proposed implicit aspects categorization model, i.e. preprocessing data, sentence feature selection, generating train data based on clustering, and finally categorizing implicit aspect using classification technique. We experiment with several classification techniques to get the best combination of the proposed technique (i.e. Logistic Regression, Support Vector Machine, Naïve Bayes, Decision Tree, and Random Forest). Based on the experiment, the combination of contextual embedding-based clustering and Random Forest algorithm produces higher accuracy than other classification techniques, with accuracy tent to 72.04% and F1 score in 0.6788.

FISH IMAGE ANALYSIS: FUSION OF MOMENT-BASED AND DIRECTIONAL FEATURES IN COLOUR SPACE

Fri, 01 Aug 2025 00:00:00 +0800

This study introduces an innovative approach to content-based image retrieval (CBIR) specifically designed for fish species identification. The proposed method integrates shape, colour, and texture features using Zernike Moments Invariant (ZMI) and Local Directional Pattern (LDP), applied to the momentgram and the hue channel of the HSV colour space. This fusion ensures invariance to transformations such as rotation, scaling, and translation, enabling robust performance on natural images with varying orientations and quality. The method was evaluated using the Fish4Knowledge dataset, consisting of 27,370 images, with 30% randomly selected as query images. Experimental results demonstrate that the proposed method achieved a mean average precision (MAP) of 84.17%, significantly outperforming comparable state-of-the-art approaches. Statistical analysis using two-tailed paired t-tests confirms its superiority. By combining global shape descriptors, local texture features, and colour properties, this method delivers a comprehensive representation of fish images. The inclusion of moment-based descriptors enhances its robustness against low-resolution images and noise. This research underscores the importance of combining diverse features within CBIR systems and offers a significant improvement in retrieval accuracy, contributing to domain-specific applications such as sustainable fisheries management and aquaculture research.

PERFORMANCE COMPARISON OF ZERO-SHOT AND TWO-SHOT PROMPTING IN DETECTING FAKE NEWS USING LARGE LANGUAGE MODELS

Muhammad Naim Syahmi Roslan ; Masnizah Mohd (Corresponding Author) — Fri, 01 Aug 2025 00:00:00 +0800

Fake news detection is a highly crucial challenge in Natural Language Processing (NLP), particularly during significant social events like elections and national crises. This study uses the GPT-3.5-Turbo model to test the effectiveness of zero-shot and two-shot prompting in detecting fake news on the PolitiFact and Liar datasets. Zero-shot prompting consists of task instructions without examples, whereas two-shot prompting contains a few task-related examples. The methodology includes dataset preparation, Large Language Models (LLMs) response collection, encoding, and evaluation using metrics such as accuracy, precision, recall, and F1-score. The results show that two-shot prompting increases performance marginally across all parameters when compared to zero-shot prompting. PolitiFact’s accuracy improved from 0.286 to 0.293, while Liar’s improved from 0.220 to 0.226. Precision, recall, and F1-score also showed minor gains. However, these advances were not statistically significant and highlight the model’s difficulty with handling multi-class classification in the political domain. The GPT-3.5-Turbo model performed better on the PolitiFact dataset, suggesting variability in performance across different datasets. In conclusion, although two-shot prompting provides a slight advantage, the GPT-3.5-Turbo’s overall performance remains limited, indicating the need for more sophisticated techniques (such as advanced prompting methods or more powerful LLMs) to enhance fake news detection.

ENHANCING SEMANTIC INFORMATION RETRIEVAL (SIR) THROUGH ANTONYMS EXTRACTION FOR RETRIEVING PRECISE COVID-19 INFORMATION

Fri, 01 Aug 2025 00:00:00 +0800

The Semantic Web extends the capabilities of the traditional Web by enabling machines to process and interpret data through ontology knowledgebase. Integrating ontologies into the Web facilitates more accurate and precise searches, task automation, and optimized integration between systems. This research work focuses on semantic information retrieval (SIR) for COVID-19-related queries, leveraging ontologies to generate precise search results and antonyms to reduce irrelevant results. By conducting syntactic and semantic analysis, the system expands the search query using the context derived from the ontology. The query is further refined by extracting antonyms via the ontology relations. The refined query is then submitted to the search engine to retrieve more precise results. A ranking module further filters and prioritizes the most pertinent result links. The SIR approach is novel among existing information retrieval systems in that it eliminates irrelevant search results via antonyms, rather than displaying all the retrieved results based on the query, and in that it re-ranks the results semantically. The SIR algorithm demonstrates significant performance improvements for most queries, primarily due to the semantic analysis, antonyms addition and re-ranking processes. The query dataset achieved 100% precision and 80% recall, outperforming existing search engines in these metrics.

DETECTING EMOTIONAL STATE OF DEPRESSION IN SOCIAL MEDIA POSTS USING LOGISTIC REGRESSION-RECURSIVE FEATURE ELIMINATION

Wang Li; Wandeep Kaur (Corresponding Author); Chen Wangmei — Fri, 01 Aug 2025 00:00:00 +0800

Depression detection through social media has garnered widespread attention due to its potential for early intervention in mental health issues. This study aims to detect depressive users based on their content shared on social media using machine learning techniques. Given the complexity and diversity of depressive text, existing research still falls short in exploring comprehensive feature extraction techniques. To address this challenge, this study proposes an integrated framework for detecting depressive tendencies through multi-dimensional feature extraction and selection techniques. The proposed approach combines TF-IDF with N-grams, DistilBERT embeddings, and SentiWordNet to capture linguistic, semantic, and emotional features. Additionally, logistic regression-based recursive feature elimination (LR-RFE) is employed to optimize high-dimensional feature sets by reducing redundancy and emphasizing key indicators.Experiments conducted on the CLEF eRisk dataset revealed varying levels of effectiveness across individual feature extraction methods. Notably, multi-feature integration significantly enhanced classification performance, achieving an accuracy of 80.8% and an F1 score of 80.54% with the combined feature set. Feature selection further improved model efficiency and performance. These findings contribute to advancing automated depression detection and lay a foundation for developing scalable and interpretable machine learning models for mental health assessment.