AI Commons Health & Wellbeing Hackathon Solutions
Overview
The AI Commons Project is a proof of concept of a new methodology of developing Artificial Intelligence solutions that allows anyone, anywhere to benefit from the possibilities that AI can provide. The project aims to increase/improve the accessibility, reproducibility, contextualization and enhancement of Artificial Intelligence solutions globally and especially in emerging markets.
The project aims to demonstrate how a global community of AI experts can learn and co-create mutually beneficial solutions with the opportunity for cross-county incremental enhancement.
Na Lie
Statement of Purpose
Introduction
Data Science Nigeria
Nelson Ogbeide; Ojeabulu O. Gift; Comfort Igboko; Caleb Emelike; Precious Cadeton
Problem Definition
A study revealed that all mobile subscribers in Nigeria receive spam sms, receiving an average of 2.45 spam sms daily. In recent times, the proliferation of fraud and fake information has made it challenging to identify trustworthy messages and information. Fraudsters specifically use this window as a major agent of fraud, thus increasing the need to provide a clear perception into the reliability of online content.
All mobile subscribers in Nigeria are affected and this made everyone with a mobile phone in Nigeria susceptible to fraud
Individuals/organizations often send broadcast that people should disregard certain types of messages because they are scam and also refrain from forwarding unverified messages.
Solution
NaLie is a solution that provides real time validation system for text messages. It uses CrowdML and NLP for Detection and verification of Text-based Financial Fraud and Fake messages. It was first released in 2019
Poster presentation: Here
The output of the solution is a response of the class the text the message belongs to. Text classes are Fake BVN, Investment Scam, 419 Scam, Fake job and Good Text. The solution validates input text based on two major criteria. One is Database method (i.e. Sender Id, Profile and author ) and the other is Feature based method (Message Content and Linguistic feature).
Mobile subscribers.
Technical expertise required to build solution include: Programming skills, Natural Language Processing, Software engineer/ML engineer.
N/A.
More data should be used for training and evaluation.
Usage
The aim of the solution is to proactively detect and prevent text-based financial fraud and fake messages. For instance, a mobile subscriber receives a text message to click on a link to update her Bank Verification Number (BVN) details as a result of the system update currently going on in her bank. Immediately the message drops, a NaLie notification pops up to warn the user that the message is fraudulent.
Anyone who owns a mobile phone.
The solution receives text as input from the user and returns a response/notification to the user’s screen.
The solution can be made to read user’s incoming text automatically and return a notification appropriately.
Domain and Applications
The application was tested on text messages in the financial and labour sector.
Na lie has been developed into an app and its available on playstore for download. The application feedback on playstore is 90% positive.
Dataset
The dataset comprises fake/fraudulent messages in the Nigerian financial and labour sectors. It contains varieties of fake message received via text and online on bank alerts and job alerts.
The dataset was created mainly for this project but it can be extended and used for similar problem scope.
The dataset was created by the research team which include the four solution implementers listed above.
Composition
Each instance represent a text message / online message received by users and the class of message the text belongs to.
Training set : 22867 instances , Testing set : 5880 with Nan inclusive.
The dataset is fairly representative of the fraudulent messages received in any location in Nigria.
An instances comprises of the text column and a column each for the five classes of the label. That is Text, Fake BVN, Investment Scam, 419 Scams, Fake Job and Good Text. A sample content of the text column is “Dear Customer, We are running a compulsory security enrollment of all ATM cards issued by banks in Nigeria. CBN as the apex body will block all cards not enrolled within 24hrs of receiving this notification. Visit link: http://217.71.50.11/~update to secure your card now.” Also a text can only belong to one class, i.e. for every instance, only one class of label can be true. If Fake BVN is 1(true), then all other classes will be 0 (false).
Yes, the label feature has five classes namely, Fake BVN, Investment Scam, 419 Scam, Fake job and Good Text for the classification task.
No. But the data was split into train and test in the ratio 70:30 for this project.
Collection Process
Yes, it was randomly sampled from the non-exhaustive data available on the internet.
The dataset was crowdsourced and it was voluntary.
Preprocessing/Cleaning/Labelling
Yes, tokenisation was done and each instance of text was labelled by experts. Messages in images were converted to text using OCR.
Yes, Check Here
sklearn, NLTK , Spacy……are all open source.
Uses
Tasks related to the solution.
Maintenance
Data Science Nigeria
A message can be sent by filling the form at https://datasciencenigeria.org/contact/
Yes. The dataset will be updated from time to time by Data Science Nigeria. When there’s an update, the documentation will be updated as well.
Yes. Any changes made will be updated in the documentation.
Yes, users can contribute to the dataset. This is encouraged and compensation points are assigned to every submission to drive adoption.
Dataset Publicly Available
Model
Model Details
Model date : Sep. 2019 , Model version: v2. Several algorithms were used such as KNeigbhours, Xgboost, Random forest , Decision Tree and LGBMClassifier. LGBMClassifier which was the best performing algorithm was later used as the algorithm to train the final model.
The data preprocessing were done under standard text classification data preparation using TfidfVectorizer to convert the text to machine readable format.
Evaluation
Testing the Solution
The data was split into train and test in the ratio 70:30 respectively and the performance was evaluated on Accuracy.
Testing by Third Party
Result
Result Details
OneVsRestClassifier(estimator=LGBMClassifier(boosting_type=’gbdt’, class_weight=None, colsample_bytree=1.0,
importance_type=’split’, learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
n_jobs=None)
Training runs = 100 , Evaluation runs = 1
Accuracy = (TP + TN)/(TP + TN + FP + FN)
where: TP = True positive; FP = False positive; TN = True negative; FN = False negative
Environment
Python >= 3.5
scikit-learn >= 0.20.0
lightgbm >= 2.2.0
pandas >= 0.23.4
numpy >= 1.14.2
joblib >= 0.12.5
Yes, the solution is deployed.
Steps to Reduce the Solution
Result Details
The solution can be reproduced by running all cells in the notebook in the link Here
Safety
General
Concept Drift
About 80% precision.
No. Though users can upload data, it only affects the solution when the model is updated with the newly ingested data. The process is not automated.
A test will be run to ensure correctness of output.
Yes, it is tested periodically and the model is updated in 6 months.
Security
We don’t collect usage data. Also, the data is annonimized for training.
Initiate system shutdown in order to identify and solve the problem.