APKs collected from google play store are filtered using VIRUSTOTAL to obtain only benign APKs. VIRUSTOTAL aggregates many antivirus products and online scan engines to check for viruses that the users own antivirus may have missed or to verify against any false positives. Drebin dataset include Malicious android APKs and PRAGuard includes Obfuscated malicious APKs. After all the dataset collection, the features that may be necessary are extracted which include static and dynamic features. Feature Reduction phase includes the reduction of extracted features to form a Knowledge Base. Sometimes, unnecessary features may lead to misclassification. Hence the feature reduction is an important part in a learning process. The feature vector table is constructed to obtain a simple form to represent all the features present in an APK and to provide it as an input to the various learning model. Machine learning approach intends to classify the inputs according to the previous knowledge from training sample datasets. Thus it can identify whether an APK is benign, malicious and obfuscated. Deep learning also approaches to classify the inputs as benign, malicious and obfuscated. The only difference between machine learning and deep learning is that machine learning needs reduced feature fector table whereas in deep learning it is done by itself. Associative rule mining is done to classify various obfuscated dataset into trivial, string encryption, class encryption, reflection, trivial+string encryption, trivial+string encryption+reflection, trivial+string encryption+reflection+class encryption.
Fig 3.1 Proposed Architecture
There are 3 types of datasets have been used which include benign, malicious and obfuscated. The benign APKs are collected from 9apps, the malware APKs are collected from Drebin and Malgenome and Obfuscated data is collected from PRAGuard.
9apps and google playstore: is an online tools to download all media and file for entertainment, office, work, education, games, design, wallpapers, mobile application etc. Common apps searched on app store or google store, apps can be downloaded in small size which will fit on a device without slowing down the process of that device.9apps are all user friendly app with innovative content.
Drebin: is a project sponsored by the German Federal Ministry of Education and Research. It’s a free dataset, composed by 5,560 applications related to 179 different families of malware. These samples were grouped between 2010 and 2012.
Malgenome: There are 1,200 malware samples that cover the majority of existing Android malware families are systematically characterized from various aspects, including their installation methods, activation mechanisms as well as the nature of carried malicious payloads.
Android PRAGuard: The dataset contains 10,479 samples, obtained by obfuscating the MalGenome and the Contagio Minidump datasets with seven different obfuscation techniques. The various techiques used include string encryption,class encryption,trivial, reflection , trivial+string encryption, trivial+string encryption +reflection and trivial+string encryption +reflection+class encryption.
3.3 Feature Extraction
Feature extraction is the process of transforming the input data into a set of features which can very well represent the input data. The different type of features considered here are static features, dynamic features and obfuscation resilient features.
Android apps needed to request permissions at installation time. It specifies a system permission that the user must grant in order for the app to operate correctly. App permissions are highly granular. Although an app may even request more permissions than it actually uses, it may simply be requesting extra permissions in anticipation of its use in future versions. Permissions are granted by the user when the application is installed or while the app is running.
3.3.2 Intent Filters
An apps manifest often declares messages, called Intents; it can receive and process through filters indicating Intent properties of interest. Although this information can be useful for identifying malware (e.g., those that listen for Intents indicating system actions) an app may simply declare filters in code, allowing for another form of obfuscation.
3.3.3 App components
A variety of component types, with specific functionalities (e.g., components for providing GUIs, and others for running background services) are declared within an Android apps manifest. However, presence of particular components, especially simply tracking their name, as conducted by some approaches can be obfuscated easily through renaming.
A type of feature often ignored by existing Android malware detection and classification techniques are those that involve reflection and, as a result, dynamic class loading. Dynamic class loading through reflection allows an app to modify or inspect itself during runtime, and violate certain language constructs related to information hiding (e.g., allow access to the private members of a class). At the same time, Android malware are increasingly utilizing reflection to obfuscate their malicious behaviours.
3.3.5 Native calls
A capability of Android apps that is almost never taken into account is use of native code. After researching it was found that there was no analysis that utilizes machine learning and static analysis examine the internal behaviours of an apps native binaries. This allows malware authors to package malicious payloads in native binaries, since they are largely ignored.
3.3.6 API Calls
Application program interface (API) is a set of procedures, protocols, and tools for building software applications. Every time call is made to a server in name of an application using a SDKs or an API, it counts as an API request or API call. API calls represent specific operations that the client applications can invoke at runtime to perform tasks. The examples of API calls with manager classes are Telephony Manager, SmsManager, Power Manager, Connectivity Manager, and Notification Manager.
3.3.7 Hardware and Software Features
Android devices come supplied with several hardware features that can be used to build apps.The purpose of a declaration is to inform any external entity of the set of hardware and software features on which the application depends. Because feature support can vary across Android devices, the element serves an important role in letting an application describe the device-variable features that it uses.
3.3.8 System Call
System call is the programmatic way in which a computer program requests a service from the kernel of the operating system it is executed on. As a result any functionality provided by a software application eventually reduces to a set of fixed system calls. The patterns in system calls made by a mobile application will provide an insight to the behavior of an APK.
3.3.9 Package Name
The package name is a unique name to identify a specific app. Generally, the package name of an app is in the format domain.company.application, but its completely up to the apps developer to choose the name. The domain portion is the domain extension, like com or org, used by the developer of the app. The company portion is usually the name of the developers company or product. The final application portion usually describes the app itself.
Check whether the times at which the app was signed and at which the certificate was generated are similar. The intuition behind this feature is that automated repackaging tools modify existing apps and are signed using auto-generated ad-hoc certificates before distribution. Thus, if the date when the certificate was created is close to the date on which the app was signed, it can reveal the use of an automated tool for app repackaging. Mark the apps where the time difference was below ten minutes. For each certificate, features from the time zone and the common names string length, which allows to identify similar certificates generated by repackaging tools have been build.
3.3.11 DEX-based Features
Android framework libraries that the method invoked was used to tag each of it.These tags represent the class of APIs used by the method and are encoded as binary features. The apps are also scanned for the presence string variables in DEX files containing keywords that were obtained from reverse engineering malware from the Malgenome data set. For instance, su relates to executing code with super user privileges; emulator and sdk suggest that the app checks for the presence of an emulator.
3.3.12 Inconsistent Representations
Check whether the file extensions match the file contents (as identified by the file header or a magic number) to allow highlighting apps that try to hide shell scripts or ELF binaries as images or other resources. Such inconsistencies are good indicators of malicious intent in some (e.g., Malgenome) but not all (e.g., Marvin) datasets, potentially owing to trends in malware writing and repackaging tools.
3.3.13 Incognito Apps
In some cases the payload of a malicious app is in an APK that is disguised among the assets of the host app. To capture this malicious payload, both syntactic and resource centric features for any incognito APK and DEX found within the app were recursively extracted. Pigeon hole these features under a different category in order to separate these statistics from the ones related to the host app. For instance, permission. INTERNET counts the static number of accesses to the Internet, while icg.permission. INTERNET does the same for the incognito app.
3.3.14 Native Code
Scan the assets of the app to identify any native ELF files. The files are parsed to extract features from the header and individual sections of the file. Extract the number of entries in the program header, the program header size, and the number and size of the section headers. From individual sections, the flags of the section to understand were extracted to know if they are W (writable), A (allocatable), X (executable), M (mergeable), S (strings), etc. and are used as Boolean features. Within code sections, also look for instructions invoking critical system calls such as ioctl, which is used for Androids inter-procedural and intercomponent communication.