Project Specification
Summary
This project aims to develop an advanced web-based solution that will scan Android applications, primarily those on the Google Play Store, automatically for potential privacy violations resulting from excessive or unjustified tracking. The system will integrate large-scale static analysis of APKs with privacy-focused natural language processing based on an enhanced Large Language Model (LLM). This mixture enables the generation of a rich "privacy health score" based on a number of factors, including what permissions the application asks and the language of privacy policies and terms of service.
The fundamental function of the tool is to bridge the gap between hard-to-understand technical privacy data and average consumers by deconstructing technical jargon and legalese into clear, plain language. By providing users with a easy-to-understand summary alongside the privacy health score, the app places users in a position where they can make informed decisions regarding the apps that they install and allow permissions for. This is a critical transparency problem in the mobile apps ecosystem wherein users have the tendency to accept app permissions without fully realizing their implications.
Additionally, the project also has a research aspect in which the tool will be utilized to study the most downloaded 100 apps in ten different categories on the Google Play Store. The findings of this research will be published on the same platform with helpful information on normal privacy practices and areas of improvement. Through such a concurrent focus on technological innovation and empirical study, the project seeks to promote user awareness, foster privacy-by-design approaches, and help elevate data protection levels among the Android app market.
Background and References
Recent investigative reporting from RTÉ's Prime Time (2025) revealed a profoundly concerning privacy vulnerability in mobile data brokerage markets. The investigation discovered that granular movement data of tens of thousands of smartphones active in Ireland was available to be purchased commercially, enabling third parties to infer home and work addresses of phone owners after these phones had entered sensitive locations, including prisons and military bases (RTÉ, 2025).
Though the report mentioned that individual identities were not specifically listed, the researchers showed that correlating location trails with residency patterns allowed for easy identification of specific individuals and their daily routines. The Data Protection Commission (2025) states that such data trading poses serious national‑security and personal‑safety risks. Unscrupulous entities could exploit the ability to identify home addresses or track vulnerable individuals, pointing to dire deficiencies in privacy enforcement across mobile ecosystems (Irish Legal News, 2025).
The problem is exacerbated by the range of mobile‑application availability. The Google Play Store has millions of apps spanning areas such as communication, banking, education, gaming, navigation, and health (Android Developers, 2025). Apps can follow numerous monetization models: one‑time purchase, subscription model, or free download with inherent advertising support. In both payment models, all apps must request users for permissions before accessing protected device functionality such as geolocation, storage, or contact details (Bashir, 2022).
But most users accept by default, seldom reading the terms put in front of them. These permissions, defined in an app's AndroidManifest.xml file, determine which system resources the software will use and, by implication, how user data can be collected or shared (Android Developers, 2025). The manifest thus forms a critical technical document for an understanding of an app's inner workings. But it is not an end user document, and its implications are seldom obvious.
Literature corroborates that while the Android platform requires declaration of permissions requested, it does not compel developers to reveal why the respective permissions are required or how the data excavated will subsequently be applied (NCBI, 2022). Consequently, privacy notices utilize ambiguous language that obscures data‑handling practices.
To alleviate this opacity, recent advancements in machine‑learning research demonstrate how language models can automatically summarize and interpret privacy documents. For instance, MobileLLaMA 2.7B (Base), an open‑source model, is demonstrated to be capable of reading large text files, including privacy notices, to highlight data‑collection intents and infer risk categories (MobileLLaMA Team, 2024; MobileLLaMA Research Group, 2023). Adding an interpretable LLM to the privacy‑audit pipeline therefore presents a plausible avenue toward user‑level transparency.
Regulatory policies support this necessity. The General Data Protection Regulation (GDPR) and the forthcoming European Union Artificial Intelligence Act (2025) mandate that data subjects provide informed, express consent for all processing operations (GDPR.eu, 2024; Artificial Intelligence Act, 2025). These regulations prioritise interpretability and accountability: fundamental values that automated explanation systems like the tool proposed herein can realise directly in user interfaces.
Despite the presence of such policies, users either routinely dismiss consent requests or remain unaware of how apps recycle their data (Dawinderapps, 2023). Experiments therefore need to narrow the gap between technical disclosure and human understanding.
The tool proposed for this project will bridge that gap by pulling information out of manifest and policy documents and converting it into understandable, structured information regarding what data an app is collecting, why it is doing so, and how the data can be used or shared. Going beyond technical analysis, the interface will display these findings in an easy‑to‑use, tutorial format, augmented by a "how‑to" guide, to enable non‑experts to make privacy‑informed decisions. Briefly, by fusing open‑source LLMs with robust data‑collection pipelines, the project tackles growing worries over privacy transparency amidst mobile‑app expansion.
Proposed Approach
The proposed approach for this is a web-based application that is cloud hosted. I plan to call it “PrivacyTotal”, with inspiration coming from the popular OSINT tool “VirusTotal”. It would have a full workflow that includes frontend/backend communications, privacy auditing and search accuracy.
I plan to develop the LLM using my local machine, then uploading the trained model into the cloud environment for it to interact with clients. This would include the development of the web-application, which will be equipped with the interaction page for the tool, information about the tool and the results of the research portion of the project.
The webpage would include a welcome to the user, then prompting the user to input an application name alongside the developer's name for the application. These frontend inputs are then combined into a search query for the backend search logic to search for the application listing on the Google Play Store.
The backend receives the app name request and triggers the crawler/search component to search for the application on the Google Play Store. It feeds the APK to the LLM, which analyses all the relevant metadata, and returns the results of it’s analysis.
The report is then generated and output in a human-readable format for the client to view and would contain information about the application and its connections and permissions.
For this to work, I need to develop a tool using a pre-existing LLM that is already available and tuning it to my specific needs in analyzing the apk files from the Google Play Store. It would do this by analyzing the “AndroidManifest.xml” file that each application provides.
This file contains:
The permissions declaration, which is used to declare the system permissions that the user would grant the application for use of the application. For example, if a map application requires location permissions to function, the application will prompt the user for access to their location data to use the application efficiently and seamlessly.
The components of the app, including all activities, services, broadcast receivers, and content providers.
Activities are single screens on the app, such as a login screen or settings screen.
Services are used to run tasks in the background, such as a music app playing songs while the app is minimised/not in focus.
Broadcast receivers are used to listen for system-wide messages, such as a low battery broadcast or when the phone has finished booting.
Content Providers share data between apps, such as the “Contacts” app providing contact data that other apps can use, like Snapchat, which requests contact data from your “Contacts” app to find people in your contacts on Snapchat.
In short, the file is somewhat of a “blueprint” for the application, listing all the important parts of the application.
This file is written in “XML” which is similar to “HTML”, storing and describing data using tags. Here are a few sample lines from the “AndroidManifest.xml” file:
<uses-permission android:name="android.permission.INTERNET" /> <!-- Allow network access -->
android:networkSecurityConfig="@xml/network_security_config"
<service android:name=".LocationUpdateService"
The second line provided points to the “network_security_config.xml” file, which the android manifest file uses to allow outgoing connections, whether the application is communicating with a backend server or API. This file will also need to be analysed by the LLM to inform the user of connections being made by the application. Here are a few lines available in the “network_security_config.xml” file:
<domain-config cleartextTrafficPermitted="true">
<domain includeSubdomains="true">example.com</domain>
<domain includeSubdomains="true">10.0.2.2</domain>
As we have now established, most of the static analysis of the application can be focused on these 2 files, and my focus when training the LLM will be on consistent reporting and analysis of these files and the connections made. There will also be analysis of the privacy policy and terms and conditions that users are also required to accept to use applications, and analysis of these can also help us determine whether the application is GDPR compliant, as if the application’s requirements are outside the scope of its declared policy scope, this goes against GDPR and other data protection laws.
Here is the full proposed system flow:
Static APK Analysis
Decompile apps using apktool or AndroGuard
Extract and classify permissions and exported components
Identify all API calls indication of location tracking, biometric tracking or background monitoring.
Textual Analysis
Train and fine-tune an LLM using datasets of annotated privacy policies and app descriptions.
Detect misleading or contradictory statements in the text
Summarize the privacy practices of the application in a human-readable format.
Risk Scoring and Reporting
Provide a weighted score for app behaviours based on privacy and compliance with GDPR and other data protection laws.
Generate a “privacy health score” breaking down the risk categories and providing the report in the form of an exportable PDF report.
Per the official GDPR website, “The instrument for a privacy impact assessment (PIA) or data protection impact assessment (DPIA) was introduced with the General Data Protection Regulation (Art. 35 of the GDPR). This refers to the obligation of the controller to conduct an impact assessment and to document it before starting the intended data processing.”
The use of PIA’s in my design will help with keeping the tool compliant with data protection laws and will also help the tool become more trustworthy and accurate over time. Due to the possible inclusion of developer information included in the APK’s playstore metadata, the possible identification of individuals could occur. Developer information, such as names, emails and company data could be present in the metadata.
They can further reduce the risk of accidental developer exposure from the metadata information. This also helps build trust and transparency with the clients, while meeting AI best practices regardless of the degree of risk.
As such, here is a diagram of the proposed workflow of the application:
Deliverables
Here is a Gantt chart showing the proposed time for certain development milestones:
Functional Web Application
Develop a secure, cloud-hosted web platform that enables users to submit applications for privacy and tracking analysis. Users will provide the application name and developer name of the app. The system will automatically locate and retrieve the relevant APK and associated privacy documents. Following thorough analysis, results will be presented directly on the website in a structured, user-friendly format that clearly communicates privacy risks and app behaviours.
LLM-based Privacy Analyser
Utilize a completed, optimized Large Language Model that is tailored to parse legal privacy policies and technical metadata included in APK files. It will be incorporated into the web application backend smoothly with the capability to perform static analysis of APK manifests and natural language processing of textual terms of service and privacy documents. The analyser will give thorough evaluations of privacy transparency and permission usage to inform and empower users.
Documentation and Governance Framework
A strong data governance framework with explicit ethical frameworks, privacy-by-design methodologies, and sound documentation practices in support of providing total assurance for compliance with applicable data protection legislation, e.g., the General Data Protection Regulation (GDPR) and the future EU Artificial Intelligence Act (AI Act). A framework with the inclusion of risk analyses, impact assessments, transparency processes, and audit trails necessary to support accountability and legal processing during the lifetime of the application.
Completed Report into popular applications
After developing and successfully deploying the tool, a thorough research study will be conducted analysing the top 100 most downloaded apps across 10 broad categories of the Google Play Store. The study will assess the transparency and readability of these apps' Privacy Policies and user agreements and conduct an in‑cursory examination of privacy practices and user data disclosures. The report will offer thoughtful remarks on the general state of privacy transparency in the Google Play universe, pointing out trends, strengths, and areas for improvement.
Technical Requirements
My current configuration in terms of hardware is considered low-end by today's standards; therefore, I must keep this in mind during the duration of the project. Picking what tools to use will be critical, as picking a tool that my hardware doesn’t support can lead to problems during production. The hardware is as follows:
Intel i7 8700
Nvidia RTX 2060 Super
32GB DDR4 RAM
2TB SSD
After thorough research I’ve determined some of the best prebuilt low-compute LLM’s that are feasible for tuning support within my time frame and with my current hardware are:
TinyLLama
Runs easily on < 8 GB VRAM, which works with my setup.
Good at providing policy summaries and classification tasks.
Mistral 7B Instruct
Good reasoning ability and text interpretation skills with QLoRA compression.
Phi-2 (Microsoft)
Small generalist model that is pre-tuned for code and policy reasoning, performs tuning quickly.
MobileLLaMA
Very good at handling legal texts and metadata, directly in line with my specifications, also lightweight to train.
I have chosen to use MobileLLaMA 2.7B as it efficient in a low-resource environment. It is a compact variant of LLaMA models, designed specifically for better performance on low-end hardware such as mobile devices and low power GPUs. The 2.7B is the parameter count, with this version providing competitive reasoning and natural language processing, alongside common-sense capabilities.
To gather the data required to train my MobileLLaMA LLM I need to build a comprehensive and clean dataset from the Google Play Store. There are pre-existing data-sets available for use online, but I would like to build and establish my own dataset.
I will configure a script, possibly utilising python libraries such as “requests” and “BeautifulSoup” to scrape privacy policies and other metadata from Google Play app listings. I would then need to crawl and save the full text of these policies. To crawl means to automatically visit and collect information from a website, so in this context we are automatically gathering information from the Google Play Store. The information would be stored in text files that can be given to the LLM for training.
Beautiful Soup is a Python package for parsing HTML and XML documents, creating a parsing tree for documents that can be used to extract data from HTML, which is useful for web scraping. It can extract privacy policies, terms of service and app metadata from application websites or Google Play listings, making it easier to precisely target extracted links, text and other fields of information from texts. It does this through establishing a parse tree, which is used to find specific elements such as “<a>” which can point to a privacy policy link on a webpage, or “<div>” that might hold policy paragraphs, app descriptions and metadata. It can make extraction safer and more maintainable for scaling web crawlers.
BeautifulSoup does this by:
Sending an HTTP Request using the “requests” library to send a request to the webpage URL and get the HTML content in response.
Parse the HTML content, using a parser such as html.parser or html5lib to convert the raw HTML into a structured format (parse tree)
Extracting Data from the parse tree using tags, classes or IDs.
Filtering tags, links and other unrelated text to provide clean text.
Uses JSON output which is supported by the LLM for analysis.
For the crawler, I will focus on 3 things for textual analysis:
Privacy Policies
The Privacy Policy for an app on the Google Play Store is usually stored as a URL on the play store page for the application, therefore we can use the crawler to visit these policy pages and extract the full text for analysis. We would then save the full policy text locally for later analysis.
It should be noted that all applications on the Google Play Store have the link to their privacy policy stored on the same web-page link format of “play.google.com/store/apps/datasafety?id=*".
Terms & Conditions (T&C)
The Terms & Conditions can be a bit more difficult, as there might not be an explicit link for them on the Google Play Store. Normally, the Terms & Conditions pages are stored somewhere on the developer's website where each website has its own format of file and web-page storage.
Through editing the URL for 3 separate pages on websites where I found the privacy policy for applications, I was able to find the Terms and Conditions pages on 2 of them, meaning that changing the URL is not a foolproof method for finding the Terms and Conditions.
App Permissions.
App permissions are listed inside the applications APK file, but there are also permissions listed on the Google Play Store listing for the application that indicate what the app may be collecting, but this information is lacking in specifics. By analysing the APK file we will be able to extract all the permissions the application requests from the user.
We could cross-reference the permissions in the APK file with the permissions on the app listing and then list the differences that might indicate some misleading terminology/conditions.
Conclusion
This project will integrate state-of-the-art APK file static analysis with privacy-focused Large Language Model (LLM) processing to yield an end-to-end app that enables users by explaining app privacy risks in a concise manner. The final product will be an open, ethical AI system designed to heighten user awareness and enhance data protection practices. In addition to single-user testing, the project will also contribute immensely through a systematic research investigation of privacy transparency in best-rated Google Play Store applications.
The investigation, with emphases on findings regarding the privacy policies and user terms of top apps per category, shall be presented alongside the tool on the website, rendering users' awareness usable and actionable. Ultimately, this platform aims to bridge the gap between complex privacy data and user knowledge, ensuring informed consent and accountability among parties in the mobile app space.
References
Android Developers. (2025) App manifest overview | App architecture. Available at: https://developer.android.com/guide/topics/manifest/manifest-intro (Accessed: 16 October 2025).
Artificial Intelligence Act. (2025) Official website of the EU Artificial Intelligence Act. Available at: https://artificialintelligenceact.eu/ (Accessed: 16 October 2025).
Bashir, K. (2022) What is AndroidManifest.xml (Manifest element). Medium. Available at: https://medium.com/@kananbashir/what-is-androidmanifest-xml-manifest-element-59552b121cb0 (Accessed: 16 October 2025).
Dawinderapps. (2023) Android Interview Questions #41: Everything You Need to Know About Android Manifest. Medium. Available at: https://medium.com/@dawinderapps/android-interview-questions-41-everything-you-needs-to-know-about-android-manifest-d360d3e0b711 (Accessed: 16 October 2025).
GDPR.eu. (2024) Privacy Impact Assessment (PIA) – General Data Protection Regulation (GDPR). Available at: https://gdpr-info.eu/issues/privacy-impact-assessment/ (Accessed: 18 October 2025).
MobileLLaMA Team. (2024) MobileLLaMA 2.7B Base model. Hugging Face. Available at: https://huggingface.co/mtgv/MobileLLaMA-2.7B-Base (Accessed: 18 October 2025).
MobileLLaMA Research Group. (2023) MobileVLM: A Fast, Strong and Open Vision Language Model. arXiv preprint. Available at: https://arxiv.org/pdf/2312.16886.pdf (Accessed: 18 October 2025).
National Center for Biotechnology Information (NCBI). (2022) Privacy Policies of IoT Devices: Collection and Analysis. PubMed Central (PMC). Available at: https://pmc.ncbi.nlm.nih.gov/articles/PMC8914639/ (Accessed: 20 October 2025).
OpenSource.com. (2021) A guide to web scraping in Python using Beautiful Soup. Available at: https://opensource.com/article/21/9/web-scraping-python-beautiful-soup (Accessed: 21 October 2025).
GeeksforGeeks. (2025) Implementing Web Scraping in Python with BeautifulSoup. Available at: https://www.geeksforgeeks.org/python/implementing-web-scraping-python-beautiful-soup/ (Accessed: 21 October 2025).
RTÉ. (2025) RTÉ Prime Time investigation uncovers location data of thousands of smartphones in Ireland for sale. Available at: https://about.rte.ie/2025/09/18/rte-prime-time-investigation-uncovers-location-data-of-thousands-of-smartphones-in-ireland-for-sale/ (Accessed: 10 October 2025).