Temat: web scraping - Prolib Integro

Skocz do pozycji: 1.

Tytuł:: Visualization and processing the structure of delicious service
Wizualizacja i przetwarzanie struktury serwisu delicious
Autorzy:: Żak, Paweł
Opis:: Opis procesu wydobycia danych ze struktury serwisu delicious w oparciu o technikę web-scrapingu. Użycie platformy open-source Gephi i implementacja plugin-u w celu wizualizacji i wydobycia tych danych.
Description of data extraction process of delicious service structure using web-scraping technique. Making use of open-source platform Gephi and implementing plugin functionality for visualization and extraction of that data.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Inne

na półce

Skocz do pozycji: 2.

Tytuł:: System do zbierania i analizy danych zakładów bukmacherskich
System for aggregating and analyzing live betting data
Autorzy:: Markiewicz, Jan
Opis:: Thesis describes the process of creating a system for aggregating live betting data and data browser. It characterizes the essence of bookmaker bets, presents various types of odds and types of bets. It was reported how the data was obtained, how the process was automated and how the data was archived. The following chapters describe the functionality and architecture of the web data browser. Thesis also includes a brief analysis of aggregated data.
Praca opisuje proces powstania systemu agregującego dane zakładów bukmacherskich live oraz przeglądarki tych danych. Scharakteryzowano w niej istotę zakładów bukmacherskich, przedstawiono różne rodzaje kursów oraz typów zakładów. Zrelacjonowano, w jaki sposób pozyskano dane, jak zautomatyzowano ten proces i w jaki sposób dane te zostały zarchiwizowane. Kolejne rozdziały opisują funkcjonalności oraz architekturę webowej przeglądarki danych. Praca zawiera również krótką analizę zagregowanych danych.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Inne

na półce

Skocz do pozycji: 3.

Tytuł:: GameScraper
Autorzy:: Maleszewski, Andrzej
Opis:: Celem pracy było stworzenie prostej aplikacji webowej, pełniącej rolę automatycznego agregatora treści - cen z popularnych sklepów internetowych, ze szczególnym naciskiem na bardzo popularne wśród klientów: przeceny. Grupą docelową są gracze, lub osoby zainteresowanie poszerzaniem elektronicznej biblioteki gier wideo. Zdobywanie informacji odbywa się za pośrednictwem pająków internetowych, zaimplementowanych w frameworku Scrapy, stworzonym w języku Python. Do zdobywania danych, użyto API sklepów internetowych.Dostępna dla użytkowników strona internetowa, została stworzona w oparciu o framework Laravel, napisany w języku PHP. Na niej można śledzić aktualne przeceny, tworzyć indywidualne listy życzeń lub zasugerować administratorowi serwisu, jakie pozycję chętnie widziało by się na stronie. Podstawowych funkcjonalności dopełnia - napisany od podstaw - prosty panel administratora, służący zarządzaniu treściami oraz użytkownikami serwisu.
The main goal of this work, was creation of simple web aplication, fulfilling role of automatic content agregator - focused on video games prices in popular eloctronic distribution shops, especially narrowing it to very popular among public: sales. The focus group of application, are gamers or people interested in widening their electronic video games library.Content gathering is based on web spiders, developed in Scrapy framework, written in Python programming language. The data acquirement is based on calls to shops API's.User accessible website, was written in Laravel framework, made in PHP. On the website, users can follow current sales, create personalized wishlists or suggest new video games they would like to see, to web application admin.The basic functionality of the website is complemented by - made from scratch - administrative panel, used in managing site content and users.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Inne

na półce

Skocz do pozycji: 4.

Tytuł:: Improving the credibility of the extracted position from a vast collection of job offers with machine learning ensemble methods
Autorzy:: Drozda, Paweł
Ropiak, Krzysztof
Nowak, Bartosz A.
Talun, Arkadiusz
Osowski, Maciej
Tematy:: machine learning
web scraping
granularity method
classification; Pokaż więcej
Wydawca:: Uniwersytet Warmińsko-Mazurski w Olsztynie
Powiązania:: https://bibliotekanauki.pl/articles/22615539.pdf Link otwiera się w nowym oknie
Opis:: The main aim of this paper is to evaluate crawlers collecting the job offers from websites. In particular the research is focused on checking the effectiveness of ensemble machine learning methods for the validity of extracted position from the job ads. Moreover, in order to significantly reduce the training time of the algorithms (Random Forests and XGBoost), granularity methods were also tested to significantly reduce the input training dataset. Both methods achieved satisfactory results in accuracy and F1 measures, which exceeded 96%. In addition, granulation reduced the input dataset by more than 99%, and the results obtained were only slightly worse (accuracy between 1% and 5%, F1 between 3% and 8%). Thus, it can be concluded that the considered methods can be used in the evaluation of job web crawlers.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 5.

Tytuł:: Pozyskiwanie i analiza danych na temat ofert pracy z wykorzystaniem big data
The collection and analysis of the data on job advertisements with the use of big data
Autorzy:: Maślankowski, Jacek
Tematy:: big data
text mining
web scraping
rynek pracy
labour market; Pokaż więcej
Wydawca:: Główny Urząd Statystyczny
Powiązania:: https://bibliotekanauki.pl/articles/962829.pdf Link otwiera się w nowym oknie
Opis:: Celem artykułu jest zaprezentowanie korzyści wynikających z wykorzystania na potrzeby statystyki publicznej (rynku pracy) narzędzi do automatycznego pobierania danych na temat ofert pracy zamieszczanych na stronach internetowych zaliczanych do zbiorów big data, a także związanych z tym wyzwań. Przedstawiono wyniki eksperymentalnych badań z wykorzystaniem metod web scrapingu oraz text miningu. Analizie poddano dane z lat 2017 i 2018 pochodzące z najpopularniejszych portali z ofertami pracy. Odwołano się do danych Głównego Urzędu Statystycznego (GUS) zbieranych na podstawie sprawozdania Z-05. Przeprowadzona analiza prowadzi do wniosku, że web scraping może być stosowany w statystyce publicznej do pozyskiwania danych statystycznych z alternatywnych źródeł, uzupełniających istniejące bazy danych statystycznych, pod warunkiem zachowania spójności z istniejącymi badaniami.
The goal of this paper is to present, on the one hand, the benefits for official statistics (labour market) resulting from the use of web scraping methods to gather data on job advertisements from websites belonging to big data compilations, and on the other, the challenges connected to this process. The paper introduces the results of experimental research where web-scraping and text-mining methods were adopted. The analysis was based on the data from 2017–2018 obtained from the most popular jobsearching websites, which was then collated with Statistics Poland’s data obtained from Z-05 forms. The above-mentioned analysis demonstrated that web-scraping methods can be adopted by public statistics services to obtain statistical data from alternative sources complementing the already-existing databases, providing the findings of such research remain coherent with the results of the already-existing studies.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 6.

Tytuł:: The use of web-scraped data to analyze the dynamics of footwear prices
Autorzy:: Juszczak, Adam
Tematy:: Big data
Consumer Price Index
Inflation
Online shopping
Web-scraping; Pokaż więcej
Wydawca:: Uniwersytet Ekonomiczny w Katowicach
Powiązania:: https://bibliotekanauki.pl/articles/2027264.pdf Link otwiera się w nowym oknie
Opis:: Aim/purpose – Web-scraping is a technique used to automatically extract data from websites. After the rise-up of online shopping, it allows the acquisition of information about prices of goods sold by retailers such as supermarkets or internet shops. This study examines the possibility of using web-scrapped data from one clothing store. It aims at comparing known price index formulas being implemented to the web-scraping case and verifying their sensitivity on the choice of data filter type. Design/methodology/approach – The author uses the price data scrapped from one of the biggest online shops in Poland. The data were obtained as part of eCPI (electronic Consumer Price Index) project conducted by the National Bank of Poland. The author decided to select three types of products for this analysis – female ballerinas, male shoes, and male oxfords to compare their prices in over one-year time period. Six price indexes were used for calculation – The Jevons and Dutot indexes with their chain and GEKS (acronym from the names of creators – Gini–Éltető–Köves–Szulc) versions. Apart from the analysis conducted on a full data set, the author introduced filters to remove outliers. Findings – Clothing and footwear are considered one of the most difficult groups of goods to measure price change indexes due to high product churn, which undermines the possibility to use the traditional Jevons and Dutot indexes. However, it is possible to use chained indexes and GEKS indexes instead. Still, these indexes are fairly sensitive to large price changes. As observed in case of both product groups, the results provided by the GEKS and chained versions of indexes were different, which could lead to conclusion that even though they are lending promising results, they could be better suited for other COICOP (Classification of Individual Consumption by Purpose) groups. Research implications/limitations – The findings of the paper showed that usage of filters did not significantly reduce the difference between price indexes based on GEKS and chain formulas. Originality/value/contribution – The usage of web-scrapped data is a fairly new topic in the literature. Research on the possibility of using different price indexes provides useful insights for future usage of these data by statistics offices.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 7.

Tytuł:: The use of web-scraped data to analyse the dynamics of clothing and footwear prices
Wykorzystanie danych scrapowanych do analizy dynamiki cen odzieży i obuwia
Autorzy:: Juszczak, Adam
Tematy:: inflation
web scraping
online shopping
GEKS-J
inflacja
zakupy online; Pokaż więcej
Wydawca:: Główny Urząd Statystyczny
Powiązania:: https://bibliotekanauki.pl/articles/28408209.pdf Link otwiera się w nowym oknie
Opis:: Web scraping is a technique that makes it possible to obtain information from websites automatically. As online shopping grows in popularity, it became an abundant source of information on the prices of goods sold by retailers. The use of scraped data usually allows, in addition to a significant reduction of costs of price research, the improvement of the precision of inflation estimates and real-time tracking. For this reason, web scraping is a popular research tool both for statistical centers (Eurostat, British Office of National Statistics, Belgian Statbel) and universities (e.g. the Billion Prices Project conducted at Massachusetts Institute of Technology). However, the use of scraped data to calculate inflation brings about many challenges at the stage of their collection, processing, and aggregation. The aim of the study is to compare various methods of calculating price indices of clothing and footwear on the basis of scraped data. Using data from one of the largest online stores selling clothing and footwear for the period of February 2018–November 2019, the author compared the results of the Jevons chain index, the GEKS-J index and the GEKS-J expanding and updating window methods. As a result of the calculations, a high chain index drift was confirmed, and very similar results were found using the extension methods and the updated calculation window (excluding the FBEW method).
Web scraping to technika pozwalająca automatycznie pobierać informacje zamieszczone na stronach internetowych. Wraz ze wzrostem popularności zakupów online stała się ona ważnym źródłem informacji o cenach dóbr sprzedawanych przez detalistów. Wykorzystanie danych scrapowanych na ogół nie tylko pozwala znacząco obniżyć koszty badania cen, lecz także poprawia precyzję szacunków inflacji i umożliwia śledzenie jej w czasie rzeczywistym. Z tego względu web scraping jest dziś popularną techniką badań prowadzonych zarówno w ośrodkach statystycznych (Eurostat, brytyjski Office of National Statistics, belgijski Statbel), jak i na uniwersytetach (m.in. Billion Prices Project realizowany na Massachusetts Institute of Technology). Zastosowanie danych scrapowanych do obliczania inflacji wiąże się jednak z wieloma wyzwaniami na poziomie ich zbierania, przetwarzania oraz agregacji. Celem badania omawianego w artykule jest porównanie różnych metod obliczania indeksów cen odzieży i obuwia wykorzystujących dane scrapowane. Na podstawie danych z jednego z największych sklepów internetowych zajmujących się sprzedażą odzieży i obuwia za okres od lutego 2018 r. do listopada 2019 r. porównano wyniki indeksu łańcuchowego Jevonsa, indeksu GEKS-J oraz indeksów GEKS-J z użyciem metod rozszerzenia i aktualizowania okna obliczeń. Potwierdzono wysokie obciążenie dryfem łańcuchowym, a ponadto stwierdzono bardzo podobne wyniki przy użyciu metod rozszerzenia i aktualizowania okna obliczeń (z wyłączeniem metody FBEW).
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 8.

Tytuł:: Diagnostyczne aspekty działalności podmiotów zarejestrowanych w AIIP
Diagnostic aspects of AIIP members’ business activities
Autorzy:: Zych, Magdalena
Wydawca:: Uniwersytet Jagielloński, Biblioteka Jagiellońska
Opis:: Cel/teza: Celem badań było porównanie zakresu działalności podmiotów zarejestrowanych w katalogu Stowarzyszenia Niezależnych Profesjonalistów Informacji (AIIP) z zakresem diagnostyki w ujęciu informatologicznym. Koncepcja/Metodyka badań: Zastosowano web scraping oraz analizę treści pisemnych opisów z katalogu AIIP. Dane pozyskiwano za pomocą programu napisanego w języku Python. Wyniki i wnioski: Termin "diagnostyka" nie jest stosowany w katalogu AIIP. Większość usług mieści się jednak w spektrum diagnostyki, w opisach zaś stosowane są określenia powiązane z diagnostyką, np. analiza, audyt, badania, decyzja, ewaluacja, ocena, poprawa, porządek, problem, prognoza, rozwiązanie, trend, zmiana. Ograniczenia badań: Opis działalności podmiotów zarejestrowanych w AIIP ograniczony jest wyłącznie do treści pisemnych zamieszczonych w pełnych rekordach katalogu AIIP. Zastosowania praktyczne: Opisana procedura web scrapingu może być stosowana do innych badań. Wykaz usług i terminów stosowanych w opisach profesjonalistów informacji można użyć w benchmarkingu komunikacji pisemnej profesjonalistów informacji z interesariuszami. Oryginalność/Wartość poznawcza: Zgodnie z wiedzą autorki jest to pierwsze badanie diagnostycznego wymiaru działalności członków AIIP.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Artykuł

na półce

Skocz do pozycji: 9.

Tytuł:: Nowoczesne technologie informacyjno-komunikacyjne w zarządzaniu informacją
Modern information and communication technologies in information management
Autorzy:: Urbanek, Ewa
Opis:: The thesis is devoted to the issue of modern information and communication technologies (ICT), their role in information management and the whole process of their evolution up to the present state. ICTs are developing at a very fast pace, and they assist everyone in today's society. The main definitions of information and communication technologies, their origins, as well as various classifications were discussed. The purpose of this text was to analyse the development and study the current state of ICTs in information management. The goal of the research was to compare obtained data for TED conferences related to the topic of technology with the subject area of the thesis. Both the analysis and criticism of literature method, as well as web scraping, which allowed to analyse websites, were used. The data was extracted using appropriate Python code. The study found that analysed TED conferences did not address the topics of information and communication technologies or information management. Other available data was also studied, which made it possible to analyse the main keywords, the number of conferences based on the date added, the types of conferences with the highest number of recordings, the speakers who addressed the topic of technology, and the number of views and likes according to the conference type and addition date.
Praca poświęcona jest zagadnieniu nowoczesnych technologii informacyjno-komunikacyjnych (TIK/ICT), ich roli w zarządzaniu informacją oraz całemu procesowi rozwoju TIK aż do stanu obecnego. ICT rozwijają się w bardzo szybkim tempie i towarzyszą każdemu w obecnym społeczeństwie. Omówiono główne definicje TIK, ich genezę i źródła, a także różne klasyfikacje. Celem pracy była analiza rozwoju oraz zbadanie stanu obecnego technologii informacyjno-komunikacyjnych w zarządzaniu informacją. Celem badawczym było prześledzenie dostępnych danych dla wystąpień TED (Technology, Entertainment and Design) z zakresu technologii pod kątem tematyki pracy. Wykorzystano metodę analizy i krytyki piśmiennictwa oraz technikę web scraping, która umożliwiła analizę stron internetowych. Dane zostały wydobyte za pomocą odpowiedniego kodu w języku Python. Z badań wynikło, że na analizowanych konferencjach TED nie podejmowano tematów technologii informacyjno-komunikacyjnych czy zarządzania informacją. Przeanalizowano także inne dostępne dane, dzięki którym wyodrębniono główne słowa kluczowe, liczbę wystąpień w zależności od daty dodania, rodzaje konferencji z największą liczbą nagrań, prelegentów, którzy najczęściej podejmowali tematykę technologii, a także liczbę wyświetleń oraz polubień w zależności od rodzaju i daty dodania.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Inne

na półce

Skocz do pozycji: 10.

Tytuł:: Current challenges and possible big data solutions for the use of web data as a source for official statistics
Współczesne wyzwania i możliwości w zakresie stosowania narzędzi big data do uzyskania danych webowych jako źródła dla statystyki publicznej
Autorzy:: Daas, Piet
Maślankowski, Jacek
Tematy:: big data
web data
websites
web scraping
dane webowe
strony internetowe; Pokaż więcej
Wydawca:: Główny Urząd Statystyczny
Powiązania:: https://bibliotekanauki.pl/articles/31232088.pdf Link otwiera się w nowym oknie
Opis:: Web scraping has become popular in scientific research, especially in statistics. Preparing an appropriate IT environment for web scraping is currently not difficult and can be done relatively quickly. Extracting data in this way requires only basic IT skills. This has resulted in the increased use of this type of data, widely referred to as big data, in official statistics. Over the past decade, much work was done in this area both on the national level within the national statistical institutes, and on the international one by Eurostat. The aim of this paper is to present and discuss current problems related to accessing, extracting, and using information from websites, along with the suggested potential solutions. For the sake of the analysis, a case study featuring large-scale web scraping performed in 2022 by means of big data tools is presented in the paper. The results from the case study, conducted on a total population of approximately 503,700 websites, demonstrate that it is not possible to provide reliable data on the basis of such a large sample, as typically up to 20% of the websites might not be accessible at the time of the survey. What is more, it is not possible to know the exact number of active websites in particular countries, due to the dynamic nature of the Internet, which causes websites to continuously change.
Web scraping jest coraz popularniejszy w badaniach naukowych, zwłaszcza w dziedzinie statystyki. Przygotowanie środowiska do scrapowania danych nie przysparza obecnie trudności i może być wykonane relatywnie szybko, a uzyskiwanie informacji w ten sposób wymaga jedynie podstawowych umiejętności cyfrowych. Dzięki temu statystyka publiczna w coraz większym stopniu korzysta z dużych wolumenów danych, czyli big data. W drugiej dekadzie XXI w. zarówno krajowe urzędy statystyczne, jak i Eurostat włożyły dużo pracy w doskonalenie narzędzi big data. Nadal istnieją jednak trudności związane z dostępnością, ekstrakcją i wykorzystywaniem informacji pobranych ze stron internetowych. Tym problemom oraz potencjalnym sposobom ich rozwiązania został poświęcony niniejszy artykuł. Omówiono studium przypadku masowego web scrapingu wykonanego w 2022 r. za pomocą narzędzi big data na próbie 503 700 stron internetowych. Z analizy wynika, że dostarczenie wiarygodnych danych na podstawie tak dużej próby jest niemożliwe, ponieważ w czasie badania zwykle do 20% stron internetowych może być niedostępnych. Co więcej, dokładna liczba aktywnych stron internetowych w poszczególnych krajach nie jest znana ze względu na dynamiczny charakter Internetu, skutkujący ciągłymi zmianami stron internetowych.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Informacja

Wyszukujesz frazę "web scraping" wg kryterium: Temat